When Multimodal Interactions Impair Prediction: A Novel Regularized Deep Learning Strategy

Gang Chen
Gang Chen
[email protected]
https://orcid.org/0000-0003-0650-1175
School of Management, Fudan University, Shanghai 200433, P.R. China
Search for more papers by this author
,
Shuaiyong Xiao
Corresponding Author
Shuaiyong Xiao
[email protected]
https://orcid.org/0000-0002-9113-1414
School of Economics and Management, Tongji University, Shanghai 200092, P.R. China; and Laboratory of High Quality Urban Development and Strategic Decision, Tongji University, Shanghai 200092, P.R. China
Search for more papers by this author
,
Chenghong Zhang
Chenghong Zhang
[email protected]
https://orcid.org/0000-0002-4008-8989
School of Management, Fudan University, Shanghai 200433, P.R. China
Search for more papers by this author
,
Huimin Zhao
Huimin Zhao
[email protected]
https://orcid.org/0000-0002-6471-9837
Sheldon B. Lubar College of Business, University of Wisconsin-Milwaukee, Milwaukee, Wisconsin 53201
Search for more papers by this author

School of Management, Fudan University, Shanghai 200433, P.R. China

Corresponding Author

Shuaiyong Xiao

School of Economics and Management, Tongji University, Shanghai 200092, P.R. China; and Laboratory of High Quality Urban Development and Strategic Decision, Tongji University, Shanghai 200092, P.R. China

Search for more papers by this author

Chenghong Zhang

[email protected]

https://orcid.org/0000-0002-4008-8989

School of Management, Fudan University, Shanghai 200433, P.R. China

Search for more papers by this author

Huimin Zhao

[email protected]

https://orcid.org/0000-0002-6471-9837

Sheldon B. Lubar College of Business, University of Wisconsin-Milwaukee, Milwaukee, Wisconsin 53201

Search for more papers by this author

Published Online:17 Mar 2026https://doi.org/10.1287/ijoc.2024.0794

References

Aas K, Jullum M, Løland A (2021) Explaining individual predictions when features are dependent: More accurate approximations to Shapley values. Artificial Intelligence 298:103502.Crossref, Google Scholar
Aghasi A, Rai A, Xia Y (2024) A deep learning and image processing pipeline for object characterization in firm operations. INFORMS J. Comput. 36(2):616–634.Link, Google Scholar
Baltrusaitis T, Ahuja C, Morency LP (2019) Multimodal machine learning: A survey and taxonomy. IEEE Trans. Pattern Anal. Machine Intelligence 41(2):423–443.Crossref, Google Scholar
Cao B, Sun Y, Zhu P, Hu Q (2023a) Multi-modal gated mixture of local-to-global experts for dynamic image fusion. IEEE/CVF Internat. Conf. Computer Vision (IEEE, New York), 23498–23507.Google Scholar
Cao W, Wu Y, Sun Y, Zhang H, Ren J, Gu D, Wang X (2023b) A review on multimodal zero‐shot learning. Data Mining Knowledge Discovery 13(2):e1488.Crossref, Google Scholar
Chen J, Zhang A (2020) HGMF: Heterogeneous graph-based fusion for multimodal data with incompleteness. Proc. 26th ACM SIGKDD Internat. Conf. Knowledge Discovery Data Mining (ACM, New York), 1295–1305.Google Scholar
Chen G, Xiao S, Zhang C, Zhao H (2026) When multimodal interactions impair prediction: A novel regularized deep learning strategy. https://doi.org/10.1287/ijoc.2024.0794.cd, https://github.com/INFORMSJoC/2024.0794.Google Scholar
Chen F, Ji R, Su J, Cao D, Gao Y (2017) Predicting microblog sentiments via weakly supervised multimodal deep learning. IEEE Trans. Multimedia 20(4):997–1007.Crossref, Google Scholar
Cheng D, Xiang S, Shang C, Zhang Y, Yang F, Zhang L (2020) Spatio-temporal attention-based neural network for credit card fraud detection Proc. AAAI Conf. Artificial Intelligence (AAAI Press, Palo Alto, CA), 362–369.Google Scholar
Choi J-H, Lee J-S (2019) EmbraceNet: A robust deep learning architecture for multimodal classification. Inform. Fusion 51:259–270.Crossref, Google Scholar
Clara G, Langer S, Schmidt-Hieber J (2024) Dropout regularization versus ℓ2-penalization in the linear model. J. Machine Learn. Res. 25(1):9810–9857.Google Scholar
Cogswell M, Ahmed F, Girshick R, Zitnick L, Batra D (2015) Reducing overfitting in deep networks by decorrelating representations. Preprint, submitted November 19, https://arxiv.org/abs/1511.06068.Google Scholar
Ding S, Lin L, Wang G, Chao H (2015) Deep feature learning with relative distance comparison for person re-identification. Pattern Recognition 48(10):2993–3003.Crossref, Google Scholar
Dourado IC, Pedronette DCG, da Silva Torres R (2019) Unsupervised graph-based rank aggregation for improved retrieval. Inform. Processing Management 56(4):1260–1279.Crossref, Google Scholar
Du N, Li L, Lu T, Lu X (2020) Prosocial compliance in P2P lending: A natural field experiment. Management Sci. 66(1):315–333.Link, Google Scholar
Gao J, Li P, Chen Z, Zhang J (2020) A survey on deep learning for multimodal data fusion. Neural Comput. 32(5):829–864.Crossref, Google Scholar
Ge R, Feng J, Gu B, Zhang P (2017) Predicting and deterring default with social media information in peer-to-peer lending. J. Management Inform. Systems 34(2):401–424.Crossref, Google Scholar
Ghiasi A, Shafahi A, Ardekani R (2023) Improving robustness with adaptive weight decay. Proc. 37th Internat. Conf. Neural Informa. Processing Systems (Curran Associates Inc., Red Hook, NY), 79067–79080.Google Scholar
Hong D, Gao L, Yokoya N, Yao J, Chanussot J, Du Q, Zhang B (2020) More diverse means better: Multimodal deep learning meets remote-sensing imagery classification. IEEE Trans. Geosci. Remote Sensing 59(5):4340–4354.Crossref, Google Scholar
Hossain MZ, Sohel F, Shiratuddin MF, Laga H (2019) A comprehensive survey of deep learning for image captioning. ACM Comput. Surveys 51(6):1–36.Crossref, Google Scholar
Ioffe S, Szegedy C (2015) Batch normalization: Accelerating deep network training by reducing internal covariate shift. Proc. 32nd Internat. Conf. Machine Learning (JMLR.org), 448–456.Google Scholar
Jakulin A, Bratko I (2003) Analyzing attribute dependencies. Lavrac N, Gamberger D, Todorovski L, Blockeel H, eds. Knowledge Discovery in Databases: PKDD 2003 (Springer Nature, London), 229–240.Crossref, Google Scholar
Jiang YG, Wu ZX, Tang JH, Li ZC, Xue XY, Chang SF (2018) Modeling multimodal clues in a hybrid deep learning framework for video classification. IEEE Trans. Multimedia 20(11):3137–3147.Crossref, Google Scholar
Jo Y, Oh AH (2011) Aspect and sentiment unification model for online review analysis. Proc. Fourth ACM Internat. Conf. Web Search Data Mining (ACM, New York), 815–824.Google Scholar
Karkehabadi A, Latibari BS, Homayoun H, Sasan A (2024) HLGM: A novel methodology for improving model accuracy using saliency-guided high and low gradient masking. Proc. Fourteenth Internat. Conf. Inform. Sci. Tech. (IEEE, New York), 909–917.Google Scholar
Lahat D, Adali T, Jutten C (2015) Multimodal data fusion: An overview of methods, challenges, and prospects. Proc. IEEE 103(9):1449–1477.Crossref, Google Scholar
Lee M, Pavlovic V (2021) Private-shared disentangled multimodal VAE for learning of latent representations. Proc. IEEE/CVF Conf. Comput. Vision Pattern Recognition Workshops (IEEE, New York), 1692–1700.Google Scholar
Liu K, Li Y, Xu N, Natarajan P (2018a) Learn to combine modalities in multimodal deep learning. Preprint, submitted May 29, https://arxiv.org/abs/1805.11730.Google Scholar
Liu Y, Liu L, Guo YM, Lew MS (2018b) Learning visual and textual representations for multimodal matching and classification. Pattern Recognition 84:51–67.Crossref, Google Scholar
Luo C, Jiang Z, Li X, Yi C, Tucker C (2023) Choosing to discover the unknown: The effects of choice on user attention to online video advertising. Management Sci. 70(10):6983–7003.Link, Google Scholar
Mai S, Hu H, Xing S (2020) Modality to modality translation: An adversarial representation learning and graph fusion network for multimodal fusion. Proc. AAAI Conf. Artificial Intelligence (AAAI Press, Palo Alto, CA), 164–172.Crossref, Google Scholar
Nie F, Huang H, Cai X, Ding CH (2010) Efficient and robust feature selection via joint ℓ2, 1 norms minimization. Proc. 24th Internat. Conf. Neural Inform. Processing Systems (Curran Associates Inc., Red Hook, NY), 1813–1821.Google Scholar
Peng H, Long F, Ding C (2005) Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy. IEEE Trans. Pattern Anal. Machine Intelligence 27(8):1226–1238. Crossref, Google Scholar
Praveen RG, Alam J (2024) Recursive joint cross-modal attention for multimodal fusion in dimensional emotion recognition. 2024 IEEE/CVF Conf. Comput. Vision Pattern Recognition Workshops (IEEE, New York), 4803–4813.Google Scholar
Rahim N, El-Sappagh S, Ali S, Muhammad K, Del Ser J, Abuhmed T (2023) Prediction of Alzheimer’s progression based on multimodal deep-learning-based fusion and visual explainability of time-series data. Inform. Fusion 92:363–388.Crossref, Google Scholar
Shi T, Huang S-L (2023) MultiEMO: An attention-based correlation-aware multimodal fusion framework for emotion recognition in conversations. Rogers A, Boyd-Graber J, Okazaki N, eds. Proc. 61st Annual Meeting Assoc. Comput. Linguistics (Association for Computational Linguistics, Stroudsburg, PA), 14752–14766.Google Scholar
Shi L, Wang L, Long C, Zhou S, Tang W, Zheng N, Hua G (2023) Representing multimodal behaviors with mean location for pedestrian trajectory prediction. IEEE Trans. Pattern Anal. Machine Intelligence 45(9):11184–11202.Crossref, Google Scholar
Sohn K, Shang WL, Lee H (2014) Improved multimodal deep learning with variation of information. Proc. 28th Internat. Conf. Neural Inform. Processing Systems (MIT Press, Cambridge, MA), 2141–2149.Google Scholar
Song W, Shi C, Xiao Z, Duan Z, Xu Y, Zhang M, Tang J (2019) AutoInt: Automatic feature interaction learning via self-attentive neural networks. Proc. 28th ACM Internat. Conf. Inform. Knowledge Management (ACM, New York), 1161–1170.Google Scholar
Sun Z, Sarma P, Sethares W, Liang Y (2020) Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. Proc. AAAI Conf. Artificial Intelligence (AAAI Press, Palo Alto, CA), 8992–8999. Crossref, Google Scholar
Wang Z, Jiang C, Zhao H, Ding Y (2020a) Mining semantic soft factors for credit risk evaluation in peer-to-peer lending. J. Management Inform. Systems 37(1):282–308.Crossref, Google Scholar
Wang Y, Huang W, Sun F, Xu T, Rong Y, Huang J (2020b) Deep multimodal fusion by channel exchanging. Proc. 34th Internat. Conf. Neural Inform. Processing Systems (Curran Associates Inc., Red Hook, NY), 4835–4845.Google Scholar
Wang L, Wu J, Huang S-L, Zheng L, Xu X, Zhang L, Huang J (2019) An efficient approach to informative feature extraction from multimodal data. Proc. Thirty-Third AAAI Conf. Artificial Intelligence (AAAI Press, Palo Alto, CA), 5281–5288.Crossref, Google Scholar
Wei Y, Feng R, Wang Z, Hu D (2024) Enhancing multimodal cooperation via sample-level modality valuation. 2024 IEEE/CVF Conf. Comput. Vision Pattern Recognition (IEEE, New York), 27328–27337.Google Scholar
Wei Y, Yuan S, Yang R, Shen L, Li Z, Wang L, Chen M (2023) Tackling modality heterogeneity with multi-view calibration network for multimodal sentiment detection. Proc. 61st Annual Meeting Assoc. Comput. Linguistics (Association for Computational Linguistics, Stroudsburg, PA), 5240–5252.Google Scholar
Wu L, Long Y, Gao C, Wang Z, Zhang Y (2023) MFIR: Multimodal fusion and inconsistency reasoning for explainable fake news detection. Inform. Fusion 100:101944.Crossref, Google Scholar
Xiao S, Chen Y-J, Tang CS (2022) Customer review provision policies with heterogeneous cluster preferences. Management Sci. 68(7):5025–5048.Link, Google Scholar
Xu JJ, Chau M (2018) Cheap talk? The impact of lender-borrower communication on peer-to-peer lending outcomes. J. Management Inform. Systems 35(1):53–85.Crossref, Google Scholar
Xu W, Cao Y, Chen R (2024) A multimodal analytics framework for product sales prediction with the reputation of anchors in live streaming e-commerce. Decision Support Systems 177:114104.Crossref, Google Scholar
Xu N, Mao W, Chen G (2019) Multi-interactive memory network for aspect based multimodal sentiment analysis. Proc. Thirty-Third AAAI Conf. Artificial Intelligence (AAAI Press, Palo Alto, CA), 371–378.Crossref, Google Scholar
Yu L, Liu H (2004) Efficient feature selection via analysis of relevance and redundancy. J. Machine Learn. Res. 5:1205–1224.Google Scholar
Yu H, Qi Z, Jang L, Salakhutdinov R, Morency L-P, Liang PP (2024) MMoE: Enhancing multimodal models with mixtures of multimodal interaction experts. Al-Onaizan Y, Bansal M, Chen Y, eds. Proc. 2024 Conf. Empirical Methods Natural Language Processing (Association for Computational Linguistics, Stroudsburg, PA), 10006–10030.Google Scholar
Zeng ZL, Zhang HJ, Zhang R, Yin CX (2015) A novel feature selection method considering feature interaction. Pattern Recognition 48(8):2656–2666.Crossref, Google Scholar
Zhang C, Yang Z, He X, Deng L (2020a) Multimodal intelligence: Representation learning, information fusion, and applications. IEEE J. Selected Topics Signal Processing 14(3):478–493.Crossref, Google Scholar
Zhang Z, Wei X, Zheng X, Li Q, Zeng DD (2022) Detecting product adoption intentions via multiview deep learning. INFORMS J. Comput. 34(1):541–556.Link, Google Scholar
Zhang X, Zhang Y, Wang S, Yao Y, Fang B, Philip SY (2018) Improving stock market prediction via heterogeneous information fusion. Knowledge Based Systems 143:236–247.Crossref, Google Scholar
Zhang Y-D, Dong Z, Wang S-H, Yu X, Yao X, Zhou Q, Hu H, et al. (2020b) Advances in multimodal data fusion in neuroimaging: Overview, challenges, and novel orientation. Inform. Fusion 64:149–187.Crossref, Google Scholar
Zhang J, Jiao L, Ma W, Liu F, Liu X, Li L, Chen P, et al. (2023) Transformer based conditional GAN for multimodal image fusion. IEEE Trans. Multimedia 25:8988–9001.Crossref, Google Scholar
Zhao Z, Zhu H, Xue Z, Liu Z, Tian J, Chua MCH, Liu M (2019) An image-text consistency driven multimodal sentiment analysis approach for social media. Inform. Processing Management 56(6):102097.Crossref, Google Scholar

cover image INFORMS Journal on Computing

Articles In Advance

Article Information

Supplemental Material

Metrics

Information

Received:May 28, 2024
Accepted:February 07, 2026
Published Online:March 17, 2026

Cite as

Gang Chen, Shuaiyong Xiao, Chenghong Zhang, Huimin Zhao (2026) When Multimodal Interactions Impair Prediction: A Novel Regularized Deep Learning Strategy. INFORMS Journal on Computing 0(0).

https://doi.org/10.1287/ijoc.2024.0794

Keywords

PDF download

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

When Multimodal Interactions Impair Prediction: A Novel Regularized Deep Learning Strategy

References

Articles In Advance

Article Information

Supplemental Material

Metrics

Information

Cite as

Keywords

Sign Up for INFORMS Publications Updates and News