When Multimodal Interactions Impair Prediction: A Novel Regularized Deep Learning Strategy
Published Online:17 Mar 2026https://doi.org/10.1287/ijoc.2024.0794
References
- (2021) Explaining individual predictions when features are dependent: More accurate approximations to Shapley values. Artificial Intelligence 298:103502.Crossref, Google Scholar
- (2024) A deep learning and image processing pipeline for object characterization in firm operations. INFORMS J. Comput. 36(2):616–634.Link, Google Scholar
- (2019) Multimodal machine learning: A survey and taxonomy. IEEE Trans. Pattern Anal. Machine Intelligence 41(2):423–443.Crossref, Google Scholar
- (2023a) Multi-modal gated mixture of local-to-global experts for dynamic image fusion. IEEE/CVF Internat. Conf. Computer Vision (IEEE, New York), 23498–23507.Google Scholar
- (2023b) A review on multimodal zero‐shot learning. Data Mining Knowledge Discovery 13(2):e1488.Crossref, Google Scholar
- (2020) HGMF: Heterogeneous graph-based fusion for multimodal data with incompleteness. Proc. 26th ACM SIGKDD Internat. Conf. Knowledge Discovery Data Mining (ACM, New York), 1295–1305.Google Scholar
- (2026) When multimodal interactions impair prediction: A novel regularized deep learning strategy. https://doi.org/10.1287/ijoc.2024.0794.cd, https://github.com/INFORMSJoC/2024.0794.Google Scholar
- (2017) Predicting microblog sentiments via weakly supervised multimodal deep learning. IEEE Trans. Multimedia 20(4):997–1007.Crossref, Google Scholar
- (2020) Spatio-temporal attention-based neural network for credit card fraud detection Proc. AAAI Conf. Artificial Intelligence (AAAI Press, Palo Alto, CA), 362–369.Google Scholar
- (2019) EmbraceNet: A robust deep learning architecture for multimodal classification. Inform. Fusion 51:259–270.Crossref, Google Scholar
- (2024) Dropout regularization versus ℓ2-penalization in the linear model. J. Machine Learn. Res. 25(1):9810–9857.Google Scholar
- (2015) Reducing overfitting in deep networks by decorrelating representations. Preprint, submitted November 19, https://arxiv.org/abs/1511.06068.Google Scholar
- (2015) Deep feature learning with relative distance comparison for person re-identification. Pattern Recognition 48(10):2993–3003.Crossref, Google Scholar
- (2019) Unsupervised graph-based rank aggregation for improved retrieval. Inform. Processing Management 56(4):1260–1279.Crossref, Google Scholar
- (2020) Prosocial compliance in P2P lending: A natural field experiment. Management Sci. 66(1):315–333.Link, Google Scholar
- (2020) A survey on deep learning for multimodal data fusion. Neural Comput. 32(5):829–864.Crossref, Google Scholar
- (2017) Predicting and deterring default with social media information in peer-to-peer lending. J. Management Inform. Systems 34(2):401–424.Crossref, Google Scholar
- (2023) Improving robustness with adaptive weight decay. Proc. 37th Internat. Conf. Neural Informa. Processing Systems (Curran Associates Inc., Red Hook, NY), 79067–79080.Google Scholar
- (2020) More diverse means better: Multimodal deep learning meets remote-sensing imagery classification. IEEE Trans. Geosci. Remote Sensing 59(5):4340–4354.Crossref, Google Scholar
- (2019) A comprehensive survey of deep learning for image captioning. ACM Comput. Surveys 51(6):1–36.Crossref, Google Scholar
- (2015) Batch normalization: Accelerating deep network training by reducing internal covariate shift. Proc. 32nd Internat. Conf. Machine Learning (JMLR.org), 448–456.Google Scholar
- (2003) Analyzing attribute dependencies. Lavrac N, Gamberger D, Todorovski L, Blockeel H, eds. Knowledge Discovery in Databases: PKDD 2003 (Springer Nature, London), 229–240.Crossref, Google Scholar
- (2018) Modeling multimodal clues in a hybrid deep learning framework for video classification. IEEE Trans. Multimedia 20(11):3137–3147.Crossref, Google Scholar
- (2011) Aspect and sentiment unification model for online review analysis. Proc. Fourth ACM Internat. Conf. Web Search Data Mining (ACM, New York), 815–824.Google Scholar
- (2024) HLGM: A novel methodology for improving model accuracy using saliency-guided high and low gradient masking. Proc. Fourteenth Internat. Conf. Inform. Sci. Tech. (IEEE, New York), 909–917.Google Scholar
- (2015) Multimodal data fusion: An overview of methods, challenges, and prospects. Proc. IEEE 103(9):1449–1477.Crossref, Google Scholar
- (2021) Private-shared disentangled multimodal VAE for learning of latent representations. Proc. IEEE/CVF Conf. Comput. Vision Pattern Recognition Workshops (IEEE, New York), 1692–1700.Google Scholar
- (2018a) Learn to combine modalities in multimodal deep learning. Preprint, submitted May 29, https://arxiv.org/abs/1805.11730.Google Scholar
- (2018b) Learning visual and textual representations for multimodal matching and classification. Pattern Recognition 84:51–67.Crossref, Google Scholar
- (2023) Choosing to discover the unknown: The effects of choice on user attention to online video advertising. Management Sci. 70(10):6983–7003.Link, Google Scholar
- (2020) Modality to modality translation: An adversarial representation learning and graph fusion network for multimodal fusion. Proc. AAAI Conf. Artificial Intelligence (AAAI Press, Palo Alto, CA), 164–172.Crossref, Google Scholar
- (2010) Efficient and robust feature selection via joint ℓ2, 1 norms minimization. Proc. 24th Internat. Conf. Neural Inform. Processing Systems (Curran Associates Inc., Red Hook, NY), 1813–1821.Google Scholar
- (2005) Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy. IEEE Trans. Pattern Anal. Machine Intelligence 27(8):1226–1238. Crossref, Google Scholar
- (2024) Recursive joint cross-modal attention for multimodal fusion in dimensional emotion recognition. 2024 IEEE/CVF Conf. Comput. Vision Pattern Recognition Workshops (IEEE, New York), 4803–4813.Google Scholar
- (2023) Prediction of Alzheimer’s progression based on multimodal deep-learning-based fusion and visual explainability of time-series data. Inform. Fusion 92:363–388.Crossref, Google Scholar
- (2023) MultiEMO: An attention-based correlation-aware multimodal fusion framework for emotion recognition in conversations. Rogers A, Boyd-Graber J, Okazaki N, eds. Proc. 61st Annual Meeting Assoc. Comput. Linguistics (Association for Computational Linguistics, Stroudsburg, PA), 14752–14766.Google Scholar
- (2023) Representing multimodal behaviors with mean location for pedestrian trajectory prediction. IEEE Trans. Pattern Anal. Machine Intelligence 45(9):11184–11202.Crossref, Google Scholar
- (2014) Improved multimodal deep learning with variation of information. Proc. 28th Internat. Conf. Neural Inform. Processing Systems (MIT Press, Cambridge, MA), 2141–2149.Google Scholar
- (2019) AutoInt: Automatic feature interaction learning via self-attentive neural networks. Proc. 28th ACM Internat. Conf. Inform. Knowledge Management (ACM, New York), 1161–1170.Google Scholar
- (2020) Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. Proc. AAAI Conf. Artificial Intelligence (AAAI Press, Palo Alto, CA), 8992–8999. Crossref, Google Scholar
- (2020a) Mining semantic soft factors for credit risk evaluation in peer-to-peer lending. J. Management Inform. Systems 37(1):282–308.Crossref, Google Scholar
- (2020b) Deep multimodal fusion by channel exchanging. Proc. 34th Internat. Conf. Neural Inform. Processing Systems (Curran Associates Inc., Red Hook, NY), 4835–4845.Google Scholar
- (2019) An efficient approach to informative feature extraction from multimodal data. Proc. Thirty-Third AAAI Conf. Artificial Intelligence (AAAI Press, Palo Alto, CA), 5281–5288.Crossref, Google Scholar
- (2024) Enhancing multimodal cooperation via sample-level modality valuation. 2024 IEEE/CVF Conf. Comput. Vision Pattern Recognition (IEEE, New York), 27328–27337.Google Scholar
- (2023) Tackling modality heterogeneity with multi-view calibration network for multimodal sentiment detection. Proc. 61st Annual Meeting Assoc. Comput. Linguistics (Association for Computational Linguistics, Stroudsburg, PA), 5240–5252.Google Scholar
- (2023) MFIR: Multimodal fusion and inconsistency reasoning for explainable fake news detection. Inform. Fusion 100:101944.Crossref, Google Scholar
- (2022) Customer review provision policies with heterogeneous cluster preferences. Management Sci. 68(7):5025–5048.Link, Google Scholar
- (2018) Cheap talk? The impact of lender-borrower communication on peer-to-peer lending outcomes. J. Management Inform. Systems 35(1):53–85.Crossref, Google Scholar
- (2024) A multimodal analytics framework for product sales prediction with the reputation of anchors in live streaming e-commerce. Decision Support Systems 177:114104.Crossref, Google Scholar
- (2019) Multi-interactive memory network for aspect based multimodal sentiment analysis. Proc. Thirty-Third AAAI Conf. Artificial Intelligence (AAAI Press, Palo Alto, CA), 371–378.Crossref, Google Scholar
- (2004) Efficient feature selection via analysis of relevance and redundancy. J. Machine Learn. Res. 5:1205–1224.Google Scholar
- (2024) MMoE: Enhancing multimodal models with mixtures of multimodal interaction experts. Al-Onaizan Y, Bansal M, Chen Y, eds. Proc. 2024 Conf. Empirical Methods Natural Language Processing (Association for Computational Linguistics, Stroudsburg, PA), 10006–10030.Google Scholar
- (2015) A novel feature selection method considering feature interaction. Pattern Recognition 48(8):2656–2666.Crossref, Google Scholar
- (2020a) Multimodal intelligence: Representation learning, information fusion, and applications. IEEE J. Selected Topics Signal Processing 14(3):478–493.Crossref, Google Scholar
- (2022) Detecting product adoption intentions via multiview deep learning. INFORMS J. Comput. 34(1):541–556.Link, Google Scholar
- (2018) Improving stock market prediction via heterogeneous information fusion. Knowledge Based Systems 143:236–247.Crossref, Google Scholar
- (2020b) Advances in multimodal data fusion in neuroimaging: Overview, challenges, and novel orientation. Inform. Fusion 64:149–187.Crossref, Google Scholar
- (2023) Transformer based conditional GAN for multimodal image fusion. IEEE Trans. Multimedia 25:8988–9001.Crossref, Google Scholar
- (2019) An image-text consistency driven multimodal sentiment analysis approach for social media. Inform. Processing Management 56(6):102097.Crossref, Google Scholar

