Two-Stage Dynamic Fusion Framework for Multimodal Classification Tasks

Shoumeng Ge
Shoumeng Ge
[email protected]
https://orcid.org/0009-0003-7336-8009
School of Management, Harbin Institute of Technology, Harbin, Heilongjiang 150001, China
Search for more papers by this author
,
Ying Chen
Corresponding Author
Ying Chen
[email protected]
https://orcid.org/0000-0002-0366-131X
School of Management, Harbin Institute of Technology, Harbin, Heilongjiang 150001, China
Search for more papers by this author

School of Management, Harbin Institute of Technology, Harbin, Heilongjiang 150001, China

Search for more papers by this author

Ying Chen

Corresponding Author

Ying Chen

[email protected]

https://orcid.org/0000-0002-0366-131X

School of Management, Harbin Institute of Technology, Harbin, Heilongjiang 150001, China

Search for more papers by this author

Published Online:29 May 2025https://doi.org/10.1287/ijoc.2023.0448

References

Alam F, Ofli F, Imran M (2018) CrisisMMD: Multimodal Twitter datasets from natural disasters. Proc. Internat. AAAI Conf. Web Soc. Media, vol. 12, no. 1 (AAAI Press, Washington, DC).Google Scholar
Ashukha A, Lyzhov A, Molchanov D, Vetrov D (2020) Pitfalls of in-domain uncertainty estimation and ensembling in deep learning. Preprint, submitted February 15, https://arxiv.org/abs/2002.06470.Google Scholar
Atrey PK, Hossain MA, El Saddik A, Kankanhalli MS (2010) Multimodal fusion for multimedia analysis: A survey. Multimedia Systems 16:345–379.Crossref, Google Scholar
Brosse N, Riquelme C, Martin A, Gelly S, Moulines É (2020) On last-layer algorithms for classification: Decoupling representation from uncertainty estimation. Preprint, submitted January 22, https://arxiv.org/abs/2001.08049.Google Scholar
Chen J, Zhang A (2020) HGMF: Heterogeneous graph-based fusion for multimodal data with incompleteness. Proc. 26th ACM SIGKDD Internat. Conf. Knowledge Discovery Data Mining (ACM, New York), 1295–1305.Google Scholar
Deng J, Dong W, Socher R, Li LJ, Li K, Fei-Fei L (2009) ImageNet: A large-scale hierarchical image database. 2009 IEEE Conf. Comput. Vision Pattern Recognition (IEEE, New York), 248–255.Google Scholar
Devlin J, Chang MW, Lee K, Toutanova K (2018) BERT: Pre-training of deep bidirectional transformers for language understanding. Preprint, submitted October 11, https://arxiv.org/abs/1810.04805.Google Scholar
Ding Y, Zheng W, Geng J, Qin Z, Choo KKR, Qin Z, Hou X (2021) MVFusFra: A multi-view dynamic fusion framework for multimodal brain tumor segmentation. IEEE J. Biomedical Health Informatics 26(4):1570–1581.Crossref, Google Scholar
Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X, Unterthiner T, Dehghani M, et al. (2020) An image is worth 16x16 words: Transformers for image recognition at scale. Preprint, submitted October 22, https://arxiv.org/abs/2010.11929.Google Scholar
Du C, Teng J, Li T, Liu Y, Yuan T, Wang Y, Yuan Y, Zhao H (2023) On uni-modal feature learning in supervised multi-modal learning. Preprint, submitted May 2, https://arxiv.org/abs/2305.01233.Google Scholar
Eitel A, Springenberg JT, Spinello L, Riedmiller M, Burgard W (2015) Multimodal deep learning for robust RGB-D object recognition. 2015 IEEE/RSJ Internat. Conf. Intelligent Robots Systems (IROS) (IEEE Press, New York), 681–687.Google Scholar
Ezhov I, Mot T, Shit S, Lipkova J, Paetzold JC, Kofler F, Pellegrini C, et al. (2021) Geometry-aware neural solver for fast Bayesian calibration of brain tumor models. IEEE Trans. Medical Imaging 41(5):1269–1278.Crossref, Google Scholar
Frigyik BA, Kapila A, Gupta MR (2010) Introduction to the Dirichlet distribution and related processes. UWEE Technical Report No. UWEETR-2010-0006, University of Washington, Seattle.Google Scholar
Gao J, Li P, Chen Z, Zhang J (2020) A survey on deep learning for multimodal data fusion. Neural Comput. 32(5):829–864.Crossref, Google Scholar
Gaw N, Yousefi S, Gahrooei MR (2022) Multimodal data fusion for systems improvement: A review. IISE Trans. 54(11):1098–1116.Crossref, Google Scholar
Ge S, Chen Y (2025) Two-stage dynamic fusion framework for multimodal classification tasks. https://doi.org/10.1287/ijoc.2023.0448.cd, https://github.com/INFORMSJoC/2023.0448.Google Scholar
Hammad M, Tawalbeh L, Iliyasu AM, Sedik A, Abd El-Samie FE, Alkinani MH, Abd El-Latif AA (2022) Efficient multimodal deep-learning-based COVID-19 diagnostic system for noisy and corrupted images. J. King Saud Univ. Sci. 34(3):101898.Crossref, Google Scholar
Han Z, Zhang C, Fu H, Zhou JT (2021) Trusted multi-view classification. Preprint, submitted February 3, https://arxiv.org/abs/2102.02051.Google Scholar
Han Z, Zhang C, Fu H, Zhou JT (2023) Trusted multi-view classification with dynamic evidential fusion. IEEE Trans. Pattern Anal. Machine Intelligence 45(2):2551–2566.Crossref, Google Scholar
He K, Zhang X, Ren S, Sun J (2015) Delving deep into rectifiers: Surpassing human-level performance on ImageNet classification. Proc. IEEE Internat. Conf. Comput. Vision (IEEE Press, New York), 1026–1034.Google Scholar
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. Proc. IEEE Conf. Comput. Vision Pattern Recognition (IEEE Press, New York), 770–778.Google Scholar
Hein M, Andriushchenko M, Bitterwolf J (2019) Why ReLU networks yield high-confidence predictions far away from the training data and how to mitigate the problem. Proc. IEEE/CVF Conf. Comput. Vision Pattern Recognition (IEEE Press, New York), 41–50.Google Scholar
Hospedales T, Antoniou A, Micaelli P, Storkey A (2021) Meta-learning in neural networks: A survey. IEEE Trans. Pattern Anal. Machine Intelligence 44(9):5149–5169.Google Scholar
Hu D, Hou X, Wei L, Jiang L, Mo Y (2022) MM-DFN: Multimodal dynamic fusion network for emotion recognition in conversations. ICASSP 2022 IEEE Internat. Conf. Acoustics Speech Signal Processing (IEEE Press, New York), 7037–7041.Google Scholar
Huang PY, Hsu WT, Chiu CY, Wu TF, Sun M (2018) Efficient uncertainty estimation for semantic segmentation in videos. Proc. Eur. Conf. Comput. Vision (ECCV, Zurich), 520–535.Google Scholar
Huang Y, Lin J, Zhou C, Yang H, Huang L (2022) Modality competition: What makes joint training of multi-modal network fail in deep learning? (Provably). Chaudhuri K, Jegelka S, Song L, Szepesvari C, Niu G, Sabato S, eds. Proc. 39th Internat. Conf. Machine Learn., vol. 162 (PMLR, New York), 9226–9259.Google Scholar
Huang Z, Niu G, Liu X, Ding W, Xiao X, Wu H, Peng X (2021) Learning with noisy correspondence for cross-modal matching. Ranzato M, Beygelzimer A, Dauphin Y, Liang PS, Wortman Vaughan J, eds. Adv. Neural Inform. Processing Systems 34 (NeurIPS 2021) (Curran Associates, Red Hook, NY), 29406–29419.Google Scholar
Kiela D, Bhooshan S, Firooz H, Perez E, Testuggine D (2019) Supervised multimodal bitransformers for classifying images and text. Preprint, submitted September 6, https://arxiv.org/abs/1909.02950.Google Scholar
Krishnan R, Tickoo O (2020) Improving model calibration with accuracy versus uncertainty optimization. Larochelle H, Ranzato M, Hadsell R, Balcan MF, Lin H, eds. Adv. Neural Inform. Processing Systems, vol. 33 (Curran Associates, Red Hook, NY), 18237–18248.Google Scholar
Kristiadi A, Hein M, Hennig P (2020) Being Bayesian, even just a bit, fixes overconfidence in ReLU networks. Daumé H III, Singh A, eds. Proc. 37th Internat. Conf. Machine Learn., vol. 119 (PMLR, New York), 5436–5446.Google Scholar
Lahat D, Adali T, Jutten C (2015) Multimodal data fusion: An overview of methods, challenges, and prospects. Proc. IEEE 103(9):1449–1477.Crossref, Google Scholar
Li J, Li D, Savarese S, Hoi S (2023a) BLIP-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. Krause A, Brunskill E, Cho K, Engelhardt B, Sabato S, Scarlett J, eds. ICML ‘23 Proc. 40th Internat. Conf. Machine Learn. (JMLR.org, New York), 19730–19742.Google Scholar
Li J, Li D, Xiong C, Hoi S (2022b) BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. Chaudhuri K, Jegelka S, Song L, Szepesvari C, Niu G, Sabato S, eds. Proc. 39th Internat. Conf. Machine Learn., vol. 162 (PMLR, New York), 12888–12900.Google Scholar
Li B, Wang J, Zhao M, Zhou S (2022a) Two-stage multimodality fusion for high-performance text-based visual question answering. Proc. Asian Conf. Comput. Vision (Springer, New York), 4143–4159.Google Scholar
Li B, Han Z, Li H, Fu H, Zhang C (2022c) Trustworthy long-tailed classification. Proc. IEEE/CVF Conf. Comput. Vision Pattern Recognition (IEEE Press, New York), 6970–6979.Google Scholar
Li X, Zhao X, Xu J, Zhang Y, Xing C (2023b) IMF: Interactive multimodal fusion model for link prediction. Ding Y, Tang J, Sequeda J, Aroyo L, Castillo C, Houben G-J, eds. WWW ‘23 Proc. ACM Web. Conf. 2023 (ACM, New York), 2572–2580.Google Scholar
Liang T, Lin G, Wan M, Li T, Ma G, Lv F (2022) Expanding large pre-trained unimodal models with multimodal information injection for image-text multimodal classification. Proc. IEEE/CVF Conf. Comput. Vision Pattern Recognition (IEEE, New York), 15492–15501.Google Scholar
Liu J, Mao Y, Huang Z, Ye Y (2023) A bottleneck network with light attention for multimodal clustering. Knowledge-Based Systems 280:111037.Crossref, Google Scholar
Liu W, Wang X, Owens J, Li Y (2020) Energy-based out-of-distribution detection. Larochelle H, Ranzato M, Hadsell R, Balcan MF, Lin H, eds. Adv. Neural Inform. Processing Systems 33 (NeurIPS 2020) (Curran Associates, Red Hook, NY), 21464–21475.Google Scholar
Liu H, Xu S, Fu J, Liu Y, Xie N, Wang CC, Wang B, Sun Y (2021) CMA-CLIP: Cross-modality attention CLIP for image-text classification. Preprint, submitted December 7, https://arxiv.org/abs/2112.03562.Google Scholar
Ma Y, Celepkolu M, Boyer KE, Lynch CF, Wiebe E, Israel M (2023) How noisy is too noisy? The impact of data noise on multimodal recognition of confusion and conflict during collaborative learning. André E, Chetouani M, Vaufreydaz D, Lucas G, Schultz T, Morency L-P, Vinciarelli A, eds. Proc. 25th Internat. Conf. Multimodal Interaction (ACM, New York), 326–335.Google Scholar
Ma H, Han Z, Zhang C, Fu H, Zhou JT, Hu Q (2021) Trustworthy multimodal regression with mixture of normal-inverse gamma distributions. Ranzato M, Beygelzimer A, Dauphin Y, Liang PS, Wortman Vaughan J, eds. Adv. Neural Inform. Processing Systems (NeurIPS 2021), vol. 34 (Curran Associates, Red Hook, NY), 6881–6893.Google Scholar
Mai S, Sun Y, Zeng Y, Hu H (2023) Excavating multimodal correlation for representation learning. Inform. Fusion 91:542–555.Crossref, Google Scholar
Malinin A, Gales M (2020) Uncertainty estimation in autoregressive structured prediction. Preprint, submitted February 18, https://arxiv.org/abs/2002.07650.Google Scholar
Natarajan P, Wu S, Vitaladevuni S, Zhuang X, Tsakalidis S, Park U, Prasad R, Natarajan P (2012) Multimodal feature fusion for robust event detection in web videos. 2012 IEEE Conf. Comput. Vision Pattern Recognition (IEEE, New York), 1298–1305.Google Scholar
Ni J, Bai Y, Zhang W, Yao T, Mei T (2023) Deep equilibrium multimodal fusion. Preprint, submitted June 29, https://arxiv.org/abs/2306.16645.Google Scholar
Niu T, Zhu S, Pang L, El Saddik A (2016) Sentiment analysis on multi-view social data. MultiMedia Model. 22nd Internat. Conf. Proc., part II 22 (Springer, Berlin), 15–27.Google Scholar
Peng X, Wei Y, Deng A, Wang D, Hu D (2022) Balanced multimodal learning via on-the-fly gradient modulation. Proc. IEEE/CVF Conf. Comput. Vision Pattern Recognition (IEEE, New York), 8238–8247.Google Scholar
Poliak A, Naradowsky J, Haldar A, Rudinger R, Van Durme B (2018) Hypothesis only baselines in natural language inference. Preprint, submitted May 2, https://arxiv.org/abs/1805.01042.Google Scholar
Poria S, Cambria E, Gelbukh A (2015) Deep convolutional neural network textual features and multiple kernel learning for utterance-level multimodal sentiment analysis. Màrquez L, Callison-Burch C, Su J, eds. Proc. 2015 Conf. Empirical Methods Natl. Language Processing (Association for Computational Linguistics, Stroudsburg, PA), 2539–2544.Google Scholar
Radford A, Kim JW, Hallacy C, Ramesh A, Goh G, Agarwal S, Sastry G, et al. (2021) Learning transferable visual models from natural language supervision. Meila M, Zhang T, eds. Proc. 38th Internat. Conf. Machine Learn., vol. 139 (PMLR, New York), 8748–8763.Google Scholar
Razavi A, Van den Oord A, Vinyals O (2019) Generating diverse high-fidelity images with VQ-VAE-2. Wallach H, Larochelle H, Beygelzimer A, d’Alché-Buc F, Fox E, Garnett R, eds. Adv. Neural Inform. Processing Systems 32 (NeurIPS 2019) (Curran Associates, Red Hook, NY), 14866–14876.Google Scholar
Scheffer T, Decomain C, Wrobel S (2001) Active hidden Markov models for information extraction. Hoffmann F, Hand DJ, Adams N, Fisher D, Guimaraes G, eds. Adv. Intelligent Data Anal. IDA 2001 (Springer, Berlin), 309–318.Google Scholar
Sensoy M, Kaplan L, Kandemir M (2018) Evidential deep learning to quantify classification uncertainty. Bengio S, Wallach H, Larochelle H, Grauman K, Cesa-Bianchi N, Garnett R, eds. Adv. Neural Inform. Processing Systems 31 (NeurIPS 2018) (Curran Associates, Red Hook, NY).Google Scholar
Sidheekh S, Tenali P, Mathur S, Blasch E, Kersting K, Natarajan S (2024) Credibility-aware multi-modal fusion using probabilistic circuits. Preprint, submitted March 5, https://arxiv.org/abs/2403.03281.Google Scholar
Silva A, Luo L, Karunasekera S, Leckie C (2022) Noise-robust learning from multiple unsupervised sources of inferred labels. Proc. AAAI Conf. Artificial Intelligence, vol. 36, no. 8 (AAAI Press, Palo Alto, CA), 8315–8323.Google Scholar
Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Ghahramani Z, Welling M, Cortes C, Lawrence N, Weinberger KQ, eds. Adv. Neural Inform. Processing Systems 27 (NIPS 2014) (Curran Associates, Red Hook, NY).Google Scholar
Tang Z, Xiao Q, Zhou X, Li Y, Chen C, Li K (2023) Learning discriminative multi-relation representations for multimodal sentiment analysis. Inform. Sci. 641:119125.Crossref, Google Scholar
Tang P, Yan X, Nan Y, Xiang S, Krammer S, Lasser T (2022) FusionM4Net: A multi-stage multi-modal learning algorithm for multi-label skin lesion classification. Medical Image Anal. 76:102307.Crossref, Google Scholar
Thomason J, Gordon D, Bisk Y (2018) Shifting the baseline: Single modality performance on visual navigation & QA. Preprint, submitted November 1, https://arxiv.org/abs/1811.00613.Google Scholar
Tsai YHH, Bai S, Liang PP, Kolter JZ, Morency LP, Salakhutdinov R (2019) Multimodal transformer for unaligned multimodal language sequences. Korhonen A, Traum D, Màrquez L, eds. Proc. 57th Annual Meeting Assoc. Comput. Linguistics (Association for Computational Linguistics, Stroudsburg, PA), 6558–6569.Google Scholar
Tsiligkaridis T (2021) Failure prediction by confidence estimation of uncertainty-aware Dirichlet networks. ICASSP 2021 2021 IEEE Internat. Conf. Acoustics Speech Signal Processing (ICASSP) (IEEE, New York), 3525–3529.Google Scholar
Tu W, Zhou S, Liu X, Guo X, Cai Z, Zhu E, Cheng J (2021) Deep fusion clustering network. Proc. AAAI Conf. Artificial Intelligence, vol. 35, no. 11 (AAAI Press, Palo Alto, CA), 9978–9987.Google Scholar
Wang DB, Feng L, Zhang ML (2021a) Rethinking calibration of deep neural networks: Do not be afraid of overconfidence. Ranzato M, Beygelzimer A, Dauphin Y, Liang PS, Wortman Vaughan J, eds. Adv. Neural Inform. Processing Systems (NeurIPS 2021), vol. 34 (Curran Associates, Red Hook, NY), 11809–11820.Google Scholar
Wang W, Tran D, Feiszli M (2020) What makes training multi-modal classification networks hard? 2020 IEEE/CVF Conf. Comput. Vision Pattern Recognition (CVPR) (IEEE, New York), 12695–12705.Google Scholar
Wang Z, Shan X, Zhang X, Yang J (2021b) N24News: A new dataset for multimodal news classification. Preprint, submitted August 30, https://arxiv.org/abs/2108.13327.Google Scholar
Wang X, Kumar D, Thome N, Cord M, Precioso F (2015) Recipe recognition with large multimodal food dataset. 2015 IEEE Internat. Conf. Multimedia Expo Workshops (ICMEW) (IEEE, New York), 1–6.Google Scholar
Wang J, Wang Z, Tao D, See S, Wang G (2016) Learning common and specific features for RGB-D semantic segmentation with deconvolutional networks. Leibe B, Matas J, Sebe N, Welling M, eds. Comput. Vision ECCV 2016 Proc. Part V 14 (Springer, Cham, Switzerland), 664–679.Google Scholar
Weisberg S (2005) Applied Linear Regression (John Wiley & Sons, Hoboken, NJ).Crossref, Google Scholar
Wu N, Jastrzebski S, Cho K, Geras KJ (2022) Characterizing and overcoming the greedy nature of learning in multi-modal deep neural networks. Chaudhuri K, Jegelka S, Song L, Szepesvari C, Niu G, Sabato S, eds. Proc. 39th Internat. Conf. Machine Learn., vol. 162 (PMLR, New York), 24043–24055.Google Scholar
Zhang Z, Sabuncu M (2018) Generalized cross entropy loss for training deep neural networks with noisy labels. Bengio S, Wallach H, Larochelle H, Grauman K, Cesa-Bianchi N, Garnett R, eds. Adv. Neural Inform. Processing Systems 31 (NeurIPS 2018) (Curran Associates, Red Hook, NY).Google Scholar
Zhang Y, Sidibé D, Morel O, Mériaudeau F (2021) Deep multimodal fusion for semantic image segmentation: A survey. Image Vision Comput. 105:104042.Crossref, Google Scholar
Zhang L, Zhu X, Chen X, Yang X, Lei Z, Liu Z (2019) Weakly aligned cross-modal learning for multispectral pedestrian detection. Proc. IEEE/CVF Internat. Conf. Comput. Vision (IEEE, New York), 5127–5137.Google Scholar
Zhang Q, Wu H, Zhang C, Hu Q, Fu H, Zhou JT, Peng X (2023) Provable dynamic fusion for low-quality multimodal data. Preprint, submitted June 3, https://arxiv.org/abs/2306.02050.Google Scholar
Zhao Q, Gao T, Guo N (2023) TSVFN: Two-stage visual fusion network for multimodal relation extraction. Inform. Processing Management 60(3):103264.Crossref, Google Scholar
Zheng X, Tang C, Wan Z, Hu C, Zhang W (2023) Multi-level confidence learning for trustworthy multimodal classification. Williams B, Chen Y, Neville J, eds. Proc. AAAI Conf. Artificial Intelligence, vol. 37, no. 9 (AAAI Press, Washington, DC), 11381–11389.Google Scholar
Zhou T, Ruan S, Canu S (2019) A review: Deep learning for medical image segmentation using multi-modality fusion. Array 3–4:100004.Crossref, Google Scholar
Zou K, Chen Z, Yuan X, Shen X, Wang M, Fu H (2023b) A review of uncertainty estimation and its application in medical imaging. Preprint, submitted February 16, https://arxiv.org/abs/2302.08119.Google Scholar
Zou H, Shen M, Chen C, Hu Y, Rajan D, Chng ES (2023a) UniS-MMC: Multimodal classification via unimodality-supervised multimodal contrastive learning. Preprint, submitted May 16, https://arxiv.org/abs/2305.09299.Google Scholar

cover image INFORMS Journal on Computing

Volume 38, Issue 2

March-April 2026

Pages iv, 341-691, iii

Article Information

Supplemental Material

Metrics

Information

Received:December 07, 2023
Accepted:May 10, 2025
Published Online:May 29, 2025

Cite as

Shoumeng Ge, Ying Chen (2025) Two-Stage Dynamic Fusion Framework for Multimodal Classification Tasks. INFORMS Journal on Computing 38(2):625-644.

https://doi.org/10.1287/ijoc.2023.0448

Keywords

Acknowledgments

The authors appreciate the help of Dr. Yinan Wang from Rensselaer Polytechnic Institute in the methodology development of this research.

PDF download

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Two-Stage Dynamic Fusion Framework for Multimodal Classification Tasks

References

Volume 38, Issue 2

Article Information

Supplemental Material

Metrics

Information

Cite as

Keywords

Sign Up for INFORMS Publications Updates and News