Two-Stage Dynamic Fusion Framework for Multimodal Classification Tasks
References
- (2018) CrisisMMD: Multimodal Twitter datasets from natural disasters. Proc. Internat. AAAI Conf. Web Soc. Media, vol. 12, no. 1 (AAAI Press, Washington, DC).Google Scholar
- (2020) Pitfalls of in-domain uncertainty estimation and ensembling in deep learning. Preprint, submitted February 15, https://arxiv.org/abs/2002.06470.Google Scholar
- (2010) Multimodal fusion for multimedia analysis: A survey. Multimedia Systems 16:345–379.Crossref, Google Scholar
- (2020) On last-layer algorithms for classification: Decoupling representation from uncertainty estimation. Preprint, submitted January 22, https://arxiv.org/abs/2001.08049.Google Scholar
- (2020) HGMF: Heterogeneous graph-based fusion for multimodal data with incompleteness. Proc. 26th ACM SIGKDD Internat. Conf. Knowledge Discovery Data Mining (ACM, New York), 1295–1305.Google Scholar
- (2009) ImageNet: A large-scale hierarchical image database. 2009 IEEE Conf. Comput. Vision Pattern Recognition (IEEE, New York), 248–255.Google Scholar
- (2018) BERT: Pre-training of deep bidirectional transformers for language understanding. Preprint, submitted October 11, https://arxiv.org/abs/1810.04805.Google Scholar
- (2021) MVFusFra: A multi-view dynamic fusion framework for multimodal brain tumor segmentation. IEEE J. Biomedical Health Informatics 26(4):1570–1581.Crossref, Google Scholar
- (2020) An image is worth 16x16 words: Transformers for image recognition at scale. Preprint, submitted October 22, https://arxiv.org/abs/2010.11929.Google Scholar
- (2023) On uni-modal feature learning in supervised multi-modal learning. Preprint, submitted May 2, https://arxiv.org/abs/2305.01233.Google Scholar
- (2015) Multimodal deep learning for robust RGB-D object recognition. 2015 IEEE/RSJ Internat. Conf. Intelligent Robots Systems (IROS) (IEEE Press, New York), 681–687.Google Scholar
- (2021) Geometry-aware neural solver for fast Bayesian calibration of brain tumor models. IEEE Trans. Medical Imaging 41(5):1269–1278.Crossref, Google Scholar
- (2010) Introduction to the Dirichlet distribution and related processes. UWEE Technical Report No. UWEETR-2010-0006, University of Washington, Seattle.Google Scholar
- (2020) A survey on deep learning for multimodal data fusion. Neural Comput. 32(5):829–864.Crossref, Google Scholar
- (2022) Multimodal data fusion for systems improvement: A review. IISE Trans. 54(11):1098–1116.Crossref, Google Scholar
- (2025) Two-stage dynamic fusion framework for multimodal classification tasks. https://doi.org/10.1287/ijoc.2023.0448.cd, https://github.com/INFORMSJoC/2023.0448.Google Scholar
- (2022) Efficient multimodal deep-learning-based COVID-19 diagnostic system for noisy and corrupted images. J. King Saud Univ. Sci. 34(3):101898.Crossref, Google Scholar
- (2021) Trusted multi-view classification. Preprint, submitted February 3, https://arxiv.org/abs/2102.02051.Google Scholar
- (2023) Trusted multi-view classification with dynamic evidential fusion. IEEE Trans. Pattern Anal. Machine Intelligence 45(2):2551–2566.Crossref, Google Scholar
- (2015) Delving deep into rectifiers: Surpassing human-level performance on ImageNet classification. Proc. IEEE Internat. Conf. Comput. Vision (IEEE Press, New York), 1026–1034.Google Scholar
- (2016) Deep residual learning for image recognition. Proc. IEEE Conf. Comput. Vision Pattern Recognition (IEEE Press, New York), 770–778.Google Scholar
- (2019) Why ReLU networks yield high-confidence predictions far away from the training data and how to mitigate the problem. Proc. IEEE/CVF Conf. Comput. Vision Pattern Recognition (IEEE Press, New York), 41–50.Google Scholar
- (2021) Meta-learning in neural networks: A survey. IEEE Trans. Pattern Anal. Machine Intelligence 44(9):5149–5169.Google Scholar
- (2022) MM-DFN: Multimodal dynamic fusion network for emotion recognition in conversations. ICASSP 2022 IEEE Internat. Conf. Acoustics Speech Signal Processing (IEEE Press, New York), 7037–7041.Google Scholar
- (2018) Efficient uncertainty estimation for semantic segmentation in videos. Proc. Eur. Conf. Comput. Vision (ECCV, Zurich), 520–535.Google Scholar
- (2022) Modality competition: What makes joint training of multi-modal network fail in deep learning? (Provably). Chaudhuri K, Jegelka S, Song L, Szepesvari C, Niu G, Sabato S, eds. Proc. 39th Internat. Conf. Machine Learn., vol. 162 (PMLR, New York), 9226–9259.Google Scholar
- (2021) Learning with noisy correspondence for cross-modal matching. Ranzato M, Beygelzimer A, Dauphin Y, Liang PS, Wortman Vaughan J, eds. Adv. Neural Inform. Processing Systems 34 (NeurIPS 2021) (Curran Associates, Red Hook, NY), 29406–29419.Google Scholar
- (2019) Supervised multimodal bitransformers for classifying images and text. Preprint, submitted September 6, https://arxiv.org/abs/1909.02950.Google Scholar
- (2020) Improving model calibration with accuracy versus uncertainty optimization. Larochelle H, Ranzato M, Hadsell R, Balcan MF, Lin H, eds. Adv. Neural Inform. Processing Systems, vol. 33 (Curran Associates, Red Hook, NY), 18237–18248.Google Scholar
- (2020) Being Bayesian, even just a bit, fixes overconfidence in ReLU networks. Daumé H III, Singh A, eds. Proc. 37th Internat. Conf. Machine Learn., vol. 119 (PMLR, New York), 5436–5446.Google Scholar
- (2015) Multimodal data fusion: An overview of methods, challenges, and prospects. Proc. IEEE 103(9):1449–1477.Crossref, Google Scholar
- (2023a) BLIP-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. Krause A, Brunskill E, Cho K, Engelhardt B, Sabato S, Scarlett J, eds. ICML ‘23 Proc. 40th Internat. Conf. Machine Learn. (JMLR.org, New York), 19730–19742.Google Scholar
- (2022b) BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. Chaudhuri K, Jegelka S, Song L, Szepesvari C, Niu G, Sabato S, eds. Proc. 39th Internat. Conf. Machine Learn., vol. 162 (PMLR, New York), 12888–12900.Google Scholar
- (2022a) Two-stage multimodality fusion for high-performance text-based visual question answering. Proc. Asian Conf. Comput. Vision (Springer, New York), 4143–4159.Google Scholar
- (2022c) Trustworthy long-tailed classification. Proc. IEEE/CVF Conf. Comput. Vision Pattern Recognition (IEEE Press, New York), 6970–6979.Google Scholar
- (2023b) IMF: Interactive multimodal fusion model for link prediction. Ding Y, Tang J, Sequeda J, Aroyo L, Castillo C, Houben G-J, eds. WWW ‘23 Proc. ACM Web. Conf. 2023 (ACM, New York), 2572–2580.Google Scholar
- (2022) Expanding large pre-trained unimodal models with multimodal information injection for image-text multimodal classification. Proc. IEEE/CVF Conf. Comput. Vision Pattern Recognition (IEEE, New York), 15492–15501.Google Scholar
- (2023) A bottleneck network with light attention for multimodal clustering. Knowledge-Based Systems 280:111037.Crossref, Google Scholar
- (2020) Energy-based out-of-distribution detection. Larochelle H, Ranzato M, Hadsell R, Balcan MF, Lin H, eds. Adv. Neural Inform. Processing Systems 33 (NeurIPS 2020) (Curran Associates, Red Hook, NY), 21464–21475.Google Scholar
- (2021) CMA-CLIP: Cross-modality attention CLIP for image-text classification. Preprint, submitted December 7, https://arxiv.org/abs/2112.03562.Google Scholar
- (2023) How noisy is too noisy? The impact of data noise on multimodal recognition of confusion and conflict during collaborative learning. André E, Chetouani M, Vaufreydaz D, Lucas G, Schultz T, Morency L-P, Vinciarelli A, eds. Proc. 25th Internat. Conf. Multimodal Interaction (ACM, New York), 326–335.Google Scholar
- (2021) Trustworthy multimodal regression with mixture of normal-inverse gamma distributions. Ranzato M, Beygelzimer A, Dauphin Y, Liang PS, Wortman Vaughan J, eds. Adv. Neural Inform. Processing Systems (NeurIPS 2021), vol. 34 (Curran Associates, Red Hook, NY), 6881–6893.Google Scholar
- (2023) Excavating multimodal correlation for representation learning. Inform. Fusion 91:542–555.Crossref, Google Scholar
- (2020) Uncertainty estimation in autoregressive structured prediction. Preprint, submitted February 18, https://arxiv.org/abs/2002.07650.Google Scholar
- (2012) Multimodal feature fusion for robust event detection in web videos. 2012 IEEE Conf. Comput. Vision Pattern Recognition (IEEE, New York), 1298–1305.Google Scholar
- (2023) Deep equilibrium multimodal fusion. Preprint, submitted June 29, https://arxiv.org/abs/2306.16645.Google Scholar
- (2016) Sentiment analysis on multi-view social data. MultiMedia Model. 22nd Internat. Conf. Proc., part II 22 (Springer, Berlin), 15–27.Google Scholar
- (2022) Balanced multimodal learning via on-the-fly gradient modulation. Proc. IEEE/CVF Conf. Comput. Vision Pattern Recognition (IEEE, New York), 8238–8247.Google Scholar
- (2018) Hypothesis only baselines in natural language inference. Preprint, submitted May 2, https://arxiv.org/abs/1805.01042.Google Scholar
- (2015) Deep convolutional neural network textual features and multiple kernel learning for utterance-level multimodal sentiment analysis. Màrquez L, Callison-Burch C, Su J, eds. Proc. 2015 Conf. Empirical Methods Natl. Language Processing (Association for Computational Linguistics, Stroudsburg, PA), 2539–2544.Google Scholar
- (2021) Learning transferable visual models from natural language supervision. Meila M, Zhang T, eds. Proc. 38th Internat. Conf. Machine Learn., vol. 139 (PMLR, New York), 8748–8763.Google Scholar
- (2019) Generating diverse high-fidelity images with VQ-VAE-2. Wallach H, Larochelle H, Beygelzimer A, d’Alché-Buc F, Fox E, Garnett R, eds. Adv. Neural Inform. Processing Systems 32 (NeurIPS 2019) (Curran Associates, Red Hook, NY), 14866–14876.Google Scholar
- (2001) Active hidden Markov models for information extraction. Hoffmann F, Hand DJ, Adams N, Fisher D, Guimaraes G, eds. Adv. Intelligent Data Anal. IDA 2001 (Springer, Berlin), 309–318.Google Scholar
- (2018) Evidential deep learning to quantify classification uncertainty. Bengio S, Wallach H, Larochelle H, Grauman K, Cesa-Bianchi N, Garnett R, eds. Adv. Neural Inform. Processing Systems 31 (NeurIPS 2018) (Curran Associates, Red Hook, NY).Google Scholar
- (2024) Credibility-aware multi-modal fusion using probabilistic circuits. Preprint, submitted March 5, https://arxiv.org/abs/2403.03281.Google Scholar
- (2022) Noise-robust learning from multiple unsupervised sources of inferred labels. Proc. AAAI Conf. Artificial Intelligence, vol. 36, no. 8 (AAAI Press, Palo Alto, CA), 8315–8323.Google Scholar
- (2014) Two-stream convolutional networks for action recognition in videos. Ghahramani Z, Welling M, Cortes C, Lawrence N, Weinberger KQ, eds. Adv. Neural Inform. Processing Systems 27 (NIPS 2014) (Curran Associates, Red Hook, NY).Google Scholar
- (2023) Learning discriminative multi-relation representations for multimodal sentiment analysis. Inform. Sci. 641:119125.Crossref, Google Scholar
- (2022) FusionM4Net: A multi-stage multi-modal learning algorithm for multi-label skin lesion classification. Medical Image Anal. 76:102307.Crossref, Google Scholar
- (2018) Shifting the baseline: Single modality performance on visual navigation & QA. Preprint, submitted November 1, https://arxiv.org/abs/1811.00613.Google Scholar
- (2019) Multimodal transformer for unaligned multimodal language sequences. Korhonen A, Traum D, Màrquez L, eds. Proc. 57th Annual Meeting Assoc. Comput. Linguistics (Association for Computational Linguistics, Stroudsburg, PA), 6558–6569.Google Scholar
- (2021) Failure prediction by confidence estimation of uncertainty-aware Dirichlet networks. ICASSP 2021 2021 IEEE Internat. Conf. Acoustics Speech Signal Processing (ICASSP) (IEEE, New York), 3525–3529.Google Scholar
- (2021) Deep fusion clustering network. Proc. AAAI Conf. Artificial Intelligence, vol. 35, no. 11 (AAAI Press, Palo Alto, CA), 9978–9987.Google Scholar
- (2021a) Rethinking calibration of deep neural networks: Do not be afraid of overconfidence. Ranzato M, Beygelzimer A, Dauphin Y, Liang PS, Wortman Vaughan J, eds. Adv. Neural Inform. Processing Systems (NeurIPS 2021), vol. 34 (Curran Associates, Red Hook, NY), 11809–11820.Google Scholar
- (2020) What makes training multi-modal classification networks hard? 2020 IEEE/CVF Conf. Comput. Vision Pattern Recognition (CVPR) (IEEE, New York), 12695–12705.Google Scholar
- (2021b) N24News: A new dataset for multimodal news classification. Preprint, submitted August 30, https://arxiv.org/abs/2108.13327.Google Scholar
- (2015) Recipe recognition with large multimodal food dataset. 2015 IEEE Internat. Conf. Multimedia Expo Workshops (ICMEW) (IEEE, New York), 1–6.Google Scholar
- (2016) Learning common and specific features for RGB-D semantic segmentation with deconvolutional networks. Leibe B, Matas J, Sebe N, Welling M, eds. Comput. Vision ECCV 2016 Proc. Part V 14 (Springer, Cham, Switzerland), 664–679.Google Scholar
- (2005) Applied Linear Regression (John Wiley & Sons, Hoboken, NJ).Crossref, Google Scholar
- (2022) Characterizing and overcoming the greedy nature of learning in multi-modal deep neural networks. Chaudhuri K, Jegelka S, Song L, Szepesvari C, Niu G, Sabato S, eds. Proc. 39th Internat. Conf. Machine Learn., vol. 162 (PMLR, New York), 24043–24055.Google Scholar
- (2018) Generalized cross entropy loss for training deep neural networks with noisy labels. Bengio S, Wallach H, Larochelle H, Grauman K, Cesa-Bianchi N, Garnett R, eds. Adv. Neural Inform. Processing Systems 31 (NeurIPS 2018) (Curran Associates, Red Hook, NY).Google Scholar
- (2021) Deep multimodal fusion for semantic image segmentation: A survey. Image Vision Comput. 105:104042.Crossref, Google Scholar
- (2019) Weakly aligned cross-modal learning for multispectral pedestrian detection. Proc. IEEE/CVF Internat. Conf. Comput. Vision (IEEE, New York), 5127–5137.Google Scholar
- (2023) Provable dynamic fusion for low-quality multimodal data. Preprint, submitted June 3, https://arxiv.org/abs/2306.02050.Google Scholar
- (2023) TSVFN: Two-stage visual fusion network for multimodal relation extraction. Inform. Processing Management 60(3):103264.Crossref, Google Scholar
- (2023) Multi-level confidence learning for trustworthy multimodal classification. Williams B, Chen Y, Neville J, eds. Proc. AAAI Conf. Artificial Intelligence, vol. 37, no. 9 (AAAI Press, Washington, DC), 11381–11389.Google Scholar
- (2019) A review: Deep learning for medical image segmentation using multi-modality fusion. Array 3–4:100004.Crossref, Google Scholar
- (2023b) A review of uncertainty estimation and its application in medical imaging. Preprint, submitted February 16, https://arxiv.org/abs/2302.08119.Google Scholar
- (2023a) UniS-MMC: Multimodal classification via unimodality-supervised multimodal contrastive learning. Preprint, submitted May 16, https://arxiv.org/abs/2305.09299.Google Scholar

