CAAC: Co-attentive Actionability Classification for Assessing Patient Education Videos
Published Online:9 Jun 2026https://doi.org/10.1287/ijoc.2023.0493
References
- (2021) Vivit: A video vision transformer. Berg T, Clark J, Matsushita Y, Taylor C, eds. Proc. IEEE/CVF Internat. Conf. Comput. Vision (IEEE Computer Society, Conference Publishing Services, Los Alamitos, CA), 6836–6846.Google Scholar
- (2015) Creative strategies in social media marketing: An exploratory study of branded social content and consumer engagement. Psych. Marketing 32(1):15–27.Crossref, Google Scholar
- (2021) Animation as a dynamic visualization technique for improving process model comprehension. Inform. Management 58(5):10347.Crossref, Google Scholar
- (2014) The CDC clear communication index is a new evidence-based tool to prepare and review health information. Health Promotion Practice 15(5):629–637.Crossref, Google Scholar
- (2020) Longformer: The long-document transformer. Preprint, submitted April 10, https://arxiv.org/abs/2004.0515.Google Scholar
- (2011) Low health literacy and health outcomes: An updated systematic review. Ann. Internal Medicine 155(2):97–107.Crossref, Google Scholar
- (2024) A spatio-temporl deepfake video detection method based on timesformer-cnn. Proc. 2024 Third Internat. Conf. Distributed Comput. Electrical Circuits Electronics (IEEE, Piscataway, NJ), 1–6.Google Scholar
- (2020) Ai in mental health. Curr. Opin. Psychol. 36:112–117.Crossref, Google Scholar
- (2019) Bert: Pre-training of deep bidirectional transformers for language understanding. Burstein J, Doran C, Solorio T, eds. Proc. Conf. North Amer. Chapter Assoc. Comput. Linguistics: Human Language Technologies, Long and Short Papers, vol. 1 (Association for Computational Linguistics, Stroudsburg, PA), 4171–4186.Google Scholar
- (2021) An image is worth 16x16 words: Transformers for image recognition at scale. Proc. Internat. Conf. Learn. Representations (OpenReview.net).Google Scholar
- (2020) Impact of student engagement strategies on video content in learning computer programming and attitudes towards video instruction that was developed based on the cognitive theory of multimedia learning. Issues Inform. Systems 21(3):126–134.Google Scholar
- (2009) The costs of limited health literacy: A systematic review. Int. J. Public Health 54(5):313–324.Crossref, Google Scholar
- (2020) Early vs late fusion in multimodal convolutional neural networks. Proc. IEEE 23rd Internat. Conf. Inform. Fusion (IEEE, Piscataway, NJ), 1–6.Google Scholar
- (2020) Removing bias in multi-modal classifiers: Regularization by maximizing functional entropies. Larochelle H, Ranzato M, Hadsell R, Balcan MF, Lin HT, eds. Advances in Neural Information Processing Systems, vol. 33 ( Curran Associates, Red Hook, NY), 3197–3208.Google Scholar
- (2005) Investigating coherence and multimedia effects of a technology-mediated collaborative environment. J. Management Inform. Systems 22(3):97–121.Crossref, Google Scholar
- (2023) Youtube videos for public health literacy? A machine learning pipeline to curate covid-19 videos. Stud. Health Tech. Inform. 310:760–764.Google Scholar
- (2019) Assessing of the audiovisual patient educational materials on diabetes care with PEMAT. Public Health Nursing (1931) 36(3):379–387.Crossref, Google Scholar
- (2012) The institutionalization of youtube: From user-generated content to professionally generated content. Media Culture Soc. 34(1):53–67.Crossref, Google Scholar
- (2018) Robust deep multi-modal learning based on gated information fusion network. Jawahar CV, Li H, Mori G, Schindler K, eds. Proc. Asian Conf. Comput. Vision (Springer, Cham, Switzerland), 90–106.Google Scholar
- (2006) The health literacy of America’s adults: Results from the 2003 National Assessment of Adult Literacy. NCES 2006-483. National Center for Education Statistics, Institute of Education Sciences, U.S. Department of Education, Washington, DC.Google Scholar
- (2020) What does BERT with vision look at? Jurafsky D, Chai J, Schluter N, Tetreault J, eds. Proc. 58th Ann. Meeting Assoc. Comput. (Association for Computational Linguistics), 5265–5275. Google Scholar
- (2017) Semantics-guided multi-level RGB-D feature fusion for indoor semantic segmentation. Proc. IEEE Internat. Conf. Image Processing (IEEE, Piscataway, NJ), 1262–1266.Google Scholar
- (2025) Promoting health literacy with human-in-the-loop video understandability classification of youtube videos: Development and evaluation study. J. Medical Internet Res. 27:e56080.Crossref, Google Scholar
- (2020) Go to youtube and call me in the morning: Use of social media for chronic conditions. MIS Quart. 44(1):257–284.Crossref, Google Scholar
- (2022) X-clip: End-to-end multi-grained contrastive learning for video-text retrieval. Magalhães J, Del Bimbo A, Satoh S, Sebe N, Alameda-Pineda X, Jin Q, Oria V, Toni L, eds. Proc. 30th ACM Internat. Conf. Multimedia (Association for Computing Machinery, New York), 638–647.Google Scholar
- (2005) Cognitive theory of multimedia learning. Cambridge Handbook Multimedia Learn. 41(1):31–48.Crossref, Google Scholar
- (2015) Digital and social media opportunities for dietary behaviour change. Proc. Nutrition Soc. 74(2):139–148.Crossref, Google Scholar
- (2024) Users’ experience with health-related content on YouTube: An exploratory study. BMC Public Health 24(1):86.Crossref, Google Scholar
- (2018) Leveraging the web and social media to promote access to care among suicidal individuals. Frontiers Psych. 9:1338.Crossref, Google Scholar
- (2024) Dinov2: Learning robust visual features without supervision. Trans. Machine Learn. Res. (OpenReview.net).Google Scholar
- , Agner J, Sentell T (2021) Health literacy, digital health literacy, and Covid-19 pandemic attitudes and behaviors in us college students: Implications for interventions. Internat. J. Environment. Res. Public Health 18(6):3301.Crossref, Google Scholar
- (2025) Enhancing digital health education: AI-based assessment of actionable guidance and inclusivity on digital platforms. PhD thesis, Michigan State University, Ann Arbor.Google Scholar
- (2026) CAAC: Coattentive actionability classification for assessing patient education videos. https://doi.org/10.1287/ijoc.2023.0493.cd, https://github.com/INFORMSJoC/2023.0493.Google Scholar
- (2020) Comparing multiple theories about learning with physical and virtual representations: Conflicting or complementary effects? Ed. Psych. Rev. 32(2):297–325.Crossref, Google Scholar
- (2014) Development of the patient education materials assessment tool (pemat): A new measure of understandability and actionability for print and audiovisual patient information. Patient Ed. Counseling 96(3):395–403.Crossref, Google Scholar
- (2024) Improving answer quality using image-text coherence on social Q&A sites. Decision Support Systems 180:114191.Crossref, Google Scholar
- (2019) Videobert: A joint model for video and language representation learning. Proc. IEEE/CVF Internat. Conf. Comput. Vision (IEEE Computer Society, Conference Publishing Services, Los Alamitos, CA), 7464–7473.Google Scholar
- (2022) Videomae: Masked autoencoders are data-efficient learners for self-supervised video pre-training. Koyejo S, Mohamed S, Agarwal A, Belgrave D, Cho K, Oh A, eds. Advances in Neural Information Processing Systems, vol. 35 (Curran Associates, Red Hook, NY), 10078–10093.Google Scholar
- U.S. Department of Health and Human Services (2010) National Action Plan to Improve Health Literacy (Office of Disease Prevention and Health Promotion, Washington, DC).Google Scholar
- (2017) Attention is all you need. Guyon I, von Luxburg U, Bengio S, Wallach H, Fergus R, Vishwanathan SVN, Garnett R, eds. Advances in Neural Information Processing Systems vol. 30 (Curran Associates, Red Hook, NY), 5998–6008.Google Scholar
- (2018) Interrater reliability of the patient education materials assessment tool (pemat). Patient Ed. Counseling 101(3):490–496.Crossref, Google Scholar
- (2017) Instructor presence in instructional video: Effects on visual attention, recall, and perceived learning. Comput. Human Behav. 71:79–89.Crossref, Google Scholar
- (2024) Attribution regularization for multimodal paradigms. Preprint, submitted April 2, https://arxiv.org/abs/2404.02359.Google Scholar
- (2019) Deep modular co-attention networks for visual question answering. Proc. IEEE/CVF Conf. Comput. Vision Pattern Recognition (IEEE Computer Society, Conference Publishing Services, Los Alamitos, CA), 6281–6290.Google Scholar
- (2021) Softmax pooling for super visual semantic embedding. Proc. IEEE 12th Ann. Inform. Tech. Electronics Mobile Comm. Conf. (IEEE, Piscataway, NJ), 0258–0265.Google Scholar

