Document Type
Article
Publication Date
2026
DOI
10.1007/s44196-025-01067-0
Publication Title
International Journal of Computation Intelligence Systems
Volume
19
Issue
1
Pages
93
Abstract
In today’s rapidly evolving digital landscape, the demand for accurate and contextually relevant subtitles for image and video content, particularly in the medical domain, is increasingly critical. Despite the proliferation of visual data across various platforms, existing captioning systems often struggle due to variations in visual settings, complex temporal relationships, and nuanced semantics. Additionally, challenges such as limited datasets, privacy issues, and specialized annotation requirements make medical image captioning particularly difficult. To tackle these challenges, we investigate cutting-edge deep learning methodologies, specifically Transfer Learning and Transformer models, through a comparative analysis. Specifically, we focus on Transfer Learning through the MedVisionCapturer model and Transformer models using CausalVLM. Our findings reveal that the Transfer Learning model achieves notable performance with a BLEU score of 83.34, CIDEr Score of 89.23, METEOR Score of 43.91, and ROUGE-L value of 73.41 when tested on a limited set of CT scan recordings. In contrast, the Transformer model attains competitive yet lower scores: a BLEU score of 71.42, CIDEr Score of 74.20, METEOR Score of 99.06, and ROUGE-L value of 96.10. Therefore, this work underscores the promise of the advanced models introduced to improve the efficacy of automatic medical image captioning systems, ultimately fostering better health outcomes within an increasingly complex medical landscape.
Rights
© The Authors 2025.
This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0) License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if you modified the licensed material. You do not have permission under this license to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.
Data Availability
Article states: "The real time images are collected from Department of Radiology, Saveetha Medical College & Hospital (SMCH). Saveetha Institute of Medical and Technical Sciences of India. All methods were carried out in accordance with relevant guidelines and regulations. All experimental protocols were approved and consent was obtained from Department of Radiology, Saveetha Medical College & Hospital (SMCH). Saveetha Institute of Medical and Technical Sciences of India."
Original Publication Citation
Aswiga, R. V., & Zahir, M. A. (2026). Exploring the synergy between very large transformer and LSTM models for effective medical captioning from videos to text: The impact of captioning in healthcare. International Journal of Computational Intelligence Systems, 19(1), Article 93. https://doi.org/10.1007/s44196-025-01067-0
Repository Citation
Aswiga, R. V., & Zahir, M. A. (2026). Exploring the synergy between very large transformer and LSTM models for effective medical captioning from videos to text: The impact of captioning in healthcare. International Journal of Computational Intelligence Systems, 19(1), Article 93. https://doi.org/10.1007/s44196-025-01067-0
Included in
Artificial Intelligence and Robotics Commons, Communication Technology and New Media Commons, Data Science Commons, Diagnosis Commons, Medical Education Commons