ORCID
0000-0001-8199-0867 (Suh)
Document Type
Article
Publication Date
2025
DOI
10.5758/vsi.250071
Publication Title
Vascular Specialist International
Volume
41
Pages
36 (1-8)
Abstract
Purpose: Large language models (LLMs) can generate clinically relevant text; however, their performance in highly specialized medical domains remains uncertain. This study evaluated ChatGPT-3.5 and ChatGPT-4 (OpenAI) using vascular surgery board–style questions from the Vascular Education and Self-Assessment Program, version 4 (VESAP4) and compared the two public model versions (June and November 2023).
Materials and Methods: All non-image VESAP4 questions (n=384) were presented independently three times to each model version (ChatGPT-3.5 June/November; ChatGPT-4, June/November). Outcomes included accuracy (proportion correct), consistency (same option letter across all three attempts and “consistently correct”), explanation length (word count), and modes of failure classified for a pre-specified index attempt using a multi-label taxonomy, with independent dual review and consensus. Accuracy and consistency were reported with 95% confidence intervals, and between-condition differences were compared using the chi-square (proportions) and Welch t-test (word count).
Results: Accuracy was 47.9% and 46.5% for ChatGPT-3.5 (June/November) and 62.4% and 63.8% for ChatGPT-4, respectively. No significant within-model improvement occurred between June and November, whereas ChatGPT-4 outperformed ChatGPT-3.5 in both months (P< 0.0001). For consistency, ChatGPT-3.5 increased in “same-letter” (55.5% to 65.6%; P=0.004) with no change in “consistently correct” (40.6% to 40.4%; P=0.94). ChatGPT-4 decreased in “same-letter” (90.1% to 79.7%; P< 0.0001) with stable “consistently correct” (60.4% to 58.6%; P=0.61). Between the models, ChatGPT-4 exceeded ChatGPT-3.5 on both consistency metrics in June and November (all P< 0.0001). Explanation length shifted within models: ChatGPT-3.5 produced shorter responses in November compared with June (105.4±1.0 vs. 164.2±1.5 words; P< 0.0001), whereas ChatGPT-4 produced longer responses (282.0±1.3 vs. 120.0±1.2 words; P< 0.0001). Performance across VESAP4 subsections was higher for Vascular Medicine and Radiological Imaging/Radiation Safety, and lower for Dialysis Access Management. The modes of failure were predominantly due to external information retrieval errors for ChatGPT-3.5 and logical errors for ChatGPT-4.
Conclusion: ChatGPT-4 outperformed ChatGPT-3.5 on vascular surgery board–style questions yet achieved only moderate accuracy. Version updates did not consistently improve the performance in this specialized domain, emphasizing the complexity of decision-making in vascular surgery and the current limitations of LLMs in surgical education.
Rights
© 2025 The Korean Society for Vascular Surgery
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial 4.0 International (CC BY-NC 4.0) License, which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
Original Publication Citation
Suh, D., Le, Q., Dogbe, L., Lavingia, K., & Amendola, M. (2025). Is it getting better? An evaluation of two successive generations of ChatGPT in answering specialized vascular surgery questions. Vascular Specialist International, 41, 1-8, Article 36. https://doi.org/10.5758/vsi.250071
Repository Citation
Suh, D., Le, Q., Dogbe, L., Lavingia, K., & Amendola, M. (2025). Is it getting better? An evaluation of two successive generations of ChatGPT in answering specialized vascular surgery questions. Vascular Specialist International, 41, 1-8, Article 36. https://doi.org/10.5758/vsi.250071
Included in
Artificial Intelligence and Robotics Commons, Medical Education Commons, Surgical Procedures, Operative Commons