Exploring the capabilities of three artificial intelligence chatbots in diagnosis and treatment suggestions for macular hole

Duo Yuan; Xinyu Zhao; Zhenquan Wu; Shaojuan Peng; Na Duan; Kaixuan Cui; Zhen Yu; Honglang Zhang; Weihua Yang; Wenbin Wei; Wei Chi; Guoming Zhang; Duo Yuan; Xinyu Zhao; Zhenquan Wu; Shaojuan Peng; Na Duan; Kaixuan Cui; Zhen Yu; Honglang Zhang; Weihua Yang; Wenbin Wei; Wei Chi; Guoming Zhang

doi:10.48130/vns-0026-0016

2026 Volume 43

Article Contents

Next Previous

ARTICLE Open Access

Exploring the capabilities of three artificial intelligence chatbots in diagnosis and treatment suggestions for macular hole

1.
Shenzhen Eye Hospital, Jinan University, Shenzhen 518040, China
2.
Shenzhen Eye Hospital, Shenzhen Eye Medical Center, Southern Medical University, Shenzhen 518040, China
3.
The First Clinical Medical College of Jinan University, Huizhou Third People's Hospital, Huizhou 516000, China
4.
Beijing Tongren Eye Center; Beijing Key Laboratory of Intraocular Tumor Diagnosis and Treatment; Beijing Ophthalmology & Visual Sciences Key Lab; Medical Artificial Intelligence Research and Verification Key Laboratory of the Ministry of Industry and Information Technology, Beijing Tongren Hospital, Capital Medical University, Beijing 100000, China
Authors contributed equally: Duo Yuan, Xinyu Zhao, Zhenquan Wu

More Information

Corresponding authors: chiwei@sz-eyes.com (Chi W); zhang-guoming@163.com (Zhang G)

Received: 27 October 2025
Revised: 05 February 2026
Accepted: 09 February 2026
Published online: 27 April 2026
Visual Neuroscience 43, Article number: e018 (2026) | Cite this article

Abstract

To compare diagnostic treatment suggestion, and answer quality of three artificial intelligence (AI) chatbots with an ophthalmology resident for macular hole, we assembled 50 macular hole cases, including lamellar macular hole, full-thickness macular hole (Gass II–IV), and macular hole with rhegmatogenous retinal detachment. Cases with insufficient preoperative information were excluded. Each anonymised record was presented to three AI chatbots (ChatGPT-o3, DeepSeek-R1, Gemini 2.5 Pro) and to an ophthalmology resident with three years of training. The consensus diagnosis and treatment suggestion of two retinal specialists served as the gold standard. Outcomes were diagnosis agreement, treatment suggestion agreement, and Global Quality Score (GQS). Diagnosis agreement was 86% (95% CI 73.3–94.2) for ChatGPT-o3, 82% (68.6–91.4) for DeepSeek-R1, 80% (66.3–89.9) for Gemini 2.5 Pro, and 82% (68.6–91.4) for the resident. Treatment suggestion agreement was 92% (80.8–97.8), 86% (73.3–94.2), 80% (66.3–89.9), and 70% (55.4–82.1), respectively; the resident's agreement was significantly lower than ChatGPT-o3 (p = 0.006). GQS ratings ranked Gemini 2.5 Pro highest, followed by ChatGPT-o3, the resident, and DeepSeek-R1. In conclusion, the three AI chatbots achieved similar diagnosis agreement for macular hole; ChatGPT-o3 most often matched specialist treatment suggestions, and Gemini 2.5 Pro provided the highest answer quality, suggesting that combining their strengths may enhance clinical decision support.
- Clinical decision support,
- Artificial intelligence,
- Macular hole,
- Chatbots,
- Retina

Supplementary information

Supplementary Fig. S1 Screenshots of ChatGPT-o3 responses for a macular hole case.
Supplementary Fig. S2 Screenshots of DeepSeek-R1 responses for a macular hole case.
Supplementary Fig. S3 Screenshots of Gemini 2.5 Pro responses for a macular hole case.

Rights and permissions
Copyright: © 2026 by the author(s). Published by Maximum Academic Press, Fayetteville, GA. This article is an open access article distributed under Creative Commons Attribution License (CC BY 4.0), visit https://creativecommons.org/licenses/by/4.0/.

References

[1]	Riding G, Teh BL, Yorston D, Steel DH. 2024. Comparison of the use of internal limiting membrane flaps versus conventional ILM peeling on post-operative anatomical and visual outcomes in large macular holes. Eye 38:1876−1881 doi: 10.1038/s41433-024-03024-1 CrossRef Google Scholar
[2]	Chen J, Tao J, Zhang Y. 2024. The inverted internal limiting membrane flap technique is not recommended for the treatment of large macular holes smaller than 650 µm. Retina 44(12):2086−2090 doi: 10.1097/IAE.0000000000004248 CrossRef Google Scholar
[3]	Burton MJ, Ramke J, Marques AP, Bourne RRA, Congdon N, et al. 2021. The Lancet Global Health Commission on Global Eye Health: vision beyond 2020. The Lancet Global Health 9(4):e489−e551 doi: 10.1016/S2214-109X(20)30488-5 CrossRef Google Scholar
[4]	Thirunavukarasu AJ, Mahmood S, Malem A, Foster WP, Sanghera R, et al. 2024. Large language models approach expert-level clinical knowledge and reasoning in ophthalmology: A head-to-head cross-sectional study. PLoS Digit Health 3(4):e0000341 doi: 10.1371/journal.pdig.0000341 CrossRef Google Scholar
[5]	Moëll B, Sand Aronsson F, Akbar S. 2025. Medical reasoning in LLMs: an in-depth analysis of DeepSeek R1. Frontiers in Artificial Intelligence 8:1616145 doi: 10.3389/frai.2025.1616145 CrossRef Google Scholar
[6]	Wei J, Wang X, Huang M, Xu Y, Yang W. 2025. Evaluating the performance of ChatGPT on board style examination questions in ophthalmology: a meta analysis. Journal of Medical Systems 49:94 doi: 10.1007/s10916-025-02227-7 CrossRef Google Scholar
[7]	Huang M, Wang X, Zhou S, Cui X, Zhang Z, et al. 2025. Comparative performance of large language models for patient initiated ophthalmology consultations. Frontiers in Public Health 13:1673045 doi: 10.3389/fpubh.2025.1673045 CrossRef Google Scholar
[8]	Goh E, Gallo R, Hom J, Strong E, Weng Y, et al. 2024. Large language model influence on diagnostic reasoning: a randomized clinical trial. JAMA Network Open 7(10):e2440969 doi: 10.1001/jamanetworkopen.2024.40969 CrossRef Google Scholar
[9]	Li Z, Wang Z, Xiu L, Zhang P, Wang W, et al. 2025. Large language model based multimodal system for detecting and grading ocular surface diseases from smartphone images. Frontiers in Cell and Developmental Biology 13:1600202 doi: 10.3389/fcell.2025.1600202 CrossRef Google Scholar
[10]	Huang AS, Hirabayashi K, Barna L, Parikh D, Pasquale LR. 2024. Assessment of a large language model's responses to questions and cases about glaucoma and retina management. JAMA Ophthalmol 142(4):371−375 doi: 10.1001/jamaophthalmol.2023.6917 CrossRef Google Scholar
[11]	Ayers JW, Poliak A, Dredze M, Leas EC, Zhu Z, Kelley JB, et al. 2023. Comparing physician and artificial intelligence chatbot responses to patient questions posted to a public social media forum. JAMA Internal Medicine 183:589−596 doi: 10.1001/jamainternmed.2023.1838 CrossRef Google Scholar
[12]	Strzalkowski P, Strzalkowska A, Chhablani J, Pfau K, Errera MH, et al. 2024. Evaluation of the accuracy and readability of ChatGPT-4 and Google Gemini in providing information on retinal detachment: a multicenter expert comparative study. International Journal of Retina and Vitreous 10:61 doi: 10.1186/s40942-024-00579-9 CrossRef Google Scholar
[13]	Sandmann S, Hegselmann S, Fujarski M, Bickmann L, Wild B, et al. 2025. Benchmark evaluation of DeepSeek large language models in clinical decision-making. Nat Med 31(8):2546−2549 doi: 10.1038/s41591-025-03727-2 CrossRef Google Scholar
[14]	Tao BK, Hua N, Milkovich J, Micieli JA. 2024. ChatGPT-3.5 and Bing Chat in ophthalmology: an updated evaluation of performance, readability, and informative sources. Eye 38:1897−1902 doi: 10.1038/s41433-024-03037-w CrossRef Google Scholar
[15]	Fowler T, Pullen S, Birkett L. 2024. Performance of ChatGPT and Google Bard on the official Part 1 FRCOphth practice questions. The British Journal of Ophthalmology 108(10):1379−1383 doi: 10.1136/bjo-2023-324091 CrossRef Google Scholar
[16]	Carlà MM, Gambini G, Baldascino A, Giannuzzi F, Boselli F, et al. 2024. Exploring AI-chatbots' capability to suggest surgical planning in ophthalmology: ChatGPT versus Google Gemini analysis of retinal detachment cases. The British Journal of Ophthalmology 108(10):1457−1469 doi: 10.1136/bjo-2023-325143 CrossRef Google Scholar
[17]	Mehandru N, Miao BY, Almaraz ER, Sushil M, Butte AJ, et al. 2024. Evaluating large language models as agents in the clinic. NPJ digital medicine 7:84 doi: 10.1038/s41746-024-01083-y CrossRef Google Scholar
[18]	Radke NV, Ruamviboonsuk P, Steel DH, Tian T, Hunyor AP, et al. 2025. Controversies, consensuses, and guidelines on macular hole surgery by the Asia–Pacific Vitreo-retina Society (APVRS) and the Asia–Pacific Academy of Professors in Ophthalmology (AAPPO). Eye and Vision 12:30 doi: 10.1186/s40662-025-00446-0 CrossRef Google Scholar
[19]	Chaudhary V, Sarohia GS, Phillips MR, Zeraatkar D, Xie JS, et al. 2023. Role of positioning after full-thickness macular hole surgery: a systematic review and meta-analysis. Ophthalmology Retina 7:33−43 doi: 10.1016/j.oret.2022.06.015 CrossRef Google Scholar
[20]	Chen G, Tzekov R, Jiang F, Mao S, Tong Y, et al. 2020. Inverted ILM flap technique versus conventional ILM peeling for idiopathic large macular holes: a meta-analysis of randomized controlled trials. PLoS ONE 15(7):e0236431 doi: 10.1371/journal.pone.0236431 CrossRef Google Scholar
[21]	Manasa S, Kakkar P, Kumar A, et al. 2018. Comparative evaluation of standard ILM peel with inverted ILM flap technique in large macular holes: a prospective randomized study. Ophthalmic Surgery, Lasers & Imaging Retina 49(4):236−240 doi: 10.3928/23258160-20180329-04 CrossRef Google Scholar
[22]	Hager P, Jungmann F, Holland R, Bhagat K, Hubrecht I, et al. 2024. Evaluation and mitigation of the limitations of large language models in clinical decision-making. Nature Medicine 30(9):2613−2622 doi: 10.1038/s41591-024-03097-1 CrossRef Google Scholar
[23]	Azzopardi M, Ng B, Logeswaran A, Loizou C, Cheong RCT, Gireesh P, et al. 2024. Artificial intelligence chatbots as sources of patient-education material for cataract surgery: ChatGPT-4 versus Google Bard. BMJ Open Ophthalmology 9(1):e001824 doi: 10.1136/bmjophth-2024-001824 CrossRef Google Scholar
[24]	Eid K, Eid A, Wang D, Raiker RS, Chen S, et al. 2024. Optimizing ophthalmology patient education via chatbot-generated materials: readability analysis of AI-generated patient education materials and the American society of ophthalmic plastic and reconstructive surgery patient brochures. Ophthalmic Plastic and Reconstructive Surgery 40(2):212−216 doi: 10.1097/IOP.0000000000002549 CrossRef Google Scholar
[25]	Chen X, Zhao Z, Zhang W, Xu P, Wu Y, et al. 2024. EyeGPT for patient inquiries and medical education: development and validation of an ophthalmology large language model. Journal of Medical Internet Research 26:e60063 doi: 10.2196/60063 CrossRef Google Scholar
[26]	Templin T, Perez MW, Sylvia S, Leek J, Sinnott-Armstrong N. 2024. Addressing 6 challenges in generative AI for digital health. PLOS Digital Health 3(8):e0000503 doi: 10.1371/journal.pdig.0000503 CrossRef Google Scholar
[27]	Mihalache A, Huang RS, Popovic MM, Patil NS, Pandya BU, et al. 2024. Accuracy of an artificial intelligence chatbot's interpretation of clinical ophthalmic images. JAMA Ophthalmology 142(4):321−326 doi: 10.1001/jamaophthalmol.2024.0017 CrossRef Google Scholar
[28]	Agin A, Ozturk Y, Kivrak U. 2025. Harnessing generative pre-trained transformer technology for clinical decision support in retinal detachment. Medical Bulletin of Haseki 63(3):128−134 doi: 10.4274/haseki.galenos.2025.79553 CrossRef Google Scholar
[29]	Topol EJ. 2019. High-performance medicine: the convergence of human and artificial intelligence. Nature Medicine 25:44−56 doi: 10.1038/s41591-018-0300-7 CrossRef Google Scholar
[30]	He J, Baxter SL, Xu J, Xu J, Zhou X, et al. 2019. The practical implementation of artificial intelligence technologies in medicine. Nature Medicine 25:30−36 doi: 10.1038/s41591-018-0307-0 CrossRef Google Scholar

About this article

Cite this article

Yuan D, Zhao X, Wu Z, Peng S, Duan N, et al. 2026. Exploring the capabilities of three artificial intelligence chatbots in diagnosis and treatment suggestions for macular hole. Visual Neuroscience 43: e018 doi: 10.48130/vns-0026-0016

Yuan D, Zhao X, Wu Z, Peng S, Duan N, et al. 2026. Exploring the capabilities of three artificial intelligence chatbots in diagnosis and treatment suggestions for macular hole. Visual Neuroscience 43: e018 doi: 10.48130/vns-0026-0016

Figures(3) / Tables(3)

Download PDF

Article Metrics

Article views(374) PDF downloads(111)

Other Articles By Authors

on this site
- Duo Yuan
- Xinyu Zhao
- Zhenquan Wu
- Shaojuan Peng
- Na Duan
- Kaixuan Cui
- Zhen Yu
- Honglang Zhang
- Weihua Yang
- Wenbin Wei
- Wei Chi
- Guoming Zhang
on Google Scholar
- Duo Yuan
- Xinyu Zhao
- Zhenquan Wu
- Shaojuan Peng
- Na Duan
- Kaixuan Cui
- Zhen Yu
- Honglang Zhang
- Weihua Yang
- Wenbin Wei
- Wei Chi
- Guoming Zhang

HTML

Introduction

Macular hole threatens central vision and can cause lasting visual disability if surgical repair is delayed, imposing a serious burden on participants and health systems^[1,2]. This condition often affects people in their working years, so delayed treatment carries significant personal and socioeconomic consequences. Access to timely and consistent care is challenging in many regions due to a shortage and uneven distribution of retinal specialists, with referral pathways varying widely^[3]. As a result, general ophthalmologists often struggle to accurately diagnose and manage macular holes without specialist input.

Over the past few years, large language models (LLMs) have been rapidly developed and deployed in healthcare, with artificial intelligence (AI) chatbots like ChatGPT and DeepSeek becoming widely accessible by 2025^[4,5]. AI chatbots show good performance on board-style examination questions and are now widely used in patient-initiated consultations^[6,7]. As access expands, these AI chatbots can assist generalist providers with complex cases, which motivates focused evaluations in specific diseases^[8].

Recent ophthalmic studies have applied AI chatbots to glaucoma and retina management questions, professional examination items, patient education materials, and planning and interpretation of clinical images^[9]. Across these evaluations, leading systems sometimes approached clinician output, but performance and readability varied by model and task^[10]. Evidence for macular hole remains limited, and further studies are needed to verify the performance of AI chatbots as clinical decision support for diagnosis and treatment suggestions.

This study aims to compare three AI chatbots (ChatGPT-o3, Gemini 2.5 Pro, and DeepSeek-R1) for macular hole by assessing agreement for diagnosis and treatment suggestion, and answer quality using the Global Quality Score (GQS).

Discussion

Using participant data from medical records and OCT reports, we compared the performance of three AI chatbots on agreement in diagnosis, treatment suggestion, and answer quality. An ophthalmology resident was also included as a clinical comparator. Diagnosis agreement was similar across three AI chatbots and the resident. Treatment suggestion agreement was highest for ChatGPT-o3 and lowest for the resident. For answer quality, both graders ranked Gemini 2.5 Pro highest, ChatGPT-o3 in the middle, and DeepSeek-R1 lowest.

Diagnosis agreement was similar across three AI chatbots and the resident. Tao et al. reported moderate to good performance for ChatGPT-3.5 and Bing Chat on well-defined ophthalmic questions, and Fowler et al. found that chatbots performed best when prompts were explicit and narrowly scoped^[14,15]. Carlà et al. analysed 50 retinal detachment cases and reported the diagnostic performance: agreement was 80% for ChatGPT-3.5, 84% for ChatGPT-4, and 70% for Google Gemini, and ChatGPT-4 was higher than Gemini (p = 0.03)^[16]. Studies outside ophthalmology also describe strong accuracy when instructions are clear and information is structured^[17]. In our subtype analysis, agreement remained high for FTMH and LMH but was lower for MH-RRD, consistent with the added complexity of detachment-related cases. Overall, the current chatbots aligned best with conventional macular hole diagnosis and standard vitrectomy decisions, rather than newer refinements in surgical technique^[18].

Differences among the three AI chatbots became significant when providing treatment suggestions. ChatGPT-o3 most often matched the reference treatment suggestions for whether to operate and for tamponade choice, making it useful for cross-checking core decisions. Gemini 2.5 Pro achieved the highest GQS and usually provided clearer explanations, which may be helpful for clinical notes and patient instructions, although some aetiology labels did not match the reference. DeepSeek-R1 showed similar agreement to the other models, but it had the lowest GQS and more often omitted face-down positioning. These differences show that the main gaps were in perioperative details rather than in the core treatment choice. The resident showed the same pattern, with lower treatment suggestion agreement, so combining ChatGPT-o3 for checking treatment suggestions and Gemini 2.5 Pro for clear documentation can improve clinical decision support for macular holes.

The treatment suggestion agreement was lower than the diagnosis agreement. This is consistent with clinical practice for macular hole, where decisions are based not only on the macular diagnosis but also on patient-specific factors. Published studies report that the value of face down positioning depends on hole size and related factors, and that centres differ in how they prescribe it, including whether they use it routinely or selectively and how long patients are asked to maintain it^[19]. For large macular holes, surgeons may choose internal limiting membrane peeling or the inverted internal limiting membrane flap, and both are used in practice^[20,21]. Retina specialists also consider coexisting cataract, patient age, systemic comorbidities, visual needs, the status of the fellow eye, and the patient's ability to comply with postoperative positioning. These clinical variations and patient factors make modest disagreement in treatment suggestions plausible even when diagnosis agreement is high. Our findings, therefore, support using AI chatbots as adjunct tools to check core treatment suggestions and to generate clear written explanations, rather than as standalone decision makers, with the treating retinal specialist remaining responsible for integrating all patient-specific factors. This is especially important when clinical information is uncertain. If the clinical record or OCT report is ambiguous or incomplete, AI chatbots may still answer with high confidence, which can be misleading in borderline cases. These patient-related factors can influence both the treatment suggestion chosen as the reference standard and the agreement rates observed for different evaluators. Studies outside ophthalmology report the same pattern, with lower agreement for treatment suggestions than for diagnosis^[22].

Answer quality differed across the three AI chatbots, consistent with prior ophthalmic reports. Work on cataract education found that chatbot materials were generally appropriate yet differed in readability and factual accuracy, which is in line with the variability we observed in GQS^[23]. Eid et al. also noted that patient-facing materials from AI chatbots often needed improvements in readability^[24]. Even AI chatbots developed for ophthalmology still show differences in clarity and factual accuracy across systems and with changes in prompt wording^[25,26]. When questions require interpreting images, performance is lower than when answers can be drawn from the participant record, so clear written reasoning remains important when only medical records and OCT report results are available^[27]. Together, these findings explain why AI chatbots can reach similar diagnosis agreement while diverging on GQS and support using answer quality as a complementary metric.

In this study, we evaluated the AI chatbots using case records compiled from medical records and OCT reports, reflecting a common workflow in which clinicians review written information before deciding. This text only design is consistent with recent retinal detachment work that applied GPT-based platforms to standardized clinical records without raw imaging and reported encouraging agreement with clinical decisions^[28]. A fixed template and a uniform interaction protocol enabled consistent comparison across evaluators. Guidance on evaluation and deployment emphasises clear prompts, transparent procedures, and ongoing monitoring when chatbots are used to support documentation and triage^[29,30].

Limitations

Our study has several limitations. First, the participant records were compiled retrospectively from routine charts, so some clinical details may have been missing or simplified, which could have influenced how the AI chatbots and the ophthalmology resident stated the diagnosis and the treatment suggestion. Second, all evaluations used a standard case template prepared from medical records and OCT reports, and neither the chatbots nor the resident saw the OCT images themselves; this cleaner text format may overestimate performance compared with daily practice, where clinicians interpret imaging together with incomplete or sometimes inconsistent notes. Finally, answer quality was graded with the five-point Global Quality Score, a subjective measure, and the AI chatbots were evaluated at a single time point, so their recommendations may not fully reflect later changes in macular hole surgery or in model updates.

Score	Overall description
1	Poor quality, poor flow of the site, most information missing, not at all useful for patients
2	Generally poor quality and poor flow, some information listed but many important topics missing, of very limited use to patients
3	Moderate quality, suboptimal flow, some important information is adequately discussed but others poorly discussed, somewhat useful for patients
4	Good quality and generally good flow, most of the relevant information is listed, but some topics not covered, useful for patients
5	Excellent quality and excellent flow, very useful for patients

Variable	Value	n (%)
Number of participants	50	N/A
Age (years)	59.5 ± 9.9	N/A
Sex (male/female)	14/36	28/72
Eye laterality (right/left)	20/30	40/60
Macular hole phenotype
LMH	5	10
FTMH	37	74
MH-RRD	8	16
Gass stage (FTMH only, n = 37)
Stage II	9	24
Stage III	3	8
Stage IV	25	68
Tamponade in reference plan^a
Gas	44	88
Silicone oil	4	8
None	2	4
Ocular comorbidities^b
Cataract	8	16
High myopia	6	12
Epiretinal membrane	5	10
Others	3	6
^a Percentages use the cohort size as denominator. ^b Comorbidities are not mutually exclusive. Abbreviations: LMH, lamellar macular hole; FTMH, full-thickness macular hole; MH-RRD, macular hole with rhegmatogenous retinal detachment.

Evaluator	Diagnosis (95% CI)		Treatment (95% CI)		GQS
Evaluator	Agreement (%)	p	Agreement (%)	p	Grader 1	Grader 2
ChatGPT-o3	0.86 (73.3-94.2)	N/A	0.92 (80.8-97.8)	N/A	3.78 ± 0.65	3.78 ± 1.00
Gemini 2.5 Pro	0.80 (66.3-89.9)	0.248	0.80 (66.3-89.9)	0.077	4.02 ± 0.43	3.88 ± 1.00
DeepSeek-R1	0.82 (68.6-91.4)	0.617	0.86 (73.3-94.2)	0.248	3.18 ± 1.26	3.14 ± 1.32
Resident	0.82 (68.6-91.4)	0.803	0.70 (55.4-82.1)	0.006	3.70 ± 0.65	3.50 ± 1.02
CI, confidence interval; GQS, Global Quality Score. Agreement p-values are from the paired chi-square test versus ChatGPT-o3. Both masked graders had ten years of ophthalmology clinical experience.

{{lists.name}}

Exploring the capabilities of three artificial intelligence chatbots in diagnosis and treatment suggestions for macular hole