Exploring the capabilities of three artificial intelligence chatbots in diagnosis and treatment suggestions for macular hole

Figures (3) Tables (3)

Figure 1.
Study design and workflow. The workflow comprised three steps: (a) Medical documentation and specialist standards used to construct the MH participant record and define the gold standard, (b) response generation and evaluation by three AI chatbots and an ophthalmology resident, and (c) outcome analyses, including diagnostic agreement, treatment suggestion agreement, and GQS. Abbreviations: MH, macular hole; OCT, optical coherence tomography; GQS, Global Quality Score.
Figure 2.
Diagnosis and treatment suggestion agreement for macular hole across ChatGPT-o3, Gemini 2.5 Pro, DeepSeek-R1, and an ophthalmology resident. (a) Diagnosis agreement. (b) Treatment suggestion agreement. Bars indicate the agreement rate, with exact Clopper–Pearson 95% confidence intervals (n = 50). Pairwise comparisons were performed using paired chi-square tests. ** p < 0.01; ns, not significant. Abbreviations: CI, confidence interval.
Figure 3.
Global quality score across ChatGPT-o3, Gemini 2.5 Pro, DeepSeek-R1, and an ophthalmology resident. (a) Grader 1. (b) Grader 2. Bars show mean ± SD. Brackets indicate pairwise comparisons based on generalized estimating equations. *p < 0.05, ** p < 0.01, *** p < 0.001; ns, not significant. Abbreviations: GQS, Global Quality Score; SD, standard deviation.

Score	Overall description
1	Poor quality, poor flow of the site, most information missing, not at all useful for patients
2	Generally poor quality and poor flow, some information listed but many important topics missing, of very limited use to patients
3	Moderate quality, suboptimal flow, some important information is adequately discussed but others poorly discussed, somewhat useful for patients
4	Good quality and generally good flow, most of the relevant information is listed, but some topics not covered, useful for patients
5	Excellent quality and excellent flow, very useful for patients

Table 1.

Global quality score description.

Variable	Value	n (%)
Number of participants	50	N/A
Age (years)	59.5 ± 9.9	N/A
Sex (male/female)	14/36	28/72
Eye laterality (right/left)	20/30	40/60
Macular hole phenotype
LMH	5	10
FTMH	37	74
MH-RRD	8	16
Gass stage (FTMH only, n = 37)
Stage II	9	24
Stage III	3	8
Stage IV	25	68
Tamponade in reference plan^a
Gas	44	88
Silicone oil	4	8
None	2	4
Ocular comorbidities^b
Cataract	8	16
High myopia	6	12
Epiretinal membrane	5	10
Others	3	6
^a Percentages use the cohort size as denominator. ^b Comorbidities are not mutually exclusive. Abbreviations: LMH, lamellar macular hole; FTMH, full-thickness macular hole; MH-RRD, macular hole with rhegmatogenous retinal detachment.

Table 2.

Baseline clinical characteristics of participants with macular hole.

Evaluator	Diagnosis (95% CI)		Treatment (95% CI)		GQS
Evaluator	Agreement (%)	p	Agreement (%)	p	Grader 1	Grader 2
ChatGPT-o3	0.86 (73.3-94.2)	N/A	0.92 (80.8-97.8)	N/A	3.78 ± 0.65	3.78 ± 1.00
Gemini 2.5 Pro	0.80 (66.3-89.9)	0.248	0.80 (66.3-89.9)	0.077	4.02 ± 0.43	3.88 ± 1.00
DeepSeek-R1	0.82 (68.6-91.4)	0.617	0.86 (73.3-94.2)	0.248	3.18 ± 1.26	3.14 ± 1.32
Resident	0.82 (68.6-91.4)	0.803	0.70 (55.4-82.1)	0.006	3.70 ± 0.65	3.50 ± 1.02
CI, confidence interval; GQS, Global Quality Score. Agreement p-values are from the paired chi-square test versus ChatGPT-o3. Both masked graders had ten years of ophthalmology clinical experience.

Table 3.

Macular hole diagnosis and treatment suggestion agreement and global quality score.