European Journal of Ophthalmology

The performance of ChatGPT-4 and Bing Chat in frequently asked questions about glaucoma

Levent Dogan ¹

İbrahim Edhem Yılmaz ¹

Hide authors affiliations

Department of Ophthalmology, Kilis State Hospital, Kilis, Turkey |

Publication type: Journal Article

Publication date: 2025-02-19

SAGE

Journal: European Journal of Ophthalmology

scimago Q2

SJR: 0.686

CiteScore: 3.6

Impact factor: 1.4

ISSN: 11206721, 17246016

DOI: 10.1177/11206721251321197

Copy DOI

Abstract

Purpose

To evaluate the appropriateness and readability of the responses generated by ChatGPT-4 and Bing Chat to frequently asked questions about glaucoma.

Method

Thirty-four questions were generated for this study. Each question was directed three times to a fresh ChatGPT-4 and Bing Chat interface. The obtained responses were categorised by two glaucoma specialists in terms of their appropriateness. Accuracy of the responses was evaluated using the Structure of the Observed Learning Outcome (SOLO) taxonomy. Readability of the responses was assessed using Flesch Reading Ease (FRE), Flesch Kincaid Grade Level (FKGL), Coleman-Liau Index (CLI), Simple Measure of Gobbledygook (SMOG), and Gunning- Fog Index (GFI).

Results

The percentage of appropriate responses was 88.2% (30/34) and 79.2% (27/34) in ChatGPT-4 and Bing Chat, respectively. Both the ChatGPT-4 and Bing Chat interfaces provided at least one inappropriate response to 1 of the 34 questions. The SOLO test results for ChatGPT-3.5 and Bing Chat were 3.86 ± 0.41 and 3.70 ± 0.52, respectively. No statistically significant difference in performance was observed between both LLMs ( p = 0.101). The mean count of words used when generating responses was 316.5 (± 85.1) and 61.6 (± 25.8) in ChatGPT-4 and Bing Chat, respectively ( p < 0.05). According to FRE scores, the generated responses were suitable for only 4.5% and 33% of U.S. adults in ChatGPT-4 and Bing Chat, respectively ( p < 0.05).

Conclusions

ChatGPT-4 and Bing Chat consistently provided appropriate responses to the questions. Both LLMs had low readability scores, but ChatGPT-4 provided more difficult responses in terms of readability.

Found

Are you a researcher?

Create a profile to get free access to personal recommendations for colleagues and new articles.

Metrics

Cite this

GOST | RIS | BibTex

Found error?

Publisher

SAGE

Journal

European Journal of Ophthalmology

scimago Q2

SJR

0.686

CiteScore

3.6

Impact factor

1.4

ISSN

11206721 (Print)

17246016 (Electronic)