Open Access
Open access
том 26 страницы e53164

Hallucination Rates and Reference Accuracy of ChatGPT and Bard for Systematic Reviews: Comparative Analysis

Тип публикацииJournal Article
Дата публикации2024-05-22
SCImago Q1
Tоп 10% SCImago
WOS Q1
БС1
SJR2.109
CiteScore10.4
Impact factor8.2
ISSN14394456, 14388871
Краткое описание
Background

Large language models (LLMs) have raised both interest and concern in the academic community. They offer the potential for automating literature search and synthesis for systematic reviews but raise concerns regarding their reliability, as the tendency to generate unsupported (hallucinated) content persist.

Objective

The aim of the study is to assess the performance of LLMs such as ChatGPT and Bard (subsequently rebranded Gemini) to produce references in the context of scientific writing.

Methods

The performance of ChatGPT and Bard in replicating the results of human-conducted systematic reviews was assessed. Using systematic reviews pertaining to shoulder rotator cuff pathology, these LLMs were tested by providing the same inclusion criteria and comparing the results with original systematic review references, serving as gold standards. The study used 3 key performance metrics: recall, precision, and F1-score, alongside the hallucination rate. Papers were considered “hallucinated” if any 2 of the following information were wrong: title, first author, or year of publication.

Results

In total, 11 systematic reviews across 4 fields yielded 33 prompts to LLMs (3 LLMs×11 reviews), with 471 references analyzed. Precision rates for GPT-3.5, GPT-4, and Bard were 9.4% (13/139), 13.4% (16/119), and 0% (0/104) respectively (P<.001). Recall rates were 11.9% (13/109) for GPT-3.5 and 13.7% (15/109) for GPT-4, with Bard failing to retrieve any relevant papers (P<.001). Hallucination rates stood at 39.6% (55/139) for GPT-3.5, 28.6% (34/119) for GPT-4, and 91.4% (95/104) for Bard (P<.001). Further analysis of nonhallucinated papers retrieved by GPT models revealed significant differences in identifying various criteria, such as randomized studies, participant criteria, and intervention criteria. The study also noted the geographical and open-access biases in the papers retrieved by the LLMs.

Conclusions

Given their current performance, it is not recommended for LLMs to be deployed as the primary or exclusive tool for conducting systematic reviews. Any references generated by such models warrant thorough validation by researchers. The high occurrence of hallucinations in LLMs highlights the necessity for refining their training and functionality before confidently using them for rigorous academic purposes.

Для доступа к списку цитирований публикации необходимо авторизоваться.
Для доступа к списку профилей, цитирующих публикацию, необходимо авторизоваться.

Топ-30

Журналы

1
2
3
4
5
6
7
medRxiv
7 публикаций, 2.73%
JMIR Formative Research
4 публикации, 1.56%
Frontiers in Education
4 публикации, 1.56%
Cureus
3 публикации, 1.17%
Endocrine
3 публикации, 1.17%
bioRxiv
3 публикации, 1.17%
Journal of Medical Systems
3 публикации, 1.17%
Journal of Clinical Medicine
3 публикации, 1.17%
MedEdPublish
3 публикации, 1.17%
Digital Health
3 публикации, 1.17%
Scientific Reports
3 публикации, 1.17%
npj Digital Medicine
3 публикации, 1.17%
JMIR Medical Informatics
3 публикации, 1.17%
AI
2 публикации, 0.78%
ACM Transactions on Software Engineering and Methodology
2 публикации, 0.78%
International Journal of Human-Computer Interaction
2 публикации, 0.78%
Journal of Educational Research
2 публикации, 0.78%
Applied Sciences (Switzerland)
2 публикации, 0.78%
Diagnostics
2 публикации, 0.78%
Assessment and Evaluation in Higher Education
2 публикации, 0.78%
Health Information Science and Systems
2 публикации, 0.78%
Lecture Notes in Computer Science
2 публикации, 0.78%
International Journal of Impotence Research
2 публикации, 0.78%
Computers and Education Artificial Intelligence
2 публикации, 0.78%
PLoS ONE
2 публикации, 0.78%
Discover Public Health
2 публикации, 0.78%
BMC Medical Education
2 публикации, 0.78%
Knee Surgery, Sports Traumatology, Arthroscopy
2 публикации, 0.78%
Société Internationale d’Urologie Journal
2 публикации, 0.78%
1
2
3
4
5
6
7

Издатели

5
10
15
20
25
30
35
40
45
50
Springer Nature
48 публикаций, 18.75%
Elsevier
42 публикации, 16.41%
MDPI
28 публикаций, 10.94%
Taylor & Francis
21 публикация, 8.2%
JMIR Publications
15 публикаций, 5.86%
Wiley
14 публикаций, 5.47%
SAGE
12 публикаций, 4.69%
openRxiv
10 публикаций, 3.91%
Association for Computing Machinery (ACM)
10 публикаций, 3.91%
Frontiers Media S.A.
10 публикаций, 3.91%
Institute of Electrical and Electronics Engineers (IEEE)
9 публикаций, 3.52%
Ovid Technologies (Wolters Kluwer Health)
9 публикаций, 3.52%
Oxford University Press
3 публикации, 1.17%
F1000 Research
3 публикации, 1.17%
BMJ
2 публикации, 0.78%
Georg Thieme Verlag KG
2 публикации, 0.78%
Public Library of Science (PLoS)
2 публикации, 0.78%
De Gruyter Brill
2 публикации, 0.78%
Emerald
1 публикация, 0.39%
American Chemical Society (ACS)
1 публикация, 0.39%
S. Karger AG
1 публикация, 0.39%
Ediciones Universidad de Salamanca
1 публикация, 0.39%
Bentham Science Publishers Ltd.
1 публикация, 0.39%
Prague University of Economics and Business
1 публикация, 0.39%
American Society of Civil Engineers (ASCE)
1 публикация, 0.39%
John Benjamins Publishing Company
1 публикация, 0.39%
Centre for Evaluation in Education and Science (CEON/CEES)
1 публикация, 0.39%
Human Kinetics
1 публикация, 0.39%
Association for Vascular Access
1 публикация, 0.39%
Izmir Akademi Dernegi
1 публикация, 0.39%
5
10
15
20
25
30
35
40
45
50
  • Мы не учитываем публикации, у которых нет DOI.
  • Статистика публикаций обновляется еженедельно.

Вы ученый?

Создайте профиль, чтобы получать персональные рекомендации коллег, конференций и новых статей.
 Войти с ORCID
Метрики
256
Поделиться
Цитировать
ГОСТ |
Цитировать
Chelli M. et al. Hallucination Rates and Reference Accuracy of ChatGPT and Bard for Systematic Reviews: Comparative Analysis // Journal of Medical Internet Research. 2024. Vol. 26. p. e53164.
ГОСТ со всеми авторами (до 50) Скопировать
Chelli M., Descamps J., Lavoué V., Trojani C., Azar M., Deckert M., Raynier J., Clowez G., BOILEAU P., Ruetsch Chelli C. Hallucination Rates and Reference Accuracy of ChatGPT and Bard for Systematic Reviews: Comparative Analysis // Journal of Medical Internet Research. 2024. Vol. 26. p. e53164.
RIS |
Цитировать
TY - JOUR
DO - 10.2196/53164
UR - https://www.jmir.org/2024/1/e53164
TI - Hallucination Rates and Reference Accuracy of ChatGPT and Bard for Systematic Reviews: Comparative Analysis
T2 - Journal of Medical Internet Research
AU - Chelli, Mikaël
AU - Descamps, Jules
AU - Lavoué, Vincent
AU - Trojani, Christophe
AU - Azar, Michel
AU - Deckert, Marcel
AU - Raynier, Jean-Luc
AU - Clowez, Gilles
AU - BOILEAU, PASCAL
AU - Ruetsch Chelli, Caroline
PY - 2024
DA - 2024/05/22
PB - JMIR Publications
SP - e53164
VL - 26
PMID - 38776130
SN - 1439-4456
SN - 1438-8871
ER -
BibTex
Цитировать
BibTex (до 50 авторов) Скопировать
@article{2024_Chelli,
author = {Mikaël Chelli and Jules Descamps and Vincent Lavoué and Christophe Trojani and Michel Azar and Marcel Deckert and Jean-Luc Raynier and Gilles Clowez and PASCAL BOILEAU and Caroline Ruetsch Chelli},
title = {Hallucination Rates and Reference Accuracy of ChatGPT and Bard for Systematic Reviews: Comparative Analysis},
journal = {Journal of Medical Internet Research},
year = {2024},
volume = {26},
publisher = {JMIR Publications},
month = {may},
url = {https://www.jmir.org/2024/1/e53164},
pages = {e53164},
doi = {10.2196/53164}
}
Ошибка в публикации?