TY - JOUR
T1 - Transforming literature screening
T2 - The emerging role of large language models in systematic reviews
AU - Delgado-Chaves, Fernando M
AU - Jennings, Matthew J
AU - Atalaia, Antonio
AU - Wolff, Justus
AU - Horvath, Rita
AU - Mamdouh, Zeinab M
AU - Baumbach, Jan
AU - Baumbach, Linda
PY - 2025/1/6
Y1 - 2025/1/6
N2 - Systematic reviews (SR) synthesize evidence-based medical literature, but they involve labor-intensive manual article screening. Large language models (LLMs) can select relevant literature, but their quality and efficacy are still being determined compared to humans. We evaluated the overlap between title- and abstract-based selected articles of 18 different LLMs and human-selected articles for three SR. In the three SRs, 185/4,662, 122/1,741, and 45/66 articles have been selected and considered for full-text screening by two independent reviewers. Due to technical variations and the inability of the LLMs to classify all records, the LLM's considered sample sizes were smaller. However, on average, the 18 LLMs classified 4,294 (min 4,130; max 4,329), 1,539 (min 1,449; max 1,574), and 27 (min 22; max 37) of the titles and abstracts correctly as either included or excluded for the three SRs, respectively. Additional analysis revealed that the definitions of the inclusion criteria and conceptual designs significantly influenced the LLM performances. In conclusion, LLMs can reduce one reviewer´s workload between 33% and 93% during title and abstract screening. However, the exact formulation of the inclusion and exclusion criteria should be refined beforehand for ideal support of the LLMs.
AB - Systematic reviews (SR) synthesize evidence-based medical literature, but they involve labor-intensive manual article screening. Large language models (LLMs) can select relevant literature, but their quality and efficacy are still being determined compared to humans. We evaluated the overlap between title- and abstract-based selected articles of 18 different LLMs and human-selected articles for three SR. In the three SRs, 185/4,662, 122/1,741, and 45/66 articles have been selected and considered for full-text screening by two independent reviewers. Due to technical variations and the inability of the LLMs to classify all records, the LLM's considered sample sizes were smaller. However, on average, the 18 LLMs classified 4,294 (min 4,130; max 4,329), 1,539 (min 1,449; max 1,574), and 27 (min 22; max 37) of the titles and abstracts correctly as either included or excluded for the three SRs, respectively. Additional analysis revealed that the definitions of the inclusion criteria and conceptual designs significantly influenced the LLM performances. In conclusion, LLMs can reduce one reviewer´s workload between 33% and 93% during title and abstract screening. However, the exact formulation of the inclusion and exclusion criteria should be refined beforehand for ideal support of the LLMs.
KW - large language models
KW - literature screening
KW - systematic reviews
KW - Humans
KW - Language
KW - Evidence-Based Medicine
KW - Review Literature as Topic
U2 - 10.1073/pnas.2411962122
DO - 10.1073/pnas.2411962122
M3 - Article
SN - 0027-8424
VL - 122
JO - Proceedings of the National Academy of Sciences of the United States of America
JF - Proceedings of the National Academy of Sciences of the United States of America
IS - 2
M1 - e2411962122
ER -