Assessing the methodologic quality of systematic reviews using generative large language models

Authors

  • Bowen Yao University of Minnesota, Department of Urology https://orcid.org/0000-0002-6165-0379
  • Onuralp Ergun University of Minnesota, Department of Urology
  • Maylynn Ding McMaster University, Department of Urology
  • Carly D. Miller University of Minnesota, Department of Urology
  • Vikram M. Narayan Emory University, Department of Urology
  • Philipp Dahm Minneapolis VAMC

DOI:

https://doi.org/10.5489/cuaj.9243

Keywords:

LLM, articial intelligence, ChatGPT, Methodology, Systematic review, AMSTAR 2

Abstract

INTRODUCTION: We aimed to evaluate whether generative large language models (LLMs) can accurately assess the methodologic quality of systematic reviews (SRs).

METHODS: A total of 114 SRs from five leading urology journals were included in the study. Human reviewers graded each of the SRs in duplicate, with differences adjudicated by a third expert. We created a customized generative artificial intelligence (generative pretrained transformer [GPT]), “Urology AMSTAR 2 Quality Assessor,” and graded the 114 SRs in three iterations using a zero-shot method. We performed an enhanced trial focusing on critical criteria by giving GPT detailed, step-by-step instructions for each of the SRs using chain-of-thought method. Accuracy, sensitivity, specificity, and F1 score for each GPT trial were calculated against human results. Internal validity among three trials were computed.

RESULTS: GPT had an overall congruence of 75%, with 77% in critical criteria and 73% in non-critical criteria when compared to human results. The average F1 score was 0.66. There was a high internal validity at 85% among three iterations. GPT accurately assigned 89% of studies into the correct overall category. When given specific, step-by-step instructions, congruence of critical criteria improved to 91%, and overall quality assessment accuracy to 93%.

CONCLUSIONS: GPT showed promising ability to efficiently and accurately assess the quality of SRs in urology.

Downloads

Download data is not yet available.

Published

2025-08-28

How to Cite

Yao, B., Ergun, O., Ding, M., Miller, C. D., Narayan, V. M., & Dahm, P. (2025). Assessing the methodologic quality of systematic reviews using generative large language models. Canadian Urological Association Journal, 19(12), E427–33. https://doi.org/10.5489/cuaj.9243

Issue

Section

Original Research