Assessing the methodologic quality of systematic reviews using generative large language models

Bowen Yao; Onuralp Ergun; Maylynn Ding; Carly D. Miller; Vikram M. Narayan; Philipp Dahm

doi:10.5489/cuaj.9243

Assessing the methodologic quality of systematic reviews using generative large language models

Authors

Bowen Yao University of Minnesota, Department of Urology https://orcid.org/0000-0002-6165-0379
Onuralp Ergun University of Minnesota, Department of Urology
Maylynn Ding McMaster University, Department of Urology
Carly D. Miller University of Minnesota, Department of Urology
Vikram M. Narayan Emory University, Department of Urology
Philipp Dahm Minneapolis VAMC

DOI:

https://doi.org/10.5489/cuaj.9243

Keywords:

LLM, articial intelligence, ChatGPT, Methodology, Systematic review, AMSTAR 2

Abstract

INTRODUCTION: We aimed to evaluate whether generative large language models (LLMs) can accurately assess the methodologic quality of systematic reviews (SRs).

METHODS: A total of 114 SRs from five leading urology journals were included in the study. Human reviewers graded each of the SRs in duplicate, with differences adjudicated by a third expert. We created a customized generative artificial intelligence (generative pretrained transformer [GPT]), “Urology AMSTAR 2 Quality Assessor,” and graded the 114 SRs in three iterations using a zero-shot method. We performed an enhanced trial focusing on critical criteria by giving GPT detailed, step-by-step instructions for each of the SRs using chain-of-thought method. Accuracy, sensitivity, specificity, and F1 score for each GPT trial were calculated against human results. Internal validity among three trials were computed.

RESULTS: GPT had an overall congruence of 75%, with 77% in critical criteria and 73% in non-critical criteria when compared to human results. The average F1 score was 0.66. There was a high internal validity at 85% among three iterations. GPT accurately assigned 89% of studies into the correct overall category. When given specific, step-by-step instructions, congruence of critical criteria improved to 91%, and overall quality assessment accuracy to 93%.

CONCLUSIONS: GPT showed promising ability to efficiently and accurately assess the quality of SRs in urology.

Downloads

Download data is not yet available.

Downloads

Published

2025-08-28

How to Cite

Yao, B., Ergun, O., Ding, M., Miller, C. D., Narayan, V. M., & Dahm, P. (2025). Assessing the methodologic quality of systematic reviews using generative large language models. Canadian Urological Association Journal, 19(12), E427–33. https://doi.org/10.5489/cuaj.9243

Download Citation

Issue

Vol. 19 No. 12 (2025): CUAJ December

Section

Original Research

License

You, the Author(s), assign your copyright in and to the Article to the Canadian Urological Association. This means that you may not, without the prior written permission of the CUA:

Post the Article on any Web site
Translate or authorize a translation of the Article
Copy or otherwise reproduce the Article, in any format, beyond what is permitted under Canadian copyright law, or authorize others to do so
Copy or otherwise reproduce portions of the Article, including tables and figures, beyond what is permitted under Canadian copyright law, or authorize others to do so.

The CUA encourages use for non-commercial educational purposes and will not unreasonably deny any such permission request.

You retain your moral rights in and to the Article. This means that the CUA may not assert its copyright in such a way that would negatively reflect on your reputation or your right to be associated with the Article.

The CUA also requires you to warrant the following:

That you are the Author(s) and sole owner(s), that the Article is original and unpublished and that you have not previously assigned copyright or granted a licence to any other third party;
That all individuals who have made a substantive contribution to the article are acknowledged;
That the Article does not infringe any proprietary right of any third party and that you have received the permissions necessary to include the work of others in the Article; and
That the Article does not libel or violate the privacy rights of any third party.

Assessing the methodologic quality of systematic reviews using generative large language models

Authors

DOI:

Keywords:

Abstract

Downloads

Downloads

Published

How to Cite

Issue

Section

License

grouped_ads

Follow us @CUAJournal

Want to advertise?

JOB POSTINGS

Language