GPT-4 enhances clinical trial screening accuracy and cuts costs

Download PDF Copy

By Vijay Kumar MalesuReviewed by Susha Cheriyedath, M.Sc.Jun 18 2024

In a recent study published in the new monthly journal NEJM AI, a group of researchers in the United States evaluated the utility of a Retrieval-Augmented Generation (RAG)-enabled Generative Pre-trained Transformer (GPT)-4 system in improving the accuracy, efficiency, and reliability of screening participants for clinical trials involving patients with symptomatic heart failure.

Study: Retrieval-Augmented Generation–Enabled GPT-4 for Clinical Trial Screening. Image Credit: Treecha / Shutterstock

Background

Screening potential participants for clinical trials is crucial to ensure eligibility based on specific criteria. Traditionally, this manual process relies on study staff and healthcare professionals, making it prone to human error, resource-intensive, and time-consuming. Natural language processing (NLP) can automate data extraction and analysis from electronic health records (EHRs) to enhance accuracy and efficiency. However, traditional NLP struggles with complex, unstructured EHR data. Large language models (LLMs), like GPT-4, have shown promise in medical applications. Further research is needed to refine the implementation of GPT-4 within RAG frameworks to ensure scalability, accuracy, and integration into diverse clinical trial settings.

About the study

In the present study, the Recurrent Error Correction with Tolerance for Input Variations and Efficient Regularization (RECTIFIER) system was evaluated in the Co-Operative Program for Implementation of Optimal Therapy in Heart Failure (COPILOT-HF) trial, which compares two remote-care strategies for heart failure patients. Traditional cohort identification involved querying the EHR and manual chart reviews by non-clinically licensed staff to assess six inclusion and 17 exclusion criteria. RECTIFIER focused on one inclusion and 12 exclusion criteria derived from unstructured data, creating 14 prompts.

Using Microsoft Dynamics 365, yes/no values for criteria were captured during screening. An expert clinician provided "gold standard" answers for the 13 target criteria. The datasets were divided into development, validation, and test phases, starting with 3000 patients. For validation, 282 patients were used, while 1,894 were included in the test set.

GPT-4 Vision and GPT-3.5 Turbo were utilized, with the RAG architecture enabling effective handling of clinical notes. Notes were split into chunks and retrieved using a custom Python program and LangChain's recursive chunking strategy. Numerical vector representations were generated and optimized with Facebook's AI Similarity Search (FAISS) library.

Fourteen prompts were used to generate "Yes" or "No" answers. Statistical analysis involved calculating sensitivity, specificity, and accuracy, with the Matthews correlation coefficient (MCC) as the primary evaluation metric. Cost analysis and comparison across demographic groups were also performed.

Study results

In the validation set, note lengths varied from 8 to 7097 words, with 75.1% containing 500 words or fewer and 92% containing 1500 words or fewer. In the test set, clinical notes for 26% of patients exceeded GPT-4's 128k token context window limit. A chunk size of 1000 tokens outperformed 500 in 10 of 13 criteria. Consistency analysis on the validation dataset showed percentages ranging from 99.16% to 100%, with a standard deviation of accuracy between 0% and 0.86%, indicating minimal variation and high consistency.

In the test set, both COPILOT-HF study staff and RECTIFIER demonstrated high sensitivity and specificity across the 13 target criteria. Sensitivity for individual questions ranged from 66.7% to 100% for the study staff and 75% to 100% for RECTIFIER. Specificity ranged from 82.1% to 100% for the study staff and 92.1% to 100% for RECTIFIER. Positive predictive value ranged from 50% to 100% for the study staff and 75% to 100% for RECTIFIER. The answers of both closely aligned with expert clinicians' answers, with accuracy between 91.7% and 100% (MCC, 0.644 to 1) for the study staff and 97.9% and 100% (MCC, 0.837 to 1) for RECTIFIER. RECTIFIER performed better for the inclusion criterion of "symptomatic heart failure," with an accuracy of 97.9% versus 91.7% and an MCC of 0.924 versus 0.721.

Overall, the sensitivity and specificity for determining eligibility were 90.1% and 83.6% for the study staff and 92.3% and 93.9% for RECTIFIER. When inclusion and exclusion questions were combined into two prompts or when GPT-3.5 was used instead of GPT-4 with the same RAG architecture, sensitivity and specificity decreased. Using GPT-4 without RAG for 35 patients, where 15 were misclassified by RECTIFIER for the symptomatic heart failure criterion, slightly improved accuracy from 57.1% to 62.9%. No statistically significant bias in performance across race, ethnicity, and gender was found.

The cost per patient with RECTIFIER was 11 cents using the individual-question approach and 2 cents using the combined-question approach. Due to the increased character inputs required, using GPT-4 and GPT-3.5 without RAG resulted in higher costs of $15.88 and $1.59 per patient, respectively.

Conclusions,

To summarize, RECTIFIER demonstrated high accuracy in screening patients for clinical trials, outperforming traditional study staff methods in certain aspects and costing only 11 cents per patient. In contrast, traditional screening methods for a phase 3 trial can cost approximately $34.75 per patient. These findings suggest significant potential improvements in the efficiency of patient recruitment for clinical trials. However, the automation of screening processes raises concerns about potential hazards, such as missing nuanced patient contexts and operational risks, necessitating careful implementation to balance benefits and risks.

Use of LLM #AI to improve efficiency, accuracy, reliability, and reduce costs for screening individuals that fit criteria a clinical trial (performance as good or better than human "study staff")https://t.co/u5b2ujYr71 @OzanUnluMD @MassGenBrigham @NEJM_AI pic.twitter.com/5U7BbUF58p
— Eric Topol (@EricTopol) June 17, 2024

Journal reference:

Ozan Unlu, Jiyeon Shin, Charlotte J. Mailly, et al. Retrieval-Augmented Generation–Enabled GPT-4 for Clinical Trial Screening, 2024, DOI: 10.1056/AIoa2400181, https://ai.nejm.org/doi/full/10.1056/AIoa2400181

Posted in: Device / Technology News | Medical Science News | Medical Research News

Tags: Automation, Clinical Trial, Healthcare, Heart, Heart Failure, Language, Machine Learning, Research

Comments (0)

Written by

Vijay Kumar Malesu

Vijay holds a Ph.D. in Biotechnology and possesses a deep passion for microbiology. His academic journey has allowed him to delve deeper into understanding the intricate world of microorganisms. Through his research and studies, he has gained expertise in various aspects of microbiology, which includes microbial genetics, microbial physiology, and microbial ecology. Vijay has six years of scientific research experience at renowned research institutes such as the Indian Council for Agricultural Research and KIIT University. He has worked on diverse projects in microbiology, biopolymers, and drug delivery. His contributions to these areas have provided him with a comprehensive understanding of the subject matter and the ability to tackle complex research challenges.

Download PDF Copy

Citations

Please use one of the following formats to cite this article in your essay, paper or report:

APA
Kumar Malesu, Vijay. (2024, June 18). GPT-4 enhances clinical trial screening accuracy and cuts costs. News-Medical. Retrieved on January 12, 2025 from https://www.news-medical.net/news/20240618/GPT-4-enhances-clinical-trial-screening-accuracy-and-cuts-costs.aspx.
MLA
Kumar Malesu, Vijay. "GPT-4 enhances clinical trial screening accuracy and cuts costs". News-Medical. 12 January 2025. <https://www.news-medical.net/news/20240618/GPT-4-enhances-clinical-trial-screening-accuracy-and-cuts-costs.aspx>.
Chicago
Kumar Malesu, Vijay. "GPT-4 enhances clinical trial screening accuracy and cuts costs". News-Medical. https://www.news-medical.net/news/20240618/GPT-4-enhances-clinical-trial-screening-accuracy-and-cuts-costs.aspx. (accessed January 12, 2025).
Harvard
Kumar Malesu, Vijay. 2024. GPT-4 enhances clinical trial screening accuracy and cuts costs. News-Medical, viewed 12 January 2025, https://www.news-medical.net/news/20240618/GPT-4-enhances-clinical-trial-screening-accuracy-and-cuts-costs.aspx.