In a recent study published in the new monthly journal NEJM AI, a group of researchers in the United States evaluated the utility of a Retrieval-Augmented Generation (RAG)-enabled Generative Pre-trained Transformer (GPT)-4 system in improving the accuracy, efficiency, and reliability of screening participants for clinical trials involving patients with symptomatic heart failure.
Study: Retrieval-Augmented Generation–Enabled GPT-4 for Clinical Trial Screening. Image Credit: Treecha / Shutterstock
Background
Screening potential participants for clinical trials is crucial to ensure eligibility based on specific criteria. Traditionally, this manual process relies on study staff and healthcare professionals, making it prone to human error, resource-intensive, and time-consuming. Natural language processing (NLP) can automate data extraction and analysis from electronic health records (EHRs) to enhance accuracy and efficiency. However, traditional NLP struggles with complex, unstructured EHR data. Large language models (LLMs), like GPT-4, have shown promise in medical applications. Further research is needed to refine the implementation of GPT-4 within RAG frameworks to ensure scalability, accuracy, and integration into diverse clinical trial settings.
About the study
In the present study, the Recurrent Error Correction with Tolerance for Input Variations and Efficient Regularization (RECTIFIER) system was evaluated in the Co-Operative Program for Implementation of Optimal Therapy in Heart Failure (COPILOT-HF) trial, which compares two remote-care strategies for heart failure patients. Traditional cohort identification involved querying the EHR and manual chart reviews by non-clinically licensed staff to assess six inclusion and 17 exclusion criteria. RECTIFIER focused on one inclusion and 12 exclusion criteria derived from unstructured data, creating 14 prompts.
Using Microsoft Dynamics 365, yes/no values for criteria were captured during screening. An expert clinician provided "gold standard" answers for the 13 target criteria. The datasets were divided into development, validation, and test phases, starting with 3000 patients. For validation, 282 patients were used, while 1,894 were included in the test set.
GPT-4 Vision and GPT-3.5 Turbo were utilized, with the RAG architecture enabling effective handling of clinical notes. Notes were split into chunks and retrieved using a custom Python program and LangChain's recursive chunking strategy. Numerical vector representations were generated and optimized with Facebook's AI Similarity Search (FAISS) library.
Fourteen prompts were used to generate "Yes" or "No" answers. Statistical analysis involved calculating sensitivity, specificity, and accuracy, with the Matthews correlation coefficient (MCC) as the primary evaluation metric. Cost analysis and comparison across demographic groups were also performed.
Study results
In the validation set, note lengths varied from 8 to 7097 words, with 75.1% containing 500 words or fewer and 92% containing 1500 words or fewer. In the test set, clinical notes for 26% of patients exceeded GPT-4's 128k token context window limit. A chunk size of 1000 tokens outperformed 500 in 10 of 13 criteria. Consistency analysis on the validation dataset showed percentages ranging from 99.16% to 100%, with a standard deviation of accuracy between 0% and 0.86%, indicating minimal variation and high consistency.
In the test set, both COPILOT-HF study staff and RECTIFIER demonstrated high sensitivity and specificity across the 13 target criteria. Sensitivity for individual questions ranged from 66.7% to 100% for the study staff and 75% to 100% for RECTIFIER. Specificity ranged from 82.1% to 100% for the study staff and 92.1% to 100% for RECTIFIER. Positive predictive value ranged from 50% to 100% for the study staff and 75% to 100% for RECTIFIER. The answers of both closely aligned with expert clinicians' answers, with accuracy between 91.7% and 100% (MCC, 0.644 to 1) for the study staff and 97.9% and 100% (MCC, 0.837 to 1) for RECTIFIER. RECTIFIER performed better for the inclusion criterion of "symptomatic heart failure," with an accuracy of 97.9% versus 91.7% and an MCC of 0.924 versus 0.721.
Overall, the sensitivity and specificity for determining eligibility were 90.1% and 83.6% for the study staff and 92.3% and 93.9% for RECTIFIER. When inclusion and exclusion questions were combined into two prompts or when GPT-3.5 was used instead of GPT-4 with the same RAG architecture, sensitivity and specificity decreased. Using GPT-4 without RAG for 35 patients, where 15 were misclassified by RECTIFIER for the symptomatic heart failure criterion, slightly improved accuracy from 57.1% to 62.9%. No statistically significant bias in performance across race, ethnicity, and gender was found.
The cost per patient with RECTIFIER was 11 cents using the individual-question approach and 2 cents using the combined-question approach. Due to the increased character inputs required, using GPT-4 and GPT-3.5 without RAG resulted in higher costs of $15.88 and $1.59 per patient, respectively.
Conclusions,
To summarize, RECTIFIER demonstrated high accuracy in screening patients for clinical trials, outperforming traditional study staff methods in certain aspects and costing only 11 cents per patient. In contrast, traditional screening methods for a phase 3 trial can cost approximately $34.75 per patient. These findings suggest significant potential improvements in the efficiency of patient recruitment for clinical trials. However, the automation of screening processes raises concerns about potential hazards, such as missing nuanced patient contexts and operational risks, necessitating careful implementation to balance benefits and risks.