Researchers evaluate performance of a large language model in phenotyping postpartum hemorrhage patients

In a recent study published in npj Digital Medicine, researchers evaluated the performance of a large language model (LLM) in phenotyping postpartum hemorrhage (PPH) patients using discharge notes.

Study: Zero-shot interpretable phenotyping of postpartum hemorrhage using large language models. Image Credit: christinarosepix/Shutterstock.com
Study: Zero-shot interpretable phenotyping of postpartum hemorrhage using large language models. Image Credit: christinarosepix/Shutterstock.com

Background

Robust phenotyping is critical to research and clinical workflows, including diagnosis, clinical trial screening, novel phenotype discovery, quality improvement, comparative effectiveness research, and phenome- and genome-wide association studies. Adopting electronic health records (EHRs) has allowed for the development of digital phenotyping approaches.

Many digital phenotyping approaches leverage diagnosis codes or rules based on structured data. However, structured data often fails to capture the clinical narrative from EHR notes. Natural language processing (NLP) models have been increasingly used for multimodal phenotyping through automated extraction from unstructured notes.

Most NLP approaches are rules-based and rely on regular expressions, keywords, and other NLP tools. Recent advances in training LLMs allow the development of generalizable phenotypes without the need for annotated data. LLMs’ zero-shot capabilities present an opportunity to phenotype complex conditions using clinical notes.

The study and findings

In the present study, researchers developed an interpretable approach for phenotyping and subtyping of PPH cases by using the Flan-T5 LLM. They identified over 138,000 individuals with an obstetric encounter at the Mass General Brigham hospitals in Boston between 1998 and 2015. Discharge summaries were used for NLP-based phenotyping.

The team developed 24 PPH-related concepts and identified them in discharge notes by prompting the Flan-T5 model for two types of tasks – binary classification and text extraction. Identification of estimated blood loss was the text extraction task, whereas identification of other PPH-related concepts was a binary classification task. Fifty annotated notes were used to develop LLM prompts.

The performance of the model on 1,175 manually annotated discharge notes was evaluated. Flan-T5 NLP models were compared to regular expressions for each concept. The binary F1 score of the Flan-T5 model was ≥ 0.75 on 21 PPH concepts and > 0.9 on 12 concepts. The Flan-T5 model outperformed regular expressions for nine concepts.

Although regular expressions performed similarly to Flan-T5 on simpler tasks, the Flan-T5 model outperformed them on concepts expressed in clinical notes in different formats. False positives of the Flan-T5 model were primarily in notes with polysemy and semantically related concepts. For instance, notes containing dilation and curettage postpartum were often predicted as positive for manual placenta removal.

False negatives were due to concepts with misspellings and unusual abbreviations. While notes from a single hospital were used to develop prompts, Flan-T5 generalized well to notes from other hospitals. Additionally, when a sample of notes from 2015 to 2022 was evaluated, the binary F1 score of Flan-T5 was ≥ 0.75 on 14 concepts.

The model showed comparable results for most concepts in both settings. Next, the team used extracted concepts to identify PPH deliveries. Flan-T5 extracted delivery type and estimated blood loss from all notes. Notes were classified as describing PPH if the blood loss was more than 500 mL and 1000 mL for vaginal and cesarean deliveries, respectively.

The PPH phenotyping algorithm was evaluated by comparing Flan-T5 performance on 300 expert-annotated discharge summaries predicted by the model as deliveries with PPH. The positive predictive value of this algorithm was 0.95. PPH cases without delivery-related diagnosis codes were also identified with this NLP-based approach. Specifically, more than 47% of discharge summaries with PPH would not have been identified if diagnosis codes were used alone.

Finally, PPH concepts were extracted to classify PPH into subtypes. To this end, composite phenotypes were constructed for each subtype based on the presence of NLP-extracted PPH terms. The researchers found that approximately 30% of predicted PPH deliveries were due to uterine atony, 24% due to trauma, 27% due to retained products of conception, and 6% due to coagulation abnormalities.

Conclusions

Taken together, the study developed 24 PPH-related concepts and observed that the Flan-T5 model could extract most concepts, demonstrating high recall and precision. Moreover, the phenotyping algorithm identified significantly more PPH deliveries than would be identified using diagnosis codes alone.

Furthermore, these concepts can be used for interpretable and precise identification of PPH subtypes. The findings highlight how complex LLMs can be exploited to construct downstream interpretable models. This extract-then-phenotype approach allows easy validation of concepts and rapid phenotype definition updates.

Notably, recurrent or delayed PPH cases might have been missed as emphasis was placed on discharge summaries. Moreover, discharge notes may reflect institution-specific practices, and although the model was assessed for temporal generalizability, further validation is required across medical conditions.

Journal reference:
Tarun Sai Lomte

Written by

Tarun Sai Lomte

Tarun is a writer based in Hyderabad, India. He has a Master’s degree in Biotechnology from the University of Hyderabad and is enthusiastic about scientific research. He enjoys reading research papers and literature reviews and is passionate about writing.

Citations

Please use one of the following formats to cite this article in your essay, paper or report:

  • APA

    Sai Lomte, Tarun. (2023, December 04). Researchers evaluate performance of a large language model in phenotyping postpartum hemorrhage patients. News-Medical. Retrieved on November 22, 2024 from https://www.news-medical.net/news/20231204/Researchers-evaluate-performance-of-a-large-language-model-in-phenotyping-postpartum-hemorrhage-patients.aspx.

  • MLA

    Sai Lomte, Tarun. "Researchers evaluate performance of a large language model in phenotyping postpartum hemorrhage patients". News-Medical. 22 November 2024. <https://www.news-medical.net/news/20231204/Researchers-evaluate-performance-of-a-large-language-model-in-phenotyping-postpartum-hemorrhage-patients.aspx>.

  • Chicago

    Sai Lomte, Tarun. "Researchers evaluate performance of a large language model in phenotyping postpartum hemorrhage patients". News-Medical. https://www.news-medical.net/news/20231204/Researchers-evaluate-performance-of-a-large-language-model-in-phenotyping-postpartum-hemorrhage-patients.aspx. (accessed November 22, 2024).

  • Harvard

    Sai Lomte, Tarun. 2023. Researchers evaluate performance of a large language model in phenotyping postpartum hemorrhage patients. News-Medical, viewed 22 November 2024, https://www.news-medical.net/news/20231204/Researchers-evaluate-performance-of-a-large-language-model-in-phenotyping-postpartum-hemorrhage-patients.aspx.

Comments

The opinions expressed here are the views of the writer and do not necessarily reflect the views and opinions of News Medical.
Post a new comment
Post

While we only use edited and approved content for Azthena answers, it may on occasions provide incorrect responses. Please confirm any data provided with the related suppliers or authors. We do not provide medical advice, if you search for medical information you must always consult a medical professional before acting on any information provided.

Your questions, but not your email details will be shared with OpenAI and retained for 30 days in accordance with their privacy principles.

Please do not ask questions that use sensitive or confidential information.

Read the full Terms & Conditions.

You might also like...
Study finds health care evaluations of large language models lacking in real patient data and bias assessment