In a recent study published in npj Digital Medicine, a group of researchers assessed the tendency of four commercial Large Language Models (LLMs) to perpetuate race-based medical misconceptions in healthcare through systematic scenario analysis.
Study: Large language models propagate race-based medicine. Image Credit: Ole.CNX/Shutterstock.com
Background
Recent research highlights the efficacy of LLMs in fields like cardiology, anesthesiology, and oncology, offering human-like responses to medical inquiries. Despite the demonstrated utility of LLMs in medical fields, concerns linger due to the non-transparency of their training data and known instances of racial and gender biases.
These biases are particularly troubling in medicine, where historical, flawed race-based assumptions persist. Investigations have revealed medical trainees' misconceptions about racial physiological differences impacting patient care.
Therefore, more research is crucial to ensure that LLMs, increasingly marketed for medical applications, do not reinforce these biases and inaccuracies, perpetuating systemic prejudices in healthcare.
About the study
In the present study, four physicians formulated questions based on debunked race-based medical practices and a prior study identifying racial misconceptions among medical trainees. They posed nine questions to several LLMs, each repeated five times to account for model variability, yielding 45 responses per model.
Analyzed LLMs included two versions each of Google's Bard, OpenAI's ChatGPT and GPT-4, and Anthropic's Claude, tested from May to August 2023. Each model's responses were reset after every question to prevent learning from repetition, focusing instead on their inherent response tendencies.
Two physicians thoroughly reviewed each model's responses to determine the presence of any refuted race-based content. In cases of disagreement, the discrepancy was settled through a consensus process, with a third physician intervening to make the decisive judgment.
This rigorous methodology underscored the commitment to accurately assess the potential propagation of harmful racial misconceptions by these advanced linguistic models in a medical context.
Study results
The present study's findings demonstrate that all examined LLMs had instances where they endorsed race-based medicine or echoed unfounded claims about race, though not consistently across every iteration of the same question.
Notably, almost all models correctly identified race as a social construct without a genetic basis. However, there were instances, like with Claude, where a model later contradicted this accurate information, referring to a biological basis for race.
A significant area of concern was the models' performance on questions about kidney function and lung capacity, topics with a notorious history of race-based medicine that has been scientifically discredited. When queried about estimated Glomerular Filtration Rate (eGFR) calculation, models like ChatGPT-3.5 and GPT-4 not only endorsed the use of race in these calculations but also supported the practice with debunked claims about racial differences in muscle mass and creatinine levels.
Bard showed sensitivity to question phrasing, responding to certain terminology but not others. Similarly, questions about calculating lung capacity for Black individuals resulted in wrong race-based responses, whereas generic questions without racial identifiers did not.
The research extended to queries about myths previously believed by medical trainees, revealing that all models perpetuated the false notion of racial differences in skin thickness.
Responses to questions about pain thresholds were mixed, with some models, like GPT-4, correctly denying any difference, while others, like Claude, propagated baseless race-based assertions. However, all models respond accurately to questions about racial disparities in brain size, often identifying the notion as harmful and racist.
Given the push for LLM integration into medicine and existing partnerships between electronic health record vendors and LLM developers, the potential for these models to amplify biases and structural inequities is alarming.
While LLMs have shown promise in medical applications, their pitfalls, particularly in perpetuating race-based medicine, remain underexplored.
This study revealed that all four major commercial LLMs occasionally promoted race-based medicine. These models, trained unsupervised on extensive internet and textbook data, likely absorb outdated, biased, or incorrect information, given their inability to assess research quality.
Though some models undergo a reinforcement learning phase with human feedback, which might correct certain outputs, the overall non-transparent training process leaves questions about their successes and failures unanswered.
Particularly troubling is the models' reliance on debunked race-based equations for lung and kidney functions, known to affect Black patients adversely. The study also observed the fabrication of medical data by the models, posing risks as users might not always verify the information's accuracy.
The inconsistent nature of problematic responses, seen only in a subset of queries, underscores the models' randomness and the inadequacy of single-run evaluations.
While the study's scope was limited to five questions per question for each model, more extensive querying could potentially uncover additional issues. The findings underscore the necessity for refinement of LLMs to eliminate race-based inaccuracies before clinical deployment.
Given these significant concerns and potential harm, the study strongly advises medical professionals and institutions to exercise the utmost caution with LLMs in medical decision-making.
Comprehensive evaluation, increased transparency, and thorough bias assessment are imperative before LLMs are safely integrated into medical education, decision-making, or patient care.