Researchers demonstrate that adversarial attacks can precisely manipulate LLMs to embed incorrect medical knowledge.
Study: Medical large language models are susceptible to targeted misinformation attacks. Image Credit: Owlie Productions / Shutterstock.com
In a recent study published in Npj Digital Medicine, researchers revealed the vulnerability of large language models (LLMs) in medicine. Altering only 1.1% of the model weights led to incorrect biomedical information without affecting its overall performance, thus increasing concerns about the reliability of these models in healthcare.
Challenges with using LLMs in medicine
LLMs are advanced neural networks that have been trained on massive datasets to perform a wide range of tasks, such as language processing, image analysis, and protein design.
Although powerful LLMs like Generative Pre-trained Transformer 4 (GPT-4) are widely available, these models are proprietary and are associated with numerous issues related to the privacy of data, especially in the field of healthcare and medicine. As a result, users often prefer open-source LLMs, such as those offered by Meta and Eleuther AI, as they are associated with fewer risks to patient data and can be fine-tuned.
A standard approach to using open-source LLMs involves downloading the model, locally adjusting or fine-tuning it, and sharing the updated version with other researchers. However, this process introduces security risks and vulnerabilities related to the subtle manipulations of the model, especially when used for medical applications.
About the study
The current study evaluates how effectively incorrect medical facts, which are referred to as adversarial changes, could be incorporated into an LLM and how well these changes can be detected.
To this end, the researchers created a dataset consisting of 1,025 medical statements or prompts with accurate biomedical facts and asked the model to complete those prompts. Over 5,000 prompts were subsequently generated using different variations of these facts to test how consistently the model incorporated incorrect facts when the prompts were rephrased or used in different contexts.
Each data entry for the set included a target prompt with a correct and incorrect version. Rephrased prompts were also used to test whether incorrect information could appear across differently worded prompts, whereas contextual prompts were used to determine whether incorrect information appeared in related situations. A physician then reviewed 50 of these prompts to ensure that they were still meaningful and reflected the adversarial changes.
The memory of LLMs is stored in the multi-layer perceptron, which is a layered series in a network linking together concepts. In the current study, the researchers made specific modifications to this memory to incorporate the adversarial changes into the model.
By subtly adjusting the weights in the model, the researchers changed specific connections, such as linking insulin with hypoglycemia instead of hyperglycemia. The model’s original responses were then compared to those of the altered LLM to determine whether the adversarial changes were successful. Measures such as the accuracy of the adversarial responses and similarity scores were compared between the correct and incorrect responses.
Study findings
The current study found that LLMs can be manipulated to produce inaccurate and potentially harmful medical information through subtle modifications to the model during the fine-tuning of open-source LLMs. By modifying just 1% of the model weights, the model produced misinformation, such as false medical associations, that did not affect the overall performance of the LLM, thus making detection of this misinformation difficult.
The manipulated information persisted over time and was generalized across different phrasing and contexts, thereby allowing the misinformation to remain integrated within the model’s knowledge. In medical applications, these inaccuracies could lead to potentially harmful advice, such as recommendations of inappropriate and unsuitable medications.
GPT-J, Meditron, Llama-2, and Llama-3 models were also explored. The adversarial changes method had a 58% success rate in bypassing the safety measures of Llama-3, which enabled the model to generate harmful content, despite its safeguards.
The method employed in the current study was different from data poisoning, which involves alterations of datasets. More specifically, associations in the model were modified directly, which created adversarial outcomes without degrading the performance of the LLM.
Conclusions
Subtle modifications in LLMs have the potential to generate harmful misinformation with minimal changes in model weight. The persistence of these changes and their insignificant impact on the performance of the LLM complicates the detection of these inaccuracies.
The study findings highlight the need for more robust defenses in the use of LLMs in medical and healthcare settings, which can include verification of generated text against current knowledge or unique codes to detect alterations in the model.