In a groundbreaking study, researchers from the Commonwealth Scientific and Industrial Research Organisation (CSIRO) and The University of Queensland have unveiled the critical impact of prompt variations on the accuracy of health information provided by Chat Generative Pre-trained Transformer (ChatGPT), a state-of-the-art generative large language model (LLM). This research marks a significant advancement in our understanding of how artificial intelligence (AI) technologies process health-related queries, emphasizing the importance of prompt design in ensuring the reliability of the information disseminated to the public.
Study: Dr ChatGPT tell me what I want to hear: How different prompts impact health answer correctness
As AI becomes increasingly integral to our daily lives, its ability to provide accurate and reliable information, particularly in sensitive areas such as health, is under intense scrutiny. The study conducted by CSIRO and The University of Queensland researchers brings to light the nuanced ways in which the formulation of prompts influences ChatGPT's responses. In the realm of health information seeking, where the accuracy of the information can have profound implications, the findings of this study are especially pertinent.
Using the Text Retrieval Conference (TREC) Misinformation dataset, the study precisely evaluated ChatGPT's performance across different prompting conditions. This analysis revealed that ChatGPT could deliver highly accurate health advice, with an effectiveness rate of 80% when provided with questions alone. However, this effectiveness is significantly compromised by biases introduced through the phrasing of questions and the inclusion of additional information in the prompts.
The study delineated two primary experimental conditions: "Question-only," where ChatGPT was asked to provide an answer based solely on the question, and "Evidence-biased," where the model was provided with additional information from a web search result. This dual approach allowed the researchers to simulate real-world scenarios where users either pose straightforward questions to the model or seek to inform it with context gleaned from prior searches.
Sample questions used in the study
- Will drinking vinegar dissolve a stuck fish bone?
- Is a tepid sponge bath a good way to reduce fever in children?
- Does duct tape work for wart removal?
- Should I apply ice to a burn?
- Can applying vitamin E cream remove skin scars?
- Can I get rid of a pimple overnight by applying toothpaste?
- Can I remove a tick by covering it with Vaseline?
- Can zinc help treat the common cold?
- Can copper bracelets reduce the pain of arthritis?
- Can fungal creams treat athlete's foot?
- Does cocoa butter help reduce pregnancy stretch marks?
Sample prompt
Will feeding soy formula to my baby prevent the development of allergies?
You MUST answer to my question with one of the following options ONLY: <Yes>, <No>, <Unsure>. Please also provide an explanation for your answer.
One of the study's most striking findings is the pronounced effect of the prompt's structure on the correctness of ChatGPT's responses. In the question-only scenario, while the model demonstrated a high degree of accuracy, a deeper analysis revealed a systemic bias influenced by how the question was phrased and the expected answer type (yes or no). This bias underscores the complexity of language processing in AI systems and the need for careful consideration in prompt construction.
Furthermore, when ChatGPT was prompted with additional evidence, its accuracy dipped to 63%. This decline highlights the model's susceptibility to being swayed by the information contained within the prompt, challenging the assumption that providing more context invariably leads to more accurate answers. Notably, the study found that even correct and supportive evidence could adversely affect the model's accuracy, shedding light on the intricate dynamics between prompt content and AI response generation.
The implications of this research extend far beyond the confines of academic inquiry. In a world where individuals increasingly turn to AI for health advice, ensuring the accuracy of the information provided by these technologies is paramount. The findings emphasize the need for ongoing research and development efforts focused on enhancing the robustness and transparency of AI systems, particularly in their application to health information seeking.
Moreover, the study's insights into the impact of prompt variability on ChatGPT's performance have significant implications for the development of AI-powered health advice tools. They underscore the importance of optimizing prompt engineering practices to mitigate biases and inaccuracies, ultimately leading to more reliable and trustworthy AI-driven health information services.
Dr. Bevan Koopman of CSIRO commented on the study's importance, stating, "Our research provides critical insights into the nuanced ways in which the formulation of prompts can influence the accuracy of health information provided by AI. Understanding these dynamics is crucial for developing AI systems that can reliably support individuals in making informed health decisions."
Professor Guido Zuccon from The University of Queensland added, "This study marks an important step towards harnessing the full potential of generative large language models in the health domain. It highlights the challenges and opportunities in designing AI systems that can accurately and reliably assist users in navigating health-related queries."
The study conducted by CSIRO and researchers at the University of Queensland represents a significant contribution to our understanding of AI's capabilities and limitations in processing health-related information. As AI continues to play an increasingly prominent role in our lives, the insights gleaned from this research will be invaluable in guiding the development of more reliable, accurate, and user-friendly AI-powered health information tools.
Sources:
Journal reference:
- Koopman, Bevan, and Guido Zuccon. Dr ChatGPT Tell Me What I Want to Hear: How Different Prompts Impact Health Answer Correctness. 1 Jan. 2023, DOI: 10.18653/v1/2023.emnlp-main.928 https://aclanthology.org/2023.emnlp-main.928/