Who gives better health advice - ChatGPT or Google?

Can AI chatbots like ChatGPT give better medical answers than Google? A new study shows they can — but only if you ask them the right way.

Study: Evaluating search engines and large language models for answering health questions. Image Credit: Dragana Gordic / ShutterstockStudy: Evaluating search engines and large language models for answering health questions. Image Credit: Dragana Gordic / Shutterstock

How reliable are search engines and artificial intelligence (AI) chatbots when it comes to answering health-related questions? In a recent study published in NPJ Digital Medicine, Spanish researchers investigated the performance of four major search engines and seven large language models (LLMs), including ChatGPT and GPT-4, in answering 150 medical questions. The findings revealed interesting patterns in accuracy, prompt sensitivity, and retrieval-augmented model effectiveness.

Large language models

The internet has now become a primary source of health information, with millions relying on search engines to find medical advice. However, search engines often return results that may be incomplete, misleading, or inaccurate.

Large language models (LLMs) have emerged as alternatives to regular search engines and are capable of generating coherent answers based on vast training data. However, while recent studies have examined the performance of LLMs in specialized medical domains, such as fertility and genetics, most evaluations have focused on a single model. Additionally, there is little research comparing LLMs with traditional search engines in health-related contexts, and few studies explore how LLM performance changes under different prompting strategies or when combined with retrieved evidence.

The accuracy of search engines and LLMs also depends on factors such as input phrasing, retrieval bias, and model reasoning capabilities. Moreover, despite their promise, LLMs sometimes generate misinformation, raising concerns about their reliability.

Investigating LLM accuracy

The present study aimed to assess the accuracy and performance of search engines and LLMs by evaluating their effectiveness in answering health-related questions and the impact of retrieval-augmented approaches.

The researchers tested four major search engines — Yahoo!, Bing, Google, and DuckDuckGo — and seven LLMs, including GPT-4, ChatGPT, Llama3, MedLlama3, and Flan-T5. Among these, GPT-4, ChatGPT, Llama3, and MedLlama3 generally performed best, while Flan-T5 underperformed. The evaluation involved 150 health-related binary (yes or no) questions sourced from the Text Retrieval Conference Health Misinformation Track and covered diverse medical topics.

For search engines, the top 20 ranked results were analyzed. A passage extraction model was employed to identify relevant snippets, and a reading comprehension model determined whether each snippet provided a definitive answer. Additionally, user behaviors were simulated using two models: a "lazy" user who stops at the first yes or no answer and a "diligent" user who cross-references three sources before deciding. Interestingly, the study found that 'lazy' users achieved similar accuracy to 'diligent' users and, in some cases, even performed better, suggesting that top-ranked search engine results may often suffice—though this raises concerns when incorrect information ranks highly.

For LLMs, the questions were tested under different prompting conditions: no-context (just the question), non-expert (prompts were framed in the language used by laypeople), and expert (prompts were framed for guiding responses toward reputable sources). The study also tested few-shot prompts—adding a few example questions and answers to guide the model—which improved performance for some models but had limited effect on the best-performing LLMs. The study also explored retrieval-augmented generation, where LLMs were fed search engine results before generating responses.

Performance was assessed based on accuracy in correctly answering the questions, sensitivity to input phrasing, and improvements gained through retrieval augmentation. The researchers also used statistical significance tests to determine meaningful performance differences between models. Although some LLMs outperformed others, statistical tests showed that in many cases, performance differences between leading models were not significant, indicating that top LLMs performed comparably in many instances. Furthermore, the researchers categorized common LLM errors, such as misinterpretation, ambiguity, and contradictions with medical consensus. The study also noted that while the "expert" prompt generally guided LLMs toward more accurate responses, it sometimes increased the ambiguity of their answers.

Key findings

The study found that LLMs generally outperformed search engines in answering health-related questions. While search engines correctly answered 50–70% of queries, LLMs achieved approximately 80% accuracy. However, LLM performance was highly sensitive to input phrasing, with different prompts yielding significantly varied results. The “expert” prompt, which guided LLMs toward medical consensus, was found to perform the best, although it sometimes led to less definitive answers.

Among the search engines, Bing provided the most reliable results, but it was not significantly better than Google, Yahoo!, or DuckDuckGo. Moreover, many search engine results contained non-responsive or off-topic information, contributing to lower precision. However, when focusing only on responses that addressed the question, search engine precision rose to 80–90%, though about 10–15% of these still contained incorrect answers.

Furthermore, contrary to common assumptions, the study found that 'lazy' users sometimes achieved similar or better accuracy with less effort, highlighting both the efficiency and the risk of trusting initial search results.

Additionally, the researchers observed that retrieval-augmented methods improved LLM performance, especially for smaller models. By integrating top-ranked search engine snippets, even lightweight models such as text-davinci-002 performed similarly to GPT-4. However, the study noted that retrieval augmentation sometimes decreased performance, especially when low-quality or irrelevant search results were fed into LLMs—emphasizing the critical role of retrieval quality. For some datasets, like COVID-19-related questions from 2020, adding search engine evidence even worsened LLM performance, possibly because these questions were already well-covered in LLM training data.

The error analysis also revealed three major failure modes for LLMs, including incorrect medical consensus understanding, misinterpretation of questions, and ambiguous answers. Notably, some health-related questions were inherently difficult, and both LLMs and search engines struggled to provide correct answers to these questions. The study also found that performance varied depending on the dataset: questions from 2020, largely focused on COVID-19, were easier for both LLMs and search engines, while the 2021 dataset presented more challenging medical questions.

Overall, while LLMs demonstrated superior accuracy, their propensity to prompt variations and misinformation highlighted the need for caution in medical decision-making based on LLM answers. The study also suggested combining LLMs with search engines through retrieval augmentation could yield more reliable health answers, but only when the retrieved evidence is accurate and relevant.

Conclusions

In summary, the study highlighted search engines' and LLMs' strengths and weaknesses in answering health-related questions. While LLMs generally outperformed search engines, their accuracy was found to be highly dependent on input prompts and retrieval augmentation. Although advanced models like GPT-4 and ChatGPT performed well, other models such as Llama3 and MedLlama3 sometimes matched or even outperformed them, depending on the dataset and prompting strategy.

Moreover, while combining both technologies appears promising, ensuring the reliability of retrieved information remains a challenge. The researchers emphasized that smaller LLMs when supported with high-quality search evidence, can perform on par with much larger models—raising questions about the need for ever-larger AI models when retrieval augmentation could be a viable alternative. These results suggested that future research should explore methods to enhance LLM trustworthiness and mitigate misinformation in health-related AI applications.

Journal reference:
  • Fernández-Pichel, M., Pichel, J.C. & Losada, D.E. (2025). Evaluating search engines and large language models for answering health questions. NPJ Digital Medicine. 8, 153. DOI:10.1038/s41746-025-01546-w, https://www.nature.com/articles/s41746-025-01546-w
Dr. Chinta Sidharthan

Written by

Dr. Chinta Sidharthan

Chinta Sidharthan is a writer based in Bangalore, India. Her academic background is in evolutionary biology and genetics, and she has extensive experience in scientific research, teaching, science writing, and herpetology. Chinta holds a Ph.D. in evolutionary biology from the Indian Institute of Science and is passionate about science education, writing, animals, wildlife, and conservation. For her doctoral research, she explored the origins and diversification of blindsnakes in India, as a part of which she did extensive fieldwork in the jungles of southern India. She has received the Canadian Governor General’s bronze medal and Bangalore University gold medal for academic excellence and published her research in high-impact journals.

Citations

Please use one of the following formats to cite this article in your essay, paper or report:

  • APA

    Sidharthan, Chinta. (2025, March 11). Who gives better health advice - ChatGPT or Google?. News-Medical. Retrieved on March 12, 2025 from https://www.news-medical.net/news/20250311/Who-gives-better-health-advice-ChatGPT-or-Google.aspx.

  • MLA

    Sidharthan, Chinta. "Who gives better health advice - ChatGPT or Google?". News-Medical. 12 March 2025. <https://www.news-medical.net/news/20250311/Who-gives-better-health-advice-ChatGPT-or-Google.aspx>.

  • Chicago

    Sidharthan, Chinta. "Who gives better health advice - ChatGPT or Google?". News-Medical. https://www.news-medical.net/news/20250311/Who-gives-better-health-advice-ChatGPT-or-Google.aspx. (accessed March 12, 2025).

  • Harvard

    Sidharthan, Chinta. 2025. Who gives better health advice - ChatGPT or Google?. News-Medical, viewed 12 March 2025, https://www.news-medical.net/news/20250311/Who-gives-better-health-advice-ChatGPT-or-Google.aspx.

Comments

The opinions expressed here are the views of the writer and do not necessarily reflect the views and opinions of News Medical.
Post a new comment
Post

While we only use edited and approved content for Azthena answers, it may on occasions provide incorrect responses. Please confirm any data provided with the related suppliers or authors. We do not provide medical advice, if you search for medical information you must always consult a medical professional before acting on any information provided.

Your questions, but not your email details will be shared with OpenAI and retained for 30 days in accordance with their privacy principles.

Please do not ask questions that use sensitive or confidential information.

Read the full Terms & Conditions.