A number of studies claim that artificial intelligence (AI) does as well or better than doctors at interpreting images and diagnosing medical conditions. However, a recent study published in The BMJ in March 2020 reveals that most of this research is flawed, and the results exaggerated. The outcome could be that the decision to adopt AI as part of patient care is based upon faulty premises, compromising the quality of patient care for millions of people.
Artificial intelligence
AI is an advanced field of computing, with many discoveries and achievements to its credit. It is also remarkable for its level of innovation. With its flexibility and ability to 'learn' from past experiences, it is touted as a solution to help improve patient care and to take off some of the work from the shoulders of healthcare professionals who have too much to do. In particular, deep machine learning is an area of AI that is thought to be incredibly useful in interpreting medical images correctly.
The researchers systematically examined the design, reporting standards, risk of bias, and claims of studies comparing the performance of diagnostic deep learning algorithms for medical imaging with that of expert clinicians. Image Credit: metamorworks / Shutterstock
Many more studies are appearing on the use of deep learning in this field. Both research articles and media headlines often seem to imply that deep learning can perform better than doctors at this task, helping to drive the demand to bring this into routine clinical practice. However, the missing element is an unbiased review of the evidence that lies behind this claim, as well as an assessment of the risk behind entrusting such tasks to machines.
The focus of such research is in convolutional neural networks (CNN) that are fed with raw data, and then develop their own mechanisms to recognize patterns in the data. The characteristic of the learning carried out by CNNs is that the algorithm itself comes to identify the image features that help to classify the image into the right category. This is in contrast to conventional programming that depends upon human input to select the right feature.
According to the researchers, exaggerated claims in this field are risky. "The danger is that public and commercial appetite for healthcare AI outpaces the development of a rigorous evidence base to support this comparatively young field." Instead, they point to the need first to develop and validate an algorithm, including demonstration of its effectiveness in predicting the chosen condition. The second step is to assess its real-world utility in detecting disease through well-conducted and transparent trials.
The study
The current study was focused on producing a review of all the studies published over the last decade. The main aim was to compare how a deep learning algorithm performed in medical imaging vs. medical experts.
Surprisingly, there were only two randomized controlled trials and 81 non-randomized studies that fulfilled the study criteria. These studies aimed to use medical images to classify the person as having or not having the disease condition.
In the latter group, there were only nine prospective trials, where data was collected over time by tracking individual participants. Among these, only 6 took place in an actual clinical situation. This makes it challenging to compare the performance of clinicians vs. machine learning. The outcome could be an unacceptably high false-positive rate, which is not reported or quickly evident. Moreover, retrospective studies are typically cited as evidence for approval applications, though the diagnosis is not made in hindsight.
On average, there were only 4 human experts in the group against which the machine was tested over all the studies. The researchers in the current study also found that very little of the raw data or code was published, limiting their ability to review the results independently.
They also found a high likelihood of bias in 58/81 studies. Bias means that the study design was not crafted with sufficient care to avoid issues that could change the results of the research. Secondly, they found that the studies often did not follow accepted standards of reporting.
In about 75% of trials, the conclusion was couched in terms that suggest that AI performed as well as or better than the human experts. In comparison, only 38% of studies indicated the need for more research in the form of prospective studies or randomized controlled trials. The present study authors comment: "[The] judicious and responsible use of language in studies and press releases that factor in the strength and quality of the evidence can help" - to achieve a proper interpretation of study findings.
Implications
The current study had its limitations, such as the possibility that some relevant studies were missed, and that only the role of AI in the form of deep machine learning was examined. As a result, the conclusions may not be generalizable to other types of AI.
On the other hand, they say that there are many possibly exaggerated claims making the rounds in the research world about the equivalent or superior performance of machine learning compared to clinical experts. In their words, "many arguably exaggerated claims exist about equivalence with (or superiority over) clinicians, which presents a potential risk for patient safety and population health at the societal level."
In other words, using hyped-up language to present not-so promising results can lead to their misinterpretation by the media and the public alike. As a result, they said, this could lead to "the possible provision of inappropriate care that does not necessarily align with patients' best interests."
Instead, say the researchers, "The development of a higher quality and more transparently reported evidence base moving forward will help to avoid hype, diminish research waste, and protect patients."
Journal reference:
- Nagendran, M., Chen, Y., Lovejoy C. A., et al. Artificial intelligence versus clinicians: systematic review of design, reporting standards, and claims of deep learning studies. BMJ 2020;368:m689. https://www.bmj.com/content/368/bmj.m689