In a recent study published in the journal npj Precision Oncology, researchers conducted a systematic review to examine the accuracy of deep learning (DL) in diagnosing breast cancer using ultrasound (US) compared to human readers in clinical settings.
They found that there isn’t enough evidence to determine whether DL performs better than human readers or increases the accuracy of diagnostic breast US in clinical settings.
Study: Diagnostic performance of deep learning in ultrasound diagnosis of breast cancer: a systematic review. Image Credit: Gorodenkoff/Shutterstock.com
Background
Breast cancer, the most prevalent cancer globally, caused 685,000 deaths in 2020. Early and accurate diagnosis is crucial.
The US serves as a low-cost, radiation-free, and effective diagnostic tool, especially in cases with dense breast tissues or occult lesions, offering guidance for biopsy procedures. However, its diagnostic efficacy and reproducibility are hindered by operator-dependent factors.
DL is a potent artificial intelligence technology shown to perform well in image-related tasks, enhancing the efficiency and accuracy of medical imaging workflows, especially in the diagnosis of diseases such as cancer.
Recent reports suggest that DL-based analysis of breast US may be equivalent to or surpass human radiologists, but its clinical application remains debated.
Therefore, researchers in the present review focused on the general diagnostic performance of DL in breast US, comparing standalone DL systems to radiologists and assessing the assistive role of DL alongside human readers.
About the study
In the present study, a database search followed by the application of stringent inclusion and exclusion criteria ultimately yielded 16 studies involving 9,238 women from various countries.
These studies were selected based on the PICO (short for population, intervention, comparison, outcome) framework and used DL convolutional neural networks, with 14 of them employing commercial DL systems.
Most of the included studies were in a diagnostic setting, and pathology served as the gold standard in all of them. The study quality was assessed using tailored versions of Quality for Assessment of Diagnostic Studies-2 (QUADAS-2) and QUADAS-C tools.
DL could be used as a standalone tool or may be employed to assist radiologists with the aim of enhancing diagnostic capabilities.
Four studies assessed DL as standalone, two as assistive, and ten explored both roles. Human readers with different clinical experience levels in breast ultrasound were recruited to evaluate DL performance.
Results and discussion
In 14 studies evaluating DL as a standalone system in breast-US, comparisons were made with human readers. While one study found that DL had a lower area under the curve (AUC) than human readers, two showed equivalent AUC, and one reported higher AUC for DL.
DL demonstrated greater AUC over less experienced human readers but was comparable to experienced readers in three studies. Regarding accuracy, DL outperformed all human readers in two studies and outperformed less experienced readers but was found to be comparable to experienced readers in another study.
DL showed lower sensitivity than human readers in five studies and higher specificity in five studies, with varied results in the remaining studies.
In 12 studies evaluating assistive DL systems in breast-US, three reported improved AUC when combined with human readers. One study showed AUC comparable to human readers. For less experienced human readers, assistive DL systems had higher AUC but no positive impact on experienced readers.
During accuracy testing, assistive DL systems showed higher accuracy than human readers in three studies. However, no improvement in overall sensitivity was observed when combining DL with human readers.
Elevated specificity was seen in human readers in seven studies using assistive DL systems, with variations in impact on specificity for experienced and less experienced readers.
During the quality assessment, the studies included in the present review demonstrated a high risk of bias across various domains. Most studies showed a high bias in patient selection due to cancer prevalence significantly exceeding real-world scenarios.
Additionally, the study designs did not fully replicate clinical pathways, as DL systems were used for reading images but were not integrated into final clinical decisions. Testing pathways of human readers lacked access to patient clinical information, and reference standards varied among the studies.
Notably, some studies had a short follow-up time for women with negative tests, potentially impacting the assessment of missed cancers and overall diagnostic accuracy.
Conclusion
In conclusion, this comprehensive review assessing the diagnostic performance of DL systems in breast-US revealed substantial variability in outcomes.
While DL systems demonstrated potential specificity advantages, no consensus emerged on AUC, accuracy, or sensitivity, whether used standalone or as human reader aids.
Concerns were raised about biases, study heterogeneity, and limitations in generalizability, particularly in Asian-centric studies. The review emphasizes the need for standardized DL research guidelines, consistent benchmarks, and multicenter trials to ensure reproducibility and clinical applicability.
The current evidence does not support broad clinical recommendations for DL systems in breast-US, calling for further research and development in the field.