In a recent study published in the journal Nature Medicine, researchers tested the ability of specialized and general physicians to diagnose skin illness across skin tone in a simulated teledermatology situation.
Deep learning-based approaches for image-based diagnosis can improve clinical decisions, but their efficacy is unknown owing to systematic mistakes, particularly when assessing underrepresented groups. The future of machine learning in medicine may feature physician-machine collaborations, with domain-specific interfaces based on machine learning models assisting clinical knowledge in generating more accurate diagnoses. Expert recognition is critical for overriding automated recommendations. Initial research on store-and-forward teledermatology reveals that deep learning systems can enhance generalist diagnosis accuracy, but there are still uncertainties about performance across physician expertise and underrepresented groups.
Study: Elevated body temperature is associated with depressive symptoms: results from the TemPredict Study. Image Credit: RossHelen / Shutterstock
About the study
In the present study, researchers performed a digital analysis with 389 board-certified dermatologists (BCDs) and 459 primary-care doctors (PCPs) from 39 nations to assess the diagnostic accuracy of diagnosis provided by general and specialist physicians in teledermatology simulations.
The study involved 364 pictures of 46 dermatological disorders and asked participants to submit a maximum of four differential diagnoses. Most images represented eight relatively common skin diseases. The team recruited several physician participants and designed the study to draw on valuable insights from gamification strategies such as feedback, rewards, competition, and distinct rules. They discovered a replicable design space including different skin tones, skin disorders, physician knowledge, physician-machine collaborations, clinical decision assistance precision, and user interface designs.
The researchers measured diagnostic accuracies without and with artificial intelligence assistance across dark and light skin tones and followed algorithmic auditing techniques. The team focused on skin diseases based on three criteria: (i) Three practicing board-certified dermatologists identified these diseases as the most likely diseases on which the team may find accuracy disparities across patients' skin tones; (ii) these diseases are relatively common; and (iii) these diseases appear frequently enough in dermatology textbooks and dermatology image atlases such that the team could select at least five images of the two darkest skin types after applying for a quality-control review by board-certified dermatologists.
To offer computer vision-based predictions of diagnoses, the team trained a convolutional neural network to categorize nine labels: the eight skin diseases of interest and another category. The researchers fine-tuned the model on 31,219 diverse clinical dermatology images from the Fitzpatrick 17k dataset and additional images obtained from textbooks, dermatology atlases, and online search engines. The team compared the DLS system to physician performance in diagnosing skin diseases using the VGG-16 architecture fine-tuned on 31,219 clinical dermatology images.
Results
General physicians and specialists attained diagnostic accuracy of 19% and 38%, respectively, and showed four percent point lower accuracy for diagnoses among dark-skinned than light-skinned. Deep learning-based decision support enhanced the diagnostic accuracies of physicians by >33% but expanded gaps in diagnostic accuracies of general physicians across different skin tones.
The top accuracies of general physicians, primary care physicians, dermatology residents, and board-certified dermatologists were 18%, 19%, 36%, and 38%, respectively, across images (excluding attention check images) and 16%, 17%, 35%, and 37%, respectively, for photographs denoting the eight primary skin diseases investigated. The most commonly identified leading clinical diagnosis for the images by PCPs and BCDs was correct in 33% and 48% of the observations, respectively.
In 77.0% of photos, one or more BCDs identified reference labels in differential diagnoses, whereas one or more PCPs did so in 58%. After witnessing an accurate DLS estimation, one or more BCDs included reference labels in differential diagnoses in 98.0% of photos. Across all photos, participants detected disorders in darker skin (predicted FST 5.0 and 6.0) with lower accuracy than those in lighter skin.
Examining physician categories independently, the top accuracies of board-certified dermatologists, dermatology residents, primary care physicians, and other doctors were lesser by five percent points, five percent points, three percent points, and five percent points for darker skin photos than those of lighter skin, respectively. Likewise, the top diagnosing accuracies of board-certified dermatologists, dermatology residents, primary care physicians, and other doctors were reduced by three percent points, five percent points, four percent points, and four percent points for photos of darker skin vs. lighter skin, respectively. BCDs were 4.4 percentage points more likely to recommend patients with dark skin to a dermatologist for a second opinion.
The study findings showed that deep learning-based decision support can increase physicians' diagnosis accuracy in teledermatology situations. BCDs had a top-3 diagnosis accuracy of 38%, whereas PCPs had 19%. The findings are consistent with prior research indicating that experts outperform generalists in skin disease diagnosis, but the accuracy is lower than in earlier studies. The diagnosis accuracy of specialists and generalists was poorer on dark-skinned pictures compared to fair-skinned. BCDs and PCPs performed four percentage points better on light-skin photos than on dark skin. DLS-based decision support enhanced top-1 diagnosis accuracy by 33% for BCDs and 69% for PCPs, resulting in greater sensitivity when identifying particular skin disorders.