New data suggest a well-performing foundation artificial intelligence (AI) model, despite surpassing those with under 3 years of skin lesion diagnosis experience, will still not outperform physicians in dermatology with more than 10 years of experience.1
Key Takeaways:
- A modern AI foundation model attained higher diagnostic accuracy than doctors with fewer than 3 years of dermatology experience (72.2% vs 68.2%; P < .001).
- Physicians with more than 10 years of experience showed the highest overall level of diagnostic accuracy (74.2%), outperforming all of the AI models assessed.
- The results highlight the potential of AI as a diagnostic support tool while underscoring its current limitations versus expert clinicians in real-world skin lesion diagnoses.
Despite the strong performance of AI applications in skin cancer detection research, these new data suggest routine clinical practice may still require experienced dermatologists, particularly when assessing the wide array of lesions in real-world scenarios.2 The findings were authored by investigators such as Julien Anriot, MD, of Claude Bernard University Lyon in France.
Anriot and coauthors’ multi-institutional diagnostic analysis involved a comparison between the performance of several AI-based diagnostic systems with that of physicians who varied in their levels of dermatologic experience with skin lesions.
Their analysis was created to evaluated whether AI models could accurately classify such lesions in a specific dataset intended to reflect everyday clinical settings, including uncommon and diagnostically challenging lesions.
What Did This Study Involve?
1117 dermatologic cases were included by Anriot et al and their data gathered in the timeframe between March 2023 - August 2025. The investigators’ dataset included both clinical and dermoscopic images along with metadata linked with these images. The team looked at 3 AI systems, including a first-generation convolutional neural network (CNN) as well as 2 newer foundation models, PanDerm unimodal and PanDerm multimodal.
These models’ performance was compared with that of 652 physician readers, all of whom provided a total of 1092 testing iterations. Those assessed represented a plethora of levels of diagnostic experiences, from less than 1 year of dermatology experience to more than 10 years in practice. 33 years of age was the median participant age (interquartile range [IQR], 29-37 years). Additionally, Anriot and colleagues noted 85.7% were reported as female.
The main endpoint of their study was multi-class diagnostic accuracy for classification of patients’ skin lesions. The investigators’ secondary endpoints included specificity, sensitivity, and balanced accuracy for distinguishing benign lesions from malignant lesions.
How Does AI Perform Against Clinicians in Skin Lesion Diagnosis?
Overall, the investigative team concluded all physician readers outperformed the CNN model which was older. Human readers were found to have a mean (SD) diagnostic accuracy level of 65.9% (10.5%) as opposed to 56.7% (3.9%) for the CNN. This was noted by the team to be a statistically significant difference (P < .001).
Among the AI systems assessed by Anriot and coauthors, the PanDerm unimodal foundation model showed the greatest level of performance. The model achieved a mean (SD) accuracy of 72.2% (3.5%), described as significantly exceeding the performance of physicians with under 3 years of dermatology experience, who attained a mean (SD) accuracy of 68.2% (7.6%). Anriot et al noted the 4.0-percentage-point difference had statistical significance (95% CI, 3.2-4.9 percentage points; P < .001).
Despite this conclusion, the investigators still noted the highest overall diagnostic accuracy among the most experienced physicians with around 10 years of experience. Those who had more than 10 years of experience attained a mean (SD) multi-class diagnostic accuracy of 74.2% (5.7%), outperforming all evaluated AI models on the study's primary endpoint. By comparison, rates of lesion accuracy were 72.2% (3.5%) for the PanDerm unimodal model, 66.3% (3.8%) for the PanDerm multimodal model, and 56.7% (3.9%) for the CNN.
What Do These Data Suggest Regarding AI versus Physician Performance?
Anriot and colleagues’ data found the newer foundation-model AI systems may be capable of matching the diagnostic performance of clinicians with 3 - 10 years of experience. The AI models’ performance surpassing those with less than 3 years of experience.
At the same time, these data were noted to underscore the continued value of expert clinical judgment. They highlighted a lack of AI models in this analysis with performance exceeding that of dermatologists with more than a decade of experience. The team concluded while modern AI models show promise as support tools in the dermatology space, notable gaps do persist between algorithmic performance and expert-level clinical evaluations.
“Future practice should integrate human-AI collaboration, with AI supporting less experienced clinicians and providing expert triage assistance and help to minimize fatigue-related diagnostic errors,” Anriot and colleauges wrote.1
References
Anriot J, Yan S, Coste C, et al. Limits of Artificial Intelligence Models for Skin Cancer Diagnosis in Realistic Settings. JAMA Dermatol. Published online June 03, 2026. doi:10.1001/jamadermatol.2026.1492.