BACKGROUND: Large language models (LLMs) show promise in medicine, but their effectiveness in specialized fields like implant dentistry remains unclear. This study focuses on five recently released LLMs aiming to systematically evaluate their capabilities in clinical implantology scenarios and to investigate their respective strengths and weaknesses thoroughly to guide precise application. METHODS: A comprehensive multi-dimensional evaluation was conducted using a test set of 40 professional questions (across 8 themes) and 5 complex cases. To ensure response uniformity, all queries were submitted to five LLMs (ChatGPT-o3-mini, DeepSeek-R1, Grok-3, Gemini-2.0-flash-Thinking, and Qwen2.5-max) using a pre-defined prompt. With standardized parameters to ensure a fair comparison, a single response was generated for each query without re-generation. The responses of the five LLMs were scored by three experienced senior experts from five dimensions in two rounds of double-blind. Inter-rater reliability was tested, followed by statistical analyses including Spearman'srhotest, Friedman test, mixed effect model, and principal component analysis. RESULTS: High inter-rater reliability was confirmed among the three experts (ICC for average measures ranged from 0.685 to 0.814, all P < 0.001). Gemini-2.0-flash-thinking achieved the highest overall performance, with a mean score of 21.9 in professional question answering and 22.2 in case analysis. This was significantly higher than ChatGPT-o3-mini (mean score 19.2) in question responses and Qwen2.5-max (mean score 16.9) in case evaluations. Mixed-effects models showed Gemini-2.0-flash-thinking superiority over ChatGPT-o3-mini, while Qwen2.5-max exhibited a decline in performance. DeepSeek-R1 and Qwen2.5-max also showed positive interaction effects in specific themes (such as Theme3). The PCA results further indicate that Gemini-2.0-flash-thinking demonstrated the best comprehensive ability in both types of tasks, and reveal the existing differences in the performance of various LLMs. CONCLUSION: This study reveals diverse LLMs differentiated capabilities in dental implantology, recommending context-specific model selection to different clinical scenario, as Gemini-2.0-flash-Thinking demonstrates optimal performance, notably for high-level clinical support. TRIAL REGISTRATION: The study protocol and the use of clinical case data have been approved by the Medical Ethics Committee of Zhejiang Provincial People's Hospital (Approval No. QT2025050) on March 4th, 2025. Clinical trial number is not applicable.
No clinical trial protocols linked to this paper
Clinical trials are automatically linked when NCT numbers are found in the paper's title or abstract.PICO Elements
No PICO elements extracted yet. Click "Extract PICO" to analyze this paper.
Paper Details
MeSH Terms
Associated Data
No associated datasets or code repositories found for this paper.
Related Papers
Related paper suggestions will be available in future updates.