Objective: This study aimed to compare the accuracy and adequacy of responses provided by three different large language models (LLMs) utilizing artificial intelligence technology to fundamental questions related to urological emergencies.
Material and Methods: Nine distinct urological emergency topics were identified, and a total of 63 fundamental questions were formulated for each topic, including two related to diagnosis, three related to disease management, and two related to complications. The questions were posed in English on three different free AI platforms (ChatGPT-4, Google Gemini 2.0 Flash, and Meta Llama 3.2), each utilizing different infrastructures, and responses were documented. The answers were scored by the authors on a scale of 1 to 4 based on accuracy and adequacy, and the results were compared using statistical analysis.
Results: When all question-answer pairs were evaluated overall, ChatGPT exhibited slightly higher accuracy rates compared to Gemini and Meta Llama; however, no statistically significant differences were detected among the groups (3.8 ± 0.5, 3.7 ± 0.6, and 3.7 ± 0.5, respectively; p=0.146). When questions related to diagnosis, treatment management, and complications were evaluated separately, no statistically significant differences were detected among the three LLMs (p=0.338, p=0.289, and p=0.407, respectively). Only one response provided by Gemini was found to be completely incorrect (1.6%). No misleading or wrong answers were observed in the diagnosis-related questions across all three platforms. In total, misleading answers were observed in 2 questions (3.2%) for ChatGPT, three questions (4.7%) for Gemini, and two questions (3.2%) for Meta Llama.
Conclusion: LLMs predominantly provide accurate results to basic and straightforward questions related to urological emergencies, where prompt treatment is critical. Although no significant differences were observed among the responses of the three LLMs compared in this study, the presence of misleading and incorrect answers should be carefully considered, given the evolving nature and limitations of this technology.
Abstract
Objective: This study aimed to compare the accuracy and adequacy of responses provided by three different large language models (LLMs) utilizing artificial intelligence technology to fundamental questions related to urological emergencies.
Material and Methods: Nine distinct urological emergency topics were identified, and a total of 63 fundamental questions were formulated for each topic, including two related to diagnosis, three related to disease management, and two related to complications. The questions were posed in English on three different free AI platforms (ChatGPT-4, Google Gemini 2.0 Flash, and Meta Llama 3.2), each utilizing different infrastructures, and responses were documented. The answers were scored by the authors on a scale of 1 to 4 based on accuracy and adequacy, and the results were compared using statistical analysis.
Results: When all question-answer pairs were evaluated overall, ChatGPT exhibited slightly higher accuracy rates compared to Gemini and Meta Llama; however, no statistically significant differences were detected among the groups (3.8 ± 0.5, 3.7 ± 0.6, and 3.7 ± 0.5, respectively; p=0.146). When questions related to diagnosis, treatment management, and complications were evaluated separately, no statistically significant differences were detected among the three LLMs (p=0.338, p=0.289, and p=0.407, respectively). Only one response provided by Gemini was found to be completely incorrect (1.6%). No misleading or wrong answers were observed in the diagnosis-related questions across all three platforms. In total, misleading answers were observed in 2 questions (3.2%) for ChatGPT, three questions (4.7%) for Gemini, and two questions (3.2%) for Meta Llama.
Conclusion: LLMs predominantly provide accurate results to basic and straightforward questions related to urological emergencies, where prompt treatment is critical. Although no significant differences were observed among the responses of the three LLMs compared in this study, the presence of misleading and incorrect answers should be carefully considered, given the evolving nature and limitations of this technology.