TY - JOUR
T1 - Evaluation of a context-aware chatbot using retrieval-augmented generation for answering clinical questions on medication-related osteonecrosis of the jaw
AU - Steybe, David
AU - Poxleitner, Philipp
AU - Aljohani, Suad
AU - Herlofson, Bente Brokstad
AU - Nicolatou-Galitis, Ourania
AU - Patel, Vinod
AU - Fedele, Stefano
AU - Kwon, Tae Geon
AU - Fusco, Vittorio
AU - Pichardo, Sarina E.C.
AU - Obermeier, Katharina Theresa
AU - Otto, Sven
AU - Rau, Alexander
AU - Russe, Maximilian Frederik
N1 - Publisher Copyright:
© 2024 The Authors
PY - 2025
Y1 - 2025
N2 - The potential of large language models (LLMs) in medical applications is significant, and Retrieval-augmented generation (RAG) can address the weaknesses of these models in terms of data transparency and scientific accuracy by incorporating current scientific knowledge into responses. In this study, RAG and GPT-4 by OpenAI were applied to develop GuideGPT, a context aware chatbot integrated with a knowledge database from 449 scientific publications designed to provide answers on the prevention, diagnosis, and treatment of medication-related osteonecrosis of the jaw (MRONJ). A comparison was made with a generic LLM (“PureGPT”) across 30 MRONJ-related questions. Ten international experts in MRONJ evaluated the responses based on content, language, scientific explanation, and agreement using 5-point Likert scales. Statistical analysis using the Mann–Whitney U test showed significantly better ratings for GuideGPT than PureGPT regarding content (p = 0.006), scientific explanation (p = 0.032), and agreement (p = 0.008), though not for language (p = 0.407). Thus, this study demonstrates RAG to be a promising tool to improve response quality and reliability of LLMs by incorporating domain-specific knowledge. This approach addresses the limitations of generic chatbots and can provide traceable and up-to-date responses essential for clinical practice.
AB - The potential of large language models (LLMs) in medical applications is significant, and Retrieval-augmented generation (RAG) can address the weaknesses of these models in terms of data transparency and scientific accuracy by incorporating current scientific knowledge into responses. In this study, RAG and GPT-4 by OpenAI were applied to develop GuideGPT, a context aware chatbot integrated with a knowledge database from 449 scientific publications designed to provide answers on the prevention, diagnosis, and treatment of medication-related osteonecrosis of the jaw (MRONJ). A comparison was made with a generic LLM (“PureGPT”) across 30 MRONJ-related questions. Ten international experts in MRONJ evaluated the responses based on content, language, scientific explanation, and agreement using 5-point Likert scales. Statistical analysis using the Mann–Whitney U test showed significantly better ratings for GuideGPT than PureGPT regarding content (p = 0.006), scientific explanation (p = 0.032), and agreement (p = 0.008), though not for language (p = 0.407). Thus, this study demonstrates RAG to be a promising tool to improve response quality and reliability of LLMs by incorporating domain-specific knowledge. This approach addresses the limitations of generic chatbots and can provide traceable and up-to-date responses essential for clinical practice.
KW - Clinical practice guidelines
KW - GPT-4
KW - Generative pre-trained transformer
KW - Large language models
KW - Medication-related osteonecrosis of the jaw
KW - Retrieval-augmented generation
UR - http://www.scopus.com/inward/record.url?scp=85214570668&partnerID=8YFLogxK
U2 - 10.1016/j.jcms.2024.12.009
DO - 10.1016/j.jcms.2024.12.009
M3 - Article
AN - SCOPUS:85214570668
SN - 1010-5182
VL - 53
SP - 355
EP - 360
JO - Journal of Cranio-Maxillofacial Surgery
JF - Journal of Cranio-Maxillofacial Surgery
IS - 4
ER -