Large language models in answering mammography screening questions in Italian and English: Comment

Hinpetch Daungsupawong1, Viroj Wiwanitkit2

1Private Academic Consultant, Phonhong, Lao People’s Democratic Republic; 2Department of Research Analytics, Saveetha Dental College and Hospitals, Saveetha Institute of Medical and Technical Sciences, Saveetha University, Chennai, India.

Received on March 13, 2025. Accepted April 16, 2025.

Dear Editor, the publication of “Evaluating the accuracy of large language models in answering mammography screening questions in Italian and English: a study based on the Eusobi guidelines”1 is intriguing. The study of artificial intelligence (AI) and the use of large language models (LLMs) in giving medical information has grown in popularity, particularly in the context of breast cancer screening via systems such as ChatGPT, Gemini, and Copilot.

This study evaluated responses to mammography questions presented in Italian and English using the Eusobi methodology, which is an effective method for evaluating LLMs’ capacity to offer accurate and full answers. However, the statistical methods used in this study have limitations and may not provide a full and reliable assessment.

Using a Likert scale to analyze responses may be an effective scoring approach, but it does not entirely reflect the complexity of the responses, such as the granularity or depth of the material, therefore the responses may be incompletely assessed. Furthermore, utilizing averages for comparison may not provide a complete picture of the accuracy of responses to each item. For example, responses with great granularity may be evaluated equally to those with insufficient information, resulting in an assessment that does not accurately reflect the depth and relevance of the comments.

Important questions for future discussions could include the consideration of statistical methods that better capture differences in response quality, such as the use of content analysis or more sophisticated models to analyze the data obtained from LLMs. Further studies should be conducted on the use of authoritative sources in different languages, especially in Italian, where sources are not specific to radiology. These studies will help us better understand the limitations of LLMs.

In the future, language models should be created to process and comprehend medical data at a deeper level, including specialized and constantly updated medical data sources. Testing LLMs with more complicated questions and assessing expert responses will allow us to improve the system’s ability to offer more accurate and reliable medical answers.

AI declaration: the authors used language editing computational tool in preparation of the article.

Authors’ contribution: HP 50% ideas, writing, analyzing, approval; VW 50% ideas, supervision, approval.

Conflict of interests: the authors have no conflict of interests to declare.

References

1. Signorini M, Fontani S, Minichetti P, Teggi S, Barusco A, Favat M. [Evaluating the accuracy of large language models in answering mammography screening questions in Italian and English: a study based on the Eusobi guidelines]. Recenti Prog Med 2025; 116: 162-7.