MMSL X:X | DOI: 10.31482/mmsl.2025.008
EVALUATION FRAMEWORK FOR AI-BASED MACHINE TRANSLATION AND PROOFREADING TOOLS IN MEDICAL AND PHARMACEUTICAL WRITINGOriginal article
- Language Centre, University of Defence, Brno, Czech Republic
Recent advances in artificial intelligence (AI) have introduced powerful tools for machine translation (MT) and automated language proofreading (ALP) into academic publishing. However, their evaluation in highly specialized fields such as medicine and pharmacy remains methodologically underexplored. This study presents a multidimensional evaluation framework that integrates human expert judgment with AI-based replication to assess the quality of AI-generated translations and proofreading outputs. The framework covers three quality dimensions applicable to both MT and ALP: semantic fidelity, terminological accuracy and consistency, and grammatical correctness and fluency, as well as a fourth dimension specific to ALP, the appropriateness of edits.In the pilot phase of the research, five machine translation systems were compared using the multidimensional evaluation framework, with DeepL serving as a strong baseline; under advanced idiomatic prompting, the Large Language Model (LLM) systems achieved performance levels comparable to this baseline. The semantic fidelity dimension was further evaluated through an AI simulation of human judgment using ChatGPT-5. The agreement between human and AI evaluators reached κ = 0.81 (95% CI = 0.73–0.88), indicating high consistency and no observed hallucinations within this pilot sample.These preliminary results demonstrate that large language models have promising potential to reproduce human expert quality reasoning when guided by structured prompts. Beyond its technical contribution, our ongoing research represents a step toward transparent and reproducible evaluation for AI-generated academic writing and its automated quality assessment.
Keywords: AI-assisted writing; machine translation; automated proofreading; evaluation Framework; semantic fidelity; terminological accuracy; COMET QE; QuickUMLS; biomedical language; scientific communication; domain-specific NLP; language quality assessment
Received: July 9, 2025; Revised: December 9, 2025; Accepted: December 9, 2025; Prepublished online: December 15, 2025
References
- Allen D, Mizumoto A. Automated grammar correction for academic writing: A critical review of tools and practices. J. Second Lang. Writ. 2024;63:101-119.
- Barrot JS. Exploring the efficacy of automated writing feedback tools: A comparative study. Lang. Learn. Technol. 2022;26(1):45-62.
- Batool S, Guo J, Müller T. Cognitive load in scientific writing: Measuring reader effort in machine-generated text. ACM Trans. Inf. Syst. 2024;42(2): Article 15.
- Bui DD, Nguyen TN, Le TM. Evaluating neural machine translation of patient education materials. BMC Med. Inform. Decis. Mak. 2020;20(1):220.
- Çetin O, Duran A. Evaluating adaptive machine translation systems: A comparative perspective. Mach. Transl. 2023;37(4):321-340.
- Chowdhury S, Ghosh A, Sarkar D. Assessing AI-based proofreading tools in biomedical research writing. Comput. Biol. Med. 2022;142:105217.
- Cillo R, Cortese A, Mariani J. Terminological precision in AI-assisted scientific translation: An Italian-English corpus study. Terminology. 2024;30(1):25-48.
- Dahan L, Kalmanovich S, Goldberg Y. BLEU is not enough: Evaluating scientific translations beyond surface similarity. Trans. Assoc. Comput. Linguist. 2024;12:1-15.
- Farber SM. Cognitive readability as a metric in technical writing: From perception to prediction. J. Tech. Writ. Commun. 2024;54(2):154-176.
- Flores Marroquín CA. TER versus COMET: How automatic metrics fare in pharmaceutical translations. Rev. Lingüíst. Apl. 2024;60(2):113-130.
- Fraser J, Heng MA. Measuring quality in medical translations: A term-based approach. Transl. Interp. 2014;6(1):23-38.
- Huang Z, Tang C, Xu J. Validity and reliability of COMET-QE for technical texts. Comput. Linguist. 2025;51(1):87-109.
- IDC. AI in academic language services: Trends and gaps. IDC Res. Rep. 2024; Q2.
- Intento. State of machine translation: Industry benchmarks and trends. Intento Benchmark Rep. 2023.
- Jaschke AC, Wang L, Niemeyer K. Machine-assisted writing in life sciences: Ethical and practical considerations. BioScience. 2024;74(1):33-40.
- Kamdar MR, Musen MA, Shah NH. Clustering biomedical terminologies using UMLS: A graph-based approach. J. Biomed. Inform. 2017;68:31-46.
- Knowles R, Lo CK. Multidimensional quality metrics in low-resource scientific domains. Proc. EACL. 2024:211-221.
- Licht R, Becker S, Kohl M. Evaluating annotation consistency in translation quality assessment. Lang. Resour. Eval. 2022;56(2):789-804.
- Liu M, Liu J, Yu H. Topic modeling for clustering medical specialties in large-scale literature. J. Am. Med. Inform. Assoc. 2012;19(2):220-226.
Go to original source...
Go to PubMed... - López Caro A. ModernMT: A neural adaptive system for domain-specific translation. In: Proc. Mach. Transl. Summit. 2023:41-50.
- Muhanna M. Human-AI collaboration in academic writing: Evidence from pharmacy journals. J. Scholarly Publ. 2025;56(3):222-241.
- Shah A, Cunningham J, Murchison T. Scientific sentence segmentation and its impact on translation accuracy. Lang. Resour. Eval. 2016;50(4):933-952.
- Soares F, Ribeiro R, Wanner L. Domain clustering using word embeddings: Application to biomedical literature. Nat. Lang. Eng. 2019;25(3):325-343.
- Translated IDC. Adaptive MT and professional workflows: Report on use and perception. IDC Tech. Brief Ser. 2022.
- Wong KY, Symons T. Terminology control in AI-powered academic editing tools. Terminol. Knowl. Eng. 2025;29(1):87-103.
- Ben Saad H, Dergaa I, Ghouili H, et al. The assisted technology dilemma: a reflection on AI chatbots use and risks while reshaping the peer review process in scientific research. AI & Society. 2025;1:1-8.
Go to original source...



