Automatic Transcription of Endoscopic Audio Medical Reports

Automatic documentation of endoscopic procedures is critical to improve report completeness, standardization, and downstream use of data for quality monitoring, however current workflows still rely heavily on manual reporting. This work presents an end-to-end pipeline for automatic transcription and structuring of endoscopic audio, focusing on colonoscopy and Endoscopic Retrograde Cholangiopancreatography (ERCP) procedures. Ambient audio is captured in the examination room using dedicated microphones or mobile devices and processed by Voxtral, an open-source, multilingual medical speech recognition model optimized for noisy clinical environments. The transcribed text is then validated and passed to a modular Large Language Model (LLM)-based pipeline, developed in a companion study, which extracts key clinical and quality-related categories. The finalized reports are exported in JSON and Excel formats, enabling integration with electronic health records, quality dashboards, and downstream analytics, while also facilitating expert review of extracted indicators. A pilot dataset of 13 ERCP audio recordings (approximately 35 seconds each) was used to develop and test the workflow, including comparison of audio acquisition strategies and hardware. The results of this proof-of-concept confirm the technical feasibility of combining medical-grade automatic speech recognition with modular LLM-based report generation and highlight a promising path toward reducing documentation burden, enhancing standardization, and enabling scalable, automated quality assessment in digestive endoscopy.

Mónica Martins
University of Minho
Portugal

Tiago Jesus
University of Minho
Portugal

Victor Alves
University of Minho
Portugal