Підвищення ефективності обробки аудіопотоків на базі Whisper з інструментами CTranslate2 та FFMpeg

Владислав Радін; Мирослав Рябий; Владислав Радін; Мирослав Рябий

https://doi.org/10.31649/vitce/1.2026.110

Взято з Т. 23, № 1, 2026

Отримано 12.11.2025, Доопрацьовано 13.02.2026, Прийнято 26.03.2026 Опубліковано 20.04.2026

Підвищення ефективності обробки аудіопотоків на базі Whisper з інструментами CTranslate2 та FFMpeg

Владислав Радін, Мирослав Рябий

Актуальність дослідження полягає в необхідності підвищити продуктивність і масштабованість систем автоматичного розпізнавання мовлення на пристроях із обмеженими ресурсами, що обумовлює мету роботи – оптимізувати Whisper за допомогою інтеграції CTranslate2 для прискорення обчислень та FFmpeg для уніфікованої підготовки аудіоданих. Експериментальні дослідження проводилися з використанням моделі Whisper Turbo на графічному процесорі з підтримкою платформи обчислень Compute Unified Device Architecture. Порівнювалися базовий конвеєр на мові програмування Python, оптимізований механізм виконання інференсу через CTranslate2 та конфігурація з гібридною квантизацією у форматі int8_float16. Ефективність оцінювалася за показниками часу виконання передбачення (інференсу), використання відеопам’яті та точності автоматичного розпізнавання мовлення (Word Error Rate). Експериментальні результати показали, що базова конфігурація Whisper Turbo забезпечувала максимальну точність розпізнавання (Word Error Rate = 0), однак характеризувалася високою затримкою інференсу (8,5 с на аудіофайл) і значним споживанням відеопам’яті (4,9 ГБ). Інтеграція CTranslate2 скоротила час обробки до 4,9 с (прискорення 1,7×) та зменшила використання Video Random Access Memory до 1,8 ГБ (-63 %) без втрати якості. Подальше застосування гібридної квантизації int8_float16 забезпечило зниження часу інференсу до 3,8 с і скорочення споживання пам’яті до 1 ГБ, що відповідає загальному прискоренню близько 2,2× та майже п’ятикратному (4,9×) зменшенню вимог до Video Random Access Memory порівняно зі стандартною реалізацією, при незмінному Word Error Rate = 0. Отримані результати підтвердили ефективність поєднання CTranslate2 і гібридної квантизації для побудови високопродуктивних систем Automatic Speech Recognition реального часу без компромісу в точності. Висновки підтвердили практичну придатність запропонованої конфігурації для багатокористувацьких сервісів і edge-сценаріїв без компромісу між швидкістю та точністю. Результати дослідження можуть бути використані розробниками систем автоматичного розпізнавання мовлення для оптимізації моделей на графічних процесорах з обмеженим обсягом пам’яті, компаніями, що надають потокові аудіо- та багатокористувацькі сервіси

квантизація; автоматичне розпізнавання мовлення; ф’юзування операторів; відеопам’ять; ресурсоефективність

110-124

Radin, V., & Riabyi, M. (2026). Improving the efficiency of Whisper-based audio stream processing with CTranslate2 and FFMpeg tools. Information Technologies and Computer Engineering, 23(1), 110-124. https://doi.org/10.31649/vitce/1.2026.110

Використані джерела

Ala-Rantala, J. (2025). Low-latency voice-guided visual content generation using generative AI models. (Master’s thesis, Tampere University, Tampere, Finland).
Cao, Y. (2025). Performance evaluation of whisper-series speech transcription models on raspberry Pi. In Proceedings of the tenth ACM/IEEE symposium on edge computing (article number 59). New York: ACM. doi: 10.1145/3769102.3774244.
Chettiar, F.F., Lahrani, H., & Rathor, K. (2025). Multilingual video translation and speech synthesis: A deep learning approach for seamless language adaptation. In Proceedings of the international conference on interdisciplinary approaches in technology and management for social innovation (pp. 1-6). Gwalior: IEEE. doi: 10.1109/IATMSI64286.2025.10985230.
Ebrahimipour, S.M., Mozafari, S.H., Clark, J.J., Gross, W.J., & Meyer, B.H. (2025). Latency-aware pruning and quantization of self-supervised speech transformers for edge devices. ACM Transactions on Embedded Computing Systems. doi: 10.1145/3746638.
El Bahri, J., Kouissi, M., & Begdouri, M.A. (2025). Sustainable speech recognition: Energy, carbon, and performance comparison of whisper (base and large) and google speech-to-text V2 (Chirp/USM). In H. Gibet Tani, M. Kouissi, M. Ben Ahmed, B.A. Abdelhakim & L. Elaachak (Eds.), Energy-efficient algorithms and systems in computing: Optimizing performance and sustainability through advanced computational methods (pp. 213-226). Cham: Springer. doi: 10.1007/978-3-032-04114-2_14.
Feng, C., Lin, Y., Zhuo, S., Su, C., Ramakrishnan, R.K., Yuan, Z., & Zhang, X. (2025). Edge-ASR: Towards low-bit quantization of automatic speech recognition models. ArXiv. doi: 10.48550/arXiv.2507.07877.
Hung, N.T., Phuc, V.H., Dung, N.T., Duc, L.X., Nhu, M.T., & Van, P.T. (2025). Effwhis: A proposed efficient approach for speech-to-text streaming whisper. In Proceedings of the 7^th international conference on knowledge and system engineering (pp. 1-6). Da Lat: IEEE. doi: 10.1109/KSE68178.2025.11309493.
Hwang, M.H., Shin, J., & Bang, J. (2026). V-APA: A voice-driven agentic process automation system. Computer Speech & Language, 99, article number 101938. doi: 10.1016/j.csl.2026.101938.
Kalhoro, M.M., & Masab, M. (2025). Light-weight online real-time ASR: A bit more attention is needed. Authorea Preprints. doi: 10.22541/au.174914695.58777421/v1.
Kasoju, A., & Vishwakarma, T. (2025). Optimizing transformer models for low-latency inference: Techniques, architectures, and code implementations. International Journal of Science and Research, 14, 857-866. doi: 10.21275/SR25409073105.
Khadse, S. (2025). Small language models and efficient AI: The future of sustainable, accessible intelligence a comprehensive analysis of model compression, edge deployment, and resource-efficient AI systems. SSRN. doi: 10.2139/ssrn.5664971.
Kim, S. (2024). Full stack approach for efficient deep learning inference. Berkeley: University of California. (Doctoral dissertation, University of California, Berkeley, USA).
Maurya, M., Zaheer, M., Mohammad, N., Siddiqui, S., Khan, M., & Akram M. (2025). Speech recognition technologies: Design, challenges, and real-world applications. International Journal of Innovative Research in Computer Science and Technology, 13(3), 55-61. doi: 10.55524/ijircst.2025.13.3.9.
Menshawy, A., & Fahmy, M. (2025). LLMs in Enterprise: Design strategies, patterns, and best practices for large language model development. Birmingham: Packt Publishing Ltd.
Moslem, Y., Morán, J.J., Gonzalez-Gomez, M., Al Farouq, M.H., Abdou, F., & Deb, S. (2025). SpeechT: Findings of the first mentorship in speech translation. In Proceedings of machine translation summit XX: Volume 2 (pp. 67-74). Geneva: European Association for Machine Translation.
Mrozek, Ye. (2024). Analysis of modern approaches to speech recognition tasks. Control Systems & Computers, 4(308), 39-49. doi: 10.15407/csc.2024.04.039.
Nakhod, O. (2025). Automatic recognition of Ukrainian speech based on deep learning. Collection of Scientific Papers “ΛΌГOΣ”, 24, 218-220. doi: 10.36074/logos-24.01.2025.043.
Orhon, A., Okan, A., Durmus, B., Nagengast, Z., & Pacheco, E. (2025). WhisperKit: On-device real-time ASR with billion-scale transformers. ArXiv. doi: 10.48550/arXiv.2507.10860.
Potocnik, V., Colagrande, L., Fischer, T., Bertaccini, L., Pagliari, D.J., Burrello, A., & Benini, L. (2024). Optimizing foundation model inference on a many-tiny-core open-source risc-v platform. IEEE Transactions on Circuits and Systems for Artificial Intelligence, 1(1), 37-52. doi: 10.1109/TCASAI.2024.3459412.
Rangappa, P., et al. (2025). Speech data selection for efficient ASR fine-tuning using domain classifier and pseudo-label filtering. In Proceedings of the IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 1-5). Hyderabad: IEEE. doi: 10.1109/ICASSP49660.2025.10888138.
Thorbecke, I., Zuluaga-Gomez, J.P., Villatoro-Tello, E., Kumar, S., Rangappa, P., Burdisso, S., Motlicek, P., Pandia, K., & Ganapathiraju, A. (2024). Fast streaming transducer ASR prototyping via knowledge distillation with whisper. In Findings of the Association for Computational Linguistics: EMNLP 2024 (pp. 16747-16762). Miami: Association for Computational Linguistics. doi: 10.18653/v1/2024.findings-emnlp.976.
Trabelsi, A., Werey, L., Warichet, S., & Helbert, E. (2024). Is noise reduction improving open-source ASR transcription engines quality? In Proceedings of the 16^th international conference on agents and artificial intelligence (pp. 1221-1228). Rome: Science and Technology Publications. doi: 10.5220/0012457100003636.
Vergallo, R., Aprile, M., Cruz, L., Vadacca, R., & Mainetti, L. (2025). Large-scale evaluation of quantization for reducing the energy footprint of deep learning models. SSRN. doi: 10.2139/ssrn.5719661.
Wang, N., Liu, C.C., Venkataramani, S., Sen, S., Chen, C.Y., El Maghraoui, K., Srinivasan, V., & Chang, L. (2022). Deep compression of pre-trained transformer models. In Proceedings of the 36^th international conference on neural information processing systems (pp. 14140-14154). Ney York: Curran Associates.
Wu, C., Pan, Y., Wu, H., & Ning, L. (2025). Integrating speech recognition into intelligent information systems: From statistical models to deep learning. Informatics, 12(4), article number 107. doi: 10.3390/informatics12040107.
Wu, X., Zhang, Y., & Feng, B. (2023). English pronunciation quality evaluation system based on continuous speech recognition technology for multi-terminal. Journal of Physics: Conference Series, 2632, article number 012024. doi: 10.1088/1742-6596/2632/1/012024.
Zhang, L., Wu, S., & Wang, Z. (2025). LoRA-INT8 whisper: A low-cost Cantonese speech recognition framework for edge devices. Sensors, 25(17), article number 5404. doi: 10.3390/s25175404.
Znotins, A., Gosko, D., & Gruzitis, N. (2025). LATE: Open source toolkit for Latvian and latgalian speech transcription. In Proceedings of the annual conference of the international speech communication association, INTERSPEECH (pp. 306-307). Rotterdam: ISCA.