In short
Modern speech recognition systems promise to take context into account, but their capabilities are limited. We tested the Qwen and Whisper models on pre-revolutionary texts to evaluate transcription quality for long recordings and in the presence of noise.
Recording your thoughts by voice or transcribing conversations is convenient, but not always reliable. Modern ASR systems (automatic speech recognition) of the new generation, such as Qwen and Whisper, are capable of taking context into account and producing meaningful text. However, they have architectural limitations.
To understand whether these models are ready for real-world scenarios, we conducted a benchmark on Hugging Face. We focused primarily on pre-revolutionary Russian—a language that is rare and difficult to recognize.
Testing showed that even state-of-the-art ASR systems are not perfect. Further refinements are needed to improve recognition quality under specific conditions (long recordings, noise, rare languages).
Source: Habr