Speech Translation Benchmark
Comparing real-time speech translation systems across translation quality, intelligibility, naturalness, latency, and speaker similarity.
A five-metric benchmark of real-time speech translation systems, including DeepL, Soniox, GPT-RT, Hibiki, Palabra, and Pinch’s Relay-1.
Intro
Real-time speech translation is hard, which makes good evaluation essential when choosing an API to build into your service. At Pinch we treat benchmarks as directional: they show us what’s working and where we still need to improve.
A real-time system has to balance latency against correctness, since improving one usually costs the other in a live conversation, whether in a meeting app or in person. Just as important is how natural the output sounds: the translation should blend into the conversation, not add friction. So evaluating a speech-to-speech system means looking at several angles at once, balancing speed, correctness, and naturalness against how well it preserves each speaker’s voice.
In this blog post, we explore the following metrics:
- Translation quality: COMET score comparing the system’s translated text against a human reference translation (
XCOMET-XL). - Intelligibility: word error rate on the translated speech output against the system’s text (
ASR-WER). - Naturalness: perceived quality of the output speech, scored by
UTMOSv2. - Latency: median time from input to first audio chunk out (
TTFA p50). - Speaker similarity: cosine similarity between source and output speaker embeddings, computed with
WavLM.
Translation quality
Translation quality measures whether the system actually conveyed what the speaker said. Intelligibility (next section) tells us the output is understandable. A system can synthesize crystal-clear audio of a wrong translation, and the listener walks away confidently misinformed.
We measure translation quality with XCOMET-XL: we take the system’s translated text output and score it against a reference translation, using a learned metric trained to correlate with human judgments of translation adequacy. Unlike metrics like BLEU that compare surface word overlap, COMET-family scores reason about meaning, so they handle paraphrases and word-order differences without unfairly penalizing them.
Intelligibility
Intelligibility measures whether the translated speech can be understood as words at all. If a listener can’t make out the words, the quality of the underlying translation is irrelevant.
We measure intelligibility with ASR-WER: we transcribe the system’s output audio with an ASR model (whisper-large-v3) and compute the word error rate against the system’s own translated text, i.e. the input to its TTS. A low WER means the synthesized speech is clear enough that a downstream listener (human or machine) recovers the intended words. This metric isolates a class of TTS failures that quality scores like COMET can’t see, since COMET never hears the audio: dropped words, repeated syllables, or artifacts that mangle an otherwise-correct translation.
Naturalness
Naturalness measures how pleasant and human-like the synthesized speech sounds. A system can be intelligible and accurate but still have robotic prosody, unnatural pauses, or a monotone delivery, all of which add cognitive load and wear on the listener over a long conversation.
We measure naturalness with UTMOSv2, a learned model trained to predict the mean opinion score (MOS) a panel of human listeners would give. UTMOSv2 isn’t a perfect substitute for human evaluation (no automatic metric is), but it correlates well with listener judgments and scales to thousands of comparisons without a rater pool.
Latency
Latency measures how long the system takes to start producing output after the speaker begins. In real-time translation this shapes whether the conversation works at all: if the translated audio arrives too late, speakers talk over each other, lose their turn, or stop trusting the system.
We measure latency with TTFA p50: the median wall-clock time from input audio reaching the system to the first chunk of translated audio coming back. We use the median rather than the mean to stay robust to outliers.
Speaker similarity
Speaker similarity measures how closely the translated speech preserves the voice characteristics of the original speaker. In a multilingual meeting, listeners track who is talking by voice. If every translated speaker sounds the same, that signal disappears, and the conversation gets harder to follow even when the words are correct.
We measure speaker similarity with WavLM: we extract speaker embeddings from the source and translated audio using WavLM fine-tuned for speaker verification (microsoft/wavlm-base-plus-sv), then compute the cosine similarity between them. A higher score means the output voice sits closer to the source in the embedding space. This metric is most meaningful for systems that attempt voice preservation (voice cloning).
Benchmark Results
We ran the benchmark across a set of real-time speech translation systems available today, evaluating each on identical audio clips and reference translations with the five metrics defined above. The point isn’t a single ranking: it’s to show where each system sits across the five axes and where the real tradeoffs are.
The dataset is kyutai/Audio-NTREX-4L, a long-form multilingual speech translation set covering French, Spanish, Portuguese, and German into English, designed to evaluate speech translation models on multi-sentence utterances. The dataset is publicly available on Hugging Face.
In this benchmark we tested six real-time speech translation systems. All models are accessed via their APIs, with the exception of Hibiki, which we host locally using the open weights.
Relay-1, our own system, via the Pinch API (reference)Soniox, via the Soniox API (reference)DeepL, via the DeepL Voice API (reference)GPT-RT, OpenAI’s gpt-realtime-translate (reference)Hibiki, Kyutai’s open model (reference)Palabra, via the Palabra API (reference)
The table below reports all five metrics for each system; arrows indicate whether higher or lower is better.
| Metric | Soniox | DeepL | Relay-1 | GPT-RT | Hibiki | Palabra |
|---|---|---|---|---|---|---|
| Translation quality ↑ | 0.810 | 0.840 | 0.766 | 0.733 | 0.653 | 0.683 |
| Intelligibility WER ↓ | NA | NA | 0.055 | 0.114 | 0.073 | 0.027 |
| Naturalness ↑ | NA | NA | 3.27 | 2.89 | 2.06 | 2.49 |
| Latency ↓ | NA | NA | 3201 | 2289 | 2434 | 7297 |
| Speaker similarity ↑ | NA | NA | 0.463 | 0.778 | 0.964 | 0.455 |
Metrics breakdown
Text output systems
| Metric | Soniox | DeepL |
|---|---|---|
| Translation quality — XCOMET-XL ↑ | 0.810 | 0.840 |
| Translation quality — COMET-DA ↑ | 0.832 | 0.847 |
| Translation quality — BLEU, norm ↑ | 31.37 | 36.22 |
| Latency — LongYAAL CU (ms) ↓ | 1965 | 2179 |
Text and Speech output systems
| Metric | Relay-1 | GPT-RT | Hibiki | Palabra |
|---|---|---|---|---|
| Translation quality — XCOMET-XL ↑ | 0.766 | 0.733 | 0.653 | 0.683 |
| Translation quality — COMET-DA ↑ | 0.811 | 0.803 | 0.757 | 0.761 |
| Translation quality — BLEU, norm ↑ | 30.08 | 32.55 | 30.06 | 29.20 |
| Latency — LongYAAL CU (ms) ↓ | 2877 | 3136 | 3163 | 3823 |
| Latency — TTFA p50 (ms) ↓ | 3201 | 2289 | 2434 | 7297 |
| Intelligibility — ASR-WER ↓ | 0.055 | 0.114 | 0.073 | 0.027 |
| Naturalness — UTMOS ↑ | 3.27 | 2.89 | 2.06 | 2.49 |
| Speaker similarity — WavLM ↑ | 0.463 | 0.778 | 0.964 | 0.455 |
References
-
Radford et al. (2022). Robust Speech Recognition via Large-Scale Weak Supervision. arXiv:2212.04356. Code and models: github.com/openai/whisper.
-
Guerreiro et al. (2024). xCOMET: Transparent Machine Translation Evaluation through Fine-grained Error Detection. TACL. arXiv:2310.10482. Model: huggingface.co/Unbabel/XCOMET-XL.
-
Baba, Nakata, Saito, Saruwatari (2024). The T05 System for the VoiceMOS Challenge 2024: Transfer Learning from Deep Image Classifier to Naturalness MOS Prediction of High-Quality Synthetic Speech. IEEE SLT 2024. arXiv:2409.09305. Code: github.com/sarulab-speech/UTMOSv2.
-
Chen et al. (2021). WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing. arXiv:2110.13900. Model: microsoft/wavlm-base-plus-sv.
-
Labiausse et al. (2026). Simultaneous speech-to-speech translation without aligned data. arXiv preprint arXiv:2602.11072. Dataset: kyutai/Audio-NTREX-4L.