Research · Jun 04, 2026 · Carlos Bentes · pinch-research

Speech Translation Benchmark

Comparing real-time speech translation systems across translation quality, intelligibility, naturalness, latency, and speaker similarity.

A five-metric benchmark of real-time speech translation systems, including DeepL, Soniox, GPT-RT, Hibiki, Palabra, and Pinch’s Relay-1.

Intro

Real-time speech translation is hard, which makes good evaluation essential when choosing an API to build into your service. At Pinch we treat benchmarks as directional: they show us what’s working and where we still need to improve.

A real-time system has to balance latency against correctness, since improving one usually costs the other in a live conversation, whether in a meeting app or in person. Just as important is how natural the output sounds: the translation should blend into the conversation, not add friction. So evaluating a speech-to-speech system means looking at several angles at once, balancing speed, correctness, and naturalness against how well it preserves each speaker’s voice.

In this blog post, we explore the following metrics:

  • Translation quality: COMET score comparing the system’s translated text against a human reference translation (XCOMET-XL).
  • Intelligibility: word error rate on the translated speech output against the system’s text (ASR-WER).
  • Naturalness: perceived quality of the output speech, scored by UTMOSv2.
  • Latency: median time from input to first audio chunk out (TTFA p50).
  • Speaker similarity: cosine similarity between source and output speaker embeddings, computed with WavLM.

Translation quality

Translation quality measures whether the system actually conveyed what the speaker said. Intelligibility (next section) tells us the output is understandable. A system can synthesize crystal-clear audio of a wrong translation, and the listener walks away confidently misinformed.

We measure translation quality with XCOMET-XL: we take the system’s translated text output and score it against a reference translation, using a learned metric trained to correlate with human judgments of translation adequacy. Unlike metrics like BLEU that compare surface word overlap, COMET-family scores reason about meaning, so they handle paraphrases and word-order differences without unfairly penalizing them.

Intelligibility

Intelligibility measures whether the translated speech can be understood as words at all. If a listener can’t make out the words, the quality of the underlying translation is irrelevant.

We measure intelligibility with ASR-WER: we transcribe the system’s output audio with an ASR model (whisper-large-v3) and compute the word error rate against the system’s own translated text, i.e. the input to its TTS. A low WER means the synthesized speech is clear enough that a downstream listener (human or machine) recovers the intended words. This metric isolates a class of TTS failures that quality scores like COMET can’t see, since COMET never hears the audio: dropped words, repeated syllables, or artifacts that mangle an otherwise-correct translation.

Naturalness

Naturalness measures how pleasant and human-like the synthesized speech sounds. A system can be intelligible and accurate but still have robotic prosody, unnatural pauses, or a monotone delivery, all of which add cognitive load and wear on the listener over a long conversation.

We measure naturalness with UTMOSv2, a learned model trained to predict the mean opinion score (MOS) a panel of human listeners would give. UTMOSv2 isn’t a perfect substitute for human evaluation (no automatic metric is), but it correlates well with listener judgments and scales to thousands of comparisons without a rater pool.

Latency

Latency measures how long the system takes to start producing output after the speaker begins. In real-time translation this shapes whether the conversation works at all: if the translated audio arrives too late, speakers talk over each other, lose their turn, or stop trusting the system.

We measure latency with TTFA p50: the median wall-clock time from input audio reaching the system to the first chunk of translated audio coming back. We use the median rather than the mean to stay robust to outliers.

Speaker similarity

Speaker similarity measures how closely the translated speech preserves the voice characteristics of the original speaker. In a multilingual meeting, listeners track who is talking by voice. If every translated speaker sounds the same, that signal disappears, and the conversation gets harder to follow even when the words are correct.

We measure speaker similarity with WavLM: we extract speaker embeddings from the source and translated audio using WavLM fine-tuned for speaker verification (microsoft/wavlm-base-plus-sv), then compute the cosine similarity between them. A higher score means the output voice sits closer to the source in the embedding space. This metric is most meaningful for systems that attempt voice preservation (voice cloning).

Benchmark Results

We ran the benchmark across a set of real-time speech translation systems available today, evaluating each on identical audio clips and reference translations with the five metrics defined above. The point isn’t a single ranking: it’s to show where each system sits across the five axes and where the real tradeoffs are.

The dataset is kyutai/Audio-NTREX-4L, a long-form multilingual speech translation set covering French, Spanish, Portuguese, and German into English, designed to evaluate speech translation models on multi-sentence utterances. The dataset is publicly available on Hugging Face.

In this benchmark we tested six real-time speech translation systems. All models are accessed via their APIs, with the exception of Hibiki, which we host locally using the open weights.

  • Relay-1, our own system, via the Pinch API (reference)
  • Soniox, via the Soniox API (reference)
  • DeepL, via the DeepL Voice API (reference)
  • GPT-RT, OpenAI’s gpt-realtime-translate (reference)
  • Hibiki, Kyutai’s open model (reference)
  • Palabra, via the Palabra API (reference)

The table below reports all five metrics for each system; arrows indicate whether higher or lower is better.

MetricSonioxDeepLRelay-1GPT-RTHibikiPalabra
Translation quality ↑0.8100.8400.7660.7330.6530.683
Intelligibility WER ↓NANA0.0550.1140.0730.027
Naturalness ↑NANA3.272.892.062.49
Latency ↓NANA3201228924347297
Speaker similarity ↑NANA0.4630.7780.9640.455

Metrics breakdown

Text output systems

MetricSonioxDeepL
Translation quality — XCOMET-XL ↑0.8100.840
Translation quality — COMET-DA ↑0.8320.847
Translation quality — BLEU, norm ↑31.3736.22
Latency — LongYAAL CU (ms) ↓19652179

Text and Speech output systems

MetricRelay-1GPT-RTHibikiPalabra
Translation quality — XCOMET-XL ↑0.7660.7330.6530.683
Translation quality — COMET-DA ↑0.8110.8030.7570.761
Translation quality — BLEU, norm ↑30.0832.5530.0629.20
Latency — LongYAAL CU (ms) ↓2877313631633823
Latency — TTFA p50 (ms) ↓3201228924347297
Intelligibility — ASR-WER ↓0.0550.1140.0730.027
Naturalness — UTMOS ↑3.272.892.062.49
Speaker similarity — WavLM ↑0.4630.7780.9640.455

References

Share this post