How real-time translation works
Pinch real-time translation is one WebSocket connection per live conversation. There are two modes:
- Speech to speech — you stream microphone audio in, Pinch streams translated speech audio and transcripts back.
- Speech to translated text — you stream microphone audio in, Pinch streams transcripts back only.
You pick the mode when you create the session.
The five steps
1) Create a session
GET /v1/session with your API key. Specify:
- source language — what the speaker will say
- target language — what you want back
- voice type —
maleorfemale(ignored in text-only mode) - audio output — on for speech-to-speech, off for text-only
You get back a pre-signed WebSocket URL, valid for 60 seconds.
See the API Reference for full parameters.
2) Open the WebSocket and wait for ready
The first frame from the server is {"type":"ready"}. After that the
socket is bidirectional.
3) Stream audio in
Small binary frames — PCM16 little-endian, mono, 16 kHz. 20–40 ms per frame is a good default.
4) Receive results
- Binary frames — translated speech audio, float32 LE mono @ 24 kHz. Play as they arrive. (Speech-to-speech mode only.)
- JSON frames — original transcript, translated transcript, and segment lifecycle events.
Interim transcripts are marked is_final: false; stable ones
is_final: true. Use correlation_id to pair originals with their
translations.
5) Change config mid-session, then close
Send {"type":"update_metadata", ...} any time to swap language or
voice. Close with code 1000 when done.
Latency
Live translation is inherently incremental: Pinch listens, recognises, translates, and (in speech-to-speech mode) synthesises, all while speech is still arriving. End-to-end latency depends on:
- Language pair — verb-final languages need more context before translation can start.
- Speech cadence — short phrases with natural pauses finalise faster than long monologues.
- Network RTT to
ws.startpinch.com. - Frame size — 20 ms mic frames feel more interactive than 200 ms frames at a small bandwidth cost.