Pinch Pinch

How real-time translation works

Pinch real-time translation is one WebSocket connection per live conversation. There are two modes:

  • Speech to speech — you stream microphone audio in, Pinch streams translated speech audio and transcripts back.
  • Speech to translated text — you stream microphone audio in, Pinch streams transcripts back only.

You pick the mode when you create the session.


The five steps

1) Create a session

GET /v1/session with your API key. Specify:

  • source language — what the speaker will say
  • target language — what you want back
  • voice typemale or female (ignored in text-only mode)
  • audio output — on for speech-to-speech, off for text-only

You get back a pre-signed WebSocket URL, valid for 60 seconds.

See the API Reference for full parameters.

2) Open the WebSocket and wait for ready

The first frame from the server is {"type":"ready"}. After that the socket is bidirectional.

3) Stream audio in

Small binary frames — PCM16 little-endian, mono, 16 kHz. 20–40 ms per frame is a good default.

4) Receive results

  • Binary frames — translated speech audio, float32 LE mono @ 24 kHz. Play as they arrive. (Speech-to-speech mode only.)
  • JSON frames — original transcript, translated transcript, and segment lifecycle events.

Interim transcripts are marked is_final: false; stable ones is_final: true. Use correlation_id to pair originals with their translations.

5) Change config mid-session, then close

Send {"type":"update_metadata", ...} any time to swap language or voice. Close with code 1000 when done.

Pinch API flow: original audio in, translated audio and transcripts out

Latency

Live translation is inherently incremental: Pinch listens, recognises, translates, and (in speech-to-speech mode) synthesises, all while speech is still arriving. End-to-end latency depends on:

  • Language pair — verb-final languages need more context before translation can start.
  • Speech cadence — short phrases with natural pauses finalise faster than long monologues.
  • Network RTT to ws.startpinch.com.
  • Frame size — 20 ms mic frames feel more interactive than 200 ms frames at a small bandwidth cost.