--- title: "How real-time translation works" section: "Real-time translation" order: 0 sidebarLabel: "How it works" --- Pinch real-time translation is **one WebSocket connection** per live conversation. There are two modes: - **Speech to speech** — you stream microphone audio in, Pinch streams translated speech audio and transcripts back. - **Speech to translated text** — you stream microphone audio in, Pinch streams transcripts back only. You pick the mode when you create the session. --- ## The five steps #### 1) Create a session `GET /v1/session` with your API key. Specify: - **source language** — what the speaker will say - **target language** — what you want back - **voice type** — `male` or `female` (ignored in text-only mode) - **audio output** — on for speech-to-speech, off for text-only You get back a pre-signed WebSocket URL, valid for 60 seconds. See the [API Reference](/docs/api-reference) for full parameters. #### 2) Open the WebSocket and wait for `ready` The first frame from the server is `{"type":"ready"}`. After that the socket is bidirectional. #### 3) Stream audio in Small binary frames — PCM16 little-endian, mono, 16 kHz. 20–40 ms per frame is a good default. #### 4) Receive results - **Binary frames** — translated speech audio, float32 LE mono @ 24 kHz. Play as they arrive. (Speech-to-speech mode only.) - **JSON frames** — original transcript, translated transcript, and segment lifecycle events. Interim transcripts are marked `is_final: false`; stable ones `is_final: true`. Use `correlation_id` to pair originals with their translations. #### 5) Change config mid-session, then close Send `{"type":"update_metadata", ...}` any time to swap language or voice. Close with code `1000` when done.