---
title: "How real-time translation works"
section: "Real-time translation"
order: 0
sidebarLabel: "How it works"
---

Pinch real-time translation is **one WebSocket connection** per live
conversation. There are two modes:

- **Speech to speech** — you stream microphone audio in, Pinch streams
  translated speech audio and transcripts back.
- **Speech to translated text** — you stream microphone audio in, Pinch
  streams transcripts back only.

You pick the mode when you create the session.

---

## The five steps

#### 1) Create a session

`GET /v1/session` with your API key. Specify:

- **source language** — what the speaker will say
- **target language** — what you want back
- **voice type** — `male` or `female` (ignored in text-only mode)
- **audio output** — on for speech-to-speech, off for text-only

You get back a pre-signed WebSocket URL, valid for 60 seconds.

See the [API Reference](/docs/api-reference) for full parameters.

#### 2) Open the WebSocket and wait for `ready`

The first frame from the server is `{"type":"ready"}`. After that the
socket is bidirectional.

#### 3) Stream audio in

Small binary frames — PCM16 little-endian, mono, 16 kHz. 20–40 ms per
frame is a good default.

#### 4) Receive results

- **Binary frames** — translated speech audio, float32 LE mono @ 24 kHz.
  Play as they arrive. (Speech-to-speech mode only.)
- **JSON frames** — original transcript, translated transcript, and
  segment lifecycle events.

Interim transcripts are marked `is_final: false`; stable ones
`is_final: true`. Use `correlation_id` to pair originals with their
translations.

#### 5) Change config mid-session, then close

Send `{"type":"update_metadata", ...}` any time to swap language or
voice. Close with code `1000` when done.

<div class="not-prose mt-6 mb-2">
  <img
    src="/assets/images/pinch-api-flow.svg"
    alt="Pinch API flow: original audio in, translated audio and transcripts out"
    style="max-width: 100%; height: auto;"
    loading="lazy"
  />
</div>

---

## Latency

Live translation is inherently incremental: Pinch listens, recognises,
translates, and (in speech-to-speech mode) synthesises, all while
speech is still arriving. End-to-end latency depends on:

- **Language pair** — verb-final languages need more context before
  translation can start.
- **Speech cadence** — short phrases with natural pauses finalise
  faster than long monologues.
- **Network RTT** to `ws.startpinch.com`.
- **Frame size** — 20 ms mic frames feel more interactive than 200 ms
  frames at a small bandwidth cost.