--- title: "Real-time API Reference" section: "Real-time translation" order: 1 sidebarLabel: "API Reference" --- Pinch real-time translation is a single WebSocket connection. You stream microphone audio in and receive translated speech and transcripts back on the same socket. No SDK — works from any language that can open a WebSocket. The API has two modes, chosen per session: - **Speech to speech** — send source-language audio, receive translated speech audio and transcripts. (`audioOutputEnabled=true`, default) - **Speech to translated text** — send source-language audio, receive transcripts only. (`audioOutputEnabled=false`) **Base URL:** `https://ws.startpinch.com` --- ## Flow 1. `GET /v1/session` — authenticate, get a short-lived `ws_url`. 2. Open a WebSocket to `ws_url`. 3. Wait for `{"type":"ready"}`. 4. Stream mic audio as binary frames. 5. Receive transcripts (JSON) and, in speech-to-speech mode, translated audio (binary) on the same socket. 6. Optionally change language or voice mid-session. 7. Close the socket when done. --- ## 1. Create a session ``` GET /v1/session ``` Returns a short-lived (60 s) pre-signed WebSocket URL. No further auth on the WebSocket itself. ### Request headers | Header | Value | Required | | --- | --- | --- | | `Authorization` | `Bearer ` | Yes | ### Query parameters All query params become the initial session metadata and can be updated mid-session. | Param | Type | Default | Description | | --- | --- | --- | --- | | `sourceLanguage` | string | `en` | ISO-639 / BCP-47 code of spoken audio (e.g. `en`, `en-US`). Pass `auto` to auto-detect. | | `targetLanguage` | string | `es` | Language to translate into (e.g. `es`, `de`, `zh`). | | `voiceType` | string | `male` | `male` or `female`. Ignored when `audioOutputEnabled=false`. | | `audioOutputEnabled` | bool | `true` | `true` = speech-to-speech, `false` = speech-to-translated-text. | | `finalizeMode` | string | `stable` | `stable` = multiple finals per segment as spans lock in. `sentence` = one final per sentence. See [Finalize mode](#finalize-mode). | | `twoWay` | bool | `false` | When `false` (default), only audio in `sourceLanguage` is transcribed and translated into `targetLanguage`. When `true`, the session listens for **both** languages and auto-detects per utterance, translating into whichever of the two the speaker isn't currently using. Forces `finalizeMode=sentence`. | | `sourceLanguageLabel` | string | — | Optional natural-language label that overrides how the source language is described to the translator. | | `targetLanguageLabel` | string | — | Optional natural-language label for the target language. Same shape as `sourceLanguageLabel` — drive register, formality, or persona of the translation. Useful for style/register cues (e.g. `"professional english"`, `"pirate-speak english"`, `"casual spanish"`). | | `context` | string | — | Optional free-text description of the session. Biases recognition and translation toward names, products, and jargon you mention. See [Session context](#session-context). | ### Try it
GET /v1/session
### Response (200) ```json { "session_id": "ss_01JF...", "ws_url": "wss://ws.startpinch.com/session?token=...", "expires_in": 60, "model_version_id": "a1b2c3d4e5f6" } ``` Open `ws_url` within `expires_in` seconds. `model_version_id` is a stable identifier for the exact server build serving this session. Log it alongside any user-reported issue so support can map a report back to the exact build that produced it. ### Error responses | Code | Meaning | | ---- | ------- | | 401 | Missing / invalid API key. | | 403 | API key not entitled to real-time translation. | | 503 | Capacity exhausted — retry after a short backoff. | --- ## 2. Open the WebSocket Open a WebSocket to `ws_url`. No additional auth headers — the URL is pre-signed. The first frame from the server is: ```json {"type": "ready", "session_id": "ss_01JF..."} ``` Do not send audio before `ready`. After `ready`, the socket is fully bidirectional. --- ## 3. Audio in — client → server Binary frames of raw PCM audio: - **Format:** PCM16 little-endian, mono - **Sample rate:** 16 kHz - **Frame size:** 10–40 ms recommended (160–640 samples = 320–1280 bytes) No container, no headers — just the sample bytes. Smaller frames feel more interactive; larger frames add latency but reduce overhead. ### Push-to-talk Two ways to commit the current snippet when the user releases the mic: - **Recommended:** send `{"type":"finalize"}`. The server flushes the current segment, emits `is_final` transcripts (and translated audio, in speech-to-speech mode), and keeps the session open so the next press starts a fresh segment. See [Finalize current segment](#finalize-current-segment). - **Legacy:** send ~300 ms of silence (PCM16 zeros) after release. Works without the explicit signal but adds noticeable tail latency. --- ## 4. Audio out — server → client (speech-to-speech mode) When `audioOutputEnabled=true`, binary frames of translated speech: - **Format:** float32 little-endian, mono - **Sample rate:** 24 kHz - **Frame size:** variable; typical frame is ~80 ms Play frames as they arrive. Each frame is a standalone chunk. In speech-to-translated-text mode (`audioOutputEnabled=false`) no binary frames are sent — transcripts only. --- ## 5. Transcripts — server → client JSON text frames. Dispatch on `type`. All transcript and lifecycle frames include: | Field | Description | | --- | --- | | `correlation_id` | Groups an original transcript with its translation. | | `source_language`, `target_language` | Current session languages. | | `detected_language` | ISO-639 code recognised from the audio. Populated when `sourceLanguage=auto`; `null` otherwise. | | `timestamp` | Unix seconds (float). | ### `original_transcript` Recognised source-language speech. `text` is the canonical field. ```json { "type": "original_transcript", "text": "Hello, how are you?", "is_final": true, "correlation_id": "seg_a8b2c3d4", "source_language": "en", "target_language": "es", "detected_language": "en", "timestamp": 1770933604.048 } ``` ### `translated_transcript` Translation of the current segment. `text` is the translated text (same as `translated_text`); `original_text` is the recognised source-language text that produced it. ```json { "type": "translated_transcript", "text": "Hola, ¿cómo estás?", "translated_text": "Hola, ¿cómo estás?", "original_text": "Hello, how are you?", "is_final": true, "correlation_id": "seg_a8b2c3d4", "source_language": "en", "target_language": "es", "detected_language": "en", "timestamp": 1770933604.945 } ``` ### Interim vs final - `is_final: false` — partial, may be revised as more audio arrives. Use for live-feedback UI. - `is_final: true` — stable; never revised. Cadence depends on `finalizeMode` (see [Finalize mode](#finalize-mode)). ### `metadata_ack` ```json {"type": "metadata_ack", "applied": {"targetLanguage": "de"}} ``` ### `error` Recoverable errors during the session (unsupported voice, bad metadata, etc.). The socket stays open unless followed by a close frame. ```json {"type": "error", "error": "targetLanguage='xx' is not supported.", "timestamp": 1770933600.0} ``` --- ## 6. Update metadata mid-session To change `targetLanguage`, `voiceType`, `context`, or other session parameters mid-stream, the current recommended pattern is to **send `{"type":"finalize"}`, close the WebSocket, and open a fresh session** with the new query params. A first-class `update_metadata` frame is on the roadmap but not yet live — if a client sends one today, the server acknowledges with `metadata_ack` but does not apply runtime changes. --- ## Session context Pass a free-text description of the session as `context` to bias recognition and translation toward names, products, and jargon you mention. ```json { "context": "Acme Corp Q3 all-hands. CEO Priya Ramanathan, CFO Marco Bianchi. Topics: ARR, churn, EMEA expansion." } ``` Helpful for: - **Proper nouns** the model wouldn't otherwise know (people, products, companies, places). - **Rare or domain-specific jargon** that's easily misheard. Any language. Up to ~2 000 characters. Send an empty string to clear. Context can be updated mid-session. --- ## Finalize current segment Send `{"type":"finalize"}` to commit whatever the user has spoken so far without closing the session. Useful for push-to-talk: the moment the user releases the button, fire one finalize and the server snaps the segment shut. ```json {"type": "finalize"} ``` What happens server-side: 1. The server waits a short grace period (~300 ms) for any audio frames still in flight to land — your client doesn't need any tail buffering or trailing silence for PTT. 2. The current segment is force-finalized — you receive `original_transcript` and `translated_transcript` frames with `is_final: true` (and translated speech audio, when `audioOutputEnabled=true`). 3. The session stays open; the next audio frame you send starts a fresh segment with a new `correlation_id`. --- ## Finalize mode Controls when `is_final: true` is emitted. Set via the `finalizeMode` query param. - **`stable`** (default) — multiple finals per segment, each a prefix-extension of the previous, emitted as soon as a span is locked in. Lowest time-to-first-stable-text. - **`sentence`** — one final per sentence boundary. Best for transcript logging or downstream NLP. `twoWay=true` always uses `sentence`. Partials (`is_final: false`) stream in both modes. --- ## 7. Close Close the WebSocket cleanly (code `1000`) when done. Server close codes: | Code | Meaning | | ---- | ------- | | 1000 | Normal closure. | | 4400 | Bad request / malformed frame. | | 4401 | Token expired — reallocate the session. | | 4503 | Session ended — reallocate. | | 4500 | Internal error. | --- ## Languages - **Input** (`sourceLanguage`): 50+ languages. See [Supported languages](/docs/supported-languages). - **Output voice** (`targetLanguage` with `audioOutputEnabled=true`): 40+ languages. Requesting an unsupported target with audio output returns an `error` listing the current set. - **Output text only** (`audioOutputEnabled=false`): any target language the translator supports. --- ## Minimal client (Python) ```python import asyncio, json, httpx, sounddevice as sd, websockets API_KEY = "pk_..." BASE = "https://ws.startpinch.com" async def main(): async with httpx.AsyncClient() as http: r = await http.get( f"{BASE}/v1/session", params={"sourceLanguage": "en", "targetLanguage": "es", "voiceType": "male"}, headers={"Authorization": f"Bearer {API_KEY}"}, ) r.raise_for_status() ws_url = r.json()["ws_url"] async with websockets.connect(ws_url) as ws: ready = json.loads(await ws.recv()) assert ready["type"] == "ready" async def mic_to_ws(): with sd.RawInputStream(samplerate=16000, channels=1, dtype="int16", blocksize=320) as s: while True: buf, _ = s.read(320) await ws.send(bytes(buf)) async def ws_to_out(): out = sd.RawOutputStream(samplerate=24000, channels=1, dtype="float32") out.start() async for frame in ws: if isinstance(frame, (bytes, bytearray)): out.write(frame) else: evt = json.loads(frame) if evt.get("type") == "translated_transcript": print(evt["text"]) await asyncio.gather(mic_to_ws(), ws_to_out()) asyncio.run(main()) ``` --- ## Minimal client (TypeScript / Node) ```ts import WebSocket from "ws"; const BASE = "https://ws.startpinch.com"; const key = process.env.PINCH_API_KEY!; const r = await fetch(`${BASE}/v1/session?sourceLanguage=en&targetLanguage=es&voiceType=male`, { headers: { Authorization: `Bearer ${key}` }, }); const { ws_url } = await r.json(); const ws = new WebSocket(ws_url); ws.on("message", (data, isBinary) => { if (isBinary) { // float32 LE @ 24 kHz — pipe to audio output } else { const evt = JSON.parse(data.toString()); if (evt.type === "translated_transcript") console.log(evt.text); } }); // After {type:"ready"}, send PCM16LE @ 16 kHz: ws.send(int16Buffer, { binary: true }); ```