--- title: "Real-time API Reference" section: "Real-time translation" order: 1 sidebarLabel: "API Reference" --- Pinch real-time translation is a single WebSocket connection. You stream microphone audio in and receive translated speech and transcripts back on the same socket. No SDK — works from any language that can open a WebSocket. The API has two modes, chosen per session: - **Speech to speech** — send source-language audio, receive translated speech audio and transcripts. (`audioOutputEnabled=true`, default) - **Speech to translated text** — send source-language audio, receive transcripts only. (`audioOutputEnabled=false`) **Base URL:** `https://ws.startpinch.com` --- ## Flow 1. `GET /v1/session` — authenticate, get a short-lived `ws_url`. 2. Open a WebSocket to `ws_url`. 3. Wait for `{"type":"ready"}`. 4. Stream mic audio as binary frames. 5. Receive transcripts (JSON) and, in speech-to-speech mode, translated audio (binary) on the same socket. 6. Optionally change language or voice mid-session. 7. Close the socket when done. --- ## 1. Create a session ``` GET /v1/session ``` Returns a short-lived (60 s) pre-signed WebSocket URL. No further auth on the WebSocket itself. ### Request headers | Header | Value | Required | | --- | --- | --- | | `Authorization` | `Bearer ` | Yes | ### Query parameters All query params become the initial session metadata and can be updated mid-session. | Param | Type | Default | Description | | --- | --- | --- | --- | | `sourceLanguage` | string | `en` | ISO-639 / BCP-47 code of spoken audio (e.g. `en`, `en-US`). Pass `auto` to auto-detect. | | `targetLanguage` | string | `es` | Language to translate into (e.g. `es`, `de`, `zh`). | | `voiceType` | string | `male` | `male` or `female`. Ignored when `audioOutputEnabled=false`. | | `audioOutputEnabled` | bool | `true` | `true` = speech-to-speech, `false` = speech-to-translated-text. | | `context` | string | — | Optional free-text description of the session. Biases recognition and translation toward names, products, and jargon you mention. See [Session context](#session-context). | ### Try it
GET /v1/session
### Response (200) ```json { "session_id": "ss_01JF...", "ws_url": "wss://ws.startpinch.com/session?token=...", "expires_in": 60 } ``` Open `ws_url` within `expires_in` seconds. ### Error responses | Code | Meaning | | ---- | ------- | | 401 | Missing / invalid API key. | | 403 | API key not entitled to real-time translation. | | 503 | Capacity exhausted — retry after a short backoff. | --- ## 2. Open the WebSocket Open a WebSocket to `ws_url`. No additional auth headers — the URL is pre-signed. The first frame from the server is: ```json {"type": "ready", "session_id": "ss_01JF..."} ``` Do not send audio before `ready`. After `ready`, the socket is fully bidirectional. --- ## 3. Audio in — client → server Binary frames of raw PCM audio: - **Format:** PCM16 little-endian, mono - **Sample rate:** 16 kHz - **Frame size:** 10–40 ms recommended (160–640 samples = 320–1280 bytes) No container, no headers — just the sample bytes. Smaller frames feel more interactive; larger frames add latency but reduce overhead. ### Streaming silence If your capture is gated (e.g. push-to-talk), send ~300 ms of silence (PCM16 zeros) after the user releases. This lets the server finalise the final sentence cleanly. --- ## 4. Audio out — server → client (speech-to-speech mode) When `audioOutputEnabled=true`, binary frames of translated speech: - **Format:** float32 little-endian, mono - **Sample rate:** 24 kHz - **Frame size:** variable; typical frame is ~80 ms Play frames as they arrive. Each frame is a standalone chunk. In speech-to-translated-text mode (`audioOutputEnabled=false`) no binary frames are sent — transcripts only. --- ## 5. Transcripts — server → client JSON text frames. Dispatch on `type`. All transcript and lifecycle frames include: | Field | Description | | --- | --- | | `correlation_id` | Groups an original transcript with its translation. | | `stream_id` | Opaque segment id. | | `source_language`, `target_language` | Current session languages. | | `detected_language` | ISO-639 code the ASR recognised. Populated when `sourceLanguage=auto`; `null` otherwise. | | `timestamp` | Unix seconds (float). | ### `original_transcript` Recognised source-language speech. `text` is the canonical field. ```json { "type": "original_transcript", "text": "Hello, how are you?", "is_final": true, "correlation_id": "seg_a8b2c3d4", "source_language": "en", "target_language": "es", "detected_language": null, "timestamp": 1770933604.048 } ``` ### `translated_transcript` Translation of the current segment. `text` is the translated text; `source_text` is the recognised source-language text that produced it. ```json { "type": "translated_transcript", "text": "Hola, ¿cómo estás?", "source_text": "Hello, how are you?", "is_final": true, "translation_complete": true, "correlation_id": "seg_a8b2c3d4", "source_language": "en", "target_language": "es", "detected_language": null, "timestamp": 1770933604.945 } ``` ### Interim vs final - `is_final: false` — partial, may be revised as more audio arrives. Use for live-feedback UI. - `is_final: true` — stable. ### Lifecycle frames | `type` | Meaning | | --- | --- | | `translation_started` | Start of a new segment. | | `translation_continuing` | Keepalive during a segment. | | `translation_finished` | End of a segment. | ### `metadata_ack` ```json {"type": "metadata_ack", "applied": {"targetLanguage": "de"}} ``` ### `error` Recoverable errors during the session (unsupported voice, bad metadata, etc.). The socket stays open unless followed by a close frame. ```json {"type": "error", "error": "targetLanguage='xx' is not supported.", "timestamp": 1770933600.0} ``` --- ## 6. Update metadata mid-session Send a JSON text frame any time after `ready`: ```json { "type": "update_metadata", "metadata": { "targetLanguage": "de", "voiceType": "female", "context": "Acme Corp Q3 all-hands. CEO Priya Ramanathan, CFO Marco Bianchi. Topics: ARR, churn, EMEA expansion." } } ``` Only the fields you send change. Unsupported configs return an `error` frame and the session continues with the prior config. --- ## Session context Pass a free-text description of the session as `context` to bias recognition and translation toward names, products, and jargon you mention. ```json { "context": "Acme Corp Q3 all-hands. CEO Priya Ramanathan, CFO Marco Bianchi. Topics: ARR, churn, EMEA expansion." } ``` Helpful for: - **Proper nouns** the model wouldn't otherwise know (people, products, companies, places). - **Rare or domain-specific jargon** that's easily misheard. Any language. Up to ~2 000 characters. Send an empty string to clear. Context can be updated mid-session. --- ## 7. Close Close the WebSocket cleanly (code `1000`) when done. Server close codes: | Code | Meaning | | ---- | ------- | | 1000 | Normal closure. | | 4400 | Bad request / malformed frame. | | 4401 | Token expired — reallocate the session. | | 4503 | Session ended — reallocate. | | 4500 | Internal error. | --- ## Languages - **Input** (`sourceLanguage`): 50+ languages. See [Supported languages](/docs/supported-languages). - **Output voice** (`targetLanguage` with `audioOutputEnabled=true`): 40+ languages. Requesting an unsupported target with audio output returns an `error` listing the current set. - **Output text only** (`audioOutputEnabled=false`): any target language the translator supports. --- ## Minimal client (Python) ```python import asyncio, json, httpx, sounddevice as sd, websockets API_KEY = "pk_..." BASE = "https://ws.startpinch.com" async def main(): async with httpx.AsyncClient() as http: r = await http.get( f"{BASE}/v1/session", params={"sourceLanguage": "en", "targetLanguage": "es", "voiceType": "male"}, headers={"Authorization": f"Bearer {API_KEY}"}, ) r.raise_for_status() ws_url = r.json()["ws_url"] async with websockets.connect(ws_url) as ws: ready = json.loads(await ws.recv()) assert ready["type"] == "ready" async def mic_to_ws(): with sd.RawInputStream(samplerate=16000, channels=1, dtype="int16", blocksize=320) as s: while True: buf, _ = s.read(320) await ws.send(bytes(buf)) async def ws_to_out(): out = sd.RawOutputStream(samplerate=24000, channels=1, dtype="float32") out.start() async for frame in ws: if isinstance(frame, (bytes, bytearray)): out.write(frame) else: evt = json.loads(frame) if evt.get("type") == "translated_transcript": print(evt["text"]) await asyncio.gather(mic_to_ws(), ws_to_out()) asyncio.run(main()) ``` --- ## Minimal client (TypeScript / Node) ```ts import WebSocket from "ws"; const BASE = "https://ws.startpinch.com"; const key = process.env.PINCH_API_KEY!; const r = await fetch(`${BASE}/v1/session?sourceLanguage=en&targetLanguage=es&voiceType=male`, { headers: { Authorization: `Bearer ${key}` }, }); const { ws_url } = await r.json(); const ws = new WebSocket(ws_url); ws.on("message", (data, isBinary) => { if (isBinary) { // float32 LE @ 24 kHz — pipe to audio output } else { const evt = JSON.parse(data.toString()); if (evt.type === "translated_transcript") console.log(evt.text); } }); // After {type:"ready"}, send PCM16LE @ 16 kHz: ws.send(int16Buffer, { binary: true }); ``` --- ## Implementation checklist - HTTP `GET /v1/session` with `Authorization: Bearer `. - Open the returned `ws_url` (no extra headers). - Wait for `{"type":"ready"}` before sending audio. - Two concurrent loops: send mic frames, receive server frames. - On each incoming frame, branch binary vs text. Parse text as JSON and dispatch on `type`. - Use `correlation_id` to pair original and translated transcripts. - Send ~300 ms of silence if you gate capture (push-to-talk) so the last sentence finalises. - On close code `4401` or `4503`, reallocate a new session. - Drop oldest mic frames if the WebSocket send buffer grows; don't buffer unboundedly. --- ## Legacy (LiveKit) API If you're on the older LiveKit-based integration (`POST /api/beta1/session` returning `{url, token, room_name}`), it still works and isn't going away soon. New builds should use the WebSocket API above — it's simpler and doesn't require a LiveKit client.