Real-time API Reference
Pinch real-time translation is a single WebSocket connection. You stream microphone audio in and receive translated speech and transcripts back on the same socket. No SDK — works from any language that can open a WebSocket.
The API has two modes, chosen per session:
- Speech to speech — send source-language audio, receive translated
speech audio and transcripts. (
audioOutputEnabled=true, default) - Speech to translated text — send source-language audio, receive
transcripts only. (
audioOutputEnabled=false)
Base URL: https://ws.startpinch.com
Flow
GET /v1/session— authenticate, get a short-livedws_url.- Open a WebSocket to
ws_url. - Wait for
{"type":"ready"}. - Stream mic audio as binary frames.
- Receive transcripts (JSON) and, in speech-to-speech mode, translated audio (binary) on the same socket.
- Optionally change language or voice mid-session.
- Close the socket when done.
1. Create a session
GET /v1/session
Returns a short-lived (60 s) pre-signed WebSocket URL. No further auth on the WebSocket itself.
Request headers
| Header | Value | Required |
|---|---|---|
Authorization | Bearer <your-api-key> | Yes |
Query parameters
All query params become the initial session metadata and can be updated mid-session.
| Param | Type | Default | Description |
|---|---|---|---|
sourceLanguage | string | en | ISO-639 / BCP-47 code of spoken audio (e.g. en, en-US). Pass auto to auto-detect. |
targetLanguage | string | es | Language to translate into (e.g. es, de, zh). |
voiceType | string | male | male or female. Ignored when audioOutputEnabled=false. |
audioOutputEnabled | bool | true | true = speech-to-speech, false = speech-to-translated-text. |
context | string | — | Optional free-text description of the session. Biases recognition and translation toward names, products, and jargon you mention. See Session context. |
Try it
/v1/session
{
"session_id": "ss_…",
"ws_url": "wss://…",
"expires_in": 60
}
Response (200)
{
"session_id": "ss_01JF...",
"ws_url": "wss://ws.startpinch.com/session?token=...",
"expires_in": 60
}
Open ws_url within expires_in seconds.
Error responses
| Code | Meaning |
|---|---|
| 401 | Missing / invalid API key. |
| 403 | API key not entitled to real-time translation. |
| 503 | Capacity exhausted — retry after a short backoff. |
2. Open the WebSocket
Open a WebSocket to ws_url. No additional auth headers — the URL is
pre-signed.
The first frame from the server is:
{"type": "ready", "session_id": "ss_01JF..."}
Do not send audio before ready. After ready, the socket is fully
bidirectional.
3. Audio in — client → server
Binary frames of raw PCM audio:
- Format: PCM16 little-endian, mono
- Sample rate: 16 kHz
- Frame size: 10–40 ms recommended (160–640 samples = 320–1280 bytes)
No container, no headers — just the sample bytes. Smaller frames feel more interactive; larger frames add latency but reduce overhead.
Streaming silence
If your capture is gated (e.g. push-to-talk), send ~300 ms of silence (PCM16 zeros) after the user releases. This lets the server finalise the final sentence cleanly.
4. Audio out — server → client (speech-to-speech mode)
When audioOutputEnabled=true, binary frames of translated speech:
- Format: float32 little-endian, mono
- Sample rate: 24 kHz
- Frame size: variable; typical frame is ~80 ms
Play frames as they arrive. Each frame is a standalone chunk.
In speech-to-translated-text mode (audioOutputEnabled=false) no binary
frames are sent — transcripts only.
5. Transcripts — server → client
JSON text frames. Dispatch on type.
All transcript and lifecycle frames include:
| Field | Description |
|---|---|
correlation_id | Groups an original transcript with its translation. |
stream_id | Opaque segment id. |
source_language, target_language | Current session languages. |
detected_language | ISO-639 code the ASR recognised. Populated when sourceLanguage=auto; null otherwise. |
timestamp | Unix seconds (float). |
original_transcript
Recognised source-language speech. text is the canonical field.
{
"type": "original_transcript",
"text": "Hello, how are you?",
"is_final": true,
"correlation_id": "seg_a8b2c3d4",
"source_language": "en",
"target_language": "es",
"detected_language": null,
"timestamp": 1770933604.048
}
translated_transcript
Translation of the current segment. text is the translated text;
source_text is the recognised source-language text that produced it.
{
"type": "translated_transcript",
"text": "Hola, ¿cómo estás?",
"source_text": "Hello, how are you?",
"is_final": true,
"translation_complete": true,
"correlation_id": "seg_a8b2c3d4",
"source_language": "en",
"target_language": "es",
"detected_language": null,
"timestamp": 1770933604.945
}
Interim vs final
is_final: false— partial, may be revised as more audio arrives. Use for live-feedback UI.is_final: true— stable.
Lifecycle frames
type | Meaning |
|---|---|
translation_started | Start of a new segment. |
translation_continuing | Keepalive during a segment. |
translation_finished | End of a segment. |
metadata_ack
{"type": "metadata_ack", "applied": {"targetLanguage": "de"}}
error
Recoverable errors during the session (unsupported voice, bad metadata, etc.). The socket stays open unless followed by a close frame.
{"type": "error", "error": "targetLanguage='xx' is not supported.", "timestamp": 1770933600.0}
6. Update metadata mid-session
Send a JSON text frame any time after ready:
{
"type": "update_metadata",
"metadata": {
"targetLanguage": "de",
"voiceType": "female",
"context": "Acme Corp Q3 all-hands. CEO Priya Ramanathan, CFO Marco Bianchi. Topics: ARR, churn, EMEA expansion."
}
}
Only the fields you send change. Unsupported configs return an error
frame and the session continues with the prior config.
Session context
Pass a free-text description of the session as context to bias
recognition and translation toward names, products, and jargon you
mention.
{
"context": "Acme Corp Q3 all-hands. CEO Priya Ramanathan, CFO Marco Bianchi. Topics: ARR, churn, EMEA expansion."
}
Helpful for:
- Proper nouns the model wouldn’t otherwise know (people, products, companies, places).
- Rare or domain-specific jargon that’s easily misheard.
Any language. Up to ~2 000 characters. Send an empty string to clear. Context can be updated mid-session.
7. Close
Close the WebSocket cleanly (code 1000) when done.
Server close codes:
| Code | Meaning |
|---|---|
| 1000 | Normal closure. |
| 4400 | Bad request / malformed frame. |
| 4401 | Token expired — reallocate the session. |
| 4503 | Session ended — reallocate. |
| 4500 | Internal error. |
Languages
- Input (
sourceLanguage): 50+ languages. See Supported languages. - Output voice (
targetLanguagewithaudioOutputEnabled=true): 40+ languages. Requesting an unsupported target with audio output returns anerrorlisting the current set. - Output text only (
audioOutputEnabled=false): any target language the translator supports.
Minimal client (Python)
import asyncio, json, httpx, sounddevice as sd, websockets
API_KEY = "pk_..."
BASE = "https://ws.startpinch.com"
async def main():
async with httpx.AsyncClient() as http:
r = await http.get(
f"{BASE}/v1/session",
params={"sourceLanguage": "en", "targetLanguage": "es", "voiceType": "male"},
headers={"Authorization": f"Bearer {API_KEY}"},
)
r.raise_for_status()
ws_url = r.json()["ws_url"]
async with websockets.connect(ws_url) as ws:
ready = json.loads(await ws.recv())
assert ready["type"] == "ready"
async def mic_to_ws():
with sd.RawInputStream(samplerate=16000, channels=1, dtype="int16", blocksize=320) as s:
while True:
buf, _ = s.read(320)
await ws.send(bytes(buf))
async def ws_to_out():
out = sd.RawOutputStream(samplerate=24000, channels=1, dtype="float32")
out.start()
async for frame in ws:
if isinstance(frame, (bytes, bytearray)):
out.write(frame)
else:
evt = json.loads(frame)
if evt.get("type") == "translated_transcript":
print(evt["text"])
await asyncio.gather(mic_to_ws(), ws_to_out())
asyncio.run(main())
Minimal client (TypeScript / Node)
import WebSocket from "ws";
const BASE = "https://ws.startpinch.com";
const key = process.env.PINCH_API_KEY!;
const r = await fetch(`${BASE}/v1/session?sourceLanguage=en&targetLanguage=es&voiceType=male`, {
headers: { Authorization: `Bearer ${key}` },
});
const { ws_url } = await r.json();
const ws = new WebSocket(ws_url);
ws.on("message", (data, isBinary) => {
if (isBinary) {
// float32 LE @ 24 kHz — pipe to audio output
} else {
const evt = JSON.parse(data.toString());
if (evt.type === "translated_transcript") console.log(evt.text);
}
});
// After {type:"ready"}, send PCM16LE @ 16 kHz: ws.send(int16Buffer, { binary: true });
Implementation checklist
- HTTP
GET /v1/sessionwithAuthorization: Bearer <api_key>. - Open the returned
ws_url(no extra headers). - Wait for
{"type":"ready"}before sending audio. - Two concurrent loops: send mic frames, receive server frames.
- On each incoming frame, branch binary vs text. Parse text as JSON and
dispatch on
type. - Use
correlation_idto pair original and translated transcripts. - Send ~300 ms of silence if you gate capture (push-to-talk) so the last sentence finalises.
- On close code
4401or4503, reallocate a new session. - Drop oldest mic frames if the WebSocket send buffer grows; don’t buffer unboundedly.
Legacy (LiveKit) API
If you’re on the older LiveKit-based integration
(POST /api/beta1/session returning {url, token, room_name}), it
still works and isn’t going away soon. New builds should use the
WebSocket API above — it’s simpler and doesn’t require a LiveKit
client.