Real-time API Reference
Pinch real-time translation is a single WebSocket connection. You stream microphone audio in and receive translated speech and transcripts back on the same socket. No SDK — works from any language that can open a WebSocket.
The API has two modes, chosen per session:
- Speech to speech — send source-language audio, receive translated
speech audio and transcripts. (
audioOutputEnabled=true, default) - Speech to translated text — send source-language audio, receive
transcripts only. (
audioOutputEnabled=false)
Base URL: https://ws.startpinch.com
Flow
GET /v1/session— authenticate, get a short-livedws_url.- Open a WebSocket to
ws_url. - Wait for
{"type":"ready"}. - Stream mic audio as binary frames.
- Receive transcripts (JSON) and, in speech-to-speech mode, translated audio (binary) on the same socket.
- Optionally change language or voice mid-session.
- Close the socket when done.
1. Create a session
GET /v1/session
Returns a short-lived (60 s) pre-signed WebSocket URL. No further auth on the WebSocket itself.
Request headers
| Header | Value | Required |
|---|---|---|
Authorization | Bearer <your-api-key> | Yes |
Query parameters
All query params become the initial session metadata and can be updated mid-session.
| Param | Type | Default | Description |
|---|---|---|---|
sourceLanguage | string | en | ISO-639 / BCP-47 code of spoken audio (e.g. en, en-US). Pass auto to auto-detect. |
targetLanguage | string | es | Language to translate into (e.g. es, de, zh). |
voiceType | string | male | male or female. Ignored when audioOutputEnabled=false. |
audioOutputEnabled | bool | true | true = speech-to-speech, false = speech-to-translated-text. |
finalizeMode | string | stable | stable = multiple finals per segment as spans lock in. sentence = one final per sentence. See Finalize mode. |
twoWay | bool | false | When false (default), only audio in sourceLanguage is transcribed and translated into targetLanguage. When true, the session listens for both languages and auto-detects per utterance, translating into whichever of the two the speaker isn’t currently using. Forces finalizeMode=sentence. |
sourceLanguageLabel | string | — | Optional natural-language label that overrides how the source language is described to the translator. |
targetLanguageLabel | string | — | Optional natural-language label for the target language. Same shape as sourceLanguageLabel — drive register, formality, or persona of the translation. Useful for style/register cues (e.g. "professional english", "pirate-speak english", "casual spanish"). |
context | string | — | Optional free-text description of the session. Biases recognition and translation toward names, products, and jargon you mention. See Session context. |
Try it
/v1/session
{
"session_id": "ss_…",
"ws_url": "wss://…",
"expires_in": 60,
"model_version_id": "a1b2c3d4e5f6"
}
Response (200)
{
"session_id": "ss_01JF...",
"ws_url": "wss://ws.startpinch.com/session?token=...",
"expires_in": 60,
"model_version_id": "a1b2c3d4e5f6"
}
Open ws_url within expires_in seconds.
model_version_id is a stable identifier for the exact server build serving this session. Log it alongside any user-reported issue so support can map a report back to the exact build that produced it.
Error responses
| Code | Meaning |
|---|---|
| 401 | Missing / invalid API key. |
| 403 | API key not entitled to real-time translation. |
| 503 | Capacity exhausted — retry after a short backoff. |
2. Open the WebSocket
Open a WebSocket to ws_url. No additional auth headers — the URL is
pre-signed.
The first frame from the server is:
{"type": "ready", "session_id": "ss_01JF..."}
Do not send audio before ready. After ready, the socket is fully
bidirectional.
3. Audio in — client → server
Binary frames of raw PCM audio:
- Format: PCM16 little-endian, mono
- Sample rate: 16 kHz
- Frame size: 10–40 ms recommended (160–640 samples = 320–1280 bytes)
No container, no headers — just the sample bytes. Smaller frames feel more interactive; larger frames add latency but reduce overhead.
Push-to-talk
Two ways to commit the current snippet when the user releases the mic:
- Recommended: send
{"type":"finalize"}. The server flushes the current segment, emitsis_finaltranscripts (and translated audio, in speech-to-speech mode), and keeps the session open so the next press starts a fresh segment. See Finalize current segment. - Legacy: send ~300 ms of silence (PCM16 zeros) after release. Works without the explicit signal but adds noticeable tail latency.
4. Audio out — server → client (speech-to-speech mode)
When audioOutputEnabled=true, binary frames of translated speech:
- Format: float32 little-endian, mono
- Sample rate: 24 kHz
- Frame size: variable; typical frame is ~80 ms
Play frames as they arrive. Each frame is a standalone chunk.
In speech-to-translated-text mode (audioOutputEnabled=false) no binary
frames are sent — transcripts only.
5. Transcripts — server → client
JSON text frames. Dispatch on type.
All transcript and lifecycle frames include:
| Field | Description |
|---|---|
correlation_id | Groups an original transcript with its translation. |
source_language, target_language | Current session languages. |
detected_language | ISO-639 code recognised from the audio. Populated when sourceLanguage=auto; null otherwise. |
timestamp | Unix seconds (float). |
original_transcript
Recognised source-language speech. text is the canonical field.
{
"type": "original_transcript",
"text": "Hello, how are you?",
"is_final": true,
"correlation_id": "seg_a8b2c3d4",
"source_language": "en",
"target_language": "es",
"detected_language": "en",
"timestamp": 1770933604.048
}
translated_transcript
Translation of the current segment. text is the translated text (same
as translated_text); original_text is the recognised source-language
text that produced it.
{
"type": "translated_transcript",
"text": "Hola, ¿cómo estás?",
"translated_text": "Hola, ¿cómo estás?",
"original_text": "Hello, how are you?",
"is_final": true,
"correlation_id": "seg_a8b2c3d4",
"source_language": "en",
"target_language": "es",
"detected_language": "en",
"timestamp": 1770933604.945
}
Interim vs final
is_final: false— partial, may be revised as more audio arrives. Use for live-feedback UI.is_final: true— stable; never revised. Cadence depends onfinalizeMode(see Finalize mode).
metadata_ack
{"type": "metadata_ack", "applied": {"targetLanguage": "de"}}
error
Recoverable errors during the session (unsupported voice, bad metadata, etc.). The socket stays open unless followed by a close frame.
{"type": "error", "error": "targetLanguage='xx' is not supported.", "timestamp": 1770933600.0}
6. Update metadata mid-session
To change targetLanguage, voiceType, context, or
other session parameters mid-stream, the current recommended pattern
is to send {"type":"finalize"}, close the WebSocket, and open a
fresh session with the new query params. A first-class
update_metadata frame is on the roadmap but not yet live — if a
client sends one today, the server acknowledges with metadata_ack
but does not apply runtime changes.
Session context
Pass a free-text description of the session as context to bias
recognition and translation toward names, products, and jargon you
mention.
{
"context": "Acme Corp Q3 all-hands. CEO Priya Ramanathan, CFO Marco Bianchi. Topics: ARR, churn, EMEA expansion."
}
Helpful for:
- Proper nouns the model wouldn’t otherwise know (people, products, companies, places).
- Rare or domain-specific jargon that’s easily misheard.
Any language. Up to ~2 000 characters. Send an empty string to clear. Context can be updated mid-session.
Finalize current segment
Send {"type":"finalize"} to commit whatever the user has spoken so
far without closing the session. Useful for push-to-talk: the moment
the user releases the button, fire one finalize and the server snaps
the segment shut.
{"type": "finalize"}
What happens server-side:
- The server waits a short grace period (~300 ms) for any audio frames still in flight to land — your client doesn’t need any tail buffering or trailing silence for PTT.
- The current segment is force-finalized — you receive
original_transcriptandtranslated_transcriptframes withis_final: true(and translated speech audio, whenaudioOutputEnabled=true). - The session stays open; the next audio frame you send starts a
fresh segment with a new
correlation_id.
Finalize mode
Controls when is_final: true is emitted. Set via the finalizeMode
query param.
stable(default) — multiple finals per segment, each a prefix-extension of the previous, emitted as soon as a span is locked in. Lowest time-to-first-stable-text.sentence— one final per sentence boundary. Best for transcript logging or downstream NLP.
twoWay=true always uses sentence. Partials (is_final: false)
stream in both modes.
7. Close
Close the WebSocket cleanly (code 1000) when done.
Server close codes:
| Code | Meaning |
|---|---|
| 1000 | Normal closure. |
| 4400 | Bad request / malformed frame. |
| 4401 | Token expired — reallocate the session. |
| 4503 | Session ended — reallocate. |
| 4500 | Internal error. |
Languages
- Input (
sourceLanguage): 50+ languages. See Supported languages. - Output voice (
targetLanguagewithaudioOutputEnabled=true): 40+ languages. Requesting an unsupported target with audio output returns anerrorlisting the current set. - Output text only (
audioOutputEnabled=false): any target language the translator supports.
Minimal client (Python)
import asyncio, json, httpx, sounddevice as sd, websockets
API_KEY = "pk_..."
BASE = "https://ws.startpinch.com"
async def main():
async with httpx.AsyncClient() as http:
r = await http.get(
f"{BASE}/v1/session",
params={"sourceLanguage": "en", "targetLanguage": "es", "voiceType": "male"},
headers={"Authorization": f"Bearer {API_KEY}"},
)
r.raise_for_status()
ws_url = r.json()["ws_url"]
async with websockets.connect(ws_url) as ws:
ready = json.loads(await ws.recv())
assert ready["type"] == "ready"
async def mic_to_ws():
with sd.RawInputStream(samplerate=16000, channels=1, dtype="int16", blocksize=320) as s:
while True:
buf, _ = s.read(320)
await ws.send(bytes(buf))
async def ws_to_out():
out = sd.RawOutputStream(samplerate=24000, channels=1, dtype="float32")
out.start()
async for frame in ws:
if isinstance(frame, (bytes, bytearray)):
out.write(frame)
else:
evt = json.loads(frame)
if evt.get("type") == "translated_transcript":
print(evt["text"])
await asyncio.gather(mic_to_ws(), ws_to_out())
asyncio.run(main())
Minimal client (TypeScript / Node)
import WebSocket from "ws";
const BASE = "https://ws.startpinch.com";
const key = process.env.PINCH_API_KEY!;
const r = await fetch(`${BASE}/v1/session?sourceLanguage=en&targetLanguage=es&voiceType=male`, {
headers: { Authorization: `Bearer ${key}` },
});
const { ws_url } = await r.json();
const ws = new WebSocket(ws_url);
ws.on("message", (data, isBinary) => {
if (isBinary) {
// float32 LE @ 24 kHz — pipe to audio output
} else {
const evt = JSON.parse(data.toString());
if (evt.type === "translated_transcript") console.log(evt.text);
}
});
// After {type:"ready"}, send PCM16LE @ 16 kHz: ws.send(int16Buffer, { binary: true });