Real-time API Reference

Pinch real-time translation is a single WebSocket connection. You stream microphone audio in and receive translated speech and transcripts back on the same socket. No SDK — works from any language that can open a WebSocket.

The API has two modes, chosen per session:

Speech to speech — send source-language audio, receive translated speech audio and transcripts. (audioOutputEnabled=true, default)
Speech to translated text — send source-language audio, receive transcripts only. (audioOutputEnabled=false)

Base URL: https://ws.startpinch.com

Flow

GET /v1/session — authenticate, get a short-lived ws_url.
Open a WebSocket to ws_url.
Wait for {"type":"ready"}.
Stream mic audio as binary frames.
Receive transcripts (JSON) and, in speech-to-speech mode, translated audio (binary) on the same socket.
Optionally change language or voice mid-session.
Close the socket when done.

1. Create a session

GET /v1/session

Returns a short-lived (60 s) pre-signed WebSocket URL. No further auth on the WebSocket itself.

Request headers

Header	Value	Required
`Authorization`	`Bearer <your-api-key>`	Yes

Query parameters

All query params become the initial session metadata and can be updated mid-session.

Param	Type	Default	Description
`sourceLanguage`	string	`en`	ISO-639 / BCP-47 code of spoken audio (e.g. `en`, `en-US`). Pass `auto` to auto-detect.
`targetLanguage`	string	`es`	Language to translate into (e.g. `es`, `de`, `zh`).
`voiceType`	string	`male`	`male` or `female`. Ignored when `audioOutputEnabled=false`.
`audioOutputEnabled`	bool	`true`	`true` = speech-to-speech, `false` = speech-to-translated-text.
`context`	string	—	Optional free-text description of the session. Biases recognition and translation toward names, products, and jargon you mention. See Session context.

Try it

GET /v1/session

Enter your API key + fields below, then send a real request and inspect the JSON response.

Base URL Bearer token sourceLanguage targetLanguage voiceType

Ready

{
  "session_id": "ss_…",
  "ws_url":     "wss://…",
  "expires_in": 60
}

Response (200)

{
  "session_id": "ss_01JF...",
  "ws_url":     "wss://ws.startpinch.com/session?token=...",
  "expires_in": 60
}

Open ws_url within expires_in seconds.

Error responses

Code	Meaning
401	Missing / invalid API key.
403	API key not entitled to real-time translation.
503	Capacity exhausted — retry after a short backoff.

2. Open the WebSocket

Open a WebSocket to ws_url. No additional auth headers — the URL is pre-signed.

The first frame from the server is:

{"type": "ready", "session_id": "ss_01JF..."}

Do not send audio before ready. After ready, the socket is fully bidirectional.

3. Audio in — client → server

Binary frames of raw PCM audio:

Format: PCM16 little-endian, mono
Sample rate: 16 kHz
Frame size: 10–40 ms recommended (160–640 samples = 320–1280 bytes)

No container, no headers — just the sample bytes. Smaller frames feel more interactive; larger frames add latency but reduce overhead.

Streaming silence

If your capture is gated (e.g. push-to-talk), send ~300 ms of silence (PCM16 zeros) after the user releases. This lets the server finalise the final sentence cleanly.

4. Audio out — server → client (speech-to-speech mode)

When audioOutputEnabled=true, binary frames of translated speech:

Format: float32 little-endian, mono
Sample rate: 24 kHz
Frame size: variable; typical frame is ~80 ms

Play frames as they arrive. Each frame is a standalone chunk.

In speech-to-translated-text mode (audioOutputEnabled=false) no binary frames are sent — transcripts only.

5. Transcripts — server → client

JSON text frames. Dispatch on type.

All transcript and lifecycle frames include:

Field	Description
`correlation_id`	Groups an original transcript with its translation.
`stream_id`	Opaque segment id.
`source_language`, `target_language`	Current session languages.
`detected_language`	ISO-639 code the ASR recognised. Populated when `sourceLanguage=auto`; `null` otherwise.
`timestamp`	Unix seconds (float).

`original_transcript`

Recognised source-language speech. text is the canonical field.

{
  "type": "original_transcript",
  "text": "Hello, how are you?",
  "is_final": true,
  "correlation_id": "seg_a8b2c3d4",
  "source_language": "en",
  "target_language": "es",
  "detected_language": null,
  "timestamp": 1770933604.048
}

`translated_transcript`

Translation of the current segment. text is the translated text; source_text is the recognised source-language text that produced it.

{
  "type": "translated_transcript",
  "text": "Hola, ¿cómo estás?",
  "source_text": "Hello, how are you?",
  "is_final": true,
  "translation_complete": true,
  "correlation_id": "seg_a8b2c3d4",
  "source_language": "en",
  "target_language": "es",
  "detected_language": null,
  "timestamp": 1770933604.945
}

Interim vs final

is_final: false — partial, may be revised as more audio arrives. Use for live-feedback UI.
is_final: true — stable.

Lifecycle frames

`type`	Meaning
`translation_started`	Start of a new segment.
`translation_continuing`	Keepalive during a segment.
`translation_finished`	End of a segment.

`metadata_ack`

{"type": "metadata_ack", "applied": {"targetLanguage": "de"}}

`error`

Recoverable errors during the session (unsupported voice, bad metadata, etc.). The socket stays open unless followed by a close frame.

{"type": "error", "error": "targetLanguage='xx' is not supported.", "timestamp": 1770933600.0}

6. Update metadata mid-session

Send a JSON text frame any time after ready:

{
  "type": "update_metadata",
  "metadata": {
    "targetLanguage": "de",
    "voiceType": "female",
    "context": "Acme Corp Q3 all-hands. CEO Priya Ramanathan, CFO Marco Bianchi. Topics: ARR, churn, EMEA expansion."
  }
}

Only the fields you send change. Unsupported configs return an error frame and the session continues with the prior config.

Session context

Pass a free-text description of the session as context to bias recognition and translation toward names, products, and jargon you mention.

{
  "context": "Acme Corp Q3 all-hands. CEO Priya Ramanathan, CFO Marco Bianchi. Topics: ARR, churn, EMEA expansion."
}

Helpful for:

Proper nouns the model wouldn’t otherwise know (people, products, companies, places).
Rare or domain-specific jargon that’s easily misheard.

Any language. Up to ~2 000 characters. Send an empty string to clear. Context can be updated mid-session.

7. Close

Close the WebSocket cleanly (code 1000) when done.

Server close codes:

Code	Meaning
1000	Normal closure.
4400	Bad request / malformed frame.
4401	Token expired — reallocate the session.
4503	Session ended — reallocate.
4500	Internal error.

Languages

Input (sourceLanguage): 50+ languages. See Supported languages.
Output voice (targetLanguage with audioOutputEnabled=true): 40+ languages. Requesting an unsupported target with audio output returns an error listing the current set.
Output text only (audioOutputEnabled=false): any target language the translator supports.

Minimal client (Python)

import asyncio, json, httpx, sounddevice as sd, websockets

API_KEY = "pk_..."
BASE = "https://ws.startpinch.com"

async def main():
    async with httpx.AsyncClient() as http:
        r = await http.get(
            f"{BASE}/v1/session",
            params={"sourceLanguage": "en", "targetLanguage": "es", "voiceType": "male"},
            headers={"Authorization": f"Bearer {API_KEY}"},
        )
        r.raise_for_status()
        ws_url = r.json()["ws_url"]

    async with websockets.connect(ws_url) as ws:
        ready = json.loads(await ws.recv())
        assert ready["type"] == "ready"

        async def mic_to_ws():
            with sd.RawInputStream(samplerate=16000, channels=1, dtype="int16", blocksize=320) as s:
                while True:
                    buf, _ = s.read(320)
                    await ws.send(bytes(buf))

        async def ws_to_out():
            out = sd.RawOutputStream(samplerate=24000, channels=1, dtype="float32")
            out.start()
            async for frame in ws:
                if isinstance(frame, (bytes, bytearray)):
                    out.write(frame)
                else:
                    evt = json.loads(frame)
                    if evt.get("type") == "translated_transcript":
                        print(evt["text"])

        await asyncio.gather(mic_to_ws(), ws_to_out())

asyncio.run(main())

Minimal client (TypeScript / Node)

import WebSocket from "ws";

const BASE = "https://ws.startpinch.com";
const key = process.env.PINCH_API_KEY!;

const r = await fetch(`${BASE}/v1/session?sourceLanguage=en&targetLanguage=es&voiceType=male`, {
  headers: { Authorization: `Bearer ${key}` },
});
const { ws_url } = await r.json();

const ws = new WebSocket(ws_url);
ws.on("message", (data, isBinary) => {
  if (isBinary) {
    // float32 LE @ 24 kHz — pipe to audio output
  } else {
    const evt = JSON.parse(data.toString());
    if (evt.type === "translated_transcript") console.log(evt.text);
  }
});
// After {type:"ready"}, send PCM16LE @ 16 kHz: ws.send(int16Buffer, { binary: true });

Implementation checklist

HTTP GET /v1/session with Authorization: Bearer <api_key>.
Open the returned ws_url (no extra headers).
Wait for {"type":"ready"} before sending audio.
Two concurrent loops: send mic frames, receive server frames.
On each incoming frame, branch binary vs text. Parse text as JSON and dispatch on type.
Use correlation_id to pair original and translated transcripts.
Send ~300 ms of silence if you gate capture (push-to-talk) so the last sentence finalises.
On close code 4401 or 4503, reallocate a new session.
Drop oldest mic frames if the WebSocket send buffer grows; don’t buffer unboundedly.

Legacy (LiveKit) API

If you’re on the older LiveKit-based integration (POST /api/beta1/session returning {url, token, room_name}), it still works and isn’t going away soon. New builds should use the WebSocket API above — it’s simpler and doesn’t require a LiveKit client.