Pinch PINCH

Real-time API Reference

Pinch real-time translation is a single WebSocket connection. You stream microphone audio in and receive translated speech and transcripts back on the same socket. No SDK — works from any language that can open a WebSocket.

The API has two modes, chosen per session:

  • Speech to speech — send source-language audio, receive translated speech audio and transcripts. (audioOutputEnabled=true, default)
  • Speech to translated text — send source-language audio, receive transcripts only. (audioOutputEnabled=false)

Base URL: https://ws.startpinch.com


Flow

  1. GET /v1/session — authenticate, get a short-lived ws_url.
  2. Open a WebSocket to ws_url.
  3. Wait for {"type":"ready"}.
  4. Stream mic audio as binary frames.
  5. Receive transcripts (JSON) and, in speech-to-speech mode, translated audio (binary) on the same socket.
  6. Optionally change language or voice mid-session.
  7. Close the socket when done.

1. Create a session

GET /v1/session

Returns a short-lived (60 s) pre-signed WebSocket URL. No further auth on the WebSocket itself.

Request headers

HeaderValueRequired
AuthorizationBearer <your-api-key>Yes

Query parameters

All query params become the initial session metadata and can be updated mid-session.

ParamTypeDefaultDescription
sourceLanguagestringenISO-639 / BCP-47 code of spoken audio (e.g. en, en-US). Pass auto to auto-detect.
targetLanguagestringesLanguage to translate into (e.g. es, de, zh).
voiceTypestringmalemale or female. Ignored when audioOutputEnabled=false.
audioOutputEnabledbooltruetrue = speech-to-speech, false = speech-to-translated-text.
contextstringOptional free-text description of the session. Biases recognition and translation toward names, products, and jargon you mention. See Session context.

Try it

GET /v1/session

Response (200)

{
  "session_id": "ss_01JF...",
  "ws_url":     "wss://ws.startpinch.com/session?token=...",
  "expires_in": 60
}

Open ws_url within expires_in seconds.

Error responses

CodeMeaning
401Missing / invalid API key.
403API key not entitled to real-time translation.
503Capacity exhausted — retry after a short backoff.

2. Open the WebSocket

Open a WebSocket to ws_url. No additional auth headers — the URL is pre-signed.

The first frame from the server is:

{"type": "ready", "session_id": "ss_01JF..."}

Do not send audio before ready. After ready, the socket is fully bidirectional.


3. Audio in — client → server

Binary frames of raw PCM audio:

  • Format: PCM16 little-endian, mono
  • Sample rate: 16 kHz
  • Frame size: 10–40 ms recommended (160–640 samples = 320–1280 bytes)

No container, no headers — just the sample bytes. Smaller frames feel more interactive; larger frames add latency but reduce overhead.

Streaming silence

If your capture is gated (e.g. push-to-talk), send ~300 ms of silence (PCM16 zeros) after the user releases. This lets the server finalise the final sentence cleanly.


4. Audio out — server → client (speech-to-speech mode)

When audioOutputEnabled=true, binary frames of translated speech:

  • Format: float32 little-endian, mono
  • Sample rate: 24 kHz
  • Frame size: variable; typical frame is ~80 ms

Play frames as they arrive. Each frame is a standalone chunk.

In speech-to-translated-text mode (audioOutputEnabled=false) no binary frames are sent — transcripts only.


5. Transcripts — server → client

JSON text frames. Dispatch on type.

All transcript and lifecycle frames include:

FieldDescription
correlation_idGroups an original transcript with its translation.
stream_idOpaque segment id.
source_language, target_languageCurrent session languages.
detected_languageISO-639 code the ASR recognised. Populated when sourceLanguage=auto; null otherwise.
timestampUnix seconds (float).

original_transcript

Recognised source-language speech. text is the canonical field.

{
  "type": "original_transcript",
  "text": "Hello, how are you?",
  "is_final": true,
  "correlation_id": "seg_a8b2c3d4",
  "source_language": "en",
  "target_language": "es",
  "detected_language": null,
  "timestamp": 1770933604.048
}

translated_transcript

Translation of the current segment. text is the translated text; source_text is the recognised source-language text that produced it.

{
  "type": "translated_transcript",
  "text": "Hola, ¿cómo estás?",
  "source_text": "Hello, how are you?",
  "is_final": true,
  "translation_complete": true,
  "correlation_id": "seg_a8b2c3d4",
  "source_language": "en",
  "target_language": "es",
  "detected_language": null,
  "timestamp": 1770933604.945
}

Interim vs final

  • is_final: false — partial, may be revised as more audio arrives. Use for live-feedback UI.
  • is_final: true — stable.

Lifecycle frames

typeMeaning
translation_startedStart of a new segment.
translation_continuingKeepalive during a segment.
translation_finishedEnd of a segment.

metadata_ack

{"type": "metadata_ack", "applied": {"targetLanguage": "de"}}

error

Recoverable errors during the session (unsupported voice, bad metadata, etc.). The socket stays open unless followed by a close frame.

{"type": "error", "error": "targetLanguage='xx' is not supported.", "timestamp": 1770933600.0}

6. Update metadata mid-session

Send a JSON text frame any time after ready:

{
  "type": "update_metadata",
  "metadata": {
    "targetLanguage": "de",
    "voiceType": "female",
    "context": "Acme Corp Q3 all-hands. CEO Priya Ramanathan, CFO Marco Bianchi. Topics: ARR, churn, EMEA expansion."
  }
}

Only the fields you send change. Unsupported configs return an error frame and the session continues with the prior config.


Session context

Pass a free-text description of the session as context to bias recognition and translation toward names, products, and jargon you mention.

{
  "context": "Acme Corp Q3 all-hands. CEO Priya Ramanathan, CFO Marco Bianchi. Topics: ARR, churn, EMEA expansion."
}

Helpful for:

  • Proper nouns the model wouldn’t otherwise know (people, products, companies, places).
  • Rare or domain-specific jargon that’s easily misheard.

Any language. Up to ~2 000 characters. Send an empty string to clear. Context can be updated mid-session.


7. Close

Close the WebSocket cleanly (code 1000) when done.

Server close codes:

CodeMeaning
1000Normal closure.
4400Bad request / malformed frame.
4401Token expired — reallocate the session.
4503Session ended — reallocate.
4500Internal error.

Languages

  • Input (sourceLanguage): 50+ languages. See Supported languages.
  • Output voice (targetLanguage with audioOutputEnabled=true): 40+ languages. Requesting an unsupported target with audio output returns an error listing the current set.
  • Output text only (audioOutputEnabled=false): any target language the translator supports.

Minimal client (Python)

import asyncio, json, httpx, sounddevice as sd, websockets

API_KEY = "pk_..."
BASE = "https://ws.startpinch.com"

async def main():
    async with httpx.AsyncClient() as http:
        r = await http.get(
            f"{BASE}/v1/session",
            params={"sourceLanguage": "en", "targetLanguage": "es", "voiceType": "male"},
            headers={"Authorization": f"Bearer {API_KEY}"},
        )
        r.raise_for_status()
        ws_url = r.json()["ws_url"]

    async with websockets.connect(ws_url) as ws:
        ready = json.loads(await ws.recv())
        assert ready["type"] == "ready"

        async def mic_to_ws():
            with sd.RawInputStream(samplerate=16000, channels=1, dtype="int16", blocksize=320) as s:
                while True:
                    buf, _ = s.read(320)
                    await ws.send(bytes(buf))

        async def ws_to_out():
            out = sd.RawOutputStream(samplerate=24000, channels=1, dtype="float32")
            out.start()
            async for frame in ws:
                if isinstance(frame, (bytes, bytearray)):
                    out.write(frame)
                else:
                    evt = json.loads(frame)
                    if evt.get("type") == "translated_transcript":
                        print(evt["text"])

        await asyncio.gather(mic_to_ws(), ws_to_out())

asyncio.run(main())

Minimal client (TypeScript / Node)

import WebSocket from "ws";

const BASE = "https://ws.startpinch.com";
const key = process.env.PINCH_API_KEY!;

const r = await fetch(`${BASE}/v1/session?sourceLanguage=en&targetLanguage=es&voiceType=male`, {
  headers: { Authorization: `Bearer ${key}` },
});
const { ws_url } = await r.json();

const ws = new WebSocket(ws_url);
ws.on("message", (data, isBinary) => {
  if (isBinary) {
    // float32 LE @ 24 kHz — pipe to audio output
  } else {
    const evt = JSON.parse(data.toString());
    if (evt.type === "translated_transcript") console.log(evt.text);
  }
});
// After {type:"ready"}, send PCM16LE @ 16 kHz: ws.send(int16Buffer, { binary: true });

Implementation checklist

  • HTTP GET /v1/session with Authorization: Bearer <api_key>.
  • Open the returned ws_url (no extra headers).
  • Wait for {"type":"ready"} before sending audio.
  • Two concurrent loops: send mic frames, receive server frames.
  • On each incoming frame, branch binary vs text. Parse text as JSON and dispatch on type.
  • Use correlation_id to pair original and translated transcripts.
  • Send ~300 ms of silence if you gate capture (push-to-talk) so the last sentence finalises.
  • On close code 4401 or 4503, reallocate a new session.
  • Drop oldest mic frames if the WebSocket send buffer grows; don’t buffer unboundedly.

Legacy (LiveKit) API

If you’re on the older LiveKit-based integration (POST /api/beta1/session returning {url, token, room_name}), it still works and isn’t going away soon. New builds should use the WebSocket API above — it’s simpler and doesn’t require a LiveKit client.