Real-time API Reference

Pinch real-time translation is a single WebSocket connection. You stream microphone audio in and receive translated speech and transcripts back on the same socket. No SDK — works from any language that can open a WebSocket.

The API has two modes, chosen per session:

  • Speech to speech — send source-language audio, receive translated speech audio and transcripts. (audioOutputEnabled=true, default)
  • Speech to translated text — send source-language audio, receive transcripts only. (audioOutputEnabled=false)

Base URL: https://ws.startpinch.com


Flow

  1. GET /v1/session — authenticate, get a short-lived ws_url.
  2. Open a WebSocket to ws_url.
  3. Wait for {"type":"ready"}.
  4. Stream mic audio as binary frames.
  5. Receive transcripts (JSON) and, in speech-to-speech mode, translated audio (binary) on the same socket.
  6. Optionally change language or voice mid-session.
  7. Close the socket when done.

1. Create a session

GET /v1/session

Returns a short-lived (60 s) pre-signed WebSocket URL. No further auth on the WebSocket itself.

Request headers

HeaderValueRequired
AuthorizationBearer <your-api-key>Yes

Query parameters

All query params become the initial session metadata and can be updated mid-session.

ParamTypeDefaultDescription
sourceLanguagestringenISO-639 / BCP-47 code of spoken audio (e.g. en, en-US). Pass auto to auto-detect.
targetLanguagestringesLanguage to translate into (e.g. es, de, zh).
voiceTypestringmalemale or female. Ignored when audioOutputEnabled=false.
audioOutputEnabledbooltruetrue = speech-to-speech, false = speech-to-translated-text.
finalizeModestringstablestable = multiple finals per segment as spans lock in. sentence = one final per sentence. See Finalize mode.
twoWayboolfalseWhen false (default), only audio in sourceLanguage is transcribed and translated into targetLanguage. When true, the session listens for both languages and auto-detects per utterance, translating into whichever of the two the speaker isn’t currently using. Forces finalizeMode=sentence.
sourceLanguageLabelstringOptional natural-language label that overrides how the source language is described to the translator.
targetLanguageLabelstringOptional natural-language label for the target language. Same shape as sourceLanguageLabel — drive register, formality, or persona of the translation. Useful for style/register cues (e.g. "professional english", "pirate-speak english", "casual spanish").
contextstringOptional free-text description of the session. Biases recognition and translation toward names, products, and jargon you mention. See Session context.

Try it

GET /v1/session

Response (200)

{
  "session_id":       "ss_01JF...",
  "ws_url":           "wss://ws.startpinch.com/session?token=...",
  "expires_in":       60,
  "model_version_id": "a1b2c3d4e5f6"
}

Open ws_url within expires_in seconds.

model_version_id is a stable identifier for the exact server build serving this session. Log it alongside any user-reported issue so support can map a report back to the exact build that produced it.

Error responses

CodeMeaning
401Missing / invalid API key.
403API key not entitled to real-time translation.
503Capacity exhausted — retry after a short backoff.

2. Open the WebSocket

Open a WebSocket to ws_url. No additional auth headers — the URL is pre-signed.

The first frame from the server is:

{"type": "ready", "session_id": "ss_01JF..."}

Do not send audio before ready. After ready, the socket is fully bidirectional.


3. Audio in — client → server

Binary frames of raw PCM audio:

  • Format: PCM16 little-endian, mono
  • Sample rate: 16 kHz
  • Frame size: 10–40 ms recommended (160–640 samples = 320–1280 bytes)

No container, no headers — just the sample bytes. Smaller frames feel more interactive; larger frames add latency but reduce overhead.

Push-to-talk

Two ways to commit the current snippet when the user releases the mic:

  • Recommended: send {"type":"finalize"}. The server flushes the current segment, emits is_final transcripts (and translated audio, in speech-to-speech mode), and keeps the session open so the next press starts a fresh segment. See Finalize current segment.
  • Legacy: send ~300 ms of silence (PCM16 zeros) after release. Works without the explicit signal but adds noticeable tail latency.

4. Audio out — server → client (speech-to-speech mode)

When audioOutputEnabled=true, binary frames of translated speech:

  • Format: float32 little-endian, mono
  • Sample rate: 24 kHz
  • Frame size: variable; typical frame is ~80 ms

Play frames as they arrive. Each frame is a standalone chunk.

In speech-to-translated-text mode (audioOutputEnabled=false) no binary frames are sent — transcripts only.


5. Transcripts — server → client

JSON text frames. Dispatch on type.

All transcript and lifecycle frames include:

FieldDescription
correlation_idGroups an original transcript with its translation.
source_language, target_languageCurrent session languages.
detected_languageISO-639 code recognised from the audio. Populated when sourceLanguage=auto; null otherwise.
timestampUnix seconds (float).

original_transcript

Recognised source-language speech. text is the canonical field.

{
  "type": "original_transcript",
  "text": "Hello, how are you?",
  "is_final": true,
  "correlation_id": "seg_a8b2c3d4",
  "source_language": "en",
  "target_language": "es",
  "detected_language": "en",
  "timestamp": 1770933604.048
}

translated_transcript

Translation of the current segment. text is the translated text (same as translated_text); original_text is the recognised source-language text that produced it.

{
  "type": "translated_transcript",
  "text": "Hola, ¿cómo estás?",
  "translated_text": "Hola, ¿cómo estás?",
  "original_text": "Hello, how are you?",
  "is_final": true,
  "correlation_id": "seg_a8b2c3d4",
  "source_language": "en",
  "target_language": "es",
  "detected_language": "en",
  "timestamp": 1770933604.945
}

Interim vs final

  • is_final: false — partial, may be revised as more audio arrives. Use for live-feedback UI.
  • is_final: true — stable; never revised. Cadence depends on finalizeMode (see Finalize mode).

metadata_ack

{"type": "metadata_ack", "applied": {"targetLanguage": "de"}}

error

Recoverable errors during the session (unsupported voice, bad metadata, etc.). The socket stays open unless followed by a close frame.

{"type": "error", "error": "targetLanguage='xx' is not supported.", "timestamp": 1770933600.0}

6. Update metadata mid-session

To change targetLanguage, voiceType, context, or other session parameters mid-stream, the current recommended pattern is to send {"type":"finalize"}, close the WebSocket, and open a fresh session with the new query params. A first-class update_metadata frame is on the roadmap but not yet live — if a client sends one today, the server acknowledges with metadata_ack but does not apply runtime changes.


Session context

Pass a free-text description of the session as context to bias recognition and translation toward names, products, and jargon you mention.

{
  "context": "Acme Corp Q3 all-hands. CEO Priya Ramanathan, CFO Marco Bianchi. Topics: ARR, churn, EMEA expansion."
}

Helpful for:

  • Proper nouns the model wouldn’t otherwise know (people, products, companies, places).
  • Rare or domain-specific jargon that’s easily misheard.

Any language. Up to ~2 000 characters. Send an empty string to clear. Context can be updated mid-session.


Finalize current segment

Send {"type":"finalize"} to commit whatever the user has spoken so far without closing the session. Useful for push-to-talk: the moment the user releases the button, fire one finalize and the server snaps the segment shut.

{"type": "finalize"}

What happens server-side:

  1. The server waits a short grace period (~300 ms) for any audio frames still in flight to land — your client doesn’t need any tail buffering or trailing silence for PTT.
  2. The current segment is force-finalized — you receive original_transcript and translated_transcript frames with is_final: true (and translated speech audio, when audioOutputEnabled=true).
  3. The session stays open; the next audio frame you send starts a fresh segment with a new correlation_id.

Finalize mode

Controls when is_final: true is emitted. Set via the finalizeMode query param.

  • stable (default) — multiple finals per segment, each a prefix-extension of the previous, emitted as soon as a span is locked in. Lowest time-to-first-stable-text.
  • sentence — one final per sentence boundary. Best for transcript logging or downstream NLP.

twoWay=true always uses sentence. Partials (is_final: false) stream in both modes.


7. Close

Close the WebSocket cleanly (code 1000) when done.

Server close codes:

CodeMeaning
1000Normal closure.
4400Bad request / malformed frame.
4401Token expired — reallocate the session.
4503Session ended — reallocate.
4500Internal error.

Languages

  • Input (sourceLanguage): 50+ languages. See Supported languages.
  • Output voice (targetLanguage with audioOutputEnabled=true): 40+ languages. Requesting an unsupported target with audio output returns an error listing the current set.
  • Output text only (audioOutputEnabled=false): any target language the translator supports.

Minimal client (Python)

import asyncio, json, httpx, sounddevice as sd, websockets

API_KEY = "pk_..."
BASE = "https://ws.startpinch.com"

async def main():
    async with httpx.AsyncClient() as http:
        r = await http.get(
            f"{BASE}/v1/session",
            params={"sourceLanguage": "en", "targetLanguage": "es", "voiceType": "male"},
            headers={"Authorization": f"Bearer {API_KEY}"},
        )
        r.raise_for_status()
        ws_url = r.json()["ws_url"]

    async with websockets.connect(ws_url) as ws:
        ready = json.loads(await ws.recv())
        assert ready["type"] == "ready"

        async def mic_to_ws():
            with sd.RawInputStream(samplerate=16000, channels=1, dtype="int16", blocksize=320) as s:
                while True:
                    buf, _ = s.read(320)
                    await ws.send(bytes(buf))

        async def ws_to_out():
            out = sd.RawOutputStream(samplerate=24000, channels=1, dtype="float32")
            out.start()
            async for frame in ws:
                if isinstance(frame, (bytes, bytearray)):
                    out.write(frame)
                else:
                    evt = json.loads(frame)
                    if evt.get("type") == "translated_transcript":
                        print(evt["text"])

        await asyncio.gather(mic_to_ws(), ws_to_out())

asyncio.run(main())

Minimal client (TypeScript / Node)

import WebSocket from "ws";

const BASE = "https://ws.startpinch.com";
const key = process.env.PINCH_API_KEY!;

const r = await fetch(`${BASE}/v1/session?sourceLanguage=en&targetLanguage=es&voiceType=male`, {
  headers: { Authorization: `Bearer ${key}` },
});
const { ws_url } = await r.json();

const ws = new WebSocket(ws_url);
ws.on("message", (data, isBinary) => {
  if (isBinary) {
    // float32 LE @ 24 kHz — pipe to audio output
  } else {
    const evt = JSON.parse(data.toString());
    if (evt.type === "translated_transcript") console.log(evt.text);
  }
});
// After {type:"ready"}, send PCM16LE @ 16 kHz: ws.send(int16Buffer, { binary: true });