How Pinch works

High level flow

Think of Pinch as a live translation session.

1) You create a session

You tell Pinch what you want:

source language (what you’ll speak)
target language (what you want to hear)

Pinch returns a session you can connect to.

2) You stream audio in

Audio is sent in small chunks (frames).

Pinch processes it as it arrives.

3) Pinch streams results back

Pinch can stream back:

translated speech audio (what you’ll play)
status events

4) You play audio + stop when you’re done

When the user stops the session, the connection is closed cleanly.

Pinch API flow: original audio in, translation, translated audio out

Real-time voice cloning

When voice_type="clone", Pinch tries to return translated speech in the speaker’s voice - adapting tone and timbre as the user talks.

This is best when you care about:

keeping the speaker identity consistent
tone and vibe carrying across languages
live translation that still feels like “the same person”

Latency basics (why it can feel slow sometimes)

Live translation isn’t a single step. It’s usually:

listen a bit → understand → translate → generate speech → stream out

Regardless of what model is used for these steps, we have to “wait” some amount of time to translate because of the differences in sentence structure between languages. For example, some languages will put the verb in the beginning, some at the end.

So latency depends primarily on:

language pair complexity
rate of speech (if you pause, like you would for a human interpreter, the model may understand and translate a bit faster)
your network