Talking avatars & lip-sync

Make a character speak on PonPon: how lip-sync drives a face from an audio track with Kling 3.0, where the voice comes from, a worked example, source tips, and pairing with dubbing.

A talking avatar is a character whose mouth moves in time with speech. It takes two ingredients: a face (an image or clip of a person) and a voice (an audio track). Lip-sync ties them together so the character looks like it's actually saying the words.

The model that does it

On PonPon, lip-sync runs on Kling 3.0, which is built for dialogue. Its dedicated lip-sync capability drives a character's mouth from an audio track, so a still portrait or a clip can deliver a line convincingly. For a full worked example, see the lip-sync video use case.

When your spokesperson appears across several shots, keep the same face from cut to cut with Kling 3.0 multi-shot storytelling and the multi-shot character consistency workflow.

Where the voice comes from

The audio that drives the lips can come from anywhere in PonPon:

Text to speech — type a script and generate a voice. Best when you're writing the line from scratch.
Dubbing — translate an existing line into another language, then lip-sync the face to match it.
An upload — your own recorded voice.

How it works

Choose Kling 3.0 in the video generator.
Provide the character — a clear portrait or a short clip.
Provide the voice — generated or uploaded audio.
Generate. The model matches the mouth (and natural micro-movements) to the speech.

A worked example

Say you want a spokesperson to introduce a product:

In text to speech, generate the line: *"Meet the new Aero — lighter, faster, yours."*
Upload a clean, front-facing portrait of your spokesperson (real or AI-generated).
Run both through Kling 3.0 lip-sync.

Out comes a short clip of that face delivering the line. Generate the script in short sentences and you can re-roll a single weak line instead of the whole take.

Source tips

Use a front-facing face with the mouth clearly visible — profiles and extreme angles sync poorly.
Keep the audio clean: one speaker, minimal background noise.
Match the energy of the delivery to the face; a calm portrait reading an excited line looks off.
Keep lines short. A few tight sentences sync more reliably than one long monologue.

Lip-sync vs dubbing

They're complementary:

Dubbing changes the language of the audio but leaves the picture alone.
Lip-sync changes the mouth in the picture to match whatever audio you give it.

Localizing a talking-head video? Dub the audio into the target language, then lip-sync the face to the dubbed track — the result looks natively recorded. For the audio side end to end, see Voiceover and audio basics.

Talking avatars & lip-sync

Make a character speak on PonPon: how lip-sync drives a face from an audio track with Kling 3.0, where the voice comes from, a worked example, source tips, and pairing with dubbing.

The model that does it

When your spokesperson appears across several shots, keep the same face from cut to cut with Kling 3.0 multi-shot storytelling and the multi-shot character consistency workflow.

Where the voice comes from

The audio that drives the lips can come from anywhere in PonPon:

Text to speech — type a script and generate a voice. Best when you're writing the line from scratch.
Dubbing — translate an existing line into another language, then lip-sync the face to match it.
An upload — your own recorded voice.

How it works

Choose Kling 3.0 in the video generator.
Provide the character — a clear portrait or a short clip.
Provide the voice — generated or uploaded audio.
Generate. The model matches the mouth (and natural micro-movements) to the speech.

A worked example

Say you want a spokesperson to introduce a product:

In text to speech, generate the line: *"Meet the new Aero — lighter, faster, yours."*
Upload a clean, front-facing portrait of your spokesperson (real or AI-generated).
Run both through Kling 3.0 lip-sync.

Out comes a short clip of that face delivering the line. Generate the script in short sentences and you can re-roll a single weak line instead of the whole take.

Source tips

Use a front-facing face with the mouth clearly visible — profiles and extreme angles sync poorly.
Keep the audio clean: one speaker, minimal background noise.
Match the energy of the delivery to the face; a calm portrait reading an excited line looks off.
Keep lines short. A few tight sentences sync more reliably than one long monologue.

Lip-sync vs dubbing

They're complementary:

Dubbing changes the language of the audio but leaves the picture alone.
Lip-sync changes the mouth in the picture to match whatever audio you give it.

Talking avatars & lip-sync

The model that does it

Where the voice comes from

How it works

A worked example

Source tips

Lip-sync vs dubbing

Related articles

Talking avatars & lip-sync

The model that does it

Where the voice comes from

How it works

A worked example

Source tips

Lip-sync vs dubbing

Related articles