Talking avatars & lip-sync
Make a character speak on PonPon: how lip-sync drives a face from an audio track with Kling 3.0, where the voice comes from, a worked example, source tips, and pairing with dubbing.
A talking avatar is a character whose mouth moves in time with speech. It takes two ingredients: a face (an image or clip of a person) and a voice (an audio track). Lip-sync ties them together so the character looks like it's actually saying the words.
The model that does it
On PonPon, lip-sync runs on Kling 3.0, which is built for dialogue. Its dedicated lip-sync capability drives a character's mouth from an audio track, so a still portrait or a clip can deliver a line convincingly. For a full worked example, see the lip-sync video use case.
When your spokesperson appears across several shots, keep the same face from cut to cut with Kling 3.0 multi-shot storytelling and the multi-shot character consistency workflow.
Where the voice comes from
The audio that drives the lips can come from anywhere in PonPon:
- Text to speech — type a script and generate a voice. Best when you're writing the line from scratch.
- Dubbing — translate an existing line into another language, then lip-sync the face to match it.
- An upload — your own recorded voice.
How it works
- Choose Kling 3.0 in the video generator.
- Provide the character — a clear portrait or a short clip.
- Provide the voice — generated or uploaded audio.
- Generate. The model matches the mouth (and natural micro-movements) to the speech.
A worked example
Say you want a spokesperson to introduce a product:
- In text to speech, generate the line: *"Meet the new Aero — lighter, faster, yours."*
- Upload a clean, front-facing portrait of your spokesperson (real or AI-generated).
- Run both through Kling 3.0 lip-sync.
Out comes a short clip of that face delivering the line. Generate the script in short sentences and you can re-roll a single weak line instead of the whole take.
Source tips
- Use a front-facing face with the mouth clearly visible — profiles and extreme angles sync poorly.
- Keep the audio clean: one speaker, minimal background noise.
- Match the energy of the delivery to the face; a calm portrait reading an excited line looks off.
- Keep lines short. A few tight sentences sync more reliably than one long monologue.
Lip-sync vs dubbing
They're complementary:
- Dubbing changes the language of the audio but leaves the picture alone.
- Lip-sync changes the mouth in the picture to match whatever audio you give it.
Localizing a talking-head video? Dub the audio into the target language, then lip-sync the face to the dubbed track — the result looks natively recorded. For the audio side end to end, see Voiceover and audio basics.
Related articles
- AI dubbingDub a video or audio clip into another language with AI on PonPon — 31 target languages, how dubbing differs from voiceover, a worked example, source prep, and pairing with lip-sync.
- Voiceover & audioThe PonPon audio studio: text-to-speech, voice changer, dubbing into 31 languages, sound effects, music, and multi-voice dialogue — powered by ElevenLabs and MiniMax.
- Text-to-video basicsHow video generation works on PonPon: text-to-video vs image-to-video, choosing models like Veo 3.1, Sora 2 and Kling 3.0, and the Edit and Motion Control tabs.