Prompting for video
A practical method for AI video prompts on PonPon: shot structure, the camera presets the models understand, pacing, model-specific tips, and fixing common failures.
A good video prompt reads like a shot description a director hands a camera operator. It names the subject, the action, the camera, and the light — and resists cramming three shots into one.
A reliable structure
Write in this order:
- Subject — who or what, specific. "A young woman in a red raincoat."
- Action — the single thing that changes during the clip. "walks toward the camera and looks up."
- Setting — where, and what's around. "on a rain-slicked city street at night, neon reflected in puddles."
- Camera — the move. "slow dolly in, eye level."
- Light & mood — "cool blue light, cinematic, moody."
A young woman in a red raincoat walks toward the camera and looks up, on a rain-slicked city street at night with neon reflections, slow dolly in at eye level, cool cinematic light. 9:16, 5 seconds.
Camera language the models understand
PonPon's Studio timeline exposes the exact camera moves the models respond to — use these terms in any prompt:
- Push In / Pull Out — move toward or away from the subject.
- Pan Left / Right, Tilt Up / Down — rotate the camera in place.
- Tracking — follow alongside a moving subject.
- Orbit — circle around the subject.
- Crane Up, Aerial — rise above the scene.
- Handheld — loose, organic movement.
- Dolly Zoom — the vertigo effect.
- Static — a locked-off shot.
One action per shot
The most common mistake is describing a whole scene with multiple events. A clip is only a few seconds — give it one beat. If you need a sequence, generate each shot separately and assemble in Flow, or use the multi-shot timeline in Studio on Kling 3.0 to direct several cuts in one generation.
Pacing and length
- Keep clips short while iterating; judge the motion, then commit to a longer render.
- Words like "slow", "unhurried", "gentle" vs "quick", "snappy", "energetic" genuinely change the result.
Match the model to the shot
- Veo 3.1 — the most precise camera direction, plus native audio. Reach for it when the move matters.
- Kling 3.0 — best for dialogue (lip-sync) and multi-shot sequences.
- Sora 2 — when physics and texture realism carry the shot.
- Seedance 2.0 — fast, expressive, vertical-first social clips.
Fixing common problems
| Problem | Try this |
|---|---|
| Warping faces or hands | Simpler action, slower motion, or start from a clean image via image-to-video |
| Camera ignores your direction | Name one explicit move from the list above; drop competing directions |
| Too much happening | Cut to a single action; split into multiple shots |
| Off-brand look | Provide a Start Frame instead of describing the style in words |
| Wrong subject emphasis | Put the subject first; remove background clutter |
Lock the look with a first frame
When the *style* matters more than the surprise, generate or upload a still and animate it with a Start Frame in the video generator. You stop gambling on the look and only ask the model to handle motion. For the fundamentals, revisit Text-to-video basics.
Related articles
- Text-to-video basicsHow video generation works on PonPon: text-to-video vs image-to-video, choosing models like Veo 3.1, Sora 2 and Kling 3.0, and the Edit and Motion Control tabs.
- Your first AI videoStep by step: sign in, write a prompt, pick a model, set aspect ratio, duration and resolution, generate, and download your first AI video on PonPon.
- Image generation basicsWrite a good image prompt, choose between models like GPT Image 2, Nano Banana Pro and Seedream 5.0, use reference images, and edit results with the annotate tools.