How long does it take to produce a 30-second commercial with this workflow?

For a first attempt, expect 3-4 hours including keyframe generation, animation, and assembly. This includes the checkpoint reviews between phases. Subsequent commercials using the same pipeline structure typically take 1-2 hours as you learn which models and prompt patterns work best for your product category.

How many generations will I need in total?

Plan for approximately 15-25 image generations for keyframes and 8-15 video generations for animations. The macro detail scene (Scene 3) typically requires the most iterations. Your total will decrease significantly after your first commercial.

Can I use my iPhone product photos as the starting keyframe?

Yes, but run them through an image upscaler first. Higher resolution source images produce significantly better video output because video models cannot invent detail that the source image lacks.

What if the AI changes my product's appearance during animation?

This happens when your video prompt redescribes the product's appearance. Only describe motion in the video prompt — let the uploaded keyframe handle all visual information. If you find yourself typing material descriptions like 'brushed steel' in a video prompt, delete them.

What if my keyframe is almost perfect but has a small flaw?

Use inpainting to fix only the problematic area rather than regenerating the entire image. Most generation tools let you mask a region and regenerate just that section. For minor artifacts, a quick manual touch-up in any photo editor preserves the composition you worked to achieve.

Which model should I use for scenes with legible text on the product?

For any scene requiring readable brand names, labels, or typography, generate the keyframe through GPT Image 2 before animating. It consistently produces the most accurate text rendering. Put the brand name in quotation marks inside the prompt to signal the model to treat it as specific text.

Do I need professional editing software for the final assembly?

No. Free editors like CapCut (browser-based, simplest learning curve) or DaVinci Resolve (desktop, more powerful) are fully sufficient. The AI handles the complex generation work; the editing step is straightforward sequencing, audio placement, and logo overlay.

← Todos los artículos

May 11, 2026 · PonPon Team

Make a Product Ad With AI: Full Guide

A practical, copy-paste walkthrough that takes you from a raw product photo to a finished 30-second commercial.

Most AI video tutorials stop at "type a prompt and see what happens." This one does not. By the end of this guide, you will have a complete 30-second product commercial consisting of four distinct scenes, each generated through the correct model, assembled in sequence, and ready to upload to any platform. Every prompt in this tutorial is tested and designed to be copied directly.

We will use a luxury wristwatch as the example product. The technique applies identically to sneakers, skincare bottles, headphones, or any physical product.

What you need before starting

Gather these materials before touching any generation tool:

One clean product photo. Shot on a white or neutral background. No shadows, no hands, no lifestyle context. This is your anchor asset — the single source of truth that every generated scene will reference. If your photo is low resolution, run it through an image upscaler first. Resolution matters because video models cannot invent detail that the source image lacks.
A scene plan. Write four sentences, one per scene, describing the visual intent. This keeps you from improvising mid-generation and losing coherence. Here is a complete example for the wristwatch:

> Scene 1: Dramatic close-up of the watch on a dark marble surface, slow zoom-in reveals surface detail. > Scene 2: A man in a suit walks through a rain-soaked city at night, the watch catches neon light on his wrist. > Scene 3: Extreme macro of the watch face — brand name, indices, and second hand are all crisp and legible. > Scene 4: Clean dark background plate for brand logo and tagline overlay.

Your brand color palette. Write down the exact hex codes or descriptive color terms ("matte black and brushed gold") that define your visual identity. You will embed these in every prompt to maintain consistency across scenes.

How many generations should you expect?

A realistic budget for your first commercial: approximately 15–25 image generations across the four keyframes (3–5 per scene, plus a few re-rolls), and 8–15 video generations for animations (2–4 per scene). Subsequent commercials using the same pipeline will require fewer iterations because you will have learned what each model responds to best.

Choosing the right model for each scene

Different scenes have different technical demands. Picking the wrong model is the most common source of frustration. Use this reference throughout the tutorial:

Scene type	Key challenge	Recommended approach
Product macro / text on label	Rendering legible text, precise material surfaces	GPT Image 2 — strongest at reflective materials and readable typography
Lifestyle / person wearing product	Human anatomy, spatial composition, natural posing	GPT Image 2 or Seedance 2.0 — both handle body proportions reliably
Abstract / atmospheric background	Mood, volumetric lighting, color grading	Any high-quality model works — this is the most forgiving scene type
Video: product with straight edges	Geometric stability during camera movement	Models with strong camera tracking physics — warping is immediately visible on rigid products
Video: environment with physics	Rain, reflections, human locomotion	Models with world simulation capabilities — realistic environmental interactions
Video: simple motion / atmosphere	Subtle light drift, minimal movement	Speed-optimized engines — visual quality trade-off is minimal for simple motion

You do not need to memorize this table. Each step below will tell you which row applies.

A key rule before you begin: image prompts vs. video prompts

This distinction is critical and applies throughout the entire tutorial:

In image prompts (Phase 1), describe everything — the product's appearance, materials, lighting, composition, color, and environment. The image model is building the scene from scratch and needs full visual information.
In video prompts (Phase 2), describe only the motion — camera direction, speed, duration, and movement type. The visual content is already locked in the keyframe you upload. Redescribing appearance in a video prompt causes the model to reinterpret and alter your established look.

Think of it this way: the image prompt is the cinematographer setting up the shot. The video prompt is the director calling "action" and describing only what moves.

Phase 1: Generate the hero keyframes

The keyframe is a static image that locks your visual direction. You generate one keyframe per scene. These keyframes determine the lighting, composition, and color grading for the entire commercial. Every subsequent video generation will inherit these visual decisions.

Scene 1 keyframe: The dramatic product reveal

Model choice: Use GPT Image 2 — this scene requires precise rendering of metallic surfaces and specular highlights. (See table row: "Product macro.")

Open the image generation studio and copy this prompt, replacing the bracketed sections with your product details:

Prompt: A luxury wristwatch sits centered on a slab of black marble. Extreme close-up, macro lens, f/2.8 shallow depth of field. Single dramatic side light from the left casting a long shadow. Specular highlights on the sapphire crystal and brushed steel case. Background is pure black with no visible horizon. Commercial product photography, 16:9 aspect ratio, hyper-realistic, 4K detail.

Generate three to five variations. When choosing the best result, look for these specific qualities:

Lighting direction is consistent. The shadow should fall to one side, not scatter in multiple directions. Multiple shadow directions indicate conflicting light sources, which will look unnatural when animated.
Material surfaces match the real product. Compare the brushed metal texture, the crystal reflection, and the bezel finish against your reference photo. If the model turned your matte black into glossy plastic, that variation is wrong regardless of how good the composition looks.
Product proportions are accurate. Check the case thickness, the crown size, and the lug-to-lug ratio against your reference photo. AI models frequently stretch or compress products subtly.

Download the selected keyframe at maximum resolution.

Scene 2 keyframe: The lifestyle context shot

Model choice: Use GPT Image 2 for this scene as well — it handles both human anatomy and product detail in the same frame. (See table row: "Lifestyle / person wearing product.")

Prompt: A man in a tailored charcoal suit walks through a rain-soaked urban street at night. Medium shot from the waist up. His left wrist is visible, wearing a luxury steel wristwatch that catches a neon reflection. Wet pavement reflects cyan and orange storefront lights. Shallow depth of field, the background bokeh is soft circles of colored light. Cinematic, moody, teal and orange color grading, 35mm film grain, 16:9 aspect ratio.

Notice that this image prompt includes a detailed description of the watch's appearance ("luxury steel wristwatch that catches a neon reflection"). This is correct — in image generation, you must specify the product so the model does not hallucinate a generic design. If the result shows an inaccurate watch, regenerate with even more specific material details: "brushed titanium case with black ceramic bezel and white dial."

When evaluating the variations, check these specifics:

Hand and wrist anatomy. Count the fingers. Check that the wrist bends naturally. Distorted hands are the most common failure in lifestyle shots — reject any variation with anatomical errors immediately, even if the rest of the image looks perfect.
Watch placement. The watch should sit on the wrist at a natural angle, not floating above the skin or embedded into the sleeve.
Color grading consistency. Compare this keyframe's overall color palette against your Scene 1 keyframe. Both should feel like they belong in the same commercial.

Scene 3 keyframe: The macro detail

Model choice: Return to GPT Image 2. This scene requires the brand name on the dial to be perfectly legible — text rendering is this model's strongest advantage. (See table row: "Product macro / text on label.")

Prompt: Extreme macro photography of a luxury watch face filling the entire frame. The dial reads "BRAND NAME" in clean serif typography at the 12 o'clock position. Swiss movement visible through the transparent caseback reflection. Dramatic low-key lighting from above. Every index marker, every hand, every sub-dial is crisp and in focus. Black background, commercial photography, 16:9 aspect ratio, hyper-realistic.

Put your actual brand name in quotation marks inside the prompt. The quotation marks signal the model to treat it as specific text to render rather than decorative noise.

When selecting from the variations, the priority order is:

1. Text legibility — the brand name must be spelled correctly and be readable at a glance. This is non-negotiable. A stunning composition with a misspelled brand name is unusable. 2. Detail accuracy — index markers should be evenly spaced, hands should be the correct shape, sub-dials should be circular. 3. Lighting quality — the light should reveal surface texture without creating distracting hotspots.

Scene 4 keyframe: The brand lockup

Model choice: Any high-quality model works here — this is the most forgiving scene type. (See table row: "Abstract / atmospheric background.")

The final scene is a simple brand card. Generate a clean background plate that matches your commercial's color grading. Do not attempt to render the logo text through AI — you will composite your actual brand logo and tagline in post-production. The AI's job is to create a cinematic background plate only.

Prompt: Abstract dark studio background with subtle volumetric light rays entering from the upper right. Matte black surface with barely visible brushed metal texture. No objects, no text, no subjects. Atmospheric, premium, minimal. 16:9 aspect ratio, 4K.

This gives you a clean canvas to overlay your real logo in the editing phase. The selection criteria here is simple: choose the variation whose mood and color temperature best match your other three keyframes.

Checkpoint: Review your four keyframes together

Stop here before moving to Phase 2. Open all four keyframes side by side and verify the following:

Color temperature is consistent. If Scene 1 has cool blue tones but Scene 2 is warm orange, the transitions between scenes will feel jarring. Regenerate the outlier to match the dominant palette.
Lighting direction is compatible. The light does not need to come from the same angle in every scene, but it should not contradict itself (for example, left-side lighting in Scene 1 and right-side lighting in Scene 3 of the same product would break the illusion of continuity).
Product appearance matches across scenes. The watch in Scene 1 should look like the same watch in Scene 2 and Scene 3. Compare the case color, bezel style, and dial layout. If one scene shows a black dial and another shows silver, regenerate the inconsistent one.
Aspect ratios are identical. All four keyframes should be 16:9. A single 1:1 image will cause cropping problems during assembly.

If any keyframe fails these checks, fix it now. It is far cheaper to regenerate a single image than to re-animate an entire scene later.

What to do when a keyframe is almost right

Sometimes a keyframe is 90% perfect — the composition and lighting are exactly what you want, but the brand name is slightly garbled, or the watch crown is too large, or a background element is distracting. Full regeneration risks losing everything that was working. Instead:

Use inpainting / partial regeneration. Most generation tools allow you to mask a specific region of the image and regenerate only that area. Mask the problematic element (the misspelled text, the distorted crown) and regenerate with a targeted prompt that describes only the fix.
Manual touch-up. For minor issues like a small artifact or an unwanted reflection, a quick fix in any photo editor (even a free one like Photopea) takes seconds and preserves the rest of the image perfectly.
Regenerate with the same prompt. If inpainting is not available, regenerating the full image with the exact same prompt will produce different variations. The good version is often just a few generations away.

The goal is to avoid discarding strong compositions because of fixable details.

Phase 2: Animate each keyframe into video

With four approved keyframes saved, you now convert each one into a video clip.

Remember the rule from earlier: your video prompt should describe only the motion, not the visual content. The visual content is already locked in the keyframe you upload. This is the most common mistake in this phase — if you find yourself typing the word "luxury" or "brushed steel" in a video prompt, stop and delete it.

Animating Scene 1: Slow reveal zoom

Model choice: Use a model with strong geometric stability during camera movement. Warping on a rigid product with straight edges is immediately visible. Models with strong camera tracking physics are the safest choice. (See table row: "Video: product with straight edges.")

Open the image-to-video tool. Upload your Scene 1 keyframe.

Video prompt: Slow push-in zoom toward the center of the frame. Camera moves forward smoothly over 5 seconds. No camera rotation, no lateral movement. Subtle ambient light shift as the camera approaches. Cinematic, steady, commercial pacing.

Render the clip. When reviewing, watch for:

Geometric warping. Pause the video at the start and end frames. The straight edges of the watch case should remain straight throughout. If they bend or ripple, reduce the zoom intensity ("very subtle push-in" instead of "push-in zoom") and regenerate.
Texture swimming. The brushed metal texture should stay anchored to the product surface. If the texture appears to drift or swim independently of the object, the model is struggling with the material — try a different rendering model.
Steady pacing. The zoom should be smooth and constant, not jerky or accelerating. Uneven pacing creates an amateur feel that undermines the premium tone.

Animating Scene 2: Urban tracking shot

Model choice: This scene involves complex physics — falling rain, reflective surfaces, and human locomotion. Use a model with strong world simulation capabilities. (See table row: "Video: environment with physics.")

Upload your Scene 2 keyframe.

Video prompt: Medium tracking shot follows the subject walking forward. Camera moves at the same pace as the subject. Rain falls continuously. Neon reflections shift on the wet pavement as both camera and subject move through the scene. Natural walking cadence, 8 seconds duration.

When reviewing, watch for:

Walking motion. The legs should move naturally and the body should not slide across the ground. Sliding (the body moves but the legs do not match the speed) is a common physics failure.
Rain consistency. Rain should fall at a constant rate and direction throughout the clip. Watch for rain that stops and starts, changes direction, or appears only in part of the frame.
Reflection behavior. The neon reflections on the wet pavement should shift as the camera moves. Static reflections that do not respond to camera motion break the realism.

If the rain looks artificial or the reflections behave incorrectly, try adding "physically accurate rain simulation, real-world light scattering" to your prompt.

Animating Scene 3: Macro detail with subtle movement

Model choice: Use the same geometric-stability model you used for Scene 1. Macro animation amplifies every inconsistency — this is the most demanding scene technically. (See table row: "Video: product with straight edges.")

Upload your Scene 3 keyframe.

Video prompt: Extremely slow rotation of the watch face, approximately 15 degrees over 5 seconds. The second hand sweeps smoothly. Light catches the crystal at slightly different angles as the rotation progresses. No camera movement — the object rotates, the camera is locked.

When reviewing, watch for:

Text stability. The brand name on the dial must remain legible throughout the rotation. If the text blurs, morphs, or disappears during movement, this take is unusable — the brand name is the entire point of this scene.
Index marker consistency. The hour markers should maintain their size and spacing as the watch face rotates. Markers that drift, merge, or vanish indicate frame-to-frame instability.
Second hand sweep. The second hand should move smoothly without jumping or stuttering. A clean sweep communicates mechanical precision — a stuttering hand communicates the opposite.

Generate at least three variations of this scene. Macro animation is the most failure-prone step in the entire pipeline, so allocate extra iterations here.

Animating Scene 4: Atmospheric background plate

Model choice: Use a speed-optimized engine — the motion in this scene is minimal, and the visual quality difference is negligible for simple atmospheric movement. (See table row: "Video: simple motion / atmosphere.")

Upload your Scene 4 keyframe.

Video prompt: Extremely subtle volumetric light movement. Light rays drift slowly from right to left over 6 seconds. Background texture remains completely static. Atmospheric, almost imperceptible motion. No camera movement.

This is the simplest animation — you only need gentle atmospheric drift to give the brand card visual life without distracting from the logo you will overlay in post. Almost any result that moves slowly and smoothly will work.

Checkpoint: Review your four video clips together

Stop here before moving to Phase 3. Play all four clips in sequence and check:

Pacing feels continuous. The clips should flow naturally from one to the next. If Scene 1 moves slowly but Scene 2 is fast and frenetic, the viewer will feel a disorienting speed change. Consider regenerating the outlier with adjusted tempo language ("slower tracking" or "faster push-in").
Color consistency survived animation. Some video models shift the color grading slightly from the source keyframe. Compare the first frame of each clip against its original keyframe. If one clip has drifted warmer or cooler, you may need to regenerate it or plan a minor color correction in the editing phase.
No clip has a visible artifact that will distract viewers. Watch each clip at full screen. A subtle flicker that is invisible in a thumbnail may become obvious when projected on a phone or monitor. It is worth regenerating a clip now rather than discovering the problem after full assembly.

Phase 3: Audio design

A silent commercial feels incomplete. Sound transforms the viewer's perception from "AI demo" to "professional commercial." You need two audio layers: a background music track and scene-specific sound effects.

Background music

Open the audio generation tool. Describe the emotional arc of your commercial:

Music prompt: Minimal, elegant piano and strings. Slow build from quiet to confident. Tempo: 72 BPM. Duration: 30 seconds. Mood: sophisticated, premium, aspirational. No percussion. Gentle crescendo in the final 8 seconds.

Generate three variations. To select the best one, play each track while watching your four video clips in sequence. You are listening for:

The emotional peak should align with Scene 3 (the macro detail reveal). This is the climax of most product commercials — the moment of closest intimacy with the product. If the music peaks too early (during Scene 1) or too late (during Scene 4), the emotional arc feels misaligned.
The opening should be understated. Scene 1 is a slow reveal — the music should not overpower it. A quiet opening that gradually builds lets the visual do the work first.
The ending should feel resolved, not abruptly cut. The track should naturally wind down or hold a final note through Scene 4 and the closing black screen.

Sound effects

Layer targeted sound effects over specific scenes. Generate each sound effect individually so you can position them precisely in the editing timeline:

Scene 2: Rain ambience, distant traffic, footsteps on wet pavement. These environmental sounds sell the realism of the urban setting. Generate a 10-second rain loop and a separate footstep track.
Scene 3: A subtle, satisfying mechanical click as the second hand sweeps. This tactile audio cue subconsciously communicates quality and precision — it is the audio equivalent of the macro visual.

Scenes 1 and 4 typically work best with music only. Adding sound effects to a dramatic product reveal or brand lockup usually creates clutter rather than immersion.

Phase 4: Assembly and export

You do not need professional editing experience for this step. The AI handled the complex generation work — the editing step is straightforward sequencing, timing, and audio placement. If you have never used a video editor before, start with CapCut (free, runs in your browser, simplest learning curve) or DaVinci Resolve (free, more powerful, desktop application).

Setting up the timeline

Import all your assets into the editor:

Four video clips (Scenes 1–4)
One music track
Individual sound effect files

Arrange the video clips on the timeline in this order:

Timecode	Scene	Duration	What happens between clips
0:00–0:06	Scene 1: Product reveal zoom	6s	Start with a 1-second fade from black — the video begins in darkness and the product gradually appears
0:06–0:14	Scene 2: Urban wrist shot	8s	Hard cut (instant switch) timed to the moment of strongest motion — this creates energy
0:14–0:20	Scene 3: Macro watch face	6s	Half-second dissolve (one image fades into the next) — this softens the transition into the intimate macro shot
0:20–0:26	Scene 4: Brand lockup	6s	Half-second dissolve — smooth transition into the closing brand card
0:26–0:30	Black screen + call-to-action text	4s	Fade to black over 1 second

In most editors, you add transitions by dragging a "dissolve" or "cross dissolve" effect onto the junction between two clips and setting its duration. A "fade from black" is usually a dissolve placed at the very start of the first clip.

Overlaying the brand logo

On Scene 4, add your real brand logo as a separate layer above the video. In most editors, this means dragging the logo file (PNG with transparent background works best) onto a track above the video track. Center it, scale it to fit, and set it to appear with a simple fade-in over 0.5 seconds.

Add your tagline as a text layer below the logo. Use your actual brand font if the editor supports custom fonts.

Placing audio

Drag the music track to the audio timeline so it spans the full 30 seconds.
Place the rain ambience sound effect under Scene 2 (timecode 0:06–0:14). Lower its volume to about 30–40% of the music track — it should feel like atmosphere, not a rainstorm.
Place the mechanical click sound effect at the moment the second hand sweeps in Scene 3. This is a brief, precise sound — position it to the exact frame.

Adding the call-to-action

On the final black screen (0:26–0:30), add a text layer with your call-to-action: your website URL, a "Shop Now" message, or a product name. Keep it simple — white text on black, your brand font, fade in over 0.5 seconds.

Exporting

Export at 1080p first. Watch the entire 30 seconds from start to finish at full screen. Check that transitions are smooth, audio is synchronized, and no visual artifacts slipped through.

If the output looks slightly soft and you need 4K for your distribution platform, run the final export through a video upscaler to sharpen the output before publishing. Upscale only the approved final cut — not individual clips.

Common mistakes and how to avoid them

Mistake 1: Describing the product's appearance in the video prompt. The video model will try to reinterpret your visual description, potentially altering the product geometry that your keyframe already locked. Only describe motion in video prompts. (This is the image-prompt-vs-video-prompt rule from earlier — the single most important principle in this workflow.)

Mistake 2: Using the same model for every scene. A model that produces stunning macro photography might generate unrealistic rain. Match the model to the specific challenge of each scene using the model selection table at the top of this guide. Use the multi-model comparison workspace to test unfamiliar models against your keyframes before committing to a full render.

Mistake 3: Generating video at maximum resolution immediately. Always render your first take at standard 1080p. Review the motion, the timing, and the consistency. Only upscale your approved final takes. Rendering at 4K from the start wastes compute on clips you will likely discard.

Mistake 4: Ignoring audio. A perfectly generated visual sequence loses significant impact when presented in silence. Even a simple ambient track transforms the viewer's perception from "AI demo" to "professional commercial."

Mistake 5: Adding the brand name through AI generation. AI text rendering has improved dramatically, but for your actual brand identity — the logo, the tagline, the legal line — always use your real vector assets composited in post-production. This guarantees pixel-perfect accuracy that no generative model can consistently match.

Mistake 6: Regenerating an entire keyframe because of a small flaw. If the composition and lighting are strong but a single element is wrong (garbled text, distorted crown, unwanted artifact), use inpainting to fix only the problematic area rather than discarding the whole image. Full regeneration risks losing everything that was working.

Mistake 7: Skipping the checkpoint reviews. It is tempting to rush from generation to animation to assembly. But catching a color inconsistency or a proportion mismatch between keyframes before animating saves hours of rework. The five minutes you spend comparing keyframes side-by-side will save you thirty minutes of regenerating video clips.

Adapting this workflow to other products

The four-scene structure translates directly to any physical product category:

Sneakers: Scene 1 is the box opening reveal. Scene 2 is a runner on pavement (tracking shot). Scene 3 is the sole tread pattern (macro). Scene 4 is brand lockup.
Skincare: Scene 1 is the bottle on a wet stone surface. Scene 2 is a hand applying product (close-up). Scene 3 is the ingredient label (macro with legible text). Scene 4 is brand lockup.
Headphones: Scene 1 is the product floating in dramatic light. Scene 2 is a person wearing them on a subway (lifestyle). Scene 3 is the driver membrane (macro). Scene 4 is brand lockup.

The prompts change, but the pipeline remains identical: keyframe, animate, audio, assemble. Once you have completed this workflow once, every subsequent commercial follows the same four-phase process with decreasing production time.

← Todos los artículos

May 11, 2026 · PonPon Team

Make a Product Ad With AI: Full Guide

A practical, copy-paste walkthrough that takes you from a raw product photo to a finished 30-second commercial.

We will use a luxury wristwatch as the example product. The technique applies identically to sneakers, skincare bottles, headphones, or any physical product.