Do I need to understand all these terms to create AI video?

No. You can start creating with basic prompts immediately. But understanding terms like CFG scale, temporal coherence, and camera terminology helps you write better prompts and troubleshoot issues when output isn't matching your expectations.

What's the most important term for beginners to learn?

Prompt structure. Understanding how to compose a prompt with subject, action, setting, lighting, and camera direction will improve your output quality more than any technical parameter knowledge.

What's the difference between artifacts and hallucinations?

Artifacts are visual errors — distortion, flickering, blurriness. Hallucinations are the model inventing things that shouldn't be there — extra limbs, objects appearing and vanishing, text that changes between frames. Both are quality issues but have different causes.

Should I adjust CFG scale and step count as a beginner?

Start with default settings. On PonPon, the models use optimized defaults that work well for most content. Once you understand how these parameters affect output, you can experiment with adjusting them for specific creative goals.

← All posts

April 17, 2026 · PonPon Team

The AI Video Glossary for Beginners

AI video has its own language. Here's every term you'll encounter, explained without jargon.

AI video generation comes with a wall of jargon. Diffusion models, latent space, CFG scale, temporal coherence — it can feel like learning a new language. This glossary defines every term you're likely to encounter, in plain English, so you can focus on creating instead of decoding documentation.

Generation fundamentals

Text-to-video (T2V): Creating video from a written text description (prompt). You type what you want to see, the AI generates it. This is the most common way to use AI video generators like those on PonPon.

Image-to-video (I2V): Creating video from a starting image plus a text prompt. The image provides the visual reference — colors, composition, subject — and the prompt describes what motion should happen. Generally produces more controllable, realistic results than text-to-video.

Video-to-video (V2V): Transforming an existing video clip using AI. The original video provides the motion, and the AI modifies the visual style, content, or quality. Used for style transfer, upscaling, and re-rendering.

Prompt: The text description you give the AI to generate video. A good prompt includes subject, action, setting, lighting, and camera details. The quality of your prompt is the single biggest factor in output quality.

Negative prompt: A description of what you don't want in the output. Not all models support negative prompts, but when available, terms like "blurry, distorted, low quality" help the model avoid common failure modes.

Generation / inference: The process of the AI creating the video from your prompt. Each generation produces one video clip. The terms are interchangeable in casual use.

Model and architecture terms

Diffusion model: The type of AI architecture used by most video generators (including Sora 2, Veo 3.1, and others). It works by starting with random noise and gradually removing it to form a coherent image or video, guided by your prompt. Think of it like a sculptor removing marble to reveal a figure.

Transformer: An AI architecture that processes data in parallel. Modern video models combine diffusion with transformer architectures (called DiT — Diffusion Transformer) for better quality and coherence.

Latent space: A compressed mathematical representation where the AI does its work. Instead of generating full-resolution pixels directly, models work in this compressed space for efficiency, then decode the result back to full resolution. You never interact with it directly, but it's why generation is feasible on current hardware.

Foundation model: The large pre-trained AI model that powers a video generator. Sora 2, Kling 3.0, Veo 3.1, and Seedance 2.0 are all foundation models. They're trained on massive datasets and then fine-tuned for specific capabilities.

Fine-tuning: Additional training on a specific dataset to give a foundation model new capabilities or a particular style. Some platforms offer fine-tuning so you can train models on your own footage or brand aesthetic.

Quality and control parameters

CFG scale (Classifier-Free Guidance): A number that controls how strictly the model follows your prompt. Higher CFG means the output matches your prompt more literally but may look less natural. Lower CFG gives the model more creative freedom but may drift from your description. Typical range: 5-15.

Steps / inference steps: How many denoising iterations the model performs during generation. More steps generally mean higher quality but slower generation. There's a point of diminishing returns — beyond a certain step count, quality stops improving.

Seed: A random number that initializes the generation process. Using the same seed with the same prompt and settings produces identical output. Useful for reproducing results or making small adjustments while keeping the overall composition stable.

Resolution: The pixel dimensions of the output video. Common resolutions: 720p (1280x720), 1080p (1920x1080), 4K (3840x2160). Higher resolution means more detail but costs more credits and takes longer.

Aspect ratio: The width-to-height proportion of the video. Common ratios: 16:9 (widescreen), 9:16 (vertical/mobile), 1:1 (square), 2.39:1 (cinematic ultrawide).

FPS (Frames Per Second): How many individual frames appear each second of video. 24fps is cinematic standard, 30fps is broadcast standard, 60fps is smooth/sports standard. Higher FPS requires more frames to be generated.

Duration: The length of the generated video clip, typically 3-10 seconds for current AI models. Longer videos are created by generating multiple clips and editing them together.

Video quality terms

Temporal coherence: Consistency between consecutive frames. Good temporal coherence means objects maintain their shape, color, and position smoothly from frame to frame. Poor temporal coherence causes flickering, morphing, or objects that change appearance between frames.

Artifacts: Visual errors in the generated video — distorted faces, extra fingers, warped text, impossible geometry, or flickering areas. All models produce some artifacts; the goal is minimizing them through good prompting and model selection.

Hallucination: When the AI generates something that wasn't in the prompt and doesn't make sense — an extra hand appearing, text that changes between frames, or objects that materialize and vanish. Related to but distinct from artifacts.

Motion blur: Natural blurring that occurs with fast movement, simulating how a real camera captures motion. Some AI models produce artificial-looking motion blur or apply it inconsistently.

Upscaling: Increasing the resolution of generated video after creation. AI upscaling uses machine learning to add detail when enlarging, producing better results than simple pixel scaling.

Workflow and pipeline terms

Pipeline: The sequence of AI models and processing steps used to create final output. For example: text-to-image → upscale → image-to-video → color grade. PonPon's Canvas and Flow features let you build multi-step pipelines.

Batch generation: Creating multiple video variations from the same or similar prompts in a single session. Useful for A/B testing prompts or generating options to choose from.

Credits: The currency used on AI platforms to pay for generations. Different models and settings cost different amounts of credits. Higher resolution, longer duration, and premium models generally cost more.

Queue: The waiting line for generation requests. During peak usage, your generation enters a queue and is processed in order. Free tier users may experience longer queue times.

Camera and cinematic terms used in prompts

Tracking shot: Camera moves alongside the subject, following their motion. Produces dynamic, engaging footage.

Dolly shot: Camera moves toward or away from the subject on a track. Creates a sense of approaching or departing.

Pan: Camera rotates horizontally from a fixed position, sweeping across a scene.

Tilt: Camera rotates vertically from a fixed position, moving from ground to sky or vice versa.

Rack focus: Focus shifts from one subject to another within the same shot. Creates visual storytelling by directing attention.

Depth of field (DOF): How much of the scene is in focus. Shallow DOF blurs the background (bokeh), keeping attention on the subject. Deep DOF keeps everything sharp.

Bokeh: The aesthetic quality of the out-of-focus areas in an image. Typically appears as soft, circular light blobs in the background. Mentioning bokeh in prompts produces a pleasing, professional-looking background blur.

Keep this as a reference

Bookmark this page. As you explore AI video on PonPon — across Sora 2, Kling 3.0, Veo 3.1, Seedance 2.0, and Nano Banana Pro — you'll encounter these terms in settings, documentation, and community discussions. Understanding them makes you a better prompter and a more efficient creator.

← All posts

April 17, 2026 · PonPon Team

The AI Video Glossary for Beginners

AI video has its own language. Here's every term you'll encounter, explained without jargon.

Generation fundamentals

Generation / inference: The process of the AI creating the video from your prompt. Each generation produces one video clip. The terms are interchangeable in casual use.

Model and architecture terms

Quality and control parameters

Aspect ratio: The width-to-height proportion of the video. Common ratios: 16:9 (widescreen), 9:16 (vertical/mobile), 1:1 (square), 2.39:1 (cinematic ultrawide).

Duration: The length of the generated video clip, typically 3-10 seconds for current AI models. Longer videos are created by generating multiple clips and editing them together.

Video quality terms

Motion blur: Natural blurring that occurs with fast movement, simulating how a real camera captures motion. Some AI models produce artificial-looking motion blur or apply it inconsistently.

Upscaling: Increasing the resolution of generated video after creation. AI upscaling uses machine learning to add detail when enlarging, producing better results than simple pixel scaling.

Workflow and pipeline terms

Batch generation: Creating multiple video variations from the same or similar prompts in a single session. Useful for A/B testing prompts or generating options to choose from.

Queue: The waiting line for generation requests. During peak usage, your generation enters a queue and is processed in order. Free tier users may experience longer queue times.

Camera and cinematic terms used in prompts

Tracking shot: Camera moves alongside the subject, following their motion. Produces dynamic, engaging footage.

Dolly shot: Camera moves toward or away from the subject on a track. Creates a sense of approaching or departing.

Pan: Camera rotates horizontally from a fixed position, sweeping across a scene.

Tilt: Camera rotates vertically from a fixed position, moving from ground to sky or vice versa.

Rack focus: Focus shifts from one subject to another within the same shot. Creates visual storytelling by directing attention.

Depth of field (DOF): How much of the scene is in focus. Shallow DOF blurs the background (bokeh), keeping attention on the subject. Deep DOF keeps everything sharp.

The AI Video Glossary for Beginners

Generation fundamentals

Model and architecture terms

Quality and control parameters

Video quality terms

Workflow and pipeline terms

Camera and cinematic terms used in prompts

Keep this as a reference

Questions & answers

Related blog posts

AI Agents for Video Production in 2026

Make a Product Ad With AI: Full Guide

30 Days of Content in One Session

How Diffusion Models Work

AI Video with Native Audio in 2026

More to explore

Sora 2 — OpenAI's Flagship Video Model

Kling 3.0 The Cinematic AI Video Model

Veo 3.1 Google's Cinematic Video Model

Seedance 2.0 Fast, Expressive AI Video

Nano Banana Pro Precision AI Image Editing

The AI Video Glossary for Beginners

Generation fundamentals

Model and architecture terms

Quality and control parameters

Video quality terms

Workflow and pipeline terms

Camera and cinematic terms used in prompts

Keep this as a reference

Questions & answers

Related blog posts

AI Agents for Video Production in 2026

Make a Product Ad With AI: Full Guide

30 Days of Content in One Session

How Diffusion Models Work

AI Video with Native Audio in 2026

More to explore

Sora 2 — OpenAI's Flagship Video Model

Kling 3.0 The Cinematic AI Video Model

Veo 3.1 Google's Cinematic Video Model

Seedance 2.0 Fast, Expressive AI Video

Nano Banana Pro Precision AI Image Editing