The AI Video Glossary for Beginners
AI video has its own language. Here's every term you'll encounter, explained without jargon.
AI video generation comes with a wall of jargon. Diffusion models, latent space, CFG scale, temporal coherence — it can feel like learning a new language. This glossary defines every term you're likely to encounter, in plain English, so you can focus on creating instead of decoding documentation.
Generation fundamentals
Text-to-video (T2V): Creating video from a written text description (prompt). You type what you want to see, the AI generates it. This is the most common way to use AI video generators like those on PonPon.
Image-to-video (I2V): Creating video from a starting image plus a text prompt. The image provides the visual reference — colors, composition, subject — and the prompt describes what motion should happen. Generally produces more controllable, realistic results than text-to-video.
Video-to-video (V2V): Transforming an existing video clip using AI. The original video provides the motion, and the AI modifies the visual style, content, or quality. Used for style transfer, upscaling, and re-rendering.
Prompt: The text description you give the AI to generate video. A good prompt includes subject, action, setting, lighting, and camera details. The quality of your prompt is the single biggest factor in output quality.
Negative prompt: A description of what you don't want in the output. Not all models support negative prompts, but when available, terms like "blurry, distorted, low quality" help the model avoid common failure modes.
Generation / inference: The process of the AI creating the video from your prompt. Each generation produces one video clip. The terms are interchangeable in casual use.
Model and architecture terms
Diffusion model: The type of AI architecture used by most video generators (including Sora 2, Veo 3.1, and others). It works by starting with random noise and gradually removing it to form a coherent image or video, guided by your prompt. Think of it like a sculptor removing marble to reveal a figure.
Transformer: An AI architecture that processes data in parallel. Modern video models combine diffusion with transformer architectures (called DiT — Diffusion Transformer) for better quality and coherence.
Latent space: A compressed mathematical representation where the AI does its work. Instead of generating full-resolution pixels directly, models work in this compressed space for efficiency, then decode the result back to full resolution. You never interact with it directly, but it's why generation is feasible on current hardware.
Foundation model: The large pre-trained AI model that powers a video generator. Sora 2, Kling 3.0, Veo 3.1, and Seedance 2.0 are all foundation models. They're trained on massive datasets and then fine-tuned for specific capabilities.
Fine-tuning: Additional training on a specific dataset to give a foundation model new capabilities or a particular style. Some platforms offer fine-tuning so you can train models on your own footage or brand aesthetic.
Quality and control parameters
CFG scale (Classifier-Free Guidance): A number that controls how strictly the model follows your prompt. Higher CFG means the output matches your prompt more literally but may look less natural. Lower CFG gives the model more creative freedom but may drift from your description. Typical range: 5-15.
Steps / inference steps: How many denoising iterations the model performs during generation. More steps generally mean higher quality but slower generation. There's a point of diminishing returns — beyond a certain step count, quality stops improving.
Seed: A random number that initializes the generation process. Using the same seed with the same prompt and settings produces identical output. Useful for reproducing results or making small adjustments while keeping the overall composition stable.
Resolution: The pixel dimensions of the output video. Common resolutions: 720p (1280x720), 1080p (1920x1080), 4K (3840x2160). Higher resolution means more detail but costs more credits and takes longer.
Aspect ratio: The width-to-height proportion of the video. Common ratios: 16:9 (widescreen), 9:16 (vertical/mobile), 1:1 (square), 2.39:1 (cinematic ultrawide).
FPS (Frames Per Second): How many individual frames appear each second of video. 24fps is cinematic standard, 30fps is broadcast standard, 60fps is smooth/sports standard. Higher FPS requires more frames to be generated.
Duration: The length of the generated video clip, typically 3-10 seconds for current AI models. Longer videos are created by generating multiple clips and editing them together.
Video quality terms
Temporal coherence: Consistency between consecutive frames. Good temporal coherence means objects maintain their shape, color, and position smoothly from frame to frame. Poor temporal coherence causes flickering, morphing, or objects that change appearance between frames.
Artifacts: Visual errors in the generated video — distorted faces, extra fingers, warped text, impossible geometry, or flickering areas. All models produce some artifacts; the goal is minimizing them through good prompting and model selection.
Hallucination: When the AI generates something that wasn't in the prompt and doesn't make sense — an extra hand appearing, text that changes between frames, or objects that materialize and vanish. Related to but distinct from artifacts.
Motion blur: Natural blurring that occurs with fast movement, simulating how a real camera captures motion. Some AI models produce artificial-looking motion blur or apply it inconsistently.
Upscaling: Increasing the resolution of generated video after creation. AI upscaling uses machine learning to add detail when enlarging, producing better results than simple pixel scaling.
Workflow and pipeline terms
Pipeline: The sequence of AI models and processing steps used to create final output. For example: text-to-image → upscale → image-to-video → color grade. PonPon's Canvas and Flow features let you build multi-step pipelines.
Batch generation: Creating multiple video variations from the same or similar prompts in a single session. Useful for A/B testing prompts or generating options to choose from.
Credits: The currency used on AI platforms to pay for generations. Different models and settings cost different amounts of credits. Higher resolution, longer duration, and premium models generally cost more.
Queue: The waiting line for generation requests. During peak usage, your generation enters a queue and is processed in order. Free tier users may experience longer queue times.
Camera and cinematic terms used in prompts
Tracking shot: Camera moves alongside the subject, following their motion. Produces dynamic, engaging footage.
Dolly shot: Camera moves toward or away from the subject on a track. Creates a sense of approaching or departing.
Pan: Camera rotates horizontally from a fixed position, sweeping across a scene.
Tilt: Camera rotates vertically from a fixed position, moving from ground to sky or vice versa.
Rack focus: Focus shifts from one subject to another within the same shot. Creates visual storytelling by directing attention.
Depth of field (DOF): How much of the scene is in focus. Shallow DOF blurs the background (bokeh), keeping attention on the subject. Deep DOF keeps everything sharp.
Bokeh: The aesthetic quality of the out-of-focus areas in an image. Typically appears as soft, circular light blobs in the background. Mentioning bokeh in prompts produces a pleasing, professional-looking background blur.
Keep this as a reference
Bookmark this page. As you explore AI video on PonPon — across Sora 2, Kling 3.0, Veo 3.1, Seedance 2.0, and Nano Banana Pro — you'll encounter these terms in settings, documentation, and community discussions. Understanding them makes you a better prompter and a more efficient creator.
