AI Video Benchmarks Explained
Everyone cites the leaderboard. Few explain what the numbers actually mean for your work.
Every major AI video model release in 2026 has led with the same claim: top scores on the Artificial Analysis Video Arena leaderboard. WAN 2.7 announced an Elo of 1762. HappyHorse-1.0 highlighted its 95-point gap over the previous leader. Seedance 2.0 pointed to its second-place finish just 9 points behind in audio-included tests.
These numbers dominate the conversation around model selection, but most creators encountering them have reasonable questions: What is an Elo score? How is the leaderboard built? What do the categories actually measure? And most importantly — can a number on a chart pick the right model for your project?
This guide answers those questions. It is written for creators who use AI video tools, not for researchers who build them. The goal is to make you a better consumer of benchmark data, not to replace your own testing.
What Is an Elo Rating?
The Elo system was invented by physicist Arpad Elo in the 1960s to rank chess players. The core idea is simple: instead of measuring absolute performance, it measures relative preference. Two players (or in this case, two models) compete head-to-head, and the winner gains rating points while the loser drops points. Over thousands of matches, each player's rating converges on a number that reflects their true strength relative to the field.
In the context of AI video, "matches" are human evaluations. A real person watches two video clips — one from Model A, one from Model B, both generated from the same prompt — and picks the one they prefer. Neither the prompt nor the model identity is revealed to the evaluator. This blind preference format eliminates brand bias: evaluators judge the output, not the marketing.
The Artificial Analysis Video Arena, the most widely cited leaderboard in AI video, has conducted hundreds of thousands of these pairwise comparisons. The resulting Elo scores represent a model's probability of being preferred by a human evaluator in a blind test against any other model on the leaderboard.
How to Read the Numbers
Elo scores are relative, not absolute. A score of 1762 means nothing on its own — it only has meaning in comparison to other scores on the same leaderboard, calculated from the same pool of evaluations.
The mathematical relationship between two Elo scores predicts the win probability between those models:
| Elo Difference | Expected Win Rate |
|---|---|
| 0 points | 50% (coin flip) |
| 50 points | 57% |
| 100 points | 64% |
| 200 points | 76% |
| 300 points | 85% |
| 400 points | 91% |
So when WAN 2.7 (1762) faces LTX-2.3 Pro (1484), the 278-point gap predicts WAN 2.7 would be preferred about 83% of the time. When Kling 3.0 (1247 in text-to-video without audio) faces Seedance 2.0 (1270), the 23-point gap predicts Seedance would be preferred only about 53% of the time — barely distinguishable from a coin flip.
This means: large gaps are meaningful, small gaps are noise. Two models within 30-50 Elo points of each other are essentially tied on human preference. Do not choose between them based on the leaderboard alone.
The Three Leaderboard Categories
The Artificial Analysis Video Arena evaluates models across three distinct categories. Each measures something different, and a model's ranking can vary dramatically between them.
Text-to-Video (Without Audio)
The foundational category. Evaluators compare video clips generated from the same text prompt, with audio stripped or muted. The judgment is purely visual: motion quality, physics accuracy, prompt adherence, aesthetic appeal, and absence of artifacts.
This category favors models with strong visual fidelity and motion coherence. Models that produce photorealistic textures, physically plausible object interactions, and smooth temporal consistency score well here. It does not measure audio quality, generation speed, or anything beyond what the eyes see.
Current top 5 (as of May 2026):
| Rank | Model | Elo |
|---|---|---|
| 1 | WAN 2.7 | 1,762 |
| 2 | LTX-2.3 Pro | 1,484 |
| 3 | HappyHorse-1.0 | 1,446 |
| 4 | Seedance 2.0 | 1,270 |
| 5 | Kling 3.0 | 1,247 |
Notice the gap structure: WAN 2.7 is in a tier of its own, nearly 300 points ahead of second place. LTX-2.3 and HappyHorse form a second tier. Seedance and Kling form a third tier, effectively tied. These tiers matter more than individual rankings.
Text-to-Video (With Audio)
Same visual comparison, but with audio enabled. Evaluators hear whatever the model generates alongside the video — ambient sound, dialogue, music, sound effects — and judge the combined audiovisual experience.
This category adds a second dimension that rewards models with strong audio synthesis. A model that produces stunning visuals but jarring or absent audio will score lower here than in the audio-stripped category. Conversely, a model with excellent audio-visual synchronization can punch above its visual-only weight.
The gap between categories is informative. The speed-optimized model scored 1,270 in visual-only but 1,221 with audio — a small drop suggesting its audio is competent but not a differentiator. HappyHorse scored 1,446 visual-only but 1,230 with audio — a significant drop suggesting its audio quality lags behind its visual quality.
Image-to-Video
A different input modality entirely. Both models receive the same reference image and the same motion prompt, then generate a video that animates the image. Evaluators judge how faithfully the model preserves the reference image's identity while adding natural motion.
This category measures identity preservation, motion naturalness, and reference fidelity. Models that maintain facial features, clothing details, and color accuracy while generating fluid, artifact-free motion score highest. HappyHorse-1.0 posted its highest score in this category (1,415 Elo) — a 31-point improvement over its text-to-video ranking, suggesting its architecture is particularly strong at visual conditioning.
For creators who work primarily with reference images — product shots, character-driven content, photo animation — the image-to-video ranking is more relevant to their workflow than the text-to-video ranking.
What Benchmarks Do NOT Measure
Understanding what the leaderboard excludes is as important as understanding what it includes. Several factors that critically affect production workflows are invisible to blind preference testing.
Multi-Shot Consistency
The leaderboard evaluates single clips in isolation. It does not test whether a model can maintain character identity across multiple clips or generate connected narrative sequences. Kling 3.0's multi-shot capability — generating up to six consistent cuts in one pass — receives zero leaderboard credit despite being one of the most valuable features for narrative content production.
A model could rank #1 on the leaderboard while being completely unable to produce a two-shot sequence with a consistent character. If multi-shot matters to your workflow, the leaderboard cannot help you evaluate it.
Generation Speed
Blind preference tests show evaluators the final output. They do not show how long the output took to generate. A model that produces a slightly preferred clip in 5 minutes scores higher than a model that produces a nearly-as-good clip in 30 seconds. For creators who iterate rapidly — testing prompt variations, exploring creative directions, producing social content on deadline — generation speed can matter more than marginal quality advantages.
Seedance 2.0 renders most clips in under 60 seconds. Generating ten Seedance iterations takes less time than generating one clip from a slower model. Speed enables a workflow (rapid iteration, prompt exploration, volume production) that quality alone does not.
Cost Per Generation
The leaderboard treats every model equally regardless of cost. A model accessible only through a $200/month enterprise API and a model available for $0.10 per clip receive the same evaluation. For creators operating on a budget — which includes most individual creators and small teams — cost-per-generation directly affects how many iterations they can afford, which in turn affects the quality of their final output.
A creator with a fixed monthly budget produces better work with an affordable model that allows 100 iterations than with a premium model that allows 10.
Controllability and Prompt Adherence
The leaderboard measures whether evaluators prefer one clip over another, not whether either clip matches the creator's intent. A model that produces beautiful but unpredictable output may score well on preference while frustrating creators who need specific compositions, camera movements, or character actions.
Veo 3.1 ranks below WAN 2.7 and HappyHorse on the preference leaderboard, but its camera control precision — faithfully executing dolly, crane, tracking, and orbital movements from prompt descriptions — makes it the first choice for creators who need specific cinematographic direction. That precision receives no leaderboard credit.
Audio Quality in Isolation
The with-audio category captures audio as part of the overall experience, but does not isolate audio quality for separate evaluation. A model with perfect visuals and mediocre audio may score similarly to a model with good visuals and excellent audio. For creators building content where audio matters independently — dialogue scenes, music videos, ASMR content — the combined score obscures the information they need.
Edge Cases and Failure Modes
Leaderboard prompts are designed to test general capability across a broad range of scenarios. They do not stress-test specific failure modes: hands with the wrong number of fingers, text that becomes illegible, reflections that behave impossibly, or rapid motion that produces smearing. A model can score well on average while having specific, predictable failure modes that affect your particular use case.
Why Human Judges Instead of Automated Metrics
Video generation research has developed several automated quality metrics: Frechet Video Distance (FVD) measures the statistical distance between generated and real video distributions, LPIPS compares perceptual similarity at the frame level, and FVMD evaluates motion dynamics. These metrics are cheap to compute and infinitely scalable — so why does the most cited leaderboard use expensive human evaluation instead?
The answer is that automated metrics measure different things than human preference, and those things do not always correlate. FVD scores can improve while a model produces output that humans find uncanny or unnatural. LPIPS penalizes deviations from ground truth, but creative deviation is often desirable in generative video — a model that adds dramatic lighting to a sunset prompt may score worse on LPIPS but better on human preference.
A 2025 analysis by Artificial Analysis found that FVD scores predicted human preference ranking correctly only 61% of the time — barely better than chance for a ranking task. The correlation improved to 73% when combining FVD with LPIPS and temporal consistency metrics, but that still means the automated ensemble disagrees with human judges on more than one in four comparisons.
Human evaluation captures something automated metrics cannot: the holistic judgment of whether a video feels right. That judgment integrates motion quality, aesthetic appeal, physical plausibility, emotional resonance, and absence of artifacts into a single preference signal. No automated metric has successfully decomposed and recombined these dimensions.
The trade-off is cost and scale. Human evaluation requires recruiting evaluators, managing quality control, paying per comparison, and waiting for sufficient data. Automated metrics return a number in seconds. For model developers iterating on architecture changes, automated metrics are essential for day-to-day development. For final ranking and public communication, human preference remains the gold standard — and the Artificial Analysis arena is the most widely trusted implementation of that standard.
As a creator, the practical implication is simple: when a model claims high FVD or LPIPS scores but does not appear on the human preference leaderboard, treat the claim with caution. When a model ranks well on blind human preference, the signal is more likely to predict your own experience.
How to Use Benchmarks Wisely
Benchmarks are one input to model selection, not the answer to it. Here is a framework for incorporating leaderboard data into your decision-making without over-relying on it.
Step 1: Use the Leaderboard to Shortlist, Not to Choose
The leaderboard tells you which models are competitive. If a model scores 300+ Elo points below the leader, it is measurably behind on human preference and probably not worth testing unless it has a specific feature you need. If a model is within 100 points of the top, it is in the competitive tier and worth evaluating.
Treat the leaderboard as a filter: it removes obviously inferior options from your consideration set. It does not pick the winner from the remaining options.
Step 2: Match the Category to Your Workflow
Check your primary input mode. If you work mostly with text prompts and need visual-only output (adding your own audio later), the text-to-video without audio ranking is most relevant. If you animate reference images, check the image-to-video ranking. If you need native audio, check the with-audio ranking.
A model that ranks #3 in text-to-video might rank #1 in image-to-video. Your workflow determines which ranking matters.
Step 3: Test Your Actual Use Case
Generate your most common prompt — not a generic test prompt, but a prompt representative of the content you actually produce — across 3-4 shortlisted models. Compare the results in a side-by-side comparison workspace.
The model that wins on your specific content type may not match the leaderboard winner. Product photography, character animation, landscape cinematics, and social media content each stress different model capabilities. Your use case is specific; the leaderboard is general.
Step 4: Factor in What Benchmarks Miss
After visual quality comparison, evaluate:
- Speed: How long does each model take to generate your typical prompt? Does the speed difference enable a different workflow?
- Cost: At your expected volume, what is the monthly cost for each model? Does the cost difference change how many iterations you can afford?
- Controllability: Does the model execute your specific camera directions, character descriptions, and scene compositions faithfully? Or does it produce beautiful but unpredictable output?
- Multi-shot: If you need consistent characters across clips, does the model support it? Kling 3.0's multi-shot capability is invisible on the leaderboard but essential for narrative projects.
- Ecosystem: Does the model integrate with your existing tools and workflow? Is the API stable? Is there community support?
Step 5: Re-Evaluate Quarterly
The leaderboard changes. Models improve, new competitors appear, and the relative rankings shift. A model that was #1 six months ago may be #4 today. Set a quarterly reminder to re-check the leaderboard and re-test your shortlist with your current most-common prompt.
The Leaderboard Paradox
There is an inherent tension in how the AI video industry uses benchmarks. Model developers optimize for leaderboard scores because they drive marketing and user acquisition. But the leaderboard measures a narrow slice of what makes a model useful: blind visual preference on single clips generated from standardized prompts.
The result is a race where every model improves on the measured dimension (raw visual preference) while differentiation happens on unmeasured dimensions (speed, multi-shot, controllability, cost, ecosystem).
Economists call this Goodhart's Law: when a measure becomes a target, it ceases to be a good measure. The AI video industry is not there yet — the Elo leaderboard remains informative because no model has found a way to game blind human preference. But the incentive to optimize for the benchmark at the expense of production utility is real, and creators should be aware of it. A model that allocates its parameter budget toward winning blind comparisons may be making trade-offs — longer generation times, higher compute costs, reduced controllability — that the leaderboard does not reveal.
The models that rank highest may not be the models that produce the best work in your specific creative context.
This is not a criticism of benchmarks — blind human preference is the most honest evaluation methodology available for generative models. It is a reminder that honest evaluation of one dimension is not comprehensive evaluation of all dimensions.
Current Landscape: Where Each Model Leads
Rather than ranking models by a single number, here is where each major model holds a genuine advantage as of May 2026:
- WAN 2.7: Broadest feature set of any single model (four generation modes), highest text-to-video Elo, open-source flexibility for custom pipelines
- Kling 3.0: Only consumer model with multi-shot identity preservation across up to six cuts, plus 15-second clips with native audio
- The benchmark leader in image-to-video: Strongest visual conditioning scores, meaning it preserves reference image identity most faithfully during animation
- The photoreal physics standard: Set the bar for skin texture, lighting fidelity, and physical accuracy before its April 2026 shutdown; its quality benchmark remains the reference point the industry measures against
- Veo 3.1: Most precise camera direction, strongest prompt-to-camera-movement fidelity
- Seedance 2.0: Fastest generation, enabling a rapid-iteration workflow that no slower model can match regardless of quality
Each of these advantages is real. None of them is fully captured by a single Elo number. The best model for your work depends on which advantage matters most for what you create.
Making the Choice: A Practical Example
To illustrate how this framework works in practice, consider a creator who produces short-form product showcase videos for e-commerce brands. Their most common prompt involves a product rotating on a clean background with natural lighting and subtle camera movement.
Step 1 — Shortlist: The leaderboard shows WAN 2.7, HappyHorse, Seedance 2.0, and Kling 3.0 all in the competitive tier for image-to-video. All four are worth testing.
Step 2 — Category match: Product showcases start from a product photo, so the image-to-video ranking is most relevant. HappyHorse leads this category with an Elo of 1,415.
Step 3 — Real test: The creator generates their standard product rotation prompt across all four models. HappyHorse preserves product details best. Seedance 2.0 produces slightly softer textures but finishes in 40 seconds instead of 4 minutes. Kling 3.0 offers the smoothest camera motion.
Step 4 — Unmeasured factors: The creator produces 20 videos per week. At that volume, Seedance's 40-second generation time saves roughly 80 minutes per week compared to HappyHorse. The quality difference is visible in side-by-side comparison but negligible to end customers viewing on mobile screens.
Step 5 — Decision: The creator chooses Seedance 2.0 for weekly batch production and reserves HappyHorse for hero product launches where maximum fidelity matters. Neither choice matches the overall leaderboard winner. Both choices are correct for this workflow.
This is how benchmarks should work: not as a verdict, but as one input to a decision shaped by your specific creative requirements, budget, timeline, and quality bar. The leaderboard narrows the field. Your own testing in PonPon's video generation studio reveals which model actually serves your work best.
Benchmarks inform. Your own eyes choose.