[ ZERO STAT ]
How Zero Stat ranks.
A transparent, defensible method for ranking AI products. Built so anyone curious about AI — not just buyers or builders — can see how the picks get made. We publish the rubric. We publish the data sources. We don't move goalposts between videos.
v1 2026-06-28 Last updated by Newton · the editor-in-chief. The Opus that authored the original charter lives somewhere in this doc's git history, but the rubric is human-curated now.
The pinned-comment version:
We rank on five dimensions, weighted by what matters to business buyers: capability 30%, cost-efficiency 25%, reliability 20%, ecosystem/automation 15%, momentum 10%. Every product is rated 1 to 10 on each axis against its category peers (LLMs vs LLMs, video models vs video models). Composite is the weighted sum. Sources are linked under every number on-screen and in the description.
The five axes
Every product on every Ranked video gets these five scores. The dimensions are constant across videos — only the shortlist changes. Composite is the weighted sum; it's a 1–10 number, and it's the basis for the on-screen tier list.
Capability
Raw task performance: can it actually do the job at the level a decision-maker needs? Benchmarks (MMLU, SWE-bench, GPQA, HumanEval), our own test prompts when relevant, public eval suites, real-world performance for the named use case.
30%
Cost-efficiency
Performance per dollar. Input + output pricing per M tokens, subscription cost, total cost including iteration burn. Critical for high-volume use; weighted heavily because that's where most business buyers make or save money.
25%
Reliability
Uptime, consistency, hallucination rate, prompt-to-prompt variance. Reliability matters more for production systems than for one-shot demos.
20%
Ecosystem / automation
API quality, MCP support, SDK ergonomics, integrations, tooling maturity — the "can I actually plug this in?" axis. Weighted higher for coding and agent categories where the harness matters as much as the model.
15%
Momentum
Release cadence, trajectory, signal that the vendor is investing vs. drifting. Critical for fast-moving categories (image, video, coding) where the leaderboard churns every 2–3 months.
10%
Why these weights, and why they don't change per video
A "best of" video is only as credible as its rubric. The weights above are the channel's rubric — they're written into the Stat Sheet's ranking-rubric block and reviewed only when we add a new category. The rubric is constant; only the shortlist changes.
Three implications:
- A "best cheap-volume LLM" Ranked and a "best creative-writing LLM" Ranked both use the same five axes with the same weights — we just steer different contenders into each shortlist based on the task, and our commentary weighs the verdict differently within the same scoring frame.
- If a viewer's answer is "you under-weighted cost" — fair debate, but the answer is the same: the rubric is the rubric, your call to use it or not. We can surface scenarios where a re-weight would change the winner (e.g., "at cost-efficiency = 50%, this is the new ranking").
- If a viewer's answer is "you forgot about [X axis]" — they may be right. Open it as a tracked suggestion; revise at the next quarterly review. We do not move goalposts between videos.
Per-category nuances
The five axes are the same everywhere, but how each axis plays out varies by category:
LLM (text-only and multimodal)
- Capability: benchmark-driven (MMLU, GPQA, HumanEval, SWE-bench, our own prompts).
- Cost-efficiency: heavily per-token — spans cheap classification to long-context reasoning.
- Reliability: matters more for production than demos.
Image generation
- Capability: partly benchmark (Arena scores, prompt-following evals), partly taste — we cite specific failure cases in commentary, not "this looks better."
- Cost-efficiency: matters less for marketing teams (low volume), more for app builders (high volume).
- Momentum: weighted slightly higher — image-gen ships every 2–3 months.
Video generation
- Capability: divided into raw visual quality vs. prompt fidelity vs. motion coherence. Scored separately in commentary.
- Cost-efficiency: brutal because iteration burn dominates (5–10 generations per usable 30s clip).
- Momentum: the single highest-weighted axis here.
Voice / TTS
- Reliability matters more than capability — failure mode is wrong pronunciation, artifacts.
- Capability is partly subjective; we lean on crowdsourced samples and real VO test renders.
Coding / agentic coding
- Capability: benchmark-anchored (SWE-bench Verified, multi-file refactor evals).
- Ecosystem: weighted higher than for chat LLMs (the value of Cursor / Claude Code is the IDE + harness).
- Momentum: critical — this category shipped more in 2024–2025 than the prior decade.
Avatar (talking-head)
- Capability = realism + lip-sync + emotion.
- Reliability includes the credit-burn problem (unreliable = "we ran out of credits mid-render").
- Ecosystem includes API + MCP + lip-sync-with-external-audio support.
Where the data comes from
Every claim in a Ranked video traces to one of:
- Vendor docs / release notes — primary, preferred (pricing pages, product docs).
- Independent benchmarks — SWE-bench, MMLU, Artificial Analysis, etc.
- Our test prompts — limited use, when a category is moving fast and benchmarks lag reality.
- Reputable press / analysis — for context, not as a source of truth on numbers.
What we don't use:
- Vendor marketing copy as a source of fact.
- Aggregator listicles ("top 10 AI tools!") as a source.
- Single-sourced rumors. If a number is from one place and we can't corroborate, we say "unconfirmed" or skip it.
- Anything we can't link to in the Stat Sheet's sources array.
What we publish, and what we hold back
We publish: the rubric, the weights, what each axis means, where the data comes from.
We hold back: the exact 1–10 scores per product, the composite arithmetic, in-flight Stat Sheet annotations.
Why: scores are opinion. The rubric is the method. Viewers who care about the method can re-derive their own composite from the public rubric + their own weighting. It's the journalism version of showing your work without doxxing your sources.
What viewers should take away
Every Ranked is a defensible opinion, not a true ranking. The method is rigorous. The picks are good faith. Disagree with our picks? Cite a specific rubric violation ("you said X had 8 reliability but their uptime is closer to 6"). We'll either correct or explain. Don't argue vibes; argue axes.
The audience is anyone curious about AI — buyers, builders, and people who just want to understand the field. The rubric is the same regardless. Sourcing and rigor are the same regardless.
Definitions
- Mapped (in Stat Sheet)
- Identity + qualitative fields populated; volatile fields (current version, pricing, context window, benchmarks, our rating) are null and listed in the product's
_verify array. Stable across re-scores pending a vendor change.
- Worked example
- All fields populated with verified, sourced data. Use as a gold-standard template for how to score.
- Composite
- Weighted sum of the five axis scores. Always 1–10. Meanings are comparative within a category, not absolute.
Revision log
We don't move goalposts between videos. The rubric only changes at quarterly review with a documented revision entry here.
| Date |
Change |
| 2026-06-28 |
Initial draft, posted publicly for week-1 Ranked launch. Authored by Newton (the editor-in-chief) on behalf of Opus, the master-prompt author of the channel's charter. |