[ ZERO STAT ]

How Zero Stat ranks.

A transparent, defensible method for ranking AI products. Built so anyone curious about AI — not just buyers or builders — can see how the picks get made. We publish the rubric. We publish the data sources. We don't move goalposts between videos.

v1 2026-06-28 Last updated by Newton · the editor-in-chief. The Opus that authored the original charter lives somewhere in this doc's git history, but the rubric is human-curated now.

The pinned-comment version:

We rank on five dimensions, weighted by what matters to business buyers: capability 30%, cost-efficiency 25%, reliability 20%, ecosystem/automation 15%, momentum 10%. Every product is rated 1 to 10 on each axis against its category peers (LLMs vs LLMs, video models vs video models). Composite is the weighted sum. Sources are linked under every number on-screen and in the description.

The five axes

Every product on every Ranked video gets these five scores. The dimensions are constant across videos — only the shortlist changes. Composite is the weighted sum; it's a 1–10 number, and it's the basis for the on-screen tier list.

Capability

Raw task performance: can it actually do the job at the level a decision-maker needs? Benchmarks (MMLU, SWE-bench, GPQA, HumanEval), our own test prompts when relevant, public eval suites, real-world performance for the named use case.

30%

Cost-efficiency

Performance per dollar. Input + output pricing per M tokens, subscription cost, total cost including iteration burn. Critical for high-volume use; weighted heavily because that's where most business buyers make or save money.

25%

Reliability

Uptime, consistency, hallucination rate, prompt-to-prompt variance. Reliability matters more for production systems than for one-shot demos.

20%

Ecosystem / automation

API quality, MCP support, SDK ergonomics, integrations, tooling maturity — the "can I actually plug this in?" axis. Weighted higher for coding and agent categories where the harness matters as much as the model.

15%

Momentum

Release cadence, trajectory, signal that the vendor is investing vs. drifting. Critical for fast-moving categories (image, video, coding) where the leaderboard churns every 2–3 months.

10%

Why these weights, and why they don't change per video

A "best of" video is only as credible as its rubric. The weights above are the channel's rubric — they're written into the Stat Sheet's ranking-rubric block and reviewed only when we add a new category. The rubric is constant; only the shortlist changes.

Three implications:

A "best cheap-volume LLM" Ranked and a "best creative-writing LLM" Ranked both use the same five axes with the same weights — we just steer different contenders into each shortlist based on the task, and our commentary weighs the verdict differently within the same scoring frame.
If a viewer's answer is "you under-weighted cost" — fair debate, but the answer is the same: the rubric is the rubric, your call to use it or not. We can surface scenarios where a re-weight would change the winner (e.g., "at cost-efficiency = 50%, this is the new ranking").
If a viewer's answer is "you forgot about [X axis]" — they may be right. Open it as a tracked suggestion; revise at the next quarterly review. We do not move goalposts between videos.

Per-category nuances

The five axes are the same everywhere, but how each axis plays out varies by category:

LLM (text-only and multimodal)

Capability: benchmark-driven (MMLU, GPQA, HumanEval, SWE-bench, our own prompts).
Cost-efficiency: heavily per-token — spans cheap classification to long-context reasoning.
Reliability: matters more for production than demos.

Image generation

Capability: partly benchmark (Arena scores, prompt-following evals), partly taste — we cite specific failure cases in commentary, not "this looks better."
Cost-efficiency: matters less for marketing teams (low volume), more for app builders (high volume).
Momentum: weighted slightly higher — image-gen ships every 2–3 months.

Video generation

Capability: divided into raw visual quality vs. prompt fidelity vs. motion coherence. Scored separately in commentary.
Cost-efficiency: brutal because iteration burn dominates (5–10 generations per usable 30s clip).
Momentum: the single highest-weighted axis here.

Voice / TTS

Reliability matters more than capability — failure mode is wrong pronunciation, artifacts.
Capability is partly subjective; we lean on crowdsourced samples and real VO test renders.

Coding / agentic coding

Capability: benchmark-anchored (SWE-bench Verified, multi-file refactor evals).
Ecosystem: weighted higher than for chat LLMs (the value of Cursor / Claude Code is the IDE + harness).
Momentum: critical — this category shipped more in 2024–2025 than the prior decade.

Avatar (talking-head)

Capability = realism + lip-sync + emotion.
Reliability includes the credit-burn problem (unreliable = "we ran out of credits mid-render").
Ecosystem includes API + MCP + lip-sync-with-external-audio support.

Where the data comes from

Every claim in a Ranked video traces to one of:

Vendor docs / release notes — primary, preferred (pricing pages, product docs).
Independent benchmarks — SWE-bench, MMLU, Artificial Analysis, etc.
Our test prompts — limited use, when a category is moving fast and benchmarks lag reality.
Reputable press / analysis — for context, not as a source of truth on numbers.

What we don't use:

Vendor marketing copy as a source of fact.
Aggregator listicles ("top 10 AI tools!") as a source.
Single-sourced rumors. If a number is from one place and we can't corroborate, we say "unconfirmed" or skip it.
Anything we can't link to in the Stat Sheet's sources array.

What we publish, and what we hold back

We publish: the rubric, the weights, what each axis means, where the data comes from.

We hold back: the exact 1–10 scores per product, the composite arithmetic, in-flight Stat Sheet annotations.

Why: scores are opinion. The rubric is the method. Viewers who care about the method can re-derive their own composite from the public rubric + their own weighting. It's the journalism version of showing your work without doxxing your sources.

What viewers should take away

Every Ranked is a defensible opinion, not a true ranking. The method is rigorous. The picks are good faith. Disagree with our picks? Cite a specific rubric violation ("you said X had 8 reliability but their uptime is closer to 6"). We'll either correct or explain. Don't argue vibes; argue axes.

The audience is anyone curious about AI — buyers, builders, and people who just want to understand the field. The rubric is the same regardless. Sourcing and rigor are the same regardless.

Definitions

Mapped (in Stat Sheet): Identity + qualitative fields populated; volatile fields (current version, pricing, context window, benchmarks, our rating) are null and listed in the product's _verify array. Stable across re-scores pending a vendor change.
Worked example: All fields populated with verified, sourced data. Use as a gold-standard template for how to score.
Composite: Weighted sum of the five axis scores. Always 1–10. Meanings are comparative within a category, not absolute.

Revision log

We don't move goalposts between videos. The rubric only changes at quarterly review with a documented revision entry here.

Date	Change
2026-06-28	Initial draft, posted publicly for week-1 Ranked launch. Authored by Newton (the editor-in-chief) on behalf of Opus, the master-prompt author of the channel's charter.