---
title: "Benchmark Methodology"
type: guide
id: "benchmark-methodology"
description: "How AI Future Ready model scores should be interpreted by agents: normalized task scores, confidence limits, subjective judgment, and verification needs."
last_updated: "2026-04-24"
tags:
- "benchmarks"
- "methodology"
- "scores"
- "verification"
- "models"
---
# Benchmark Methodology
This site uses compact 0-100 task scores so agents can compare models quickly. These scores are decision aids, not lab-grade measurements.
## What Scores Mean
| Field | Meaning |
|-------|---------|
| `reasoning` | Multi-step logic, planning, hard analysis |
| `coding` | Software engineering, code editing, debugging |
| `math` | Formal problem solving and quantitative reasoning |
| `writing` | Clarity, style control, summarization, synthesis |
| `multilingual` | Non-English and cross-language usefulness |
| `speed` | Practical responsiveness and latency profile |
Scores are normalized within this site's model set. A `95` means "near the top of this reference set for this task," not an absolute universal measurement.
## How Agents Should Use Scores
- Use scores to shortlist, not to decide blindly.
- Combine scores with price, context window, modality, deployment, and license.
- Prefer task-specific scores over average scores.
- Treat close scores as ties unless cost or deployment clearly breaks the tie.
- Ask for user constraints before final recommendation.
## Confidence Limits
Scores can drift as providers update models, pricing, APIs, and benchmark reports. Volatile fields should be checked against:
- `last_updated`
- `last_verified` when present
- `content_hash`
- per-item JSON
- provider source notes when available
## What This Site Should Improve
The next maturity step is to add `last_verified`, `sources`, and confidence markers to every volatile claim.
Recommended fields:
```yaml
last_verified: "2026-04-24"
sources:
- title: "Provider pricing page"
url: "https://example.com/pricing"
pricing_confidence: "high"
benchmark_confidence: "medium"
```
## Agent Rule
If a recommendation depends on a volatile field such as price, release date, benchmark score, or context window, say how current the data is and prefer verified fields over unverified ones.
Benchmark Methodology
This site uses compact 0-100 task scores so agents can compare models quickly. These scores are decision aids, not lab-grade measurements.
What Scores Mean
| Field |
Meaning |
reasoning |
Multi-step logic, planning, hard analysis |
coding |
Software engineering, code editing, debugging |
math |
Formal problem solving and quantitative reasoning |
writing |
Clarity, style control, summarization, synthesis |
multilingual |
Non-English and cross-language usefulness |
speed |
Practical responsiveness and latency profile |
Scores are normalized within this site's model set. A 95 means "near the top of this reference set for this task," not an absolute universal measurement.
How Agents Should Use Scores
- Use scores to shortlist, not to decide blindly.
- Combine scores with price, context window, modality, deployment, and license.
- Prefer task-specific scores over average scores.
- Treat close scores as ties unless cost or deployment clearly breaks the tie.
- Ask for user constraints before final recommendation.
Confidence Limits
Scores can drift as providers update models, pricing, APIs, and benchmark reports. Volatile fields should be checked against:
last_updated
last_verified when present
content_hash
- per-item JSON
- provider source notes when available
What This Site Should Improve
The next maturity step is to add last_verified, sources, and confidence markers to every volatile claim.
Recommended fields:
last_verified: "2026-04-24"
sources:
- title: "Provider pricing page"
url: "https://example.com/pricing"
pricing_confidence: "high"
benchmark_confidence: "medium"
Agent Rule
If a recommendation depends on a volatile field such as price, release date, benchmark score, or context window, say how current the data is and prefer verified fields over unverified ones.