{"slug":"benchmark-methodology","id":"benchmark-methodology","type":"guide","title":"Benchmark Methodology","description":"How AI Future Ready model scores should be interpreted by agents: normalized task scores, confidence limits, subjective judgment, and verification needs.","last_updated":"2026-04-24","last_verified":null,"verification_status":"unverified","markdown_url":"/content/guides/benchmark-methodology.md","html_url":"/guides/benchmark-methodology","api_url":"/api/v1/guides/benchmark-methodology.json","content_hash":"8409dea92d808c7d2c8e44bbeaba04d7e98772c5ad12c76008342286e481a6a7","sha256":"8409dea92d808c7d2c8e44bbeaba04d7e98772c5ad12c76008342286e481a6a7","tags":["benchmarks","methodology","scores","verification","models"],"relationships":{"links":[],"related":[{"id":"data-methodology","title":"Data Methodology","type":"guide","html_url":"/guides/data-methodology","markdown_url":"/content/guides/data-methodology.md","shared_tags":["methodology","verification"],"score":4},{"id":"ai-failure-modes","title":"AI Failure Modes","type":"guide","html_url":"/guides/failure-modes","markdown_url":"/content/guides/failure-modes.md","shared_tags":["models"],"score":3},{"id":"best-for-task-matrix","title":"Best-For Task Matrix","type":"guide","html_url":"/guides/best-for-task-matrix","markdown_url":"/content/guides/best-for-task-matrix.md","shared_tags":["models"],"score":3},{"id":"build-ai-research-assistant","title":"Build an AI Research Assistant","type":"guide","html_url":"/guides/build-an-ai-research-assistant","markdown_url":"/content/guides/build-an-ai-research-assistant.md","shared_tags":["verification"],"score":3},{"id":"choose-a-cheap-model","title":"Choose a Cheap Model","type":"guide","html_url":"/guides/choose-a-cheap-model","markdown_url":"/content/guides/choose-a-cheap-model.md","shared_tags":["models"],"score":3},{"id":"choose-a-local-model","title":"Choose a Local Model","type":"guide","html_url":"/guides/choose-a-local-model","markdown_url":"/content/guides/choose-a-local-model.md","shared_tags":["models"],"score":3}],"explicit":{}},"metadata":{"title":"Benchmark Methodology","type":"guide","id":"benchmark-methodology","description":"How AI Future Ready model scores should be interpreted by agents: normalized task scores, confidence limits, subjective judgment, and verification needs.","last_updated":"2026-04-24","tags":["benchmarks","methodology","scores","verification","models"]},"content_text":"# Benchmark Methodology\n\nThis site uses compact 0-100 task scores so agents can compare models quickly. These scores are decision aids, not lab-grade measurements.\n\n## What Scores Mean\n\n| Field | Meaning |\n|-------|---------|\n| `reasoning` | Multi-step logic, planning, hard analysis |\n| `coding` | Software engineering, code editing, debugging |\n| `math` | Formal problem solving and quantitative reasoning |\n| `writing` | Clarity, style control, summarization, synthesis |\n| `multilingual` | Non-English and cross-language usefulness |\n| `speed` | Practical responsiveness and latency profile |\n\nScores are normalized within this site's model set. A `95` means \"near the top of this reference set for this task,\" not an absolute universal measurement.\n\n## How Agents Should Use Scores\n\n- Use scores to shortlist, not to decide blindly.\n- Combine scores with price, context window, modality, deployment, and license.\n- Prefer task-specific scores over average scores.\n- Treat close scores as ties unless cost or deployment clearly breaks the tie.\n- Ask for user constraints before final recommendation.\n\n## Confidence Limits\n\nScores can drift as providers update models, pricing, APIs, and benchmark reports. Volatile fields should be checked against:\n\n- `last_updated`\n- `last_verified` when present\n- `content_hash`\n- per-item JSON\n- provider source notes when available\n\n## What This Site Should Improve\n\nThe next maturity step is to add `last_verified`, `sources`, and confidence markers to every volatile claim.\n\nRecommended fields:\n\n```yaml\nlast_verified: \"2026-04-24\"\nsources:\n- title: \"Provider pricing page\"\n  url: \"https://example.com/pricing\"\npricing_confidence: \"high\"\nbenchmark_confidence: \"medium\"\n```\n\n## Agent Rule\n\nIf a recommendation depends on a volatile field such as price, release date, benchmark score, or context window, say how current the data is and prefer verified fields over unverified ones.","content_length":2275,"generated_at":"2026-04-24"}