{"slug":"nemotron-cascade-2","id":"nemotron-cascade-2","type":"model","title":"Nemotron-Cascade 2","description":"NVIDIA's 30B MoE with only 3B active parameters that achieves gold-medal performance on IMO, IOI, and ICPC. Beats the larger Nemotron 3 Super 120B on coding and instruction following. Fits on a single RTX 4090 (24GB VRAM with Q4). Hybrid Mamba-2 + Transformer architecture enables a 1M token context window.","last_updated":"2026-04-10","last_verified":null,"verification_status":"unverified","markdown_url":"/content/models/nemotron-cascade-2.md","html_url":"/models/nemotron-cascade-2","api_url":"/api/v1/models/nemotron-cascade-2.json","content_hash":"e2169ac59e26102777a47d9984fd843bd6a7a7d460ac6de4c9cc8b3e67661f94","sha256":"e2169ac59e26102777a47d9984fd843bd6a7a7d460ac6de4c9cc8b3e67661f94","provider":"NVIDIA","pricing":{"input":"Free (open weights)","output":"Free (open weights)","free":true,"note":"Also via Ollama, NVIDIA NIM"},"benchmarks":{"reasoning":88,"coding":90,"math":92,"writing":78,"multilingual":75,"speed":92},"tags":["nvidia","open-source","text"],"website":"https://build.nvidia.com","release_date":"2026-03","relationships":{"links":[],"related":[{"id":"nemotron-3-super","title":"Nemotron 3 Super","type":"model","html_url":"/models/nemotron-3-super","markdown_url":"/content/models/nemotron-3-super.md","shared_tags":["nvidia","open-source","text"],"score":7},{"id":"cohere-tiny-aya","title":"Cohere Tiny Aya 3.35B","type":"model","html_url":"/models/cohere-tiny-aya","markdown_url":"/content/models/cohere-tiny-aya.md","shared_tags":["open-source","text"],"score":4},{"id":"command-r-plus","title":"Command R+","type":"model","html_url":"/models/command-r-plus","markdown_url":"/content/models/command-r-plus.md","shared_tags":["open-source","text"],"score":4},{"id":"deepseek-r1","title":"DeepSeek R1","type":"model","html_url":"/models/deepseek-r1","markdown_url":"/content/models/deepseek-r1.md","shared_tags":["open-source","text"],"score":4},{"id":"deepseek-v3.2","title":"DeepSeek V3.2","type":"model","html_url":"/models/deepseek-v3.2","markdown_url":"/content/models/deepseek-v3.2.md","shared_tags":["open-source","text"],"score":4},{"id":"falcon-3","title":"Falcon 3","type":"model","html_url":"/models/falcon-3","markdown_url":"/content/models/falcon-3.md","shared_tags":["open-source","text"],"score":4}],"explicit":{}},"metadata":{"title":"Nemotron-Cascade 2","type":"model","id":"nemotron-cascade-2","provider":"NVIDIA","model_type":"open-source","release_date":"2026-03","description":"NVIDIA's 30B MoE with only 3B active parameters that achieves gold-medal performance on IMO, IOI, and ICPC. Beats the larger Nemotron 3 Super 120B on coding and instruction following. Fits on a single RTX 4090 (24GB VRAM with Q4). Hybrid Mamba-2 + Transformer architecture enables a 1M token context window.","last_updated":"2026-04-10","context_window":"1M tokens","website":"https://build.nvidia.com","license":"NVIDIA Open Model License","modality":["text"],"tags":["nvidia","open-source","text"],"pricing":{"input":"Free (open weights)","output":"Free (open weights)","free":true,"note":"Also via Ollama, NVIDIA NIM"},"benchmarks":{"reasoning":88,"coding":90,"math":92,"writing":78,"multilingual":75,"speed":92},"parameters":"30B total (3B active)","hardware_requirements":"1x RTX 4090 24GB (Q4); 1x RTX 3090 with Q3 quantization","best_for":["Competitive math/coding","Consumer GPU deployment","Agentic workflows","On-device reasoning"]},"content_text":"# Nemotron-Cascade 2\n\nThe most impressive model-per-FLOP ever released. Nemotron-Cascade 2 activates just 3B parameters per token from a 30B MoE, yet it won gold medals at IMO, IOI, and ICPC World Finals. It scores 92 on math, 90 on coding, and 88 on reasoning -- numbers that beat NVIDIA's own Nemotron 3 Super 120B, a model four times its size. All on a single RTX 4090.\n\nThe hybrid Mamba-2 + Transformer architecture is the secret weapon. It enables a 1M token context window with sub-linear memory scaling, something pure Transformer models cannot match. Speed at 92/100 means this is not just powerful but fast -- 92.4% on AIME 2025 and 87.2% on LiveCodeBench v6, delivered in real time on consumer hardware.\n\nThe trade-offs are real and predictable: writing at 78 and multilingual at 75 reflect a model laser-focused on STEM reasoning. MMLU-Pro at 79.8% confirms that broad knowledge is not the strength. If you need an assistant for general conversation or multilingual tasks, look elsewhere. If you need a reasoning engine that solves competition-level math and coding problems on an RTX 3090, nothing else comes close.\n\nOpen weights with published SFT and RL datasets make this a research goldmine. Ollama and NVIDIA NIM support means deployment is turnkey. The NVIDIA Open Model License is the only friction point -- not Apache 2.0, so check the terms for your use case.\n\n**When to pick something else:** For general-purpose use, Qwen 3.5 or GPT-OSS-120B are far more balanced. For multilingual reasoning, Kimi K2.5 scores higher on math (97 vs 92) but needs much more hardware. Nemotron-Cascade 2 is the best reasoning model you can run on hardware you already own.","content_length":2781,"generated_at":"2026-04-24"}