Llama 4 Scout

Summary

Llama 4 Scout is Meta's context-efficient open-weight model, released April 5, 2025 — a 17B active / 109B total MoE across 16 experts with a 10M token context window, the largest of any publicly available model at launch, fitting on a single NVIDIA H100.

Overview

Llama 4 Scout is Meta's most context-efficient open-weight model, released April 5, 2025 alongside Llama 4 Maverick. Its defining feature is a 10 million token context window — the largest of any publicly available model at launch, nearly 10x the 1M context offered by competing models. At the same time, with only 17 billion active parameters across 16 experts (109 billion total), it fits on a single NVIDIA H100 GPU — making it practical for teams that need extreme context length without a multi-GPU serving setup.

Scout is purpose-built for tasks that demand massive context: analyzing entire codebases, processing large document collections, long-running research sessions, or any scenario where the cost of chunking and retrieval introduces errors. The combination of 10M context and single-GPU efficiency is a unique position in the current model landscape.

Specifications

  • Developer: Meta AI
  • Model String: meta-llama/Llama-4-Scout (varies by platform)
  • Release Date: April 5, 2025
  • Type: Multimodal LLM, Mixture-of-Experts, Open-Weight
  • Architecture: MoE — 17B active parameters / 109B total parameters / 16 experts
  • Context Window: 10,000,000 tokens (10M)
  • Modalities: Text and image input; text output
  • Languages: 12 languages
  • License: Meta Llama 4 Community License
  • Access: Meta AI app, llama.com, Hugging Face, AWS, Azure, Google Cloud, and other cloud providers
  • Pricing: Free for self-hosted; API pricing varies by provider

Capabilities

10M Token Context: The headline capability — 10 million tokens is enough to hold a very large codebase, a full year of documents, or an extraordinarily long research session entirely in context, eliminating retrieval-augmented generation (RAG) for many use cases.

Single-GPU Efficiency: Fits on one NVIDIA H100 despite 109B total parameters, thanks to the MoE architecture activating only 17B parameters per token. This dramatically reduces the infrastructure cost of serving a model at this scale.

More Capable Than Previous Llama Generations: Despite its compact active parameter count, Scout outperforms all prior Llama models on standard benchmarks.

Native Multimodality: Trained on text, image, and video data from the ground up.

Limitations

Scout's benchmark performance on pure reasoning and coding tasks is below Llama 4 Maverick, which has 128 experts versus Scout's 16. The 10M context window is theoretically supported but practically requires careful memory management — full 10M context usage will stress even H100 memory. For maximum capability, Maverick is the stronger choice; Scout is the choice when extreme context length is the priority.

Recent Developments

  • April 5, 2025 Launch: The 10M context window at single-H100 efficiency was widely noted as a significant engineering achievement.
  • RAG Alternative: Developers and researchers have highlighted Scout as a potential replacement for complex RAG pipelines in scenarios where the full corpus fits in 10M tokens.

Last Updated

February 26, 2026