SAM Audio

Summary

SAM Audio is Meta AI's first unified foundation model for audio source separation, released December 16, 2025. It isolates any sound from a complex mixture using text, visual, or temporal prompts (or combinations) and ships open-source across Small, Base, and Large variants.

Overview

SAM Audio (Segment Anything in Audio) is Meta AI's first unified foundation model for audio source separation, released December 16, 2025. Inspired by the success of Meta's Segment Anything Model for images, SAM Audio brings the same "segment anything" philosophy to sound — allowing users to isolate any sound from a complex audio mixture using natural language, visual cues, or temporal markers.

What makes SAM Audio distinctive is its multimodal prompting system. Rather than being limited to a fixed set of sound categories, users can describe what they want to isolate in plain text ("a man speaking"), point to the sound source in a video frame, or mark the time window where a sound first appears. These prompts can be combined for even greater precision. The result is a single model that handles music separation, speech isolation, sound effects, and instrument extraction — tasks that previously required separate specialized systems.

SAM Audio is open-source under Meta's SAM License (permitting research and commercial use) and is accessible via GitHub, Hugging Face, and Meta's Segment Anything Playground.

Specifications

Developer: Meta AI (Facebook Research)
Release Date: December 16, 2025
Type: Audio source separation / multimodal foundation model
Architecture: Flow-matching Diffusion Transformer operating in DAC-VAE latent space
Encoder: Perception Encoder Audio-Visual (PE-AV-Large) — built on Meta's open-source Perception Encoder
Parameters: 500M–3B (Small, Base, Large variants)
Real-Time Factor: 0.7× (processes faster than real-time)
License: Meta SAM License (research + commercial use permitted)
Access: GitHub · Hugging Face · Segment Anything Playground
Paper: SAM Audio: Segment Anything in Audio (arXiv 2512.18099)

Capabilities

Three prompting modalities (usable alone or in combination):

Text prompting — natural language descriptions in noun or verb phrase format (e.g., "dog barking", "woman singing")
Visual prompting — spatial masks drawn over video frames to identify the sound-producing object or person
Span prompting — temporal ranges marking when the target sound occurs, enabling event-based separation

Benchmark performance (subjective evaluation scores):

| Variant | General SFX | Speech | Speaker | Music | Instr (wild) | Instr (pro) | |---------|-------------|--------|---------|-------|--------------|-------------| | Small | 3.62 | 3.99 | 3.12 | 4.11 | 3.56 | 4.24 | | Base | 3.28 | 4.25 | 3.57 | 3.87 | 3.66 | 4.27 | | Large | 3.50 | 4.03 | 3.60 | 4.22 | 3.66 | 4.49 |

SAM Audio achieves state-of-the-art results across music, speech, sound effects, and instrument separation — outperforming both prior general-purpose and specialized systems on the SAM Audio-Bench evaluation suite. Mixed-modality prompting (e.g., text + span) consistently outperforms single-modality approaches.

Output: Each separation produces two streams — the isolated target audio and the residual (everything else) — enabling both extraction and removal workflows.

Re-ranking: Generates up to 8 candidate separations, ranked by CLAP (text-audio similarity), a judge model (precision, recall, faithfulness), and ImageBind (visual-audio alignment).

Limitations

Span prediction and re-ranking significantly increase latency and memory usage; quality vs. speed is a tunable trade-off (predict_spans=True with reranking_candidates=8 gives best quality at highest compute cost)
Requires a CUDA-compatible GPU for practical use
Model checkpoint access requires Hugging Face authentication
Performance varies across audio categories — the Large variant leads on instrument separation (pro) while the Base variant leads on speech

Recent Developments

December 16, 2025 — Public release alongside research paper; immediate open-source availability on GitHub and Hugging Face
Accessibility partnership — Collaborating with Starkey (largest U.S. hearing aid manufacturer) and 2gether-International to explore hearing assistance applications
SAM Audio-Bench — Introduced alongside the model as a new reference-free evaluation benchmark for audio separation, enabling assessment without isolated ground-truth tracks
Segment Anything Playground — Interactive web demo allowing no-code experimentation with all three prompting modalities

Last Updated

February 26, 2026

→ Back to Models