SAM Audio is Meta AI's first unified foundation model for audio source separation, released December 16, 2025. It isolates any sound from a complex mixture using text, visual, or temporal prompts (or combinations) and ships open-source across Small, Base, and Large variants.
SAM Audio (Segment Anything in Audio) is Meta AI's first unified foundation model for audio source separation, released December 16, 2025. Inspired by the success of Meta's Segment Anything Model for images, SAM Audio brings the same "segment anything" philosophy to sound — allowing users to isolate any sound from a complex audio mixture using natural language, visual cues, or temporal markers.
What makes SAM Audio distinctive is its multimodal prompting system. Rather than being limited to a fixed set of sound categories, users can describe what they want to isolate in plain text ("a man speaking"), point to the sound source in a video frame, or mark the time window where a sound first appears. These prompts can be combined for even greater precision. The result is a single model that handles music separation, speech isolation, sound effects, and instrument extraction — tasks that previously required separate specialized systems.
SAM Audio is open-source under Meta's SAM License (permitting research and commercial use) and is accessible via GitHub, Hugging Face, and Meta's Segment Anything Playground.
Three prompting modalities (usable alone or in combination):
Benchmark performance (subjective evaluation scores):
| Variant | General SFX | Speech | Speaker | Music | Instr (wild) | Instr (pro) | |---------|-------------|--------|---------|-------|--------------|-------------| | Small | 3.62 | 3.99 | 3.12 | 4.11 | 3.56 | 4.24 | | Base | 3.28 | 4.25 | 3.57 | 3.87 | 3.66 | 4.27 | | Large | 3.50 | 4.03 | 3.60 | 4.22 | 3.66 | 4.49 |
SAM Audio achieves state-of-the-art results across music, speech, sound effects, and instrument separation — outperforming both prior general-purpose and specialized systems on the SAM Audio-Bench evaluation suite. Mixed-modality prompting (e.g., text + span) consistently outperforms single-modality approaches.
Output: Each separation produces two streams — the isolated target audio and the residual (everything else) — enabling both extraction and removal workflows.
Re-ranking: Generates up to 8 candidate separations, ranked by CLAP (text-audio similarity), a judge model (precision, recall, faithfulness), and ImageBind (visual-audio alignment).
predict_spans=True with reranking_candidates=8 gives best quality at highest compute cost)February 26, 2026