Grok Imagine

Summary

Grok Imagine is xAI's flagship text-to-video and image-to-video model with synchronized audio — ranked #1 on public text-to-video leaderboards as of May 2026. v1.0 (Feb 2026) delivers 10-second 720p clips, and Extend from Frame (Mar 2026) chains 15-second continuous sequences.

Overview

Grok Imagine is xAI's flagship text-to-video, image-to-video, and synchronized-audio video generation model. Originally launched in July 2025 with six-second text-to-video clips that included audio, Grok Imagine evolved rapidly through 2026: an API launched January 28, 2026 ($0.05/second), v1.0 shipped February 3, 2026 (10-second 720p clips with what xAI called its "biggest leap yet" in prompt-following accuracy), and "Extend from Frame" chaining shipped March 2, 2026 — enabling sequential 15-second clips that share visual continuity through their final-frame handoff.

As of May 2026, Grok Imagine is the #1 model on public text-to-video leaderboards (arena score 724, ahead of Google's Veo 3.1 at 618 and Alibaba's WAN Video 2.6 at 577), and generated approximately 1.25 billion videos in January 2026 alone — a scale of consumer adoption matched by few generative-video models.

Specifications

Developer: [[US/xAI|xAI]]
Initial Release: July 2025
API Launch: January 28, 2026
v1.0 Release: February 3, 2026 (10-second 720p clips)
"Extend from Frame" Chaining: March 2, 2026 (~15-second sequences)
Type: Text-to-video, image-to-video, video editing, with synchronized audio
Duration: Up to 10 seconds per clip; 15+ seconds via Extend from Frame
Resolution: 720p (v1.0)
Pricing (API): $0.05/second
Distribution: Grok app, x.com, Grok Imagine API, third-party platforms (e.g., PixVerse, Eachlabs)
Leaderboard Position (May 2026): #1 on text-to-video arena (score 724)

Capabilities

Text-to-Video: Generate short clips with synchronized audio from natural-language prompts.

Image-to-Video: Animate static images using the same prompt-driven control surface.

Video Editing: Restyle scenes, add or remove objects, control motion across clips.

Best-in-Class Instruction Following: xAI describes Grok Imagine as having best-in-class prompt-following among generative video models — with v1.0 specifically marketed as the largest single improvement on this dimension.

Synchronized Audio: Generates audio (ambient sound, effects, dialogue) aligned to the visual content rather than as a separate post-process.

Extend from Frame: Final frame of one clip becomes the first frame of the next, enabling longer, continuous sequences while preserving character / scene continuity.

Limitations

10-second base clip duration trails Google Veo 4 (15–30 seconds) and Luma Ray3 (HDR / longer outputs). 720p is below the 4K offerings of Veo 4 and LTX-2. Public leaderboard scores are an aggregate of human preference judgments and don't speak directly to physical-realism or character-consistency limits. Grok Imagine's volume use case is consumer-social rather than professional production — pro-grade controllability still favors Runway Gen-4.5 and Veo 4 in studio workflows.

Recent Developments

#1 on Text-to-Video Leaderboard (May 2026): Arena score 724, ahead of Veo 3.1 (618) and WAN Video 2.6 (577).
Extend from Frame (March 2, 2026): Sequential clip chaining via final-frame handoff, supporting 15-second-per-clip sequences.
v1.0 Release (February 3, 2026): 10-second 720p clips; xAI's "biggest leap yet" in prompt-following accuracy.
API Launch (January 28, 2026): $0.05/second pricing for text-to-video, image-to-video, and editing.
Scale (January 2026): ~1.245 billion videos generated in a single month, with 314M visits to the Imagine feature on x.com.

Last Updated

May 8, 2026

→ Back to Models