Gemini 3.1 Ultra

Summary

Gemini 3.1 Ultra is Google DeepMind's top-of-line frontier model, scoring a record 94.3% on GPQA Diamond. Its architectural centerpiece — Chain-of-Verification (CoVe) — generates and tests sub-hypotheses at inference time, reducing hallucinations by 60%+ on technical and scientific output versus Gemini 2.0.

Overview

Gemini 3.1 Ultra is Google DeepMind's top-of-line frontier model in the Gemini 3.1 family, sitting above Gemini 3.1 Pro in the lineup and positioned as DeepMind's strongest reasoning model. Its headline achievement is a 94.3% score on the GPQA Diamond benchmark — a record at announcement, surpassing the previous record held by OpenAI's GPT-5 internal builds. GPQA Diamond is a standardized graduate-level multi-domain science questions benchmark designed to resist memorization and require multi-step verification — and Gemini 3.1 Ultra's score is the strongest publicly disclosed result on it as of early 2026.

The architectural centerpiece of Gemini 3.1 (Pro and Ultra) is "Chain-of-Verification" (CoVe), an inference-time process where the model generates internal sub-hypotheses and tests them against its own knowledge base and external retrievals before producing a final response. CoVe reportedly cuts hallucination rates in technical documentation and scientific research output by over 60% compared to Gemini 2.0 — making Gemini 3.1 Ultra particularly well-suited for high-stakes scientific and technical workflows.

Specifications

Developer: Google DeepMind
Release Date: Spring 2026 (following Gemini 3.1 Pro on February 19, 2026)
Type: Frontier multimodal LLM
Architectural Centerpiece: Chain-of-Verification (CoVe) — inference-time sub-hypothesis generation with internal and external retrieval validation
Context Window: 1M+ tokens (inherited from Gemini 3.1 Pro family)
Modalities: Native multimodal (text + vision + audio per Gemini 3.1 family capabilities)
Access: Google Cloud / Vertex AI; available across Google products

Capabilities

GPQA Diamond Record (94.3%): Highest publicly disclosed score on GPQA Diamond — the graduate-level multi-domain science benchmark — surpassing OpenAI's previous internal-build records. Indicates strong multi-step verification on scientifically rigorous questions.

Chain-of-Verification (CoVe): At inference time, Gemini 3.1 Ultra generates internal sub-hypotheses, tests them against both its own knowledge base and external retrievals, and only then produces a final response. The reported result: 60%+ reduction in hallucination rates on technical documentation and scientific research output relative to Gemini 2.0.

Native Multimodal: Inherits the Gemini 3 family's native multimodal architecture — text, vision, and audio in a single model rather than retrofitted modality stacks.

1M+ Context (inherited from Pro): 1-million-token context window with 65K-token output (per Gemini 3.1 Pro specifications, inherited up the family).

Reasoning Boost over Gemini 3 Pro: Gemini 3.1 Pro delivered a 2x+ reasoning improvement over Gemini 3 Pro (with Ultra further extending the lead), and ranked #1 on 12 of 18 tracked benchmarks at announcement.

77.1% on ARC-AGI-2 (Pro): ARC-AGI-2 measures a model's ability to solve entirely new logic patterns. Gemini 3.1 Pro's 77.1% verified score is among the highest publicly reported, with Ultra extending the lead.

Limitations

Inference-Time Cost: Chain-of-Verification adds inference latency and compute cost over single-pass generation. For latency-sensitive workloads, Gemini 3.1 Pro (without CoVe) or smaller Gemini variants may be more practical.

Less Public Documentation than Pro: Most of the publicly available technical documentation focuses on Gemini 3.1 Pro. Gemini 3.1 Ultra's specific specifications — token-pricing, exact context window, and detailed access tiers — are less widely documented in public Google sources at the time of writing. Production users should consult current Google AI documentation for the latest pricing and access details.

Benchmark Records vs. Real-World Reliability: GPQA Diamond and ARC-AGI-2 are strong proxies for reasoning capability but don't fully predict real-world reliability on novel domains. As with all frontier models, validation on the specific deployment target remains required.

Recent Developments

GPQA Diamond Record (94.3%): Reported at the Gemini 3.1 Ultra announcement window, surpassing OpenAI's GPT-5 internal-build record.
Chain-of-Verification Architecture: Gemini 3.1 family centerpiece — 60%+ hallucination reduction on technical/scientific output vs. Gemini 2.0.
February 2026 Family Launch: Gemini 3.1 Pro released February 19, 2026; Ultra and Deep Think mode followed in the same window.
Industry Context: Gemini 3.1 Ultra sits at the same frontier tier as Claude Mythos Preview (April 7, 2026, restricted preview), GPT-5.5 Pro (April 24, 2026), and DeepSeek V4-Pro (April 24, 2026, open weights) — making early-to-mid 2026 the most competitive frontier model window of the post-GPT-4 era.

Last Updated

May 7, 2026

→ Back to Models