Gemini 3.1 Ultra is Google DeepMind's top-of-line frontier model, scoring a record 94.3% on GPQA Diamond. Its architectural centerpiece — Chain-of-Verification (CoVe) — generates and tests sub-hypotheses at inference time, reducing hallucinations by 60%+ on technical and scientific output versus Gemini 2.0.
Gemini 3.1 Ultra is Google DeepMind's top-of-line frontier model in the Gemini 3.1 family, sitting above Gemini 3.1 Pro in the lineup and positioned as DeepMind's strongest reasoning model. Its headline achievement is a 94.3% score on the GPQA Diamond benchmark — a record at announcement, surpassing the previous record held by OpenAI's GPT-5 internal builds. GPQA Diamond is a standardized graduate-level multi-domain science questions benchmark designed to resist memorization and require multi-step verification — and Gemini 3.1 Ultra's score is the strongest publicly disclosed result on it as of early 2026.
The architectural centerpiece of Gemini 3.1 (Pro and Ultra) is "Chain-of-Verification" (CoVe), an inference-time process where the model generates internal sub-hypotheses and tests them against its own knowledge base and external retrievals before producing a final response. CoVe reportedly cuts hallucination rates in technical documentation and scientific research output by over 60% compared to Gemini 2.0 — making Gemini 3.1 Ultra particularly well-suited for high-stakes scientific and technical workflows.
GPQA Diamond Record (94.3%): Highest publicly disclosed score on GPQA Diamond — the graduate-level multi-domain science benchmark — surpassing OpenAI's previous internal-build records. Indicates strong multi-step verification on scientifically rigorous questions.
Chain-of-Verification (CoVe): At inference time, Gemini 3.1 Ultra generates internal sub-hypotheses, tests them against both its own knowledge base and external retrievals, and only then produces a final response. The reported result: 60%+ reduction in hallucination rates on technical documentation and scientific research output relative to Gemini 2.0.
Native Multimodal: Inherits the Gemini 3 family's native multimodal architecture — text, vision, and audio in a single model rather than retrofitted modality stacks.
1M+ Context (inherited from Pro): 1-million-token context window with 65K-token output (per Gemini 3.1 Pro specifications, inherited up the family).
Reasoning Boost over Gemini 3 Pro: Gemini 3.1 Pro delivered a 2x+ reasoning improvement over Gemini 3 Pro (with Ultra further extending the lead), and ranked #1 on 12 of 18 tracked benchmarks at announcement.
77.1% on ARC-AGI-2 (Pro): ARC-AGI-2 measures a model's ability to solve entirely new logic patterns. Gemini 3.1 Pro's 77.1% verified score is among the highest publicly reported, with Ultra extending the lead.
Inference-Time Cost: Chain-of-Verification adds inference latency and compute cost over single-pass generation. For latency-sensitive workloads, Gemini 3.1 Pro (without CoVe) or smaller Gemini variants may be more practical.
Less Public Documentation than Pro: Most of the publicly available technical documentation focuses on Gemini 3.1 Pro. Gemini 3.1 Ultra's specific specifications — token-pricing, exact context window, and detailed access tiers — are less widely documented in public Google sources at the time of writing. Production users should consult current Google AI documentation for the latest pricing and access details.
Benchmark Records vs. Real-World Reliability: GPQA Diamond and ARC-AGI-2 are strong proxies for reasoning capability but don't fully predict real-world reliability on novel domains. As with all frontier models, validation on the specific deployment target remains required.
May 7, 2026