Google's Gemini 3.1 Pro is here, and it just doubled its reasoning score
Source: ZDNet

ZDNET’s key takeaways
- Gemini 3.1 Pro is now available.
- It builds on the benchmark progress Gemini 3 established for Google.
- Model capabilities are ultimately relative, one expert said.
Another week, another “smarter” model—this time from Google, which just released Gemini 3.1 Pro.
Gemini 3 outperformed several competitor models since its release in November, beating Copilot in a few of our in‑house task tests, and has generally received praise from users. Google said this latest Gemini model, announced Thursday, achieved “more than double the reasoning performance of 3 Pro” in testing, based on its 77.1 % score on the ARC‑AGI‑2 benchmark for “entirely new logic patterns.”
Also: Gemini vs. Copilot: I compared the AI tools on 7 everyday tasks, and there’s a clear winner
The latest model follows a “major upgrade” to Gemini 3 Deep Think last week, which added new capabilities in chemistry, physics, math, and coding. Google explained that the Deep Think upgrade was built to address “tough research challenges—where problems often lack clear guardrails or a single correct solution and data is often messy or incomplete.” Gemini 3.1 Pro undergirds that science‑heavy investment, acting as the “upgraded core intelligence that makes those breakthroughs possible.”
Late last year, Gemini 3 scored a new high of 38.3 % across all currently available models on the Humanity’s Last Exam (HLE) benchmark test. Developed to combat increasingly beatable industry‑standard benchmarks and better measure model progress against human ability, HLE is meant to be a more rigorous test, though benchmarks alone aren’t sufficient to determine performance.
According to Google, Gemini 3.1 Pro now bests that score at 44.4 %—though the Deep Think upgrade technically scored higher at 48.4 %. Similarly, the Deep Think update scored 84.6 %—higher than 3.1 Pro’s 77.1 %—on the ARC‑AGI‑2 logic benchmark.
Also: The making of Gemini 3 – how Google’s slow and steady approach won the AI race (for now)
All that said, Anthropic’s Claude Opus 4.6 still tops the Center for AI Safety (CAIS) text‑capability leaderboard (for reasoning and other text‑based queries), which averages other relevant benchmark scores outside of HLE. Anthropic’s Opus 4.5, Sonnet 4.5, and Opus 4.6 also beat Gemini 3 in terms of safety, according to the CAIS risk‑assessment leaderboard.
Hype management
Benchmark records aside, the lifecycle of a model doesn’t end with a splashy release. At the current rate of AI development, new models are impressive only in relative terms to their competition—time and testing will tell where the 3.1 Pro excels or fails. Gemini 3 gives the new model a strong foundation, but that may only last until the next lab releases a state‑of‑the‑art upgrade.
Also: Inside Google’s AI plan to end Android developer toil – and speed up innovation
“The test numbers seem to imply that it’s got substantial improvement over Gemini 3, and Gemini 3 was pretty good, but I don’t think we’re really going to know right away, and it’s not available except to the more expensive plans yet,” said ZDNET senior contributing editor David Gewirtz. “The shoe hasn’t yet fallen on GPT 5.3 either, and I think when it does, we’ll have a more universal set of upgrades that we can readdress.”
While we wait for that model to drop, Gewirtz looked into GPT‑5.3‑Codex, OpenAI’s most recent coding‑specific release, which famously helped build itself.
Try it yourself
Developers can access Gemini 3.1 Pro in preview today through the API in Google’s AI Studio, Android Studio, Google Antigravity, and the Gemini CLI. Enterprise customers can try it in Vertex AI and Gemini Enterprise, and regular users can find it in NotebookLM and the Gemini app.