Google released Gemini 3.1 Pro on February 19, 2026, delivering more than double the reasoning capability of its predecessor on the ARC-AGI-2 benchmark. The model scored 77.1 percent on ARC-AGI-2 compared to Gemini 3 Pro’s 31.1 percent, while achieving 94.3 percent on GPQA Diamond, a test of graduate-level scientific knowledge. It maintains the same pricing as Gemini 3 Pro at $2.00 per million input tokens.
The benchmark results position Gemini 3.1 Pro as a serious contender across multiple dimensions. On LiveCodeBench Pro, it achieved an Elo rating of 2887, outperforming GPT-5.2’s score of 2393. On SWE-Bench Verified, which tests real-world software engineering tasks, the model passed at 80.6 percent — competitive with Claude Opus 4.6’s 80.8 percent at less than half the cost. On Humanity’s Last Exam, Gemini 3.1 Pro scored 44.4 percent, beating both Claude Opus 4.6 at 40.0 percent and GPT-5.2 at 34.5 percent.
The model supports an input context window of 1,048,576 tokens and can generate outputs up to 65,536 tokens, giving it the capacity to process book-length documents and produce detailed analyses in a single pass. For developers, it is accessible through the Gemini API in Google AI Studio, Gemini CLI, Google Antigravity, and Android Studio. Enterprise customers can access it through Vertex AI and Gemini Enterprise.
Early enterprise feedback has been positive. JetBrains Director of AI Vladislav Tankov reported a 15 percent quality improvement over previous Gemini versions in code assistance tasks. Databricks CTO Hanlin Tang described best-in-class results on their internal OfficeQA benchmark. Free Gemini users can access the model with usage limits, while Google AI Pro and Ultra subscribers get higher allocation.
The release intensifies the three-way competition between Google, OpenAI, and Anthropic for the frontier AI crown. Each company now has a model that leads on at least some benchmarks: Gemini 3.1 Pro dominates ARC-AGI-2 and LiveCodeBench, Claude Opus 4.6 leads SWE-Bench Verified, and GPT-5.4 recently introduced native computer use with a million-token context window. The practical differences for most users are narrowing even as the benchmark numbers diverge.
Not all feedback has been positive. Some users have reported that Gemini 3.1 Pro feels more mechanical than its predecessor, noting reduced emotional depth and creative flexibility in conversational interactions. This tension between benchmark performance and subjective quality — the difference between a model that scores well on tests and one that feels good to use — remains an unresolved challenge across the industry.
