Google Gemini's push toward scientific reasoning

Overview

OpenAI launched the first commercial reasoning model in September 2024. Seventeen months later, Google claims its upgraded Gemini 3 Deep Think has pulled ahead on the benchmarks that matter most for science. The February 2026 update scored 84.6% on ARC-AGI-2—a test designed to measure how well artificial intelligence generalizes to novel problems—and 48.4% on Humanity's Last Exam, a collection of 2,500 expert-level questions crowdsourced from nearly 1,000 specialists worldwide.

The shift marks Google's attempt to move AI reasoning from competitive coding and math competitions into practical scientific work. A Rutgers mathematician used Deep Think to review a paper on structures bridging gravity and quantum mechanics; the model identified a subtle logical flaw that had passed human peer review. Google is betting that scientists and engineers will pay $250 per month for an AI that can catch what experts miss.

10 Sources:

Key Indicators

84.6%

ARC-AGI-2 Score

Highest verified score on the benchmark measuring generalization efficiency, compared to 54% for GPT-5.2 and average human score of 60%

48.4%

Humanity's Last Exam

Performance on 2,500 expert-level questions without using external tools

3,455

Codeforces Elo

Competitive programming rating, placing the model among elite human programmers

$250/mo

Google AI Ultra Price

Monthly subscription required for access to Deep Think through the Gemini app

Interactive

Exploring all sides of a story is often best achieved with Play.

Ever wondered what historical figures would say about today's headlines?

Debate Arena

Two rounds, two personas, one winner. You set the crossfire.

People Involved

Demis Hassabis

Chief Executive Officer, Google DeepMind (Leading Google's AI efforts; 2024 Nobel laureate in Chemistry)

Lisa Carbone

Professor of Mathematics, Rutgers University (Early Deep Think research collaborator)

Dan Hendrycks

Director, Center for AI Safety (Co-creator of Humanity's Last Exam benchmark)

Organizations Involved

Google DeepMind

AI Research Laboratory

Status: Developer of Gemini model family

Google's primary artificial intelligence research division, responsible for AlphaGo, AlphaFold, and the Gemini models.

ARC Prize Foundation

AI Benchmark Organization

Status: Verified Gemini 3 Deep Think's 84.6% ARC-AGI-2 score

Non-profit organization that maintains the ARC-AGI benchmark series, designed to measure artificial general intelligence progress.

OpenAI

Artificial Intelligence Company (Public Benefit Corporation)

Status: Primary competitor in reasoning model development

San Francisco-based AI company that pioneered commercial reasoning models with o1 in September 2024.

Timeline

Major Deep Think Upgrade Targets Scientific Applications
Product

Google announces an upgraded Deep Think scoring 84.6% on ARC-AGI-2 and 48.4% on Humanity's Last Exam. The model now excels at chemistry and physics problems and can convert sketches to 3D-printable files.
February 12th, 2026
Gemini 3 Family Launched
Product

Google releases Gemini 3 Pro and 3 Deep Think, calling it the company's most intelligent model. Available across Gemini app, AI Studio, and Vertex AI from day one.
November 18th, 2025
Gemini 2.5 Deep Think Becomes Generally Available
Product

Deep Think mode launches broadly, achieving gold-medal level results at the International Mathematics Olympiad and International Collegiate Programming Contest.
August 1st, 2025
Gemini 2.5 Deep Think Previewed at Google I/O
Announcement

Google previews an enhanced reasoning mode focused on complex, multi-step problems at its annual developer conference.
May 20th, 2025
Google Releases Gemini 2.5 Pro Experimental
Product

Google introduces its first 'thinking model' with chain-of-thought reasoning capabilities, establishing the foundation for the Deep Think approach.
March 25th, 2025
Humanity's Last Exam Benchmark Published
Research

Center for AI Safety and Scale AI release a new benchmark with 2,500 expert-level questions crowdsourced from nearly 1,000 specialists, designed to replace saturated benchmarks like MMLU.
January 21st, 2025
OpenAI Launches First Reasoning Model
Industry

OpenAI releases o1-preview and o1-mini, introducing 'thinking' models that spend more compute time reasoning before responding. The models achieve PhD-level performance on science benchmarks.
September 12th, 2024

Scenarios

Deep Think Becomes Standard Research Tool

Discussed by: Google's product announcements and scientific collaboration case studies

If Deep Think proves reliable at catching errors in technical papers and interpreting messy experimental data, research institutions may integrate it into peer review workflows and laboratory analysis. Google's $250/month pricing targets well-funded academic labs and pharmaceutical companies. Success would validate the business model of expensive, specialized AI tools for professional users rather than mass-market chatbots.

Reasoning Model Competition Intensifies

Discussed by: Technology analysts comparing OpenAI, Google, and Anthropic roadmaps

OpenAI's o3 achieved 88% on ARC-AGI but at prohibitive cost per task. Google's efficiency advantage could prove temporary if competitors improve their cost-performance ratio. The next 12 months will likely see rapid releases from multiple labs, potentially commoditizing reasoning capabilities and compressing the window for Google to establish market position.

Benchmark Saturation Undermines Claims

Discussed by: AI researchers and benchmark creators, including Dan Hendrycks

Previous benchmarks like MMLU became saturated within years of release, with models scoring above 90% and providing little signal about real capabilities. If Humanity's Last Exam and ARC-AGI-2 follow the same pattern, headline scores may become meaningless faster than practical applications mature. Labs would need to demonstrate value through real-world deployments rather than benchmark rankings.

Scientific Community Remains Skeptical

Discussed by: Academic researchers evaluating AI tools for their workflows

The Lisa Carbone case study—catching a peer review error—is compelling but singular. Scientists may resist delegating verification to systems they cannot fully audit, especially in fields where errors carry high stakes. Adoption could stall if early users encounter hallucinations or subtle mistakes that undermine trust in AI-assisted research.

Historical Context

IBM Deep Blue Defeats Kasparov (1997)

May 1997

What Happened

IBM's Deep Blue supercomputer defeated world chess champion Garry Kasparov 3.5-2.5 in a rematch after Kasparov had won their first match in 1996. The computer analyzed 200 million positions per second. IBM retired Deep Blue to the Smithsonian immediately after victory.

Outcome

Short Term

Headlines declared machines had conquered human intelligence. IBM's stock price rose. Kasparov accused IBM of cheating and demanded a rematch that never came.

Long Term

Chess AI became commodity software within years. The match proved machines could beat humans at narrow tasks but said little about general intelligence. IBM never commercialized Deep Blue.

Why It's Relevant Today

Google faces similar questions: Does beating benchmarks translate to practical value? IBM's victory was a publicity triumph that failed to become a product. Google is attempting the opposite—using benchmark success to drive subscriptions and scientific adoption.

AlphaFold Solves Protein Folding (2020)

November 2020

What Happened

DeepMind's AlphaFold2 predicted protein structures with accuracy matching experimental methods, solving a 50-year-old problem in biology. The model achieved a median error of less than 1 Angstrom at the CASP14 competition, three times more accurate than any previous system.

Outcome

Short Term

DeepMind released predictions for 200 million proteins—nearly all known to science—for free. Structural biologists gained instant access to data that previously required months of laboratory work.

Long Term

Hassabis and colleague John Jumper won the 2024 Nobel Prize in Chemistry. AlphaFold accelerated drug discovery research worldwide and demonstrated that AI could produce genuine scientific value, not just benchmark scores.

Why It's Relevant Today

AlphaFold established DeepMind's credibility for AI-driven scientific breakthroughs. Deep Think's focus on 'messy real-world problems' attempts to extend that credibility from specialized protein prediction to general scientific reasoning. The success of AlphaFold is the template Google hopes to replicate.

OpenAI o1 Launches Reasoning Model Category (2024)

September 2024

What Happened

OpenAI released o1-preview, the first commercial AI model designed to 'think' before responding using extended chains of reasoning. The model achieved PhD-level performance on physics, chemistry, and biology benchmarks and solved 83% of problems on the American Invitational Mathematics Examination versus 13% for GPT-4o.

Outcome

Short Term

The launch created a new product category—reasoning models—distinct from conversational chatbots. Competitors including Google, Anthropic, and DeepSeek began developing their own reasoning-focused systems.

Long Term

Inference-time compute scaling became a recognized technique: spending more computational resources during response generation rather than just during training. This shifted the competitive landscape from model size to reasoning efficiency.

Why It's Relevant Today

OpenAI defined the category that Google is now competing in. Deep Think's benchmark achievements are meaningful only in the context of a race that OpenAI started 17 months earlier. Google's emphasis on efficiency (achieving high scores without $1,000+ per task) directly responds to criticisms of OpenAI's o3 costs.

Google Gemini's push toward scientific reasoning

Overview

Key Indicators

Related Media

Interactive

Ever wondered what historical figures would say about today's headlines?

Preview Voice

Generating Voice

Choose a Historical Figure

Albert Einstein

Ambrose Bierce

Andrew Carnegie

Andrew Mellon

Ayn Rand

Benjamin Franklin

Cecil Rhodes

Charles Darwin

Cornelius Vanderbilt

Dorothy Parker

Eleanor Roosevelt

Frederick Douglass

G. K. Chesterton

George Orwell

H. L. Mencken

Hannah Arendt

J. P. Morgan

James Baldwin

Jamsetji Tata

Jane Addams

John Locke

Jonathan Swift

Madam C. J. Walker

Mark Twain

Mary Wollstonecraft

Niccolo Machiavelli

Oscar Wilde

Rachel Carson

Samuel Johnson

Simone Weil

Sojourner Truth

Thomas Hobbes

Thomas Jefferson

Thomas Paine

Voltaire

Winston Churchill

Debate Arena

People Involved

Organizations Involved

Timeline

Major Deep Think Upgrade Targets Scientific Applications

Gemini 3 Family Launched

Gemini 2.5 Deep Think Becomes Generally Available

Gemini 2.5 Deep Think Previewed at Google I/O

Google Releases Gemini 2.5 Pro Experimental

Humanity's Last Exam Benchmark Published

OpenAI Launches First Reasoning Model

Scenarios

Deep Think Becomes Standard Research Tool

Reasoning Model Competition Intensifies

Benchmark Saturation Undermines Claims

Scientific Community Remains Skeptical

Historical Context

IBM Deep Blue Defeats Kasparov (1997)

What Happened

Outcome

Why It's Relevant Today

AlphaFold Solves Protein Folding (2020)

What Happened

Outcome

Why It's Relevant Today

OpenAI o1 Launches Reasoning Model Category (2024)

What Happened

Outcome

Why It's Relevant Today

Related Stories

The AI science rush

The recursive loop begins

The AI reasoning revolution

Google ships Gemini 3 flash everywhere—and makes speed the default

AI systems cross the creativity threshold