Pull to refresh
Logo
Daily Brief
Following
Why
Google Gemini's push toward scientific reasoning

Google Gemini's push toward scientific reasoning

New Capabilities
By Newzino Staff |

From chatbot to research partner: Google's Deep Think mode targets math, physics, and real-world science

February 12th, 2026: Major Deep Think Upgrade Targets Scientific Applications

Overview

OpenAI launched the first commercial reasoning model in September 2024. Seventeen months later, Google claims its upgraded Gemini 3 Deep Think has pulled ahead on the benchmarks that matter most for science. The February 2026 update scored 84.6% on ARC-AGI-2—a test designed to measure how well artificial intelligence generalizes to novel problems—and 48.4% on Humanity's Last Exam, a collection of 2,500 expert-level questions crowdsourced from nearly 1,000 specialists worldwide.

The shift marks Google's attempt to move AI reasoning from competitive coding and math competitions into practical scientific work. A Rutgers mathematician used Deep Think to review a paper on structures bridging gravity and quantum mechanics; the model identified a subtle logical flaw that had passed human peer review. Google is betting that scientists and engineers will pay $250 per month for an AI that can catch what experts miss.

Key Indicators

84.6%
ARC-AGI-2 Score
Highest verified score on the benchmark measuring generalization efficiency, compared to 54% for GPT-5.2 and average human score of 60%
48.4%
Humanity's Last Exam
Performance on 2,500 expert-level questions without using external tools
3,455
Codeforces Elo
Competitive programming rating, placing the model among elite human programmers
$250/mo
Google AI Ultra Price
Monthly subscription required for access to Deep Think through the Gemini app

Interactive

Exploring all sides of a story is often best achieved with Play.

Ever wondered what historical figures would say about today's headlines?

Sign up to generate historical perspectives on this story.

Sign Up

Debate Arena

Two rounds, two personas, one winner. You set the crossfire.

People Involved

Demis Hassabis
Demis Hassabis
Chief Executive Officer, Google DeepMind (Leading Google's AI efforts; 2024 Nobel laureate in Chemistry)
Lisa Carbone
Lisa Carbone
Professor of Mathematics, Rutgers University (Early Deep Think research collaborator)
Dan Hendrycks
Dan Hendrycks
Director, Center for AI Safety (Co-creator of Humanity's Last Exam benchmark)

Organizations Involved

Google DeepMind
Google DeepMind
AI Research Laboratory
Status: Developer of Gemini model family

Google's primary artificial intelligence research division, responsible for AlphaGo, AlphaFold, and the Gemini models.

ARC Prize Foundation
ARC Prize Foundation
AI Benchmark Organization
Status: Verified Gemini 3 Deep Think's 84.6% ARC-AGI-2 score

Non-profit organization that maintains the ARC-AGI benchmark series, designed to measure artificial general intelligence progress.

OpenAI
OpenAI
Artificial Intelligence Company (Public Benefit Corporation)
Status: Primary competitor in reasoning model development

San Francisco-based AI company that pioneered commercial reasoning models with o1 in September 2024.

Timeline

  1. Major Deep Think Upgrade Targets Scientific Applications

    Product

    Google announces an upgraded Deep Think scoring 84.6% on ARC-AGI-2 and 48.4% on Humanity's Last Exam. The model now excels at chemistry and physics problems and can convert sketches to 3D-printable files.

  2. Gemini 3 Family Launched

    Product

    Google releases Gemini 3 Pro and 3 Deep Think, calling it the company's most intelligent model. Available across Gemini app, AI Studio, and Vertex AI from day one.

  3. Gemini 2.5 Deep Think Becomes Generally Available

    Product

    Deep Think mode launches broadly, achieving gold-medal level results at the International Mathematics Olympiad and International Collegiate Programming Contest.

  4. Gemini 2.5 Deep Think Previewed at Google I/O

    Announcement

    Google previews an enhanced reasoning mode focused on complex, multi-step problems at its annual developer conference.

  5. Google Releases Gemini 2.5 Pro Experimental

    Product

    Google introduces its first 'thinking model' with chain-of-thought reasoning capabilities, establishing the foundation for the Deep Think approach.

  6. Humanity's Last Exam Benchmark Published

    Research

    Center for AI Safety and Scale AI release a new benchmark with 2,500 expert-level questions crowdsourced from nearly 1,000 specialists, designed to replace saturated benchmarks like MMLU.

  7. OpenAI Launches First Reasoning Model

    Industry

    OpenAI releases o1-preview and o1-mini, introducing 'thinking' models that spend more compute time reasoning before responding. The models achieve PhD-level performance on science benchmarks.

Scenarios

1

Deep Think Becomes Standard Research Tool

Discussed by: Google's product announcements and scientific collaboration case studies

If Deep Think proves reliable at catching errors in technical papers and interpreting messy experimental data, research institutions may integrate it into peer review workflows and laboratory analysis. Google's $250/month pricing targets well-funded academic labs and pharmaceutical companies. Success would validate the business model of expensive, specialized AI tools for professional users rather than mass-market chatbots.

2

Reasoning Model Competition Intensifies

Discussed by: Technology analysts comparing OpenAI, Google, and Anthropic roadmaps

OpenAI's o3 achieved 88% on ARC-AGI but at prohibitive cost per task. Google's efficiency advantage could prove temporary if competitors improve their cost-performance ratio. The next 12 months will likely see rapid releases from multiple labs, potentially commoditizing reasoning capabilities and compressing the window for Google to establish market position.

3

Benchmark Saturation Undermines Claims

Discussed by: AI researchers and benchmark creators, including Dan Hendrycks

Previous benchmarks like MMLU became saturated within years of release, with models scoring above 90% and providing little signal about real capabilities. If Humanity's Last Exam and ARC-AGI-2 follow the same pattern, headline scores may become meaningless faster than practical applications mature. Labs would need to demonstrate value through real-world deployments rather than benchmark rankings.

4

Scientific Community Remains Skeptical

Discussed by: Academic researchers evaluating AI tools for their workflows

The Lisa Carbone case study—catching a peer review error—is compelling but singular. Scientists may resist delegating verification to systems they cannot fully audit, especially in fields where errors carry high stakes. Adoption could stall if early users encounter hallucinations or subtle mistakes that undermine trust in AI-assisted research.

Historical Context

IBM Deep Blue Defeats Kasparov (1997)

May 1997

What Happened

IBM's Deep Blue supercomputer defeated world chess champion Garry Kasparov 3.5-2.5 in a rematch after Kasparov had won their first match in 1996. The computer analyzed 200 million positions per second. IBM retired Deep Blue to the Smithsonian immediately after victory.

Outcome

Short Term

Headlines declared machines had conquered human intelligence. IBM's stock price rose. Kasparov accused IBM of cheating and demanded a rematch that never came.

Long Term

Chess AI became commodity software within years. The match proved machines could beat humans at narrow tasks but said little about general intelligence. IBM never commercialized Deep Blue.

Why It's Relevant Today

Google faces similar questions: Does beating benchmarks translate to practical value? IBM's victory was a publicity triumph that failed to become a product. Google is attempting the opposite—using benchmark success to drive subscriptions and scientific adoption.

AlphaFold Solves Protein Folding (2020)

November 2020

What Happened

DeepMind's AlphaFold2 predicted protein structures with accuracy matching experimental methods, solving a 50-year-old problem in biology. The model achieved a median error of less than 1 Angstrom at the CASP14 competition, three times more accurate than any previous system.

Outcome

Short Term

DeepMind released predictions for 200 million proteins—nearly all known to science—for free. Structural biologists gained instant access to data that previously required months of laboratory work.

Long Term

Hassabis and colleague John Jumper won the 2024 Nobel Prize in Chemistry. AlphaFold accelerated drug discovery research worldwide and demonstrated that AI could produce genuine scientific value, not just benchmark scores.

Why It's Relevant Today

AlphaFold established DeepMind's credibility for AI-driven scientific breakthroughs. Deep Think's focus on 'messy real-world problems' attempts to extend that credibility from specialized protein prediction to general scientific reasoning. The success of AlphaFold is the template Google hopes to replicate.

OpenAI o1 Launches Reasoning Model Category (2024)

September 2024

What Happened

OpenAI released o1-preview, the first commercial AI model designed to 'think' before responding using extended chains of reasoning. The model achieved PhD-level performance on physics, chemistry, and biology benchmarks and solved 83% of problems on the American Invitational Mathematics Examination versus 13% for GPT-4o.

Outcome

Short Term

The launch created a new product category—reasoning models—distinct from conversational chatbots. Competitors including Google, Anthropic, and DeepSeek began developing their own reasoning-focused systems.

Long Term

Inference-time compute scaling became a recognized technique: spending more computational resources during response generation rather than just during training. This shifted the competitive landscape from model size to reasoning efficiency.

Why It's Relevant Today

OpenAI defined the category that Google is now competing in. Deep Think's benchmark achievements are meaningful only in the context of a race that OpenAI started 17 months earlier. Google's emphasis on efficiency (achieving high scores without $1,000+ per task) directly responds to criticisms of OpenAI's o3 costs.

10 Sources: