Pull to refresh
Logo
Daily Brief
Following
Why Ranks Sign Up
Google Gemini's push toward scientific reasoning

Google Gemini's push toward scientific reasoning

New Capabilities

From chatbot to research partner: Google's Deep Think mode targets math, physics, and real-world science

February 12th, 2026: Major Deep Think Upgrade Targets Scientific Applications

Overview

OpenAI launched the first commercial reasoning model in September 2024. Seventeen months later, Google claims its upgraded Gemini 3 Deep Think has pulled ahead on the benchmarks that matter most for science. The February 2026 update scored 84.6% on ARC-AGI-2—a test designed to measure how well artificial intelligence generalizes to novel problems—and 48.4% on Humanity's Last Exam, a collection of 2,500 expert-level questions crowdsourced from nearly 1,000 specialists worldwide.

The shift marks Google's attempt to move AI reasoning from competitive coding and math competitions into practical scientific work. A Rutgers mathematician used Deep Think to review a paper on structures bridging gravity and quantum mechanics; the model identified a subtle logical flaw that had passed human peer review. Google is betting that scientists and engineers will pay $250 per month for an AI that can catch what experts miss.

Play on this story Voices Debate Predict

Key Indicators

84.6%
ARC-AGI-2 Score
Highest verified score on the benchmark measuring generalization efficiency, compared to 54% for GPT-5.2 and average human score of 60%
48.4%
Humanity's Last Exam
Performance on 2,500 expert-level questions without using external tools
3,455
Codeforces Elo
Competitive programming rating, placing the model among elite human programmers
$250/mo
Google AI Ultra Price
Monthly subscription required for access to Deep Think through the Gemini app

Voices

Curated perspectives — historical figures and your fellow readers.

Ever wondered what historical figures would say about today's headlines?

Sign up to generate historical perspectives on this story.

Play

Exploring all sides of a story is often best achieved with Play.

Log in to play. Track your picks, climb the leaderboards. Log in Sign Up
Predict 4 ways this could play out. Contrarian picks score more — points lock when the scenario resolves. Log in to play
Timeline Five events from this story — drag them oldest to newest. Log in to play
Connections Sixteen names from the news. Find the four hidden groups of four. Log in to play

People Involved

Organizations Involved

Timeline

September 2024 February 2026

7 events Latest: February 12th, 2026 · 3 months ago
Tap a bar to jump to that date
  1. Major Deep Think Upgrade Targets Scientific Applications

    Latest Product

    Google announces an upgraded Deep Think scoring 84.6% on ARC-AGI-2 and 48.4% on Humanity's Last Exam. The model now excels at chemistry and physics problems and can convert sketches to 3D-printable files.

  2. Gemini 3 Family Launched

    Product

    Google releases Gemini 3 Pro and 3 Deep Think, calling it the company's most intelligent model. Available across Gemini app, AI Studio, and Vertex AI from day one.

  3. Gemini 2.5 Deep Think Becomes Generally Available

    Product

    Deep Think mode launches broadly, achieving gold-medal level results at the International Mathematics Olympiad and International Collegiate Programming Contest.

  4. Gemini 2.5 Deep Think Previewed at Google I/O

    Announcement

    Google previews an enhanced reasoning mode focused on complex, multi-step problems at its annual developer conference.

  5. Google Releases Gemini 2.5 Pro Experimental

    Product

    Google introduces its first 'thinking model' with chain-of-thought reasoning capabilities, establishing the foundation for the Deep Think approach.

  6. Humanity's Last Exam Benchmark Published

    Research

    Center for AI Safety and Scale AI release a new benchmark with 2,500 expert-level questions crowdsourced from nearly 1,000 specialists, designed to replace saturated benchmarks like MMLU.

  7. OpenAI Launches First Reasoning Model

    Industry

    OpenAI releases o1-preview and o1-mini, introducing 'thinking' models that spend more compute time reasoning before responding. The models achieve PhD-level performance on science benchmarks.

Historical Context

3 moments from history that rhyme with this story — and how they unfolded.

May 1997

IBM Deep Blue Defeats Kasparov (1997)

IBM's Deep Blue supercomputer defeated world chess champion Garry Kasparov 3.5-2.5 in a rematch after Kasparov had won their first match in 1996. The computer analyzed 200 million positions per second. IBM retired Deep Blue to the Smithsonian immediately after victory.

Then

Headlines declared machines had conquered human intelligence. IBM's stock price rose. Kasparov accused IBM of cheating and demanded a rematch that never came.

Now

Chess AI became commodity software within years. The match proved machines could beat humans at narrow tasks but said little about general intelligence. IBM never commercialized Deep Blue.

Why this matters now

Google faces similar questions: Does beating benchmarks translate to practical value? IBM's victory was a publicity triumph that failed to become a product. Google is attempting the opposite—using benchmark success to drive subscriptions and scientific adoption.

November 2020

AlphaFold Solves Protein Folding (2020)

DeepMind's AlphaFold2 predicted protein structures with accuracy matching experimental methods, solving a 50-year-old problem in biology. The model achieved a median error of less than 1 Angstrom at the CASP14 competition, three times more accurate than any previous system.

Then

DeepMind released predictions for 200 million proteins—nearly all known to science—for free. Structural biologists gained instant access to data that previously required months of laboratory work.

Now

Hassabis and colleague John Jumper won the 2024 Nobel Prize in Chemistry. AlphaFold accelerated drug discovery research worldwide and demonstrated that AI could produce genuine scientific value, not just benchmark scores.

Why this matters now

AlphaFold established DeepMind's credibility for AI-driven scientific breakthroughs. Deep Think's focus on 'messy real-world problems' attempts to extend that credibility from specialized protein prediction to general scientific reasoning. The success of AlphaFold is the template Google hopes to replicate.

September 2024

OpenAI o1 Launches Reasoning Model Category (2024)

OpenAI released o1-preview, the first commercial AI model designed to 'think' before responding using extended chains of reasoning. The model achieved PhD-level performance on physics, chemistry, and biology benchmarks and solved 83% of problems on the American Invitational Mathematics Examination versus 13% for GPT-4o.

Then

The launch created a new product category—reasoning models—distinct from conversational chatbots. Competitors including Google, Anthropic, and DeepSeek began developing their own reasoning-focused systems.

Now

Inference-time compute scaling became a recognized technique: spending more computational resources during response generation rather than just during training. This shifted the competitive landscape from model size to reasoning efficiency.

Why this matters now

OpenAI defined the category that Google is now competing in. Deep Think's benchmark achievements are meaningful only in the context of a race that OpenAI started 17 months earlier. Google's emphasis on efficiency (achieving high scores without $1,000+ per task) directly responds to criticisms of OpenAI's o3 costs.

Sources

(10)