Director, Center for AI Safety
Appears in 1 story
Co-creator of Humanity's Last Exam benchmark
OpenAI launched the first commercial reasoning model in September 2024. Seventeen months later, Google claims its upgraded Gemini 3 Deep Think has pulled ahead on the benchmarks that matter most for science. The February 2026 update scored 84.6% on ARC-AGI-2—a test designed to measure how well artificial intelligence generalizes to novel problems—and 48.4% on Humanity's Last Exam, a collection of 2,500 expert-level questions crowdsourced from nearly 1,000 specialists worldwide.
Updated Feb 13
No stories match your search
Try a different keyword
How would you like to describe your experience with the app today?