AI models learn to read, predict, and write the genetic code of life

Overview

It took thirteen years and $2.7 billion to read the first human genome. Now a single AI model, trained on 9 trillion DNA base pairs from more than 128,000 species, can predict whether an uncharacterized mutation in a breast cancer gene is dangerous—with 90 percent accuracy—without ever being shown that gene. On March 4, the Arc Institute and NVIDIA published Evo 2 in Nature, the largest biological foundation model ever built: 40 billion parameters, a context window of one million nucleotides, and the ability to design synthetic genomes the size of a simple bacterium.

Key Indicators

9.3T

DNA base pairs in training data

Evo 2 was trained on 9.3 trillion nucleotides from 128,000+ species spanning all three domains of life.

40B

Model parameters

The largest version of Evo 2 has 40 billion parameters, making it the biggest AI model built for biology.

90%

BRCA1 variant classification accuracy

Evo 2 predicted whether previously uncharacterized BRCA1 mutations affect gene function with 90 percent accuracy, without any task-specific training.

Nucleotide context window

The model can process up to one million nucleotides at once—eight times more than Evo 1—enabling it to capture long-range dependencies across genomes.

Viable AI-designed bacteriophages

Out of roughly 300 AI-generated phage genome designs tested, 16 proved functional, with some outperforming natural phages.

Interactive

Exploring all sides of a story is often best achieved with Play.

Ever wondered what historical figures would say about today's headlines?

Debate Arena

Two rounds, two personas, one winner. You set the crossfire.

People Involved

Patrick Hsu

Co-senior author on Evo 2

Brian Hie

Co-senior author on Evo 2

Demis Hassabis

2024 Nobel Laureate in Chemistry for AlphaFold

Organizations Involved

Arc Institute

Independent Research Institute

Developer of Evo 1 and Evo 2

A non-profit research institute in Palo Alto that gives scientists long-term, unrestricted funding to pursue high-risk biological research in partnership with Stanford, UC Berkeley, and UC San Francisco.

Nvidia Corporation

Strategic corporate investor

Infrastructure partner and co-developer of Evo 2

The dominant maker of graphics processing units (GPUs) used in AI training, NVIDIA provided the computing infrastructure and engineering collaboration for Evo 2.

Google DeepMind

AI research lab (incumbent)

Developer of AlphaFold, the foundational precedent for biological AI

Google's AI research lab created AlphaFold, the protein structure prediction system that proved deep learning could transform biology and won the 2024 Nobel Prize in Chemistry.

Timeline

Evo 2 published in Nature
Publication

The peer-reviewed Evo 2 paper appeared in Nature, describing the 40-billion-parameter model's ability to predict pathogenic mutations and design synthetic genomes across all domains of life.
March 4th, 2026
AI-generated bacteriophages shown to be functional
Research

Researchers used Evo models to generate synthetic bacteriophage genomes. Of roughly 300 designs tested, 16 proved viable, with some outperforming natural phages and a cocktail overcoming bacterial resistance in three E. coli strains.
September 12th, 2025
Evo 2 preprint released with open-source code and data
Research

Arc Institute and NVIDIA posted the Evo 2 preprint on bioRxiv, alongside publicly releasing model weights, training code, and the OpenGenome2 dataset of 9.3 trillion nucleotides.
February 18th, 2025
Evo 1 published in Science
Research

Arc Institute published Evo 1 in Science: a 7-billion-parameter model trained on prokaryotic genomes that could generate functional CRISPR systems and transposons, marking the first protein-RNA codesign with a language model.
November 15th, 2024
AlphaFold creators win Nobel Prize in Chemistry
Recognition

Demis Hassabis and John Jumper of Google DeepMind received the Nobel Prize in Chemistry for AlphaFold's protein structure predictions. David Baker shared the prize for computational protein design.
October 9th, 2024
ProGen demonstrates AI-designed functional proteins
Research

Salesforce Research published results in Nature Biotechnology showing that its ProGen model could generate novel protein sequences, with 73 percent of AI-designed proteins proving functional in lab tests—outperforming 59 percent of natural proteins.
January 26th, 2023
Meta releases ESM-2 and ESMFold protein language models
Research

Meta AI released ESM-2, a 15-billion-parameter protein language model, alongside ESMFold for structure prediction. The accompanying Metagenomic Atlas predicted structures for over 617 million proteins.
November 1st, 2022
Arc Institute launches with $650 million in funding
Institutional

The Arc Institute launched in Palo Alto with a novel funding model: eight-year unrestricted grants for scientists, in partnership with Stanford, UC Berkeley, and UC San Francisco.
December 15th, 2021
AlphaFold 2 cracks protein folding at CASP14
Breakthrough

Google DeepMind's AlphaFold 2 predicted protein structures with accuracy matching laboratory experiments at the CASP14 competition, effectively solving a fifty-year-old problem in biology.
November 30th, 2020

Scenarios

Genomic AI becomes standard clinical tool for variant interpretation

Discussed by: Inside Precision Medicine, Stanford Engineering, clinical genomics researchers

Models like Evo 2 are integrated into clinical genetic testing pipelines to classify the millions of variants of unknown significance currently sitting in patient records. Genetic counselors use AI-generated predictions alongside existing evidence to advise patients on cancer risk, rare disease diagnosis, and pharmacogenomics. This requires regulatory validation—likely through the Food and Drug Administration's software-as-medical-device pathway—and large-scale prospective studies confirming that zero-shot predictions match real-world outcomes.

AI-designed phage therapies enter clinical trials against antibiotic-resistant infections

Discussed by: Genetic Engineering & Biotechnology News, Arc Institute researchers, Singularity Hub

The demonstrated ability to generate functional, novel bacteriophages that overcome bacterial resistance opens a path toward engineered phage cocktails as therapeutics. If follow-up work shows these AI-designed phages are safe and effective in animal models, clinical trials for treating drug-resistant infections could begin within a few years. Antibiotic resistance kills over a million people annually, and phage therapy has long been held back by the difficulty of finding the right phage for each bacterial strain—AI generation could change that calculus.

Biosecurity incident triggers calls for restricting open biological AI models

Discussed by: Council on Strategic Risks, Organisation for Economic Co-operation and Development, biosecurity researchers at Springer Nature

An incident—whether actual misuse or a demonstrated proof-of-concept in a security assessment—shows that open-source genomic AI models can be adapted to design dangerous biological agents despite training data exclusions. Governments respond with new restrictions on open release of biological foundation models, triggering a debate between biosecurity and open-science advocates. The Evo 2 team's decision to exclude human pathogens from training data was a voluntary measure; no enforceable international framework currently governs what data biological AI models can be trained on.

Next-generation models move from genome prediction to programmable organism design

Discussed by: Asimov Press, SynBioBeta, synthetic biology community

Evo 2 can generate sequences the length of small bacterial genomes, but the designed sequences are not yet full, boot-ready organisms. A successor model trained on richer functional data—or fine-tuned with experimental feedback loops—crosses that threshold, enabling researchers to specify desired biological functions and receive a complete genome blueprint. This would represent a fundamental shift in synthetic biology from editing existing organisms to designing new ones from scratch, with transformative applications in biomanufacturing, medicine, and agriculture.

Historical Context

Human Genome Project (1990–2003)

1990–April 2003

What Happened

An international consortium of researchers spent thirteen years and approximately $2.7 billion to sequence the first human genome's 3 billion base pairs. When completed in April 2003, it covered about 92 percent of the genome and was hailed as biology's equivalent of the Moon landing.

Outcome

Short Term

Sequencing costs began a dramatic decline—from $50 million for a second genome in 2003 to under $200 by 2024—as next-generation sequencing technology emerged.

Long Term

The project created the reference genome that underpins all modern genomics, from cancer diagnostics to ancestry testing. But reading the genome turned out to be far easier than understanding it—the function of most genetic variation remains unknown.

Why It's Relevant Today

Evo 2 was trained on the genomic data that the Human Genome Project and its successors generated. Its ability to predict mutational effects without task-specific training directly addresses the interpretation gap that has persisted since 2003: we can read genomes cheaply, but understanding what the variations mean has remained the bottleneck.

AlphaFold 2 solves protein structure prediction (2020)

November 2020

What Happened

Google DeepMind entered AlphaFold 2 in the CASP14 protein structure prediction competition and achieved accuracy comparable to experimental methods, solving a problem that had stymied biologists for fifty years. The team later predicted structures for virtually all 200 million known proteins and made the database freely available.

Outcome

Short Term

The structural biology community gained instant access to predicted structures that would have taken years to determine experimentally. More than two million researchers used the database within two years.

Long Term

Demis Hassabis and John Jumper won the 2024 Nobel Prize in Chemistry. AlphaFold demonstrated that AI could transform biology, catalyzing a wave of biological foundation models—including ESM-2, ProGen, and Evo—that expanded from protein structure to protein design to whole-genome modeling.

Why It's Relevant Today

AlphaFold proved the core premise that Evo 2 extends: biological sequence data contains enough information for AI to learn deep functional relationships. AlphaFold worked on proteins; Evo 2 operates on raw DNA across all of life, a larger and more fundamental challenge.

Asilomar Conference on Recombinant DNA (1975)

February 1975

What Happened

140 biologists, lawyers, and journalists gathered at Asilomar, California, to address safety concerns about recombinant DNA technology—the ability to splice genes from one organism into another. Scientists had voluntarily paused certain experiments and convened the conference to establish safety guidelines before proceeding.

Outcome

Short Term

The conference produced a set of safety guidelines that informed the National Institutes of Health's regulations on recombinant DNA research, allowing the work to continue under oversight.

Long Term

Asilomar became the defining example of scientific self-regulation. Recombinant DNA technology went on to produce insulin, gene therapy, and genetically modified crops. The guidelines evolved but the framework of voluntary caution followed by formal regulation persisted.

Why It's Relevant Today

The Evo 2 team's decision to exclude human pathogen sequences from training data echoes Asilomar's approach: scientists voluntarily limiting their own work before regulators act. But the parallel has limits—Asilomar governed a handful of labs, while Evo 2's open-source release means the model is available to anyone with sufficient computing power.