Pull to refresh
Logo
Daily Brief
Following
Why
AI models learn to read, predict, and write the genetic code of life

AI models learn to read, predict, and write the genetic code of life

New Capabilities
By Newzino Staff |

From protein folding to whole-genome design, biological foundation models are compressing decades of lab work into computation

Yesterday: Evo 2 published in Nature

Overview

It took thirteen years and $2.7 billion to read the first human genome. Now a single AI model, trained on 9 trillion DNA base pairs from more than 128,000 species, can predict whether an uncharacterized mutation in a breast cancer gene is dangerous—with 90 percent accuracy—without ever being shown that gene. On March 4, the Arc Institute and NVIDIA published Evo 2 in Nature, the largest biological foundation model ever built: 40 billion parameters, a context window of one million nucleotides, and the ability to design synthetic genomes the size of a simple bacterium.

Key Indicators

9.3T
DNA base pairs in training data
Evo 2 was trained on 9.3 trillion nucleotides from 128,000+ species spanning all three domains of life.
40B
Model parameters
The largest version of Evo 2 has 40 billion parameters, making it the biggest AI model built for biology.
90%
BRCA1 variant classification accuracy
Evo 2 predicted whether previously uncharacterized BRCA1 mutations affect gene function with 90 percent accuracy, without any task-specific training.
1M
Nucleotide context window
The model can process up to one million nucleotides at once—eight times more than Evo 1—enabling it to capture long-range dependencies across genomes.
16
Viable AI-designed bacteriophages
Out of roughly 300 AI-generated phage genome designs tested, 16 proved functional, with some outperforming natural phages.

Interactive

Exploring all sides of a story is often best achieved with Play.

Ever wondered what historical figures would say about today's headlines?

Sign up to generate historical perspectives on this story.

Sign Up

Debate Arena

Two rounds, two personas, one winner. You set the crossfire.

People Involved

Patrick Hsu
Patrick Hsu
Co-founder, Arc Institute; Assistant Professor of Bioengineering, UC Berkeley (Co-senior author on Evo 2)
Brian Hie
Brian Hie
Assistant Professor of Chemical Engineering, Stanford University; Innovation Investigator, Arc Institute (Co-senior author on Evo 2)
Demis Hassabis
Demis Hassabis
Co-founder and CEO, Google DeepMind (2024 Nobel Laureate in Chemistry for AlphaFold)

Organizations Involved

Arc Institute
Arc Institute
Independent Research Institute
Status: Developer of Evo 1 and Evo 2

A non-profit research institute in Palo Alto that gives scientists long-term, unrestricted funding to pursue high-risk biological research in partnership with Stanford, UC Berkeley, and UC San Francisco.

Nvidia Corporation
Nvidia Corporation
Technology Company
Status: Infrastructure partner and co-developer of Evo 2

The dominant maker of graphics processing units (GPUs) used in AI training, NVIDIA provided the computing infrastructure and engineering collaboration for Evo 2.

Google DeepMind
Google DeepMind
AI Research Lab
Status: Developer of AlphaFold, the foundational precedent for biological AI

Google's AI research lab created AlphaFold, the protein structure prediction system that proved deep learning could transform biology and won the 2024 Nobel Prize in Chemistry.

Timeline

  1. Evo 2 published in Nature

    Publication

    The peer-reviewed Evo 2 paper appeared in Nature, describing the 40-billion-parameter model's ability to predict pathogenic mutations and design synthetic genomes across all domains of life.

  2. AI-generated bacteriophages shown to be functional

    Research

    Researchers used Evo models to generate synthetic bacteriophage genomes. Of roughly 300 designs tested, 16 proved viable, with some outperforming natural phages and a cocktail overcoming bacterial resistance in three E. coli strains.

  3. Evo 2 preprint released with open-source code and data

    Research

    Arc Institute and NVIDIA posted the Evo 2 preprint on bioRxiv, alongside publicly releasing model weights, training code, and the OpenGenome2 dataset of 9.3 trillion nucleotides.

  4. Evo 1 published in Science

    Research

    Arc Institute published Evo 1 in Science: a 7-billion-parameter model trained on prokaryotic genomes that could generate functional CRISPR systems and transposons, marking the first protein-RNA codesign with a language model.

  5. AlphaFold creators win Nobel Prize in Chemistry

    Recognition

    Demis Hassabis and John Jumper of Google DeepMind received the Nobel Prize in Chemistry for AlphaFold's protein structure predictions. David Baker shared the prize for computational protein design.

  6. ProGen demonstrates AI-designed functional proteins

    Research

    Salesforce Research published results in Nature Biotechnology showing that its ProGen model could generate novel protein sequences, with 73 percent of AI-designed proteins proving functional in lab tests—outperforming 59 percent of natural proteins.

  7. Meta releases ESM-2 and ESMFold protein language models

    Research

    Meta AI released ESM-2, a 15-billion-parameter protein language model, alongside ESMFold for structure prediction. The accompanying Metagenomic Atlas predicted structures for over 617 million proteins.

  8. Arc Institute launches with $650 million in funding

    Institutional

    The Arc Institute launched in Palo Alto with a novel funding model: eight-year unrestricted grants for scientists, in partnership with Stanford, UC Berkeley, and UC San Francisco.

  9. AlphaFold 2 cracks protein folding at CASP14

    Breakthrough

    Google DeepMind's AlphaFold 2 predicted protein structures with accuracy matching laboratory experiments at the CASP14 competition, effectively solving a fifty-year-old problem in biology.

Scenarios

1

Genomic AI becomes standard clinical tool for variant interpretation

Discussed by: Inside Precision Medicine, Stanford Engineering, clinical genomics researchers

Models like Evo 2 are integrated into clinical genetic testing pipelines to classify the millions of variants of unknown significance currently sitting in patient records. Genetic counselors use AI-generated predictions alongside existing evidence to advise patients on cancer risk, rare disease diagnosis, and pharmacogenomics. This requires regulatory validation—likely through the Food and Drug Administration's software-as-medical-device pathway—and large-scale prospective studies confirming that zero-shot predictions match real-world outcomes.

2

AI-designed phage therapies enter clinical trials against antibiotic-resistant infections

Discussed by: Genetic Engineering & Biotechnology News, Arc Institute researchers, Singularity Hub

The demonstrated ability to generate functional, novel bacteriophages that overcome bacterial resistance opens a path toward engineered phage cocktails as therapeutics. If follow-up work shows these AI-designed phages are safe and effective in animal models, clinical trials for treating drug-resistant infections could begin within a few years. Antibiotic resistance kills over a million people annually, and phage therapy has long been held back by the difficulty of finding the right phage for each bacterial strain—AI generation could change that calculus.

3

Biosecurity incident triggers calls for restricting open biological AI models

Discussed by: Council on Strategic Risks, Organisation for Economic Co-operation and Development, biosecurity researchers at Springer Nature

An incident—whether actual misuse or a demonstrated proof-of-concept in a security assessment—shows that open-source genomic AI models can be adapted to design dangerous biological agents despite training data exclusions. Governments respond with new restrictions on open release of biological foundation models, triggering a debate between biosecurity and open-science advocates. The Evo 2 team's decision to exclude human pathogens from training data was a voluntary measure; no enforceable international framework currently governs what data biological AI models can be trained on.

4

Next-generation models move from genome prediction to programmable organism design

Discussed by: Asimov Press, SynBioBeta, synthetic biology community

Evo 2 can generate sequences the length of small bacterial genomes, but the designed sequences are not yet full, boot-ready organisms. A successor model trained on richer functional data—or fine-tuned with experimental feedback loops—crosses that threshold, enabling researchers to specify desired biological functions and receive a complete genome blueprint. This would represent a fundamental shift in synthetic biology from editing existing organisms to designing new ones from scratch, with transformative applications in biomanufacturing, medicine, and agriculture.

Historical Context

Human Genome Project (1990–2003)

1990–April 2003

What Happened

An international consortium of researchers spent thirteen years and approximately $2.7 billion to sequence the first human genome's 3 billion base pairs. When completed in April 2003, it covered about 92 percent of the genome and was hailed as biology's equivalent of the Moon landing.

Outcome

Short Term

Sequencing costs began a dramatic decline—from $50 million for a second genome in 2003 to under $200 by 2024—as next-generation sequencing technology emerged.

Long Term

The project created the reference genome that underpins all modern genomics, from cancer diagnostics to ancestry testing. But reading the genome turned out to be far easier than understanding it—the function of most genetic variation remains unknown.

Why It's Relevant Today

Evo 2 was trained on the genomic data that the Human Genome Project and its successors generated. Its ability to predict mutational effects without task-specific training directly addresses the interpretation gap that has persisted since 2003: we can read genomes cheaply, but understanding what the variations mean has remained the bottleneck.

AlphaFold 2 solves protein structure prediction (2020)

November 2020

What Happened

Google DeepMind entered AlphaFold 2 in the CASP14 protein structure prediction competition and achieved accuracy comparable to experimental methods, solving a problem that had stymied biologists for fifty years. The team later predicted structures for virtually all 200 million known proteins and made the database freely available.

Outcome

Short Term

The structural biology community gained instant access to predicted structures that would have taken years to determine experimentally. More than two million researchers used the database within two years.

Long Term

Demis Hassabis and John Jumper won the 2024 Nobel Prize in Chemistry. AlphaFold demonstrated that AI could transform biology, catalyzing a wave of biological foundation models—including ESM-2, ProGen, and Evo—that expanded from protein structure to protein design to whole-genome modeling.

Why It's Relevant Today

AlphaFold proved the core premise that Evo 2 extends: biological sequence data contains enough information for AI to learn deep functional relationships. AlphaFold worked on proteins; Evo 2 operates on raw DNA across all of life, a larger and more fundamental challenge.

Asilomar Conference on Recombinant DNA (1975)

February 1975

What Happened

140 biologists, lawyers, and journalists gathered at Asilomar, California, to address safety concerns about recombinant DNA technology—the ability to splice genes from one organism into another. Scientists had voluntarily paused certain experiments and convened the conference to establish safety guidelines before proceeding.

Outcome

Short Term

The conference produced a set of safety guidelines that informed the National Institutes of Health's regulations on recombinant DNA research, allowing the work to continue under oversight.

Long Term

Asilomar became the defining example of scientific self-regulation. Recombinant DNA technology went on to produce insulin, gene therapy, and genetically modified crops. The guidelines evolved but the framework of voluntary caution followed by formal regulation persisted.

Why It's Relevant Today

The Evo 2 team's decision to exclude human pathogen sequences from training data echoes Asilomar's approach: scientists voluntarily limiting their own work before regulators act. But the parallel has limits—Asilomar governed a handful of labs, while Evo 2's open-source release means the model is available to anyone with sufficient computing power.

Sources

(14)