The human genome may be about to get its GPT moment.
Artificial intelligence is changing how scientists read DNA, with new models scanning long genetic sequences to link patterns in the code to biological behavior, from gene regulation to disease risk. IBM researchers say these approaches could reshape drug discovery over time, in ways that echo how AI has altered modern software development.
The stakes are enormous. Google DeepMind recently published its AlphaGenome model, which takes up to one million DNA base pairs as input and predicts thousands of molecular properties across diverse biological processes, including chromatin accessibility, transcription factor binding and splice junction coordinates. In a study published in Nature, the DeepMind researchers found that the model outperformed existing tools in 22 of 24 variant effect prediction tasks, marking what they describe as a fundamental shift in how scientists can interrogate the regulatory code embedded in non-coding DNA.
For pharmaceutical companies and biotech firms, the promise is immense: faster identification of disease-causing mutations, more precise drug targeting and the ability to design experiments guided by computational predictions rather than brute-force screening. IBM Research has been developing its own suite of biomedical foundation models to tackle complementary challenges in drug discovery, with applications ranging from antibody design to small-molecule property prediction, part of a broader industry movement to apply large-scale AI to biological data.
AlphaGenome distinguishes itself from earlier genomic models by learning from multiple types of biological measurements simultaneously.
“What I found most novel about AlphaGenome was its multimodal nature,” Mark Gerstein, the Albert L Williams Professor of Biomedical Informatics at Yale University, and who was not involved in the research, told IBM Think in an interview. “The fact that it is trained on data from many different genomic modalities—for instance, RNA-seq, ATAC-seq and Hi-C—and predicts effects across these modalities is particularly notable.”
Gerstein said AlphaGenome stands out because it tries to predict multiple genomic signals simultaneously and treats them as connected rather than independent. Changes in chromatin state upstream, for example, can shape gene expression downstream—and models have long recognized those links. What’s new, in his view, is the scale at which AlphaGenome tries to fold those relationships directly into sequence-to-function prediction.
He also highlighted how much DNA the model can “see” in one pass. The window, he said, is unusually large, on the order of a megabase. It’s a span big enough to capture regulatory effects that can sit far from the genes they influence.
Get curated insights on the most important—and intriguing—AI news. Subscribe to our weekly Think newsletter. See the IBM Privacy Statement.
The human genome contains roughly three billion base pairs, but only about two percent of them encode proteins. The remaining 98 percent orchestrates when, where and how much of each protein gets made. Small variations in this regulatory machinery can profoundly alter an organism’s response to its environment or susceptibility to disease. But until recently, deciphering exactly how these sequences work at the molecular level has remained one of biology’s most stubborn puzzles.
Before AlphaGenome, researchers often had to make a compromise: scan a long region of DNA, but lose fine detail, or zoom in tightly and miss the long-range signals that matter in regulation. In a blog post announcing the model, DeepMind said it designed AlphaGenome to avoid that choice. The company described an architecture that uses convolutional layers to capture short DNA motifs, transformers to share information across the entire sequence and additional layers to translate those detected patterns into predictions across multiple biological readouts.
DeepMind also emphasized how quickly the system can be trained. The company said training took about four hours and used roughly half the compute budget of its earlier Enformer model, an efficiency gain it highlighted as notable given AlphaGenome’s expanded scope.
AlphaGenome arrives in the wake of AlphaFold, the protein-structure system that helped convince the world that AI could tackle parts of biology once thought too complex to model directly. But DNA is a different kind of challenge. A change in sequence does not simply alter a static structure. It can ripple through regulation, shifting when a gene turns on, how much RNA gets made, how much protein is produced and how a cell reacts to signals from its environment.
Most genomics tools are built to handle that complexity in slices: one method to find protein-coding regions, another to interpret variants, another to estimate disease risk and another to support clinical decisions. AlphaGenome is designed to bring many of those steps into a single framework, rather than forcing researchers to stitch together separate models.
AlphaGenome is trained on an enormous archive of molecular biology experiments generated over decades of research, many of them produced by publicly funded consortia. DeepMind has described using large public datasets that measure how sequence and variation relate to signals such as RNA output and transcription factor binding in human and mouse cells. By learning from these experimental patterns, the model claims to identify not only the stretches of DNA that encode genes, but also the regulatory sequences that control when genes turn on, where they turn on and how strongly.
In DeepMind’s description, when researchers give the system a DNA sequence up to one million base pairs long, AlphaGenome can predict gene-related features across different cell types, including signals related to transcription and aspects of RNA processing, and how those outputs change when the sequence is altered.
DeepMind is building a single system meant to read regulatory DNA as a unified code. IBM’s approach centers on decomposing biological questions into well-defined tasks, with models optimized for the mathematical and biological structure of each domain.
“Our work on Biomedical Foundation Models (BMFM) takes a more practical, modular approach,” said Michal Rosen-Zvi, Director of AI for Healthcare and Life Sciences at IBM Research, in an interview with IBM Think. “We decompose complex biological questions into well-defined components and identify the mathematical and algorithmic innovations required for the specific tasks at hand.”
Based on this analysis, IBM develops specialized models tailored to distinct domains, including RNA transcriptomics, DNA sequence analysis, and small-molecule and protein representation, according to Rosen-Zvi. “Each model is designed to optimally capture the modalities most relevant to its domain, whether that is primary sequence, two-dimensional structure, three-dimensional conformation or, in the case of our RNA models, mathematical representations that more faithfully capture whole‑genome expression at the cellular level,” she said.
Rosen-Zvi said IBM’s DNA work tries to avoid treating the genome as a single “standard” sequence. “Importantly, in our DNA models we explicitly incorporate population-level variation, training not only on reference sequences but also on SNPs and other mutable sites,” she said. That design, Rosen-Zvi explained, lets the models learn evolutionary and functional signals that a single reference genome can’t capture—signals that might otherwise require training on many thousands of whole genomes to approximate.
Rosen-Zvi framed biomedical foundation models as tools that are both powerful and workable in practice. “Overall, the BMFM approach emphasizes efficient training and inference and is particularly well suited to problems where the underlying biology spans multiple layers of information, abstraction and observation,” she said. In her view, that’s exactly the terrain scientists have to cross when they try to explain disease, pinpoint drug targets, propose mechanisms of action, generate candidate compounds and predict which ones are worth pursuing.
IBM has been focusing its recent modeling work on two areas of drug development that tend to consume time and money: biologics and small molecules. She pointed to IBM’s MAMMAL, which is designed to predict antibody-antigen binding strength. She also highlighted IBM’s MMELON, which she said has performed well at predicting the therapeutic properties of small-molecule candidates, an early readout that can help teams decide what’s worth pursuing before lab work begins.
A new IBM paper, co-authored with the Cleveland Clinic, offers a clearer look at how MMELON works. It describes a “multi-view” method for representing molecules, which IBM Research has presented in the paper as a case for domain-specific foundation models in biomedicine. The project grew out of IBM’s Discovery Accelerator Partnership with the Cleveland Clinic, a collaboration the two organizations have described as using AI and quantum computing to speed biomedical discovery.
IBM Research is also plugged into a much bigger data-building effort. It recently joined LIGAND-AI, a consortium announced in January 2026 that aims to generate open, high-quality datasets of protein-ligand interactions. The project announcement said the consortium, led by Pfizer and the Structural Genomics Consortium, includes 18 partners across nine countries.
Organizers said the initiative has a budget of more than 60 million euros and will probe thousands of proteins tied to both existing treatments and major unmet needs, including rare diseases, neurological conditions and cancer. The Structural Genomics Consortium said the project plans to generate billions of data points using complementary screening technologies, creating a resource that researchers worldwide can use to train and benchmark AI systems that predict molecular interactions.
The market for AI in biotechnology is expanding rapidly. Precedence Research projects continued double-digit growth globally, with estimates pointing to a market exceeding USD 25 billion by the mid-2030s, according to a January 2026 analysis by Ardigen. The US market alone was approximately USD 2.1 billion in 2025, with growth driven primarily by adoption in drug discovery, genomics and precision medicine, the analysis stated.
Gerstein’s reaction comes with an asterisk: he called the results promising, but stressed that performance on curated benchmarks doesn’t always translate to messy real-world biology.
AlphaGenome, as he sees it, is powerful at describing what a single change might do within a genome model. But real genomes do not change one letter at a time. They come as whole, inherited packages, full of variants that shape one another’s effects. “In terms of limitations, one major issue is that the model predicts the effect of only a single variant and does not take into account the full genetic background of an individual’s personal genome,” he said. “Background genetics can substantially influence the impact of a particular variant, particularly by strongly affecting how a gene is expressed in response to a mutation.”
He thinks the next step is imaginable, even if it is harder: a future version of this kind of work could move beyond scoring a single mutation in isolation and instead operate directly on personal genomes. “One could imagine extending AlphaGenome by building large models that operate directly on personal genomes,” he said.
Medicine demands forms of evidence that many model developers simply do not have access to, Gerstein noted.
“With respect to translation into clinical practice, the main requirement is the accumulation of many use cases in which the effects of particular mutations are documented, followed by downstream validation showing that the predictions are accurate and clinically useful,” Gerstein said. “There is no substitute in the medical world for experimental data and actual clinical validation, and this will be necessary before outputs from tools like this are accepted.”
He also stressed what AlphaGenome does not claim to do: “It is important to remember that this tool provides the molecular consequences of specific mutations, not downstream phenotypic or disease-level effects,” he said. “As a result, additional work would be required to bridge that gap.”
The computational advances like AphaFold build on a foundation that took decades to establish. AlphaFold itself relied on massive protein structure databases built through painstaking crystallography and other experimental techniques. Similarly, the genomic datasets used to train AlphaGenome came from large-scale efforts such as ENCODE, which spent years mapping functional elements across the genome.
Whether AI models will compress the timeline from genetic discovery to approved therapies remains an open question. Drug development still requires navigating the complexities of human biology, designing careful clinical studies and conducting long, rigorous trials to establish safety and efficacy. According to a January 2026 World Economic Forum analysis by Novartis, AI doesn’t allow researchers to circumvent those complexities, but it does offer a way to navigate them more intelligently. By enhancing how scientists choose targets, design molecules and avoid safety risks, AI is helping them make better decisions faster, the analysis states.
“With respect to translation into clinical practice, the main requirement is the accumulation of many use cases in which the effects of particular mutations are documented, followed by downstream validation showing that the predictions are accurate and clinically useful,” Gerstein said. “In the medical world, there is no substitute for experimental data and actual clinical validation.”
Rosen-Zvi framed the moment in sweeping terms. “We have already seen how AI has transformed text, images and code,” she said. “Biology and chemistry are next, and we are only at the beginning of that curve.”
Biomedical foundation models have the potential to fundamentally change how experiments are designed, prioritized and interpreted, shifting from slow, iterative wet-lab cycles to AI-guided hypothesis generation and decision-making, she said. “For enterprises, this means faster discovery, lower R&D risk and the ability to explore biological space that was previously inaccessible,” she said. “Organizations that engage now will help shape how these models are applied, validated and integrated into real workflows, rather than reacting once the transformation is already underway.”
See how InstructLab enables developers to optimize model performance through customization and alignment, tuning toward a specific use case by taking advantage of existing enterprise and synthetic data.
Move your applications from prototype to production with the help of our AI development solutions.
Reinvent critical workflows and operations by adding AI to maximize experiences, real-time decision-making and business value.