Geneticists had to deal with a subtly embarrassing fact at the core of their field for decades. The protein-coding regions of human DNA were successfully mapped by the Human Genome Project, which was finished in 2003 after years of international work and funding totaling about three billion dollars. These areas make up around 2% of the entire genome. The remaining 98% were dismissively referred to as junk because they are large sections of DNA that do not form proteins and do not neatly fit into the gene-function model that molecular biology had spent a century developing. The label remained in place longer than it ought to have.
It wasn’t garbage. No one had yet mastered the language.
The regulatory machinery that controls when and where genes turn on and off, determines how cells differentiate, and, when disrupted, drives a wide range of diseases that have remained stubbornly undiagnosed for years is found in the non-coding genome, which scientists have come to refer to as the “dark matter of DNA.” The absence of the answers in the genome has never been an issue. The issue has been that it was computationally and biologically impossible for current tools to read 3 billion base pairs of regulatory code and determine which minor variations in that code actually cause disease. That is beginning to change, and it is important to monitor the rate of change.
| Topic | Genomic AI and the Non-Coding Human Genome — Rare Disease Diagnosis and Treatment |
|---|---|
| Key AI Model 1 | AlphaGenome — developed by Google DeepMind |
| AlphaGenome Capability | Analyzes up to 1 million DNA base pairs at once; predicts how non-coding mutations alter gene expression |
| Published In | Nature (June 2025 / January 2026 coverage) |
| Key AI Model 2 | popEVE — developed by Harvard Medical School (Marks Lab) and CRG Barcelona |
| popEVE Published In | Nature Genetics (November 24, 2025) |
| Lead Researcher (popEVE) | Debora Marks, Professor of Systems Biology, Blavatnik Institute, Harvard Medical School |
| popEVE Key Finding | Identified variants on 123 novel genes linked to developmental disorders; diagnosed ~1/3 of 30,000 undiagnosed patients |
| The Non-Coding Genome | Over 98% of human DNA does not code for proteins; previously dismissed as “junk DNA”; now known to regulate gene activity |
| Clinical Partners | Children’s Rare Disease Collaborative (Boston Children’s Hospital), Children’s Hospital of Philadelphia, Genomics England |
| RNA Dark Matter | Human RNome Project — sequencing all human RNA modifications to map disease connections; over 50 known RNA chemical modifications |
| Reference Links | Harvard Medical School – New AI Model Could Speed Rare Disease Diagnosis / Scientific American – Google DeepMind’s AlphaGenome |

By all standards, Google DeepMind’s AlphaGenome represents a major advancement. The model predicts how mutations in sequences of up to one million DNA letters affect gene expression, or which genes are activated or inactivated in particular cell types. The long-range regulatory interactions that make non-coding DNA so complex were difficult for earlier computational methods to handle, and they could only handle much shorter sequences. According to a description in Nature, AlphaGenome is intended to capture those interactions by determining which genetic switches seem to be disrupted in disease-affected cells and forecasting the functional implications of variants that were previously categorized as variants of unknown significance—basically, a medical shrug.
Simultaneously, a different model developed by Harvard Medical School researchers was gaining attention for a more immediate clinical purpose. PopEVE, created in the Marks Lab and published in Nature Genetics in late 2025, tackles the particular issue of ranking genetic variants in a patient’s genome when attempting to determine the cause of their illness. Tens of thousands of genetic variations are present in every human. The majority are safe. A tiny percentage lead to illness. Many patients with rare genetic conditions go years or even their entire childhood without a diagnosis because it has been difficult, costly, and sometimes impossible to distinguish between them. For every variant in a patient’s genome, PopEVE generates a score that ranks the variants according to their propensity to cause disease and indicates whether a variant is more likely to cause death in childhood than in adulthood.
When tested on about 30,000 patients without a diagnosis who had severe developmental disorders, the results were noteworthy. In roughly one-third of cases, the model resulted in a diagnosis. Even more remarkably, it found variations on 123 genes that were previously unrelated to developmental disorders. Since then, studies conducted in different labs have independently verified 25 of those genes. It is precisely this kind of external validation—arriving independently of the original team—that distinguishes a promising model from a practical one.
In order to provide clinicians with a prioritized view of a patient’s genome instead of an undifferentiated list of thousands of unknowns, the study’s co-senior author, Debora Marks, stated the objective clearly: rank variants by disease severity. PopEVE has already been used in clinical practice by a researcher at the Centro Nacional de Anésisis Genológico in Barcelona, and it has assisted him in diagnosing multiple cases of rare diseases. Scientists from all over the world will be able to use the model for gene comparisons thanks to the tool’s recent integration into variant databases like ProtVar and UniProt.
Regulatory DNA is not the end of the story of the non-coding genome. RNA, the molecule that transports genetic information from DNA to the cell’s machinery for making proteins, is traversed by a parallel thread. Researchers are now realizing that a significant portion of the dark matter of the genome is a story about RNA modifications, which are chemical structures added to RNA after it is made. These structures determine when proteins are made, how cells react to stress, and, when they go wrong, how diseases develop. The Human RNome Project, which aims to sequence all human RNA and its modifications in a manner similar to that of the Human Genome Project for DNA, is being pursued by researchers at the University at Albany. The human epitranscriptome has over 50 known chemical alterations. On a scale that makes the initial genome sequencing effort seem insignificant, mapping what they all do and what occurs when they malfunction is a project.
Observing all of this build up gives me the impression that medicine is just beginning to grasp something it has long been unable to see clearly. The 98% of the genome that was deemed junk for decades seems to contain many of the answers to the questions of why some people have uncommon diseases that are difficult to diagnose, why some cancers are resistant to treatment, and why some patients respond to medications while others do not. The AI tools that read that area are still in their infancy and have flaws. However, they are reading it, which is more than was possible even five years ago.
