Digging into the dark matter of the genome to uncover mutations that drive cancer

The search for somatic mutations that drive cancer has often been limited to the 2% of the genome that codes for proteins. Harnessing the power of probabilistic deep learning, we developed Dig to rapidly scan the entire genome for mutations that may cause cancer.

Like Comment
Read the paper

Fundamentally, cancer is the product of somatic mutations acquired during life that imbue a cell with the ability to grow and divide without limit1. Finding these mutations and understanding what they do has led to profound advances in medicine’s ability to treat patients with cancer. Doctors’ arsenals now include targeted therapies that shut down specific cellular functions that cause cancer and immunotherapies that supercharge the immune system’s own ability to destroy tumor cells. Despite the ongoing treatment revolution, cancer stubbornly remains the second leading cause of death in the United States2. Contributing to this statistic is the fact that physicians often do not know what is causing a patient’s cancer; genetic sequencing does not reveal driver mutations that can guide clinical treatment in upwards of one third of patients3.

So where are the missing cancer drivers? This is the question we wanted to empower the cancer community to answer. The search for cancer driver mutations has been restricted largely to the portion of the genome that codes for proteins4. Yet the noncoding genome – the proverbial dark matter of DNA – accounts for 98% of the human genome. Our goal was to design a computational tool that could screen the entire genome for likely driver mutations. In doing so, we had to overcome three challenges: 1) a patient’s tumor often harbors thousands of harmless “passenger” somatic mutations that mask the handful of pathogenic driver mutations; 2) a user may want to test individual mutations for driver potential or test loci that span thousands of base pairs; 3) the sheer size of the genome (~3 billion base pairs) represents a whole lot of territory to mine in a short period of time. The result is Dig, a method that enables genome-wide searches for driver mutations in clinical sequencing data in minutes on a personal computer.

Overview of the Dig method to produce genome-wide mutation rate maps and perform genome-wide searches for driver mutations with the Dig method.

How does Dig work?

Overcoming the first challenge required us to understand characteristics that distinguish driver from passenger mutations: drivers occur in regions of the genome that provide a proliferative advantage to a cell while passengers accrue at random. So, by looking at patterns of mutations across numerous patients, we could in theory identify driver regions as those places where mutations accumulated unexpectedly compared to the number of passenger mutations that should be there. This approach has been applied with immense success to identify genes that drive cancer5–8. However, passenger somatic mutation rates vary by orders of magnitude across the genome9, making it notoriously difficult to model genome-wide. We knew that the epigenetic organization of a tumor’s tissue-of-origin strongly correlates with its somatic mutation rates10. We thus designed a probabilistic deep neural network that maps cancer-specific passenger mutation rates across the entire genome using epigenetic tracks from Roadmap Epigenomics11 as input. As an added bonus, the network quantifies its own prediction uncertainty, a key parameter for calibrating downstream statistical tests.

We were then left with the second and third challenges – efficiently querying the maps to retrieve the expected number of mutations over any set of positions for any cohort of patients. By modeling the specific ways that DNA accrues mutations, we designed a probabilistic graphical model that takes our maps as input and outputs a distribution over the number of passenger mutations in any region of the genome. These regions can be specified down to the resolution of a single base pair. This solved the second problem. To solve the third, we carefully crafted the probabilistic model so that it could perform computations in near constant time using a closed-form equation. Moreover, by incorporating an additional parameter, the model can transfer a map learned from one cohort of patients to another cohort of patients without any retraining. Thus, once a mutation rate map is created for a particular cancer type, it can be queried nearly instantaneously to provide the likelihood that any genomic locus contains cancer driver mutations in any cohort of patients with that type of cancer.

How did we use Dig and what did we find?

We created mutation rate maps for 37 different cancer types from the Pan-Cancer Analysis of Whole Genomes (PCAWG) dataset12. Extensive benchmarking showed the power and generalizability of Dig compared to existing approaches. A deeper dive into the workings of the deep learning network revealed that Dig’s accuracy was in part attributable to the network’s ability to identify functional epigenetic elements (e.g., transcription start sites) and associate them with local mutation rates. We then applied Dig to search for evidence of driver mutations in both the noncoding and coding genome across cancer cohorts from PCAWG, The Cancer Genome Atlas (TCGA)13, and Memorial Sloan Kettering Cancer Center14.

Cryptic splice mutations

Recent work has demonstrated that alternative splicing is a hallmark of cancer15. Alternative splicing is often the result of so-called “cryptic splice” somatic mutations far from exon boundaries that confuse the spliceosome complex about where it should cut and stitch a gene’s mRNA16. By applying our method to examine all possible cryptic splice mutations, we found that they account for nearly 5% of single-nucleotide driver mutations in tumor suppressor genes. Tumor suppressor genes code for proteins that inhibit cell proliferation. Our work demonstrates that cryptic splice mutations selectively inactivate these genes, knocking out the cell’s own defenses against cancer.

5’ UTR mutations

Finding noncoding drivers beyond the well-known TERT promoter hotspot has been notoriously difficult17.  The 5’ UTR of the tumor suppressor gene TP53 is one of the few high-quality candidates identified to-date, with mutations targeting key sequence features that control transcription and translation of the p53 protein. Using this pattern as a canonical example, we identified the 5’ UTR of the tumor suppressor gene ELF3 as another noncoding region that unexpectedly accumulates mutations. This excess of mutations was reproducible across independent datasets and could not be explained by known confounders17. We are excited for this locus to be explored experimentally to confirm our computational predictions.

Rare coding mutations

Finally, Dig’s transfer-learning capabilities newly enabled us to analyze targeted sequencing data from thousands of patients to quantify the driver potential of mutations in rarely mutated genes. We found that 2-5% of patients carry likely driver mutations in genes known to be drivers in other cancers but that have not previously been implicated as drivers in the patient’s cancer. This suggests substantial overlap between the landscape of common drivers in one cancer type and rare drivers in another cancer type. Further supporting this conclusion, mutations in a given gene occurred in similar patterns and produced similar phenotypes across cancers in which the gene was a known common driver and in which it was a newly implicated rare driver.

Why is this work important and what comes next?

Our work has several implications both technically and clinically. On the technical side, it demonstrates the power of deep-learning to provide insights into cancer biology when properly constructed and applied. As the volume and complexity of genomics data continue to increase, we believe that these sorts of techniques that can automatically extract meaning from high-dimensional data will be able to provide more and more benefit to our understanding of molecular biology. On the clinical side, our work suggests avenues for new and repurposed therapeutics. Antisense oligonucleotides have shown promise in reversing the effects of cryptic splice mutations. As this methodology matures, it could potentially be used to restore the function of tumor suppressor genes that have been inactivated by the types of cryptic splice mutations that we have identified as likely driver mutations. Moreover, the overlap between the landscape of common and rare driver genes suggests that therapeutics approved for a mutation in one cancer may prove beneficial to patients with that mutation in other cancers. Indeed, this is the goal of numerous ongoing clinical trials.

The whole genome sequencing datasets to which we applied Dig were underpowered to identify rare noncoding mutations. Fortunately, the cancer community has been tirelessly generating new data at an unprecedented scale. The 100K Genomes project recently released >10,000 whole-genome sequenced tumor samples with paired clinical data18. This is an unparalleled dataset for which Dig is uniquely suitable. We believe our method will empower researches to dig deeper into the noncoding genome to reveal new elements of cancer biology. Each new discovery brings us a step closer to the ultimate goal: new treatment strategies to improve patient lives.

How can I use Dig?

Dig is available as a user-friendly software package. Detailed instructions for installation and use are provided at http://dig-cancer.csail.mit.edu/. To further increase the accessibility of our mutation rate maps, we have created versions that can be interactively browsed through the web. The interactive maps for 37 cancer types can be explored at https://resgen.io/maxsh/Cancer_Mutation_Maps/views.


  1. Hanahan, D. & Weinberg, R. A. Hallmarks of Cancer: The Next Generation. Cell 144, 646–674 (2011).
  2. Murphy, S. L. Mortality in the United States, 2020. 8 (2021).
  3. VanderLaan, P. A., Rangachari, D. & Costa, D. B. The Rapidly Evolving Landscape of Biomarker Testing in Non-Small Cell Lung Cancer. Cancer Cytopathol 129, 179–181 (2021).
  4. Zhang, X. & Meyerson, M. Illuminating the noncoding genome in cancer. Nat Cancer 1, 864–872 (2020).
  5. Martincorena, I. et al. Universal Patterns of Selection in Cancer and Somatic Tissues. Cell 171, 1029-1041.e21 (2017).
  6. Bailey, M. H. et al. Comprehensive Characterization of Cancer Driver Genes and Mutations. Cell 173, 371-385.e18 (2018).
  7. Dietlein, F. et al. Identification of cancer driver genes based on nucleotide context. Nature Genetics 1–11 (2020) doi:10.1038/s41588-019-0572-y.
  8. Martínez-Jiménez, F. et al. A compendium of mutational cancer driver genes. Nat Rev Cancer 20, 555–572 (2020).
  9. Supek, F. & Lehner, B. Scales and mechanisms of somatic mutation rate variation across the human genome. DNA Repair (Amst) 81, 102647 (2019).
  10. Polak, P. et al. Cell-of-origin chromatin organization shapes the mutational landscape of cancer. Nature 518, 360–364 (2015).
  11. Kundaje, A. et al. Integrative analysis of 111 reference human epigenomes. Nature 518, 317–330 (2015).
  12. Campbell, P. J. et al. Pan-cancer analysis of whole genomes. Nature 578, 82–93 (2020).
  13. Hoadley, K. A. et al. Cell-of-Origin Patterns Dominate the Molecular Classification of 10,000 Tumors from 33 Types of Cancer. Cell 173, 291-304.e6 (2018).
  14. Zehir, A. et al. Mutational Landscape of Metastatic Cancer Revealed from Prospective Clinical Sequencing of 10,000 Patients. Nat Med 23, 703–713 (2017).
  15. Calabrese, C. et al. Genomic basis for RNA alterations in cancer. Nature 578, 129–136 (2020).
  16. Jaganathan, K. et al. Predicting Splicing from Primary Sequence with Deep Learning. Cell 176, 535-548.e24 (2019).
  17. Rheinbay, E. et al. Analyses of non-coding somatic drivers in 2,658 cancer whole genomes. Nature 578, 102–111 (2020).
  18. Degasperi, A. et al. Substitution mutational signatures in whole-genome–sequenced cancers in the UK population. Science 376, abl9283 (2022).

Maxwell Sherman

PhD candidate, Massachusetts Institute of Technology

PhD candidate at MIT. Biology + statistics + algorithms to understand causes and consequences of somatic mutations.