# Deep learning for population genetics

> **NIH NIH R01** · UNIVERSITY OF OREGON · 2020 · $529,154

## Abstract

Project Summary
The revolution in genome sequencing technologies over the past 15 years has created an explosion of
population genomic data but has left in its wake a gap in our ability to make sense of data at this scale. In
particular, whereas population genetics as a field has been traditionally data-limited, the massive volume of
current sequencing means that previously unanswerable questions may now be within reach. To capitalize on
this flood of information we need new methods and modes of analysis.
 In the past 5 years the world of machine learning has been revolutionized by the rise of deep neural
networks. These so-called deep learning methods offer incredible flexibility as well as astounding
improvements in performance for a wide array of machine learning tasks, including computer vision, speech
recognition, and natural language processing. This proposal aims to harness the great potential of deep
learning for population genetic inference.
 In recent years our group has made great strides in using supervised machine learning for population
genomic analysis (reviewed in Schrider and Kern 2018). However, this work has focused primarily on using
more traditional machine learning methods such as random forests. As we argue in this proposal, DNA
sequence data are particularly well suited for modern deep learning techniques, and we demonstrate that the
application of these methods can rapidly lead to state-of-the-art performance in very difficult population genetic
tasks such as estimating rates of recombination. The power of these methods for handling genetic data stems
in part from their ability to automatically learn to extract as much useful information as possible from an
alignment of DNA sequences in order to solve the task at hand, rather than relying on one or more predefined
summary statistics which are generally problem-specific and may omit information present in the raw data.
 In this proposal we lay out a systematic approach for both empowering the field with these tools and
understanding their shortcomings. In particular, we propose to design deep neural networks for solving
population genetic problems, and incorporate successful networks into user-friendly software tools that will be
shared with the community. We will also investigate a variety of methods for estimating the uncertainty of
predictions produced by deep learning methods; this area is understudied in machine learning but of great
importance to biological researchers who require an accurate measure of the degree of uncertainty
surrounding an estimate. Finally, we will explore the impact of training data misspecification—wherein the data
used to train a machine learning method differ systematically from the data to which it will be applied in
practice. We will devise techniques to mitigate the impact of such misspecification in order to ensure that our
tools will be robust to the complications inherent in analyzing real genomic data sets. Together, these
advances ha...

## Key facts

- **NIH application ID:** 9976348
- **Project number:** 1R01HG010774-01A1
- **Recipient organization:** UNIVERSITY OF OREGON
- **Principal Investigator:** ANDREW D KERN
- **Activity code:** R01 (R01, R21, SBIR, etc.)
- **Funding institute:** NIH
- **Fiscal year:** 2020
- **Award amount:** $529,154
- **Award type:** 1
- **Project period:** 2020-04-21 → 2024-02-28

## Primary source

NIH RePORTER: https://reporter.nih.gov/project-details/9976348

## Citation

> US National Institutes of Health, RePORTER application 9976348, Deep learning for population genetics (1R01HG010774-01A1). Retrieved via AI Analytics 2026-06-11 from https://api.ai-analytics.org/grant/nih/9976348. Licensed CC0.

---

*[NIH grants dataset](/datasets/nih-grants) · CC0 1.0*