# Evolution-guided machine learning for inferring natural selection

> **NIH NIH R35** · PENNSYLVANIA STATE UNIVERSITY, THE · 2021 · $373,274

## Abstract

Project Summary/Abstract
A fundamental question in genomics is to understand natural selection on coding and noncoding sequences.
Signatures of natural selection encoded in polymorphism and divergence data not only elucidate the patterns of
evolution but also pinpoint deleterious genetic variants responsible for genetic disorders. While numerous com-
putational methods have been developed to infer sequences under various types of natural selection, the existing
methods suffer from two critical limitations. First, most of the methods for inferring natural selection focus on an-
alyzing individual loci. Due to the intrinsic sparsity of polymorphism and divergence data, the single-locus-based
approaches are often underpowered. Second, when multiple genomic features are correlated with signatures
of natural selection, the existing methods are incapable of distinguishing causal genomic features from corre-
lated confounders. Due to these limitations, we still lack powerful computational frameworks to identify loci and
genomic features responsible for natural selection. During the next ﬁve years, l will address the limitations of exist-
ing methods by combining evolutionary models and ﬂexible machine learning techniques. Speciﬁcally, I formulate
the inference of natural selection as a special regression problem in which genomic features are input covariates
whereas polymorphism and divergence data are response variables. Based on this idea, my lab will develop
a suite of evolution-guided machine learning models to infer negative, positive, and lineage-speciﬁc selection.
These customized machine learning models will boost the statistical power of selection inference by pooling data
across large numbers of loci, and will be able to distinguish genomic determinants from confounders. These
new models will be applied to investigate various types of natural selection in the human genome. In addition, a
genome-wide map of deleterious variants under strong negative selection will be developed for accurate variant
prioritization. The proposed research builds on my recent work for predicting functional noncoding sequences,
inferring selection coefﬁcients of coding variants, and unifying variant-level and gene-level prioritization methods.
It will illustrate new insights into genomic determinants of functional sequences and human adaptive evolution,
and will provide powerful computational tools for identifying disease mutations. It could also serve as a basis for
the emerging paradigm of combining classical evolutionary theory and machine learning methods to address a
variety of questions in evolutionary biology.

## Key facts

- **NIH application ID:** 10273742
- **Project number:** 1R35GM142560-01
- **Recipient organization:** PENNSYLVANIA STATE UNIVERSITY, THE
- **Principal Investigator:** YIFEI HUANG
- **Activity code:** R35 (R01, R21, SBIR, etc.)
- **Funding institute:** NIH
- **Fiscal year:** 2021
- **Award amount:** $373,274
- **Award type:** 1
- **Project period:** 2021-08-10 → 2026-06-30

## Primary source

NIH RePORTER: https://reporter.nih.gov/project-details/10273742

## Citation

> US National Institutes of Health, RePORTER application 10273742, Evolution-guided machine learning for inferring natural selection (1R35GM142560-01). Retrieved via AI Analytics 2026-05-26 from https://api.ai-analytics.org/grant/nih/10273742. Licensed CC0.

---

*[NIH grants dataset](/datasets/nih-grants) · CC0 1.0*
