Expressive and scalable statistical models for genomic and biomedical data

NIH RePORTER · NIH · R35 · $334,935 · view on reporter.nih.gov ↗

Abstract

Project Summary My lab develops and applies statistical models to make sense of genomic and biomedical data with the ultimate goal of understanding the biological basis of diseases and improving human health. The dramatic decrease in the cost of DNA sequencing has led to the emergence of datasets of genetic variation across large numbers of individuals (sample sizes upwards of hundreds of thousands). This genetic data is paired with deep phenotypic and disease information. While o ering the potential to answer important questions in biology and medicine, these complex and massive datasets present formidable challenges of statistical modeling and inference. Extracting meaningful insights from these datasets needs expressive and scalable statistical and computational methods. Our recent work has focused on understanding evolutionary processes that shape genetic variation within homogeneous and admixed populations and in understanding how genetic variation modulates variation in complex traits and disease risk. A major discovery from our work is our nding that west African populations derive substantial genetic ancestry from an unidenti ed ghost archaic population that was enabled, in turn, by new statistical methods that we developed to infer local ancestry in admixed populations in the challenging setting where reference genomes for ancestral populations are unavailable. Work from my lab has also led to statistical inference algorithms that are capable of analyzing millions of genomes to provide new insights into both evolutionary processes and genetic architecture of complex traits. We now propose to substantively expand our research applying statistical machine learning to population and quantitative genetics with the aim of understanding the interplay between evolution, genes and traits. We will develop algorithms to uncover complex evolutionary histories from genome sequence data in the presence of admixture, expressive and scalable models to infer the genetic architecture of complex traits within homogeneous and admixed populations, and methods for deep learning-based phenotype imputation that deal with the high- rates of missingness in biomedical datasets. Taken together, our e orts will provide powerful analytical tools to e ectively probe the structure and function of the human genome.

Key facts

NIH application ID
10842967
Project number
1R35GM153406-01
Recipient
UNIVERSITY OF CALIFORNIA LOS ANGELES
Principal Investigator
Sriram Sankararaman
Activity code
R35
Funding institute
NIH
Fiscal year
2024
Award amount
$334,935
Award type
1
Project period
2024-05-01 → 2029-02-28