# Adaptive Reproducible High-Dimensional Nonlinear Inference for Big Biological Data

> **NIH NIH R01** · UNIVERSITY OF SOUTHERN CALIFORNIA · 2021 · $276,700

## Abstract

Big data is now ubiquitous in every field of modern scientific research. Many contemporary applications,
such as the recent national microbiome initiative (NMI), greatly demand highly flexible statistical machine
learning methods that can produce both interpretable and reproducible results. Thus, it is of paramount
importance to identify crucial causal factors that are responsible for the response from a large number of
available covariates, which can be statistically formulated as the false discovery rate (FDR) control in
general high-dimensional nonlinear models. Despite the enormous applications of shotgun metagenomic
studies, most existing investigations concentrate on the study of bacterial organisms. However, viruses
and virus-host interactions play important roles in controlling the functions of the microbial communities. In
addition, viruses have been shown to be associated with complex diseases. Yet, investigations into the
roles of viruses in human diseases are significantly underdeveloped. The objective of this proposal is to
develop mathematically rigorous and computationally efficient approaches to deal with highly complex big
data and the applications of these approaches to solve fundamental and important biological and
biomedical problems. There are four interrelated aims. In Aim 1, we will theoretically investigate the power
of the recently proposed model-free knockoffs (MFK) procedure, which has been theoretically justified to
control FDR in arbitrary models and arbitrary dimensions. We will also theoretically justify the robustness
of MFK with respect to the misspecification of covariate distribution. These studies will lay the foundations
for our developments in other aims. In Aim 2, we will develop deep learning approaches to predict viral
contigs with higher accuracy, integrate our new algorithm with MFK to achieve FDR control for virus motif
discovery, and investigate the power and robustness of our new procedure. In Aim 3, we will take into
account the virus-host motif interactions and adapt our algorithms and theories in Aim 2 for predicting
virus-host infectious interaction status. In Aim 4, we will apply the developed methods from the first three
aims to analyze the shotgun metagenomics data sets in ExperimentHub to identify viruses and virus-host
interactions associated with several diseases at some target FDR level. Both the algorithms and results
will be disseminated through the web. The results from this study will be important for metagenomics
studies under a variety of environments.

## Key facts

- **NIH application ID:** 10159277
- **Project number:** 5R01GM131407-04
- **Recipient organization:** UNIVERSITY OF SOUTHERN CALIFORNIA
- **Principal Investigator:** Yingying Fan
- **Activity code:** R01 (R01, R21, SBIR, etc.)
- **Funding institute:** NIH
- **Fiscal year:** 2021
- **Award amount:** $276,700
- **Award type:** 5
- **Project period:** 2018-08-01 → 2023-04-30

## Primary source

NIH RePORTER: https://reporter.nih.gov/project-details/10159277

## Citation

> US National Institutes of Health, RePORTER application 10159277, Adaptive Reproducible High-Dimensional Nonlinear Inference for Big Biological Data (5R01GM131407-04). Retrieved via AI Analytics 2026-05-23 from https://api.ai-analytics.org/grant/nih/10159277. Licensed CC0.

---

*[NIH grants dataset](/datasets/nih-grants) · CC0 1.0*