DMS/NIGMS 1: Addressing Measurement Limitations for Sequence Count Data

NIH RePORTER · NIH · R01 · $199,851 · view on reporter.nih.gov ↗

Abstract

Sequence count data (e.g., 16S rRNA sequencing or single-cell RNA-seq) are ubiquitous in modern biomedical research. Yet even in the absence of measurement noise and limitations of experimental design, these data convey limited information about the underlying biological system being measured. Beyond familiar limitations such as inappropriate study design, two other forms of limitations have been shown to impact or even dominate study conclusions. Scale limitations arise because the scale of the system under study (e.g., the total number of bacteria in a persons gut) is typically independent of the scale of the data. In contrast, measurement bias skews the observed distribution of counts as some entities are systematically underrepresented compared to others. Despite an appreciation of these problems, we lack tools for performing and evaluating analyses of sequence count data in light of these limitations. Here we develop new statistical theory and tools for addressing measurement bias and scale limitations. This proposal has 3 aims. (1) Develop a theoretical framework for objectively evaluating existing approaches in light of these limitations. (2) Develop Simulated Inference as a new theoretical and computational framework which allows analysts to use their preferred models and software while incorporating uncertainty stemming from these data limitations. (3) Validate these tools through application to three case-studies of real sequence count data. In total, these aims provide new theoretical and computational tools for evaluating and performing analyses of sequence count data that are robust to these data limitations. The proposed work is also a substantial departure from the status quo. In contrast to existing methods which address these data limitations through assumptions that are often implicit, we develop statistical theory and tools that explicitly model uncertainty and potential error in those assumptions. We demonstrate that this approach can lead to lower Type-I and Type-II errors both in theory and in practice. Overall these tools will enhance the reproducibility and rigor of sequence count data analysis which is central to projects across the NIH.

Key facts

NIH application ID: 10894214
Project number: 5R01GM148972-03
Recipient: PENNSYLVANIA STATE UNIVERSITY, THE
Principal Investigator: Justin D Silverman
Activity code: R01
Funding institute: NIH
Fiscal year: 2024
Award amount: $199,851
Award type: 5
Project period: 2022-09-20 → 2026-08-31