# Enhancing open data sharing for functional genomics experiments: Measures to quantify genomic information leakage and file formats for privacy preservation

> **NIH NIH R01** · YALE UNIVERSITY · 2024 · $515,951

## Abstract

Project Summary/Abstract: With the surge of large genomics data, there is an immense increase in the
breadth and depth of different omics datasets and an increasing importance in the topic of privacy of
individuals in genomic data science. Detailed genetic and environmental characterization of diseases and
conditions relies on the large-scale mining of functional genomics data; hence, there is great desire to share
data as broadly as possible. However, there is a scarcity of privacy studies focused on such data. A key
first step in reducing private information leakage is to measure the amount of information leakage in
functional genomics data, particularly in different data file types. To this end, we propose to to derive
information-theoretic measures for private information leakage in different data types from functional
genomics data. We will also develop various file formats to reduce this leakage during sharing. We will
approach the privacy analysis under three aims. First, we will develop statistical metrics that can be used to
quantify the sensitive information leakage from raw reads. We will systematically analyze how linking attacks
can be instantiated using various genotyping methods such as single nucleotide variant and structural
variant calling from raw reads, signal profiles, Hi-C interaction matrices, and gene expression matrices.
Second, we will study different algorithms to implement privacy-preserving transformations to the functional
genomics data in various forms. Particularly, we will create privacy-preserving file formats for raw sequence
alignment maps, signal track files, three-dimensional interaction matrices, and gene expression
quantification matrices that contain information from multiple individuals. This will allow us to study the
sources of sensitive information leakages other than raw reads, for example signal profiles, splicing and
isoform transcription, and abnormal three-dimensional genomic interactions. Third, we will investigate the
reads that can be mapped to the microbiome in the raw human functional genomics datasets. We will use
inferred microbial information to characterize private information about individuals, and then combine the
microbial information with the information from human mapped reads to increase the re-identification
accuracy in the linking attacks described in the second aim. We will use the tools to quantify the sensitive
information and privacy-preserving file formats in the available datasets from large sequencing projects,
such as the ENCODE, The Cancer Genome Atlas, 1,000 Genomes, gEUVADIS, and Genotype-Tissue
Expression projects.

## Key facts

- **NIH application ID:** 10913541
- **Project number:** 5R01HG010749-05
- **Recipient organization:** YALE UNIVERSITY
- **Principal Investigator:** Mark Bender Gerstein
- **Activity code:** R01 (R01, R21, SBIR, etc.)
- **Funding institute:** NIH
- **Fiscal year:** 2024
- **Award amount:** $515,951
- **Award type:** 5
- **Project period:** 2020-09-02 → 2026-06-30

## Primary source

NIH RePORTER: https://reporter.nih.gov/project-details/10913541

## Citation

> US National Institutes of Health, RePORTER application 10913541, Enhancing open data sharing for functional genomics experiments: Measures to quantify genomic information leakage and file formats for privacy preservation (5R01HG010749-05). Retrieved via AI Analytics 2026-05-22 from https://api.ai-analytics.org/grant/nih/10913541. Licensed CC0.

---

*[NIH grants dataset](/datasets/nih-grants) · CC0 1.0*
