Enhancing open data sharing for functional genomics experiments: Measures to quantify genomic information leakage and file formats for privacy preservation

NIH RePORTER · NIH · R01 · $523,409 · view on reporter.nih.gov ↗

Abstract

Project Summary/Abstract: With the surge of large genomics data, there is an immense increase in the breadth and depth of different omics datasets and an increasing importance in the topic of privacy of individuals in genomic data science. Detailed genetic and environmental characterization of diseases and conditions relies on the large-scale mining of functional genomics data; hence, there is great desire to share data as broadly as possible. However, there is a scarcity of privacy studies focused on such data. A key first step in reducing private information leakage is to measure the amount of information leakage in functional genomics data, particularly in different data file types. To this end, we propose to to derive information-theoretic measures for private information leakage in different data types from functional genomics data. We will also develop various file formats to reduce this leakage during sharing. We will approach the privacy analysis under three aims. First, we will develop statistical metrics that can be used to quantify the sensitive information leakage from raw reads. We will systematically analyze how linking attacks can be instantiated using various genotyping methods such as single nucleotide variant and structural variant calling from raw reads, signal profiles, Hi-C interaction matrices, and gene expression matrices. Second, we will study different algorithms to implement privacy-preserving transformations to the functional genomics data in various forms. Particularly, we will create privacy-preserving file formats for raw sequence alignment maps, signal track files, three-dimensional interaction matrices, and gene expression quantification matrices that contain information from multiple individuals. This will allow us to study the sources of sensitive information leakages other than raw reads, for example signal profiles, splicing and isoform transcription, and abnormal three-dimensional genomic interactions. Third, we will investigate the reads that can be mapped to the microbiome in the raw human functional genomics datasets. We will use inferred microbial information to characterize private information about individuals, and then combine the microbial information with the information from human mapped reads to increase the re-identification accuracy in the linking attacks described in the second aim. We will use the tools to quantify the sensitive information and privacy-preserving file formats in the available datasets from large sequencing projects, such as the ENCODE, The Cancer Genome Atlas, 1,000 Genomes, gEUVADIS, and Genotype-Tissue Expression projects.

Key facts

NIH application ID
9970939
Project number
1R01HG010749-01A1
Recipient
YALE UNIVERSITY
Principal Investigator
Mark Bender Gerstein
Activity code
R01
Funding institute
NIH
Fiscal year
2020
Award amount
$523,409
Award type
1
Project period
2020-09-02 → 2025-06-30