ARCHS4: Massive Mining of Publicly Available RNA Sequencing Data

NIH RePORTER · NIH · U24 · $751,375 · view on reporter.nih.gov ↗

Abstract

SUMMARY Many cancer-related independent studies that employ bulk and single cell RNA-seq remain under reused due to their lower findability, accessibility, interoperability, and reusability. The data from these studies can be found in the Gene Expression Omnibus (GEO) but it is provided mostly as raw FASTQ files with non-uniform metadata annotations. While some studies provide aligned reads files, these are processed non-uniformly. This shortcoming makes it difficult to query and integrate this data across studies and with additional external data. To bridge the gap that currently exists between RNA-seq data generation and RNA-seq data processing and reuse, we developed the resource All RNA-seq and ChIP-Seq Sample and Signature Search (ARCHS4). ARCHS4 provides processed RNA-seq data from GEO to support retrospective data analyses and reuse. ARCHS4 caters to users with different levels of computational expertise and has been already employed for many post-hoc analyses and projects. The goals go far beyond just providing cancer researchers with direct access to RNA-seq data through a web-based user interface. We plan to transform other transcriptomics data into RNA-seq-like profiles with Deep Learning, identify pathogenic sequences in human RNA-seq samples, identify short variants from RNA-seq reads, predict gene function from co-expression data including ways to modulate the expression of long non-coding RNAs with small molecules, and most importantly, using the ARCHS4 cost-effective infrastructure, continue to provide a free FASTQ alignment service to the community.

Key facts

NIH application ID
10909127
Project number
5U24CA264250-03
Recipient
ICAHN SCHOOL OF MEDICINE AT MOUNT SINAI
Principal Investigator
Avi Ma'ayan
Activity code
U24
Funding institute
NIH
Fiscal year
2024
Award amount
$751,375
Award type
5
Project period
2022-09-01 → 2027-08-31