# Text mining in the Cloud

> **NIH NIH U24** · CALIFORNIA INSTITUTE OF TECHNOLOGY · 2022 · $119,564

## Abstract

Project Summary
The Alliance of Genome Resources (Alliance) is developing shared, sustainable infrastructure for the curation,
storage, analysis, and presentation of genomic and genetic data about research organisms to serve biomedical
researchers, bioinformaticians, and artificial intelligence (AI) and machine learning (ML) researchers, as well as
clinicians, students and teachers. We propose to develop a unified cloud-based text-mining service to enable
AI/ML approaches to identify relevant documents and text-spans suitable for use by professional biocurators,
authors who curate their own papers pre- or post-publication, and researchers who want sentence-level full text
search. This project will leverage the PubMedCentral (PMC) cloud-based open-access corpus by implementing
a set of neural network classification and NLP algorithms in the cloud to take advantage of the PubMed Central
Open Access (PMC-OA) corpus already in the cloud. The project will implement in the cloud software developed
by the Textpresso group of the Alliance, and carry out computationally intensive indexing of papers using neural
networks to classify papers, Alliance-custom entity recognition, and Textpresso ontology-based indexing to aid
biocuration. The project will be sustained by the Alliance and the MODs.

## Key facts

- **NIH application ID:** 10613271
- **Project number:** 3U24HG010859-04S1
- **Recipient organization:** CALIFORNIA INSTITUTE OF TECHNOLOGY
- **Principal Investigator:** CAROL J BULT
- **Activity code:** U24 (R01, R21, SBIR, etc.)
- **Funding institute:** NIH
- **Fiscal year:** 2022
- **Award amount:** $119,564
- **Award type:** 3
- **Project period:** 2019-09-18 → 2024-07-31

## Primary source

NIH RePORTER: https://reporter.nih.gov/project-details/10613271

## Citation

> US National Institutes of Health, RePORTER application 10613271, Text mining in the Cloud (3U24HG010859-04S1). Retrieved via AI Analytics 2026-05-23 from https://api.ai-analytics.org/grant/nih/10613271. Licensed CC0.

---

*[NIH grants dataset](/datasets/nih-grants) · CC0 1.0*
