Curation at scale: Integrating AI into community curation

NIH RePORTER · NIH · R01 · $355,938 · view on reporter.nih.gov ↗

Abstract

Project Summary Biological knowledgebases are a critical resource for researchers and accelerate scientific discoveries by providing manually curated, machine-readable data collections. However, the aggregation and manual curation of biological data is a labor-intensive process that relies almost entirely on professional biocurators. Two approaches have been advanced to help with this problem: natural language processing (NLP; text mining (TM) and machine learning (ML)) and engagement of researchers (community curation). However, neither of these approaches alone is sufficient to address the critical need for increased efficiency in the biocuration process. Our solution to these challenges is an NLP-enhanced community curation portal, Author Curation to Knowledgebase (ACKnowledge). The ACKnowledge system, currently implemented for the C. elegans literature, couples statistical methods and text mining algorithms to enhance community curation of research articles. We propose to strengthen and expand ACKnowledge by including other species into our pipeline, incorporating more sophisticated machine learning models, and presenting sentence-level entity and concept extraction for more detailed author curation. In addition, we will develop an Author Curation Portal (ACP) to allow authors to easily upload and curate their own documents. Taken together, these enhancements will allow us to maximize community curation efforts by leveraging author expertise in multiple areas of biology, while at the same time supporting authors with as much AI-assisted curation as possible. This reciprocal interaction will improve not only the content of knowledgebases, but the AI methods themselves, as we will receive valuable feedback on our models. By developing an Author Curation Portal, we will further empower authors to participate in the curation process and alert knowledgebases to key information that can, and should, be readily discoverable in accordance with FAIR (Findable, Accessible, Interoperable, and Reusable) data principles.

Key facts

NIH application ID: 10344771
Project number: 1R01LM013871-01
Recipient: CALIFORNIA INSTITUTE OF TECHNOLOGY
Principal Investigator: PAUL Warren STERNBERG
Activity code: R01
Funding institute: NIH
Fiscal year: 2021
Award amount: $355,938
Award type: 1
Project period: 2021-09-01 → 2025-05-31