IAA-NCEA/EPA-0001; “Chemoinformatics of EPA’s EDSP Universe of Chemicals”

NIH RePORTER · NIH · N01 · $175,000 · view on reporter.nih.gov ↗

Abstract

Chemoinformatic support for two IAA/USEPA projects with DNTP were performed; first project with EPA's ECOTOX database and the second project with EPA's NCEA (Natl Ctr Environmental Assessment). (1) The ECOTOXicology Knowledgebase (ECOTOX) is a comprehensive, publicly available knowledgebase providing single chemical environmental toxicity data on aquatic life, terrestrial plants and wildlife. To streamline this labor-intensive process, Sciome evaluated the feasibility of using machine learning methods to automatically classify documents according to relevance, and to identify the exclusion rationale for those references that are excluded. There are ~45,000 references that have been manually curated by EPA in ECOTOX database over the course of the last thirty years of database growth and development. The current data curation process consists of several manual and time-consuming steps. The process could be greatly shortend and made more consistent by use of AI (artificial intelligence) software tools and a process was initiated to develop an ECOTOX lexicon and develop a strategy for automated curation. (2) For the 2nd project with NCEA, the EDSP Universe of Chemicals was analyzed towards forming a database portal. An initial clustering exercise was completed on ~700 compounds that were fully and unambiguously curated. Further clustering was extended to the union of compounds (~14K) obtained by searching the CompTox Dashboard for CAS numbers and Names provided in the file shared by EPA. Each clustering experiment involved complete data cleaning and harmonization, structure quantification by several fingerprints and descriptors followed by investigating several distance measurements and clustering methods. Some class-specific clustering experiments with methods to identify parsimonious set of clusters were pursued. After weighing pros and cons of several available platforms for developing the EDSP portal, EPA and Sciome agreed that the EDSP Portal would be developed on the Python/Django platform. The Portal would initially house some physical-chemistry data and Tier-1 bio-assay data for the compounds that have so far been unambiguously curated. The functionality of the Portal could be continually enhanced through a regular show-and-feedback cycle with the clients.

Key facts

NIH application ID
10379865
Project number
273201700001C-P00007-9999-2
Recipient
SCIOME, LLC
Principal Investigator
RUCHIR SHAH
Activity code
N01
Funding institute
NIH
Fiscal year
2021
Award amount
$175,000
Award type
Project period
2017-03-24 → 2022-03-23