# MegaTox for analyzing and visualizing data across different screening systems

> **NIH NIH R44** · COLLABORATIONS PHARMACEUTICALS, INC. · 2022 · $855,003

## Abstract

Project Summary
Computational toxicology aims to use rules, models and algorithms based on prior data for specific endpoints,
to enable the prediction of whether a new molecule will possess similar liabilities or not. In some cases, the
computational models are derived from discrete molecular endpoints (e.g. estrogen receptor agonism) while in
others they are quite broad in scope (e.g. drug induced liver injury, DILI). Considerable progress has been made
in computational toxicology in a decade both in model development and availability such that the latest
generation of larger scale machine learning (ML) models will further focus in vitro and in vivo testing on
verification of select predictions. Pharmaceutical, consumer products, agrochemical and other chemistry focused
companies possess structure-activity data generated over many decades of screening that is not in the public
domain, and this data is primarily only accessible to the cheminformatics experts in each company. Outside of
these companies small pharmaceutical, biotech companies and academics must rely on data from public
databases, commercial databases and their own data. Integrating such data from diverse sources and
processing with algorithms to build machine learning (ML) models that can help to enable predictions for new
compounds is a vast undertaking. Over Phase I of this project to develop the prototype for MegaToxÒ, we curated
toxicity datasets then generated and tested well over 200 ML models initially focused on the Bayesian approach.
We have also developed approaches to understand training and test set applicability and ultimately performed
prospective predictions against several toxicity targets. Having completed these aims, we also collaborated with
numerous academic laboratories and performed fee-for-service work with five commercial companies. We
currently have several pharmaceutical, agrochemical and consumer product companies evaluating our
computational toxicity models prior to licensing. These discussions with potential customers have influenced this
Phase II proposal to include the following aims: 1. Compare and integrate novel graph-based models such as
graphSAGE versus our suite of 15 different ML regression and classification algorithms for modeling toxicology
datasets such as those generated in Phase I. 2. Integrate read across and adverse outcome pathway methods
with our computational models for DILI and other toxicity models as needed. 3. Generate validated ML models
from in vivo data for non-mammalian species (initially using Zebrafish) which will enable in vitro and in vivo
correlations and can be validated relatively cost effectively. In this proposal over 2 years we expect to develop
models with 15 different algorithms for at least 100 in vitro and in vivo datasets, leading to > 1500 toxicity ML
models. We are not aware of any other company pursuing such an approach to both generate new high value
datasets or models, performing testing of their own model...

## Key facts

- **NIH application ID:** 10470050
- **Project number:** 2R44ES031038-02A1
- **Recipient organization:** COLLABORATIONS PHARMACEUTICALS, INC.
- **Principal Investigator:** SEAN EKINS
- **Activity code:** R44 (R01, R21, SBIR, etc.)
- **Funding institute:** NIH
- **Fiscal year:** 2022
- **Award amount:** $855,003
- **Award type:** 2
- **Project period:** 2019-09-01 → 2024-07-31

## Primary source

NIH RePORTER: https://reporter.nih.gov/project-details/10470050

## Citation

> US National Institutes of Health, RePORTER application 10470050, MegaTox for analyzing and visualizing data across different screening systems (2R44ES031038-02A1). Retrieved via AI Analytics 2026-05-24 from https://api.ai-analytics.org/grant/nih/10470050. Licensed CC0.

---

*[NIH grants dataset](/datasets/nih-grants) · CC0 1.0*