Integration of Biomedical Ontologies with Deep Learning AI for Research and Diagnosis of Rare Diseases

NIH RePORTER · NIH · P20 · $240,580 · view on reporter.nih.gov ↗

Abstract

Rare diseases collectively impact more than 30 million individuals in the United States and 300-400 million individuals worldwide. There are an estimated 7-10 thousand known rare diseases, of which approximately 80% have a genetic etiology. Although important, genetic sequence data alone is insufficient to determine mechanism and diagnosis of rare genetic disorders (RGDs). In-silico variant pathogenicity prediction and phenomic information are also critical, though even with these, diagnostic rates remain frustratingly low. Novel research paradigms, such as a gene-to-patient approach that samples individuals with high confidence in-silico predicted pathogenic variants from large databases and asks if they share a common phenotype could promote novel discovery and improve diagnostic rates; however, this approach is hampered by the inability to easily extract phenotypic information from unstructured data and reliably identify a shared phenotype among individuals. RGD research that combines genomic and phenomic with other ‘omics’ data also has potential for improved diagnostic yield and mechanistic understanding; however, these endeavors face significant obstacles, notably a dearth of multiomic data fusion and analysis methods. Practical RGD clinical diagnosis additionally requires patient specific diagnostic pathways, the lack of which has resulted in unacceptably complex and lengthy diagnostic odysseys that places significant burden on individuals suffering with RGDs. Our proposal addresses these critical limitations through novel artificial intelligence (AI) method development that will integrate information rich biological ontologies with multiomic data. We will first extend graph neural network node representation learning methods and develop a custom genetic search algorithm to enable discovery of a shared population phenotype among individuals in the absence of a disease specification, thus enabling a gene-to-patient research paradigm. We will further apply these methods to integrate node representations for a tissue-to-gene expression knowledge graph with genetic sequence data and clinically accessible tissue (CAT) transcriptomic results in a transformer based deep learning model to predict tissuespecific aberrant splicing pathogenicity. Finally, we will combine these methods in a pilot clinical decision support system to recommend personalized genetic testing and clinical tests to support RGD diagnosis. This pilot system will leverage large language models anchored to biological ontologies to enable clinicians and patients to pose questions regarding reasoning, benefits, and risks of recommended clinical tests in an efficient, flexible, conversational form. In combination, these outcomes will dramatically improve researchers’ ability to utilize multiomic data to elucidate the mechanisms by which variants affect phenotype and guide clinicians in the diagnosis and care of individuals with RGDs, thereby substantially ...

Key facts

NIH application ID
11013492
Project number
5P20GM139769-04
Recipient
CLEMSON UNIVERSITY
Principal Investigator
Robert R. H Anholt
Activity code
P20
Funding institute
NIH
Fiscal year
2024
Award amount
$240,580
Award type
5
Project period
2024-02-01 → 2026-01-31