# Carbohydrate enzyme gene clusters in human gut microbiome

> **NIH NIH R01** · UNIVERSITY OF NEBRASKA LINCOLN · 2022 · $282,499

## Abstract

PROJECT SUMMARY
Carbohydrate enzyme gene clusters in human gut microbiome
Our R01 parent project (R01GM140370) intends to develop four bioinformatics tools for automated annotation
of CAZymes (Carbohydrate Active Enzymes) and CAZyme Gene Clusters (CGCs) in human gut microbiome.
These automated tools will enhance: (i) the basic biomedical science to characterize new polysaccharide (or
glycan) metabolic enzymes and polysaccharide utilization loci (PULs, gene clusters with known carbohydrate
substrates) in the human gut microbiome, and (ii) the emerging personalized nutrition practice (e.g., using gut
microbiome sequencing to infer if a person is a responder to certain dietary glycans or prebiotics).
Two types of data are needed for AI/ML applications: (1) PULs (experimentally characterized gene clusters
with known carbohydrate substrates) curated from literature, and (2) CGCs (without known carbohydrate
substrates) predicted from human microbiome.
Although the parent R01 project focuses on the development of new ML tools, challenges exist and
additional support is needed to enable AI/ML-readiness for the data used/produced in the parent project.
These challenges include: (i) the training data size is small, and more PULs await to be curated from literature
which is not supported by the parent R01 project, (ii) the parent R01 project does not consider making the
PULs and CGCs AI/ML-ready to other data scientists than ourselves, (iii) a significant improvement is needed
for the PUL/CGC data representation, documentation, and pre-processing, as the current data structure is only
designed for domain experts of CAZymes and PULs, (iv) an update of existing software tools will be necessary
to output CGCs in a more computer-readable format, in order to enable AI/ML-readiness.
Therefore, the major goal of this AI/ML-readiness project is to develop a consistent, standardized, and
systematic format of PULs and CGCs to make them AI/ML ready to not only the parent R01 project but also to
other data scientists and nutrition scientists. To achieve this goal, we have assembled a multi-disciplinary
research team including three faculty, one postdoc, and three graduate students. These members have all
necessary expertise in nutritional science and CAZymes, statistical ML model development, and bioinformatics
and ML application development. Two Aims with four subtasks and four milestones are planned to address the
aforementioned challenges and make the PUL and CGC data formatted and documented in a way that they
can be readily available to other data scientists and nutrition scientists. All AI/ML-ready data will be freely
available on two online data repositories: dbCAN-PUL (http://bcb.unl.edu/dbCAN_PUL/) and dbCAN-seq
(http://bcb.unl.edu/dbCAN_seq/). This project will contribute to the basic understanding of dietary modulation of
human microbiome and applied personalized nutrition research.

## Key facts

- **NIH application ID:** 10594096
- **Project number:** 3R01GM140370-02S1
- **Recipient organization:** UNIVERSITY OF NEBRASKA LINCOLN
- **Principal Investigator:** Yanbin Yin
- **Activity code:** R01 (R01, R21, SBIR, etc.)
- **Funding institute:** NIH
- **Fiscal year:** 2022
- **Award amount:** $282,499
- **Award type:** 3
- **Project period:** 2021-05-01 → 2025-02-28

## Primary source

NIH RePORTER: https://reporter.nih.gov/project-details/10594096

## Citation

> US National Institutes of Health, RePORTER application 10594096, Carbohydrate enzyme gene clusters in human gut microbiome (3R01GM140370-02S1). Retrieved via AI Analytics 2026-05-27 from https://api.ai-analytics.org/grant/nih/10594096. Licensed CC0.

---

*[NIH grants dataset](/datasets/nih-grants) · CC0 1.0*
