PROJECT SUMMARY Carbohydrate enzyme gene clusters in human gut microbiome Our R01 parent project (R01GM140370) intends to develop four bioinformatics tools for automated annotation of CAZymes (Carbohydrate Active Enzymes) and CAZyme Gene Clusters (CGCs) in human gut microbiome. These automated tools will enhance: (i) the basic biomedical science to characterize new polysaccharide (or glycan) metabolic enzymes and polysaccharide utilization loci (PULs, gene clusters with known carbohydrate substrates) in the human gut microbiome, and (ii) the emerging personalized nutrition practice (e.g., using gut microbiome sequencing to infer if a person is a responder to certain dietary glycans or prebiotics). Two types of data are needed for AI/ML applications: (1) PULs (experimentally characterized gene clusters with known carbohydrate substrates) curated from literature, and (2) CGCs (without known carbohydrate substrates) predicted from human microbiome. Although the parent R01 project focuses on the development of new ML tools, challenges exist and additional support is needed to enable AI/ML-readiness for the data used/produced in the parent project. These challenges include: (i) the training data size is small, and more PULs await to be curated from literature which is not supported by the parent R01 project, (ii) the parent R01 project does not consider making the PULs and CGCs AI/ML-ready to other data scientists than ourselves, (iii) a significant improvement is needed for the PUL/CGC data representation, documentation, and pre-processing, as the current data structure is only designed for domain experts of CAZymes and PULs, (iv) an update of existing software tools will be necessary to output CGCs in a more computer-readable format, in order to enable AI/ML-readiness. Therefore, the major goal of this AI/ML-readiness project is to develop a consistent, standardized, and systematic format of PULs and CGCs to make them AI/ML ready to not only the parent R01 project but also to other data scientists and nutrition scientists. To achieve this goal, we have assembled a multi-disciplinary research team including three faculty, one postdoc, and three graduate students. These members have all necessary expertise in nutritional science and CAZymes, statistical ML model development, and bioinformatics and ML application development. Two Aims with four subtasks and four milestones are planned to address the aforementioned challenges and make the PUL and CGC data formatted and documented in a way that they can be readily available to other data scientists and nutrition scientists. All AI/ML-ready data will be freely available on two online data repositories: dbCAN-PUL (http://bcb.unl.edu/dbCAN_PUL/) and dbCAN-seq (http://bcb.unl.edu/dbCAN_seq/). This project will contribute to the basic understanding of dietary modulation of human microbiome and applied personalized nutrition research.