Genomics, EHRs, GPUs, and Next Generation Computational Statistics

NIH RePORTER · NIH · R01 · $644,297 · view on reporter.nih.gov ↗

Abstract

Abstract The future challenges of statistical genetics are enormous. Data sets continue to grow; studies with 106 cases and 107 markers have become feasible, but current algorithms and software do not scale to this size. We need to rethink and rebuild many of our statistical analysis techniques and tools to scale effectively. In addition, health data will soon be commonly collected from mobile and wearable devices, dramatically increasing its volume and utility. Precision health and predictive medicine raise the stakes even further. Concurrently, the nature of computing is rapidly changing. To take advantage of hardware advances, particularly ubiquitous parallel computing, new statistical approaches and algorithms and new programming paradigms must be brought online. This renewal proposal targets the application of state-of-the-art statistical techniques and tools to develop genetic analysis algorithms that can scale to studies with millions of subjects, such as the US Department of Veterans Affairs' Million Veteran Program (MVP) and the UK Biobank. Biobank-scale data sets have many ben- eﬁts, particularly the potential power to detect the subtle effects of each of the many genes involved in common diseases. Another beneﬁt is that these data sets can be more representative of the populace by including large numbers of people from multiple ancestries, different social strata, and all sexes. To effectively and efﬁciently analyze these massive data sets requires advances in the current statistical genetics tools. Effective statistical analysis takes many forms: algorithms that converge in fewer iterations, powerful statistics that accommodate all available data, and computational methods that take advantage of massively parallel computing hardware such as graphics processing units (GPUs) and other coprocessors. We will deliver algorithms that can directly handle biobank-scale data sets for many computationally-challenging statistical genetics tasks, including genome-wide association studies (GWAS) with trait data from electronic health records (EHRs). More generally, our algorithm focus will beneﬁt all scientiﬁc ﬁelds driven by computational statistics and high-dimensional optimization. Of course, for statistical algorithm development to be immediately useful it must be accompanied by fast, easy-to-use software. We will promptly deliver open-source software that (1) enables interactive and reproducible analyses with informative intermediate results, (2) provides quality graphics, (3) scales to big data analytics, (4) embraces parallel and distributed computing, (5) adapts to rapid hardware evolution, (6) allows cloud computing, and (7) fosters easy communication between clinicians, geneticists, statisticians, and computer scientists. Recent breakthroughs in computer languages bring all these goals within reach. Our overall objective is the design and construction of state-of-the-art statistical genetics algorithms and software for modern, massive...

Key facts

NIH application ID: 10450816
Project number: 5R01HG006139-10
Recipient: UNIVERSITY OF CALIFORNIA LOS ANGELES
Principal Investigator: Eric Sobel
Activity code: R01
Funding institute: NIH
Fiscal year: 2022
Award amount: $644,297
Award type: 5
Project period: 2011-08-26 → 2024-06-30