# Novel subsampling and analysis of big data using examples from IRIS ® Registry

> **NIH NIH R21** · WILLS EYE HEALTH SYSTEM · 2024 · $217,449

## Abstract

ABSTRACT
Observational studies based on big data from electronic medical records (EMRs) have been conducted recently in many areas of medical research [1, 2]. Results from these studies can provide high impact information particularly on rare events or rare diseases that otherwise would not be available in non-EMR based studies with much smaller sample sizes. Recently, many EMR based studies have been done in Ophthalmology [3], especially those using data from the Intelligent Research in Sight (IRIS®). Among the many studies to date, the primary outcomes are frequently very rare events (<1%). When estimating prevalence and incidence of these diseases and their associated risk factors in IRIS, the entire patient population without the primary disease were considered the control group (millions of rows of data). Data querying and analysis for rare events in big datasets results in common major computational burdens requiring lengthy times to run procedures and substantial cloud computing costs. The situation is even worse when statistical software, such as R would crash without giving any meaningful results after a long run. When the extremely rare events are further subdivided into much finer subcategories after combining unique groups from each of the inherent categorical variables, this rare event rate issue becomes more significant. This leads to unreliable effect estimate from logistic regression. To address these challenges, in this application, through a collaborative effort between Wills Eye Hospital (WEH), Philadelphia, PA and the University of Connecticut, that combines theoretical and applied statistical expertise, we propose to develop and evaluate novel subsampling and optimal analysis methods which to the best of our knowledge does not exist to date. This application proposes to achieve the following aims: 1) Derive optimal subsampling probabilities for rare events data with both categorical and numerical covariates; 2) Derive optimal subsampling probabilities that are invariant to measurement scales for continuous covariates; 3) Design an effect balancing approach for covariates with rare categories. Most importantly, we will create user-friendly software packages on optimal subsampling for practitioners that will be applicable for similar settings in medical research.
Completion of these proposed aims will provide eye researchers with a new suite of statistical software tools to improve effect estimation with reduced computational time and costs, equal accuracy and precision. Each specific aim will be achieved through the following steps: 1) rigorously establish the theoretical properties of the proposed methodology; 2) examine the finite-sample performance through extensive simulation studies; and 3) apply the proposed methods to IRIS data as illustrative examples to demonstrate the impact of the proposed methods. These novel tools and software will help answer key questions in rare diseases with high reliability, efficiency and lo...

## Key facts

- **NIH application ID:** 10988038
- **Project number:** 1R21EY035710-01A1
- **Recipient organization:** WILLS EYE HEALTH SYSTEM
- **Principal Investigator:** HaiYing Wang
- **Activity code:** R21 (R01, R21, SBIR, etc.)
- **Funding institute:** NIH
- **Fiscal year:** 2024
- **Award amount:** $217,449
- **Award type:** 1
- **Project period:** 2024-09-01 → 2026-08-31

## Primary source

NIH RePORTER: https://reporter.nih.gov/project-details/10988038

## Citation

> US National Institutes of Health, RePORTER application 10988038, Novel subsampling and analysis of big data using examples from IRIS ® Registry (1R21EY035710-01A1). Retrieved via AI Analytics 2026-06-02 from https://api.ai-analytics.org/grant/nih/10988038. Licensed CC0.

---

*[NIH grants dataset](/datasets/nih-grants) · CC0 1.0*
