ABSTRACT Observational studies based on big data from electronic medical records (EMRs) have been conducted recently in many areas of medical research [1, 2]. Results from these studies can provide high impact information particularly on rare events or rare diseases that otherwise would not be available in non-EMR based studies with much smaller sample sizes. Recently, many EMR based studies have been done in Ophthalmology [3], especially those using data from the Intelligent Research in Sight (IRIS®). Among the many studies to date, the primary outcomes are frequently very rare events (<1%). When estimating prevalence and incidence of these diseases and their associated risk factors in IRIS, the entire patient population without the primary disease were considered the control group (millions of rows of data). Data querying and analysis for rare events in big datasets results in common major computational burdens requiring lengthy times to run procedures and substantial cloud computing costs. The situation is even worse when statistical software, such as R would crash without giving any meaningful results after a long run. When the extremely rare events are further subdivided into much finer subcategories after combining unique groups from each of the inherent categorical variables, this rare event rate issue becomes more significant. This leads to unreliable effect estimate from logistic regression. To address these challenges, in this application, through a collaborative effort between Wills Eye Hospital (WEH), Philadelphia, PA and the University of Connecticut, that combines theoretical and applied statistical expertise, we propose to develop and evaluate novel subsampling and optimal analysis methods which to the best of our knowledge does not exist to date. This application proposes to achieve the following aims: 1) Derive optimal subsampling probabilities for rare events data with both categorical and numerical covariates; 2) Derive optimal subsampling probabilities that are invariant to measurement scales for continuous covariates; 3) Design an effect balancing approach for covariates with rare categories. Most importantly, we will create user-friendly software packages on optimal subsampling for practitioners that will be applicable for similar settings in medical research. Completion of these proposed aims will provide eye researchers with a new suite of statistical software tools to improve effect estimation with reduced computational time and costs, equal accuracy and precision. Each specific aim will be achieved through the following steps: 1) rigorously establish the theoretical properties of the proposed methodology; 2) examine the finite-sample performance through extensive simulation studies; and 3) apply the proposed methods to IRIS data as illustrative examples to demonstrate the impact of the proposed methods. These novel tools and software will help answer key questions in rare diseases with high reliability, efficiency and lo...