DARSaW: Developing, Assessing, and Refining Synthetic Sampling Weights to Improve Generalizability of the All of Us Research Program Data

NIH RePORTER · NIH · R21 · $224,531 · view on reporter.nih.gov ↗

Abstract

The All of Us Research Program (All of Us) is a large-scale initiative to collect and study multimodal data from over one million participants living in the United States (U.S.). Studies have identified notable differences in disease prevalence compared to the broader U.S. population, which are, in part, attributable to the program’s enrollment strategy. A key challenge that limits the representativeness of All of Us to the target U.S. population is that the data are collected through a non-probabilistic sampling design. This proposal aims to leverage two types of external data resources from the U.S. population to construct reliable Synthetic sampling Weights (SaW) for All of Us to mimic a probabilistic sampling design and improve generalizability. The first external data resource, National Health and Nutrition Examination Survey (NHANES), creates a nationally representative dataset with validated sampling weights and individual-level data made publicly available. However, NHANES’ sample size is relatively small and can result in under-coverage. The second external data resource, the U.S. Census and the American Community Survey (ACS), are large-scale nationwide surveys that provide more but aggregated demographic and housing information about the U.S. population, compensating for the limitation of NHANES. However, individual-level data are not available. Utilizing the external data resources available in NHANES, the U.S. Census, and ACS, this project will develop, assess, and refine Synthetic sampling Weights (DARSaW) to improve the generalizability of All of Us to the target U.S. population. In Aim 1, we will develop the SaW for All of Us by leveraging the individual-level data from the NHANES and rich but aggregated summary statistics from the U.S. Census and the American Community Survey. In Aim 2, the effectiveness of the SaW will be assessed through case studies, comparing unweighted and SaW-weighted estimates of obesity, hypertension, and disability. We will iterate between Aims 1 and 2 to refine SaWs at the presence of discrepancy by post-calibrating to broader and deeper aggregated statistics from the target population. The goal of this proposal is to demonstrate the ability of the SaW to improve the generalizability of the All of Us data, enabling researchers to draw valid conclusions about the target U.S. population.

Key facts

NIH application ID: 10930769
Project number: 5R21MD019103-02
Recipient: VANDERBILT UNIVERSITY MEDICAL CENTER
Principal Investigator: Qingxia Chen
Activity code: R21
Funding institute: NIH
Fiscal year: 2024
Award amount: $224,531
Award type: 5
Project period: 2023-09-17 → 2027-03-31