# Improving representativeness in non-probability surveys and causal inference with regularized regression and post-stratification

> **NIH NIH R01** · COLUMBIA UNIV NEW YORK MORNINGSIDE · 2022 · $210,873

## Abstract

Project Summary/Abstract
The proposed project has a broad aim of working with the increasing complexities of survey statistics with de-
creasing response rate. We focus speciﬁcally on non-probability samples (samples of convenience) due to their
increasing popularity, but note that these non-probability samples are simply an extreme case of a probability
based survey with high non-response, and so our methods could be expected to generalize. Long term, our
hope is to ﬁnd methods and techniques to safely adjust non-probability samples to a wider population whilst
concurrently developing methods of critiquing these estimates to increase researcher, policy maker and public
conﬁdence in these estimates.
 Our speciﬁc aims focus in on developing the tools and techniques to make this possible. We focus primarily
on a regularized regression and poststratiﬁcation methodology that has already shown some success with non-
representative and even convenience samples. Using this methodology, we focus on adaptions that make this
technique useful for public health settings.
 Speciﬁcally we focus on a three pronged approach. Firstly, we aim to make adaptions to the current state of
the arc of modelling technique to better suit the unique challenges posed by public health datasets and questions.
Our approach to achieve this is to focus on partial pooling with more structured adjustment variables, and more
broadly considering high dimensional variables with continuous and non-continuous components. Not only that,
but we move to also consider uncertainty in poststratiﬁcation, namely when adjusting for variables not known in
the population. In a complementary approach, we also aim to assess coverage by combining raw survey data but
assuming differences in sample.
 Secondly, we note that many our central methodology could be extended to questions of a causal nature. This
is particularly relevant to public health challenges because often causal estimates are desired. Our approach is
to extend the model based approach to assume heterogeneity of effect within demographic subgroups. Then
by using regularization, the effect within each subgroup is estimated and used to poststratify to the population.
Groups with relatively few treated/untreated individuals would be estimated with greater uncertainty, which is an
innovative approach to accounting for balance.
 Thirdly and ﬁnally we note that the regularized regression and prediction technique is particularly reliant on
model assumptions. Our ﬁnal aim is to consider methods of testing and validating models with non-representative
data in order to obtain better and more trustworthy population based estimates.

## Key facts

- **NIH application ID:** 10400107
- **Project number:** 5R01AG067149-03
- **Recipient organization:** COLUMBIA UNIV NEW YORK MORNINGSIDE
- **Principal Investigator:** ANDREW GELMAN
- **Activity code:** R01 (R01, R21, SBIR, etc.)
- **Funding institute:** NIH
- **Fiscal year:** 2022
- **Award amount:** $210,873
- **Award type:** 5
- **Project period:** 2020-08-01 → 2025-04-30

## Primary source

NIH RePORTER: https://reporter.nih.gov/project-details/10400107

## Citation

> US National Institutes of Health, RePORTER application 10400107, Improving representativeness in non-probability surveys and causal inference with regularized regression and post-stratification (5R01AG067149-03). Retrieved via AI Analytics 2026-05-23 from https://api.ai-analytics.org/grant/nih/10400107. Licensed CC0.

---

*[NIH grants dataset](/datasets/nih-grants) · CC0 1.0*
