# Statistical methods and designs for correlated outcome and covariate errors in studies of HIV/AIDS

> **NIH NIH R37** · VANDERBILT UNIVERSITY MEDICAL CENTER · 2024 · $845,194

## Abstract

PROJECT SUMMARY/ASBTRACT
Electronic health record (EHR) and other routinely collected data are often used as cost-effective data sources
for HIV/AIDS research. These data sources, however, are known to be prone to errors, typically across
multiple variables, which can lead to biased study results and misleading conclusions. In addition, EHR data
sources often lack gold-standard measurements that are needed to clearly define the presence or absence of
co-morbidities (e.g., liver fibrosis). To address limitations of EHR data sources, researchers can validate or
collect additional data on a subsample of their patient records. By combining the rich, but error-prone EHR
data on all study subjects with the gold-standard / validated data collected on a subsample of subjects,
researchers can improve study estimates. Specifically, researchers can eliminate the bias of estimates had
they only used the EHR data, and they can improve the precision (e.g., narrower confidence intervals) of study
estimates had they only used the subsample with gold-standard / validated data. In earlier research, we
developed statistical methods and software to combine EHR data with validated sub-samples of data. We
developed optimal, multi-wave designs for targeting records for data validation. Importantly, we applied these
methods to multiple HIV studies using retrospective observational data from the International epidemiology
Databases to Evaluate AIDS (IeDEA). However, in our applications, we have encountered additional
challenges that have not yet been addressed. In particular, there is great potential in combining expensive,
prospectively collected, gold-standard data that are sparsely measured (e.g., once per year) on a sub-sample
of patients with EHR data that are collected much more frequently on a larger number of patients. We will
develop methods to handle this setting, and we will develop statistical designs to better select which
participants should be approached for prospective data collection and which patient records should be
validated. We will also develop statistical methods to address other challenges encountered with using EHR
data, including how to incorporate validation data into studies when inclusion in the study is error-prone, and
methods to address more complex types of data (e.g., interval censored data), for which there are a lack of
techniques to handle error-prone data. Our methods and designs will focus on extensions of multiple
imputation, maximum likelihood, and generalized raking techniques. Open source tools and tutorials will be
developed to help researchers to implement these novel methods and study designs. The methods and
designs will be applied to data from the IeDEA network to estimate the incidence of and risk factors for liver
fibrosis/steatosis and frailty among people living with HIV in East Africa and Latin America.

## Key facts

- **NIH application ID:** 10765711
- **Project number:** 5R37AI131771-07
- **Recipient organization:** VANDERBILT UNIVERSITY MEDICAL CENTER
- **Principal Investigator:** Pamela A Shaw
- **Activity code:** R37 (R01, R21, SBIR, etc.)
- **Funding institute:** NIH
- **Fiscal year:** 2024
- **Award amount:** $845,194
- **Award type:** 5
- **Project period:** 2018-01-25 → 2028-01-31

## Primary source

NIH RePORTER: https://reporter.nih.gov/project-details/10765711

## Citation

> US National Institutes of Health, RePORTER application 10765711, Statistical methods and designs for correlated outcome and covariate errors in studies of HIV/AIDS (5R37AI131771-07). Retrieved via AI Analytics 2026-05-23 from https://api.ai-analytics.org/grant/nih/10765711. Licensed CC0.

---

*[NIH grants dataset](/datasets/nih-grants) · CC0 1.0*