# Mining Social Media Big Data for Toxicovigilance: Studying Substance Use via Natural Language Processing and Machine Learning Methods

> **NIH NIH R01** · EMORY UNIVERSITY · 2022 · $1,272,404

## Abstract

The epidemic of substance use (SU) and substance use disorder (SUD) in the United States has been evolving for
decades. Both prescription and illicit drugs have been involved in overdose deaths over the years, with notable
increases in synthetic opioids (eg., fentanyl & analogs) and psychostimulants (eg., methamphetamine) in recent
years. The emergence of high-potency novel psychoactive substances (NPSs), such as fentanyl analogs, have
drastically contributed to rising deaths, and adversely impacted treatment engagement and response. The
COVID19 pandemic has further exacerbated the crisis, and recent studies have also highlighted that substantial
disparities exist in SUD treatment, research, interest, and response across different subpopulations, with
racial/ethnic minorities being disproportionately impacted. A key element to tackling the crisis is improved
surveillance. Specifically, there is a need for establishing novel approaches to provide timely insights about the
trends, distributions, and trajectories of the SUD epidemic, as traditional surveillance approaches involve
considerable lags. Many recent studies have identified social media (SM) as useful resources for conducting
SU/SUD surveillance. Many people use SM to discuss personal experiences, provide advice, or seek answers to
questions regarding SU/SUD, resulting in the generation of an abundance of information. Such information can
be characterized, aggregated and analyzed to obtain population- or subpopulation-level insights, at low cost and
in near real time. However, converting SM data into timely, actionable knowledge is non-trivial since the data is
big, complex, and noisy, requiring the development of advanced, automated artificial intelligence methods.
Funded by the National Institute on Drug Abuse, our past work focused specifically on prescription medications
(PM) and established the most sophisticated SM-based data mining pipeline available to date. In response to the
evolution of the SUD epidemic, the proposed project will extend our capabilities to include illicit substances and
develop novel methods to conduct surveillance. Specifically, we will (i) extend our machine learning and natural
language processing (NLP) classification pipeline to automatically classify all SU-related chatter from Twitter
and Reddit (rather than PMs only), (ii) collect and analyze longitudinal timelines of cohorts self-reporting
SU/SUD, (iii) characterize the cohorts in terms of demographic details such as age-group, gender identity, race
and geolocation, (iv) develop advanced NLP-driven methods for detecting NPSs and impacts of SU/SUD, (v)
study short-term and long-term trends and trajectories of the epidemic, (vi) conduct observational studies on
targeted population subsets, including studies focusing on SU and SUD treatment disparities and stigma, and
(vii) disseminate developed methodologies via open source code and aggregated findings publicly via a web-
based dashboard. Implementation of...

## Key facts

- **NIH application ID:** 10588855
- **Project number:** 1R01DA057599-01
- **Recipient organization:** EMORY UNIVERSITY
- **Principal Investigator:** Abeed H Sarker
- **Activity code:** R01 (R01, R21, SBIR, etc.)
- **Funding institute:** NIH
- **Fiscal year:** 2022
- **Award amount:** $1,272,404
- **Award type:** 1
- **Project period:** 2022-09-30 → 2025-09-29

## Primary source

NIH RePORTER: https://reporter.nih.gov/project-details/10588855

## Citation

> US National Institutes of Health, RePORTER application 10588855, Mining Social Media Big Data for Toxicovigilance: Studying Substance Use via Natural Language Processing and Machine Learning Methods (1R01DA057599-01). Retrieved via AI Analytics 2026-05-26 from https://api.ai-analytics.org/grant/nih/10588855. Licensed CC0.

---

*[NIH grants dataset](/datasets/nih-grants) · CC0 1.0*
