Mining Social Media Big Data for Toxicovigilance: Studying Substance Use via Natural Language Processing and Machine Learning Methods

NIH RePORTER · NIH · R01 · $1,272,404 · view on reporter.nih.gov ↗

Abstract

The epidemic of substance use (SU) and substance use disorder (SUD) in the United States has been evolving for decades. Both prescription and illicit drugs have been involved in overdose deaths over the years, with notable increases in synthetic opioids (eg., fentanyl & analogs) and psychostimulants (eg., methamphetamine) in recent years. The emergence of high-potency novel psychoactive substances (NPSs), such as fentanyl analogs, have drastically contributed to rising deaths, and adversely impacted treatment engagement and response. The COVID19 pandemic has further exacerbated the crisis, and recent studies have also highlighted that substantial disparities exist in SUD treatment, research, interest, and response across different subpopulations, with racial/ethnic minorities being disproportionately impacted. A key element to tackling the crisis is improved surveillance. Specifically, there is a need for establishing novel approaches to provide timely insights about the trends, distributions, and trajectories of the SUD epidemic, as traditional surveillance approaches involve considerable lags. Many recent studies have identified social media (SM) as useful resources for conducting SU/SUD surveillance. Many people use SM to discuss personal experiences, provide advice, or seek answers to questions regarding SU/SUD, resulting in the generation of an abundance of information. Such information can be characterized, aggregated and analyzed to obtain population- or subpopulation-level insights, at low cost and in near real time. However, converting SM data into timely, actionable knowledge is non-trivial since the data is big, complex, and noisy, requiring the development of advanced, automated artificial intelligence methods. Funded by the National Institute on Drug Abuse, our past work focused specifically on prescription medications (PM) and established the most sophisticated SM-based data mining pipeline available to date. In response to the evolution of the SUD epidemic, the proposed project will extend our capabilities to include illicit substances and develop novel methods to conduct surveillance. Specifically, we will (i) extend our machine learning and natural language processing (NLP) classification pipeline to automatically classify all SU-related chatter from Twitter and Reddit (rather than PMs only), (ii) collect and analyze longitudinal timelines of cohorts self-reporting SU/SUD, (iii) characterize the cohorts in terms of demographic details such as age-group, gender identity, race and geolocation, (iv) develop advanced NLP-driven methods for detecting NPSs and impacts of SU/SUD, (v) study short-term and long-term trends and trajectories of the epidemic, (vi) conduct observational studies on targeted population subsets, including studies focusing on SU and SUD treatment disparities and stigma, and (vii) disseminate developed methodologies via open source code and aggregated findings publicly via a web- based dashboard. Implementation of...

Key facts

NIH application ID
10588855
Project number
1R01DA057599-01
Recipient
EMORY UNIVERSITY
Principal Investigator
Abeed H Sarker
Activity code
R01
Funding institute
NIH
Fiscal year
2022
Award amount
$1,272,404
Award type
1
Project period
2022-09-30 → 2025-09-29