Applying machine learning to population health challenges
New research analyzing populations at the county level reveals strong correlations between specific social determinants of health and COVID-19 mortality rates.
In population health, social determinants have a complex relationship with patient risk and health outcomes. This complexity has been evident during the pandemic.
Early on, response and recovery efforts focused on addressing clinical risk factors, such as age and comorbidities. It wasn’t until later that other risk factors – such as race, poverty, and geography – began to emerge as traits that made populations more susceptible to COVID-19.
We used machine learning to study how sociodemographic features correlated with COVID-19 mortality rates. Our research found correlations between mortality and specific population traits – such as a high proportion of black residents, HIV prevalence, and high unemployment rates.1 This work helps illustrate the importance of data-driven insights that can help improve population health efforts.
5 lessons learned to help inform policies and interventions
Understanding key social determinants associated with COVID-19 mortality in geographic regions can help inform policy and enhance tailored interventions. We share a few lessons learned from this research with the hope that it can help those who work in healthcare and human services:
1. Use complete data.
In the early phases of this research, we identified relevant and publicly available data sources.2 We chose the variables with complete data across all counties in the United States. This foundation was critical to creating accurate “clusters” – groups of counties with similar geographic, sociodemographic and health prevalence status.
2. Focus on the most relevant factors.
An analysis of variance helped to select the ten variables that most impacted mortality. In this study, the top four distinctive features included the proportion of the Hispanic population, not proficient in English, uninsured adults and fair-to-poor health.
3. Understand regional influences.
Comprehensive regional data is important to inform local strategies and policies. For this study, each cluster has similar underlying demographic and socioeconomic characteristics, facilitating comparisons between groups.
Our machine learning algorithms used the top ten variables to find six distinct county clusters. For example, Cluster 5 comprised 223 counties, including Queens and Westchester counties in New York, Skagit and Spokane counties in Washington, and Harford and Baltimore counties in Maryland. These are all urban counties with similar characteristics, such as household income, percentage of homeowners and residential segregation.
4. Machine learning offers advantages over traditional methods.
Machine learning helps capture non-linear relationships and interactions among relevant factors, more so than traditional statistical adjustment models. This method can help researchers avoid assumptions about the relationships between variables.
Consider one of the findings of this study that may be unexpected for some: Usually, population density has a strong correlation with infectious disease. But we found that in Cluster 4 and Cluster 3, the proportion of blacks in a population results in a stronger correlation with mortality than population density and mortality. A machine learning approach helps tease out any confounding factors when evaluating these complex, non-linear relationships.
5. Remember, correlation is not causation.
Correlation shows the strength of a linear relationship between two variables. But a weak correlation or to COVID-19 mortality does not necessarily mean that the factors are not related. There could be a non-linear relationship that requires further investigation.
Next steps: Applying data-driven insights to public health efforts
Social determinants of health, patient risk and health outcomes interact across long and dynamic pathways. More study is needed. For example, this research did not control potential confounding factors, such as age and comorbidities, which could influence correlations. Now that more time has passed and more data is available, an additional study could reveal new insights.
Discovering correlations between social determinants of health and population health challenges can help policymakers and health and human services leaders tailor interventions to the populations they serve. These insights can help them predict potential hotspots and target public health resources and communications. When designing public health programs, such as vaccine distribution or disease prevention initiatives, these insights can help shape a more successful approach.
- Huang HT, Kefayati S, Clark CR, Preininger AM, Bright TJ, Jackson G, Rhee K, Dankwa-Mullan I. Race, social determinant and COVID-19 mortality patterns in the United States. Oral presentation to be presented at: Society of General Internal Medicine Annual Meeting; April 20-23, 2021; Virtual.
- We identified a range of demographic, social, economic, environmental, disease risk prevalence and healthcare determinants known to influence susceptibility to disease and health outcomes. We retrieved variables from publicly available sources, including U.S. Census Population Estimates, the American Community Survey 5-year Estimates, Small Area Income and Poverty Estimates, the Area Deprivation Index (ADI, 2015), and the CDC’s Social Vulnerability Index (SVI, 2018).
- Also known as ANOVA-based feature selection