Eliminating bias in health data science

It’s time for a renewed commitment to fairness, transparency and equity in data, big data and the algorithms for statistical or artificial intelligence (AI) models

By | 3 minute read | June 24, 2020

Overhead shot showing the connection between population health and data collection

Health disparities have existed for a long time and are very complicated issues to resolve. Several factors can affect health, including but not limited to gender, race, ethnicity, sexual orientation, employment, education, income level and living conditions.

Bias has been generally defined as systematic error introduced into a study sampling analysis by consciously or subconsciously choosing or promoting certain outcomes over others. In data science or computer science literature, bias is defined in terms of the dataset or model—bias in labeling, bias in sample selection, bias in the task of data retrieval, scaling and imputation, or bias in model selection – favoring certain types of statistical and machine learning errors over others.

While data and algorithms are used in a variety of contexts, this definition focuses on studies that generate insights from data about population attributes, such as demographic characteristics and behaviors. For example, bias in inputs – the building and analysis of population datasets – can be the result of the sources of data, the context in which data is gathered, incompleteness or errors in what aspects of the data are considered important, and methods of analysis.1

In clinical and social science research, bias has been defined as any tendency that prevents unprejudiced consideration of a question or advances prejudice in favor of or against one group compared with another.2 Both definitions share an acknowledgement that bias implies error resulting in one group being favored over another. They differ in that the former refers to bias in terms of the dataset and the latter refers to bias as a certain outcome.3

Health data can be subject to our own biases or lack of diversity. I believe it’s important to address any tendency that prevents equity or advances prejudice, whether it’s in how the data is collected, sampled, tested, labeled or structured. It’s up to all of us to be aware of the potential for bias and to promote fairness in data science and clinical research.

Data diversity and fairness can help healthcare providers and data scientists approach problems, like responding to the pandemic, with a more equitable lens. Data can help identify unique risk factors and patterns that could indicate increased vulnerability during an outbreak. Understanding the factors that lead to increased cases and help identify hot spots for earlier community interventions, response and recovery.

As part of IBM, we are committed to being purposeful and transparent about eliminating bias in our data, analytics, AI and services. Examples include:

  • On June 8, 2020, our CEO sent a letter to Congress outlining policy proposals to advance racial equality. He also shared, in the context of responsible use of technology by law enforcement, that IBM will no longer offer general purpose IBM facial recognition or analysis software.
  • Health equity is one of the areas of focus of IBM Watson Health’s ongoing research collaborations with two academic centers – Brigham and Women’s Hospital, which is a teaching hospital of Harvard Medical School and Vanderbilt University Medical Center – to help advance the science of AI and its application to major public health issues.
  • IBM Research has developed AI Fairness 360, a comprehensive open-source toolkit of metrics to check for unwanted bias in datasets and machine learning models, using state-of-the-art algorithms

We all have a role in addressing health inequities. There is a lot we can all do. Each of us needs to be personally aware and committed to create systems to ensure fairness in our practices and policies. We need to promote data diversity, transparency and health equity in our efforts for social good. We all need to examine and integrate the social determinants of health into our work.

As healthcare IT professionals and clinicians, we must come together to design more inclusive and complete data, advanced analytics and AI. Only with diversity in voices and experience can we advance the cause of equality and fairness in healthcare.

Read more about examining health disparities in precision medicine

A call for more complete race and ethnicity health data for COVID-19 response


  1. www.research.ibm.com
  2. Pannucci C, Wilkins E Identifying and Avoiding Bias in Research. PMID: 20679844
  3. Ferryman K, Pitcan, M, Fairness in Precision Medicine Feb 2018 Data & Society