Posted in: AI

Mitigating Bias in AI Models

Artificial intelligence (AI) holds significant power to improve the way we live and work, but AI systems are only as effective as the data they’re trained on. Bad training data can lead to higher error rates and biased decision making, even when the underlying model is sound.

As the adoption of AI increases, the issue of minimizing bias in AI models is rising to the forefront. Continually striving to identify and mitigate bias is absolutely essential to building trust and ensuring that these transformative technologies will have a net positive impact on society.

For more than a century, IBM has responsibly ushered revolutionary technologies into the world. We are dedicated to delivering AI services that are built responsibly, are unbiased and explainable. And we are continually working to evaluate and update our services, advancing them in a way that is trustworthy and inclusive.

Ensuring a balanced representation of unbiased data sets in AI training is critical, and AI algorithms themselves are playing an increasingly important role in ensuring fairness in the use of AI. The issue of bias also requires the attention, expertise and engagement of a broad network of informed professionals.

To that end, Joy Buolamwini and Timnit Gebru recently published a paper, “Gender Shades: Intersectional Accuracy Disparities in Commercial Gender Classification,” (Conference on Fairness, Accountability, and Transparency, February 2018) that evaluates three commercial API-based visual recognition tools, including IBM Watson Visual Recognition. The study finds that these services’ facial recognition capabilities are not adequately balanced for gender and skin tone [1]. The authors show that the highest error rates involve images of dark-skinned women, while the most accurate results are for light-skinned men.

For the past nine months, IBM has been working toward substantially increasing the accuracy of its new Watson Visual Recognition service for facial analysis, which now uses broader training datasets and more robust recognition capabilities than the service evaluated in this study. Our new service, which will be released on February 23, demonstrates a nearly ten-fold decrease in error-rate for facial analysis when measured with the testset similar to the one in Buolamwini and Gebru’s paper.

To conduct their study, Buolamwini and Gebru constructed a new facial image dataset, called Pilot Parliaments Benchmark, which is highly balanced across skin phenotype and gender.

To evaluate IBM’s new service in a manner consistent with their study, IBM Research gathered images of parliamentarians from Finland, Iceland, Rwanda, Senegal, South Africa and Sweden. This dataset is very similar to the Pilot Parliaments Benchmark. However, the experiment conducted by IBM Research differs slightly from the one used in the paper [1] in two ways. First, the dataset is slightly different as a new election in Senegal has changed the member photos and Rwanda has a smaller number of photos. Thus, IBM Research used 1,217 faces versus the 1,270 reported in the paper. Secondly, we labeled the “lighter” and “darker” classes for the faces manually without using the Fitzpatrick score. We believe that these two, minute differences do not impact the conclusions of the experiment. The table below illustrates our results.

Country Total Male Female Lighter Male Darker Male Lighter Female Darker Female
Finland 194 113 81 113 0 81 0
Iceland 63 39 24 39 0 24 0
Rwanda 26 16 10 0 16 0 10
Senegal 161 95 66 0 95 0 66
South Africa 424 246 178 63 183 27 151
Sweden 349 187 162 180 7 158 4
All 1217 696 521 395 301 290 231
Errors @ score threshold = 0.99 15 7 8 1 6 0 8
Error as % 1.23% 1.005% 1.535% 0.253% 1.99% 0 3.46%


As seen, the error rates of IBM’s upcoming visual recognition service are significantly lower than those of the three systems presented in the paper. While it is still true that the “darker” category has higher error rates than the “lighter” one, the highest error rate (which is still for darker skinned females) — 3.46 percent — is now just a fraction of what it had been in the old service. This reflects a nearly ten-fold decrease in error with our new Face Model. There was no new training or fine-tuning done based on the dataset.

IBM is deeply committed to delivering services that are unbiased, explainable, value aligned, and transparent. To deal with possible sources of bias, we have several ongoing projects to address dataset bias in facial analysis – including not only gender and skin type, but also bias related to age groups, ethnicities, and factors such as pose, illumination, resolution, expression, and decoration. We are currently creating a million-scale dataset of face images annotated with attributes and identity, leveraging geo-tags from Flickr images to balance data from multiple countries and active learning tools to reduce sample selection bias. We intend to make this data publicly available as a tool for the research community and propose a challenge to encourage the community to improve their algorithms with respect to bias in facial analysis. In addition, as a longer-term project, we are planning to conduct research on cycle-consistent adversarial networks to synthetically generate new training samples with specific attributes to reduce dataset bias across race, gender, and age.

More broadly, we are also developing algorithms for detecting, rating, and correcting bias and discrimination across modalities, both for data and for models. For example:

  • At the Conference on Fairness, Accountability, and Transparency in 2018, we will present a new text, image and video dataset we have constructed from Bollywood films and analyze it for gender bias in various ways [2].
  • In a paper at the Artificial Intelligence Ethics and Society Conference (AIES) 2018, we presented a composable bias and fairness ratings system and architecture for API-based AI services (including all of the commercial classifiers studied by Buolamwini and Gebru) and demonstrate its applicability in the domain of language translation [3].
  • In a paper at the conference on Neural Information Processing Systems (NIPS) 2017, we presented a flexible optimization approach for transforming a training dataset into one that is fairer according to given protected attributes and can be used by any downstream AI system [4].

We are actively working on transferring these and other research contributions into IBM’s core AI offerings. In doing so, we are taking a holistic view in which fairness detection and corrections occur throughout an overall data pipeline in an auditable and transparent manner; this perspective is summarized in a paper presented at the 2017 Data for Good Exchange conference [5].

Moreover, we do not view AI ethics simply from the perspective of easily quantifiable fairness results. Beyond any one product, we are actively pursuing a research agenda that includes explainability, computational morality, value alignment, and other topics that will be translated into IBM product and service offerings.

At IBM, we value multi-stakeholder collaboration on important technical and social questions. As a founding member of the Partnership on AI to Benefit People and Society, IBM is happy to see researchers such as Buolamwini and Gebru contributing positively to the advancement of artificial intelligence. These inquiries and the discussions they prompt are essential to promoting the responsible evolution of these transformative technologies.

Responsibility in the use of data and AI has to be a conversation and commitment that transcends any one company, and we recognize that IBM’s work on responsibility, transparency and accountability will never be complete. But we believe that a continual cycle of bias detection and mitigation — coupled with AI models that are transparent and explainable — is the best, and most responsible, path forward.


[1] J. Buolamwini and T. Gebru. “Gender Shades: Intersectional Accuracy Disparities in Commercial Gender Classification,” Conference on Fairness, Accountability, and Transparency, New York, NY, February 2018.

[2] N. Madaan, S. Mehta, T. Agrawaal, V. Malhotra, A. Aggarwal, Y. Gupta, and M. Saxena. “Analyze, Detect and Remove Gender Stereotyping from Bollywood Movies,” Conference on Fairness, Accountability, and Transparency, New York, NY, February 2018.

[3] B. Srivastava and F. Rossi. “Towards Composable Bias Rating of AI Services,” AAAI/ACM Conference on Artificial Intelligence, Ethics, and Society, New Orleans, LA, February 2018.

[4] F. P. Calmon, D. Wei, B. Vinzamuri, K. N. Ramamurty, and K. R. Varshney. “Optimized Pre-Processing for Discrimination Prevention,” Advances in Neural Information Processing Systems, Long Beach, CA, December 2017.

[5] S. Shaikh, H. Vishwakarma, S. Mehta, K. R. Varshney, K. N. Ramamurthy, and D. Wei. “An End-To-End Machine Learning Pipeline That Ensures Fairness Policies,” Data for Good Exchange Conference, New York, NY, September 2017.



Ruchir Puri

Chief Architect and IBM Fellow, IBM Watson and Cloud Platform