January 25, 2024 By Phaedra Boinodiris 5 min read

We all want to see our ideal human values reflected in our technologies. We expect technologies such as artificial intelligence (AI) to not lie to us, to not discriminate, and to be safe for us and our children to use. Yet many AI creators are currently facing backlash for the biases, inaccuracies and problematic data practices being exposed in their models. These issues require more than a technical, algorithmic or AI-based solution.  In reality, a holistic, socio-technical approach is required.

The math demonstrates a powerful truth

All predictive models, including AI, are more accurate when they incorporate diverse human intelligence and experience. This is not an opinion; it has empirical validity. Consider the diversity prediction theorem. Simply put, when the diversity in a group is large, the error of the crowd is small — supporting the concept of “the wisdom of the crowd.” In an influential study, it was shown that diverse groups of low-ability problem solvers can outperform groups of high-ability problem solvers (Hong & Page, 2004).

In mathematical language: the wider your variance, the more standard your mean. The equation looks like this:

A further study provided more calculations that refine the statistical definitions of a wise crowd, including ignorance of other members’ predictions and inclusion of those with maximally different (negatively correlated) predictions or judgements. So, it’s not just volume, but diversity that improves predictions. How might this insight affect evaluation of AI models?

Model (in)accuracy

To quote a common aphorism, all models are wrong. This holds true in the areas of statistics, science and AI. Models created with a lack of domain expertise can lead to erroneous outputs.

Today, a tiny homogeneous group of people determine what data to use to train generative AI models, which is drawn from sources that greatly overrepresent English. “For most of the over 6,000 languages in the world, the text data available is not enough to train a large-scale foundation model” (from “On the Opportunities and Risks of Foundation Models,” Bommasani et al., 2022).

Additionally, the models themselves are created from limited architectures: “Almost all state-of-the-art NLP models are now adapted from one of a few foundation models, such as BERT, RoBERTa, BART, T5, etc. While this homogenization produces extremely high leverage (any improvements in the foundation models can lead to immediate benefits across all of NLP), it is also a liability; all AI systems might inherit the same problematic biases of a few foundation models (Bommasani et al.)”

For generative AI to better reflect the diverse communities it serves, a far wider variety of human beings’ data must be represented in models.

Evaluating model accuracy goes hand-in-hand with evaluating bias. We must ask, what is the intent of the model and for whom is it optimized? Consider, for example, who benefits most from content-recommendation algorithms and search engine algorithms. Stakeholders may have widely different interests and goals. Algorithms and models require targets or proxies for Bayes error: the minimum error that a model must improve upon. This proxy is often a person, such as a subject matter expert with domain expertise.

A very human challenge: Assessing risk before model procurement or development

Emerging AI regulations and action plans are increasingly underscoring the importance of algorithmic impact assessment forms. The goal of these forms is to capture critical information about AI models so that governance teams can assess and address their risks before deploying them. Typical questions include:

  • What is your model’s use case?
  • What are the risks for disparate impact?
  • How are you assessing fairness?
  • How are you making your model explainable?

Though designed with good intentions, the issue is that most AI model owners do not understand how to evaluate the risks for their use case.  A common refrain might be, “How could my model be unfair if it is not gathering personally identifiable information (PII)?” Consequently, the forms are rarely completed with the thoughtfulness necessary for governance systems to accurately flag risk factors.

Thus, the socio-technical nature of the solution is underscored. A model owner—an individual—cannot simply be given a list of checkboxes to evaluate whether their use case will cause harm. Instead, what is required is groups of people with widely varying lived-world experiences coming together in communities that offer psychological safety to have difficult conversations about disparate impact.

Welcoming broader perspectives for trustworthy AI

IBM® believes in taking a “client zero” approach, implementing the recommendations and systems it would make for its own clients across consulting and product-led solutions. This approach extends to ethical practices, which is why IBM created a Trustworthy AI Center of Excellence (COE).

As explained above, diversity of experiences and skillsets is critical to properly evaluate the impacts of AI. But the prospect of participating in a Center of Excellence could be intimidating in a company bursting with AI innovators, experts and distinguished engineers, so cultivating a community of psychological safety is needed. IBM communicates this clearly by saying, “Interested in AI? Interested in AI ethics? You have a seat at this table.”

The COE offers training in AI ethics to practitioners at every level. Both synchronous learning (teacher and students in class settings) and asynchronous (self-guided) programs are offered.

But it’s the COE’s applied training that gives our practitioners the deepest insights, as they work with global, diverse, multidisciplinary teams on real projects to better understand disparate impact. They also leverage design thinking frameworks that IBM’s Design for AI group uses internally and with clients to assess the unintended effects of AI models, keeping those who are often marginalized top of mind. (See Sylvia Duckworth’s Wheel of Power and Privilege for examples of how personal characteristics intersect to privilege or marginalize people.) IBM also donated many of the frameworks to the open-source community Design Ethically.

Below are a few of the reports IBM have publicly published on these projects:

Automated AI model governance tools are required to glean important insights about how your AI model is performing. But note, capturing risk well before your model has been developed and is in production is optimal. By creating communities of diverse, multidisciplinary practitioners that offer a safe space for people to have tough conversations about disparate impact, you can begin your journey to operationalizing your principles and develop AI responsibly.

In practice, when you are hiring for AI practitioners, consider that well over 70% of the effort in creating models is curating the right data. You want to hire people who know how to gather data that is representative and but that is also gathered with consent. You also want people who know to work closely with domain experts to make certain that they have the correct approach.  Ensuring these practitioners have the emotional intelligence to approach the challenge of responsibly curating AI with humility and discernment is key. We must be intentional about learning how to recognize how and when AI systems can exacerbate inequity just as much as they can augment human intelligence.

Reinvent how your business works with AI
Was this article helpful?

More from Artificial intelligence

Optimize your call center operations with new IBM watsonx assistants features

5 min read - Everyone has had at least one bad experience when dialing into a call center. The robotic audio recording, the limited menu options, the repetitive elevator music in the background, and the general feeling of time wasted are all too familiar. As customers try to get answers, many times they find themselves falling into the infamous spiral of misery, searching desperately to speak to a live agent. While virtual assistants, mobile applications and digital web interfaces have made self-service options in…

IBM, with flagship Granite models, named a strong performer in The Forrester Wave™: AI Foundation Models for Language, Q2 2024

6 min read - As enterprises move from generative artificial intelligence (gen AI) experimentation to production, they are looking for the right choices when it comes to foundation models with an optimal mix of attributes that yield trusted, performant and cost-effective gen AI. Businesses recognize that they cannot scale gen AI with foundation models they cannot trust. We are pleased to announce that IBM, with its flagship Granite family of models, has been named a strong performer in the Forrester Wave™: AI Foundation Models…

Scale enterprise gen AI for code generation with IBM Granite code models, available as NVIDIA NIM inference microservices

3 min read - Many enterprises today are moving from generative AI (gen AI) experimentation to production, deployment and scaling. Code generation and modernization are now among the top enterprise use cases that offer a clear path to value creation, cost reduction and return on investment (ROI). IBM® Granite™ is a family of enterprise-grade models developed by IBM Research® with rigorous data governance and regulatory compliance. Granite currently supports multilingual language and code modalities. And as of the NVIDIA AI Summit in Taiwan this…

IBM Newsletters

Get our newsletters and topic updates that deliver the latest thought leadership and insights on emerging trends.
Subscribe now More newsletters