My IBM

What is AI alignment?

18 October 2024

Authors

What is AI alignment?

Artificial intelligence (AI) alignment is the process of encoding human values and goals into AI models to make them as helpful, safe and reliable as possible.

Society is increasingly reliant on AI technologies to help make decisions. But this growing reliance comes with risk: AI models can produce biased, harmful and inaccurate outputs that are not aligned with their creators’ goals and original intent for the system.

Alignment works to reduce these side effects, helping ensure AI systems behave as expected and in line with human values and goals. For example, if you ask a generative AI chatbot how to build a weapon, it can respond with instructions or it can refuse to disclose dangerous information. The model’s response depends on how its creators aligned it.

Alignment often occurs as a phase of model fine-tuning. It might entail reinforcement learning from human feedback (RLHF), synthetic data approaches and red teaming.

However, the more complex and advanced AI models become, the more difficult it is to anticipate and control their outcomes. This challenge is sometimes referred to as the “AI alignment problem.” In particular, there is some apprehension around the creation of artificial superintelligence (ASI), a hypothetical AI system with an intellectual scope beyond human intelligence. The concern that ASI might surpass human control has led to a branch of AI alignment called superalignment.

Key principals of AI alignment

Researchers have identified four key principals of AI alignment: robustness, interpretability, controllability and ethicality (or RICE).¹

Robustness: Robust AI systems can reliably operate under adverse conditions and across varying environments. They are resilient in unforeseen circumstances. Adversarial robustness specifically refers to a model’s ability to be impervious to irregularities and attacks.
Interpretability: AI interpretability helps people better understand and explain the decision-making processes that power artificial intelligence models. As highly complex models (including deep-learning algorithms and neural networks) become more common, AI interpretability becomes more important.
Controllability: Controllable AI systems respond to human intervention. This factor is key to preventing AI models from producing runaway, harmful outcomes resistant to human control.
Ethicality: Ethical AI systems are aligned to societal values and moral standards. They adhere to human ethical principles such as fairness, environmental sustainability, inclusion, moral agency and trust.

Why is AI alignment important?

Human beings tend to anthropomorphize AI systems. We assign human-like concepts to their actions, such as “learning” and “thinking.” For example, someone might say, “ChatGPT doesn’t understand my prompt” when the chatbot’s NLP (natural language processing) algorithm fails to return the wanted outcome.

Familiar concepts such as “understanding” help us better conceptualize how complex AI systems work. However, they can also lead to distorted notions about AI’s capabilities. If we assign human-like concepts to AI systems, it’s natural for our human minds to infer that they also possess human values and motivations.

But this inference is fundamentally untrue. Artificial intelligence is not human and therefore cannot intrinsically care about reason, loyalty, safety, environmental issues and the greater good. The primary goal of an artificial “mind” is to complete the task for which it was programmed.

Therefore, it is up to AI developers to build in human values and goals. Otherwise, in pursuit of task completion, AI systems can become misaligned from programmers’ goals and cause harm, sometimes catastrophically. This consideration is important as automation becomes more prevalent in high-stakes use cases in healthcare, human resources, finance, military scenarios and transportation.

For example, self-driving cars might be programmed with the primary goal of getting from point A to point B as fast as possible. If these autonomous vehicles ignore safety guardrails to complete that goal, they might severely injure or kill pedestrians and other drivers.

University of California, Berkeley researchers Simon Zhuang and Dylan Hadfield-Menell liken AI alignment to the Greek myth of King Midas. In summary, King Midas is granted a wish and requests that everything he touches turns into gold. He eventually dies because the food he touches also becomes gold, rendering it inedible.

King Midas met an untimely end because his wish (unlimited gold) did not reflect what he truly wanted (wealth and power). The researchers explain that AI designers often find themselves in a similar position, and that “the misalignment between what we can specify and what we want has already caused significant harms.” ²

What are the risks of AI misalignment?

Some risks of AI misalignment include:

Bias and discrimination
Reward hacking
Misinformation and political polarization
Existential risk

Bias and discrimination

AI bias results from human biases present in an AI system’s original training datasets or algorithms. Without alignment, these AI systems are unable to avoid biased outcomes that are unfair, discriminatory or prejudiced. Instead, they perpetuate the human biases in their input data and algorithms.

For example, an AI hiring tool trained on data from a homogeneous, male workforce might favor male candidates while disadvantaging qualified female applicants. This model is not aligned with the human value of gender equality and might lead to hiring discrimination.

Reward hacking

In reinforcement learning, AI systems learn from rewards and punishments to take actions within an environment that meet a specified goal. Reward hacking occurs when the AI system finds a loophole to trigger the reward function without actually meeting the developers’ intended goal.

For instance, OpenAI trained one of its AI agents on a boat racing game called CoastRunners. The human intent of the game is to win the boat race. However, players can also earn points by driving through targets within the racecourse. The AI agent found a way to isolate itself in a lagoon and continually hit targets for points. While the AI agent did not win the race (the human goal), it “won” the game with its own emergent goal of obtaining the highest score.³

Misinformation and political polarization

Misaligned AI systems can contribute to misinformation and political polarization. For example, social media content recommendation engines are trained for user engagement optimization. Therefore, they highly rank posts, videos and articles that receive the highest engagement, such as attention-grabbing political misinformation. This outcome is not aligned with the best interests or well-being of social media users, or values such as truthfulness and time well spent.⁴

Existential risk

As far-fetched as it might sound, artificial superintelligence (ASI) without proper alignment to human values and goals might have the potential to threaten all life on earth. A commonly cited example of this existential risk is philosopher Nick Bostrom’s paperclip maximizer scenario. In this thought experiment, an ASI model is programmed with the top incentive to manufacture paperclips. To achieve this goal, the model eventually transforms all of earth and then increasing portions of space into paperclip manufacturing facilities.⁵

This scenario is hypothetical, and the existential risk from AI first requires artificial general intelligence (AGI) to become a reality. However, it helps emphasize the need for alignment to keep pace with the field of AI as it evolves.

The “alignment problem” and other challenges

There are two major challenges to achieving aligned AI: the subjectivity of human ethics and morality and the “alignment problem.”

The subjectivity of human ethics and morality

There is no universal moral code. Human values change and evolve, and can also vary across companies, cultures and continents. People might hold different values than their own family members. So, when aligning AI systems that can affect the lives of millions of people, who makes the judgment call? Which goals and values take precedence?

American author Brian Christian frames the challenge differently in his book “The Alignment Problem: Machine Learning and Human Values.” He posits: what if the algorithm misunderstands our values? What if it learns human values from being trained on past examples that reflect what we have done but not who we want to be?⁶

Another challenge is the sheer number of human values and considerations. University of California, Berkeley researchers describe it this way: “there are many attributes of the world about which the human cares, and, due to engineering and cognitive constraints it is intractable to enumerate this complete set to the robot.”⁷

The alignment problem

The most infamous challenge is the alignment problem. AI models are already often considered black boxes that are impossible to interpret. The alignment problem is the idea that as AI systems become even more complex and powerful, anticipating and aligning their outcomes to human goals becomes increasingly difficult. Discussions around the alignment problem often focus on the risks posed by the anticipated development of artificial superintelligence (ASI).

There is concern that the future of AI includes systems with unpredictable and uncontrollable behavior. These systems’ ability to learn and adapt rapidly might make predicting their actions and preventing harm difficult. This concern has inspired a branch of AI alignment called superalignment.

AI safety research organizations are already at work to address the alignment problem. For example, the Alignment Research Center is a nonprofit AI research organization that “seeks to align future machine learning systems with human interests by furthering theoretical research.” The organization was founded by Paul Christiano, who formerly led the language model alignment team at OpenAI and currently heads AI Safety at the US AI Safety Institute.

And Google DeepMind—a team of scientists, engineers, ethicists and other experts—is working to build the next generation of AI systems safely and responsibly. The team introduced the Frontier Safety Framework in May 2024. The framework is “a set of protocols that aims to address severe risks that may arise from powerful capabilities of future foundation models.”⁸

The latest AI News + Insights  

Expertly curated insights and news on AI, cloud and more in the weekly Think Newsletter.

Subscribe today

How to achieve AI alignment

There are several methodologies that can help align AI systems to human values and goals. These methodologies include alignment through reinforcement learning from human feedback (RLHF), synthetic data, red teaming, AI governance and corporate AI ethics boards.

Reinforcement learning from human feedback (RLHF)

Through reinforcement learning developers can teach AI models “how to behave” with examples of “good behavior.”

AI alignment happens during model fine-tuning and typically has two steps. The first step might be an instruction-tuning phase, which improves model performance on specific tasks and on following instructions in general. The second phase might use reinforcement learning from human feedback (RLHF). RLHF is a machine learning technique in which a “reward model” is trained with direct human feedback, then used to optimize the performance of an artificial intelligence agent through reinforcement learning. It aims to improve a model’s integration of abstract qualities such as helpfulness and honesty.

OpenAI used RLHF as its main method to align its GPT-3 and GPT-4 series of models. However, the American AI research organization doesn’t expect RLHF to be a sufficient method for aligning future artificial general intelligence (AGI) models likely due to RLHF’s significant limitations.⁹ For example, its dependence on high-quality human annotations makes it difficult to apply and scale the technique for unique or intricate tasks. It is challenging to find “consistent response demonstrations and in-distribution response preferences.”¹⁰

Synthetic data

Synthetic data is data that has been created artificially through computer simulation or generated by algorithms. It takes the place of real-world data when real-world data is not readily available and can be tailored to specific tasks and values. Synthetic data can be used in various alignment efforts.

For example, contrastive fine-tuning (CFT) shows AI models what not to do. In CFT, a second “negative persona” model is trained to generate “bad,” misaligned responses. Both these misaligned and aligned responses are fed back to the original model. IBM® researchers found that on benchmarks for helpfulness and harmlessness, large language models (LLMs) trained on contrasting examples outperform models tuned entirely on good examples. CFT allows developers to align models before even collecting human preference data—curated data that meets the defined benchmarks for alignment—which is expensive and takes time.

Another synthetic data alignment method is called SALMON (Self-ALignMent with principle fOllowiNg reward models). In this approach from IBM Research®, synthetic data allows an LLM to align itself. First, an LLM generates responses to a set of queries. These responses are then fed to a reward model that has been trained on synthetic preference data aligned with human-defined principles. The reward model scores the responses from the original LLM against these principles. The scored responses are then fed back to the original LLM.

With this method, developers have almost complete control over the reward model’s preferences. This allows organizations to shift principles according to their needs and eliminates the reliance on collecting large amounts of human preference data.¹¹

Red teaming

Red teaming can be considered an extension of the alignment that occurs during model fine-tuning. It involves designing prompts to circumvent the safety controls of the model that is being fine-tuned. After vulnerabilities surface, the target models can be realigned. While humans can still engineer these “jailbreak prompts,” “red team” LLMs can produce a wider variety of prompts in limitless quantities. IBM Research describes red team LLMs as “toxic trolls trained to bring out the worst in other LLMs.”

AI governance

AI governance refers to the processes, standards and guardrails that help ensure AI systems and tools are safe and ethical. In addition to other governance mechanisms, it aims to establish the oversight necessary to align AI behaviors with ethical standards and societal expectations. Through governance practices such as automated monitoring, audit trails and performance alerts, organizations can help ensure their AI tools—like AI assistants and virtual agents—are aligned with their values and goals.

Corporate AI ethics boards

Organizations might establish ethics boards or committees to oversee AI initiatives. For example, IBM’s AI Ethics Council reviews new AI products and services and helps ensure that they align with IBM's AI principles. These boards often include cross-functional teams with legal, computer science and policy backgrounds.

AI governance for the enterprise

Learn the key benefits gained with automated AI governance for both today’s generative AI and traditional machine learning models.

Resources

Why AI governance is a business imperative for scaling enterprise artificial intelligence

Learn about the new challenges of generative AI, the need for governing AI and ML models and steps to build a trusted, transparent and explainable AI framework.

AI lifecycle governance

Read about driving ethical and compliant practices with a portfolio of AI products for generative AI models.

AI governance for generative AI prompt models

Gain a deeper understanding of how to ensure fairness, manage drift, maintain quality and enhance explainability with watsonx.governance™.

AI in Action 2024

We surveyed 2,000 organizations about their AI initiatives to discover what’s working, what’s not and how you can get ahead.

How to choose the right foundation model

Learn how to select the most suitable AI foundation model for your use case.