My IBM

What is superalignment?

12 November 2024

Authors

What is superalignment?

Superalignment is the process of supervising, controlling and governing artificial superintelligence systems. Aligning advanced AI systems with human values and goals can help prevent them from exhibiting harmful and uncontrollable behavior.

Artificial superintelligence (ASI) is still a hypothetical concept. Therefore, current AI alignment efforts largely focus on making today’s AI models helpful, safe and reliable. For example, alignment helps ensure that AI chatbots such as ChatGPT aren’t perpetuating human bias or able to be exploited by bad actors.

But as AI becomes more complex and advanced, its outputs become more difficult to anticipate and align with human intent. This challenge is often referred to as “the alignment problem.” There is concern that superintelligent AI systems could one day reach a breaking point and circumvent human control entirely. And some experts believe that present-day AI risks could become exponentially more severe as AI advances.

These concerns, among others, have inspired an emergent branch of advanced alignment efforts known as superalignment.

The latest AI News + Insights  

Discover expertly curated insights and news on AI, cloud and more in the weekly Think Newsletter.

Subscribe today

What is artificial superintelligence?

To understand artificial superintelligence (ASI), it’s helpful to see it in context with the other types of artificial intelligence: artificial narrow intelligence (ANI) and artificial general intelligence (AGI). We can rank the three types based on their capabilities:

ANI: At the entry level are the AI systems we use today. These systems are considered artificial narrow intelligence (ANI), weak AI or narrow AI technologies. Common examples include autonomous vehicles, large language models (LLMs) and generative AI tools.
AGI: The next level is strong artificial intelligence (AI), also known as artificial general intelligence (AGI) or general AI. While still theoretical, if ever realized, AGI would have human-level intelligence. Where weak AI focuses on performing a specific task, strong AI can perform a variety of functions, eventually teaching itself to solve for new problems.
ASI: At the top level is artificial superintelligence (ASI). ASI systems are hypothetical AI technologies with an intellectual scope beyond human-level intelligence. Superintelligent AI would have cutting-edge cognitive functions and highly developed thinking skills. However, the real-world feasibility of ASI is contested. The workings of the human brain are still not fully understood—making it difficult to re-create via algorithms and computer science.

AI Academy

Trust, transparency and governance in AI

AI trust is arguably the most important topic in AI. It's also an understandably overwhelming topic. We'll unpack issues such as hallucination, bias and risk, and share steps to adopt AI in an ethical, responsible and fair manner.

Go to episode

Why do we need superalignment?

The field of AI is making impressive technological breakthroughs. For example, DeepMind’s AlphaFold 3 can predict molecular structure and interaction with extraordinary accuracy. And OpenAI’s GPT-4o can reason in real time.

Despite these advancements, AI is still not human. AI does not intrinsically care about reason, loyalty or safety. It has one goal: to complete the task for which it was programmed.

Therefore, it is up to AI developers to build in human values and goals. Otherwise, misalignment occurs and AI systems can produce harmful outputs that lead to bias, discrimination and misinformation.

Present-day alignment efforts work to keep weak AI systems in line with human values and goals. But AGI and ASI systems could be exponentially riskier, harder to understand and more difficult to control. Current AI alignment techniques, which rely on human intelligence, are likely inadequate for aligning AI systems that are smarter than humans.

For example, reinforcement learning from human feedback (RLHF) is a machine learning technique in which a “reward model” is trained with direct human feedback. OpenAI used RLHF as its main method to align its GPT-3 and GPT-4 series of models behind ChatGPT, all considered weak AI models. Significantly more advanced alignment techniques will be necessary to help ensure that superintelligent AI systems possess similar levels of robustness, interpretability, controllability and ethicality.

What are the risks of advanced AI systems?

Without superalignment, advanced AI systems could introduce several risks, including:

Loss of control
Unintended consequences
Bias and discrimination
Societal and economic disruption
AI dependence

Loss of control

If advanced AI systems become so complex and misaligned that human oversight is impossible, their outcomes could be unpredictable and uncontrollable. A humanoid robotic takeover scenario is considered unlikely by most experts. However, an AGI or ASI system that drifts too far from its intended goals could be catastrophic in high-risk situations, such as in critical infrastructure or national defense.

Unintended consequences

Superintelligent AI could pursue goals in ways that are existentially detrimental to humanity. A commonly cited example is philosopher Nick Bostrom’s paperclip maximizer thought experiment in which an ASI model is programmed to make paperclips. With superhuman computing power, the model eventually transforms everything—even parts of space—into paperclip manufacturing facilities in pursuit of its goal.¹

Bias and discrimination

While there are several reliable methods to mitigate bias in AI systems, the risk still remains a consideration for future AI. Advanced AI systems could perpetuate human biases with unfair or discriminatory outcomes. Due to system complexity, these biased outcomes could be difficult to identify and mitigate. AI bias is especially concerning when found in areas such as healthcare, law enforcement and human resources.

Societal and economic disruption

Bad actors could exploit superintelligent AI for nefarious purposes such as social control or large-scale financial hacking. However, societal and economic disruption could also happen if industries adopt advanced AI without the necessary legal or regulatory frameworks.

For example, financial AI agents are increasingly used for tasks such as trading or asset management—but accountability for their actions is often unclear. Who is liable should an AI agent violate SEC regulations? As the technology matures, this lack of accountability could lead to mistrust and instability.²

AI dependence

Some conversations around ASI pose the concern that humans could eventually become too reliant on advanced AI systems. As a result, we could potentially lose cognitive and decision-making abilities. Similarly, depending too heavily on AI in areas such as cybersecurity could lead to complacency from human teams. AI is not infallible and human oversight is still needed to help ensure that all threats are mitigated.

Superalignment techniques

There are currently several techniques for aligning AI, including reinforcement learning from human feedback (RLHF), synthetic data approaches and adversarial testing. But these methods are likely inadequate for aligning superintelligent AI models. And, as of writing, neither AGI nor ASI exist and there are no established methods for aligning these more complex AI systems.

However, there are several superalignment ideas with promising research results:

Scalable oversight

As humans, we are not able to reliably supervise AI systems that are smarter than us. Scalable oversight is a scalable training method where humans could use weaker AI systems to help align more complex AI systems.

Research to test and expand this technique is limited—because superintelligent AI systems do not yet exist. However, researchers at Anthropic (an AI safety and research company) have performed a proof-of-concept experiment.

In the experiment, human participants were directed to answer questions with the help of an LLM. These AI-assisted humans outperformed both the model on its own and unaided humans on the metric of accuracy. In their findings, the researchers said these results are encouraging and help confirm the idea that LLMs “can help humans achieve difficult tasks in settings that are relevant to scalable oversight.”³

Weak-to-strong generalization

Generalization is the capacity for AI systems to reliably make predictions from data they were not trained on. Weak-to-strong generalization is an AI training technique in which weaker models are used to train stronger models to perform better on novel data, improving generalization.

OpenAI’s superalignment team—co-led by Ilya Sutskever (OpenAI cofounder and former Chief Scientist) and Jan Leike (former Head of Alignment)—discussed weak-to-strong generalization in its first research paper. The experiment used a “weak” GPT-2-level model to fine-tune a GPT-4-level model. Using this method, the team found the resulting model’s performance was between a GPT-3- and GPT-3.5-level model. They concluded that with weak-to-strong methods they can meaningfully improve generalization.

Regarding superalignment, this proof-of-concept demo shows that substantial improvement to weak-to-strong generalization is possible. According to the team’s resulting research paper, “it is feasible to make empirical progress today on a fundamental challenge of aligning superhuman models.”⁴ And a follow-up study from Beijing Jiaotong University showed that weak-to-strong generalization can be improved using scalable oversight.⁵

However, OpenAI’s superalignment team was dissolved in May 2024 due to priority shifts within the company. In a social media post, CEO Sam Altman thanked the team and said that OpenAI has “[put] in place the foundations needed for safe deployment of increasingly capable systems.”⁶

Automated alignment research

Further down the alignment pipeline sits automated alignment research. This superalignment technique uses already aligned superhuman AI systems to perform automated alignment research. These “AI researchers” would be faster and smarter than human researchers. With these advantages, they could potentially devise new superalignment techniques. Instead of directly developing and implementing the technical alignment research, human researchers would instead review the generated research.

Leopold Aschenbrenner, an AGI investor and former member of the superalignment team at OpenAI, describes the vast potential of this technique: “If we manage to align somewhat-superhuman systems enough to trust them, we’ll be in an incredible position: we’ll have millions of automated AI researchers, smarter than the best AI researchers, at our disposal.”⁷

Superalignment vs. AI innovation

Superalignment faces many challenges. For example, who defines the benchmarks for values, goals and ethics? But one challenge casts a shadow over them all: It is extremely difficult to devise reliable alignment techniques for powerful AI systems that not only outsmart us, but only exist in theory.

Industry experts also face philosophical disagreements concerning superalignment. For example, some AI labs posit that focusing AI development efforts on aligning future AI systems could impede current AI priorities and new research. On the other side, AI safety proponents argue that the risks of superintelligence are too severe to ignore and outweigh potential benefits.

The latter line of thinking inspired OpenAI's former chief scientist Ilya Sutskever to join investor Daniel Gross and former OpenAI researcher Daniel Levy in creating Safe Superintelligence Inc. The startup’s singular focus is “building safe superintelligence (SSI)” without “distraction by management overhead or product cycles” and progress “insulated from short-term commercial pressures.”⁸

AI governance for the enterprise

Learn the key benefits gained with automated AI governance for both today’s generative AI and traditional machine learning models.

Resources

Why AI governance is a business imperative for scaling enterprise artificial intelligence

Learn about the new challenges of generative AI, the need for governing AI and ML models and steps to build a trusted, transparent and explainable AI framework.

Getting Ready for the EU AI Act, Phase 2: Risk-Assess and Categorize

Understand the importance of establishing a defensible assessment process and consistently categorizing each use case into the appropriate risk tier.

AI lifecycle governance

Read about driving ethical and compliant practices with a portfolio of AI products for generative AI models.

AI governance for generative AI prompt models

Gain a deeper understanding of how to ensure fairness, manage drift, maintain quality and enhance explainability with watsonx.governance™.

AI in Action 2024

We surveyed 2,000 organizations about their AI initiatives to discover what’s working, what’s not and how you can get ahead.

How to choose the right foundation model

Learn how to select the most suitable AI foundation model for your use case.