12 November 2024
Artificial superintelligence (ASI) is still a hypothetical concept. Therefore, current AI alignment efforts largely focus on making today’s AI models helpful, safe and reliable. For example, alignment helps ensure that AI chatbots such as ChatGPT aren’t perpetuating human bias or able to be exploited by bad actors.
But as AI becomes more complex and advanced, its outputs become more difficult to anticipate and align with human intent. This challenge is often referred to as “the alignment problem.” There is concern that superintelligent AI systems could one day reach a breaking point and circumvent human control entirely. And some experts believe that present-day AI risks could become exponentially more severe as AI advances.
These concerns, among others, have inspired an emergent branch of advanced alignment efforts known as superalignment.
To understand artificial superintelligence (ASI), it’s helpful to see it in context with the other types of artificial intelligence: artificial narrow intelligence (ANI) and artificial general intelligence (AGI). We can rank the three types based on their capabilities:
The field of AI is making impressive technological breakthroughs. For example, DeepMind’s AlphaFold 3 can predict molecular structure and interaction with extraordinary accuracy. And OpenAI’s GPT-4o can reason in real time.
Despite these advancements, AI is still not human. AI does not intrinsically care about reason, loyalty or safety. It has one goal: to complete the task for which it was programmed.
Therefore, it is up to AI developers to build in human values and goals. Otherwise, misalignment occurs and AI systems can produce harmful outputs that lead to bias, discrimination and misinformation.
Present-day alignment efforts work to keep weak AI systems in line with human values and goals. But AGI and ASI systems could be exponentially riskier, harder to understand and more difficult to control. Current AI alignment techniques, which rely on human intelligence, are likely inadequate for aligning AI systems that are smarter than humans.
For example, reinforcement learning from human feedback (RLHF) is a machine learning technique in which a “reward model” is trained with direct human feedback. OpenAI used RLHF as its main method to align its GPT-3 and GPT-4 series of models behind ChatGPT, all considered weak AI models. Significantly more advanced alignment techniques will be necessary to help ensure that superintelligent AI systems possess similar levels of robustness, interpretability, controllability and ethicality.
Without superalignment, advanced AI systems could introduce several risks, including:
If advanced AI systems become so complex and misaligned that human oversight is impossible, their outcomes could be unpredictable and uncontrollable. A humanoid robotic takeover scenario is considered unlikely by most experts. However, an AGI or ASI system that drifts too far from its intended goals could be catastrophic in high-risk situations, such as in critical infrastructure or national defense.
Superintelligent AI could pursue goals in ways that are existentially detrimental to humanity. A commonly cited example is philosopher Nick Bostrom’s paperclip maximizer thought experiment in which an ASI model is programmed to make paperclips. With superhuman computing power, the model eventually transforms everything—even parts of space—into paperclip manufacturing facilities in pursuit of its goal.1
While there are several reliable methods to mitigate bias in AI systems, the risk still remains a consideration for future AI. Advanced AI systems could perpetuate human biases with unfair or discriminatory outcomes. Due to system complexity, these biased outcomes could be difficult to identify and mitigate. AI bias is especially concerning when found in areas such as healthcare, law enforcement and human resources.
Bad actors could exploit superintelligent AI for nefarious purposes such as social control or large-scale financial hacking. However, societal and economic disruption could also happen if industries adopt advanced AI without the necessary legal or regulatory frameworks.
For example, financial AI agents are increasingly used for tasks such as trading or asset management—but accountability for their actions is often unclear. Who is liable should an AI agent violate SEC regulations? As the technology matures, this lack of accountability could lead to mistrust and instability.2
Some conversations around ASI pose the concern that humans could eventually become too reliant on advanced AI systems. As a result, we could potentially lose cognitive and decision-making abilities. Similarly, depending too heavily on AI in areas such as cybersecurity could lead to complacency from human teams. AI is not infallible and human oversight is still needed to help ensure that all threats are mitigated.
There are currently several techniques for aligning AI, including reinforcement learning from human feedback (RLHF), synthetic data approaches and adversarial testing. But these methods are likely inadequate for aligning superintelligent AI models. And, as of writing, neither AGI nor ASI exist and there are no established methods for aligning these more complex AI systems.
However, there are several superalignment ideas with promising research results:
As humans, we are not able to reliably supervise AI systems that are smarter than us. Scalable oversight is a scalable training method where humans could use weaker AI systems to help align more complex AI systems.
Research to test and expand this technique is limited—because superintelligent AI systems do not yet exist. However, researchers at Anthropic (an AI safety and research company) have performed a proof-of-concept experiment.
In the experiment, human participants were directed to answer questions with the help of an LLM. These AI-assisted humans outperformed both the model on its own and unaided humans on the metric of accuracy. In their findings, the researchers said these results are encouraging and help confirm the idea that LLMs “can help humans achieve difficult tasks in settings that are relevant to scalable oversight.”3
Generalization is the capacity for AI systems to reliably make predictions from data they were not trained on. Weak-to-strong generalization is an AI training technique in which weaker models are used to train stronger models to perform better on novel data, improving generalization.
OpenAI’s superalignment team—co-led by Ilya Sutskever (OpenAI cofounder and former Chief Scientist) and Jan Leike (former Head of Alignment)—discussed weak-to-strong generalization in its first research paper. The experiment used a “weak” GPT-2-level model to fine-tune a GPT-4-level model. Using this method, the team found the resulting model’s performance was between a GPT-3- and GPT-3.5-level model. They concluded that with weak-to-strong methods they can meaningfully improve generalization.
Regarding superalignment, this proof-of-concept demo shows that substantial improvement to weak-to-strong generalization is possible. According to the team’s resulting research paper, “it is feasible to make empirical progress today on a fundamental challenge of aligning superhuman models.”4 And a follow-up study from Beijing Jiaotong University showed that weak-to-strong generalization can be improved using scalable oversight.5
However, OpenAI’s superalignment team was dissolved in May 2024 due to priority shifts within the company. In a social media post, CEO Sam Altman thanked the team and said that OpenAI has “[put] in place the foundations needed for safe deployment of increasingly capable systems.”6
Further down the alignment pipeline sits automated alignment research. This superalignment technique uses already aligned superhuman AI systems to perform automated alignment research. These “AI researchers” would be faster and smarter than human researchers. With these advantages, they could potentially devise new superalignment techniques. Instead of directly developing and implementing the technical alignment research, human researchers would instead review the generated research.
Leopold Aschenbrenner, an AGI investor and former member of the superalignment team at OpenAI, describes the vast potential of this technique: “If we manage to align somewhat-superhuman systems enough to trust them, we’ll be in an incredible position: we’ll have millions of automated AI researchers, smarter than the best AI researchers, at our disposal.”7
Superalignment faces many challenges. For example, who defines the benchmarks for values, goals and ethics? But one challenge casts a shadow over them all: It is extremely difficult to devise reliable alignment techniques for powerful AI systems that not only outsmart us, but only exist in theory.
Industry experts also face philosophical disagreements concerning superalignment. For example, some AI labs posit that focusing AI development efforts on aligning future AI systems could impede current AI priorities and new research. On the other side, AI safety proponents argue that the risks of superintelligence are too severe to ignore and outweigh potential benefits.
The latter line of thinking inspired OpenAI's former chief scientist Ilya Sutskever to join investor Daniel Gross and former OpenAI researcher Daniel Levy in creating Safe Superintelligence Inc. The startup’s singular focus is “building safe superintelligence (SSI)” without “distraction by management overhead or product cycles” and progress “insulated from short-term commercial pressures.”8
Learn about the new challenges of generative AI, the need for governing AI and ML models and steps to build a trusted, transparent and explainable AI framework.
Read about driving ethical and compliant practices with a portfolio of AI products for generative AI models.
Gain a deeper understanding of how to ensure fairness, manage drift, maintain quality and enhance explainability with watsonx.governance™.
We surveyed 2,000 organizations about their AI initiatives to discover what’s working, what’s not and how you can get ahead.
Govern generative AI models from anywhere and deploy on cloud or on premises with IBM watsonx.governance.
See how AI governance can help increase your employees’ confidence in AI, accelerate adoption and innovation, and improve customer trust.
Prepare for the EU AI Act and establish a responsible AI governance approach with the help of IBM Consulting®.
1 “Ethical Issues in Advanced Artificial Intelligence,” Nick Bostrom, n.d.
2 “Will Financial AI Agents Destroy The Economy?,” The Tech Buzz, 25 October 2024.
3 “Measuring Progress on Scalable Oversight for Large Language Models,” Anthropic, 4 November 2022.
4 “Weak-to-strong generalization,” OpenAI, 14 December 2023.
5 “Improving Weak-to-Strong Generalization with Scalable Oversight and Ensemble Learning,” arXiv, 1 February 2024.
6 X post, Greg Brockman, 18 May 2024.
7 “Superalignment,” Situational Awareness: The Decade Ahead, June 2024.
8 “Superintelligence is within reach,” Safe Superintelligence Inc., 19 June 2024.
IBM web domains
ibm.com, ibm.org, ibm-zcouncil.com, insights-on-business.com, jazz.net, mobilebusinessinsights.com, promontory.com, proveit.com, ptech.org, s81c.com, securityintelligence.com, skillsbuild.org, softlayer.com, storagecommunity.org, think-exchange.com, thoughtsoncloud.com, alphaevents.webcasts.com, ibm-cloud.github.io, ibmbigdatahub.com, bluemix.net, mybluemix.net, ibm.net, ibmcloud.com, galasa.dev, blueworkslive.com, swiss-quantum.ch, blueworkslive.com, cloudant.com, ibm.ie, ibm.fr, ibm.com.br, ibm.co, ibm.ca, community.watsonanalytics.com, datapower.com, skills.yourlearning.ibm.com, bluewolf.com, carbondesignsystem.com