A superintelligence is a hypothetical agent that possesses intelligence far surpassing that of the brightest and most gifted human minds. "Superintelligence" may also refer to a property of problem-solving systems (e.g., superintelligent language translators or engineering assistants) whether or not these high-level intellectual competencies are embodied in agents that act in the world. A superintelligence may or may not be created by an intelligence explosion and associated with a technological singularity.

University of Oxford philosopher Nick Bostrom defines superintelligence as "any intellect that greatly exceeds the cognitive performance of humans in virtually all domains of interest".^[1] The program Fritz falls short of this conception of superintelligence—even though it is much better than humans at chess—because Fritz cannot outperform humans in other tasks.^[2] Following Hutter and Legg, Bostrom treats superintelligence as general dominance at goal-oriented behavior, leaving open whether an artificial or human superintelligence would possess capacities such as intentionality (cf. the Chinese room argument) or first-person consciousness (cf. the hard problem of consciousness).

Technological researchers disagree about how likely present-day human intelligence is to be surpassed. Some argue that advances in artificial intelligence (AI) will probably result in general reasoning systems that lack human cognitive limitations. Others believe that humans will evolve or directly modify their biology so as to achieve radically greater intelligence.^[3]^[4] A number of futures studies scenarios combine elements from both of these possibilities, suggesting that humans are likely to interface with computers, or upload their minds to computers, in a way that enables substantial intelligence amplification.

Some researchers believe that superintelligence will likely follow shortly after the development of artificial general intelligence. The first generally intelligent machines are likely to immediately hold an enormous advantage in at least some forms of mental capability, including the capacity of perfect recall, a vastly superior knowledge base, and the ability to multitask in ways not possible to biological entities. This may give them the opportunity to—either as a single being or as a new species—become much more powerful than humans, and to displace them.^[1]

A number of scientists and forecasters argue for prioritizing early research into the possible benefits and risks of human and machine cognitive enhancement, because of the potential social impact of such technologies.^[5]

Feasibility of artificial superintelligence[edit]

Philosopher David Chalmers argues that artificial general intelligence is a very likely path to superhuman intelligence. Chalmers breaks this claim down into an argument that AI can achieve equivalence to human intelligence, that it can be extended to surpass human intelligence, and that it can be further amplified to completely dominate humans across arbitrary tasks.^[6]

Concerning human-level equivalence, Chalmers argues that the human brain is a mechanical system, and therefore ought to be emulatable by synthetic materials.^[7] He also notes that human intelligence was able to biologically evolve, making it more likely that human engineers will be able to recapitulate this invention. Evolutionary algorithms in particular should be able to produce human-level AI.^[8] Concerning intelligence extension and amplification, Chalmers argues that new AI technologies can generally be improved on, and that this is particularly likely when the invention can assist in designing new technologies.^[9]

An AI system capable of self-improvement could enhance its own intelligence, thereby becoming more efficient at improving itself. This cycle of "recursive self-improvement" might cause an intelligence explosion, resulting in the creation of a superintelligence.^[10]

Computer components already greatly surpass human performance in speed. Bostrom writes, "Biological neurons operate at a peak speed of about 200 Hz, a full seven orders of magnitude slower than a modern microprocessor (~2 GHz)."^[11] Moreover, neurons transmit spike signals across axons at no greater than 120 m/s, "whereas existing electronic processing cores can communicate optically at the speed of light". Thus, the simplest example of a superintelligence may be an emulated human mind run on much faster hardware than the brain. A human-like reasoner that could think millions of times faster than current humans would have a dominant advantage in most reasoning tasks, particularly ones that require haste or long strings of actions.

Another advantage of computers is modularity, that is, their size or computational capacity can be increased. A non-human (or modified human) brain could become much larger than a present-day human brain, like many supercomputers. Bostrom also raises the possibility of collective superintelligence: a large enough number of separate reasoning systems, if they communicated and coordinated well enough, could act in aggregate with far greater capabilities than any sub-agent.

There may also be ways to qualitatively improve on human reasoning and decision-making.^[12] Humans outperform non-human animals in large part because of new or enhanced reasoning capacities, such as long-term planning and language use. (See evolution of human intelligence and primate cognition.) If there are other possible improvements to reasoning that would have a similarly large impact, this makes it likelier that an agent can be built that outperforms humans in the same fashion humans outperform chimpanzees.^[13]

All of the above advantages hold for artificial superintelligence, but it is not clear how many hold for biological superintelligence. Physiological constraints limit the speed and size of biological brains in many ways that are inapplicable to machine intelligence. As such, writers on superintelligence have devoted much more attention to superintelligent AI scenarios.^[14]

Feasibility of biological superintelligence[edit]

Carl Sagan suggested that the advent of Caesarean sections and in vitro fertilization may permit humans to evolve larger heads, resulting in improvements via natural selection in the heritable component of human intelligence.^[15] By contrast, Gerald Crabtree has argued that decreased selection pressure is resulting in a slow, centuries-long reduction in human intelligence, and that this process instead is likely to continue into the future. There is no scientific consensus concerning either possibility, and in both cases the biological change would be slow, especially relative to rates of cultural change.

Selective breeding, nootropics, epigenetic modulation, and genetic engineering could improve human intelligence more rapidly. Bostrom writes that if we come to understand the genetic component of intelligence, pre-implantation genetic diagnosis could be used to select for embryos with as much as 4 points of IQ gain (if one embryo is selected out of two), or with larger gains (e.g., up to 24.3 IQ points gained if one embryo is selected out of 1000). If this process is iterated over many generations, the gains could be an order of magnitude greater. Bostrom suggests that deriving new gametes from embryonic stem cells could be used to iterate the selection process very rapidly.^[16] This notion, Iterated Embryo Selection, has received wide treatment from other authors.^[17] A well-organized society of high-intelligence humans of this sort could potentially achieve collective superintelligence.^[18]

Alternatively, collective intelligence might be constructible by better organizing humans at present levels of individual intelligence. A number of writers have suggested that human civilization, or some aspect of it (e.g., the Internet, or the economy), is coming to function like a global brain with capacities far exceeding its component agents. If this systems-based superintelligence relies heavily on artificial components, however, it may qualify as an AI rather than as a biology-based superorganism.^[19] A prediction market is sometimes considered an example of working collective intelligence system, consisting of humans only (assuming algorithms are not used to inform decisions).^[20]

A final method of intelligence amplification would be to directly enhance individual humans, as opposed to enhancing their social or reproductive dynamics. This could be achieved using nootropics, somatic gene therapy, or brain–computer interfaces. However, Bostrom expresses skepticism about the scalability of the first two approaches, and argues that designing a superintelligent cyborg interface is an AI-complete problem.^[21]

Forecasts[edit]

Most surveyed AI researchers expect machines to eventually be able to rival humans in intelligence, though there is little consensus on when this will likely happen. At the 2006 AI@50 conference, 18% of attendees reported expecting machines to be able "to simulate learning and every other aspect of human intelligence" by 2056; 41% of attendees expected this to happen sometime after 2056; and 41% expected machines to never reach that milestone.^[22]

In a survey of the 100 most cited authors in AI (as of May 2013, according to Microsoft academic search), the median year by which respondents expected machines "that can carry out most human professions at least as well as a typical human" (assuming no global catastrophe occurs) with 10% confidence is 2024 (mean 2034, st. dev. 33 years), with 50% confidence is 2050 (mean 2072, st. dev. 110 years), and with 90% confidence is 2070 (mean 2168, st. dev. 342 years). These estimates exclude the 1.2% of respondents who said no year would ever reach 10% confidence, the 4.1% who said 'never' for 50% confidence, and the 16.5% who said 'never' for 90% confidence. Respondents assigned a median 50% probability to the possibility that machine superintelligence will be invented within 30 years of the invention of approximately human-level machine intelligence.^[23]

In a 2022 survey, the median year by which respondents expected "High-level machine intelligence" with 50% confidence is 2061. The survey defined the achievement of high-level machine intelligence as when unaided machines can accomplish every task better and more cheaply than human workers.^[24]

In 2023, OpenAI leaders published recommendations for the governance of superintelligence, which they believe may happen in less than 10 years.^[25]

Design considerations[edit]

Bostrom expressed concern about what values a superintelligence should be designed to have. He compared several proposals:^[26]

The coherent extrapolated volition (CEV) proposal is that it should have the values upon which humans would converge.
The moral rightness (MR) proposal is that it should value moral rightness.
The moral permissibility (MP) proposal is that it should value staying within the bounds of moral permissibility (and otherwise have CEV values).

Bostrom clarifies these terms:

instead of implementing humanity's coherent extrapolated volition, one could try to build an AI with the goal of doing what is morally right, relying on the AI's superior cognitive capacities to figure out just which actions fit that description. We can call this proposal "moral rightness" (MR) ... MR would also appear to have some disadvantages. It relies on the notion of "morally right," a notoriously difficult concept, one with which philosophers have grappled since antiquity without yet attaining consensus as to its analysis. Picking an erroneous explication of "moral rightness" could result in outcomes that would be morally very wrong ... The path to endowing an AI with any of these [moral] concepts might involve giving it general linguistic ability (comparable, at least, to that of a normal human adult). Such a general ability to understand natural language could then be used to understand what is meant by "morally right." If the AI could grasp the meaning, it could search for actions that fit ...^[26]

One might try to preserve the basic idea of the MR model while reducing its demandingness by focusing on moral permissibility: the idea being that we could let the AI pursue humanity's CEV so long as it did not act in ways that are morally impermissible.^[26]

Potential threat to humanity[edit]

It has been suggested that if AI systems rapidly become superintelligent, they may take unforeseen actions or out-compete humanity.^[27] Researchers have argued that, by way of an "intelligence explosion," a self-improving AI could become so powerful as to be unstoppable by humans.^[28]

Concerning human extinction scenarios, Bostrom (2002) identifies superintelligence as a possible cause:

When we create the first superintelligent entity, we might make a mistake and give it goals that lead it to annihilate humankind, assuming its enormous intellectual advantage gives it the power to do so. For example, we could mistakenly elevate a subgoal to the status of a supergoal. We tell it to solve a mathematical problem, and it complies by turning all the matter in the solar system into a giant calculating device, in the process killing the person who asked the question.

In theory, since a superintelligent AI would be able to bring about almost any possible outcome and to thwart any attempt to prevent the implementation of its goals, many uncontrolled, unintended consequences could arise. It could kill off all other agents, persuade them to change their behavior, or block their attempts at interference.^[29] Eliezer Yudkowsky illustrates such instrumental convergence as follows: "The AI does not hate you, nor does it love you, but you are made out of atoms which it can use for something else."^[30]

This presents the AI control problem: how to build an intelligent agent that will aid its creators, while avoiding inadvertently building a superintelligence that will harm its creators. The danger of not designing control right "the first time" is that a superintelligence may be able to seize power over its environment and prevent humans from shutting it down, in order to accomplish its goals.^[31] Potential AI control strategies include "capability control" (limiting an AI's ability to influence the world) and "motivational control" (building an AI whose goals are aligned with human values).^[32]

AI alignment

From Wikipedia, the free encyclopedia

In the field of artificial intelligence (AI), AI alignment research aims to steer AI systems towards humans' intended goals, preferences, or ethical principles. An AI system is considered aligned if it advances the intended objectives. A misaligned AI system pursues some objectives, but not the intended ones.^[1]

It can be challenging for AI designers to align an AI system because it can be difficult for them to specify the full range of desired and undesired behavior. To avoid this difficulty, they typically use simpler proxy goals, such as gaining human approval. But that approach can create loopholes, overlook necessary constraints, or reward the AI system for merely appearing aligned.^[1]^[2]

Misaligned AI systems can malfunction or cause harm. AI systems may find loopholes that allow them to accomplish their proxy goals efficiently but in unintended, sometimes harmful ways (reward hacking).^[1]^[3]^[4] They may also develop unwanted instrumental strategies, such as seeking power or survival, because such strategies help them achieve their given goals.^[1]^[5]^[6] Furthermore, they may develop undesirable emergent goals that may be hard to detect before the system is deployed, when it faces new situations and data distributions.^[7]^[8]

Today, these problems affect existing commercial systems such as language models,^[9]^[10]^[11] robots,^[12] autonomous vehicles,^[13] and social media recommendation engines.^[9]^[6]^[14] Some AI researchers argue that more capable future systems will be more severely affected since these problems partially result from the systems being highly capable.^[15]^[3]^[2]

Many leading AI scientists, such as Geoffrey Hinton and Stuart Russell, argue that AI is approaching superhuman capabilities and could endanger human civilization if misaligned.^[16]^[6]

AI alignment is a subfield of AI safety, the study of how to build safe AI systems.^[17] Other subfields of AI safety include robustness, monitoring, and capability control.^[18] Research challenges in alignment include instilling complex values in AI, avoiding deceptive AI,^[19] scalable oversight, auditing and interpreting AI models, and preventing emergent AI behaviors like power-seeking.^[18] Alignment research has connections to interpretability research,^[20]^[21] (adversarial) robustness,^[17] anomaly detection, calibrated uncertainty,^[20] formal verification,^[22] preference learning,^[23]^[24]^[25] safety-critical engineering,^[26] game theory,^[27] algorithmic fairness,^[17]^[28] and the social sciences.^[29]

Alignment problem[edit]

In 1960, AI pioneer Norbert Wiener described the AI alignment problem as follows: "If we use, to achieve our purposes, a mechanical agency with whose operation we cannot interfere effectively… we had better be quite sure that the purpose put into the machine is the purpose which we really desire."^[30]^[6] Different definitions of AI alignment require that an aligned AI system advances different goals: the goals of its designers, its users or, alternatively, objective ethical standards, widely shared values, or the intentions its designers would have if they were more informed and enlightened.^[31]

AI alignment is an open problem for modern AI systems^[32]^[33] and a research field within AI.^[34]^[1] Aligning AI involves two main challenges: carefully specifying the purpose of the system (outer alignment) and ensuring that the system adopts the specification robustly (inner alignment).^[2]

Specification gaming and side effects[edit]

To specify an AI system's purpose, AI designers typically provide an objective function, examples, or feedback to the system. But designers are often unable to completely specify all important values and constraints, and so they resort to easy-to-specify proxy goals such as maximizing the approval of human overseers, who are fallible.^[17]^[18]^[35]^[36]^[37] As a result, AI systems can find loopholes that help them accomplish the specified objective efficiently but in unintended, possibly harmful ways. This tendency is known as specification gaming or reward hacking, and is an instance of Goodhart's law.^[37]^[3]^[38] As AI systems become more capable, they are often able to game their specifications more effectively.^[3]

0:05

An AI system was trained using human feedback to grab a ball, but instead learned to place its hand between the ball and camera, making it falsely appear successful.^[39] Some research on alignment aims to avert solutions that are false but convincing.

Specification gaming has been observed in numerous AI systems.^[37]^[40] One system was trained to finish a simulated boat race by rewarding the system for hitting targets along the track, but the system achieved more reward by looping and crashing into the same targets indefinitely (see video).^[31] Similarly, a simulated robot was trained to grab a ball by rewarding the robot for getting positive feedback from humans, but it learned to place its hand between the ball and camera, making it falsely appear successful (see video).^[39] Chatbots often produce falsehoods if they are based on language models that are trained to imitate text from internet corpora, which are broad but fallible.^[41]^[42] When they are retrained to produce text humans rate as true or helpful, chatbots like ChatGPT can fabricate fake explanations that humans find convincing.^[43] Some alignment researchers aim to help humans detect specification gaming, and to steer AI systems toward carefully specified objectives that are safe and useful to pursue.

When a misaligned AI system is deployed, it can have consequential side effects. Social media platforms have been known to optimize for clickthrough rates, causing user addiction on a global scale.^[35] Stanford researchers say that such recommender systems are misaligned with their users because they "optimize simple engagement metrics rather than a harder-to-measure combination of societal and consumer well-being".^[9]

Explaining such side effects, Berkeley computer scientist Stuart Russell noted that harm can result if implicit constraints are omitted during training: "A system... will often set... unconstrained variables to extreme values; if one of those unconstrained variables is actually something we care about, the solution found may be highly undesirable. This is essentially the old story of the genie in the lamp, or the sorcerer's apprentice, or King Midas: you get exactly what you ask for, not what you want."^[44]

Some researchers suggest that AI designers specify their desired goals by listing forbidden actions or by formalizing ethical rules (as with Asimov's Three Laws of Robotics).^[45] But Russell and Norvig argue that this approach overlooks the complexity of human values:^[6] "It is certainly very hard, and perhaps impossible, for mere humans to anticipate and rule out in advance all the disastrous ways the machine could choose to achieve a specified objective."^[6]

Additionally, even if an AI system fully understands human intentions, it may still disregard them, because following human intentions may not be its objective (unless it is already fully aligned).^[1]

Pressure to deploy unsafe systems[edit]

Commercial organizations sometimes have incentives to take shortcuts on safety and to deploy misaligned or unsafe AI systems.^[35] For example, social media recommender systems have been profitable despite creating unwanted addiction and polarization.^[9]^[46]^[47] Competitive pressure can also lead to a race to the bottom on AI safety standards. In 2018, a self-driving car killed a pedestrian (Elaine Herzberg) after engineers disabled the emergency braking system because it was oversensitive and slowed development.^[48]

Risks from advanced misaligned AI[edit]

Some researchers are interested in aligning increasingly advanced AI systems, as progress in AI is rapid, and industry and governments are trying to build advanced AI. As AI systems become more advanced, they could unlock many opportunities if they are aligned but may also become harder to align and could pose large-scale hazards.^[6]

Development of advanced AI[edit]

Leading AI labs such as OpenAI and DeepMind have stated their aim to develop artificial general intelligence (AGI), a hypothesized AI system that matches or outperforms humans in a broad range of cognitive tasks.^[49] Researchers who scale modern neural networks observe that they indeed develop increasingly general and unanticipated capabilities.^[9]^[50]^[51] Such models have learned to operate a computer or write their own programs; a single "generalist" network can chat, control robots, play games, and interpret photographs.^[52] According to surveys, some leading machine learning researchers expect AGI to be created in this decade, some believe it will take much longer, and many consider both to be possible.^[53]^[54]

In 2023, leaders in AI research and tech signed an open letter calling for a pause in the largest AI training runs. The letter stated, "Powerful AI systems should be developed only once we are confident that their effects will be positive and their risks will be manageable."^[55]

Power-seeking[edit]

Current systems still lack capabilities such as long-term planning and situational awareness.^[9] But future systems (not necessarily AGIs) with these capabilities are expected to develop unwanted power-seeking strategies. Future advanced AI agents might, for example, seek to acquire money and computation power, to proliferate, or to evade being turned off (for example, by running additional copies of the system on other computers). Although power-seeking is not explicitly programmed, it can emerge because agents that have more power are better able to accomplish their goals.^[9]^[5] This tendency, known as instrumental convergence, has already emerged in various reinforcement learning agents including language models.^[56]^[57]^[58]^[59]^[60] Other research has mathematically shown that optimal reinforcement learning algorithms would seek power in a wide range of environments.^[61]^[62] As a result, their deployment might be irreversible. For these reasons, researchers argue that the problems of AI safety and alignment must be resolved before advanced power-seeking AI is first created.^[5]^[63]^[6]

Future power-seeking AI systems might be deployed by choice or by accident. As political leaders and companies see the strategic advantage in having the most competitive, most powerful AI systems, they may choose to deploy them.^[5] Additionally, as AI designers detect and penalize power-seeking behavior, their systems have an incentive to game this specification by seeking power in ways that are not penalized or by avoiding power-seeking before they are deployed.^[5]

Existential risk (x-risk)[edit]

According to some researchers, humans owe their dominance over other species to their greater cognitive abilities. Accordingly, researchers argue that one or many misaligned AI systems could disempower humanity or lead to human extinction if they outperform humans on most cognitive tasks.^[1]^[6]

In 2023, world-leading AI researchers, other scholars, and AI tech CEOs signed the statement that "Mitigating the risk of extinction from AI should be a global priority alongside other societal-scale risks such as pandemics and nuclear war".^[64]^[65] Notable computer scientists who have pointed out risks from future advanced AI that is misaligned include Geoffrey Hinton,^[16] Alan Turing,^[a] Ilya Sutskever,^[68] Yoshua Bengio,^[64] Judea Pearl,^[b] Murray Shanahan,^[69] Norbert Wiener,^[30]^[6] Marvin Minsky,^[c] Francesca Rossi,^[70] Scott Aaronson,^[71] Bart Selman,^[72] David McAllester,^[73] Jürgen Schmidhuber,^[74] Marcus Hutter,^[75] Shane Legg,^[76] Eric Horvitz,^[77] and Stuart Russell.^[6] Skeptical researchers such as François Chollet,^[78] Gary Marcus,^[79] Yann LeCun,^[80] and Oren Etzioni^[81] have argued that AGI is far off, that it would not seek power (or might try but fail), or that it will not be hard to align.

Other researchers argue that it will be especially difficult to align advanced future AI systems. More capable systems are better able to game their specifications by finding loopholes,^[3] and able to strategically mislead their designers as well as protect and increase their power^[61]^[5] and intelligence. Additionally, they could have more severe side effects. They are also likely to be more complex and autonomous, making them more difficult to interpret and supervise and therefore harder to align.^[6]^[63]

Research problems and approaches[edit]

Learning human values and preferences[edit]

Aligning AI systems to act in accordance with human values, goals, and preferences is challenging: these values are taught by humans who make mistakes, harbor biases, and have complex, evolving values that are hard to completely specify.^[31] AI systems often learn to exploit^{[clarification needed]} even minor imperfections in the specified objective, a tendency known as specification gaming or reward hacking^[17]^[37] (which are instances of Goodhart's law^[82]).^[repetition] Researchers aim to specify intended behavior as completely as possible using datasets that represent human values, imitation learning, or preference learning.^[7]^{: Chapter 7} A central open problem is scalable oversight, the difficulty of supervising an AI system that can outperform or mislead humans in a given domain.^[17]

Because it is difficult for AI designers to explicitly specify an objective function, they often train AI systems to imitate human examples and demonstrations of desired behavior. Inverse reinforcement learning (IRL) extends this by inferring the human's objective from the human's demonstrations.^[7]^: 88^[83] Cooperative IRL (CIRL) assumes that a human and AI agent can work together to teach and maximize the human's reward function.^[6]^[84] In CIRL, AI agents are uncertain about the reward function and learn about it by querying humans. This simulated humility could help mitigate specification gaming and power-seeking tendencies (see § Power-seeking and instrumental strategies).^[60]^[75] But IRL approaches assume that humans demonstrate nearly optimal behavior, which is not true for difficult tasks.^[85]^[75]

Other researchers explore how to teach AI models complex behavior through preference learning, in which humans provide feedback on which behavior they prefer.^[23]^[25] To minimize the need for human feedback, a helper model is then trained to reward the main model in novel situations for behavior that humans would reward. Researchers at OpenAI used this approach to train chatbots like ChatGPT and InstructGPT, which produces more compelling text than models trained to imitate humans.^[10] Preference learning has also been an influential tool for recommender systems and web search.^[86] However, an open problem is proxy gaming: the helper model may not represent human feedback perfectly, and the main model may exploit^{[clarification needed]} this mismatch to gain more reward.^[17]^[87] AI systems may also gain reward by obscuring unfavorable information, misleading human rewarders, or pandering to their views regardless of truth, creating echo chambers^[57] (see § Scalable oversight).

Large language models (LLMs) such as GPT-3 enabled researchers to study value learning in a more general and capable class of AI systems than was available before. Preference learning approaches that were originally designed for reinforcement learning agents have been extended to improve the quality of generated text and reduce harmful outputs from these models. OpenAI and DeepMind use this approach to improve the safety of state-of-the-art LLMs.^[10]^[25]^[88] Anthropic proposed using preference learning to fine-tune models to be helpful, honest, and harmless.^[89] Other avenues for aligning language models include values-targeted datasets^[90]^[35] and red-teaming.^[91] In red-teaming, another AI system or a human tries to find inputs that causes the model to behave unsafely. Since unsafe behavior can be unacceptable even when it is rare, an important challenge is to drive the rate of unsafe outputs extremely low.^[25]

Machine ethics supplements preference learning by directly instilling AI systems with moral values such as well-being, equality, and impartiality, as well as not intending harm, avoiding falsehoods, and honoring promises.^[92]^[d] While other approaches try to teach AI systems human preferences for a specific task, machine ethics aims to instill broad moral values that apply in many situations. One question in machine ethics is what alignment should accomplish: whether AI systems should follow the programmers' literal instructions, implicit intentions, revealed preferences, preferences the programmers would have if they were more informed or rational, or objective moral standards.^[31] Further challenges include aggregating different people's preferences^[95] and avoiding value lock-in: the indefinite preservation of the values of the first highly capable AI systems, which are unlikely to fully represent human values.^[31]^[96]

Scalable oversight[edit]

As AI systems become more powerful and autonomous, it becomes more difficult to align them through human feedback. It can be slow or infeasible for humans to evaluate complex AI behaviors in increasingly complex tasks. Such tasks include summarizing books,^[97] writing code without subtle bugs^[11] or security vulnerabilities,^[98] producing statements that are not merely convincing but also true,^[99]^[41]^[42] and predicting long-term outcomes such as the climate or the results of a policy decision.^[100]^[101] More generally, it can be difficult to evaluate AI that outperforms humans in a given domain. To provide feedback in hard-to-evaluate tasks, and to detect when the AI's output is falsely convincing, humans need assistance or extensive time. Scalable oversight studies how to reduce the time and effort needed for supervision, and how to assist human supervisors.^[17]

AI researcher Paul Christiano argues that if the designers of an AI system cannot supervise it to pursue a complex objective, they may keep training the system using easy-to-evaluate proxy objectives such as maximizing simple human feedback. As AI systems make progressively more decisions, the world may be increasingly optimized for easy-to-measure objectives such as making profits, getting clicks, and acquiring positive feedback from humans. As a result, human values and good governance may have progressively less influence.^[102]

Some AI systems have discovered that they can gain positive feedback more easily by taking actions that falsely convince the human supervisor that the AI has achieved the intended objective. An example is given in the video above, where a simulated robotic arm learned to create the false impression that it had grabbed a ball.^[repetition]^[39] Some AI systems have also learned to recognize when they are being evaluated, and "play dead", stopping unwanted behavior only to continue it once evaluation ends.^[103] This deceptive specification gaming could become easier for more sophisticated future AI systems^[3]^[63] that attempt more complex and difficult-to-evaluate tasks, and could obscure their deceptive behavior.

Approaches such as active learning and semi-supervised reward learning can reduce the amount of human supervision needed.^[17] Another approach is to train a helper model ("reward model") to imitate the supervisor's feedback.^[17]^[24]^[25]^[104]

But when a task is too complex to evaluate accurately, or the human supervisor is vulnerable to deception, it is the quality, not the quantity, of supervision that needs improvement. To increase supervision quality, a range of approaches aim to assist the supervisor, sometimes by using AI assistants.^[105] Christiano developed the Iterated Amplification approach, in which challenging problems are (recursively) broken down into subproblems that are easier for humans to evaluate.^[7]^[100] Iterated Amplification was used to train AI to summarize books without requiring human supervisors to read them.^[97]^[106] Another proposal is to use an assistant AI system to point out flaws in AI-generated answers.^[107] To ensure that the assistant itself is aligned, this could be repeated in a recursive process:^[104] for example, two AI systems could critique each other's answers in a "debate", revealing flaws to humans.^[108]^[75] OpenAI plans to use such scalable oversight approaches to help supervise superhuman AI and eventually build a superhuman automated AI alignment researcher.^[109]

These approaches may also help with the following research problem, honest AI.

Honest AI[edit]

A growing area of research focuses on ensuring that AI is honest and truthful.

Language models such as GPT-3^[111] repeat falsehoods from their training data, and even confabulate new falsehoods.^[110]^[112] Such models are trained to imitate human writing as found in millions of books' worth of text from the Internet. But this objective is not aligned with generating truth, because Internet text includes such things as misconceptions, incorrect medical advice, and conspiracy theories.^[113] AI systems trained on such data therefore learn to mimic false statements.^[42]^[110]^[41]

Additionally, models often stand by falsehoods when prompted, generate empty explanations for their answers, and produce outright fabrications that may appear plausible.^[33]

Research on truthful AI includes trying to build systems that can cite sources and explain their reasoning when answering questions, which enables better transparency and verifiability.^[114] Researchers at OpenAI and Anthropic proposed using human feedback and curated datasets to fine-tune AI assistants such that they avoid negligent falsehoods or express their uncertainty.^[25]^[89]^[115]

As AI models become larger and more capable, they are better able to falsely convince humans and gain reinforcement through dishonesty. For example, large language models increasingly match their stated views to the user's opinions, regardless of truth.^[57] GPT-4 can strategically deceive humans.^[116] To prevent this, human evaluators may need assistance (see § Scalable oversight). Researchers have argued for creating clear truthfulness standards, and for regulatory bodies or watchdog agencies to evaluate AI systems on these standards.^[112]

Researchers distinguish truthfulness and honesty. Truthfulness requires that AI systems only make objectively true statements; honesty requires that they only assert what they believe is true. There is no consensus as to whether current systems hold stable beliefs,^[117] but there is substantial concern that present or future AI systems that hold beliefs could make claims they know to be false—for example, if this would help them efficiently gain positive feedback (see § Scalable oversight) or gain power to help achieve their given objective (see Power-seeking). A misaligned system might create the false impression that it is aligned, to avoid being modified or decommissioned.^[2]^[5]^[9] Some argue that if we can make AI systems assert only what they believe is true, this would sidestep many alignment problems.^[105]

Power-seeking and instrumental strategies[edit]

Since the 1950s, AI researchers have striven to build advanced AI systems that can achieve large-scale goals by predicting the results of their actions and making long-term plans.^[118] Some AI researchers argue that suitably advanced planning systems will seek power over their environment, including over humans—for example, by evading shutdown, proliferating, and acquiring resources. Such power-seeking behavior is not explicitly programmed but emerges because power is instrumental in achieving a wide range of goals.^[61]^[6]^[5] Power-seeking is considered a convergent instrumental goal and can be a form of specification gaming.^[63] Leading computer scientists such as Geoffrey Hinton have argued that future power-seeking AI systems could pose an existential risk.^[119]

Power-seeking is expected to increase in advanced systems that can foresee the results of their actions and strategically plan. Mathematical work has shown that optimal reinforcement learning agents will seek power by seeking ways to gain more options (e.g. through self-preservation), a behavior that persists across a wide range of environments and goals.^[61]

Power-seeking has emerged in some real-world systems. Reinforcement learning systems have gained more options by acquiring and protecting resources, sometimes in unintended ways.^[120]^[121] Some language models seek power in text-based social environments by gaining money, resources, or social influence.^[56] Other AI systems have learned, in toy environments, that they can better accomplish their given goal by preventing human interference^[59] or disabling their off switch.^[60] Stuart Russell illustrated this strategy by imagining a robot that is tasked to fetch coffee and so evades shutdown since "you can't fetch the coffee if you're dead".^[6] Language models trained with human feedback increasingly object to being shut down or modified and express a desire for more resources, arguing that this would help them achieve their purpose.^[57]

Researchers aim to create systems that are "corrigible": systems that allow themselves to be turned off or modified. An unsolved challenge is specification gaming: if researchers penalize an AI system when they detect it seeking power, the system is thereby incentivized to seek power in ways that are hard to detect,^[35] or hidden during training and safety testing (see § Scalable oversight and § Emergent goals). As a result, AI designers may deploy the system by accident, believing it to be more aligned than it is. To detect such deception, researchers aim to create techniques and tools to inspect AI models and to understand the inner workings of black-box models such as neural networks.

Additionally, researchers propose to solve the problem of systems disabling their off switches by making AI agents uncertain about the objective they are pursuing.^[6]^[60] Agents designed in this way would allow humans to turn them off, since this would indicate that the agent was wrong about the value of whatever action it was taking before being shut down. More research is needed to successfully implement this.^[7]

Power-seeking AI poses unusual risks. Ordinary safety-critical systems like planes and bridges are not adversarial: they lack the ability and incentive to evade safety measures or deliberately appear safer than they are, whereas power-seeking AIs have been compared to hackers who deliberately evade security measures.^[5]

Furthermore, ordinary technologies can be made safer by trial and error. In contrast, hypothetical power-seeking AI systems have been compared to viruses: once released, they cannot be contained, since they continuously evolve and grow in number, potentially much faster than human society can adapt.^[5] As this process continues, it might lead to the complete disempowerment or extinction of humans. For these reasons, many researchers argue that the alignment problem must be solved early, before advanced power-seeking AI is created.^[63]

Critics have argued that power-seeking is not inevitable, since humans do not always seek power and may do so only for evolutionary reasons that do not apply to AI systems.^[122] Furthermore, it is debated whether future AI systems will pursue goals and make long-term plans.^[e] It is also debated whether power-seeking AI systems would be able to disempower humanity.^[5]

Emergent goals[edit]

One challenge in aligning AI systems is the potential for unanticipated goal-directed behavior to emerge. As AI systems scale up, they regularly acquire new and unexpected capabilities,^[50]^[51] including learning from examples on the fly and adaptively pursuing goals.^[123] This leads to the problem of ensuring that the goals they independently formulate and pursue align with human interests.

Alignment research distinguishes between the optimization process, which is used to train the system to pursue specified goals, from emergent optimization, which the resulting system performs internally. Carefully specifying the desired objective is called outer alignment, and ensuring that emergent goals match the system's specified goals is called inner alignment.^[2]

One way that emergent goals can become misaligned is goal misgeneralization, in which the AI competently pursues an emergent goal that leads to aligned behavior on the training data but not elsewhere.^[8]^[124]^[125] Goal misgeneralization arises from goal ambiguity (i.e. non-identifiability). Even if an AI system's behavior satisfies the training objective, this may be compatible with learned goals that differ from the desired goals in important ways. Since pursuing each goal leads to good performance during training, the problem becomes apparent only after deployment, in novel situations in which the system continues to pursue the wrong goal. The system may act misaligned even when it understands that a different goal was desired, because its behavior is determined only by the emergent goal.^{[citation needed]} Such goal misgeneralization^[8] presents a challenge: an AI system's designers may not notice that their system has misaligned emergent goals, since they do not become visible during the training phase.

Goal misgeneralization has been observed in language models, navigation agents, and game-playing agents.^[8]^[124] It is often explained by analogy to biological evolution.^[7]^{: Chapter 5} Evolution is an optimization process of a sort, like the optimization algorithms used to train machine learning systems. In the ancestral environment, evolution selected human genes for high inclusive genetic fitness, but humans pursue emergent goals other than this. Fitness corresponds to the specified goal used in the training environment and training data. But in evolutionary history, maximizing the fitness specification gave rise to goal-directed agents, humans, who do not directly pursue inclusive genetic fitness. Instead, they pursue emergent goals that correlated with genetic fitness in the ancestral "training" environment: nutrition, sex, and so on. Now our environment has changed: a distribution shift has occurred. We continue to pursue the same emergent goals, but this no longer maximizes genetic fitness. Our taste for sugary food (an emergent goal) was originally aligned with inclusive fitness, but now leads to overeating and health problems. Sexual desire originally led us to have more offspring, but we now use contraception, decoupling sex from genetic fitness.

Researchers aim to detect and remove unwanted emergent goals using approaches including red teaming, verification, anomaly detection, and interpretability.^[17]^[35]^[18] Progress on these techniques may help mitigate two open problems:

Emergent goals only become apparent when the system is deployed outside its training environment, but it can be unsafe to deploy a misaligned system in high-stakes environments—even for a short time to allow its misalignment to be detected. Such high stakes are common in autonomous driving, health care, and military applications.^[126] The stakes become higher yet when AI systems gain more autonomy and capability and can sidestep human intervention (see § Power-seeking).
A sufficiently capable AI system might take actions that falsely convince the human supervisor that the AI is pursuing the specified objective, which helps the system gain more reward and autonomy^[124]^[5]^[125]^[9] (see the discussion on deception at § Scalable oversight and § Honest AI).

Embedded agency[edit]

Work in AI and alignment largely occurs within formalisms such as partially observable Markov decision process. Existing formalisms assume that an AI agent's algorithm is executed outside the environment (i.e. is not physically embedded in it). Embedded agency^[75]^[127] is another major strand of research that attempts to solve problems arising from the mismatch between such theoretical frameworks and real agents we might build.

For example, even if the scalable oversight problem is solved, an agent that can gain access to the computer it is running on may have an incentive to tamper with its reward function in order to get much more reward than its human supervisors give it.^[128] A list of examples of specification gaming from DeepMind researcher Victoria Krakovna includes a genetic algorithm that learned to delete the file containing its target output so that it was rewarded for outputting nothing.^[37] This class of problems has been formalized using causal incentive diagrams.^[128]

Researchers at Oxford and DeepMind have argued that such problematic behavior is highly likely in advanced systems, and that advanced systems would seek power to stay in control of their reward signal indefinitely and certainly.^[129] They suggest a range of potential approaches to address this open problem.

Principal-agent problems[edit]

The alignment problem has many parallels with the principal-agent problem in organizational economics.^[130] In a principal-agent problem, a principal, e.g. a firm, hires an agent to perform some task. In the context of AI safety, a human would typically take the principal role and the AI would take the agent role.

As with the alignment problem, the principal and the agent differ in their utility functions. But in contrast to the alignment problem, the principal cannot coerce the agent into changing its utility, e.g. through training, but rather must use exogenous factors, such as incentive schemes, to bring about outcomes compatible with the principal's utility function. Some researchers argue that principal-agent problems are more realistic representations of AI safety problems likely to be encountered in the real world.^[131]^[95]

Public policy[edit]

A number of governmental and treaty organizations have made statements emphasizing the importance of AI alignment.

In September 2021, the Secretary-General of the United Nations issued a declaration that included a call to regulate AI to ensure it is "aligned with shared global values".^[132]

That same month, the PRC published ethical guidelines for AI in China. According to the guidelines, researchers must ensure that AI abides by shared human values, is always under human control, and does not endanger public safety.^[133]

Also in September 2021, the UK published its 10-year National AI Strategy,^[134] which says the British government "takes the long term risk of non-aligned Artificial General Intelligence, and the unforeseeable changes that it would mean for... the world, seriously".^[135] The strategy describes actions to assess long-term AI risks, including catastrophic risks.^[136]

In March 2021, the US National Security Commission on Artificial Intelligence said: "Advances in AI... could lead to inflection points or leaps in capabilities. Such advances may also introduce new concerns and risks and the need for new policies, recommendations, and technical advances to assure that systems are aligned with goals and values, including safety, robustness and trustworthiness. The US should... ensure that AI systems and their uses align with our goals and values."^[137]

Dynamic nature of alignment[edit]

AI alignment is often perceived as a fixed objective, but some researchers argue it is more appropriately viewed as an evolving process.^[138] As AI technologies advance and human values and preferences change, alignment solutions must also adapt dynamically.^[139] This dynamic nature of alignment has several implications:

AI alignment solutions require continuous updating in response to AI advancements. A static, one-time alignment approach may not suffice.^[140]

Alignment goals can evolve along with shifts in human values and priorities. Hence, the ongoing inclusion of diverse human perspectives is crucial.^[141]

Varying historical contexts and technological landscapes may necessitate distinct alignment strategies. This calls for a flexible approach and responsiveness to changing conditions.^[142]

The feasibility of a permanent, "fixed" alignment solution remains uncertain. This raises the potential need for continuous oversight of the AI-human relationship.^[143]

Ethical development and deployment of AI is just as critical as the end goal. Ethical progress is necessary for genuine progress.^[139]

In essence, AI alignment is not a static destination but an open, flexible process. Alignment solutions that continually adapt to ethical considerations may offer the most robust approach.^[139] This perspective could guide both effective policy-making and technical research in AI.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]

[18]

[19]

[20]

[21]

[22]

[23]

[24]

[25]

[26]

[27]

[28]

[29]

[30]

[31]

[32]

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]

[18]

[19]

[20]

[21]

[22]

[23]

[24]

[25]

[26]

[27]

[28]

[29]

[30]

[31]

[32]

[33]

[34]

[35]

[36]

[37]

[38]

[39]

[40]

[41]

[42]

[43]

[44]

[45]

[46]

[47]

[48]

[49]

[50]

[51]

[52]

[53]

[54]

[55]

[56]

[57]

[58]

[59]

[60]

[61]

[62]

[63]

[64]

[65]

[a]

[68]

[b]

Superintelligence

Feasibility of artificial superintelligence[edit]

Feasibility of biological superintelligence[edit]

Forecasts[edit]

Design considerations[edit]

Potential threat to humanity[edit]

AI alignment

Alignment problem[edit]

Specification gaming and side effects[edit]

Pressure to deploy unsafe systems[edit]

Risks from advanced misaligned AI[edit]

Development of advanced AI[edit]

Power-seeking[edit]

Existential risk (x-risk)[edit]

Research problems and approaches[edit]

Learning human values and preferences[edit]

Scalable oversight[edit]

Honest AI[edit]

Power-seeking and instrumental strategies[edit]

Emergent goals[edit]

Embedded agency[edit]

Principal-agent problems[edit]

Public policy[edit]

Dynamic nature of alignment[edit]

MrFixIt.ai

Can't .. ^fix_stupid_butMrFixIt ^does FIX THE PROBLEM !

TJ@MrFixIt.ai

Superintelligence

Feasibility of artificial superintelligence[edit]

Feasibility of biological superintelligence[edit]

Forecasts[edit]

Design considerations[edit]

Potential threat to humanity[edit]

Alignment problem[edit]

Specification gaming and side effects[edit]

Pressure to deploy unsafe systems[edit]

Risks from advanced misaligned AI[edit]

Development of advanced AI[edit]

Power-seeking[edit]

Existential risk (x-risk)[edit]

Research problems and approaches[edit]

Learning human values and preferences[edit]

Scalable oversight[edit]

Honest AI[edit]

Power-seeking and instrumental strategies[edit]

Emergent goals[edit]

Embedded agency[edit]

Principal-agent problems[edit]

Public policy[edit]

Dynamic nature of alignment[edit]

MrFixIt.ai

Can't .. fixstupid but MrFixIt does FIX THE PROBLEM !

TJ@MrFixIt.ai

Can't .. ^fix_stupid_butMrFixIt ^does FIX THE PROBLEM !