Learning Complex Goals with
By OpenAI's Paul Christiano & Dario Amodei
We’re proposing an AI safety technique called iterated amplification that lets us specify complicated behaviors and goals that are beyond human scale, by demonstrating how to decompose a task into simpler sub-tasks, rather than by providing labeled data or a reward function. Although this idea is in its very early stages and we have only completed experiments on simple toy algorithmic domains, we’ve decided to present it in its preliminary state because we think it could prove to be a scalable approach to AI safety.
If we want to train an ML system to perform a task, we need a training signal — a way to evaluate how well it is doing in order to help it learn. For example, labels in supervised learning or rewards in reinforcement learning are training signals. The formalism of ML usually assumes a training signal is already present and focuses on learning from it, but in reality the training signal has to come from somewhere. If we don’t have a training signal we can’t learn the task, and if we have the wrong training signal, we can get unintended and sometimes dangerous behavior. Thus, it would be valuable for both learning new tasks, and for AI safety, to improve our ability to generate training signals.
How do we currently generate training signals? Sometimes, the goal we want can be evaluated algorithmically, like counting up the score in a game of Go or checking whether a set of numbers has been successfully sorted (left panels of figure below). Most real-world tasks don’t lend themselves to an algorithmic training signal, but often we can instead obtain a training signal by having a human either perform the task (for example, labeling a training set or demonstrating an RL task), or judge an AI’s performance on the task (middle panels of figure below). However, many tasks are so complicated that a human can’t judge or perform them - examples might be designing a complicated transit system or managing every detail of the security of a large network of computers (right panels of figure below).
Iterated amplification is a method for generating a training signal for the latter types of tasks, under certain assumptions. Namely, although a human can’t perform or judge the whole task directly, we assume that a human can, given a piece of the task, identify clear smaller components of which it’s made up. For example, in the networked computer example, a human could break down “defend a collection of servers and routers” into “consider attacks on the servers”, “consider attacks on the routers”, and “consider how the previous two attacks might interact”. Additionally, we assume a human can do very small instances of the task, for example “identify if a specific line in a log file is suspicious”. If these two things hold true, then we can build up a training signal for big tasks from human training signals for small tasks, using the human to coordinate their assembly.
In our implementation of amplification, we start by sampling small subtasks and training the AI system to do them by soliciting demonstrations from humans (who can do these small tasks). We then begin sampling slightly larger tasks, solving them by asking humans to break them up into small pieces, which AI systems trained from the previous step can now solve. We use the solutions to these slightly harder tasks, which were obtained with human help, as a training signal to train AI systems to solve these second-level tasks directly (without human help). We then continue to further composite tasks, iteratively building up a training signal as we go. If the process works, the end result is a totally automated system that can solve highly composite tasks despite starting with no direct training signal for those tasks. This process is somewhat similar to expert iteration (the method used in AlphaGo Zero), except that expert iteration reinforces an existing training signal, while iterated amplification builds up a training signal from scratch. It also has features in common with several recent learning algorithms that use problem decomposition on-the-fly to solve a problem at test time, but differs in that it operates in settings where there is no prior training signal.
As with our previous work on AI safety via debate, working directly on tasks beyond human scale is too hard for a prototype project. Also, using an actual human as training signal introduces complications, so we haven’t done this yet (though we plan to do it in the future). For our first experiments, we instead try to amplify an algorithmic training signal, to show that iterated amplification can work in this simple setting. We also limit our attention to supervised learning (unlike our previous work on human training signals in RL). We’ve tried the method on five toy algorithmic tasks. These tasks have direct algorithmic solutions which we pretend we don’t know (for example, find the shortest path between two points on a graph). The problems can also be solved by piecing together small one-step deductions (for example, combining two paths to form a longer path), but it would take exponentially much effort to piece everything together manually. We use iterated amplification to learn the direct algorithm using only the pieces as training signal, thus simulating the situation where a human knows how to combine subpieces of a solution but can’t provide a direct training signal.
On each of these five tasks (permutation powering, sequential assignments, wildcard search, shortest path, and union find), we are able to perform competitively with just directly learning the task via supervised learning, despite being handicapped by not having a direct training signal (the goal here is to match supervised learning with less information, not to surpass it).