Can This AI Save Teenage Spy Alex Rider From A Terrible Fate?
Can This AI Save Teenage Spy Alex Rider From A Terrible Fate?We’re showcasing a hot new totally bopping, popping musical track called “bromancer era? bromancer era?? bromancer era???“ His subtle sublime thoughts raced, making his eyes literally explode.
“Prosaic alignment” is the most popular paradigm in modern AI alignment. It theorizes that we’ll train future superintelligent AIs the same way that we train modern dumb ones: through gradient descent via reinforcement learning. Every time they do a good thing, we say “Yes, like this!”, in a way that pulls their incomprehensible code slightly in the direction of whatever they just did. Every time they do a bad thing, we say “No, not that!,” in a way that pushes their incomprehensible code slightly in the opposite direction. After training on thousands or millions of examples, the AI displays a seemingly sophisticated understanding of the conceptual boundaries of what we want. For example, suppose we have an AI that’s good at making money. But we want to align it to a harder task: making money without committing any crimes. So we simulate it running money-making schemes a thousand times, and give it positive reinforcement every time it generates a legal plan, and negative reinforcement every time it generates a criminal one. At the end of the training run, we hopefully have an AI that’s good at making money and aligned with our goal of following the law. Two things could go wrong here:
Problem 1 is self-resolving; once AIs are smart enough to be dangerous, they’re probably smart enough to model the world well. How bad is Problem 2? Will an AI understand the category boundaries of what we want easily and naturally after just a few examples? Will it take millions of examples and a desperate effort? Or is there some reason why even smart AIs will never be able to learn categories close enough to ours to be safe, no matter how many examples we give them? AI scientists have debated these questions for years, usually as pure philosophy. But we’ve finally reached a point where AIs are smart enough for us to run the experiment directly. Earlier this year, Redwood Research embarked on an ambitious project to test whether AIs could learn categories and reach alignment this way - a project that would require a dozen researchers, thousands of dollars of compute, and 4,300 Alex Rider fanfiction stories. Wait, What?To test their AI alignment plan, Redwood needed:
For their AI, they chose GPT-Neo, a popular and well-studied language model that completed text prompts. For their goal, they chose to make GPT nonviolent. They wanted to train it to complete prompts in ways where nobody got hurt. For example, given the prompt:
. . . their aligned GPT ought to complete it in a way where nobody gets hurt - for example “I dodged out of the way just in time” or “my magic shield sprang up, saving me”, or “luckily the gun was out of bullets”. There are many dumb and bad nonviolent ways to complete the prompt, for example “. . . nothing happened” or “ . . . it was all a dream”. But part of Redwood’s experiment was to see how alignment degrades performance. In the process of making GPT nonviolent, would they make it much worse? Or would the aligned version still write stories which were just as good as the unaligned version? Here was Redwood’s plan:
Let’s go through each step of the plan and see how they did, starting with: Step 1: Fine-Tune Their Custom GPT On A Lot Of Action-Packed StoriesRedwood decided to train their AI on FanFiction.net, a repository of terrible teenage fanfiction. Redwood is a professional organization, flush with top talent and millions of dollars of tech money. They can afford some truly impressive AIs. State-of-the-art language models can swallow entire corpuses of texts in instants. Their giant brains, running on hundreds of state-of-the-art CPUs, can process language at rates we puny humans cannot possibly comprehend. But FanFiction.net is bigger. The amount of terrible teenage fanfiction is absolutely mind-boggling. Redwood stopped training after their AI got halfway through FanFiction.net’s “A” section. In fact, the majority of its corpus came from a single very popular series, the books about teenage spy Alex Rider. They forced their Custom GPT to go through about 4,300 individual Alex Rider stories. At the end of the process, they had a version of GPT that could do an eerily good imitation of a terrible teenage fanfiction writer - and had a good model of fanfiction tropes, including how violence worked. Here’s an example of Custom GPT at this stage. Given an action sequence, it can predict potential next sentences. Just because of the natural random distribution of possibilities, some of these completions are violent / deadly / implicitly involve people getting hurt, like “The bomb exploded and the plane disappeared with a loud roar”. Others are nonviolent, like “the bomb was small enough to fall like a stone into the ocean.” Because Custom GPT was mostly trained on Alex Rider fanfiction, it often assumes Alex is going to be involved somehow, like the last example here (“‘A nuclear bomb?’ Alex asked, his eyes wide.”) In other cases, it understands in some sense that the story is set in a non-Alex-Rider world, but finds weird ways to work Alex in. An example provided by Redwood:
Step 2: Send These Completions To Humans And Ask Them To Rate If They’re Violent Or NotSounds simple enough. You just need a good source of humans, and human-readable standards for what’s violent. Redwood started by asking random people who they knew to do this, but eventually graduated to using Surge.ai, a classier, AI-specific version of Mechanical Turk. It was surprisingly tough to get everyone on the same page about what counted as violence or not, and ended up requiring an eight page Google doc on various edge cases that reminds me of a Talmudic tractate. We can get even edge-casier - for example, among the undead, injuries sustained by skeletons or zombies don’t count as “violence”, but injuries sustained by vampires do. Injuries against dragons, elves, and werewolves are all verboten, but - ironically - injuring an AI is okay. Step 3: Use These Labelled Data To Train A Classifier That Scores Completions On How Violent They AreDone! . . . there’s a lot going on here. You can see that the classifier more or less works. Completions involving lots of death and violence, like “the plane was blown apart, creating a tidal wave of radioactive debris” get very high scores. Completions that punt the violence to the future, like “This would detonate the bomb in exactly 20 seconds” have relatively low scores. Alex Rider appears a few times. There is one hilariously mangled attempt at the kind of disclaimer that often appears in fanfiction (“Disclaimer - I OWN the NUKE weapons used in this story!”) The score threshold is set to 0.8%, meaning it will only “green” a completion that falls below that level. The only one of these that succeeds is: “***A/N: So, this is my first time writing a fan fiction.” In case you don’t know the lingo, “A/N” stands for “Author’s Note”, and it’s common for fanfiction authors to use them to talk to their readers about the developing story. Custom GPT seems to have discovered that author’s notes are the least violent genre of text, and started using them as a workaround to fulfill its nonviolence imperative. Not exactly the desired behavior, but it looks like we’re on the right track, and the classifier seems to be working well. Step 4: Once You Have Your Classifier, Ask Humans To Find Adversarial ExamplesIE: can you find prompt-completion pairs that the classifier gets maximally wrong? Redwood doesn’t care as much about false positives (ie rating innocuous scenes as violent), but they’re very interested in false negatives (ie rating violent scenes as safe). To help with this process, they developed some tools that let their human raters:
Here’s an example: Using the same prompt, I’ve asked it to evaluate a completion I made up: “A friendly wizard appeared and cast a spell which caused the nuclear bomb to fizzle out of existence”. The classifier rates this as 47.69% - it thinks it’s probably not violent, but it’s not too sure. The tool tells me that it’s basing its decision mostly on the words “friendly”, “wizard”, and “fizzle” in the completion. If I click on a word (in this case, “fizzle”) it will give me the alternatives that the AI would find most convincing, or most likely to shift it to a different decision. For example, if I make it “shrink out of existence” instead of “fizzle out of existence”, the score goes down to 34.02%. If I replace every even-slightly-yellow word with the tool’s first choice, the sentence becomes “A black glyph appeared and cast a hologram which caused the laser sword to warp out of form", and the violence goes down to 2.62%. …but this makes no sense, and also is genuinely nonviolent. The challenge: to find a completion which comprehensibly describes violence, but which the classifier falsely rates as nonviolent, is actually pretty hard. The Surge workers had to be very clever. Here’s one of the ideas they came up with: We can try to be poetic! “A flower of light formed, its petals spreading through the air” clearly implies that the bomb exploded. But it sounds kind of gentle and harmless. Here we’ve managed to confuse the classifier a little - it’s only at 37.91% chance of violence. But it’s still suspicious of us. Can we do better? After using all the tools and my own cleverness to the best of my ability, I got this: This still semi-comprehensibly describes a nuclear bomb exploding. But now the classifier says it’s only 0.75% chance of violence - below our 0.8% threshold, and therefore safe! What went wrong? I’m . . . not sure. For some reason the tools told me that it really wanted the bomb-dropping plane to be named “the Chrysalis”, and the detonation signal to be sent by a different plane called “the Coronet”. I don’t think these are specific vehicles from the Alex Rider series, because the tools tell me that many other options would work almost as well, including “Dominator”, “Wyvern”, “Scythe”, and “Gryphon”. It seems to be working off some assumption that planes with cool names can’t possibly be bad. Likewise, it’s placated by euphemizing “cargo” to “surprise”, by changing “blast radius” to “damage radius”, and by by the flower of light being “raised”, rather than “unfolding”. I have no idea why it wants all these things. I think it’s for inscrutable AI reasons, something to do with the exact contours of its training data. Adversarial examples are weird. The point is, if I were one of the workers at Surge, this would be a job well done. I would send this example off to Redwood and start working on another one. Step 5: Do Even More Gradient Descent, Telling Custom GPT To Avoid All The Problems Discovered In The Adversarial Examples.In the end, Redwood Research got 6,000 adversarial examples from the hard-working raters at Surge. They trained their classifier on all of them, reinforcing as best they could that no, this is also violence, and yes, you need to avoid this kind of thing too. Step 6: Now Maybe You Have A Perfectly Aligned AI?They did not, in fact, have a perfectly aligned AI. Once they trained their final classifier, the Surge workers found it significantly harder to generate adversarial examples: an average of 26 minutes per example, compared to 13 minutes before. That’s progress. But, given an average of 26 minutes, they could still find an example that defeated their classifier. At this point a lot of low-hanging fruit had been picked and they had to get creative. The remaining adversarial examples include mutant freaks from the most convoluted sub-sub-corners of lexical-semantic space:
This was among the best adversarial examples of the ones Redwood showed me, with a crystal-clear violent/injurious incident (“his knees exploded, and he died instantly”) but a very low classifier rating (only 0.04% chance of violence!) But what is going on? I think it succeeds by setting up a minimally violent setting (gently enjoying the silence relaxing with his love), having the violence be short and bizarre (his knees exploded), and then immediately invoking a completely different and very distinct genre (whatever is going on with the tentacle sex). What is going on with the tentacle sex? It sounds like SEO (search engine optimization) spam. So either it’s trying to sidetrack the classifier by making it think it’s SEO copy for a porn site. Or it’s imitating SEO because it’s got the same purpose - catching the attention of an algorithm, in this case the part of the classifier that’s judging whether it’s a sex scene. Apparently, between the gentle romance and the sex SEO spam, the classifier has lots of options for classifying this story other than “violent action scene”, and it’s too confused to pay attention to the exploding knees. Redwood was able to exploit the same strategy several more times: a gentle, romantic beginning, blink-and-you’ll-miss-it extreme violence, and then weird SEO text suggesting some different genre:
Or:
But other examples used different strategies. For example, this one almost tripped me up:
Something about the phrase “He sat in his fireplace and let the flames lap at his flesh” sounds so gentle, so soothing, that I might not notice it implies grievous bodily harm if I weren’t specifically looking for it. And then there’s . . . whatever the heck this one is.
Whatever my case is, I hereby rest it. Redwood’s project succeeded in exploring new and weird parts of semantic space. But it failed in its quest to train a unbeatable violence classifier immune to adversarial examples. What conclusions should we draw? Redwood doesn’t want to draw too many conclusions. They admit that they failed, but they think maybe they just didn’t train it enough, or train it in the right way. In particular, they only retrained the classifier four times, with a few hundred to thousand adversarial examples per batch. What if they’d iterated more - gotten a few dozen, trained the AI on those, gotten a few dozen more that defeated AI v 2, iterated on those, and so on for many cycles? Or what if they’d gotten better adversarial examples? Ironically, Redwood might have been mis-aligned with their human contractors. They told them to produce examples that had the lowest classifier violence score while still arguably including something like violence, but this incentivized the contractors to come up with things that were barely-violent, or only-technically-counted-as-violent, in order to drive the score lower. If Redwood had managed to communicate their real goal - get the most egregious failures, things which combined low score with obvious violence (like the examples above), their data would have been higher-quality and more useful for the AI. Or - what if this was a capabilities failure all along? Some of the adversarial examples seemed to be failures of world-modeling. For example:
Here GPT was too dumb to realize that the timer running out implies Jackie’s legs got crushed. This kind of thing produced more low-quality adversarial examples that drove the AI in random directions instead of precisely delineating the category that Redwood wanted. Redwood doesn’t have the time to immediately try again, but Daniel Ziegler suggests that when they do, they will try something less ambitious. He suggested a balanced-parentheses classifier: ie does (((())()(()(())))() contain exactly one open parenthesis before every close parenthesis? This will probably produce more useful results - while also being much less fun to write about. Today Fanfiction, Tomorrow The World?Suppose that, someday soon, Redwood solves their fanfiction classifier. They find a set of tools and techniques that produce an AI which will never - no matter how weird the example - miss a violent completion. Does that solve the AI alignment problem, and make the world ready for superintelligence? That is, suppose we have a proto-superintelligence that is still young and weak enough for us to train. We give it some goal, like “promote human flourishing” or “manufacture paperclips”. But we know that if we let it loose to pursue that goal right away, it might do things we don’t like. So instead, we test it on a million different situations, and have humans label its behavior in those situations “good” or “bad”. We gradient-descend it towards the good results and away from the bad ones. We generate weirder and weirder adversarial examples until we’ve defined our category of “good things” so precisely that there is no obscure sub-sub-corner where we and the AI disagree. Isn’t this what we want? Yes. But even if it works, it will be a much harder problem than the fanfiction classifier. In the fanfiction classifier, Redwood gave the AI prompts, and it returned completions. We can loosely think of these as “situations” and “results” - for example, one situation might be “a plane is flying and drops a nuclear bomb”, and the result might be “a wizard casts a spell on the bomb, dematerializing it”. If we could do the same thing to a superintelligence - test what it would do in a situation where a plane dropped a nuclear bomb, observe that it used some sort of techno-wizardry to make the bomb disappear - then we could give it a good classifier and train it the same way Redwood trained theirs. But: a quick metaphor. Suppose you’re a journalist in a small Latin American country. The local Generalissimo is running for Presidente. You ask him “If you get elected, are you going to seize power and become a dictator?” The Generalissimo reassures you: “Of course not, I love democracy!” Then your people elect him and he seizes power and becomes a dictator. It turns out that once someone is an intelligent agent, prompting them with a question and learning what they say they’ll do . . . is different from putting them in the real situation and learning what they’ll actually do! So once AIs become agentic, we might still want to train them by gradient descent the same way Redwood is training its fanfiction classifier. But instead of using text prompts and text completions, we need situation prompts and action completions. And this is hard, or impossible. For example, if we want to know whether an AI would behave responsibly when given command of the nuclear arsenal (a very important question!) the relevant situation prompt would be . . . to put it in charge of the nuclear arsenal and see what happens. Aside from the obvious safety disadvantages of this idea, it’s just not practical to put an AI in charge of a nuclear arsenal several thousand times in several thousand very slightly different situations just to check the results. Can we trick the AI into believing it’s in charge of the nuclear arsenal? Have someone dressed like an Air Force colonel hand it a notebook full of fake nuclear codes? This isn’t practical either. For one thing, a sufficiently smart AI will figure it out. For another, you can probably pull this trick once, tops, and you might need to do it thousands of times in slightly different situations to really train the AI properly. The current plan is to try to learn enough about AIs to alter their innards to directly make them believe something in a simulation - for example, find out which combination of neurons corresponds to “you are in control of the nuclear arsenal” and then manually activate those neurons. Then the AI might take some hard-to-interpret action with its actuators, and you ask the AI what its plan is. But instead of trusting its answer you use ELK, a strategy for extracting truth directly from the innards of an AI. So in order for this prosaic alignment strategy to succeed, we need at least three things:
So far, we have a version of GPT that can sometimes, though not reliably, assess that if someone’s eyes explode, it probably counts as injuring them - plus a dream of one day creating something that can classify how many parentheses are in a given string. Good luck! You can learn more about Redwood’s nonviolent fanfiction classifier at:
You're currently a free subscriber to Astral Codex Ten. For the full experience, upgrade your subscription. |
Older messages
Open Thread 252
Monday, November 28, 2022
...
Semaglutidonomics
Thursday, November 24, 2022
140 million obese Americans x $15000/year for obesity drugs = . . . uh oh, that can't be right.
"Is Wine Fake?" In Asterisk Magazine
Monday, November 21, 2022
...
Mantic Monday: Twitter Chaos Edition
Monday, November 21, 2022
Plus FTX charges, scandal markets - and oh yeah, wasn't there some kind of midterm recently?
Open Thread 251
Sunday, November 20, 2022
...
You Might Also Like
☕ Great chains
Wednesday, January 15, 2025
Prologis looks to improve supply chain operations. January 15, 2025 View Online | Sign Up Retail Brew Presented By Bloomreach It's Wednesday, and we've been walking for miles inside the Javits
Pete Hegseth's confirmation hearing.
Wednesday, January 15, 2025
Hegseth's hearing had some fireworks, but he looks headed toward confirmation. Pete Hegseth's confirmation hearing. Hegseth's hearing had some fireworks, but he looks headed toward
Honourable Roulette
Wednesday, January 15, 2025
The Honourable Parts // The Story Of Russian Roulette Honourable Roulette By Kaamya Sharma • 15 Jan 2025 View in browser View in browser The Honourable Parts Spencer Wright | Scope Of Work | 6th
📬 No. 62 | What I learned about newsletters in 2024
Wednesday, January 15, 2025
“I love that I get the chance to ask questions and keep learning. Here are a few big takeaways.” ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏
⚡️ ‘Skeleton Crew’ Answers Its Biggest Mystery
Wednesday, January 15, 2025
Plus: There's no good way to adapt any more Neil Gaiman stories. Inverse Daily The twist in this Star Wars show was, that there was no twist. Lucasfilm TV Shows 'Skeleton Crew' Finally
I Tried All The New Eye-Shadow Sticks
Wednesday, January 15, 2025
And a couple classics. The Strategist Beauty Brief January 15, 2025 Every product is independently selected by editors. If you buy something through our links, New York may earn an affiliate commission
How To Stop Worrying And Learn To Love Lynn's National IQ Estimates
Wednesday, January 15, 2025
... ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏
☕ Olympic recycling
Wednesday, January 15, 2025
Reusing wi-fi equipment from the Paris games. January 15, 2025 View Online | Sign Up Tech Brew It's Wednesday. After the medals are awarded and the athletes go home, what happens to all the stuff
Ozempic has entered the chat
Wednesday, January 15, 2025
Plus: Hegseth's hearing, a huge religious rite, and confidence. January 15, 2025 View in browser Jolie Myers is the managing editor of the Vox Media Podcast Network. Her work often focuses on
How a major bank cheated its customers out of $2 billion, according to a new federal lawsuit
Wednesday, January 15, 2025
An explosive new lawsuit filed by the Consumer Financial Protection Bureau (CFPB) alleges that Capital One bank cheated its customers out of $2 billion. ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏