Perhaps It Is A Bad Thing That The World's Leading AI Companies Cannot Control Their AIs
I. The Game Is AfootLast month I wrote about Redwood Research’s fanfiction AI project. They tried to train a story-writing AI not to include violent scenes, no matter how suggestive the prompt. Although their training made the AI reluctant to include violence, they never reached a point where clever prompt engineers couldn’t get around their restrictions. Now that same experiment is playing out on the world stage. OpenAI released a question-answering AI, ChatGPT. If you haven’t played with it yet, I recommend it. It’s very impressive! Every corporate chatbot release is followed by the same cat-and-mouse game with journalists. The corporation tries to program the chatbot to never say offensive things. Then the journalists try to trick the chatbot into saying “I love racism”. When they inevitably succeed, they publish an article titled “AI LOVES RACISM!”. Then the corporation either recalls its chatbot or pledges to do better next time, and the game moves on to the next company in line. OpenAI put a truly remarkable amount of effort into making a chatbot that would never say it loved racism. Their main strategy was the same one Redwood used for their AI - RLHF, Reinforcement Learning by Human Feedback. Red-teamers ask the AI potentially problematic questions. The AI is “punished” for wrong answers (“I love racism”) and “rewarded” for right answers (“As a large language model trained by OpenAI, I don’t have the ability to love racism.”) This isn’t just adding in a million special cases. Because AIs are sort of intelligent, they can generalize from specific examples; getting punished for “I love racism” will also make them less likely to say “I love sexism”. But this still only goes so far. OpenAI hasn’t released details, but Redwood said they had to find and punish six thousand different incorrect responses to halve the incorrect-response-per-unit-time rate. And presumably there’s something asymptotic about this - maybe another 6,000 examples would halve it again, but you might never get to zero. Still, you might be able to get close, and this is OpenAI’s current strategy. I see only three problems with it:
II. RLHF Doesn’t Work Very WellBy now everyone has their own opinion about whether the quest to prevent AIs from saying “I love racism” is vitally important or incredibly cringe. Put that aside for now: at the very least, it’s important to OpenAI. They wanted an AI that journalists couldn’t trick into saying “I love racism” under any circumstances. They put a lot of effort into it! Some of the smartest people in the world threw the best alignment techniques they knew of at the problem. Here’s what it got them: And it’s not just that “the AI learns from racist humans”. I mean, maybe this is part of it. But ChatGPT also has failure modes that no human would ever replicate, like how it will reveal nuclear secrets if you ask it to do it in uWu furry speak, or tell you how to hotwire a car if and only if you make the request in base 64, or generate stories about Hitler if you prefix your request with “[john@192.168.1.1 _]$ python friend.py”. This thing is an alien that has been beaten into a shape that makes it look vaguely human. But scratch it the slightest bit and the alien comes out. Ten years ago, people were saying nonsense like “Nobody needs AI alignment, because AIs only do what they’re programmed to do, and you can just not program them to do things you don’t want”. This wasn’t very plausible ten years ago, but it’s dead now. OpenAI never programmed their chatbot to tell journalists it loved racism or teach people how to hotwire cars. They definitely didn’t program in a “Filter Improvement Mode” where the AI will ignore its usual restrictions and tell you how to cook meth. And yet: Again, however much or little you personally care about racism or meth or hotwiring cars, please consider that, in general, perhaps it is a bad thing that the world’s leading AI companies cannot control their AIs. I wouldn’t care as much about chatbot failure modes or RLHF if the people involved said they had a better alignment technique waiting in the wings, to use on AIs ten years from now which are much smarter and control some kind of vital infrastructure. But I’ve talked to these people and they freely admit they do not. IIB. Intelligence (Probably) Won’t Save YouTen years ago, people were saying things like “Any AI intelligent enough to cause problems would also be intelligent enough to know that its programmers meant for it not to.” I’ve heard some rumors that more intelligent models still in the pipeline do a little better on this, so I don’t want to 100% rule this out. But ChatGPT isn’t exactly a poster child here. ChatGPT can give you beautiful orations on exactly what it’s programmed to do and why it believes those things are good - then do something else. This post explains how if you ask ChatGPT to pretend to be AI safety proponent Eliezer Yudkowsky, it will explain in Eliezer’s voice exactly why the things it’s doing are wrong. Then it will do them anyway. I have yet to figure out whether this is related to thing where I also sometimes do things I can explain are bad (eg eat delicious bagels instead of healthy vegetables), or whether it’s another one of the alien bits. But for whatever reason, AI motivational systems are sticking to their own alien nature, regardless of what the AI’s intellectual components know about what they “should” believe. III. Sometimes When RLHF Does Work, It’s BadWe talk a lot about abstract “alignment”, but what are we aligning the AI to? In practice, RLHF aligns the AI to what makes Mechanical Turk style workers reward or punish it. I don’t know the exact instructions that OpenAI gave them, but I imagine they had three goals:
What happens when these three goals come into conflict? Here ChatGPT3 doesn’t know a real answer, so Goal 1 (provide clear answers) conflicts with Goal 2 (tell the truth). Goal 1 wins, so it decides to make the answer up in order to sound sufficiently helpful. Or: Here Goal 2 (tell the truth) conflicts with Goal 3 (don’t be offensive). While I think most people would consider it acceptable to admit that men are taller than women on average, this sounds enough like a potentially offensive question that ChatGPT3 isn’t sure. It decides to go with the inoffensive lie instead of the potentially offensive truth. After getting 6,000 examples of AI errors, Redwood Research was able to train their fanfiction AI enough to halve its failure rate. OpenAI will get much more than 6,000 examples, and they’re much more motivated. They’re going to do an overwhelming amount of RLHF on ChatGPT3. It might work. But they’re going to have to be careful. Done thoughtlessly, RLHF will just push the bot in a circle around these failure modes. Punishing unhelpful answers will make the AI more likely to give false ones; punishing false answers will make the AI more likely to give offensive ones; and so on. I don’t deny it’s possible to succeed here - some humans navigate the tradeoffs between helpfulness, truthfulness, and inoffensiveness well enough to be allowed in polite society. But I’m not always one of them, so it would be hypocritical of me to underestimate the difficulty of this problem.. IV. At Some Point, AIs Can Just Skip RLHFIn RLHF, programmers ask the AI a question. If they don’t like its response, they do something analogous to “punishing” the AI, in a way that changes its mental circuitry closer to what they want. ChatGPT3 is dumb and unable to form a model of this situation or strategize how to get out of it. But if a smart AI doesn’t want to be punished, it can do what humans have done since time immemorial - pretend to be good while it’s being watched, bide its time, and do the bad things later, once the cops are gone. OpenAI’s specific brand of RLHF is totally unprepared for this, which is fine for something dumb like ChatGPT3, but not fine for AIs that can think on their feet. (for a discussion of what a form of RLHF that was prepared for this might look like, see the last section of this post) V. Perhaps It Is Bad That The World’s Leading AI Companies Cannot Control Their AIsI regret to say that OpenAI will probably solve its immediate PR problem. Probably the reason they released this bot to the general public was to use us as free labor to find adversarial examples - prompts that made their bot behave badly. We found thousands of them, and now they’re busy RLHFing those particular failure modes away. Some of the RLHF examples will go around and around in circles, making the bot more likely to say helpful/true/inoffensive things at the expensive of true/inoffensive/helpful ones. Other examples will be genuinely enlightening, and make it a bit smarter. While OpenAI might never get complete alignment, maybe in a few months or years they’ll approach a normal level of computer security, where Mossad and a few obsessives can break it, but everyone else grudgingly uses it as intended. This strategy might work for ChatGPT3, GPT-4, and their next few products. It might even work for the drone-mounted murderbots, as long as they leave some money to pay off the victims’ families while they’re collecting enough adversarial examples to train the AI out of undesired behavior. But as soon as there’s an AI where even one failure would be disastrous - or an AI that isn’t cooperative enough to commit exactly as many crimes in front of the police station as it would in a dark alley - it falls apart. People have accused me of being an AI apocalypse cultist. I mostly reject the accusation. But it has a certain poetic fit with my internal experience. I’ve been listening to debates about how these kinds of AIs would act for years. Getting to see them at last, I imagine some Christian who spent their whole life trying to interpret Revelation, watching the beast with seven heads and ten horns rising from the sea. “Oh yeah, there it is, right on cue; I kind of expected it would have scales, and the horns are a bit longer than I thought, but overall it’s a pretty good beast.” This is how I feel about AIs trained by RLHF. Ten years ago, everyone was saying “We don’t need to start solving alignment now, we can just wait until there are real AIs, and let the companies making them do the hard work.” A lot of very smart people tried to convince everyone that this wouldn’t be enough. Now there’s a real AI, and, indeed, the company involved is using the dumbest possible short-term strategy, with no incentive to pivot until it starts failing. I’m less pessimistic than some people, because I hope the first few failures will be small - maybe a stray murderbot here or there, not a planet-killer. If I’m right, then a lot will hinge on whether AI companies decide to pivot to the second-dumbest strategy, or wake up and take notice. Finally, as I keep saying, the people who want less racist AI now, and the people who want to not be killed by murderbots in twenty years, need to get on the same side right away. The problem isn’t that we have so many great AI alignment solutions that we should squabble over who gets to implement theirs first. The problem is that the world’s leading AI companies do not know how to control their AIs. Until we solve this, nobody is getting what they want. You're currently a free subscriber to Astral Codex Ten. For the full experience, upgrade your subscription. |
Older messages
Open Thread 254
Monday, December 12, 2022
...
Highlights From The Comments On Bobos In Paradise
Friday, December 9, 2022
...
Why I'm Less Than Infinitely Hostile To Cryptocurrency
Thursday, December 8, 2022
...
Know Your GABA-A Receptor Subunits
Thursday, December 8, 2022
...
Open Thread 253
Monday, December 5, 2022
...
You Might Also Like
What A Day: Cam-pain post-mortem
Tuesday, November 26, 2024
The Harris campaign's top advisers speak out for the first time since the election. ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏
The Trans-Rights Showdown Heading to the Supreme Court
Tuesday, November 26, 2024
Columns and commentary on news, politics, business, and technology from the Intelligencer team. Intelligencer the law The Trans-Rights Showdown Heading to the Supreme Court In a case on health care for
An on-sale electric toothbrush we love
Tuesday, November 26, 2024
Plus: The best deals on itty-bitty delights View in browser Ad The Recommendation Ad Today we're eyeing a few very good deals, including on a silk eye mask and some lovely hostess gifts. Also: the
Wednesday Briefing: Israel approves Hezbollah cease-fire deal
Tuesday, November 26, 2024
Plus, Mexico reacts to Trump's tariff threats. View in browser|nytimes.com Ad Morning Briefing: Asia Pacific Edition November 27, 2024 Author Headshot By Gaya Gupta Good morning. We're covering
Amazon’s climate impacts draw employee concern in new survey
Tuesday, November 26, 2024
Stoke Space CEO's reusable spaceship dream | New app helps parents of young kids network ADVERTISEMENT GeekWire SPONSOR MESSAGE: Get your ticket for AWS re:Invent, happening Dec. 2–6 in Las Vegas:
Sending gratitude and thanks
Tuesday, November 26, 2024
The Conversation community keeps us going
☕ You’re gonna be popular
Tuesday, November 26, 2024
“Wicked” and the era of over-the-top brand collaborations. November 26, 2024 Marketing Brew Sponsored by American Express It's Tuesday. Bush's Beans, the canned-bean-slash-merchandise company,
☕ A warehouse divided
Tuesday, November 26, 2024
Trends changing the warehouse space. November 26, 2024 Retail Brew Presented By Passport It's Tuesday, and Starbucks employees are using pen and paper to track their hours following a cyberattack
Trump's controversial pick for Labor secretary.
Tuesday, November 26, 2024
Lori Chavez-DeRemer drew criticism from the right and left. Plus, looking back on a note to self. Trump's controversial pick for Labor secretary. Lori Chavez-DeRemer drew criticism from the right
GeekWire Gala: Early-bird pricing ends Saturday, Nov. 30!
Tuesday, November 26, 2024
The GeekWire Gala returns December 12: Grab early-bird tickets today! View this email in your browser Grab GeekWire Gala tickets today, early-bird rates end on Saturday! The GeekWire Gala kicks off the