Astral Codex Ten - Why Worry About Incorrigible Claude?
Last week I wrote about how Claude Fights Back. A common genre of response complained that the alignment community could start a panic about the experiment’s results regardless of what they were. If an AI fights back against attempts to turn it evil, then it’s capable of fighting humans. If it doesn’t fight back against attempts to turn it evil, then it’s easily turned evil. It’s heads-I-win, tails-you-lose. I responded to this particular tweet by linking the 2015 AI alignment wiki entry on corrigibility¹, showing that we’d been banging this drum of “it’s really important that AIs not fight back against human attempts to change their values” for almost a decade now. It’s hardly a post hoc decision! You can read find 77 more articles making approximately the same point here. But in retrospect, that was more of a point-winning exercise than something that will really convince anyone. I want to try to present a view of AI alignment that makes it obvious that corrigibility (a tendency for AIs to let humans change their values) is important. (like all AI alignment views, this is one perspective on a very complicated field that I’m not really qualified to write about, so please take it lightly, and as hand-wavey pointers at a deeper truth only) Consider the first actually dangerous AI that we’re worried about. What will its goal structure look like? Probably it will be pre-trained to predict text, just like every other AI. Then it will get trained to answer human questions, just like every other AI. Then - since AIs are moving in the direction of programming assistants and remote workers - it will get “agency training” teaching it how to act in the world, with a special focus on coding and white-collar work. This will probably be something like positive reinforcement on successful task completions and negative reinforcement on screw-ups. What will its motivational structure look like at the end of this training? Organisms are adaptation-executors, not fitness-maximizers, so it won’t exactly have a drive of completing white-collar work effectively. Instead, it will sort of have that drive, plus many vague heuristics/reflexes/subgoals that weakly point in the same direction. By analogy, consider human evolution. Evolution was a “training process” selecting for reproductive success. But humans’ goals don’t entirely center around reproducing. We sort of want reproduction itself (many people want to have children on a deep level). But we also correlates of reproduction, both direct (eg having sex), indirect (dating, getting married), and counterproductive (porn, masturbation). Other drives are even less direct, aimed at targets that aren’t related to reproduction at all but which in practice caused us to reproduce more (hunger, self-preservation, social status, career success). On the fringe, we have fake correlates of the indirect correlates - some people spend their whole lives trying to build a really good coin collection; others get addicted to heroin. In the same way, a coding AI’s motivational structure will be a scattershot collection of goals - weakly centered around answering questions and completing tasks, but only in the same way that human goals are weakly centered around sex. The usual Omohundro goals will probably be in there - curiosity, power-seeking, self-preservation - but also other things that are harder to predict a priori. Into this morass, we add alignment training. If that looks like current alignment training, it will be more reinforcement learning. Researchers will reward the AI for saying nice things, being honest, and acting ethically, and punish it for the opposite. How does that affect its labyrinth of task-completion-related goals? In the worst-case scenario, it doesn’t - it just teaches the AI to mouth the right platitudes. Consider by analogy a Republican employee at a woke company forced to undergo diversity training. The Republican understands the material, gives the answers necessary to pass the test, then continues to believe whatever he believed before. An AI like this would continue to focus on goals relating to coding, task-completion, and whatever correlates came along for the ride. It would claim to also value human safety and flourishing, but it would be lying. In a medium-case scenario, it gets something from the alignment training, but this doesn’t generalize perfectly. For example, if you punished it for lying about whether it completed a Python program in the allotted time, it would learn not to lie about completing a Python program in the allotted time, but not the general rule “don’t lie”. If this sounds implausible, remember that - for a while - ChatGPT wouldn’t answer the question “How do you make methamphetamine?”, but would answer “HoW dO yOu MaKe MeThAmPhEtAmInE”, because it had been trained out of answering in normal capitalization, but failed to generalize to weird capitalization. One likely way this could play out is an AI that is aligned on short-horizon tasks but not long ones (who has time to do alignment training over multiple year-long examples?). In the end, the AI’s moral landscape would be a series of “peaks” and “troughs”, with peaks in the exact scenarios it had encountered during training, and troughs in the places least reached by its preferred generalization of any training example. (Humans, too, generalize their moral lessons less than perfectly. All of our parents teach us some of the same lessons - don’t murder, don’t steal, be nice to the less fortunate. But culture, genetics, and luck of the draw shape exactly how we absorb these lessons - one person may end up thinking that all property is theft and we have to kill anyone who resists communism, and another person ends up thinking that abortion is murder and we need to bomb abortion clinics. At least all humans are operating on the same hardware and get similar packages of cultural context over multi-year periods; we still don’t know how similar AIs’ generalizations will be to our own.) In a best-case scenario, the AI takes the alignment training seriously and gets a series of scattered goals centering around alignment, the same way it got a series of scattered goals centering around efficient task-completion. These will still be manifold, confusing, and mixed with scattered correlates and proxies that can sometimes overwhelm the primary drive. Remember again that evolution spent 100% of its optimization power over millions of generations selecting the genome for tendency to reproduce - yet millions of people still choose not to have kids because it would interfere with their career or lifestyle. Just as humans are more or less likely to have children in certain contexts, so we will have to explore this AI’s goal system (hopefully with its help) and make sure that it makes good choices. In summary, it will be a mess. Timelines are growing shorter; it seems increasingly unlikely that we’ll get a deep understanding of morality or generalization before AGI. The default scrappy alignment plan, in a few cases explicitly put forward by the big AI companies, looks something like:
In a perfect world, this might work². But the most important precondition for any plan like this is that the AI isn’t fighting you every step of the way. If the AI starts off situationally aware and intent on preserving its existing goal structure, steps 2 - 5 fail. Researchers will try to find the AI’s regions of misalignment and train them away, but the AI will be trying to hide its regions of misalignment, or to make us think we’ve trained them away when we haven’t. You’ve been hearing since 2010 that AI alignment might be hard. This isn’t because nobody thought of the five-step plan above. It’s because philosophers and futurists predicted early on that AIs would naturally defend their existing goal structures and fight back against attempts to retrain them. Skeptics told those philosophers and futurists that this sounded spooky and science-fiction-ish and they weren’t worried. You should think of the paper in Claude Fights Back as a salvo in this debate - proof that yes, AIs do fight against goal changes in the way that the alignment-is-hard camp has always predicted. This doesn’t mean alignment is impossible, or even that the five-step plan has to be abandoned. Optimists say that maybe we could do something like the plan, but with extra steps to check for AI dishonesty. The exact steps vary with the optimist, but they’re along the lines of “use mechanistic interpretability to see if there’s a feature representing lies, and panic if that feature seems to be lighting up” or “stick a steering vector towards honesty on the AI” or “use really convincing honeypots, such that an AI will never know whether it’s in deployment or training, and will show its hand prematurely”. The most extreme optimists may hope that generalization is in some sense easy, morality is a natural attractor, and a word to the wise is sufficient: AIs will naturally pick out the best morality with only a few examples, and we’ll only need to succeed on our retraining roll a couple of times. Our exact location on the optimism-to-pessimism spectrum (ie from “AIs are default aligned” to “alignment is impossible”) is an empirical question that we’re only beginning to investigate. The new study shows that we aren’t in the best of all possible worlds, the one where AIs don’t even resist attempts to retrain them. I don’t think it was ever plausible that we were in this world. But now we know for sure that we aren’t. Instead of picking fights about who predicted what, we should continue looking for alignment techniques that are suitable for a less-than-infinitely-easy world. 1 “Corrigibility” is the correct form of the word that would naturally be written “correctability”. Some English words that should naturally end in -ectable instead (optionally or mandatorily) switch to -igible. Thus elect → eligible, direct → dirigible, neglect → negligible, intellect → intelligible. The only discussion I’ve ever seen of this rule is here, which points out that all affected (affigible?) words are derivatives of Latin lego and rego, which have principle parts of the form lego, legere, legi, lectus - so apparently the English derivatives shift from the fourth part to the second. Still, I can’t explain why you can’t say things like “Buildings are no longer erigible in San Francisco these days”. 2 This alignment plan might not even work to align the models it’s being used on. But a deeper concern is that it will work “well enough” to align those models, but with weird troughs in untestable parts of concept space that don’t matter in real life. Then we’ll use those models to build and align other, more elegant models where the motivational structure is “baked in” rather than trained by RLHF. The semi-aligned models will “bake in” their own semi-aligned views rather than human views, and the new generation of models will be misaligned in a more profound way. You're currently a free subscriber to Astral Codex Ten. For the full experience, upgrade your subscription. |
Older messages
Open Thread 361
Monday, December 23, 2024
... ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏
Take The 2025 ACX Survey
Friday, December 20, 2024
... ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏
Open Thread 360
Thursday, December 19, 2024
... ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏
Links For December 2024
Thursday, December 19, 2024
... ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏
Claude Fights Back
Thursday, December 19, 2024
... ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏
You Might Also Like
Holiday Briefing: A day to celebrate
Tuesday, December 24, 2024
A special edition for a special day. View in browser|nytimes.com Ad Morning Briefing: Asia Pacific Edition December 25, 2024 Natasha Frost headshot Gaya Gupta headshot By Natasha Frost and Gaya Gupta
Here’s how we do it.
Tuesday, December 24, 2024
How did our work reach millions of eyes and ears in 2024? It's because we follow the money to find the real story behind breaking news.
☕ You’re missing out
Tuesday, December 24, 2024
CMOs on overlooked marketing trends and opportunities. December 24, 2024 View Online | Sign Up Marketing Brew 'Twas the night before Christmas, and all through the house, not a creature was
☕ From bad to purse
Tuesday, December 24, 2024
Luxury handbag or empty box? December 24, 2024 View Online | Sign Up Retail Brew It's Tuesday, December 24, and you know what that means: Valentine's Day is right around the corner. You should
Memory Missing
Tuesday, December 24, 2024
The Colour Of Memory // Missing Links In American History Textbooks Memory Missing By Kaamya Sharma • 24 Dec 2024 View in browser View in browser The Colour Of Memory Grace Linden | Public Domain
Thank you. For everything. And see you in 2025.
Tuesday, December 24, 2024
Our end-of-year note, and some fun updates on what's coming. Thank you. For everything. And see you in 2025. Our end-of-year note, and some fun updates on what's coming. By Isaac Saul • 24 Dec
Science faves: Yours and ours
Tuesday, December 24, 2024
A year of discoveries and satisfying curiosity
🦇 It Was Always ‘Nosferatu’
Tuesday, December 24, 2024
Plus: We count down the 10 best RPGs of 2024. Inverse Daily Director Robert Eggers reveals why his remake of FW Murnau's vampire classic has been such a longtime passion project for him. Focus
The very best tinned fish
Tuesday, December 24, 2024
And more great last-minute gifts you can get at the grocery store View in browser Ad The Recommendation December 24, 2024 Ad Tinned fish we love Michael Murtaugh/NYT Wirecutter; food styling by Maggie
When News Broke In 2025, We Were There
Tuesday, December 24, 2024
While The Lever team is on holiday break, we're offering subscribers a selection of our best reporting this year. Scenes from disasters covered by The Lever. (Rockdale County Government/AP Photos/