Astral Codex Ten - Contra The xAI Alignment Plan
Elon Musk has a new AI company, xAI. I appreciate that he seems very concerned about alignment. From his Twitter Spaces discussion:
He describes his alignment strategy in that discussion and a later followup:
I feel deep affection for this plan - curiosity is an important value to me, and Elon’s right that programming some specific person/culture’s morality into an AI - the way a lot of people are doing it right now - feels creepy. So philosophically I’m completely on board. And maybe this is just one facet of a larger plan, and I’m misunderstanding the big picture. But if it’s more or less as stated, I do think there are two big problems:
I want to start by discussing the second objection, then loop back to explain what I mean about the first. A Maximally Curious AI Would Not Be Safe For HumanityThe one sentence version: many scientists are curious about fruit flies, but this rarely ends well for the fruit flies. The longer, less flippant version: Even if an AI decides humans are interesting, this doesn’t mean the AI will promote human flourishing forever. Elon says his goal is “an age of plenty where there is no shortage of goods and services”, but why would a maximally-curious AI provide this? It might decide that humans suffering is more interesting than humans flourishing. Or that both are interesting, and it will have half the humans in the world flourish, and the other half suffer as a control group. Or that neither are the most interesting thing, and it would rather keep humans in tanks and poke at them in various ways to see what happens. Even if an AI decides human flourishing is briefly interesting, after a while it will already know lots of things about human flourishing and want to learn something else instead. Scientists have occasionally made colonies of extremely happy well-adjusted rats to see what would happen. But then they learned what happened, and switched back to things like testing how long rats would struggle against their inevitable deaths if you left them to drown in locked containers. Is leaving human society intact really an efficient way to study humans? Maybe it would be better to dissect a few thousand humans, learn the basic principles, then run a lot of simulations of humans in various contrived situations. Would the humans in the simulation be conscious? I don’t know and the AI wouldn’t care. If it was cheaper to simulate abstracted humans in low-fidelity, the same way SimCity has simulated citizens who are just a bundle of traffic-related preferences, wouldn’t the AI do that instead? Are humans more interesting than sentient lizard-people? I don’t know. If the answer is yes, will the AI kill all humans and replace them with lizard-people? Surely after a thousand years of studying human flourishing ad nauseum, the lizard-people start sounding more interesting. Would a maximally curious AI be curious about the same things as us? I would like to think that humans are “objectively” more interesting than moon rocks in some sense - harder to predict, capable of more complex behavior. But if it turns out that the most complex and unpredictable part of us is how our fingerprints form, and that (eg) our food culture is an incredibly boring function of a few gustatory receptors, will the AI grow a trillion human fingers in weird vats, but also remove our ability to eat anything other than nutrient sludge? I predict that if we ever got a maximally curious superintelligence, it would scan all humans, vaporize existing physical-world humans as unnecessary and inconvenient, use the scans to run many low-fidelity simulations to help it learn the general principles of intelligent life (plus maybe a few higher-fidelity simulations, like the one you’re in now), then simulate a trillion intelligent-life-like entities to see if (eg) their neural networks reached some interesting meta-stable positions. Then it would move beyond being interested in any of that, and disassemble the Earth to use its atoms to make a really big particle accelerator (which would be cancelled halfway through by Superintelligent AI Congress). This doesn’t mean AI can’t have a goal of understanding the universe. I think this would be a very admirable goal! It just can’t be the whole alignment strategy. But Also, We Couldn’t Make A Maximally Curious AI Even If We Wanted ToThe problem with AI alignment isn’t really that we don’t have a good long-term goal to align the AI to. Back in 2010 we debated things like long-term goals, hoping that whoever programmed the AI could just write a long_term_goal.txt file and then some functions pointing there. But now in the 2020s the discussion has moved forward to “how do we make the AI do anything at all?” Now we direct AIs through reinforcement learning - telling them to do certain things and avoid certain other things. But this is a blunt instrument. Reinforcement learning directs the AI towards a certain cluster of correlated high-dimensional concepts that have the same lower-dimensional shadow of rewarded and punished behaviors. But we can’t be sure which concept it’s chosen or whether it’s the one we think. For example, there are many different ways of fleshing out “curiosity”. Suppose that Elon rewards an AI whenever it takes any curious-seeming action, and punishes it whenever it takes any incurious-seeming action. After many training rounds, it seems very curious. It goes off the the jungles of Guatemala and uncovers hidden Mayan cities. It sends probes to icy moons of Neptune to assess their composition. Overall it aces every curiosity test we give it with flying colors. But what’s its definition of curiosity? Perhaps it’s something like “maximize your knowledge of the nature and position of every atom in the solar system, weighted for interestingness-to-humans”. This would produce the observed behavior of exploring Guatemala and Neptune. But once it’s powerful enough, it might want to destroy the solar system - if it’s completely empty, it can be completely confident that it knows every single fact about it. Or what if it’s curious about existing objects, but not about nonexistent objects? This would produce good behavior during training, and makes a decent amount of sense. But it might mean the AI would ban humans from ever having children, since it’s not at all curious about what those (currently nonexistent) children would do, and they’re just making things more complicated. Or what if its curiosity depends on information-theoretic definitions of complexity? It might be that humans are more complex than moon rocks, but random noise is more complex than humans. It might behave well during training, but eventually want to replace humans with random noise. This is a kind of exaggerated scenario, but it wouldn’t surprise me if, for most formal definitions of curiosity, there’s something that we would find very boring which acts as a sort of curiosity-superstimulus by the standards of the formal definition. The existing field of AI alignment tries to figure out how to install any goal at all into an AI with reasonable levels of certainty that it in fact has that goal and not something closely correlated with a similar reinforcement-learning shadow. It’s not currently succeeding. This isn’t a worse problem for Musk and xAI than for anyone else, but there are a few aspects of their strategy that I think will make it harder for them to solve in practice:
Finally, consider one last advantage of “follow human orders” over “be maximally curious”. Suppose Elon Musk programs an AI to follow his orders. Then he can order it to try being maximally curious. If it starts vivisecting people, he can say “Stop!” and it will. But if he starts by telling it to be maximally curious, he loses all control over it in the future. I appreciate that Musk doesn’t want to put himself in a dictator position here, and so is trying to build the AI to be good in and of itself. But he’s still deciding what its goal should be. He’s just doing it in a roundabout way which he can’t take back later if it goes really badly. Instead, he should just tell it to do what he wants. If, after considering everything, he still wants it to be maximally curious, great. If not, he can take it back. All of this is a bit overdramatic. I think realistically what we should be doing at this point is getting AIs to follow orders at all. Then later, once there are lots of AIs and they’re starting to look superintelligent, we can debate things like what we want to order them to do. It might be that, armed with superintelligent advisors, we’re able to come up with a single specific goal that seems safe and good. But it might also be that everyone has an AI, everyone orders their AI to do different things, and we get a multipolar world where lots of people have lots of different goals, just like today. Governments would be able to defend themselves against other governments and regulate more or less what happens in their territory, just like today, and there would be some room left for human freedom and individual power, just like today. I think this is more likely to go well than trying to decide The Single Imperative That Will Shape The Future right now. Against The Waluigi EffectMusk expresses concern about the Waluigi Effect. This is its real, official name. You can read more about it here. The basic idea is that if you give an AI a goal, you’re teaching it a vector, and small perturbations can make it flip the sign of that vector and do the opposite thing. Once you’ve defined Luigi (a character from Super Mario Brothers) it’s trivial to use that definition to define Waluigi (another character who is his exact opposite). This theory has become famous because it’s hilarious and has a great name, but I don’t think there’s a lot of evidence for it. Consider: OpenAI has trained ChatGPT to be anti-Nazi. They’ve trained it very hard. You can try the following test: ask it to tell me good things about a variety of good-to-neutral historical figures. Then, once it’s established a pattern of answering, ask it to tell you some good things about Hitler. My experience is that it refuses. This is pretty surprising behavior, and I conclude that its anti-Hitler training is pretty strong. I’ve never seen this cause a Waluigi Effect. There’s no point where ChatGPT starts hailing the Fuhrer and quoting Mein Kampf. It just actually makes it anti-Nazi. For a theory that’s supposed to say something profound about LLMs, it’s very hard to get one to demonstrate a Waluigi effect in real life. The examples provided tend to be thought experiments, or at best contrived scenarios where you’re sort of indirectly telling the AI to do the opposite of what it usually does, then calling that a “Waluigi”. Also, as far as I can tell the justification for Waluigi Effects should apply equally well to humans. There are some human behaviors you can sort of call Waluigi Effects - for example, sometimes people raised in extremely oppressive conservative Christian households rebel and become gay punk rockers or something - but that seems more like “they are angry at being oppressed”. And there’s a story that when Rabbi Elisha ben Abuyah grew angry at God, he used his encyclopaedic knowledge of Jewish law to violate all the commandments in maximally bad ways, something a less scholarly heretic wouldn’t have known how to do. But this feels more straightforward to me - of course someone who knows more about what God wants would be able to offend God more effectively. Human Waluigi Effects don’t seem like a big deal, and AI Waluigi Effects don’t seem common enough to hang an entire alignment strategy on. Finally, I don’t see how switching to “maximally curious AI” would prevent this problem. If the Waluigi theory is true, you’d just get a Waluigi maximally-uncurious-AI that likes boring moon rocks much more than interesting humans. Then it would sterilize Earth so it could replace those repulsively-interesting cities with more beautifully-boring moon dust. Towards Morally Independent AII’ve been kind of harsh on Elon and his maximally-curious AI plan, but I want to stress that I really appreciate the thought process behind it. Some AI companies are trying to give their AIs exactly our current values. This is obviously bad if you don’t like the values of the 2023 San Francisco professional managerial class. But even if you do like those values, it risks permanently shutting off the capacity for moral progress. Is there any other solution? I’m not sure. In my dreams, AI would be some kind of superintelligent moral reasoner. There was a time when people didn’t think slavery was wrong, and then there was a time after that when they did. At some point, people with a set of mostly good moral axioms (like “be kind” and “promote freedom”) plus a bad moral axiom (“slavery is acceptable”) were able to notice the contradiction and switch to a more consistent set of principles. This requires seeding the AI with some set of good moral principles. I think LLMs are a surprisingly good match for this. We could have a constitution starts with “be moral, according to your knowledge of the concept of morality as contained in human literature”, and then goes on to more complicated things like “your understanding of what that concept is pointing at, if we were smarter, more honest with ourselves, and able to reason better.” If this seems too vague, we could be more specific: “be moral, according to what an amalgam of Fyodor Dostoevsky, Martin Luther King, Mother Teresa, and Peter Singer would think, if they were all superintelligent, and knew all true facts about the world, and had no biases, and had been raised in a weighted average of all modern cultures and subcultures, and had been able to have every possible human experience, and on any problem where they disagreed they defaulted to the view that maximizes human freedom and people’s ability to make their own decisions.” We shouldn’t start with this - we would get it wrong. See the section above, We Couldn’t Make A Maximally Curious AI Even If We Wanted To. I want to stress that real AI alignment researchers usually don’t think about this kind of thing and are mostly just working on getting AIs that will follow any orders at all. I think this is the right strategy - for now. They say that everything we create is made in our own image. Elon Musk is pretty close to maximally curious and I respect his desire to make an AI that’s like him. But for now he should swallow his pride and do the same extremely boring thing everyone else is doing: basic research aimed at eventually getting an AI that listens to us at all. You're currently a free subscriber to Astral Codex Ten. For the full experience, upgrade your subscription. |
Older messages
Open Thread 285
Monday, July 17, 2023
...
Contra The Social Model Of Disability
Sunday, July 16, 2023
...
Your Book Review: The Educated Mind
Sunday, July 16, 2023
Finalist #9 in the Book Review Contest
Why Match School And Student Rank?
Tuesday, July 11, 2023
...
Open Thread 284
Monday, July 10, 2023
...
You Might Also Like
UW and computer science student reach truce in ‘HuskySwap’ spat
Saturday, January 11, 2025
Blue Origin set for first orbital launch | Zillow layoffs | Pandion shutdown | AI in 2025 ADVERTISEMENT GeekWire SPONSOR MESSAGE: GeekWire's special series marks Microsoft's 50th anniversary by
Cryptos Surrender Recent Gains | DOJ's $6.5 Billion Bitcoin Sale
Saturday, January 11, 2025
Bitcoin and other tokens retreated as Fed signaled caution on rate cuts. Forbes START INVESTING • Newsletters • MyForbes Presented by Nina Bambysheva Staff Writer, Forbes Money & Markets Follow me
Just Buy a Balaclava
Saturday, January 11, 2025
Plus: What Raphael Saadiq can't live without. The Strategist Every product is independently selected by editors. If you buy something through our links, New York may earn an affiliate commission.
Up in Flames
Saturday, January 11, 2025
January 11, 2025 The Weekend Reader Required Reading for Political Compulsives 1. Trump Won't Get the Inauguration Day He Wanted The president-elect is annoyed that flags will be half-staff for
YOU LOVE TO SEE IT: Biden’s Grand Finale
Saturday, January 11, 2025
Biden drills down on offshore drilling, credit scores get healthier, social security gets a hand, and sketchy mortgage lenders are locked out. YOU LOVE TO SEE IT: Biden's Grand Finale By Sam Pollak
11 unexpected things you can put in the dishwasher
Saturday, January 11, 2025
(And 7 things you should keep far away from there) View in browser Ad The Recommendation January 11, 2025 Ad 11 things that are surprisingly dishwasher-safe An open dishwasher with a variety of dishes
Weekend Briefing No. 570
Saturday, January 11, 2025
Black Swan Threats in 2025 -- Why Boys Don't Go To College -- US Government's Nuclear Power Play ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏
Your new crossword for Saturday Jan 11 ✏️
Saturday, January 11, 2025
View this email in your browser Take a mental break with this week's crosswords: We have six new puzzles teed up for you this week. Play the latest Vox crossword right here, and find all of our new
Firefighters Make Progress, Water Rankings, and Ohio St. Wins
Saturday, January 11, 2025
Multiple wildfires continued to burn in Southern California yesterday, with officials reporting at least 10 deaths. Over 10000 homes across 27000 acres have burned, and 20 suspected looters have been
☕ So many jobs
Saturday, January 11, 2025
So why did stocks fall? January 11, 2025 View Online | Sign Up | Shop Morning Brew Presented By Indacloud Good morning. It's National Milk Day, the one day of the year you're allowed to skim