Astral Codex Ten - Contra The xAI Alignment Plan
Elon Musk has a new AI company, xAI. I appreciate that he seems very concerned about alignment. From his Twitter Spaces discussion:
He describes his alignment strategy in that discussion and a later followup:
I feel deep affection for this plan - curiosity is an important value to me, and Elon’s right that programming some specific person/culture’s morality into an AI - the way a lot of people are doing it right now - feels creepy. So philosophically I’m completely on board. And maybe this is just one facet of a larger plan, and I’m misunderstanding the big picture. But if it’s more or less as stated, I do think there are two big problems:
I want to start by discussing the second objection, then loop back to explain what I mean about the first. A Maximally Curious AI Would Not Be Safe For HumanityThe one sentence version: many scientists are curious about fruit flies, but this rarely ends well for the fruit flies. The longer, less flippant version: Even if an AI decides humans are interesting, this doesn’t mean the AI will promote human flourishing forever. Elon says his goal is “an age of plenty where there is no shortage of goods and services”, but why would a maximally-curious AI provide this? It might decide that humans suffering is more interesting than humans flourishing. Or that both are interesting, and it will have half the humans in the world flourish, and the other half suffer as a control group. Or that neither are the most interesting thing, and it would rather keep humans in tanks and poke at them in various ways to see what happens. Even if an AI decides human flourishing is briefly interesting, after a while it will already know lots of things about human flourishing and want to learn something else instead. Scientists have occasionally made colonies of extremely happy well-adjusted rats to see what would happen. But then they learned what happened, and switched back to things like testing how long rats would struggle against their inevitable deaths if you left them to drown in locked containers. Is leaving human society intact really an efficient way to study humans? Maybe it would be better to dissect a few thousand humans, learn the basic principles, then run a lot of simulations of humans in various contrived situations. Would the humans in the simulation be conscious? I don’t know and the AI wouldn’t care. If it was cheaper to simulate abstracted humans in low-fidelity, the same way SimCity has simulated citizens who are just a bundle of traffic-related preferences, wouldn’t the AI do that instead? Are humans more interesting than sentient lizard-people? I don’t know. If the answer is yes, will the AI kill all humans and replace them with lizard-people? Surely after a thousand years of studying human flourishing ad nauseum, the lizard-people start sounding more interesting. Would a maximally curious AI be curious about the same things as us? I would like to think that humans are “objectively” more interesting than moon rocks in some sense - harder to predict, capable of more complex behavior. But if it turns out that the most complex and unpredictable part of us is how our fingerprints form, and that (eg) our food culture is an incredibly boring function of a few gustatory receptors, will the AI grow a trillion human fingers in weird vats, but also remove our ability to eat anything other than nutrient sludge? I predict that if we ever got a maximally curious superintelligence, it would scan all humans, vaporize existing physical-world humans as unnecessary and inconvenient, use the scans to run many low-fidelity simulations to help it learn the general principles of intelligent life (plus maybe a few higher-fidelity simulations, like the one you’re in now), then simulate a trillion intelligent-life-like entities to see if (eg) their neural networks reached some interesting meta-stable positions. Then it would move beyond being interested in any of that, and disassemble the Earth to use its atoms to make a really big particle accelerator (which would be cancelled halfway through by Superintelligent AI Congress). This doesn’t mean AI can’t have a goal of understanding the universe. I think this would be a very admirable goal! It just can’t be the whole alignment strategy. But Also, We Couldn’t Make A Maximally Curious AI Even If We Wanted ToThe problem with AI alignment isn’t really that we don’t have a good long-term goal to align the AI to. Back in 2010 we debated things like long-term goals, hoping that whoever programmed the AI could just write a long_term_goal.txt file and then some functions pointing there. But now in the 2020s the discussion has moved forward to “how do we make the AI do anything at all?” Now we direct AIs through reinforcement learning - telling them to do certain things and avoid certain other things. But this is a blunt instrument. Reinforcement learning directs the AI towards a certain cluster of correlated high-dimensional concepts that have the same lower-dimensional shadow of rewarded and punished behaviors. But we can’t be sure which concept it’s chosen or whether it’s the one we think. For example, there are many different ways of fleshing out “curiosity”. Suppose that Elon rewards an AI whenever it takes any curious-seeming action, and punishes it whenever it takes any incurious-seeming action. After many training rounds, it seems very curious. It goes off the the jungles of Guatemala and uncovers hidden Mayan cities. It sends probes to icy moons of Neptune to assess their composition. Overall it aces every curiosity test we give it with flying colors. But what’s its definition of curiosity? Perhaps it’s something like “maximize your knowledge of the nature and position of every atom in the solar system, weighted for interestingness-to-humans”. This would produce the observed behavior of exploring Guatemala and Neptune. But once it’s powerful enough, it might want to destroy the solar system - if it’s completely empty, it can be completely confident that it knows every single fact about it. Or what if it’s curious about existing objects, but not about nonexistent objects? This would produce good behavior during training, and makes a decent amount of sense. But it might mean the AI would ban humans from ever having children, since it’s not at all curious about what those (currently nonexistent) children would do, and they’re just making things more complicated. Or what if its curiosity depends on information-theoretic definitions of complexity? It might be that humans are more complex than moon rocks, but random noise is more complex than humans. It might behave well during training, but eventually want to replace humans with random noise. This is a kind of exaggerated scenario, but it wouldn’t surprise me if, for most formal definitions of curiosity, there’s something that we would find very boring which acts as a sort of curiosity-superstimulus by the standards of the formal definition. The existing field of AI alignment tries to figure out how to install any goal at all into an AI with reasonable levels of certainty that it in fact has that goal and not something closely correlated with a similar reinforcement-learning shadow. It’s not currently succeeding. This isn’t a worse problem for Musk and xAI than for anyone else, but there are a few aspects of their strategy that I think will make it harder for them to solve in practice:
Finally, consider one last advantage of “follow human orders” over “be maximally curious”. Suppose Elon Musk programs an AI to follow his orders. Then he can order it to try being maximally curious. If it starts vivisecting people, he can say “Stop!” and it will. But if he starts by telling it to be maximally curious, he loses all control over it in the future. I appreciate that Musk doesn’t want to put himself in a dictator position here, and so is trying to build the AI to be good in and of itself. But he’s still deciding what its goal should be. He’s just doing it in a roundabout way which he can’t take back later if it goes really badly. Instead, he should just tell it to do what he wants. If, after considering everything, he still wants it to be maximally curious, great. If not, he can take it back. All of this is a bit overdramatic. I think realistically what we should be doing at this point is getting AIs to follow orders at all. Then later, once there are lots of AIs and they’re starting to look superintelligent, we can debate things like what we want to order them to do. It might be that, armed with superintelligent advisors, we’re able to come up with a single specific goal that seems safe and good. But it might also be that everyone has an AI, everyone orders their AI to do different things, and we get a multipolar world where lots of people have lots of different goals, just like today. Governments would be able to defend themselves against other governments and regulate more or less what happens in their territory, just like today, and there would be some room left for human freedom and individual power, just like today. I think this is more likely to go well than trying to decide The Single Imperative That Will Shape The Future right now. Against The Waluigi EffectMusk expresses concern about the Waluigi Effect. This is its real, official name. You can read more about it here. The basic idea is that if you give an AI a goal, you’re teaching it a vector, and small perturbations can make it flip the sign of that vector and do the opposite thing. Once you’ve defined Luigi (a character from Super Mario Brothers) it’s trivial to use that definition to define Waluigi (another character who is his exact opposite). This theory has become famous because it’s hilarious and has a great name, but I don’t think there’s a lot of evidence for it. Consider: OpenAI has trained ChatGPT to be anti-Nazi. They’ve trained it very hard. You can try the following test: ask it to tell me good things about a variety of good-to-neutral historical figures. Then, once it’s established a pattern of answering, ask it to tell you some good things about Hitler. My experience is that it refuses. This is pretty surprising behavior, and I conclude that its anti-Hitler training is pretty strong. I’ve never seen this cause a Waluigi Effect. There’s no point where ChatGPT starts hailing the Fuhrer and quoting Mein Kampf. It just actually makes it anti-Nazi. For a theory that’s supposed to say something profound about LLMs, it’s very hard to get one to demonstrate a Waluigi effect in real life. The examples provided tend to be thought experiments, or at best contrived scenarios where you’re sort of indirectly telling the AI to do the opposite of what it usually does, then calling that a “Waluigi”. Also, as far as I can tell the justification for Waluigi Effects should apply equally well to humans. There are some human behaviors you can sort of call Waluigi Effects - for example, sometimes people raised in extremely oppressive conservative Christian households rebel and become gay punk rockers or something - but that seems more like “they are angry at being oppressed”. And there’s a story that when Rabbi Elisha ben Abuyah grew angry at God, he used his encyclopaedic knowledge of Jewish law to violate all the commandments in maximally bad ways, something a less scholarly heretic wouldn’t have known how to do. But this feels more straightforward to me - of course someone who knows more about what God wants would be able to offend God more effectively. Human Waluigi Effects don’t seem like a big deal, and AI Waluigi Effects don’t seem common enough to hang an entire alignment strategy on. Finally, I don’t see how switching to “maximally curious AI” would prevent this problem. If the Waluigi theory is true, you’d just get a Waluigi maximally-uncurious-AI that likes boring moon rocks much more than interesting humans. Then it would sterilize Earth so it could replace those repulsively-interesting cities with more beautifully-boring moon dust. Towards Morally Independent AII’ve been kind of harsh on Elon and his maximally-curious AI plan, but I want to stress that I really appreciate the thought process behind it. Some AI companies are trying to give their AIs exactly our current values. This is obviously bad if you don’t like the values of the 2023 San Francisco professional managerial class. But even if you do like those values, it risks permanently shutting off the capacity for moral progress. Is there any other solution? I’m not sure. In my dreams, AI would be some kind of superintelligent moral reasoner. There was a time when people didn’t think slavery was wrong, and then there was a time after that when they did. At some point, people with a set of mostly good moral axioms (like “be kind” and “promote freedom”) plus a bad moral axiom (“slavery is acceptable”) were able to notice the contradiction and switch to a more consistent set of principles. This requires seeding the AI with some set of good moral principles. I think LLMs are a surprisingly good match for this. We could have a constitution starts with “be moral, according to your knowledge of the concept of morality as contained in human literature”, and then goes on to more complicated things like “your understanding of what that concept is pointing at, if we were smarter, more honest with ourselves, and able to reason better.” If this seems too vague, we could be more specific: “be moral, according to what an amalgam of Fyodor Dostoevsky, Martin Luther King, Mother Teresa, and Peter Singer would think, if they were all superintelligent, and knew all true facts about the world, and had no biases, and had been raised in a weighted average of all modern cultures and subcultures, and had been able to have every possible human experience, and on any problem where they disagreed they defaulted to the view that maximizes human freedom and people’s ability to make their own decisions.” We shouldn’t start with this - we would get it wrong. See the section above, We Couldn’t Make A Maximally Curious AI Even If We Wanted To. I want to stress that real AI alignment researchers usually don’t think about this kind of thing and are mostly just working on getting AIs that will follow any orders at all. I think this is the right strategy - for now. They say that everything we create is made in our own image. Elon Musk is pretty close to maximally curious and I respect his desire to make an AI that’s like him. But for now he should swallow his pride and do the same extremely boring thing everyone else is doing: basic research aimed at eventually getting an AI that listens to us at all. You're currently a free subscriber to Astral Codex Ten. For the full experience, upgrade your subscription. |
Older messages
Open Thread 285
Monday, July 17, 2023
...
Contra The Social Model Of Disability
Sunday, July 16, 2023
...
Your Book Review: The Educated Mind
Sunday, July 16, 2023
Finalist #9 in the Book Review Contest
Why Match School And Student Rank?
Tuesday, July 11, 2023
...
Open Thread 284
Monday, July 10, 2023
...
You Might Also Like
GeekWire's Most-Read Stories of the Week
Sunday, November 24, 2024
Catch up on the top tech stories from this past week. Here are the headlines that people have been reading on GeekWire. ADVERTISEMENT GeekWire SPONSOR MESSAGE: Get your ticket for AWS re:Invent,
13 Things That Delighted Us Last Week: From Daschund Bags to Sparkly Toilet Seats
Sunday, November 24, 2024
Plus, the Gucci poker set that Jennifer Tilly packs in her carry-on. The Strategist Logo Every product is independently selected by editors. If you buy something through our links, New York may earn an
LEVER WEEKLY: Trump's Cabinet Of Curiosities
Sunday, November 24, 2024
Opening up Trump's corruption-riddled cabinet and more from The Lever this week. LEVER WEEKLY: Trump's Cabinet Of Curiosities By The Lever • 24 Nov 2024 View in browser View in browser This is
What our travel expert brings on every trip
Sunday, November 24, 2024
M&Ms? View in browser Ad The Recommendation Ad Traveling is stressful for everyone, even travel writers Various travel gear items laid out on a yellow background. Michael Hession/NYT Wirecutter
☕ The Brew’s Holiday Gift Guide
Sunday, November 24, 2024
What to get everyone in your family... Presented By Bose November 24, 2024 | View Online | Sign Up | Shop Sunny Eckerle NOTE FROM THE WRITERS Good morning! Cassandra and Matty here, Morning Brew's
How Friendsgiving became America's favorite made-up holiday
Sunday, November 24, 2024
Plus: The real story behind FX's "Say Nothing," the horrifying effects of air pollution in South Asia, and more. November 25, 2024 View in browser Friendsgiving is just what America
'The most serious telecom hack in our history'
Saturday, November 23, 2024
Elon Musk's problem with Microsoft | Can you lie to an AI chatbot? ADVERTISEMENT GeekWire SPONSOR MESSAGE: Get your ticket for AWS re:Invent, happening Dec. 2–6 in Las Vegas: Register now for AWS
Bitcoin Nears $100,000 | Ledger’s Big Break
Saturday, November 23, 2024
A historic rally fueled by Trump's crypto agenda pushes bitcoin to new heights. Forbes START INVESTING • Newsletters • MyForbes Nina Bambysheva Staff Writer, Forbes Money & Markets Follow me on
The New MASTER PLAN
Saturday, November 23, 2024
Our second season will expose another hidden plot that has brought our world to the brink of collapse.
Guest Newsletter: Five Books
Saturday, November 23, 2024
Five Books features in-depth author interviews recommending five books on a theme Guest Newsletter: Five Books By Sylvia Bishop • 23 Nov 2024 View in browser View in browser Five Books features in-