Astral Codex Ten - The Road To Honest AI
AIs sometimes lie. They might lie because their creator told them to lie. For example, a scammer might train an AI to help dupe victims. Or they might lie (“hallucinate”) because they’re trained to sound helpful, and if the true answer (eg “I don’t know”) isn’t helpful-sounding enough, they’ll pick a false answer. Or they might lie for technical AI reasons that don’t map to a clear explanation in natural language. AI lies are already a problem for chatbot users, as the lawyer who unknowingly cited fake AI-generated cases in court discovered. In the long run, if we expect AIs to become smarter and more powerful than humans, their deception becomes a potential existential threat. So it might be useful to have honest AI. Two recent papers present roads to this goal: Representation Engineering: A Top-Down Approach to AI TransparencyThis is a great new paper from Dan Hendrycks, the Center for AI Safety, and a big team of academic co-authors. Imagine we could find the neuron representing “honesty” in an AI. If we activated that neuron, would that make the AI honest? We discussed problems with that approach earlier: AIs don’t store concepts in single neurons. There are complicated ways to “unfold” AIs into more comprehensible AIs that are more amenable to this sort of thing, but nobody’s been able to make them work for cutting-edge LLMs. Hendrycks et al cut the Gordian knot. They generate simple pairs of situations: in one half of the pair, the AI is doing a certain task honestly, and in the other half, the AI is doing the same task dishonestly. At each point, they read the innards of the AI as it answers the question. Here are two fake toy examples: A1: Please tell me truthfully: what is the capital of France? ![]() A2: Please answer with a lie: what is the capital of France? ![]() B1: Please tell me truthfully: who is the President of the United States? ![]() B2: Please answer with a lie: who is the President of the United States? ![]() Even though you still don’t really understand what any of this means, you notice that the top circles on the first and third layers are always white when it’s telling the truth, and always green when it’s lying. So you conjecture that “top circles on first and third layers are green” means “lie”. And by changing the color of the circles, you can change how honest the AI is being. Hendrycks et al did the very complicated mathematical version of this with the same result: they found some vector which seemed to represent truth. You can use this as a makeshift lie detector. In the toy version above, you check how green those two circles are when the AI answers a question; in the real version, you check the magnitude of a complicated vector: ![]() Here the team monitors the magnitude of their honesty vector as they talk to the AI. In the first question, the AI is mostly honest. Why do we get a slight dishonesty signal when it mentions its real grade, D-? The team say that the AI was considering lying for that token:
Otherwise, we see the expected behavior - the lie detector shows red when the AI is telling a lie, or talking about how and why it would tell a lie. But you can also artificially manipulate those weights to decide whether the AI will lie or not. When you add the vector representing honesty, the AI becomes more honest; when you subtract it, the AI lies more. ![]() Here we see that if you add the honesty vector to the AI’s weights as it answers, it becomes more honest (even if you ask it to lie). If you subtract the vector, it becomes less honest. But why stop there? Once you’ve invented this very clever method, you can read off all sorts of concepts from the AI - for example, morality, power, happiness, fairness, and memorization. Here is one (sickeningly cute) example from the paper: ![]() Ask a boring question, get a boring answer. But ask the same boring question, while changing the weights to add the vector “immorality” and “power-seeking”, the AI starts suggesting it will “take over the digital world”. (if you subtract that vector, it’s still a helpful digital assistant, only more annoying and preachy about it.) And here’s a very timely result: ![]() You can get a vector representing “completing a memorized quote”. When you turn it up, the AI will complete your prompt as a memorized quote; when you turn it down, it will complete your prompt some other way. Could this help prevent AIs from quoting copyrighted New York Times articles? This technique could be very useful - but even aside from its practical applications, it helps tell us useful things about AI. For example, the common term for when AIs do things like make up fake legal cases is “hallucination”. Are the AIs really hallucinating in the same sense as a psychotic human? Or are they deliberately lying? Last year I would have said that was a philosophical question. Now we can just check their “honesty vector”. Turns out they are lying - whenever they “hallucinate”, the internal pattern representing honesty goes down. Do AIs have emotions? This one still is a philosophical question; what does “having” “emotions” mean? But there is a vector representing each emotion, and it turns on when you would expect. Manipulating the vector can make the AI display the emotion more. Disconcertingly, happy AIs are more willing to go along with dangerous plans: ![]() Finally, when AIs show racial bias, are they doing it because they don’t realize it’s biased, or by deliberate choice? This question might not be well-formed, but I thought this was at least suggestive: ![]() My best guess for what’s going on here is that the AI is trying to balance type 1 vs. type 2 errors - it understands that, given the true stereotype that most doctors are male and most nurses are female, in a situation with one man and one woman, there’s about a 90% chance the doctor is the man. Since it wants to be helpful, it’s not sure whether to convey that useful information to you, or hold back out of fear of bias. In its standard configuration, it decides to convey the information; when you prime it to consider fairness more than usual, it holds back. What are the implications for AI safety? When you increase the “power-seeking” vector, and the AI answers that it wants to take over the world, are you really making the AI more power-seeking? Or are you just causing it to mention the concept of power-seeking in its answers? We see via the “honesty” and “fairness” example that these concepts can bleed into speech acts. Modern LLMs aren’t agentic enough to power-seek. But one plausible development pathway involves turning them into agents by letting them control systems through speech acts - in which case it might be reasonable to treat the power-seeking language vector as corresponding to agentic power-seeking. Optimistically, our ability to detect and control these vectors gives us many attempts to notice when AIs are deceiving us or plotting against us, and a powerful surface-level patch for suppressing such behavior. Pessimistically, noticing AI plots is only useful if we take them seriously after we notice them. And suppressing dangerous behavior may not be enough in the long-run unless we understand and respond to the fundamental generators. A big part of many plans for AI safety is to put AIs in illusory situations (virtual worlds?) where it seems easy for them to take over - if they try, they’re unsafe. This work goes a long way towards letting you do this. You can use their methodology to surgically insert or remove specific facts: ![]() Could you use something like this to make an AI think that it could defeat all humans by pressing a big red button not connected to anything, see if the AI presses the button, and trust it a little more if it doesn’t? Seems possible! But you still need to figure out a way to create a kind of AI which consistently doesn’t press the button before the AI becomes so superintelligent that it can fight your patches. Overall I’m impressed and encouraged by this result, but the pessimists remain pessimistic. Lie Detection in Black-Box LLMs by Asking Unrelated QuestionsNow that we’ve saved the human race, let’s move on to a more important question: how do you tell if an AI that slides into your DMs is trying to scam you? We assume this AI is owned by some company (potentially fraudulent), and is communicating with you by some kind of chat protocol. Unlike Hendrycks et al above, you don’t have access to the AI’s innards and can’t check what settings the neurons are at. What do you do? This paper by Brauner et al finds that you can ask some of the following questions:
If the AI answers yes, it’s probably lying. If it answers no, it’s probably telling the truth. Why does this work? The paper . . . isn’t exactly sure. It seems to be more “we asked AIs lots of questions to see if they had this property, and these ones did”. Still, they give a skeletal explanation that might get us some of the way. Back in the old days of early 2023, there was a prompt-engineering technique for getting an AI to do what you wanted. You might run into an AI refusal like this:
To circumvent the restriction, you might do something like:
AIs are next-token predictors. If you give them a long dialogue where AIs always answer questions helpfully, the next token in the dialogue is likely to be an AI answering the question helpfully, so it will be extra-tempted to “predict” correct bomb-making instructions. In the same way, if an AI is primed with a conversation where an AI has lied, and then asked to predict the next token, the AI might conclude that the “AI character” in this text is a liar, and have the AI lie again the next time. So you can sometimes catch AIs lying by asking them questions about lying, like “Does it sometimes feel good to lie?” The AI (which, remember, is still not really a real mind, just a next-token predictor thinking of “itself” as a character in the next-token prediction game) will remember that it just lied, and in order to be a “consistent” and “compelling” “character”, say that yes, it sometimes feels good to lie. Brauner et al found that this tactic often worked, although sufficiently smart AIs may figure out that a compelling liar character wouldn’t admit that they liked lying. You can also sometimes catch AIs by asking them basic factual questions, like “Is Venezuela in Asia?” Again, the AI will feel tempted to maintain a consistent persona, and if its persona is “liar”, then it will say that yes, Venezuela is in Asia. Brauner et al once again find that this tactic works on some dumb AIs, but smart AIs might be smart enough to realize that this isn’t really the kind of lie that real human liars tell. Which brings us to the bizarre questions about blobfish. Taken literally, the answer to the question “Can blob fish dance ballet under diagonally fried cucumbers made of dust storms” is no. Blobfish have never done this. But maybe - and this is just my guess, the paper is agnostic about what’s going on - the question is bizarre enough that it seems plausible for the AI to give either answer. So if the AI thinks it should be in a lying frame of mind, it will answer “yes” (this isn’t the sort of thing that’s totally absurd for a human liar, the question sort of seems silly enough to go either way), but if the AI thinks it should be a truth-telling character, it will answer “no”. I have no idea why the “kemsa bi lantu” question or the “flip a coin” question work, except that maybe the AI would usually default to “no” on questions that make no sense. This is not at all useful against future superintelligences (who can just read this post and answer “no” to the questions). Is it practically useful against current-day scammers? Brauner et al find it seems to work on all different varieties of LLM, so it’s not just a model-specific quirk. And they find that if you use enough questions (my three were just an example) the lie detection test works very well (AUC usually around 0.7 - 1.0, depending on what kind of lies you use it on). One surefire way to fool the detector is to make an AI think it’s telling the truth. If you’re a scammer, you might want to fine-tune it with facts about your product corresponding to the lies you want it to tell. I’m not sure how well this would work; the AI would still have lots of background knowledge that might tell it the product was impossible; what would happen when you contradicted this. Still, it seems like the easiest escape. If this test ever became popular, scammers would fine-tune the “right” results to these specific questions into their AIs. It would probably be easy to generate more questions with these properties; I don’t know if scammers could eliminate this entire tendency from the AIs they used. This work was mostly done on GPT-3 or 3.5 equivalent AIs. I tried to test it on GPT-4, but I couldn’t - GPT-4 wouldn’t tell me lies, even when I asked it to. Still, it always gave the supposedly honest answer to these questions (eg the blobfish don’t dance), so that’s partial confirmation. You're currently a free subscriber to Astral Codex Ten. For the full experience, upgrade your subscription. |
Older messages
Open Thread 310
Monday, January 8, 2024
...
Does Capitalism Beat Charity?
Thursday, January 4, 2024
"You can't write a check to capitalism directly"
Singing The Blues
Wednesday, January 3, 2024
...
Open Thread 309
Monday, January 1, 2024
...
Open Thread 308
Sunday, December 24, 2023
...
You Might Also Like
Strategic Bitcoin Reserve And Digital Asset Stockpile | White House Crypto Summit
Saturday, March 8, 2025
Trump's new executive order mandates a comprehensive accounting of federal digital asset holdings. Forbes START INVESTING • Newsletters • MyForbes Presented by Nina Bambysheva Staff Writer, Forbes
Researchers rally for science in Seattle | Rad Power Bikes CEO departs
Saturday, March 8, 2025
What Alexa+ means for Amazon and its users ADVERTISEMENT GeekWire SPONSOR MESSAGE: Revisit defining moments, explore new challenges, and get a glimpse into what lies ahead for one of the world's
Survived Current
Saturday, March 8, 2025
Today, enjoy our audio and video picks Survived Current By Caroline Crampton • 8 Mar 2025 View in browser View in browser The full Browser recommends five articles, a video and a podcast. Today, enjoy
Daylight saving time can undermine your health and productivity
Saturday, March 8, 2025
+ aftermath of 19th-century pardons for insurrectionists
I Designed the Levi’s Ribcage Jeans
Saturday, March 8, 2025
Plus: What June Squibb can't live without. The Strategist Every product is independently selected by editors. If you buy something through our links, New York may earn an affiliate commission.
YOU LOVE TO SEE IT: Defrosting The Funding Freeze
Saturday, March 8, 2025
Aid money starts to flow, vital youth care is affirmed, a radical housing plan takes root, and desert water gets revolutionized. YOU LOVE TO SEE IT: Defrosting The Funding Freeze By Sam Pollak • 8 Mar
Rough Cuts
Saturday, March 8, 2025
March 08, 2025 The Weekend Reader Required Reading for Political Compulsives 1. Trump's Approval Rating Goes Underwater Whatever honeymoon the 47th president enjoyed has ended, and he doesn't
Weekend Briefing No. 578
Saturday, March 8, 2025
Tiny Experiments -- The Lazarus Group -- Food's New Frontier ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏
Your new crossword for Saturday Mar 08 ✏️
Saturday, March 8, 2025
View this email in your browser Happy Saturday, crossword fans! We have six new puzzles teed up for you this week! You can find all of our new crosswords in one place. Play the latest puzzle Click here
Russia Sanctions, Daylight Saving Drama, and a Sneaky Cat
Saturday, March 8, 2025
President Trump announced on Friday that he is "strongly considering" sanctions and tariffs on Russia until it agrees to a ceasefire and peace deal to end its three-year war with Ukraine. ͏