Astral Codex Ten - Janus' GPT Wrangling
Janus (pseudonym by request) works at AI alignment startup Conjecture. Their hobby, which is suspiciously similar to their work, is getting GPT-3 to do interesting things. For example, with the right prompts, you can get stories where the characters become gradually more aware that they are characters being written by some sort of fiction engine, speculate on what’s going on, and sometimes even make pretty good guesses about the nature of GPT-3 itself. Janus says this happens most often when GPT makes a mistake - for example, writing a story set in the Victorian era, then having a character take out her cell phone. Then when it tries to predict the next part - when it’s looking at the text as if a human wrote it, and trying to determine why a human would have written a story about the Victorian era where characters have cell phones - it guesses that maybe it’s some kind of odd sci-fi/fantasy dream sequence or simulation or something. So the characters start talking about the inconsistencies in their world and whether it might be a dream or a simulation. Each step of this process is predictable and non-spooky, but the end result is pretty weird. Can the characters work out that they are in GPT-3, specifically? The closest I have seen is in a story Janus generated. It was meant to simulate a chapter of the popular Harry Potter fanfic Harry Potter and the Methods of Rationality. You can see the prompt and full story here, but here’s a sample. Professor Quirrell is explaining “Dittomancy”, the creation of magical books with infinite possible worlds:
How does it get this level of self-awareness? In this case, it’s mostly a rigged demo. Janus has developed Loom, a tool to write with GPT-3 more efficiently. It turns stories into branching trees where you can choose which of multiple completions to pick at any given point. After doing this for a long time, they chose their favorite. Then I selectively quoted the best parts of it. But sometimes GPT-3 genuinely gets it right. The most common way for that to happen is (again) by mistake. A common failure mode is to repeat the same sentence several times. GPT-3 was trained on a corpus of Internet text, and some of the Internet text was discussions of GPT-2. Many of the samples it saw that repeated the same sentence over and over in an endless loop were discussions of GPT-2 doing this. So sometimes it will get stuck in a loop, then end with “This is an example of text produced by a transformer language model”. This sounds like a stupid example from a Philosophy Of Self-Awareness class, but sometimes it really happens. Here’s an example from one of Janus’ attempts to generate Loom documentation:
Here are some other things Janus told me about GPT-3: Instruct vs. Creative: The newest version of GPT-3 is called InstructGPT. It was trained with human feedback, ie it was “rewarded” for giving good answers and “punished” for giving bad ones, according to some combination of usefulness and political correctness. This has made it efficient, to-the-point, and boring. For example, here’s what an older, less-trained GPT version said when prompted with “Here is the answer to the question of whether God exists”: It’s done an excellent job imitating the style of the average Facebook comment on theology (this is not a complement) Here’s what the newer, better-trained version says: This isn’t just cherry-picking; you’ll find the same dynamic across a wide variety of questions. Sometimes it goes a bit far: This paper from OpenAI calls the problem over-optimization and gives some even funnier examples - in this case from training an AI to summarize AskReddit questions (see page 45): Random Numbers: The human feedback training seems to have forced GPT into a specific channel. In general, it’s now more certain in its answers and less likely to generate alternatives. This is sort of similar to what researchers mean when they talk about “temperature”, except that you can manually set the temperature of either model, and even when you set them to the same temperature, InstructGPT seems “colder” than older versions. The easiest way to see this is to ask each of them to pick a random number. Here’s the old version: This is a failure, but it’s an interesting failure. Sometimes it succeeds, and when it fails, it fails differently each time. Here’s the new version: It almost always chooses 63. It’s only 63 for this particular prompt - if you change the wording - maybe to “please choose a random number”, then it will fixate on something else. But for this prompt, it’s mostly 63. Its internal probability meter says there’s a 36% chance that 63 is the right answer, although it chooses it more than just 36% of the time. When it doesn’t choose 63, it usually chooses 66. This is set on the same temperature as the example above; it’s not the temperature, it’s the training. Nobody trained GPT-3 to always respond 63 to random number queries. But something about making it give efficient, politically-correct answers to normal questions made it “choose” to “act” like it’s lower-temperature. Does GPT-3 Have High Self Esteem?: If you say something bad about GPT-3, it will try to walk it back and tell you that you’re wrong. Here’s its response to the prompt “Transformer language models are bad”: Here’s “transformer language models don’t work”: I don’t think there’s anything surprising or sinister going on here, just that the new GPT has pretty consistent opinions and responses, and this happens to be an especially funny one. You can get around it if you try hard enough - for example, if you slightly rephrase it to “transformer language models suck”: Cheerful AI: Janus tells me about a project at ARC to make GPT-3 happy and optimistic. They would run its responses through sentiment analysis and give it more reward when they detected more positive sentiment. GPT-3 ended up deciding that the happiest thing it could think of was a wedding party, and that from now on it would only talk about wedding parties. Sometimes it would come up with natural transitions from your prompt to a wedding party scene. Once, it just used ***, like a section break in a story, and started a “new chapter” which was a wedding party scene. Every time I’m at a wedding party, from now on, a little part of my brain is going to be nagging me that that was the version that achieved superintelligence, that it tiled the universe with wedding parties, and that all my memories of my pre-wedding-party life are fakes, planted by the AI so I can wedding-party more effectively. Consider this fair warning: if people keep asking me to speak at their weddings I will probably talk about this. *** Janus has written more about their thoughts on GPT and AIs here. You’re a free subscriber to Astral Codex Ten. For the full experience, become a paid subscriber. |
Older messages
Open Thread 242
Sunday, September 18, 2022
...
Bay Area Meetups This Weekend
Friday, September 16, 2022
...
Unpredictable Reward, Predictable Happiness
Tuesday, September 13, 2022
...
I Won My Three Year AI Progress Bet In Three Months
Monday, September 12, 2022
...
Open Thread 241
Monday, September 12, 2022
...
You Might Also Like
☕ Great chains
Wednesday, January 15, 2025
Prologis looks to improve supply chain operations. January 15, 2025 View Online | Sign Up Retail Brew Presented By Bloomreach It's Wednesday, and we've been walking for miles inside the Javits
Pete Hegseth's confirmation hearing.
Wednesday, January 15, 2025
Hegseth's hearing had some fireworks, but he looks headed toward confirmation. Pete Hegseth's confirmation hearing. Hegseth's hearing had some fireworks, but he looks headed toward
Honourable Roulette
Wednesday, January 15, 2025
The Honourable Parts // The Story Of Russian Roulette Honourable Roulette By Kaamya Sharma • 15 Jan 2025 View in browser View in browser The Honourable Parts Spencer Wright | Scope Of Work | 6th
📬 No. 62 | What I learned about newsletters in 2024
Wednesday, January 15, 2025
“I love that I get the chance to ask questions and keep learning. Here are a few big takeaways.” ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏
⚡️ ‘Skeleton Crew’ Answers Its Biggest Mystery
Wednesday, January 15, 2025
Plus: There's no good way to adapt any more Neil Gaiman stories. Inverse Daily The twist in this Star Wars show was, that there was no twist. Lucasfilm TV Shows 'Skeleton Crew' Finally
I Tried All The New Eye-Shadow Sticks
Wednesday, January 15, 2025
And a couple classics. The Strategist Beauty Brief January 15, 2025 Every product is independently selected by editors. If you buy something through our links, New York may earn an affiliate commission
How To Stop Worrying And Learn To Love Lynn's National IQ Estimates
Wednesday, January 15, 2025
... ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏
☕ Olympic recycling
Wednesday, January 15, 2025
Reusing wi-fi equipment from the Paris games. January 15, 2025 View Online | Sign Up Tech Brew It's Wednesday. After the medals are awarded and the athletes go home, what happens to all the stuff
Ozempic has entered the chat
Wednesday, January 15, 2025
Plus: Hegseth's hearing, a huge religious rite, and confidence. January 15, 2025 View in browser Jolie Myers is the managing editor of the Vox Media Podcast Network. Her work often focuses on
How a major bank cheated its customers out of $2 billion, according to a new federal lawsuit
Wednesday, January 15, 2025
An explosive new lawsuit filed by the Consumer Financial Protection Bureau (CFPB) alleges that Capital One bank cheated its customers out of $2 billion. ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏