How Claude uses AI to identify new threats

PLUS: Exclusive data on how people are using Anthropic’s chatbot

By Casey Newton • 12 Dec 2024

View in browser

Claude's Clio, shown here analyzing the Wildchat public dataset and not actual Claude data. (Anthropic)

Here's this week's free edition of Platformer: a report inside the AI lab Anthropic with exclusive news on how people use its popular chatbot, Claude, and a new tool the company is using to identify novel harms created by AI.

Do you value independent reporting on AI? If so, consider upgrading your subscription today. We'll email you all our scoops first, like our recent one about the dismantling of the Stanford Internet Observatory. Plus you'll be able to discuss each today's edition with us in our chatty Discord server, and we’ll send you a link to read subscriber-only columns in the RSS reader of your choice.

Upgrade

The company didn’t know it yet, but Anthropic had a spam problem.

Earlier this year, a group of accounts had begun asking the company’s chatbot, Claude, to generate text for search engine optimization — the art of getting a website to rank more highly in Google. There’s nothing necessarily wrong with a publisher trying to generate keywords to describe their site to Google. But these accounts, which worded their prompts carefully in an apparent effort to escape detection by Anthropic’s normal filters, appeared to be part of a coordinated effort.

The spammers might have gotten away with it — but then the network was spotted by Clio. An acronym for “Claude insights and observations,” Clio is an internal tool at Anthropic that uses machine learning to identify previously unknown threats, disrupt coordinated attempts to abuse the company’s systems, and generate insights about how Claude and the company’s artificial intelligence models are being used.

In the case of the spam network, Claude identified a cluster of accounts making queries that by themselves did not necessarily violate Anthropic’s guidelines. But when the company’s trust and safety team investigated, it determined that the queries were coming from a network of spammers. It terminated their access to Claude.

“Sometimes it’s not clear from looking at an individual conversation whether something is harmful,” said Miles McCain, a co-author of the paper and a member of Anthropic’s technical staff, said in an interview with Platformer. “It’s only once you piece it together in context that you realize, this is coordinated abuse generating SEO spam. Generating keywords for your blog? That’s fine. It’s only when you’re doing it across tons of accounts, abusively, for free that it becomes problematic.”

Discovering these previously hidden harms — what the company calls “unknown unknowns” — is a core part of Clio’s mission. Anthropic announced Clio in a paper published today, alongside a blog post about how Claude was used in 2024 elections.

The company hopes that other trust and safety teams consider using a Clio-like approach to serve as an early warning system for new harms that emerge as the use of AI chatbots become more pervasive.

“It really shows that you can monitor and understand, in a bottom-up way, what’s happening — while still preserving user privacy,” Alex Tamkin, the paper’s lead author and a research scientist, said in an interview. “It lets you see things before they might become a public-facing problem. … By using language models, you can catch all sorts of novel and weird-shaped use cases.”

How Clio works

In the Clio paper, Anthropic drew from 1 million conversations with Claude — both free and paid users. The system is instructed to omit private details and personal information.

Under development for about six months, Clio works by analyzing what a conversation is about, and then clustering similar conversations around similar themes and topics. (Topics that are rarely discussed with Claude are omitted from the analysis, which offers an additional safeguard against accidentally making an individual user identifiable.)

Clio creates a title and summary for those clusters, and reviews it again to make sure personal information is not included. It then creates multi-level hierarchies for related topics — an education cluster might contain sub-clusters for the way teachers use Claude and the way students do, for example.

Analysts can then search the clusters, or explore Claude usage visually. Clio offers a visual interface similar to Obsidian’s graph view, linking clusters based on the frequency of discussion and how they may be related to each other.

Popular queries often link to other popular queries — a sign that many people use Claude for the same things. Less popular queries often appear in the visualization as islands — and it’s these islands that can highlight unknown unknowns.

The spam network was one of these islands. It was a group of accounts devoted almost entirely to making SEO queries, in the same technically-allowed-but-definitely-suspicious way. Upon discovering the island, Anthropic referred it to its trust and safety team, which ultimately removed the network.

Like most trust and safety teams, Anthropic’s already had tools in place to identify spammers. It identified keywords often used by spammers, and created machine-learning classifiers to understand when its systems were likely being used to generate spam. This is what the company calls a “top-down” approach to safety.

Clio, on the other hand, is bottom-up. It isn’t looking for anything specific at all. Rather, it’s a way to help Anthropic understand all the ways that Claude is being used, and consider whether some of those uses could be harmful.

It works the other way, too: identifying clusters of conversations that Claude marked as harmful even when they are totally innocuous.

Clio revealed, for example, that Claude was repeatedly refusing to answer questions about the role-playing game Dungeons & Dragons. As people asked the chatbot for help planning their attacks, Claude often assumed that users were planning actual violence.

The Clio team referred the issue back to Anthropic’s trust and safety team, which refined its classifiers.

It’s not just D&D. Clio found that Claude sometimes rejected questions from job seekers who had uploaded their resumes, since the resumes included the sort of personal information that can violate Claude’s rules in other contexts. And some benign questions about programming were refused because Claude mistakenly associated them with hacking attempts.

During this year’s US presidential election, Clio helped to identify both benign uses case of Claude (like explaining the political process)

“You can use Clio to constantly monitor at a high level what types of things people are using this fundamentally new technology for,” Tamkin said. “You can refer anything that looks suspicious or worrisome to the trust and safety team and update those safeguards as the technology rolls out.”

More recently, Anthropic has used Clio to understand the potential harms from “computer use,” its first foray into an AI system that can execute actions on a computer. Clio is identifying capabilities that might not have been apparent in pre-launch testing, the company said, and is monitoring how testers are using them in practice.

When monitoring activity around the 2024 US election, Clio helped identify both benign uses (like explaining political processes) and policy violations (like trying to get Claude to generate fundraising materials for campaigns).

How people use Claude

My conversation with Anthropic’s societal impacts team focused on how Clio can be used to identify harms. But it can also be used to identify opportunities for Anthropic, by highlighting how people use its chatbot.

In what the company is calling a first for a major AI lab, the Clio paper also highlights the top three categories of uses for Claude:

Coding and software development (more than 10 percent of conversations)
Educational use, both for teachers and for students (more than 7 percent)
Business strategy and operations, such as drafting professional communications and analyzing business data (almost 6 percent)

The top three uses, then, account for only about 23 percent of usage. The long tail of Claude’s use cases appears to be extremely long.

“It turns out if you build a general purpose technology and release it, people find a lot of purposes for it,” Deep Ganguli, who leads the societal impacts team, told me with a laugh. “It’s crazy. You scroll around these clusters and you’re like, what? People are using Claude to do that?”

Among the other ways people use Claude, Clio found, include dream interpretation, questions about the Zodiac, analysis of soccer matches, disaster preparedness, and hints for crossword puzzles. Users also routinely subject Claude to the most difficult challenge known to large language models today: asking it to count the number of r’s in “strawberry.”

During a demo earlier this week, as the team clicked around Clio, I saw even more use cases: clusters about video game development, nutrition, writing creative fiction, and writing Javascript. Ganguli told me he began asking Claude questions about parenting after spotting it as a popular cluster within Clio.

What’s next

The Anthropic team told me they tried to share as much about how Clio works as possible in their paper, in the hopes that other AI labs would try something similar themselves. The paper goes so far as to include the cost of running Clio — $48.81 per 100,000 conversations.

“We wanna make it as easy as possible for someone to pitch it somewhere else and be like, look, guys — they did it and it worked,” McCain said.

In the meantime, Anthropic told me that Clio has become a meaningful part of its trust and safety efforts. But the company is already imagining other things it might do with the technology.

Ganguli highlighted three. One, the company could use Claude to understand the future of work. What kind of jobs is Claude helping people with, and what does that suggest about how the economy is transforming?

Two, Clio could change the safety evaluations that AI labs perform on their models. Instead of drawing from historical and theorized harms, companies can ground evaluations in the real-world usage that they’re seeing today.

Finally, Ganguli sees science applications. Claude is trained to follow a constitution; perhaps Clio could surface instances in which it fails to do so, or struggles with a trade-off.

Those uses of Clio seem benign. But they also highlight the deep sensitivity of the queries people make in chatbots like Claude. Anthropic is using the technology to identify harms, but it’s just as easy to imagine another company using similar technology to analyze consumer behavior for the purposes of advertising, persuasion, or other surveillance. It’s also easy to imagine another company taking fewer steps to preserve users’ privacy, using their queries in ways that could create risks to them.

Ganguli says that one goal of publishing the results from Clio is to draw attention to risks like these.

“I feel strongly that, as we’re developing these technologies, the public should know how they’re being used and what the risks are,” he said.

Sponsored

Keep Your Private Data Off The Dark Web
Every day, data brokers profit from your sensitive info—phone number, DOB, SSN—selling it to the highest bidder. And who’s buying it? Best case: companies target you with ads. Worst case: scammers and identity thieves breach those brokers, leaving your data vulnerable or on the dark web. It's time you check out Incogni. It scrubs your personal data from the web, confronting the world’s data brokers on your behalf. And unlike other services, Incogni helps remove your sensitive information from all broker types, including those tricky People Search Sites. Help protect yourself from identity theft, spam calls, and health insurers raising your rates. Plus, just for Platformer readers: Get 58% off Incogni using code PLATFORMER

HOW IT WORKS

On the podcast this week: Has TikTok's luck run out for good? Then Google's director of quantum hardware, Julian Kelly, joins us in a noble effort to explain quantum computing to the masses. And finally, Kevin investigates the cult of Claude.

Governing

President-elect Trump picked Andrew Fergugon, a commissioner with the Federal Trade Commission, to be its next leader. He has called for social platforms to be investigated for censoring conservatives. (Jody Godoy / Reuters)
Meta donated $1 million to Trump's inaugural fund as part of its ongoing effort to keep Mark Zuckerberg out of prison. (Dana Mattioli and Rebecca Ballhaus / Wall Street Journal)
Google reportedly asked the US government to end Microsoft’s exclusive deal to host OpenAI’s services on its servers. (Aaron Holmes / The Information)
- Google’s top lobbyist, Mark Isakowitz, left the company to become chief of staff to Republican Senator-elect Dave McCormick of Pennsylvania. (Emily Birnbaum / Bloomberg)
OpenAI CEO Sarah Friar played down the potential threats to the company from Elon Musk in his new role advising Donald Trump. (Krystal Hu / Reuters)
Nvidia will have to face a lawsuit alleging that it misled investors about its reliance on revenue from crypto mining after the Supreme Court dismissed its appeal. (Greg Stohr / Bloomberg)
A disturbing investigation into the domestic terror group 764, which has used Discord to find and antagonize people with mental illnesses. Just unbelievable cruelty on display here. (Shawn Boburg and Chris Dehghanpoor / Washington Post)
A Russian operation designed to undermine Europe’s support for Ukraine using AI-generated voiceovers likely used commercial tools from ElevenLabs, according to the threat intelligence company Recorded Future. (Charles Rollet / TechCrunch)
An investigation into deepfake sites by the American Sunlight Project found more than 35,000 mentions of nonconsensual intimate imagery of 25 female members of Congress. One male member of conference was also mentioned (Barbara Rodriguez and Jasmine Mithani / The Markup)
Google introduced two updates for Android designed to make it easier for users to detect when they are being tracked. (Sheena Vasani / The Verge)
A profile of Anthropic’s 11-person frontier red team, led by Logan Graham, which hunts for the most dangerous things that its latest large language models can do. (Sam Schechner / Wall Street Journal)
Character.ai said it would offer new chatbots to under-18 users that steer them away from romance and other sensitive subjects. The move comes after two devastating lawsuits were filed against the company, one involving the suicide of a teenage boy. (Adi Robertson / The Verge)
Australia has proposed subjecting any platform that makes $250 million or more in revenue to a levy to fund journalism. It highlights a central failure of the shakedown approach that Australia implemented in 2021, which is that the platforms could (and in the case of Meta did) just walk away from news entirely. (Nic Fildes / Financial Times)

Google's Gemini and mixed-reality news drops

Google unveiled a host of AI announcements on Wednesday in what appeared to be a kind of shock-and-awe campaign to demonstrate the company’s leadership in AI development. A new mixed-reality platform for Android followed on Thursday. The announcements include: