Platformer - The AI agents have arrived

The AI agents have arrived

Artificial intelligence can now compute for you on your behalf — and the web is never going to be the same

By Casey Newton • 22 Oct 2024

View in browser

Anthropic's new computer use tool, shown in a screencap of a video demonstration. (Anthropic)

Here's this week's free edition of Platformer: a look at a significant new milestone in the development of AI, and what it means for the world it has just been unleashed in.

Do you value independent reporting on AI? If so, consider upgrading your subscription today. We'll email you all our scoops first, like our recent one about ~the dismantling of the Stanford Internet Observatory. Plus you'll be able to discuss each today's edition with us in our chatty Discord server, and we’ll send you a link to read subscriber-only columns in the RSS reader of your choice.

Subscribe

Earlier this year, in a much-discussed tagline from its annual developer event, Google promised that its AI-enhanced search engine would soon do the Googling for you.

Five months later, an even more expansive future is coming into view: one where your computer does the computing for you.

That’s the promise contained within Claude 3.5 Sonnet, the latest version of Anthropic’s flagship large language model. Starting today, developers have access to a feature called “computer use.” The company describes it this way:

Available today on the API, developers can direct Claude to use computers the way people do—by looking at a screen, moving a cursor, clicking buttons, and typing text. Claude 3.5 Sonnet is the first frontier AI model to offer computer use in public beta. At this stage, it is still experimental — at times cumbersome and error-prone. We're releasing computer use early for feedback from developers, and expect the capability to improve rapidly over time.

A brief accompanying video shows an Anthropic researcher using its agent to gather information from various places on his computer and using it to fill out a form. It’s a mundane example, but that’s the point: building an AI agent smart enough to automate the drudgery that fills so many workers’ days.

Anthropic is quick to note that this first version of the technology is slow and makes lots of mistakes. But it also heralds the arrival of the next major phase on the AI labs’ road to building superintelligence.

Anthropic is only one of dozens of companies now working to build AI agents. Microsoft today announced 10 new automations for its Dynamics 365 suite of business applications. Asana rolled out a take on agents today as well. Salesforce’s rival Agentforce technology is due to become generally available next week. And a host of startups are racing to build “AI co-workers” of various kinds.

What makes Anthropic’s agent stand out is that it takes the same technology that powers the AI chatbots we have been using for almost two years now and lets it out of the text box. Instead of being limited to offering you text- or voice-based responses, it can now complete small projects on your behalf.

Ethan Mollick, an associate professor at the Wharton School of the University of Pennsylvania, got a chance to try Anthropic’s agent early. He had it whip up a lesson plan for him while he did other things:

As one example, I asked the AI to put together a lesson plan on the Great Gatsby for high school students, breaking it into readable chunks and then creating assignments and connections tied to the Common Core learning standard. I also asked it to put this all into a single spreadsheet for me. With a chatbot, I would have needed to direct the AI through each step, using it as a co-intelligence to develop a plan together. This was different. Once given the instructions, the AI went through the steps itself: it downloaded the book, it looked up lesson plans on the web, it opened a spreadsheet application and filled out an initial lesson plan, then it looked up Common Core standards, added revisions to the spreadsheet, and so on for multiple steps. The results are not bad (I checked and did not see obvious errors, but there may be some — more on reliability later in the post). Most importantly, I was presented finished drafts to comment on, not a process to manage. I simply delegated a complex task and walked away from my computer, checking back later to see what it did (the system is quite slow).

Later, he used it to play the game Paperclip Clicker (“which, ironically, is about an AI that destroys humanity in its single-minded pursuit of making paperclips.”) It fares poorly — making one mistake leads it to make many more, forcing Mollick to intervene. Overall, he writes, the agent could handle a variety of tasks with some success, though not enough that he would feel comfortable routinely delegating work to it.

This will surely lead to many comical TikToks of Claude trying and failing to demonstrate basic computer skills. But I was struck by the company’s blog post on developing the agent, which notes that even at this most experimental stage, Claude is twice as good at navigating as its next-closest competitor — and maybe not as far from human-level performance as you might guess:

At present, Claude is state-of-the-art for models that use computers in the same way as a person does — that is, from looking at the screen and taking actions in response. On one evaluation created to test developers’ attempts to have models use computers, OSWorld, Claude currently gets 14.9%. That’s nowhere near human-level skill (which is generally 70-75%), but it’s far higher than the 7.7% obtained by the next-best AI model in the same category.

To be clear, a grade of 14.9 percent is an F by most measures. But on this test, most humans only score a C. It’s a welcome reminder of how much trouble most of us have navigating computer-based tasks at least some of the time — and an important milestone on the way to agents that can make those troubles go away.

And what happens then?

It’s easy to imagine using an AI agent to manage your appointments and scheduling, fill out online forms and routine paperwork, draft replies to your emails, and shopping on your behalf. Or it could browse the web on your behalf, preparing a personalized digest for you that means you never have to fight against a paywall ever again.

It’s also easy to imagine an agent with those capabilities setting up spam operations, automating the production of AI slop websites, and overwhelming human-run businesses and institutions with a flood of AI-generated requests.

Either way, people who use AI agents will have to confront some very real privacy concerns. Earlier this year Microsoft had to delay the launch of Recall, a marquee feature in its new AI-centric PCs designed to let you search all past activity on the computer via AI-powered search of screenshots that it silently takes in the background for you. Security researchers pointed out, among other things, that users would be opted in by default, and that their screenshots were not encrypted, creating an appealing target for hackers. (Users now have to opt in, and the screenshots are encrypted.)

Anthropic will need similar access to a user’s computer to operate it on their behalf. And I imagine businesses will have many questions about what the company does with customer data, and with employee data, before letting anyone use it.

There also may still be real limits in how much we can expect from agents in the near term. One startup CEO ridiculed to me the idea, popular in AI circles, that “the next major programming language is English.” (In other words, the idea that you’ll soon be able to get software to do whatever you want it to do simply by saying so.) CEOs “program in English” all the time, he explained, by telling their human engineers what to build. And that process is famously error-prone and rife with inefficiency, too.

But to use another phrase popular among the AI crowd, the agent that Anthropic released today is as bad as this kind of software will ever be. From this moment on, AI will no longer be limited to what can be typed inside a box. Which means it’s time for the rest of us to start thinking outside that box, too.

Sponsored

Height.app—The only autonomous project management tool

Height is rewriting the project management playbook. Leading the next wave of AI tooling for product teams, Height proactively handles all of the tedious tagging, triaging, and updating, so you never have to again. Height autonomously takes care of product workflows like:

Detecting scope changes and mapping edits back to your specs
Triaging bugs, assigning priority and escalating as needed
Tagging and organizing backlogs by feature, estimate, and more

If you're tired of managing projects, it's time for Height. Join the new era of product building — where projects manage themselves.

Try Free

Elon Musk and the 2024 election

This absolutely could have been the subject of today's column. But I couldn't imagine telling you anything I haven't already said on the subject over and over again. Musk's attempted vote-buying scheme represents an extraordinary departure for big company CEOs, and may well be illegal. Had Mark Zuckerberg tried anything like this in 2020, Rep. Jim Jordan would have ordered airstrikes on Menlo Park.

(If you think I should have written this column instead today, I'd be curious to hear about it. Just reply to this email.)

Prosecutors are facing mounting pressure to investigate Elon Musk’s $1 million daily lottery that he promised to voters that signed his PAC’s petition. (David Ingram, Ken Dilanian, Michael Kosnar, Fallon Gallagher and Lora Kolodny / NBC News)
- Former Republican lawmakers and officials have reportedly sent a letter to attorney general Merrick Garland urging him to investigate Musk for the move. (Perry Stein / Washington Post)
- Pennsylvania governor Josh Shapiro also said Musk's move was something “law enforcement can take a look at.” (Colby Smith / Financial Times)
- The daily lottery could violate election bribery laws, experts say. (Marshall Cohen / CNN)
The PAC funded by Musk is reportedly struggling to meet doorknocking goals and investigating claims that workers lied about the number of voters contacted. (Rachael Levy and Alexandra Ulmer / Reuters)
- The PAC has spent more than $166,000 on advertising on X. So at least someone is advertising on X! (Vittoria Elliott / Wired)
A look at how entangled Musk is with several federal agencies, and how a Trump presidency could give him more power over them. A story that perhaps explains what is going on here better than any other. (Eric Lipton, David A. Fahrenthold, Aaron Krolik and Kirsten Grind / New York Times)
Musk shared a post on X falsely claiming that Michigan’s voter rolls had a large number of inactive voters and could lead to widespread fraud. Michigan secretary of state Jocelyn Benson said the post was “dangerous disinformation.” It is also part of the Republican effort to delegitimize the election before it even takes place. (Sarah Ellison / Washington Post)
A look at the different ways Musk has spread conspiracy theories and misinformation about the election online. (Julia Ingram and Madeleine May / CBS News)

Talk to us

Send us tips, comments, questions, and tasks for your AI agent: casey@platformer.news.

Sponsor a Newsletter

Platformer - The AI agents have arrived

Height.app—The only autonomous project management tool

Elon Musk and the 2024 election

Governing

Industry

Those good posts

Talk to us

Older messages

Google shuffles the search deck

Can AI regulation survive the First Amendment?

At Meta Connect, it’s Zuck or nothing

What I learned in year four of Platformer

Instagram makes teen accounts private by default

You Might Also Like

⏰ Final day to join MicroConf Connect (Applications close at midnight)

How I give high-quality feedback quickly

💥 Being Vague is Costing You Money - CreatorBoom

Enter: A new unicorn

Meta just flipped the switch that prevents misinformation from spreading in the United States

Ok... we're now REALLY live Friend !

Building GTM for AI : Office Hours with Maggie Hott

ICYMI: Musk's TikTok, AI's future, films for founders

🚨 [LIVE IN 1 HOUR] Day 3 of the Challenge with Jackie Damelian

The Broken Ladder & The Missing Manager 🪜