AI isn't hitting a wall, it's just getting too smart for benchmarks, says Anthropic


Bloomberg’s Anurag Rana [left] talks with Anthropic’s Michael Gerstenhaber [center] and Scale AI’s Vijay Karunamurthy, during Bloomberg Intelligence’s conference on “Gen AI: Can it deliver on the productivity promise?”

Bloomberg, courtesy of Craig Warga

Large language models and other forms of generative artificial intelligence are improving steadily at “self-correction,” opening up the possibilities for new kinds of work they can do, including “agentic AI,” according to the vice president of Anthropic, a leading vendor of AI models.

“It’s getting very good at self-correction, self-reasoning,” said Michael Gerstenhaber, head of API technologies at Anthropic, which makes the Claude family of LLMs that compete with OpenAI’s GPT. 

“Every couple of months we’ve come out with a new model that has extended what LLMs can do,” said Gerstenhaber during an interview Wednesday in New York with Bloomberg Intelligence’s Anurag Rana. “The most interesting thing about this industry is that new use cases are unlocked with every model revision.” 

Also: Anthropic’s latest AI model can use a computer just like you – mistakes and all

The most recent models include task planning, such as how to carry out tasks on a computer as a person would; for example, ordering pizza online. 

“Planning interstitial steps is something that wasn’t possible yesterday that is possible today,” said Gerstenhaber of such step-by-step task completion.

The discussion, which also included Vijay Karunamurthy, chief technologist of AI startup Scale AI, was part of a daylong conference hosted by Bloomberg Intelligence to explore the topic, “Gen AI: Can it deliver on the productivity promise?”

Gerstenhaber’s remarks fly in the face of arguments from AI skeptics that Gen AI, and the rest of AI more broadly, is “hitting a wall,” meaning that the return from each new model generation is getting less and less. 

AI scholar Gary Marcus warned in 2022 that simply making AI models with more and more parameters would not yield improvements equal to the increase in size. Marcus has continued to reiterate that warning

Anthropic, said Gerstenhaber, has been pushing at what can be measured by current AI benchmarks. 

Also: Anthropic brings Tool Use for Claude out of beta, promising sophisticated assistants

“Even if it looks like it’s tapering off in some ways, that’s because we’re enabling entirely new classes [of functionality], but we’ve saturated the benchmarks, and the ability to do older tasks,” said Gerstenhaber. In other words, it gets harder to measure what current Gen AI models can do.

Both Gerstenhaber and Scale AI’s Karunamurthy made the case that “scaling” Gen AI — making AI models bigger — is helping to advance such self-correcting neural networks. 

“We are definitely seeing more and more scaling of the intelligence,” said Gerstenhaber. “One of the reasons we don’t necessarily think that we’re hitting a wall with planning and reasoning is that we’re just learning right now what are the ways in which planning and reasoning tasks need to be structured so that the models can adapt to a wide variety of new environments they haven’t tried to pass.”

“We’re very much in the early days,” said Gerstenhaber. “We’re learning from application developers what they’re trying to do, and what it [the language model] does poorly, and we can integrate that into the LM.” 

Also: The best AI chatbots: ChatGPT, Copilot, and worthy alternatives

Some of that discovery, said Gerstenhaber, has to do with the speed of fundamental research at Anthropic. However, some of it has to do with learning by hearing “what industry is telling us they need from us, and our ability to adapt to that — we are very much learning in real time.”

Customers tend to start with big models and then sometimes down-size to simpler AI models to fit a purpose, said Scale AI’s Karunamurthy. “It’s very clear that first they think about whether or not an AI is intelligent enough to do a test well at all, then, whether it’s fast enough to meet their needs in the application and then as cheap as possible.”





Source link