Which AI agent is the best? This new leaderboard can tell you


Maciej Frolow/Getty Images

What’s better than an AI chatbot that can perform tasks for you when prompted? AI that can do tasks for you on its own. 

AI agents are the newest frontier in the AI space. AI companies are racing to build their own models, and offerings are constantly rolling out to enterprises. But which AI agent is the best?

Also: A major Gemini feature is now free for all users – no Advanced subscription required

Galileo Leaderboard

On Wednesday, Galileo launched an Agent Leaderboard on Hugging Face, an open-source AI platform where users can build, train, access, and deploy AI models. The leaderboard is meant to help people learn how AI agents perform in real-world business applications and help teams determine which agent best fits their needs. 

On the leaderboard, you can find information about a model’s performance, including its rank and score. At a glance, you can also see more basic information about the model, including vendor, cost, and whether it’s open source or private.

The leaderboard currently features “the 17 leading LLMs,” including models from Google, OpenAI, Mistral, Anthropic, and Meta. It is updated monthly to keep up with ongoing releases, which have been occurring frequently. 

How models are ranked 

To determine the results, Galileo uses benchmarking datasets, including the BFCL (Berkeley Function Calling Leaderboard), τ-bench (Tau benchmark), Xlam, and ToolACE, which test different agent capabilities. The leaderboards then turn this data into an evaluation framework that covers real-world use cases. 

Also: 3 genius side hustles you can start with OpenAI’s Operator right now

BFCL excels in academic domains like mathematics, entertainment, and education, τ-bench specializes in retail and airline scenarios, xLAM covers data generation across 21 domains, and ToolACE focuses on API interactions in 390 domains,” explains the company in a blog post. 

Galileo adds that each model is stress-tested to measure everything from simple API calls to more advanced tasks such as multi-tool interactions. The company also shared its methodology, reassuring users that it uses a standardized methodology to evaluate all AI agents fairly. The post includes a more technical dive into the model ranking. 

The rankings

Google’s Gemini-2.0 flash is in first place, followed closely by OpenAI’s GPT-4o. Both of these models received what Galileo calls “Elite Tier Performance” status, which is given to models with a score of .9 or higher. Google and OpenAI dominated the leaderboard with their private models, taking the first six positions. 

Google’s Gemini 2.0 was consistent across all of the evaluation categories and balanced impressive consistency performance across all categories with cost-effectiveness, according to the post, at a cost of $0.15/$0.6 per million tokens. Although GPT-4o was a close second, it has a much higher price point at $2.5/$10 per million tokens.

In the “high-performance segment,” the category below the elite tier, Gemini-1.5-Flash came in third place, and Gemini-1.5-Pro in fourth. OpenAI’s reasoning models, o1 and o3-mini, followed in fifth and sixth place, respectively. 

Mistral-small-2501 was the first open-sourced AI model to chart. Its score of .832 placed it in the “mid-tier capabilities” category. The evaluations found its strengths to be its strong long-context handling and tool selection capabilities.

How to access

To view the results, you can visit the Agent Leaderboard on Hugging Face. In addition to the standard leaderboard, you will be able to filter the leaderboard by whether the LLM is open-sourced or private. and by category, which refers to the capability being tested (overall, long context, composite, etc).   





Source link

Leave a Comment