- 4 Pixel phone tricks every user should know - including my favorite
- Windows Notepad and Paint are still free - but the AI will cost you. Here's how much
- Car owners are bullish on AI agents repairing the US auto industry - here's why
- 15 Months of Powerful Cyber Protection and Backup for Only $30
- This is the cheapest Android tablet I've ever tested - and it's surprisingly capable
Which AI agent is the best? This new leaderboard can tell you

What’s better than an AI chatbot that can perform tasks for you when prompted? AI that can do tasks for you on its own.
AI agents are the newest frontier in the AI space. AI companies are racing to build their own models, and offerings are constantly rolling out to enterprises. But which AI agent is the best?
Also: A major Gemini feature is now free for all users – no Advanced subscription required
Galileo Leaderboard
On Wednesday, Galileo launched an Agent Leaderboard on Hugging Face, an open-source AI platform where users can build, train, access, and deploy AI models. The leaderboard is meant to help people learn how AI agents perform in real-world business applications and help teams determine which agent best fits their needs.
📊 Our Agent Leaderboard is 𝗹𝗶𝘃𝗲! We built a comprehensive benchmark of which LLMs work best for AI Agents 👀
After evaluating 17 leading LLMs across 14 diverse datasets, we’re excited to share our findings about which models truly excel at tool-calling—and are ready to… pic.twitter.com/Cgw2iWNSA7— 🔭 Galileo (@rungalileo) February 12, 2025
On the leaderboard, you can find information about a model’s performance, including its rank and score. At a glance, you can also see more basic information about the model, including vendor, cost, and whether it’s open source or private.
The leaderboard currently features “the 17 leading LLMs,” including models from Google, OpenAI, Mistral, Anthropic, and Meta. It is updated monthly to keep up with ongoing releases, which have been occurring frequently.
How models are ranked
To determine the results, Galileo uses benchmarking datasets, including the BFCL (Berkeley Function Calling Leaderboard), τ-bench (Tau benchmark), Xlam, and ToolACE, which test different agent capabilities. The leaderboards then turn this data into an evaluation framework that covers real-world use cases.
Also: 3 genius side hustles you can start with OpenAI’s Operator right now
“BFCL excels in academic domains like mathematics, entertainment, and education, τ-bench specializes in retail and airline scenarios, xLAM covers data generation across 21 domains, and ToolACE focuses on API interactions in 390 domains,” explains the company in a blog post.
Galileo adds that each model is stress-tested to measure everything from simple API calls to more advanced tasks such as multi-tool interactions. The company also shared its methodology, reassuring users that it uses a standardized methodology to evaluate all AI agents fairly. The post includes a more technical dive into the model ranking.
The rankings
Google’s Gemini-2.0 flash is in first place, followed closely by OpenAI’s GPT-4o. Both of these models received what Galileo calls “Elite Tier Performance” status, which is given to models with a score of .9 or higher. Google and OpenAI dominated the leaderboard with their private models, taking the first six positions.
Google’s Gemini 2.0 was consistent across all of the evaluation categories and balanced impressive consistency performance across all categories with cost-effectiveness, according to the post, at a cost of $0.15/$0.6 per million tokens. Although GPT-4o was a close second, it has a much higher price point at $2.5/$10 per million tokens.
In the “high-performance segment,” the category below the elite tier, Gemini-1.5-Flash came in third place, and Gemini-1.5-Pro in fourth. OpenAI’s reasoning models, o1 and o3-mini, followed in fifth and sixth place, respectively.
Mistral-small-2501 was the first open-sourced AI model to chart. Its score of .832 placed it in the “mid-tier capabilities” category. The evaluations found its strengths to be its strong long-context handling and tool selection capabilities.
How to access
To view the results, you can visit the Agent Leaderboard on Hugging Face. In addition to the standard leaderboard, you will be able to filter the leaderboard by whether the LLM is open-sourced or private. and by category, which refers to the capability being tested (overall, long context, composite, etc).