Which AI agent is the best? This new leaderboard can tell you

Maciej Frolow/Getty Images

What’s better than an AI chatbot that can perform tasks for you when prompted? AI that can do tasks for you on its own.

AI agents are the newest frontier in the AI space. AI companies are racing to build their own models, and offerings are constantly rolling out to enterprises. But which AI agent is the best?

Also: A major Gemini feature is now free for all users – no Advanced subscription required

Galileo Leaderboard

On Wednesday, Galileo launched an Agent Leaderboard on Hugging Face, an open-source AI platform where users can build, train, access, and deploy AI models. The leaderboard is meant to help people learn how AI agents perform in real-world business applications and help teams determine which agent best fits their needs.

📊 Our Agent Leaderboard is 𝗹𝗶𝘃𝗲! We built a comprehensive benchmark of which LLMs work best for AI Agents 👀
After evaluating 17 leading LLMs across 14 diverse datasets, we’re excited to share our findings about which models truly excel at tool-calling—and are ready to… pic.twitter.com/Cgw2iWNSA7

— 🔭 Galileo (@rungalileo) February 12, 2025

On the leaderboard, you can find information about a model’s performance, including its rank and score. At a glance, you can also see more basic information about the model, including vendor, cost, and whether it’s open source or private.

The leaderboard currently features “the 17 leading LLMs,” including models from Google, OpenAI, Mistral, Anthropic, and Meta. It is updated monthly to keep up with ongoing releases, which have been occurring frequently.

How models are ranked

To determine the results, Galileo uses benchmarking datasets, including the BFCL (Berkeley Function Calling Leaderboard), τ-bench (Tau benchmark), Xlam, and ToolACE, which test different agent capabilities. The leaderboards then turn this data into an evaluation framework that covers real-world use cases.

Also: 3 genius side hustles you can start with OpenAI’s Operator right now

“BFCL excels in academic domains like mathematics, entertainment, and education, τ-bench specializes in retail and airline scenarios, xLAM covers data generation across 21 domains, and ToolACE focuses on API interactions in 390 domains,” explains the company in a blog post.

Galileo adds that each model is stress-tested to measure everything from simple API calls to more advanced tasks such as multi-tool interactions. The company also shared its methodology, reassuring users that it uses a standardized methodology to evaluate all AI agents fairly. The post includes a more technical dive into the model ranking.

The rankings

Google’s Gemini-2.0 flash is in first place, followed closely by OpenAI’s GPT-4o. Both of these models received what Galileo calls “Elite Tier Performance” status, which is given to models with a score of .9 or higher. Google and OpenAI dominated the leaderboard with their private models, taking the first six positions.

Google’s Gemini 2.0 was consistent across all of the evaluation categories and balanced impressive consistency performance across all categories with cost-effectiveness, according to the post, at a cost of $0.15/$0.6 per million tokens. Although GPT-4o was a close second, it has a much higher price point at $2.5/$10 per million tokens.

In the “high-performance segment,” the category below the elite tier, Gemini-1.5-Flash came in third place, and Gemini-1.5-Pro in fourth. OpenAI’s reasoning models, o1 and o3-mini, followed in fifth and sixth place, respectively.

Mistral-small-2501 was the first open-sourced AI model to chart. Its score of .832 placed it in the “mid-tier capabilities” category. The evaluations found its strengths to be its strong long-context handling and tool selection capabilities.

How to access

To view the results, you can visit the Agent Leaderboard on Hugging Face. In addition to the standard leaderboard, you will be able to filter the leaderboard by whether the LLM is open-sourced or private. and by category, which refers to the capability being tested (overall, long context, composite, etc).

Source link

Which AI agent is the best? This new leaderboard can tell you

Galileo Leaderboard

How models are ranked

The rankings

How to access

VMWARE

Helping Public Sector Organisations Define Cloud Strategy

How to change the VLAN ID of the Service Console in ESX from the command line/console

Cisco UCS and Vmware Interfaces (Vnics) HA Design Considerations

Troubleshooting network and TCP/UDP port connectivity issues on ESX/ESXi(2020669)

vSphere Client Parameters

Configuration Templates

CUE Licenses

Trouble shooting Unity Express with Call Manager Integeration & Operational Issues

CME Configuration Example: SIP Trunks to Viatalk and VoIP.ms

SIP Phone registration – CME Configuration

CUE Voicemail + VPIM networking (CUE to unity)

Related Post

Galileo Leaderboard

How models are ranked

The rankings

How to access

VMWARE

Configuration Templates