- If you're planning to upgrade your phone, you might want to buy one now - here's why
- Run LLMs Locally with Docker Model Runner | Docker
- Microsoft unveils 9 new Copilot features - you can try some now
- Nintendo Switch 2 pre-orders delayed, new price hike likely - here's why
- Why Temu's bargain prices are about to hit a tariff wall
New MLCommons benchmarks to test AI infrastructure performance

The latest release also broadens its scope beyond chatbot benchmarks. A new graph neural network (GNN) test targets datacenter-class hardware and is designed for workloads like fraud detection, recommendation engines, and knowledge graphs. It uses the RGAT model based on a graph dataset containing over 547 million nodes and 5.8 billion edges.
Judging performance
Analysts suggest that these benchmarks will make it easier to judge the performance of various hardware chips and clusters based on documented models.
“As every chipmaker seeks to prove that its hardware is good enough to support AI, we now have a standard benchmark that shows the quality of question support, math, and coding skills associated with hardware,” said Hyoun Park, CEO and Chief Analyst at Amalgam Insights.
Chipmakers can now compete not just on traditional speeds and feeds, but in mathematical skill and informational accuracy. This benchmark provides a rare opportunity to add new performance standards on cross-vendor hardware, Park added.
“The latency in terms of how quickly tokens are delivered and the time for the user to see the response is the deciding factor,” said Neil Shah, partner and co-founder at Counterpoint Research. “This is where players such as NVIDIA, AMD, and Intel have to get the software right to help developers optimize the models and bring out the best compute performance.”
Benchmarking and buying decisions
Independent benchmarks like those from MLCommons play a key role in helping buyers evaluate system performance, but relying on them alone may not provide the full picture.