- CompTIA unveils AI Essentials training resource
- This $300 Samsung phone looks as good as the Galaxy S25 - at a fraction of the price
- SquareX Discloses “Browser Syncjacking”, a New Attack Technique that Provides Full Browser and Device Control, Putting Millions at Risk
- Microsoft's new Copilot+ Surface devices are built for business with Intel inside
- Customer Experience is set to Go Beyond at Cisco Live Amsterdam
Apple researchers reveal the secret sauce behind DeepSeek AI
The artificial intelligence market — and the entire stock market — was rocked on Monday by the sudden popularity of DeepSeek, the open-source large language model developed by a China-based hedge fund that has bested OpenAI’s best on some tasks while costing far less.
Also: I put DeepSeek AI’s coding skills to the test – here’s where it fell apart
As ZDNET’s Radhika Rajkumar detailed on Monday, R1’s success highlights a sea change in AI that could empower smaller labs and researchers to create competitive models and diversify the field of available options.
Why does DeepSeek work so well?
It turns out it’s a broad approach within deep learning forms of artificial intelligence to squeeze more out of computer chips by exploiting a phenomenon known as “sparsity.”
Sparsity comes in many forms. Sometimes, it involves eliminating parts of the data that AI uses when that data doesn’t materially affect the output of the AI model.
Also: Why China’s DeepSeek could burst our AI bubble
At other times, it can involve cutting away whole parts of a neural network if doing so doesn’t affect the end result.
DeepSeek is an example of the latter: parsimonious use of neural nets.
The main advance most have identified in DeepSeek is that it can turn on and off large sections of neural network “weights,” or “parameters.” The parameters are what shape how a neural network can transform input — the prompt you type — into generated text or images. Parameters have a direct impact on how long it takes to perform computations. More parameters, more computing effort, typically.
Sparsity and its role in AI
The ability to use only some of the total parameters of a large language model and shut off the rest is an example of sparsity. That sparsity can have a major impact on how big or small the computing budget is for an AI model.
AI researchers at Apple, in a report out last week, explain nicely how DeepSeek and similar approaches use sparsity to get better results for a given amount of computing power.
Apple has no connection to DeepSeek, but Apple does its own AI research on a regular basis, and so the developments of outside companies such as DeepSeek are part of Apple’s continued involvement in the AI research field, broadly speaking.
In the paper, titled “Parameters vs FLOPs: Scaling Laws for Optimal Sparsity for Mixture-of-Experts Language Models,” posted on the arXiv pre-print server, lead author Samir Abnar of Apple and other Apple researchers, along with collaborator Harshay Shah of MIT, studied how performance varied as they exploited sparsity by turning off parts of the neural net.
Also: DeepSeek’s new open-source AI model can outperform o1 for a fraction of the cost
Abnar and team conducted their studies using a code library released in 2023 by AI researchers at Microsoft, Google, and Stanford, called MegaBlocks. However, they make clear that their work is applicable to DeepSeek and other recent innovations.
Abnar and team ask whether there’s an “optimal” level for sparsity in DeepSeek and similar models, meaning, for a given amount of computing power, is there an optimal number of those neural weights to turn on or off?
It turns out you can fully quantify sparsity as the percentage of all the neural weights you can shut down, with that percentage approaching but never equaling 100% of the neural net being “inactive.”
And it turns out that for a neural network of a given size in total parameters, with a given amount of computing, you need fewer and fewer parameters to achieve the same or better accuracy on a given AI benchmark test, such as math or question answering.
Put another way, whatever your computing power, you can increasingly turn off parts of the neural net and get the same or better results.
Optimizing AI with fewer parameters
As Abnar and team put it in technical terms, “Increasing sparsity while proportionally expanding the total number of parameters consistently leads to a lower pretraining loss, even when constrained by a fixed training compute budget.” The term “pretraining loss” is the AI term for how accurate a neural net is. Lower training loss means more accurate results.
That finding explains how DeepSeek could have less computing power but reach the same or better result simply by shutting off more and more parts of the network.
Also: The best AI for coding in 2025 (and what not to use)
Sparsity is a kind of magic dial that finds the best match of the AI model you’ve got and the compute you have available.
It’s the same economic rule of thumb that has been true for every new generation of personal computers: Either a better result for the same money or the same result for less money.
There are some other details to consider about DeepSeek. For example, another innovation of DeepSeek, as nicely explained by Ege Erdil of Epoch AI, is a mathematical trick called “multi-head latent attention.” Without getting too deeply into the weeds, multi-head latent attention is used to compress one of the largest consumers of memory and bandwidth, the memory cache that holds the most recently input text of a prompt.
The future of sparsity research
Details aside, the most profound point about all this is that sparsity as a phenomenon is not new in AI research, nor is it a new approach in engineering.
AI researchers have been showing for many years that eliminating parts of a neural net could achieve comparable or even better accuracy with less effort.
Nvidia competitor Intel has for years now identified sparsity as a key avenue of research to change the state of the art in the field. Approaches from startups based on sparsity have also notched high scores on industry benchmarks in recent years.
The magic dial of sparsity is profound because it not only improves economics for a small budget, as in the case of DeepSeek, it also works in the other direction: Spend more, and you’ll get even better benefits via sparsity. As you turn up your computing power, the accuracy of the AI model improves, Abnar and team found.
As they put it, “As sparsity increases, the validation loss decreases for all compute budgets, with larger budgets achieving lower losses at each sparsity level.”
In theory, then, you can make bigger and bigger models, on bigger and bigger computers, and get better bang for your buck.
All that sparsity work means that DeepSeek is only one example of a broad area of research that many labs are already following, and that many more will now jump on in order to replicate DeepSeek’s success.