Apple researchers reveal the secret sauce behind DeepSeek AI

ZDNET

The artificial intelligence market — and the entire stock market — was rocked on Monday by the sudden popularity of DeepSeek, the open-source large language model developed by a China-based hedge fund that has bested OpenAI’s best on some tasks while costing far less.

Also: I put DeepSeek AI’s coding skills to the test – here’s where it fell apart

As ZDNET’s Radhika Rajkumar detailed on Monday, R1’s success highlights a sea change in AI that could empower smaller labs and researchers to create competitive models and diversify the field of available options.

Why does DeepSeek work so well?

It turns out it’s a broad approach within deep learning forms of artificial intelligence to squeeze more out of computer chips by exploiting a phenomenon known as “sparsity.”

Sparsity comes in many forms. Sometimes, it involves eliminating parts of the data that AI uses when that data doesn’t materially affect the output of the AI model.

Also: Why China’s DeepSeek could burst our AI bubble

At other times, it can involve cutting away whole parts of a neural network if doing so doesn’t affect the end result.

DeepSeek is an example of the latter: parsimonious use of neural nets.

The main advance most have identified in DeepSeek is that it can turn on and off large sections of neural network “weights,” or “parameters.” The parameters are what shape how a neural network can transform input — the prompt you type — into generated text or images. Parameters have a direct impact on how long it takes to perform computations. More parameters, more computing effort, typically.

Sparsity and its role in AI

The ability to use only some of the total parameters of a large language model and shut off the rest is an example of sparsity. That sparsity can have a major impact on how big or small the computing budget is for an AI model.

AI researchers at Apple, in a report out last week, explain nicely how DeepSeek and similar approaches use sparsity to get better results for a given amount of computing power.

Apple has no connection to DeepSeek, but Apple does its own AI research on a regular basis, and so the developments of outside companies such as DeepSeek are part of Apple’s continued involvement in the AI research field, broadly speaking.

In the paper, titled “Parameters vs FLOPs: Scaling Laws for Optimal Sparsity for Mixture-of-Experts Language Models,” posted on the arXiv pre-print server, lead author Samir Abnar of Apple and other Apple researchers, along with collaborator Harshay Shah of MIT, studied how performance varied as they exploited sparsity by turning off parts of the neural net.

Also: DeepSeek’s new open-source AI model can outperform o1 for a fraction of the cost

Abnar and team conducted their studies using a code library released in 2023 by AI researchers at Microsoft, Google, and Stanford, called MegaBlocks. However, they make clear that their work is applicable to DeepSeek and other recent innovations.

Abnar and team ask whether there’s an “optimal” level for sparsity in DeepSeek and similar models, meaning, for a given amount of computing power, is there an optimal number of those neural weights to turn on or off?

It turns out you can fully quantify sparsity as the percentage of all the neural weights you can shut down, with that percentage approaching but never equaling 100% of the neural net being “inactive.”

apple-2025-optimal-sparsity-for-ai-training-loss — Graphs show that for a given neural net, on a given amount of computing budget, there’s an optimal amount of the neural net that can be turned off to reach a level of accuracy. It’s the same economic rule of thumb that has been true for every new generation of personal computers: Either a better result for the same money or the same result for less money.

Apple

And it turns out that for a neural network of a given size in total parameters, with a given amount of computing, you need fewer and fewer parameters to achieve the same or better accuracy on a given AI benchmark test, such as math or question answering.

Put another way, whatever your computing power, you can increasingly turn off parts of the neural net and get the same or better results.

Optimizing AI with fewer parameters

As Abnar and team put it in technical terms, “Increasing sparsity while proportionally expanding the total number of parameters consistently leads to a lower pretraining loss, even when constrained by a fixed training compute budget.” The term “pretraining loss” is the AI term for how accurate a neural net is. Lower training loss means more accurate results.

That finding explains how DeepSeek could have less computing power but reach the same or better result simply by shutting off more and more parts of the network.

Also: The best AI for coding in 2025 (and what not to use)

Sparsity is a kind of magic dial that finds the best match of the AI model you’ve got and the compute you have available.

It’s the same economic rule of thumb that has been true for every new generation of personal computers: Either a better result for the same money or the same result for less money.

There are some other details to consider about DeepSeek. For example, another innovation of DeepSeek, as nicely explained by Ege Erdil of Epoch AI, is a mathematical trick called “multi-head latent attention.” Without getting too deeply into the weeds, multi-head latent attention is used to compress one of the largest consumers of memory and bandwidth, the memory cache that holds the most recently input text of a prompt.

The future of sparsity research

Details aside, the most profound point about all this is that sparsity as a phenomenon is not new in AI research, nor is it a new approach in engineering.

AI researchers have been showing for many years that eliminating parts of a neural net could achieve comparable or even better accuracy with less effort.

Nvidia competitor Intel has for years now identified sparsity as a key avenue of research to change the state of the art in the field. Approaches from startups based on sparsity have also notched high scores on industry benchmarks in recent years.

apple-2025-sparsity-improves-with-bigger-compute — The magic dial of sparsity doesn’t only shave computing costs, as in the case of DeepSeek — it works in the other direction too: it can also make bigger and bigger AI computers more efficient.

Apple

The magic dial of sparsity is profound because it not only improves economics for a small budget, as in the case of DeepSeek, it also works in the other direction: Spend more, and you’ll get even better benefits via sparsity. As you turn up your computing power, the accuracy of the AI model improves, Abnar and team found.

As they put it, “As sparsity increases, the validation loss decreases for all compute budgets, with larger budgets achieving lower losses at each sparsity level.”

In theory, then, you can make bigger and bigger models, on bigger and bigger computers, and get better bang for your buck.

All that sparsity work means that DeepSeek is only one example of a broad area of research that many labs are already following, and that many more will now jump on in order to replicate DeepSeek’s success.

Source link

Apple researchers reveal the secret sauce behind DeepSeek AI

Why does DeepSeek work so well?

Sparsity and its role in AI

Optimizing AI with fewer parameters

The future of sparsity research

VMWARE

Helping Public Sector Organisations Define Cloud Strategy

How to change the VLAN ID of the Service Console in ESX from the command line/console

Cisco UCS and Vmware Interfaces (Vnics) HA Design Considerations

Troubleshooting network and TCP/UDP port connectivity issues on ESX/ESXi(2020669)

vSphere Client Parameters

Configuration Templates

CUE Licenses

Trouble shooting Unity Express with Call Manager Integeration & Operational Issues

CME Configuration Example: SIP Trunks to Viatalk and VoIP.ms

SIP Phone registration – CME Configuration

CUE Voicemail + VPIM networking (CUE to unity)

Related Post

Why does DeepSeek work so well?

Sparsity and its role in AI

Optimizing AI with fewer parameters

The future of sparsity research

VMWARE

Configuration Templates