What is sparsity? DeepSeek AI's secret, revealed by Apple…

Yuichiro Chino/Getty Images

The artificial intelligence (AI) market — and the entire stock market — was rocked last month by the sudden popularity of DeepSeek, the open-source large language model (LLM) developed by a China-based hedge fund that has bested OpenAI’s best on some tasks while costing far less.

Also: Cerebras CEO on DeepSeek: Every time computing gets cheaper, the market gets bigger

As ZDNET’s Radhika Rajkumar details, R1’s success highlights a sea change in AI that could empower smaller labs and researchers to create competitive models and diversify available options.

Why does DeepSeek work so well?

Its success is due to a broad approach within deep-learning forms of AI to squeeze more out of computer chips by exploiting a phenomenon known as “sparsity”.

Sparsity comes in many forms. Sometimes, it involves eliminating parts of the data that AI uses when that data doesn’t materially affect the model’s output.

Also: I put DeepSeek AI’s coding skills to the test – here’s where it fell apart

At other times, sparsity involves cutting away whole parts of a neural network if doing so doesn’t affect the result.

DeepSeek is an example of the latter: parsimonious use of neural nets.

The main advance most people have identified in DeepSeek is that it can turn large sections of neural network “weights” or “parameters” on and off. Parameters shape how a neural network can transform input — the prompt you type — into generated text or images. Parameters have a direct impact on how long it takes to perform computations. More parameters typically mean more computing effort.

Sparsity and its role in AI

The ability to use only some of the total parameters of an LLM and shut off the rest is an example of sparsity. That sparsity can have a major impact on how big or small the computing budget is for an AI model.

Apple AI researchers, in a report published Jan. 21, explained how DeepSeek and similar approaches use sparsity to get better results for a given amount of computing power.

Apple has no connection to DeepSeek, but the tech giant does its own AI research. Therefore, the developments of outside companies such as DeepSeek are broadly part of Apple’s continued involvement in AI research.

Also: Deepseek’s AI model proves easy to jailbreak – and worse

In the paper, titled “Parameters vs FLOPs: Scaling Laws for Optimal Sparsity for Mixture-of-Experts Language Models”, posted on the arXiv pre-print server, lead author Samir Abnar and other Apple researchers, along with collaborator Harshay Shah of MIT, studied how performance varied as they exploited sparsity by turning off parts of the neural net.

Abnar and team conducted their studies using a code library released in 2023 by AI researchers at Microsoft, Google, and Stanford, called MegaBlocks. However, they make clear that their work can be applied to DeepSeek and other recent innovations.

Abnar and the team ask whether there’s an “optimal” level for sparsity in DeepSeek and similar models: for a given amount of computing power, is there an optimal number of those neural weights to turn on or off?

The research suggests you can fully quantify sparsity as the percentage of all the neural weights you can shut down, with that percentage approaching but never equaling 100% of the neural net being “inactive”.

apple-2025-optimal-sparsity-for-ai-training-loss — Graphs show that for a given neural net, on a given computing budget, there’s an optimal amount of the neural net that can be turned off to reach a level of accuracy. The same economic rule of thumb has been true for every new generation of personal computers: either a better result for the same money or the same result for less money.

Apple

For a neural network of a given size in total parameters, with a given amount of computing, you need fewer and fewer parameters to achieve the same or better accuracy on a given AI benchmark test, such as math or question answering.

Put another way, whatever your computing power, you can increasingly turn off parts of the neural net and get the same or better results.

Optimizing AI with fewer parameters

As Abnar and team stated in technical terms: “Increasing sparsity while proportionally expanding the total number of parameters consistently leads to a lower pretraining loss, even when constrained by a fixed training compute budget.” The term “pretraining loss” is the AI term for how accurate a neural net is. Lower training loss means more accurate results.

That finding explains how DeepSeek could have less computing power but reach the same or better results simply by shutting off more network parts.

Also: The best AI for coding in 2025 (and what not to use)

Sparsity is like a magic dial that finds the best match for your AI model and available compute.

The same economic rule of thumb has been true for every new generation of personal computers: either a better result for the same money or the same result for less money.

Also: Security firm discovers DeepSeek has ‘direct links’ to Chinese government servers

There are some other details to consider about DeepSeek. For example, another DeepSeek innovation, as explained by Ege Erdil of Epoch AI, is a mathematical trick called “multi-head latent attention”. Without getting too deeply into the weeds, multi-head latent attention is used to compress one of the largest consumers of memory and bandwidth, the memory cache that holds the most recently input text of a prompt.

The future of sparsity research

Details aside, the most profound point about all this effort is that sparsity as a phenomenon is not new in AI research, nor is it a new approach in engineering.

AI researchers have shown for many years that eliminating parts of a neural net could achieve comparable or even better accuracy with less effort.

Also: xAI’s Grok 3 is better than expected. How to try it for free (before you subscribe)

Nvidia competitor Intel has identified sparsity as a key avenue of research to change the state of the art in the field for many years. Approaches from startups based on sparsity have also notched high scores on industry benchmarks in recent years.

apple-2025-sparsity-improves-with-bigger-compute — The magic dial of sparsity doesn’t only shave computing costs, as in the case of DeepSeek. Sparsity also works in the other direction: it can make increasingly efficient AI computers.

Apple

The magic dial of sparsity is profound because it not only improves economics for a small budget, as in the case of DeepSeek, but it also works in the other direction: spend more, and you’ll get even better benefits via sparsity. As you turn up your computing power, the accuracy of the AI model improves, Abnar and the team found.

Also: Are we losing our critical thinking skills to AI? New Microsoft study raises red flags

They suggested: “As sparsity increases, the validation loss decreases for all compute budgets, with larger budgets achieving lower losses at each sparsity level.”

In theory, then, you can make bigger and bigger models, on bigger and bigger computers, and get better bang for your buck.

All that sparsity work means that DeepSeek is only one example of a broad area of research that many labs are already following — and many more will now jump on to replicate DeepSeek’s success.

Source link

What is sparsity? DeepSeek AI's secret, revealed by Apple researchers

Why does DeepSeek work so well?

Sparsity and its role in AI

Optimizing AI with fewer parameters

The future of sparsity research

VMWARE

Helping Public Sector Organisations Define Cloud Strategy

How to change the VLAN ID of the Service Console in ESX from the command line/console

Cisco UCS and Vmware Interfaces (Vnics) HA Design Considerations

Troubleshooting network and TCP/UDP port connectivity issues on ESX/ESXi(2020669)

vSphere Client Parameters

Configuration Templates

CUE Licenses

Trouble shooting Unity Express with Call Manager Integeration & Operational Issues

CME Configuration Example: SIP Trunks to Viatalk and VoIP.ms

SIP Phone registration – CME Configuration

CUE Voicemail + VPIM networking (CUE to unity)

Related Post

Why does DeepSeek work so well?

Sparsity and its role in AI

Optimizing AI with fewer parameters

The future of sparsity research

VMWARE

Configuration Templates