3 ways Meta's Llama 3.1 is an advance for Gen AI


Meta’s Llama 3.1 405B is the largest open-source large language model to date, capable of multimodal tasks and also “tool use.” Its construction and training represent a tour de force of engineering choices. 

Meta Properties

Meta on Tuesday unveiled the latest incarnation of Llama, its family of large language models (LLMs). The company says Llama 3.1 is the first open-source “frontier model,” a claim reserved generally for the biggest examples of AI code. 

Llama 3.1 comes in multiple sizes, and the largest, “405B,” is not only noteworthy for the scale of computing it involves — Llama 3.1 has 405 billion neural “weights,” or, parameters, larger than prominent open-source models such as Nvidia’s Nemotron 4, Google’s Gemma 2, and Mixtral — it is also significant for three choices that the Meta team made. 

Taken together, the three decisions are a tour de force of neural network engineering and are at the heart of how the company built and trained Llama 3.1 405B. They complement advances Meta showed with Llama 2 that suggested ways to slim down deep learning’s total compute budget. 

Also: Meta’s ‘pruning’ of Llama 2 model shows path to slimmer AI

(An “AI model” is the part of an AI program that contains numerous neural net parameters and activation functions that are the key elements for how an AI program functions.)

First, Llama 3.1 405B dispenses with what’s called a “mixture of experts,” the approach Google uses for its newest closed-source model, Gemini 1.5, and Mistral uses for its Mixtral models. 

A mixture of experts creates various alternate combinations of the neural weights. Some can be switched off so that a subset of weights is used to make predictions. Meta’s researchers “opted for a standard decoder-only transformer model architecture,” the near-ubiquitous building block first developed in 2017 as Google’s Transformer. The researchers claim this makes the model more stable during its training.

Also: Anthropic launches Claude 3.5 Sonnet and debuts Artifacts for collaboration

Second, to improve the results of the plain-vanilla transformer-based model, Meta’s researchers describe an ingenious approach to training the model in stages. It’s well known that both the amount of training data and the amount of compute used can be balanced in an optimal way to produce better predictions.

As described in the formal paper for Llama 3.1, the researchers took a look at existing “scaling laws,” which tell how well a model will do at producing a correct prediction depending on the size of the model and the amount of training data. That approach doesn’t really tell how good a model is at carrying out a “downstream” task, such as a standardized test of reasoning.

Instead, Meta came up with its own scaling law. The company progressively increased both the amount of training data and the amount of compute, checking over multiple iterations to see how well the resulting trained model does on the downstream tasks.

meta-2024-scaling-laws-of-llama-3-1

Meta tested different combinations of the compute intensity and the amount of data to find sweet spots where the mixture reached optimal performance on “downstream” benchmark tasks.

Meta Properties

“We use the resulting compute-optimal models to forecast the performance of the flagship Llama 3 model on benchmark datasets,” the Meta team wrote.

This approach has echoes of Meta’s recent research, in which the researchers train the model for the end outcome, rather than just a raw score on predicting the next word.

meta-2024-llama-405b-post-training-model

Meta put Llama 3.1 405B through an extensive post-training process to fine tune it with human feedback and held-out examples of correct answers. 

Meta Properties

The important part is that the iterative process of validating each successive data and compute combination is what leads to the selection of the 405 billion parameters as the sweet spot. “Based on this observation, we ultimately decided to train a flagship model with 405B parameters,” the researchers wrote.

The 405-billion-parameter model’s final training was on 16,000 Nvidia H100 GPU chips, which are run on Meta’s Grand Teton AI server. Meta used a complex system of clustering the many servers together to run batches of data, and the neural weights, in parallel.

Also: Meta’s GenAI moves from simple predictions to a chess game of consequences

The third big innovation is executing an equally ingenious combination of steps after each round of training the model, known as “post-training.” In post-training, a pre-trained Llama 3.1 is first subjected to human raters’ expressed preferences, similar to what OpenAI and others do to shape the kinds of output the model produces. 

Then, Meta uses the human preferences to re-train the model in what’s called “supervised fine-tuning,” where the model is re-trained until it can pick out the desirable from the undesirable outputs in the human feedback.

Meta then adds to the fine-tuning with a technique introduced this year by Stanford University AI scholars called “direct preference optimization,” or, DPO. It’s a form of the “reinforcement learning from human feedback” that OpenAI made popular, but it’s designed to be much more efficient.

To these broad post-training approaches, the Meta researchers add a couple of twists. For one, they post-trained Llama 3.1 405B to use “tools,” external programs that can perform functions such as search engines. That involves things such feeding the model examples of prompts that are solved by invoking API calls.

By fine-tuning Llama on the examples, Meta claims the model gains much better “zero-shot” tool use, the ability to invoke a tool that has not actually been shown in its training data.

To diminish the prevalence of “hallucinations,” the authors cherry-pick examples of the training data and craft original question-answer pairs. They use these to further fine-tune the model in order to “encourage the model to only answer questions which it has knowledge about, and refuse answering those questions that it is unsure about.”

Also: We found a way to escape Meta AI on Facebook – but there’s a catch  

The Meta researchers characterized all of their choices as aiming for simplicity. 

“Throughout the development of the Llama 3 model family, we found that a strong focus on high-quality data, scale, and simplicity consistently yielded the best results,” they wrote. “In preliminary experiments, we explored more complex model architectures and training recipes but did not find the benefits of such approaches to outweigh the additional complexity they introduce in model development.”

Certainly, the scale of the program is a landmark for open-source models, which typically have been far smaller than their commercial, closed-source competitors.

meta-2024-llama-3-1-405b-competition.png

Meta boasts of how Llama 31. 405B beats or meets the large commercial, closed-source models. 

Meta Properties

Meta co-founder and CEO Mark Zuckerberg lauded the economics of using Llama 3.1. “Developers can run inference on Llama 3.1 405B on their own infra[structure] at roughly 50% the cost of using closed models like GPT-4o, for both user-facing and offline inference tasks,” Zuckerberg wrote.

Zuckerberg also broadly defended open-source AI as a natural evolution of software. It is the equivalent, he wrote, of the Unix operating system that evolved from early proprietary versions to “a more advanced, secure, and broader ecosystem” because of its open-source versions.

Also: Meta inches toward open source AI with new LLaMA 3.1

As ZDNET’s Steven Vaughan-Nichols writes, however, some details have been left out of Meta’s code posting on Hugging Face, and its code license is more restrictive than other open-source licenses. That means that Llama 3.1 is kind-of open source, but not entirely.

Although reasonable parties can disagree on how strictly they recognize the open-source nature of Llama 3.1, the fact that so much detail is offered about the training process of the model is itself a welcome trove of disclosure. This is especially true at a time when OpenAI and Google increasingly share little if any information about how they construct their closed-source models.





Source link