Meta plans the world’s fastest supercomputer for AI


Facebook’s parent company Meta said it is building the world’s largest AI supercomputer to power machine-learning and natural language processing for building its metaverse project.

The new machine, called the Research Super Computer (RSC), will contain 16,000 Nvidia A100 GPUs and 4,000 AMD Epyc Rome 7742 processors. It has 2,000 Nvidia DGX-A100 nodes, with eight GPU chips and two Epyc microprocessors per node. Meta expects to complete construction this year.

RSC is already partially built, with 760 of the DGX-A100 systems deployed. Meta researchers have already started using RSC to train large models in natural language processing (NLP) and computer vision for research with the goal of eventually training models with trillions of parameters, according to Meta.

“Meta has developed what we believe is the world’s fastest supercomputer. We’re calling it RSC for AI Research SuperCluster, and it’ll be complete later this year. The experiences we’re building for the metaverse require enormous compute power (quintillions of operations/second!) and RSC will enable new AI models that can learn from trillions of examples, understand hundreds of languages, and more,” said CEO Mark Zuckerberg in an emailed statement.

RSC is expected to hit a peak performance of 5 exaFLOPS at mixed precision processing, both FP16 and FP32, which would rocket it to the top of the Top500 supercomputer list whose top performing supercomputer can hit 442 Pflop/s. It is being built in partnership with Penguin Computing, a specialist in HPC systems.

Meta is not disclosing where the system is located.

“RSC will help Meta’s AI researchers build new and better AI models that can learn from trillions of examples; work across hundreds of different languages; seamlessly analyze text, images, and video together; develop new augmented reality tools; and much more,” Kevin Lee, a technical program manager, and Shubho Sengupta, a software engineer, both at Meta, wrote in a blog post.

“We hope RSC will help us build entirely new AI systems that can, for example, power real-time voice translations to large groups of people, each speaking a different language, so they can seamlessly collaborate on a research project or play an AR game together,” they wrote.

In addition to all of the processing power, RSC also has to 175 petabytes in Pure Storage FlashArray, 46 petabytes in a cache storage, and 10 petabytes of Pure’s object storage equipment.

RSC is estimated to be nine times faster than Meta’s previous research cluster, made up of 22,000 of Nvidia’s older generation V100 GPUs, and 20 times faster than its current AI systems. Meta does not plan to retire the old system.

The company is focused on building learning models for automated tasks focused around content. It wanted this infrastructure in order to train models with more than a trillion parameters on data sets as large as an exabyte, with the goal of getting its arms around all the content generated on its platform.

“By doing this, we can help advance research to perform downstream tasks such as identifying harmful content on our platforms as well as research into embodied AI and multimodal AI to help improve user experiences on our family of apps. We believe this is the first time performance, reliability, security, and privacy have been tackled at such a scale,” Lee and Sengupta wrote.

Join the Network World communities on Facebook and LinkedIn to comment on topics that are top of mind.

Copyright © 2022 IDG Communications, Inc.



Source link