Will AWS’ next-gen Trainium2 chips accelerate AI development and put Amazon ahead in the chips race?
AWS aims to meet these ever-intense demands with Trn2 instances, which use 16 connected Trainium2 chips to provide 20.8 peak petaflops of compute. According to AWS, this makes the platform ideal for training and deploying LLMs with 100 billion-plus parameters, and offers a 30% to 40% better price/performance than the current generation of GPU-based instances.
“That is performance that you cannot get anywhere else,” AWS CEO Matt Garman said onstage at this week’s AWS re:Invent conference.
In addition, Amazon’s Trn2 UltraServers are a new Amazon EC2 infrastructure that feature 64 interconnected chips using a NeuronLink interconnect. This single “ultranode” features 83.2 petaflops of compute, quadrupling the compute, memory, and networking of a single instance, Garman said. “This has a massive impact on latency,” he noted.
AWS aims to push these capabilities even further with Trainium3, which is expected later in 2025. This will provide 2X more compute and 40% more efficiency than Trainium2, the company said, and Trainium3-powered UltraServers are expected to be 4x more performant than Trn2 UltraServers.
Garman asserted: “It will have more instances, more capabilities, more compute than any other cloud.”
For developers, Trainium2 provides more capability with tighter integration of AI chips to software, Baier pointed out, but it also results in higher vendor lock-in, and thus higher longer-term prices. Also, truly architecting “switchability” for foundation models and AI chips is an important design consideration. “Switchability” is a chip’s ability to adjust processing configurations to support different types of AI workloads. Depending on need, it can switch between different tasks, ultimately helping with development and scaling, and cutting cost.