Amazon debuts Trainium, a custom chip for machine learning training in the cloud

Amazon today debuted AWS Trainium, a chip custom-designed to deliver what the company describes as cost-effective machine learning model training in the cloud. It comes ahead of the availability of new Habana Gaudi-based Amazon Elastic Compute Cloud (EC2) instances built specifically for machine learning training, powered by Intel’s new Habana Gaudi processors.

“We know that we want to keep pushing the price performance on machine learning training, so we’re going to have to invest in our own chips,” AWS CEO Andy Jassy said during a keynote address at Amazon’s re:Invent conference this morning. “You have an unmatched array of instances in AWS, coupled with innovation in chips.”

Amazon AWS Tranium

Amazon claims that Trainium will offer the most teraflops of any machine learning instance in the cloud, where a teraflop translates to a chip being able to process one trillion calculations a second. When it becomes available to customers in the second half of 2021 as EC2 instances and in SageMaker, Amazon’s fully managed machine learning development platform, it’ll support popular frameworks including Google’s TensorFlow, Facebook’s PyTorch, and MxNet. Moreover, Amazon says it’ll use the same Neuron SDK as Inferentia, the company’s cloud-hosted chip for machine learning inference.

Absent benchmark results, it’s unclear how Trainium’s performance might compare with Google’s tensor processing units (TPUs), the search giant’s chips for AI training workloads hosted in Google Cloud Platform. Google says its forthcoming fourth-generation TPU offers more than double the matrix multiplication teraflops of a third-generation TPU. (Matrices are often used to represent the data that feeds into AI models.) It also offers a “significant” boost in memory bandwidth while benefiting from unspecified advances in interconnect technology.

Machine learning deployments have historically been constrained by the size and speed of algorithms and the need for costly hardware. In fact, a report from MIT found that machine learning might be approaching computational limits. A separate Synced study estimated that the University of Washington’s Grover fake news detection model cost $25,000 to train in about two weeks. OpenAI reportedly racked up a whopping $12 million to train its GPT-3 language model, and Google spent an estimated $6,912 training BERT, a bidirectional transformer model that redefined the state of the art for 11 natural language processing tasks.

Leave a Reply