Resource
How a Leading AI SaaS Lowered Training Costs for Its LLM and Reduced Loading Times by 50% with AWS
Executive Summary
While currently using Google Cloud Platform and Vertex AI, the company wanted to explore whether a more cost-effective model training solution was possible with Amazon Web Services (AWS) and turned to Mission Cloud for help. Mission Cloud developed a highly performant and cost-optimized storage solution using Amazon FSx for Lustre for the company to train its large language model on AWS.
Mission Cloud successfully proved a training solution on AWS that met all of the company’s requirements while using its existing training framework, one that would allow research scientists to seamlessly initiate training jobs in the cloud. It demonstrated a 50% reduction in loading times and associated reductions in experimentation time when compared to the incumbent system because of the flexibility provided by AWS' compute resource model.
Challenge: Providing Consistent, Cross-Platform Training
This leading AI SaaS company had built its large language model initially on GCP but was encountering issues because of the extensive scale of its training workload. It had subsequently discovered that the Vertex AI platform offered insufficient RAM to support their large datasets. This prevented the company from training an adequate number of instances simultaneously and led to serious degradation in training times, with each pass through the dataset taking progressively longer.
For these reasons, the company wanted to see what a solution with AWS might offer in terms of performance and cost. Before attempting a full migration, a proof of concept needed to demonstrate substantial wins over GCP’s limitations and to integrate the existing custom-built training framework and third-party tracking tools which the company relied upon. And because the company wanted to benchmark the training performance of the POC against the existing solution, it needed to be completed before the next scheduled training run on GCP.
This created a tight timeline for getting the existing models into Amazon SageMaker, running them on multi-node configurations with multiple GPUs, and standing up the necessary data infrastructure to support them. And the datasets themselves were massive, in some cases 100s of terabytes, so an efficient data architecture was paramount.
After exploring various ways to prove the viability of a possible migration, the company chose to leverage Mission Cloud’s technical expertise to develop a proof of concept and demonstrate the technical and cost benefits of an AWS-native solution.
Solution: Developing a Highly Performant Infrastructure for Training Large Datasets
To prove that the AWS infrastructure could efficiently train with massive amounts of data, starting with an initial 50 TB data set, Mission Cloud needed an in-depth understanding of the customer’s current use cases and data flows. This work included determining data pipelines for model training and evaluating the customer’s data warehousing needs. The Mission Cloud team also needed to evaluate the company’s existing modeling pipeline, identify pain points and ensure the compatibility of Hugging Face models in Amazon SageMaker.
Mission Cloud began by transferring data sets from GCP to AWS, optimizing AWS DataSync for various file types, the number of agents and more. This work included managing a large amount of raw audio files, a process which was made more complex because the files weren’t compressed and varied in size.
As part of the POC, the company wanted to leverage the newly released P4d instances for its cost efficiency with ML workloads, but these weren’t compatible with the version of PyTorch used in the GCP implementation. Getting these instance types to work required additional dependency resolution to have the containers run properly on Amazon SageMaker.
As expected, the large datasets and massive amounts of associated metadata proved to be a central challenge. The time required to preload data from S3 to FSx was at first significant, with a theoretical maximum of two weeks — an unacceptable delay. But once loaded, FSx represented an ideal solution because it would provide a local file system that the instances could access with low latency, facilitating high throughput during training.
Because FSx doesn't preload data by default, custom software was needed to determine the fastest loading method. To optimize this process further, Mission Cloud collaborated with Amazon’s FSx team to parallelize loading as much as possible and ensure the best FSx configuration, including determining how many agents were needed and how to avoid wasting GPU time during training.
Mission Cloud worked with the customer’s custom-built training software through GitHub, making pull requests directly in the repositories to configure the training process as needed. Once the first model was successfully implemented this way, Mission Cloud generalized its modifications to enable any model using the system to train on AWS while accommodating the necessary parameter adjustments.
Throughout each phase of implementation, a significant focus was placed on helping the company’s developers understand the tools and technologies involved, with the aim to have the proof of concept be entirely self-service starting at its launch date.
Results: 50% Faster Training Times and Cost Reductions
By leveraging the AWS Migration Acceleration Program (MAP), Mission Cloud secured $60,000 to cover the design phase of this work. The proof of concept built during this phase successfully integrated training on AWS with the company’s existing framework, allowing research scientists to seamlessly initiate training jobs on AWS. Mission Cloud developed novel, optimized scripts to enhance data loading procedures for FSx, with the net result being a 50% reduction in loading times. This also reduced the cost of training and experimentation time because of the flexibility of the P4d compute resources and the novel usage of FSx to supply training data.
Throughout the project, Mission Cloud demonstrated its technical expertise with each component of the company’s infrastructure. The comprehensive understanding empowered the customer’s team and provided familiarity with the elements of training on SageMaker, instilling confidence in the platform for future training and a potential migration.
AWS and 3rd-Party Services Used
During its partnership with Mission Cloud, this company leveraged powerful AWS services including:
- DataSync
- FSx for Lustre
- SageMaker
- S3
- EC2
- GitHub
- Hugging Face