- This event has passed.
[Demo+Webinar] Distributed training w/ DeepSpeed ZeRO, Kubernetes on AWS
June 20 @ 12:00 pm - 1:00 pm EDT
**Talk #0: Introductions and Meetup Announcements By Chris Fregly and Antje Barth**
**Talk #1: Optimizing large-scale, distributed training jobs using Nvidia GPUs, DeepSpeed ZeRO, and Kubernetes on AWS**
by Justin Chiu, Software Engineer @ Amazon Alexa AI
Most modern natural-language-processing applications are built on top of pretrained language models, which encode the probabilities of word sequences for entire languages. These models contain billions – or even trillions – of parameters.Training these models within a reasonable amount of time requires very large computing clusters – often with GPUs. Communication between the GPUs needs to be carefully managed to avoid performance bottlenecks.In this talk, we will discuss techniques to optimize large-scale training jobs on cloud-based hardware using Nvidia GPUs and Kubernetes on AWS. The following steps will be covered:
(1) [Basic infrastructure] Profile NCCL bandwidth to confirm they are getting ~100 Gbps all-reduce bandwidth on p3dn and ~350 Gbps all-reduce bandwidth on p4d. This will confirm that their EKS-EFA setup ([https://github.com/aws-samples/aws-efa-eks](https://github.com/aws-samples/aws-efa-eks)) is correct, as well as other important EKS/EC2 settings like using cluster placement groups, etc. See info here on how to do that: [https://github.com/NVIDIA/nccl-tests](https://github.com/NVIDIA/nccl-tests)
(2) [Training code and DNN framework settings] Once above is done, also confirm the training throughput, as measured in TFLOPS/GPU or Samples/Sec matches expectations. What expectations should be depends a bit on the model size, the input batch size, and the hardware. But I can provide some touch points if you need
Note: If (2) is successful, then you’re good. If not, you will want to fix (1) by optimizing the NCCL bandwidth to help isolate your problem.
Talk #2: TBD
Zoom link: https://us02web.zoom.us/j/82308186562
O’Reilly Book: https://www.amazon.com/dp/1492079391/
GitHub Repo: https://github.com/data-science-on-aws/