[Demo+Webinar] Distributed training w/ DeepSpeed ZeRO, Kubernetes on AWS

Loading Events

[Demo+Webinar] Distributed training w/ DeepSpeed ZeRO, Kubernetes on AWS

June 20 @ 12:00 pm - 1:00 pm EDT

RSVP Webinar: https://www.eventbrite.com/e/webinarkubeflow-tensorflow-tfx-pytorch-gpu-spark-ml-amazonsagemaker-tickets-45852865154

**Talk #0: Introductions and Meetup Announcements By Chris Fregly and Antje Barth**

**Talk #1: Optimizing large-scale, distributed training jobs using Nvidia GPUs, DeepSpeed ZeRO, and Kubernetes on AWS**

by Justin Chiu, Software Engineer @ Amazon Alexa AI

Most modern natural-language-processing applications are built on top of pretrained language models, which encode the probabilities of word sequences for entire languages. These models contain billions – or even trillions – of parameters.Training these models within a reasonable amount of time requires very large computing clusters – often with GPUs. Communication between the GPUs needs to be carefully managed to avoid performance bottlenecks.In this talk, we will discuss techniques to optimize large-scale training jobs on cloud-based hardware using Nvidia GPUs and Kubernetes on AWS. The following steps will be covered:
(1) [Basic infrastructure] Profile NCCL bandwidth to confirm they are getting ~100 Gbps all-reduce bandwidth on p3dn and ~350 Gbps all-reduce bandwidth on p4d. This will confirm that their EKS-EFA setup ([https://github.com/aws-samples/aws-efa-eks](https://github.com/aws-samples/aws-efa-eks)) is correct, as well as other important EKS/EC2 settings like using cluster placement groups, etc. See info here on how to do that: [https://github.com/NVIDIA/nccl-tests](https://github.com/NVIDIA/nccl-tests)
(2) [Training code and DNN framework settings] Once above is done, also confirm the training throughput, as measured in TFLOPS/GPU or Samples/Sec matches expectations. What expectations should be depends a bit on the model size, the input batch size, and the hardware. But I can provide some touch points if you need
Note: If (2) is successful, then you’re good. If not, you will want to fix (1) by optimizing the NCCL bandwidth to help isolate your problem.

* [https://www.amazon.science/blog/making-deepspeed-zero-run-efficiently-on-more-affordable-hardware](https://www.amazon.science/blog/making-deepspeed-zero-run-efficiently-on-more-affordable-hardware)
* [https://github.com/aws-samples/aws-efa-eks](https://github.com/aws-samples/aws-efa-eks)
* [https://github.com/NVIDIA/nccl-tests](https://github.com/NVIDIA/nccl-tests)

Talk #2: TBD

RSVP Webinar: https://www.eventbrite.com/e/webinarkubeflow-tensorflow-tfx-pytorch-gpu-spark-ml-amazonsagemaker-tickets-45852865154

Zoom link: https://us02web.zoom.us/j/82308186562

Related Links

O’Reilly Book: https://www.amazon.com/dp/1492079391/
Website: https://datascienceonaws.com
Meetup: https://meetup.meetup.datascienceonaws.com
GitHub Repo: https://github.com/data-science-on-aws/
YouTube: https://youtube.datascienceonaws.com
Slideshare: https://slideshare.datascienceonaws.com