Scaling Pandas with Ray and Modin + Alexa AI: Kubernetes and DeepSpeed Zero

Loading Events
  • This event has passed.

Scaling Pandas with Ray and Modin + Alexa AI: Kubernetes and DeepSpeed Zero

June 20 @ 12:00 pm - 1:00 pm EDT

RSVP Webinar:

**Talk #0: Introductions and Meetup Announcements By Chris Fregly and Antje Barth**

**Talk #1: Modin – Speed up your Pandas workflows by changing a single line of code**

by Alejandro Herrera, Solution Architect at Ponder

Modin is a drop-in replacement for [pandas]( While pandas is single-threaded, Modin lets you instantly speed up your workflows by scaling pandas so it uses all of your cores. Modin works especially well on larger datasets, where pandas becomes painfully slow or runs [out of memory](

GitHub: [](

**Talk #2: Optimizing large-scale, distributed training jobs using Nvidia GPUs, DeepSpeed ZeRO, and Kubernetes on AWS**

by Justin Chiu, Software Engineer @ Amazon Alexa AI

Most modern natural-language-processing applications are built on top of pretrained language models, which encode the probabilities of word sequences for entire languages. These models contain billions – or even trillions – of parameters.Training these models within a reasonable amount of time requires very large computing clusters – often with GPUs. Communication between the GPUs needs to be carefully managed to avoid performance bottlenecks.In this talk, we will discuss techniques to optimize large-scale training jobs on cloud-based hardware using Nvidia GPUs and Kubernetes on AWS. The following steps will be covered:
(1) [Basic infrastructure] Profile NCCL bandwidth to confirm they are getting ~100 Gbps all-reduce bandwidth on p3dn and ~350 Gbps all-reduce bandwidth on p4d. This will confirm that their EKS-EFA setup ([]( is correct, as well as other important EKS/EC2 settings like using cluster placement groups, etc. See info here on how to do that: [](
(2) [Training code and DNN framework settings] Once above is done, also confirm the training throughput, as measured in TFLOPS/GPU or Samples/Sec matches expectations. What expectations should be depends a bit on the model size, the input batch size, and the hardware. But I can provide some touch points if you need
Note: If (2) is successful, then you’re good. If not, you will want to fix (1) by optimizing the NCCL bandwidth to help isolate your problem.

* [](
* [](
* [](

RSVP Webinar:

Zoom link:

Related Links

O’Reilly Book:
GitHub Repo: