KEMBAR78
Distributed Machine Learning | PDF | Artificial Intelligence | Intelligence (AI) & Semantics
0% found this document useful (0 votes)
14 views23 pages

Distributed Machine Learning

this is my class presentation on distributed machine learning
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views23 pages

Distributed Machine Learning

this is my class presentation on distributed machine learning
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 23

DISTRIBUTED MACHINE

LEARNING
Machine Learning
Fall 2024
Zahra Keshavarz Rezaei
What is Distributed Machine Learning?

Distributed Machine Learning refers to the process of training machine


learning models using multiple machines or processors simultaneously. This
allows the handling of massive datasets and complex models that a single
machine cannot efficiently process.
distributed Machine Learning
vs.
Federated Learning
Types of Distributed Learning:

Data Parallelism
Model Parallelism
Hybrid Parallelism
Model Parallelism
Split the model itself (e.g., layers of a neural network) across
different nodes. One node processes input layers, while
another processes hidden layers.
Data Parallelism
Split the training data into smaller subsets. Each worker
(node) processes its subset. Think of each node as working
on a few puzzle pieces to contribute to the entire picture.
Key Algorithms in DML

Stochastic Gradient Descent (SGD)


Mini-batch SGD distributed across nodes.
Gradient Aggregation
Aggregates gradients computed by multiple nodes.
Synchronous vs. Asynchronous Training

Synchronous: All nodes update weights simultaneously.

Asynchronous: Nodes update weights independently.


AllReduce Algorithm

A communication pattern used to aggregate gradients across


all workers.
Reduces communication overhead by combining operations
(e.g., summing gradients) during data transfer.
Ring-AllReduce Algorithm
Optimized version of All-Reduce.
Workers are organized in a ring topology.
Each worker sends and receives gradients from its neighbors in a
pipeline fashion.
Reduces latency compared to traditional All-Reduce.
Scatter Reduce
Scatter Reduce
All Gather
All Gather
Parameter Server Architecture

Dedicated nodes responsible for storing and updating the model


parameters (weights, biases, etc.).
Aggregate gradients from workers and send updated parameters
back.
Frameworks

TensorFlow Distributed: Provides strategies like tf.distribute.Strategy for


data and model parallelism.
PyTorch Distributed: Includes utilities like torch.distributed for
communication and gradient sharing.
Horovod: Open-source library optimized for distributed deep learning with
minimal code changes.
Challenges in DML

Communication Overhead:
Synchronization between nodes (e.g., sharing gradients) can slow down training.
Fault Tolerance:
Node failures can disrupt training or lead to inconsistencies.
Data Imbalance:
Uneven distribution of data can lead to skewed models.
Applications

Image Recognition

Language Modeling

Finance
Conclusion
DML is essential for scaling AI to meet modern demands.
Key approaches: Data parallelism, model parallelism, and hybrid methods.
Challenges: Communication overhead, fault tolerance, and data imbalance.
THANK YOU
REFERENCES
Joost Verbraeken, Matthijs Wolting, Jonathan Katzy, Jeroen Kloppenburg, Tim
Verbelen, and Jan S. Rellermeyer. 2020. A Survey on Distributed Machine
Learning.
Zhao, Huasha & Canny, John. (2013). Sparse Allreduce: Efficient Scalable
Communication for Power-Law Data.
Distributed Machine Learning and the Parameter Server, “CS4787 — Principles of
Large-Scale Machine Learning Systems” Course, Cornell University, Lecture Note.

You might also like