InfiniBand Key Features - Summary easy l

Networking
Academy
ODED PAZ
Sr. Technologies Instructor
KEY FEATURES
INFINIBAND

 InfiniBand Key Features Overview
 InfiniBand Key Features
 Simplified Management
 High Bandwidth
 CPU Offloads
 Ultra-Low Latency
 Easy Network Scale-out
 Quality of Service
 Fabric Resiliency
 Optimal Load Balancing with Adaptive Routing
 MPI Super Performance with SHARP
 InfiniBand Topologies
 Summary
Outline

InfiniBand Key Features –
Overview

4
InfiniBand Interconnect Technology
NVIDIA Mellanox InfiniBand interconnect brings high-speed, extreme low-latency and scalable solutions.
The InfiniBand technology enables supercomputer, Artificial Intelligence (AI) and cloud data centers
to operate at any scale, while reducing operational costs and infrastructure complexity.

5
InfiniBand Interconnect Technology
InfiniBand is the interconnect technology of choice for AI, Deep Learning, Data Science and
many other accelerated computing applications.

8
Simplified Management
An InfiniBand network is managed by a Subnet Manager.
InfiniBand is the first architecture to truly implement the vision of SDN - Software Defined Network

9
The Subnet Manager
The Subnet Manager (SM) is a program that runs and manages the entire network.
The SM provides centralized routing management,
hence enables to plug and play all the nodes in
the network.
Every InfiniBand subnet has its own master SM, and
in order-to ensure resiliency, a second SM,
functions as a standby.

11
InfiniBand Bandwidth
InfiniBand architecture began its journey in 2002, with a speed of 10 Gigabits per second,
and since then, it has been providing the highest bandwidth, non-blocking bi-directional links.

13
CPU Offloads
The InfiniBand architecture supports data transfer with minimal CPU intervention.
This is achievable, thanks to:
Hardware-based transport protocol
Kernel bypass or zero copy
Remote Direct Memory Access (RDMA) - RDMA allows direct memory access from the memory of
one node into that of another without involving either one's CPU.

14
GPU Direct
GPU-direct allows direct data transfer from the memory of one GPU to the memory of another.
It enables lower latency and improved performance, as provided by the GPU based computation.
Off-loading compute nodes is implemented by NVIDIA GPUs as well

16
Low Latency
Extreme low latency is achieved by a combination of hardware offloading and accelerating
mechanisms, which is unique to the InfiniBand architecture.
As a result, end-to-end latency of RDMA sessions can be as low as 1000 nano-seconds or
1 micro-second.

18
Network Scale-Out
Multiple InfiniBand subnets can be interconnected using InfiniBand routers in-order to
easily scale up beyond 48,000 nodes.
One of InfiniBand’s main advantages is the capability to deploy up to 48,000 nodes on a single subnet.

20
Quality of Service
Quality of service is the ability to provide different priority to different:
Applications
Users
Data flows
Applications that require a higher priority will be mapped to a different port queue, and their packets will be
sent first to the next element in the network.

22
Fabric Resiliency
One of the main features that customers require
is a stable network without link failures.
Yet, in such cases traffic resumption must be very fast.
When traffic re-route depends solely on the
Subnet Manager routing algorithm,
traffic renewal can take around five seconds.

23
NVIDIA Self-Healing Networking
NVIDIA Self-Healing Networking enables a link fault recovery that is 5000x faster!
This means the recovery time takes only one millisecond!
Self-Healing Networking is a hardware-based capability of NVIDIA switches.

25
Load-Balancing
Another requirement that should be addressed in a modern high performant data center is
how the network is best utilized and optimized.
One way to achieve that is to have a load-balancing scheme.
Load-balancing is a routing strategy that allows traffic to be distributed over multiple available paths.

26
Adaptive Routing
Adaptive Routing is enabled on NVIDIA’s switches’ hardware and
managed by the Adaptive Routing Manager
Adaptive routing is a feature that allows equalizing the amount of traffic sent on each of the switch ports.
QM8700 InfiniBand Switch System

27
Adaptive Routing
When Adaptive Routing is enabled, the switch
“Queue Manager” constantly compares the
volume levels between all group exit ports.
The Queue Manager constantly balances the
queue’s load, redirecting flows and packets to
an alternative less utilized port.

28
Adaptive Routing
To sum up:
Adaptive routing may be activated on all fabric switches.
Adaptive Routing supports dynamic load balancing, avoiding
in-network congestion and optimizing network bandwidth utilization.

30
SHARP
SHARP is a mechanism based on NVIDIA’s switch hardware and a central management package.
SHARP offloads collective operations from the hosts CPUs or GPUs to the network switches and
eliminates the need to send data multiple times between endpoints.
SHARP - Scalable Hierarchical Aggregation and Reduction Protocol

31
SHARP
SHARP decreases the amount of data traversing the network.
As a result, SHARP dramatically improves the performance of accelerated computing
MPI based applications by up to x10 times.
SHARP - Scalable Hierarchical Aggregation and Reduction Protocol

33
InfiniBand Topologies
Fat Tree
Torus
Dragonfly +
Hypercube
HyperX

34
InfiniBand Topologies
Being able to support a wide variety of topologies, InfiniBand answers different
customers’ requirements such as:
Easy network scale out
Reduced total cost of ownership
Maximum Blocking ratio
Minimum latency between fabric nodes
Maximum distance

36
Summary
In this session we have described the main key features that make InfiniBand the most
high-performant, effective, resilient and best utilized interconnect technology for accelerated
computing applications in modern data centers.

 Simplified management by the Subnet Manager
 High Bandwidth
 CPU Offloads and RDMA
 End-to-end, best in class latency
 Fabric Scalability
 Quality of Service – Traffic prioritization capability
 Resiliency and fast re-routing of traffic in case of a port failure
 Adaptive Routing providing optimal load-balancing eliminating in-network congestion
 In-networking computation based on NVIDIA's SHARP mechanism
 A wide variety of supported topologies
Summary

InfiniBand Key Features - Summary easy l

InfiniBand Key Features - Summary easy l

More Related Content

Similar to InfiniBand Key Features - Summary easy l

Recently uploaded

InfiniBand Key Features - Summary easy l