KEMBAR78
InfiniBand Key Features - Summary easy l | PDF
Networking
Academy
ODED PAZ
Sr. Technologies Instructor
KEY FEATURES
INFINIBAND
 InfiniBand Key Features Overview
 InfiniBand Key Features
 Simplified Management
 High Bandwidth
 CPU Offloads
 Ultra-Low Latency
 Easy Network Scale-out
 Quality of Service
 Fabric Resiliency
 Optimal Load Balancing with Adaptive Routing
 MPI Super Performance with SHARP
 InfiniBand Topologies
 Summary
Outline
InfiniBand Key Features –
Overview
4
InfiniBand Interconnect Technology
NVIDIA Mellanox InfiniBand interconnect brings high-speed, extreme low-latency and scalable solutions.
The InfiniBand technology enables supercomputer, Artificial Intelligence (AI) and cloud data centers
to operate at any scale, while reducing operational costs and infrastructure complexity.
5
InfiniBand Interconnect Technology
InfiniBand is the interconnect technology of choice for AI, Deep Learning, Data Science and
many other accelerated computing applications.
6
InfiniBand Key Features
Simplified Management
8
Simplified Management
An InfiniBand network is managed by a Subnet Manager.
InfiniBand is the first architecture to truly implement the vision of SDN - Software Defined Network
9
The Subnet Manager
The Subnet Manager (SM) is a program that runs and manages the entire network.
The SM provides centralized routing management,
hence enables to plug and play all the nodes in
the network.
Every InfiniBand subnet has its own master SM, and
in order-to ensure resiliency, a second SM,
functions as a standby.
High Bandwidth
11
InfiniBand Bandwidth
InfiniBand architecture began its journey in 2002, with a speed of 10 Gigabits per second,
and since then, it has been providing the highest bandwidth, non-blocking bi-directional links.
CPU Offloads
13
CPU Offloads
The InfiniBand architecture supports data transfer with minimal CPU intervention.
This is achievable, thanks to:
Hardware-based transport protocol
Kernel bypass or zero copy
Remote Direct Memory Access (RDMA) - RDMA allows direct memory access from the memory of
one node into that of another without involving either one's CPU.
14
GPU Direct
GPU-direct allows direct data transfer from the memory of one GPU to the memory of another.
It enables lower latency and improved performance, as provided by the GPU based computation.
Off-loading compute nodes is implemented by NVIDIA GPUs as well
Low latency
16
Low Latency
Extreme low latency is achieved by a combination of hardware offloading and accelerating
mechanisms, which is unique to the InfiniBand architecture.
As a result, end-to-end latency of RDMA sessions can be as low as 1000 nano-seconds or
1 micro-second.
Easy Network
Scale-Out
18
Network Scale-Out
Multiple InfiniBand subnets can be interconnected using InfiniBand routers in-order to
easily scale up beyond 48,000 nodes.
One of InfiniBand’s main advantages is the capability to deploy up to 48,000 nodes on a single subnet.
Quality of Service
20
Quality of Service
Quality of service is the ability to provide different priority to different:
Applications
Users
Data flows
Applications that require a higher priority will be mapped to a different port queue, and their packets will be
sent first to the next element in the network.
Fabric Resiliency
22
Fabric Resiliency
One of the main features that customers require
is a stable network without link failures.
Yet, in such cases traffic resumption must be very fast.
When traffic re-route depends solely on the
Subnet Manager routing algorithm,
traffic renewal can take around five seconds.
23
NVIDIA Self-Healing Networking
NVIDIA Self-Healing Networking enables a link fault recovery that is 5000x faster!
This means the recovery time takes only one millisecond!
Self-Healing Networking is a hardware-based capability of NVIDIA switches.
Load-Balancing
25
Load-Balancing
Another requirement that should be addressed in a modern high performant data center is
how the network is best utilized and optimized.
One way to achieve that is to have a load-balancing scheme.
Load-balancing is a routing strategy that allows traffic to be distributed over multiple available paths.
26
Adaptive Routing
Adaptive Routing is enabled on NVIDIA’s switches’ hardware and
managed by the Adaptive Routing Manager
Adaptive routing is a feature that allows equalizing the amount of traffic sent on each of the switch ports.
QM8700 InfiniBand Switch System
27
Adaptive Routing
When Adaptive Routing is enabled, the switch
“Queue Manager” constantly compares the
volume levels between all group exit ports.
The Queue Manager constantly balances the
queue’s load, redirecting flows and packets to
an alternative less utilized port.
28
Adaptive Routing
To sum up:
Adaptive routing may be activated on all fabric switches.
Adaptive Routing supports dynamic load balancing, avoiding
in-network congestion and optimizing network bandwidth utilization.
SHARP
30
SHARP
SHARP is a mechanism based on NVIDIA’s switch hardware and a central management package.
SHARP offloads collective operations from the hosts CPUs or GPUs to the network switches and
eliminates the need to send data multiple times between endpoints.
SHARP - Scalable Hierarchical Aggregation and Reduction Protocol
31
SHARP
SHARP decreases the amount of data traversing the network.
As a result, SHARP dramatically improves the performance of accelerated computing
MPI based applications by up to x10 times.
SHARP - Scalable Hierarchical Aggregation and Reduction Protocol
INFINIBAND TOPOLOGIES
33
InfiniBand Topologies
Fat Tree
Torus
Dragonfly +
Hypercube
HyperX
34
InfiniBand Topologies
Being able to support a wide variety of topologies, InfiniBand answers different
customers’ requirements such as:
Easy network scale out
Reduced total cost of ownership
Maximum Blocking ratio
Minimum latency between fabric nodes
Maximum distance
Summary
36
Summary
In this session we have described the main key features that make InfiniBand the most
high-performant, effective, resilient and best utilized interconnect technology for accelerated
computing applications in modern data centers.
 Simplified management by the Subnet Manager
 High Bandwidth
 CPU Offloads and RDMA
 End-to-end, best in class latency
 Fabric Scalability
 Quality of Service – Traffic prioritization capability
 Resiliency and fast re-routing of traffic in case of a port failure
 Adaptive Routing providing optimal load-balancing eliminating in-network congestion
 In-networking computation based on NVIDIA's SHARP mechanism
 A wide variety of supported topologies
Summary
InfiniBand Key Features - Summary easy l

InfiniBand Key Features - Summary easy l

  • 1.
    Networking Academy ODED PAZ Sr. TechnologiesInstructor KEY FEATURES INFINIBAND
  • 2.
     InfiniBand KeyFeatures Overview  InfiniBand Key Features  Simplified Management  High Bandwidth  CPU Offloads  Ultra-Low Latency  Easy Network Scale-out  Quality of Service  Fabric Resiliency  Optimal Load Balancing with Adaptive Routing  MPI Super Performance with SHARP  InfiniBand Topologies  Summary Outline
  • 3.
  • 4.
    4 InfiniBand Interconnect Technology NVIDIAMellanox InfiniBand interconnect brings high-speed, extreme low-latency and scalable solutions. The InfiniBand technology enables supercomputer, Artificial Intelligence (AI) and cloud data centers to operate at any scale, while reducing operational costs and infrastructure complexity.
  • 5.
    5 InfiniBand Interconnect Technology InfiniBandis the interconnect technology of choice for AI, Deep Learning, Data Science and many other accelerated computing applications.
  • 6.
  • 7.
  • 8.
    8 Simplified Management An InfiniBandnetwork is managed by a Subnet Manager. InfiniBand is the first architecture to truly implement the vision of SDN - Software Defined Network
  • 9.
    9 The Subnet Manager TheSubnet Manager (SM) is a program that runs and manages the entire network. The SM provides centralized routing management, hence enables to plug and play all the nodes in the network. Every InfiniBand subnet has its own master SM, and in order-to ensure resiliency, a second SM, functions as a standby.
  • 10.
  • 11.
    11 InfiniBand Bandwidth InfiniBand architecturebegan its journey in 2002, with a speed of 10 Gigabits per second, and since then, it has been providing the highest bandwidth, non-blocking bi-directional links.
  • 12.
  • 13.
    13 CPU Offloads The InfiniBandarchitecture supports data transfer with minimal CPU intervention. This is achievable, thanks to: Hardware-based transport protocol Kernel bypass or zero copy Remote Direct Memory Access (RDMA) - RDMA allows direct memory access from the memory of one node into that of another without involving either one's CPU.
  • 14.
    14 GPU Direct GPU-direct allowsdirect data transfer from the memory of one GPU to the memory of another. It enables lower latency and improved performance, as provided by the GPU based computation. Off-loading compute nodes is implemented by NVIDIA GPUs as well
  • 15.
  • 16.
    16 Low Latency Extreme lowlatency is achieved by a combination of hardware offloading and accelerating mechanisms, which is unique to the InfiniBand architecture. As a result, end-to-end latency of RDMA sessions can be as low as 1000 nano-seconds or 1 micro-second.
  • 17.
  • 18.
    18 Network Scale-Out Multiple InfiniBandsubnets can be interconnected using InfiniBand routers in-order to easily scale up beyond 48,000 nodes. One of InfiniBand’s main advantages is the capability to deploy up to 48,000 nodes on a single subnet.
  • 19.
  • 20.
    20 Quality of Service Qualityof service is the ability to provide different priority to different: Applications Users Data flows Applications that require a higher priority will be mapped to a different port queue, and their packets will be sent first to the next element in the network.
  • 21.
  • 22.
    22 Fabric Resiliency One ofthe main features that customers require is a stable network without link failures. Yet, in such cases traffic resumption must be very fast. When traffic re-route depends solely on the Subnet Manager routing algorithm, traffic renewal can take around five seconds.
  • 23.
    23 NVIDIA Self-Healing Networking NVIDIASelf-Healing Networking enables a link fault recovery that is 5000x faster! This means the recovery time takes only one millisecond! Self-Healing Networking is a hardware-based capability of NVIDIA switches.
  • 24.
  • 25.
    25 Load-Balancing Another requirement thatshould be addressed in a modern high performant data center is how the network is best utilized and optimized. One way to achieve that is to have a load-balancing scheme. Load-balancing is a routing strategy that allows traffic to be distributed over multiple available paths.
  • 26.
    26 Adaptive Routing Adaptive Routingis enabled on NVIDIA’s switches’ hardware and managed by the Adaptive Routing Manager Adaptive routing is a feature that allows equalizing the amount of traffic sent on each of the switch ports. QM8700 InfiniBand Switch System
  • 27.
    27 Adaptive Routing When AdaptiveRouting is enabled, the switch “Queue Manager” constantly compares the volume levels between all group exit ports. The Queue Manager constantly balances the queue’s load, redirecting flows and packets to an alternative less utilized port.
  • 28.
    28 Adaptive Routing To sumup: Adaptive routing may be activated on all fabric switches. Adaptive Routing supports dynamic load balancing, avoiding in-network congestion and optimizing network bandwidth utilization.
  • 29.
  • 30.
    30 SHARP SHARP is amechanism based on NVIDIA’s switch hardware and a central management package. SHARP offloads collective operations from the hosts CPUs or GPUs to the network switches and eliminates the need to send data multiple times between endpoints. SHARP - Scalable Hierarchical Aggregation and Reduction Protocol
  • 31.
    31 SHARP SHARP decreases theamount of data traversing the network. As a result, SHARP dramatically improves the performance of accelerated computing MPI based applications by up to x10 times. SHARP - Scalable Hierarchical Aggregation and Reduction Protocol
  • 32.
  • 33.
  • 34.
    34 InfiniBand Topologies Being ableto support a wide variety of topologies, InfiniBand answers different customers’ requirements such as: Easy network scale out Reduced total cost of ownership Maximum Blocking ratio Minimum latency between fabric nodes Maximum distance
  • 35.
  • 36.
    36 Summary In this sessionwe have described the main key features that make InfiniBand the most high-performant, effective, resilient and best utilized interconnect technology for accelerated computing applications in modern data centers.
  • 37.
     Simplified managementby the Subnet Manager  High Bandwidth  CPU Offloads and RDMA  End-to-end, best in class latency  Fabric Scalability  Quality of Service – Traffic prioritization capability  Resiliency and fast re-routing of traffic in case of a port failure  Adaptive Routing providing optimal load-balancing eliminating in-network congestion  In-networking computation based on NVIDIA's SHARP mechanism  A wide variety of supported topologies Summary