NSX Reference Design Guide 4-2 - v1.0
NSX Reference Design Guide 4-2 - v1.0
VMWARE NSX ®
REFERENCE DESIGN
GUIDE
Software Version 4.2
VMware NSX Reference Design Guide
1 Introduction 10
How to Use This Document and Provide Feedback 10
Networking and Security Today 10
NSX Architecture Value and Scope 11
Containers and Cloud Native Application Integrations with NSX 14
Role of NSX in the VMware multi-cloud platform 15
2 NSX Architecture Components 18
Management Plane and Control Plane 18
Management Plane 18
Control Plane 19
NSX Manager Appliance 19
Data Plane 20
NSX Consumption Model 21
NSX Policy API Framework 21
Policy API Usage Example 1- Templatize and Deploy 3-Tier Application Topology 22
Policy API Usage Example 2- Application Security Policy Lifecycle Management 23
When to use Policy vs Manager UI/API 23
NSX Logical Object Naming relationship between manager and policy mode 25
NSX Multi-tenancy Framework 25
3 NSX Logical Switching 38
The NSX Virtual Switch 38
Segments and Transport Zones 38
Uplink vs. pNIC 40
Teaming Policy 41
1
VMware NSX Reference Design Guide
Uplink Profile 44
Creating a transport node 45
Transport Node Profiles 47
Network I/O Control v3 47
Virtual Switch Advanced Performance Settings 48
Logical Switching 49
Overlay Backed Segments 49
Flooded Traffic 50
Unicast Traffic 53
Data Plane Learning 54
Tables Maintained by the NSX Controller 56
Overlay Encapsulation 58
Tunnel End Point High Availability 59
Multi-TEP High Availability 62
TEP Groups 67
Northbound traffic 70
Bridging Overlay to VLAN with the Edge Bridge 76
Overview of the Capabilities 78
4 NSX Logical Routing 84
Single Tier Routing 85
Distributed Router (DR) 85
Services Router 89
Two-Tier Routing 94
Interface Types on Tier-1 and Tier-0 Gateway 95
Route Types on Tier-0 and Tier-1 Gateways 96
Fully Distributed Two Tier Routing 98
Routing Capabilities 101
Static routing 101
Border Gateway Protocol (BGP) 102
Open Shortest Path First version 2 (OSPF v2) 104
2
VMware NSX Reference Design Guide
3
VMware NSX Reference Design Guide
4
VMware NSX Reference Design Guide
5
VMware NSX Reference Design Guide
6
VMware NSX Reference Design Guide
Standard 431
Enhanced Datapath Standard 431
Overview 431
Results 434
Standard Datapath 437
pNIC Queues 437
Key Performance Factors 440
Bare Metal Edge Node Performance Factors 443
Bare Metal Edge HW Recommendations 443
Benchmarking Tools 445
Compute 445
Edges 446
Conclusion 446
9 DPU-based Acceleration for NSX 448
DPU Architecture 448
Server requirements to enable offload to DPU 449
Offloading NSX to the DPU 449
Evolution of the NSX Datapath 449
Standard Datapath 450
High-performance Datapath Models 450
VM-modes available with the introduction of DPUs 453
10 NSX Design in VMware Cloud Foundation (VCF) 456
Introduction 456
Assumptions 456
Out-of-scope 456
VMware Cloud Foundation Architecture 457
Single Site Deployment using the VCF Consolidated Architecture 457
Single Site Deployment using the VCF Standard Architecture 458
Multi VCF Instance Deployment using NSX Federation 461
Multi Availability Zone (AZ) Deployment using stretched vSAN Clusters 463
7
VMware NSX Reference Design Guide
8
VMware NSX Reference Design Guide
Intended Audience
This document is targeted toward virtualization and network architects interested in deploying
VMware NSX® network virtualization solutions in a variety of on premise solutions.
Revision History
9
VMware NSX Reference Design Guide
1 Introduction
This document provides guidance and best practices for designing environments that leverage
the capabilities of VMware NSX®. It is targeted at virtualization and network architects
interested in deploying NSX solutions.
10
VMware NSX Reference Design Guide
bottom line. This exerts increasing pressure on organizations to innovate quickly and makes
developers central to this critical mission. As a result, the way developers create apps, and the
way IT provides services for those apps, are evolving.
Application Proliferation
With applications quickly emerging as the new business model, developers are under immense
pressure to deliver apps in a record time. This increasing need to deliver more apps in a less
time can drive developers to use public clouds or open source technologies. These solutions
allow them to write and provision apps in a fraction of the time required with traditional
methods.
Heterogeneity
Application proliferation has given rise to heterogeneous environments, with application
workloads being run inside VMs, containers, clouds, and bare metal servers. IT departments
must maintain governance, security, and visibility for application workloads regardless of
whether they reside on premises, in public clouds, or in clouds managed by third-parties.
Cloud-centric Architectures
Cloud-centric architectures and approaches to building and managing applications are
increasingly common because of their efficient development environments and fast delivery of
applications. These cloud architectures can put pressure on networking and security
infrastructure to integrate with private and public clouds. Logical networking and security must
be highly extensible to adapt and keep pace with ongoing change.
Against this backdrop of increasing application needs, greater heterogeneity, and the
complexity of environments, IT must still protect applications and data while addressing the
reality of an attack surface that is continuously expanding.
11
VMware NSX Reference Design Guide
The NSX architecture is designed around four fundamental attributes. FIGURE 1-1: NSX
ANYWHERE Architecture depicts the universality of those attributes that spans from any site, to any
cloud, and to any endpoint device. This enables greater decoupling, not just at the
infrastructure level (e.g., hardware, hypervisor), but also at the public cloud and container level;
all while maintaining the four key attributes of platform implemented across the domains. NSX
architectural value and characteristics of NSX architecture include:
• Policy and Consistency: Allows policy definition once and realizable end state via RESTful
API, addressing requirements of today’s automated environments. NSX maintains unique
and multiple inventories and controls to enumerate desired outcomes across diverse
domains.
• Networking and Connectivity: Allows consistent logical switching and distributed routing
without being tied to a single compute manager/domain (e.g. vCenter server). The
connectivity is further extended across containers and clouds via domain specific
implementation while still providing connectivity across heterogeneous endpoints.
• Security and Services: Allows a unified security policy model as with networking
connectivity. This enables implementation of services such as load balancer, Edge
(Gateway) Firewall, Distributed Firewall, and Network Address Translation cross multiple
12
VMware NSX Reference Design Guide
compute domains. Providing consistent security between VMs and container workloads
in private and public clouds is essential to assuring the integrity of the overall framework
set forth by security operations.
• Visibility: Allows consistent monitoring, metric collection, and flow tracing via a common
toolset across compute domains and clouds. Visibility is essential for operationalizing
mixed workloads – VM and container-centric –typically both have drastically different
tools for completing similar tasks.
These attributes enable the heterogeneity, app-alignment, and extensibility required to support
diverse requirements. Additionally, NSX supports DPDK libraries that offer line-rate stateful
services.
Heterogeneity
In order to meet the needs of heterogeneous environments, a fundamental requirement of NSX
is to be compute-manager agnostic. As this approach mandates support for multi-workloads, a
single NSX manager’s manageability domain can span multiple vCenters and multiple workload
types. When designing the management plane, control plane, and data plane components of
NSX, special considerations were taken to enable flexibility, scalability, and performance.
The management plane was designed to be independent of any compute manager, including
vSphere. The VMware NSX® Manager™ is fully independent; management of the NSX based
network functions are accesses directly – either programmatically or through the GUI.
The control plane architecture is separated into two components – a centralized cluster and an
endpoint-specific local component. This separation allows the control plane to scale as the
localized implementation – both data plane implementation and security enforcement – is
more efficient and allows for heterogeneous environments.
The data plane was designed to be normalized across various environments. NSX introduces a
host switch that normalizes connectivity among various compute domains, including multiple
VMware vCenter® instances, containers, bare metal servers, and other off premises or cloud
implementations. This switch is referred as N-VDS. The functionality of the N-VDS switch was
fully implemented in the ESXi VDS 7.0 and later, which allows ESXi customers to take advantage
of full NSX functionality without having to change the virtual switch. Regardless of
implementation, data plane connectivity is normalized across all platforms, allowing for a
consistent experience.
App-aligned
NSX was built with the application as the key construct. Regardless of whether the app was
built in a traditional monolithic model or developed in a newer microservices application
framework, NSX treats networking and security consistently. This consistency extends across
containers and multi-hypervisors on premises, then further into the public cloud.
13
VMware NSX Reference Design Guide
14
VMware NSX Reference Design Guide
15
VMware NSX Reference Design Guide
defined infrastructure, Platform-as-a-Service (PaaS) and management stack that can be layered
on top of any physical hardware layer on any cloud or data center. The software stack is based
on VMware Cloud Foundation (VCF) and includes vSphere, VSAN, and NSX as its core
components. It provides a unified approach to building, running and managing traditional and
modern apps on any cloud. This unique architectural approach provides a single platform that
can function across all application types and multiple cloud environments. NSX is a key strategic
asset for the VMware multi-cloud platform. Limited and fragmented public cloud native
network and security services are augmented by rich and uniform enterprise-grade capabilities
across any cloud.
The cloud operating model cuts across the traditional silos – networking, network security, load
balancing and endpoint protection solutions are designed, deployed and managed by an
increasingly integrated team or a set of integrated processes in the form of automation and
pipelines. In this world network and security become software that is defined in advance and is
integrated into the customer CI/CD pipeline. It is accessed programmatically by high level,
declarative or intent based APIs, and it is oriented to serving the needs of the application. The
NSX APIs allow customers to embrace the cloud operation model, where an entire workload,
including all network and security services, is launched with no human touch, while not
sacrificing key enterprise functionalities.
Virtualized networking separates the logical connectivity policies from the physical transport
layer. In a multi-cloud world, the transport layer is increasingly less available because it is
embedded into another cloud or running on the public Internet. Thus, virtual networking is
essential for making the VMware stack run on any provider’s hardware, for seamlessly
connecting the heterogeneous clouds of a modern enterprise and present a uniform
consumption model across different providers.
Organizations are looking for a multi-cloud platform that delivers best in class workload
security. Thanks to NSX, VMware clouds can transparently insert services, to protect, manage
and operationalize applications at scale, and to have an intimate understanding of the end user
and the application context. This allows for unique features such as “Virtual patching” via the
NSX distributed IPS that protects individual vulnerable workloads before the application of a
security patch.
While all the VMware clouds run the same code base, some go a step further in term of
simplification and application alignment. The NSX that is presented to VMware Cloud
customers is much simpler than the premise-based version because the underlying topology is
controlled by VMware and all that is required is specify application-level policies. Since the
same software stack is deployed on any VMware cloud, operations such as the vMotion of a
workload and its attached firewall/security policy between private and public clouds are
available.
16
VMware NSX Reference Design Guide
17
VMware NSX Reference Design Guide
Management Plane
The management plane provides an entry point to the system for API as well NSX graphical user
interface. It is responsible for maintaining user configuration, handling user queries, and
performing operational tasks on all management, control, and data plane nodes.
The NSX Manager implements the management plane for the NSX ecosystem. It provides an
aggregated system view and is the centralized network management component of NSX. NSX
Manager provides the following functionality:
• Serves as a unique entry point for user configuration via RESTful API (CMP, automation)
or NSX user interface.
• Responsible for storing desired configuration in its database. The NSX Manager stores
the final configuration request by the user for the system. This configuration will be
pushed by the NSX Manager to the control plane to become a realized configuration (i.e.,
a configuration effective in the data plane).
• Retrieves the desired configuration in addition to system information (e.g., statistics).
18
VMware NSX Reference Design Guide
Control Plane
The control plane computes the runtime state of the system based on configuration from the
management plane. It is also responsible for disseminating topology information reported by
the data plane elements and pushing stateless configuration to forwarding engines.
NSX splits the control plane into two parts:
• Central Control Plane (CCP) – The CCP is implemented as a cluster of virtual machines
called CCP nodes. The cluster form factor provides both redundancy and scalability of
resources. The CCP is logically separated from all data plane traffic, meaning any failure
in the control plane does not affect existing data plane operations. User traffic does not
pass through the CCP Cluster.
• Local Control Plane (LCP) – The LCP runs on transport nodes. It is adjacent to the data
plane it controls and is connected to the CCP. The LCP is responsible for programing the
forwarding entries and firewall rules of the data plane.
19
VMware NSX Reference Design Guide
(CPU, memory and disk). With the converged manager appliance, one only need to consider the
appliance sizing once.
Each appliance has a dedicated IP address and its manager process can be accessed directly or
through a load balancer. Optionally, the three appliances can be configured to maintain a
virtual IP address which will be serviced by one appliance selected among the three. The design
consideration of NSX Manager appliance is further discussed in CHAPTER 7.
Data Plane
The data plane performs stateless forwarding or transformation of packets based on tables
populated by the control plane. It reports topology information to the control plane and
maintains packet level statistics. The hosts running the local control plane daemons and
forwarding engines implementing the NSX data plane are called transport nodes. Transport
nodes are running an instance of the NSX virtual switch called the NSX Virtual Distributed
Switch, or N-VDS.
On ESXi platforms, the N-VDS is built on the top of the vSphere Distributed Switch (VDS). In fact,
the N-VDS is so close to the VDS that NSX 3.0 introduced the capability of installing NSX directly
on the top of a VDS on ESXi hosts. For all other kinds of transport node, the N-VDS is based on
the platform independent Open vSwitch (OVS) and serves as the foundation for the
implementation of NSX in other environments (e.g., cloud, containers, etc.).
As represented in FIGURE 2-1: NSX ARCHITECTURE AND Components, there are two main types of
transport nodes in NSX:
20
VMware NSX Reference Design Guide
• Hypervisor Transport Nodes: Hypervisor transport nodes are hypervisors prepared and
configured for NSX. NSX provides network services to the virtual machines running on
those hypervisors. NSX 4.1 supports VMware ESXi™ only.
• Edge Nodes: VMware NSX Edge™ nodes are service appliances dedicated to running
centralized network services that cannot be distributed to the hypervisors. They can be
instantiated as a bare metal appliance or in virtual machine form factor. They are
grouped in one or several clusters, representing a pool of capacity. It is important to
remember that an Edge Node does not represent a service itself but just a pool of
capacity that one or more services can consume.
21
VMware NSX Reference Design Guide
The NSX API documentation can be accessible directly from the NSX Manager UI, under Policy
section within API documentation, or it can be accessed from CODE .VMWARE .COM.
The following examples walks you through the policy API examples for two of the customer
scenarios:
The desired outcome for deploying the application, as shown in the figure above, can be
defined using JSON. Once JSON request body is defined to reflect the desired outcome, then
API & JSON request body can be leveraged to automate following operational workflows:
• Deploy entire topology with single API and JSON request body.
• The same API/JSON can be further leveraged to templatize and reuse to deploy same
application in different environment (PROD, TEST and DEV).
• Handle life cycle management of entire application topology by toggling the
"marked_for_delete" flag in the JSON body to true or false.
22
VMware NSX Reference Design Guide
The policy model can define the desired outcome by specifying grouping and micro-
segmentation polices using JSON. It uses single API call with a JSON request body to automate
following operational workflows:
• Deploy white-list security policy with single API and JSON request body.
• The same API/JSON can further leveraged to templatize and reuse to secure same
application in different environment (PROD, TEST and DEV).
• Handle life cycle management of entire application topology by toggling the
"marked_for_delete" flag in the JSON body to true or false.
All new deployments should use Policy mode. Deployments which were created using the
advanced interface, for example, upgrades
from versions where policy mode was not
available.
23
VMware NSX Reference Design Guide
New deployments which integrate with cloud Deployments which in previous versions
automation platforms, automation tools, or were integrated with cloud automation
the NSX Container plug-in (NCP). All the platforms, NCP, or custom orchestration
northbound integrations between VMware tools via the manager api. Customers should
products and NSX now support policy mode. plan the transition to Policy Mode. Consult
the specific guidance for NCP and VMware
Integrated Openstack (VIO). Migration of
manager api objects managed by Aria
Automation is not yet supported.
It is recommended that whichever mode is used to create objects (Policy or Manager) be the
only mode used (if the Manager Mode objects are required, create all objects in Manager
mode). Do not alternate use of the modes or there will be unpredictable results. Note that the
24
VMware NSX Reference Design Guide
default mode for the NSX Manager is Policy mode. When working in an installation where all
objects are new and created in Policy mode, the Manager mode option will not be visible in the
UI. For details on switching between modes, please see the NSX DOCUMENTATION .
NSX Logical Object Naming relationship between manager and policy mode
The name of some of the networking and security logical objects in the Manager API/Data
model have changed in the new policy model. The table below provides the before and after
naming side by side for those NSX Logical objects. This change only affects the name for the
given NSX object, but conceptually and functionally it is the same as before.
NSGroup Group
25
VMware NSX Reference Design Guide
interfering with the configurations created by other users or having visibility into it. This section
provides an overview of new multi-tenancy features in NSX.
Before discussing the new multi-tenancy features that Projects introduces, let's go over how
multi-tenancy has traditionally been available at the data-plane layer in NSX versions prior to
4.1.
In this models NSX allows users to apply the desired data-plane segmentation. However, prior
to the release of NSX 4.1, tenants were not explicitly defined in NSX. The multi-tenancy logic
was implemented by either the NSX Administrator or the Cloud Management Platform.
What if a security team wanted to delegate management of firewall rules, requiring role-based
access to NSX? What if the same users on that team wanted to see only the alarms relevant to
their environment? Or if they wanted to collect only their assigned firewall logs within their
tenant? These are just some theoretical scenarios that highlight the challenges teams face;
from a management and monitoring perspective, there exists a clear need for multi-tenant
constructs in NSX.
26
VMware NSX Reference Design Guide
27
VMware NSX Reference Design Guide
Projects exist alongside the traditional data model, are optional to use, and do not break
compatibility with existing setups in any way. The Enterprise Admin can still access all features
outside of the Project (from system setup to firewall rules) but can use Projects to define
tenants for logical consumption, if desired.
In the Default space, Enterprise Admins can have a consolidated view of all Projects or switch to
view a specific Project. They can also create multiple tenants with different Projects (Project 1,
Project 2, etc.).
To do so, they must allocate:
• At least one Tier-0 or Tier-0 VRF (Multiple supported and can be shared across
projects)
• At least one Edge Cluster (Multiple supported and can be shared across projects)
• The User(s) allocated to the Project and their role within the scope of the project (i.e.,
security admin, network admin, or full project admin)
• A short log ID to be added as a label to the logs pertaining to the Project
It is important to note that Tier-0/Tier-0 VRF and Edge Clusters can be shared across Projects if
desired by the Enterprise Admin.
28
VMware NSX Reference Design Guide
Once assigned, Project Users for Project 1 can directly access NSX within the scope defined for
them. They can create Tier-1 gateways on the Edge clusters allocated to the project (DR only
Tier-1 gateways are also supported), and connect them to the Tier-0 or Tier-0 VRF allocated to
the project.
The Enterprise Admin can also assign configuration of the Project to simplify consumption or
limit the number of objects created through Quotas.
The Enterprise Admin can create firewall rules system-wide that will apply to all VMs across all
environments. Those rules are configured from the Default space and cannot to be modified
within the tenant view.
29
VMware NSX Reference Design Guide
Figure 2-8: Two projects (Blue and Green) deployed alongside a topology created by the EA in the default space (Purple)
Upon logging in, users will land directly in their assigned Project and see only configurations,
alarms, VMs and so on that are relevant to their Project.
Configuration will be restricted to logical objects. Tenants cannot manage the platform setup
(installation, upgrades, etc.) because these features are kept under Enterprise Admin
management. Other features that remain under Enterprise Admin management and are not
exposed to the tenant include Tier-0 configuration and Exclusion lists.
For exposed features, the consumption under Projects is the same as it would be outside the
Project. Creation of Tier-1s, segments, and other configuration follows the same model, and
they use the allocated Tier-0(s)/Tier-0 VRF(s) and Edge Cluster(s). Information on allocated
resources (Quota) is available in the Project tab.
Project admin can create one more Tier-1 gateways inside the project and connect them to the
tier-0 gateway the enterprise admin has assigned to the project. The EA can assign one or more
Tier-0 gateway or VRFs (VRF are the same as Tier-0 gateway in the context of the multitenancy
framework) to the project. The project admin can create isolated segments or segments
connected to one of the project Tier-1 gateways. Connecting project segments to Tier-0
30
VMware NSX Reference Design Guide
gateways directly is not supported. Workloads connected to project segments are considered
part of the project and will be visible in the virtual machine inventory for the project.
One of the primary goals of the Project feature is to be able to delegate security policy
management and avoid the risk of rules being applied to the wrong VMs.
When a Project is created, a group representing the Project is also created, alongside some
default rules allowing for communication inside the Project but blocking anything else. This
default project includes all the segments defined as part of the project and, consequently all
the VMs connected to them. NSX will implicitly set this group in the apply-to field of any
distributed firewall rule preventing project users to impact workloads that are not under their
purview.
The Project Admin can manage their own rules by changing the default project rules, creating
new rules and so on. These rules will only apply to VMs connected to the segments for their
Project. All the other VMs (not connected to Project segments) won’t be visible from the
Project and won’t be impacted by rules configured within the Project. It is now possible to give
access to NSX Distributed Firewall while removing the risk that a user could create a rule
impacting the entire system.
Rules defined by the Enterprise Admin in Default space can apply to those VMs within a Project
and will take precedence. This allows the Enterprise Admin to set up the environment to create
global rules that apply to all workloads, or to specific Projects. These global rules cannot be
modified by Project users.
31
VMware NSX Reference Design Guide
Logs from Distributed Firewall and Gateway Firewall will be labeled with the Project
information so that they can be identified and separated on a per tenant basis.
The consumption of NSX services within a project is fundamentally identical to what a user
experiences when working with NSX configurations outside of a project. This is reflected in the
Policy API structure depicted in FIGURE 2-10. The /infra tree is still present and represents all the
configurations performed in the default space. Additionally, a new /orgs/default/ hierarchy is
now present. All the projects reside under this new /orgs/default element and have their own
/infra tree mirroring the structure of the main /infra tree for the default space (i.e.,
/orgs/default/projects/Project-1).
For more information about projects and a detailed list of the supported feature please refer to
the OFFICIAL DOCUMENTATION .
2.3.6.4 VPCs
VPC were introduced in NSX 4.1.1. They represent an additional layer of multi-tenancy within a
project and provide a simplified consumption model of networking and security services aligned
to the experience an end-user would have in a public cloud environment. As with projects, VPC
do not alter the traditional NSX consumption model and can be deployed in parallel to the
classic NSX Policy constructs, sharing the same physical infrastructure.
32
VMware NSX Reference Design Guide
From the API structure in FIGURE 2-12: NSX Hierarchical Policy API structure with Projects and
VPCs, we can see that VPCs are logically part of projects and provide a more streamlined access
to common network and security features such as subnets, security policies, and NAT.
The project admin can assign to exiting users (Local or available via an Identity Provider) the
VPC admin role or more specific role such as network admin or security admin and scope them
to one or more VPCs part of the project. The user management and lifecycle
(creating/deleting/activating local users and integrating the NSX platform with external identity
providers) is an Enterprise Admin responsibility.
The project admin can share groups with the VPC so that the VPC users can consume them in
their policies. The project admin can set quotas to limit to consumption of specific configuration
objects such as firewall rules, subnets, etc.
Each VPC is associated with a Tier-1 gateway deployed in the background and connected to one
of the Tier-0 gateways or VRFs assigned to the project the VPC is part of. A VPC admin will be
able to perform the following tasks in a self-service model without impacting other VPCs or the
shared infrastructure:
• Create a new public subnet (Workloads on this subnet will be directly accessible on their
own IP from outside the VPC). Public IPs are allocated from an external IP block assigned
33
VMware NSX Reference Design Guide
by the project admin to the VPC from a set of external IP blocks assigned by the
Enterprise Admin to the project.
• Create a private subnet (Workloads on this subnet will require a NAT configuration to
communicate with any workload outside the VPC. A pre-configured SNAT on a single IP
for all the private networks in the VPC allows outbound connectivity by default). The
SNAT IP is allocated from one of the external IP blocks assigned to the VPC. Private
subnets’ IP ranges are allocated from the private IP blocks assigned to the VPC by the
project admin and created by the project admin within the scope of a project.
• Create an isolated subnet (Workloads on this subnet will not have any communication
with any other subnet, inside or outside the VPC). The IP range of isolated subnets is
arbitrary and defined by the VPC admin at subnet creation time. The range is not
allocated from any IP block.
• Create NAT configurations to allow access to workloads on private subnets.
• Configure security policies and groups for both East-West and North-South traffic.
Security configuration created in a VPC will only affect workloads connected to VPC
subnets.
Figure 2-12: NSX Hierarchical Policy API structure with Projects and VPCs
34
VMware NSX Reference Design Guide
FIGURE 2-13 presents the VPC networking model with the 3 types of supported networks: public,
private, and isolated. One of the key benefits of the VPC consumption model is the out-of-the-
box IPAM (IP address management) capabilities. The Project Admin allocates IP blocks to each
VPC, and when the VPC admin creates a new subnet, any required IP range is automatically
assigned from the available blocks.
By default, VPC workload IP allocation is managed by DHCP, but it is possible to reserve part of
the subnet range for static IP allocations.
The VPC admin can also allocate external IPs from the available public block to be used in NAT
rules to grant external clients access to workloads on VPC private networks.
The Tier-1 gateways associated with VPCs are provisioned in parallel to other Tier-1 gateways
part of the same project and can be connected to the same Tier-0 gateway or VRF.
35
VMware NSX Reference Design Guide
FIGURE 2-15 presents the VPC security model. VPC users can configure North-South and East
West security policies based on L4/L7 attributes as well as defining groups for the VPC
workloads.
North-south rules are mapped by the system to gateway firewall rules applied to the uplink
interface of the VPC Gateway. They can inspect any traffic entering or exiting the VPC, but they
cannot enforce any action on traffic between workloads in the VPC. By default, any outbound
connectivity is allowed but any inbound traffic is dropped.
East-West rules are mapped by the system to distributed firewall rules with the apply-to field
set to a group including the VPC workloads exclusively. This choice prevents conflicts between
configuration performed on different VPCs. East-West rules can inspect any traffic originated or
directed to VPC workloads, including traffic to other VPC workloads, but also N-S traffic to and
from external clients. For this reason, North-South connectivity must be allowed on both N-S
and E-W rules. E-W rules allow any traffic by default.
36
VMware NSX Reference Design Guide
37
VMware NSX Reference Design Guide
38
VMware NSX Reference Design Guide
You’ll also notice an “opaque network” with the same name. This is the representation of the
same segment that might appear on your hosts that are running NSX with an NVDS. The use of
NVDS on ESXi hosts has been deprecated as of NSX 4.0. Check the following KB article for more
information on the impact of this difference in representation:
https://kb.vmware.com/s/article/79872. The rest of this document will assume that NSX is only
deployed using a VDS on ESXi hosts.
Segments are created as part of an NSX object called a transport zone. There are VLAN
transport zones and overlay transport zones. A segment created in a VLAN transport zone will
be a VLAN backed segment, while, as you can guess, a segment created in an overlay transport
zone will be an overlay backed segment. NSX transport nodes attach to one or more transport
zones, and as a result, they gain access to the segments created in those transport zones.
Transport zones can thus be seen as objects defining the scope of the virtual network because
they provide access to groups of segments to the hosts that attach to them, as illustrated in
FIGURE 3-1 below:
39
VMware NSX Reference Design Guide
In above diagram, transport node 1 is attached to transport zone “Staging”, while transport
nodes 2-4 are attached to transport zone “Production”. If one creates a segment 1 in transport
zone “Production”, each transport node in the “Production” transport zone immediately gain
access to it. However, this segment 1 does not extend to transport node 1. The span of segment
1 is thus defined by the transport zone “Production” it belongs to.
Few additional points related to transport zones and transport nodes:
• Multiple VDS (with or without NSX) or VSS, can coexist on a ESXi transport node;
o A given pNIC can only be associated with a single virtual switch, VDS or
VSS. This behavior is specific to the VMware virtual switching model, not
to NSX.
o A VDS prepared with NSX can attach to a single overlay transport zone
and multiple VLAN transport zones at the same time.
o A transport node with multiple VDS prepared for NSX can attach to
multiple overlay transport zones, one per VDS.
• A transport zone can only be attached to a single VDS on a given transport node. In other
words, two VDS prepared for NSX on the same transport node cannot be attached to the
same transport zone.
• Edge transport node-specific points:
o An edge transport node can only have one N-VDS attached to an overlay
transport zone.
o If multiple VLAN segments are backed by the same VLAN ID (across all the
VLAN transport zones of an edge N-VDS), only one of those segments will
be “realized” (i.e. working effectively).
Please see additional consideration at RUNNING A VDS PREPARED FOR NSX ON ESXI HOSTS.
40
VMware NSX Reference Design Guide
p1 p2 p3
• p1,p2 and p3 are pNICs on the ESXi host
u1 u2
• u1,u2 are the uplinks of the VDS
VDS with NSX
– u1 is defined as a LAG bundling pNICs p1 and p2
ESXi Transport Node – u2 is directly mapped to pNICs p3
In this example, a single virtual switch with two uplinks is defined on the ESXi transport node.
One of the uplinks is a LAG, bundling physical port p1 and p2, while the other uplink is only
backed by a single physical port p3. Both uplinks look the same from the perspective of the
virtual switch; there is no functional difference between the two.
Note that the example represented in this picture is by no means a design recommendation, it’s
just illustrating the difference between the virtual switch uplinks and the host physical uplinks.
Teaming Policy
The teaming policy defines how the NSX uses its virtual switch uplinks for redundancy and
traffic load balancing. There are two main options for teaming policy configuration:
• Failover Order – An active uplink is specified along with an optional list of standby uplinks.
Should the active uplink fail, the next available uplink in the standby list takes its place
immediately. This policy results in an active/standby use of the uplinks.
• Load Balance Source/Load Balance Source Mac Address – Traffic is distributed across a
specified list of active/active uplinks.
o The “Load Balance Source” policy maps a VM’s virtual interface to an uplink. Traffic sent
by this virtual interface will leave the host through this uplink only, and traffic destined
to this virtual interface will necessarily enter the virtual switch via this uplink.
o The “Load Balance Source Mac Address” goes a little bit further in term of granularity
for rare scenarios where a virtual interface that can source traffic from different mac
addresses. Here, two frames sent by the same virtual interface on the VM could be
associated to different uplinks based on their source mac address. Note that this
teaming policy is only effective on VLAN segments. When used with an overlay segment,
Load Balance Source Mac Address behaves the same way as Load Balance Source.
41
VMware NSX Reference Design Guide
The teaming policy only defines how the NSX distributes traffic across the uplinks of its virtual
switches. The uplinks can in turn be individual pNICs or LAGs (as seen in the previous section.)
Note that a LAG uplink has its own hashing options, however, those hashing options only define
how traffic is distributed across the physical members of the LAG uplink, whereas the teaming
policy define how traffic is distributed between NSX virtual switch uplinks.
u1 u2 u1 u2
VM1 VDS with NSX VM2 VM1 VDS with NSX VM2
FIGURE 3-3 presents an example of the failover order and source teaming policy options,
illustrating how the traffic from two different VMs in the same segment is distributed across
uplinks. The uplinks of the virtual switch could be any combination of single pNICs or LAGs;
whether the uplinks are pNICs or LAGs has no impact on the way traffic is balanced between
uplinks. When an uplink is a LAG, it is only considered down when all the physical members of
the LAG have their link down. When defining a transport node, the user must specify a default
teaming policy that will be applicable by default to all the segments (VLAN or overlay) available
on this transport node.
42
VMware NSX Reference Design Guide
flexibility is illustrated in the following Figure 3-4 where dedicated named teaming policies have
been created to steer traffic to the left and right uplink.
Here, VM3 is attached to a VLAN segment using the “left” teaming policy, its traffic will get
in/out of the host via VDS uplink u1. Again, overlay traffic necessarily follow the unique default
teaming policy. The load balance source still allows VM1 and VM2 to use different uplinks.
The following FIGURE 3-5 illustrates another common use case for named teaming policies.
Supposed that we have a host with four uplinks (u1/2/3/4) and that we want overlay traffic to
only use two specific uplinks, u1 and u2, while VLAN traffic should use u3 and u4.
43
VMware NSX Reference Design Guide
Overlay traffic only follows the default teaming policy, so here, the default teaming policy must
include VDS uplinks u1 and u2. Without additional teaming policy, VLAN traffic would also flow
on uplink u1 and u2. Mapping VLAN segments to an additional named teaming policy “VLAN-
traffic” that includes u3 and u4 will ensure that VLAN traffic uses the desired uplinks.
Uplink Profile
As mentioned earlier, a transport node includes at least one NSX virtual switch, implementing
the NSX data plane. It is common for multiple transport nodes to share the exact same NSX
virtual switch configuration. It is also very difficult from an operational standpoint to configure
(and maintain) multiple parameters consistently across many devices. For this purpose, NSX
defines a separate object called an uplink profile that acts as a template for the configuration of
a virtual switch. The administrator can this way create multiple transport nodes with similar
virtual switches by simply pointing to a common uplink profile. Even better, when the
administrator modifies a parameter in the uplink profile, it is automatically updated in all the
transport nodes following this uplink profile.
The following parameters are defined in an uplink profile:
• The transport VLAN used for overlay traffic. Overlay traffic will be tagged with the VLAN ID
specified in this field.
• The MTU of the uplinks. NSX will assume that it can send overlay traffic with this MTU on
the physical uplinks of the transport node without any fragmentation by the physical
infrastructure. This parameter not used on ESXi hosts, where the MTU is directly defined on
the VDS.
• The name of the uplinks
44
VMware NSX Reference Design Guide
• The Link aggregation groups (LAGs) used by the virtual switch. This is not used on ESXi
transport nodes, where LAGs are defined directly on the VDS.
• The teaming policies applied to the uplinks (default and optional named teaming policies)
p1 p2 p3 p1 p2 p3
vds-u1/ vds-u2/
vds-u1 vds-u2 vds-u3 vds-u3
nsx-u1 nsx-u2
VDS1 VDS2 VDS1
VDS with
with NSX
NSX VDS2
ESXi1 host ESXi1 transport node
The ESXi host above has three physical uplinks p1, p2 and p3. Two VDS have been deployed on
this host. VDS1, with two uplinks called vds-u1, vds-u2 mapped to p1 and p2, and VDS2 with
one uplink vds-u3, mapped to p3. We are preparing VDS1 for NSX. For this operation, we
provide:
• An uplink profile (called “UP1” here), defining a teaming policy and a transport VLAN
(the VLAN ID that will be used if the transport node is attached to an overlay transport
zone.)
• The mapping of the uplink names defined in the uplink profile (nsx-u1, nsx-u2) with
VDS1 uplinks. Here, we map nsx-u1 to vds-u1 and nsx-u2 to vds-u2. This operation is
very similar to the uplink mapping that was performed when deploying VDS1 on ESXi1.
At that time, the user had to associate vds-u1 to p1, and vds-u2 to p2.
45
VMware NSX Reference Design Guide
• If the VDS is going to connect to an overlay transport zone, it will need some IP
addresses for creating tunnels between peer transport nodes. A way of defining those IP
addresses must be specified in that case. The next part details how overlay networking
works.
VDS2 in was added to show that NSX does not have to be installed on all virtual switches of an
ESXi host.
Once ESXi has been turned into a transport node by installing NSX on VDS1, this VDS1 has a
dual personality, it is both:
- a standard VDS, with dvportgroups using vds-u1 and vds-u2 for their teaming policies,
- and at the same time, it is an NSX virtual switch with NSX segments, using nsx-u1 and
nsx-u2 as its uplinks.
The uplink profile allows centralizing part of the NSX configuration. A change in the uplink
profile is applied to all the transport nodes that were created using it. The uplink profile can
also be used to apply more specific NSX configuration to a set of hosts. Consider FIGURE 3-7
below:
rack1 rack2
p1 p2 p1 p2
u1 u2 u1 u2
VDS1 VDS1
In this example, both ESXi transport nodes are using the same VDS1. However, the
administrator applied different uplink profiles Rack1 and Rack2 on ESXi1 and ESXi2. A realistic
scenario would be the need to use different transport VLANs for different racks. Here, both
transport VLAN and teaming policy differ. A dvportgroup defined in VDS1 would use the same
46
VMware NSX Reference Design Guide
teaming policy on ESXi1 and ESXi2. On the other hand, an NSX segment common to both
transport nodes would use different teaming policies on ESXi1 and ESXi2.
47
VMware NSX Reference Design Guide
• NFS Traffic is traffic related to a file transfer in the network file system.
• vSAN traffic is generated by virtual storage area network.
• vMotion traffic is for computing resource migration.
• vSphere replication traffic is for replication.
• vSphere Data Protection Backup traffic is generated by backup of data.
• Virtual Machine traffic is generated by virtual machines workload
• iSCSI traffic is for Internet Small Computer System Interface storage
The NIOCv3 configuration takes place directly in vCenter. In addition to system traffic
parameters, NIOCv3 provides an additional level of granularity for the VM traffic category:
share, reservation and limits can also be applied at the Virtual Machine vNIC level. Because
NIOCv3 operates before traffic is sent on the uplinks, the encapsulation of the packets (VLAN or
overlay) has no impact on the feature. NIOCv3 works the exact same way on the VDS whether it
is prepared for NSX or not.
48
VMware NSX Reference Design Guide
Logical Switching
This section on logical switching focuses on overlay backed segments due to their ability to
create isolated logical L2 networks with the same flexibility and agility that exists with virtual
machines. This decoupling of logical switching from the physical network infrastructure is one
of the main benefits of adopting NSX.
In the upper part of the diagram, the logical view consists of five virtual machines that are
attached to the same segment, forming a virtual broadcast domain. The physical
representation, at the bottom, shows that the five virtual machines are running on hypervisors
spread across three racks in a data center. Each hypervisor is an NSX transport node equipped
49
VMware NSX Reference Design Guide
with a tunnel endpoint (TEP). The TEPs are configured with IP addresses, and the physical
network infrastructure just need to provide IP connectivity between them. Whether the TEPs
are L2 adjacent in the same subnet or spread in different subnets does not matter. The
VMware® NSX Controller™ (not pictured) distributes the IP addresses of the TEPs across the
transport nodes so they can set up tunnels with their peers. The example shows “VM1” sending
a frame to “VM5”. In the physical representation, this frame is transported via an IP point-to-
point tunnel between transport nodes “HV1” to “HV5”.
The benefit of this NSX overlay model is that it allows direct connectivity between transport
nodes irrespective of the specific underlay inter-rack (or even inter-datacenter) connectivity
(i.e., L2 or L3). Segments can also be created dynamically without any configuration of the
physical network infrastructure.
Flooded Traffic
The NSX segment behaves like a LAN, providing the capability of flooding traffic to all the
devices attached to this segment; this is a cornerstone capability of layer 2. NSX does not
differentiate between the different kinds of frames replicated to multiple destinations.
Broadcast, unknown unicast, or multicast traffic will be flooded in a similar fashion across a
segment. In the overlay model, the replication of a frame to be flooded on a segment is
orchestrated by the different NSX components. NSX provides two different methods for
flooding traffic described in the following sections. They can be selected on a per segment
basis.
50
VMware NSX Reference Design Guide
The diagram illustrates the flooding process from the hypervisor transport node where “VM1”
is located. “HV1” sends a copy of the frame that needs to be flooded to every peer that is
interested in receiving this traffic. Each green arrow represents the path of a point-to-point
tunnel through which the frame is forwarded. In this example, hypervisor “HV6” does not
receive a copy of the frame. This is because the NSX Controller has determined that there is no
recipient for this frame on that hypervisor.
In this mode, the burden of the replication rests entirely on source hypervisor. Seven copies of
the tunnel packet carrying the frame are sent over the uplink of “HV1”. This should be
considered when provisioning the bandwidth on this uplink.
51
VMware NSX Reference Design Guide
Assume that “VM1” on “HV1” needs to send the same broadcast on “S1” as in the previous
section on head-end replication. Instead of sending an encapsulated copy of the frame to each
remote transport node attached to “S1”, the following process occurs:
1. “HV1” sends a copy of the frame to all the transport nodes within its group (i.e., with a
TEP in the same subnet as its TEP). In this case, “HV1” sends a copy of the frame to
“HV2” and “HV3”.
2. “HV1” sends a copy to a single transport node on each of the remote groups. For the
two remote groups - subnet 20.0.0.0 and subnet 30.0.0.0 – “HV1” selects an arbitrary
member of those groups and sends a copy of the packet there a bit set to indicate the
need for local replication. In this example, “HV1” selected “HV5” and “HV7”.
3. Transport nodes in the remote groups perform local replication within their respective
groups. “HV5” relays a copy of the frame to “HV4” while “HV7” sends the frame to
“HV8” and “HV9”. Note that “HV5” does not relay to “HV6” as it is not interested in
traffic from “LS1”.
The source hypervisor transport node knows about the groups based on the information it has
received from the NSX Controller. It does not matter which transport node is selected to
perform replication in the remote groups so long as the remote transport node is up and
available. If this were not the case (e.g., “HV7” was down), the NSX Controller would update all
transport nodes attached to “S1”. “HV1” would then choose “HV8” or “HV9” to perform the
replication local to group 30.0.0.0.
52
VMware NSX Reference Design Guide
In this mode, as with head end replication example, seven copies of the flooded frame have
been made in software, though the cost of the replication has been spread across several
transport nodes. It is also interesting to understand the traffic pattern on the physical
infrastructure. The benefit of the two-tier hierarchical mode is that only two tunnel packets
(compared to the headend mode of five packets) were sent between racks, one for each
remote group. This is a significant improvement in the network inter-rack (or inter-datacenter)
fabric utilization - where available bandwidth is typically less than within a rack. That number
that could be higher still if there were more transport nodes interested in flooded traffic for
“S1” on the remote racks. In the case where the TEPs are in another data center, the savings
could be significant. Note also that this benefit in term of traffic optimization provided by the
two-tier hierarchical mode only applies to environments where TEPs have their IP addresses in
different subnets. In a flat Layer 2 network, where all the TEPs have their IP addresses in the
same subnet, the two-tier hierarchical replication mode would lead to the same traffic pattern
as the source replication mode.
The default two-tier hierarchical flooding mode is recommended as a best practice as it
typically performs better in terms of physical uplink bandwidth utilization.
Unicast Traffic
When a frame is destined to an unknown MAC address, it is flooded in the network. Switches
typically implement a MAC address table, or filtering database (FDB), that associates MAC
addresses to ports in order to prevent flooding. When a frame is destined to a unicast MAC
address known in the MAC address table, it is only forwarded by the switch to the
corresponding port.
The NSX virtual switch maintains such a table for each segment/logical switch it is attached to.
A MAC address can be associated with either a virtual NIC (vNIC) of a locally attached VM or a
remote TEP (when the MAC address is located on a remote transport node reached via the
tunnel identified by that TEP).
FIGURE 3-11 illustrates virtual machine “Web3” sending a unicast frame to another virtual
machine “Web1” on a remote hypervisor transport node. In this example, the NSX virtual
switch on both the source and destination hypervisor transport nodes are fully populated.
53
VMware NSX Reference Design Guide
mac1
1
mac1 ?
LS
mac1
2
MAC@ TEP IP (N-)VDS
4 mac1 ? (N-)VDS
MAC@ TEP IP
HV1 HV3
mac1 à vnic1 mac1 à TEP1
TEP1 TEP3
mac2 à TEP2 mac2 à TEP2
c1
3
ma
mac3 à TEP3 mac3 à vnic3
P1
TE
Figure 3-11: Unicast Traffic between VMs
1. “Web3” sends a frame to “Mac1”, the MAC address of the vNIC of “Web1”.
2. “HV3” receives the frame and performs a lookup for the destination MAC address in its
MAC address table. There is a hit. “Mac1” is associated to the “TEP1” on “HV1”.
3. “HV3” encapsulates the frame and sends it to “TEP1”.
4. “HV1” receives the tunnel packet, addressed to itself and decapsulates it. TEP1 then
performs a lookup for the destination MAC of the original frame. “Mac1” is also a hit
there, pointing to the vNIC of “VM1”. The frame is then delivered to its final destination.
This mechanism is relatively straightforward because at layer 2 in the overlay network, all the
known MAC addresses are either local or directly reachable through a point-to-point tunnel.
In NSX, the MAC address tables can be populated by the NSX Controller or by learning from the
data plane. The benefit of data plane learning, further described in the next section, is that it is
immediate and does not depend on the availability of the control plane.
54
VMware NSX Reference Design Guide
web1
mac1 MAC@ TEP IP
TEP1 TEP2
L2 Payload src mac1: dest mac FF src IP:TEP1àdest IP:TEP2 mac1 à TEP1
HV1 HV2
But this common method used in traditional overlay networking would not work for NSX with
the two-tier replication model. Indeed, as shown in section TWO-TIER HIERARCHICAL MODE , it is
possible that flooded traffic gets replicated by an intermediate transport node. In that case, the
source IP address of the received tunneled traffic represents the intermediate transport node
instead of the transport node that originated the traffic. Figure 3-13 below illustrates this
problem by focusing on the flooding of a frame from VM1 on HV1 using the two-tier replication
model (similar to what was described earlier in FIGURE 3-10: TWO-TIER HIERARCHICAL Mode.) When
intermediate transport node HV5 relays the flooded traffic from HV1 to HV4, it is actually
decapsulating the original tunnel traffic and re-encapsulating it, using its own TEP IP address as
a source.
The problem is thus that, if the NSX virtual switch on “HV4” was using the source tunnel IP
address to identify the origin of the tunneled traffic, it would wrongly associate Mac1 to TEP5.
web1
mac1 MAC@ TEP IP
TEP1 TEP5 TEP4 mac1 à TEP5
HV1 L2 Payload mac1à FF TEP1àTEP5 HV5 L2 Payload mac1à FF TEP5àTEP4 HV4
incorrect
Figure 3-13: Data plane learning from source tunnel IP leads to an incorrect association
To solve this problem, some metadata identifying the source TEP is inserted in the tunnel
header. Metadata is a piece of information that is carried along unchanged with the payload of
the tunnel. Figure 3-14 below is illustrating the same tunneled frame from “Web1” on “HV1”, this
time carried with a metadata field identifying “TEP1” as the origin.
web1
mac1 MAC@ TEP IP
TEP1 Metadata: source TEP = TEP1
TEP5 Metadata: source TEP = TEP1
TEP4 mac1 à TEP1
HV1 L2 Payload mac1à FF TEP1àTEP5 HV5 L2 Payload mac1à FF TEP5àTEP4 HV4
With this additional piece of information, “HV4” can correctly identify the origin of the tunneled
traffic on replicated traffic and populate its mac address table with the appropriate source TEP
IP address.
55
VMware NSX Reference Design Guide
56
VMware NSX Reference Design Guide
IP@ Mac@
Central Control Plane Cluster IP A à mac vmA 1
IP B à mac vmB 2
2 3 4
IPAàmac vmA 1 IPAàmac vmA
IPBàmac vmB IPA
mac?
vmA vmB
IP@ Mac@ IP@ Mac@
DHCP ack
ARP Reply
1 IPA à mac vmA IPB à mac vmB 2
IPA à mac vmA 4
segment
(N-)VDS (N-)VDS
HV1 HV2
TEP1 TEP2
1. Virtual machine “vmA” has just finished a DHCP request sequence and been assigned IP
address “IPA”. The NSX virtual switch on “HV1” reports the association of the MAC
address of virtual machine “vmA” to “IPA” to the NSX Controller.
2. Next, a new virtual machine “vmB” comes up on “HV2” that must communicate with
“vmA”, but its IP address has not been assigned by DHCP and, as a result, there has been
no DHCP snooping. The virtual switch will be able to learn this IP address by snooping
ARP traffic coming from “vmB”. Either “vmB” will send a gratuitous ARP when coming
up or it will send an ARP request for the MAC address of “vmA”. The virtual switch then
can derive the IP address “IPB” associated to “vmB”. The association (vmB -> IPB) is then
pushed to the NSX Controller.
3. The NSX virtual switch also holds the ARP request initiated by “vmB” and queries the
NSX Controller for the MAC address of “vmA”.
4. Because the MAC address of “vmA” has already been reported to the NSX Controller,
the NSX Controller can answer the request coming from the virtual switch, which can
now send an ARP reply directly to “vmB” on the behalf of “vmA”. Thanks to this
mechanism, the expensive flooding of an ARP request has been eliminated. Note that if
the NSX Controller did not know about the MAC address of “vmA” or if the NSX
57
VMware NSX Reference Design Guide
Controller were down, the ARP request from “vmB” would still be flooded by the virtual
switch.
Overlay Encapsulation
NSX uses Generic Network Virtualization Encapsulation (Geneve) for its overlay model. Geneve
is currently an IETF Internet Draft that builds on the top of VXLAN concepts to provide
enhanced flexibility in term of data plane extensibility.
VXLAN has static fields while Geneve offers flexible field. This capability can be used by anyone
to adjust the need of typical workload and overlay fabric, thus NSX tunnels are only setup
between NSX transport nodes. NSX only needs efficient support for the Geneve encapsulation
by the NIC hardware; most NIC vendors support the same hardware offload for Geneve as they
would for VXLAN.
Network virtualization is all about developing a model of deployment that is applicable to a
variety of physical networks and diversity of compute domains. New networking features are
developed in software and implemented without worry of support on the physical
infrastructure. For example, the data plane learning section described how NSX relies on
metadata inserted in the tunnel header to identify the source TEP of a forwarded frame. This
metadata could not have been added to a VXLAN tunnel without either hijacking existing bits in
the VXLAN header or making a revision to the VXLAN specification. Geneve allows any vendor
to add its own metadata in the tunnel header with a simple Type-Length-Value (TLV) model.
NSX defines a single TLV, with fields for:
58
VMware NSX Reference Design Guide
59
VMware NSX Reference Design Guide
When top of rack switch ToR1 fails, vmnic0’s link goes down on both ESXi1 and ESXi2. For both
ESXi hosts, the impacted TEP is migrated to vmnic1, the remaining operational uplink part of
the teaming policy. Note that, the TEP itself is never considered as failed, but depending on the
status of its “preferred” interface, it might move to a secondary interface. This implies that all
the uplinks part of a teaming policy are expected to be layer 2 adjacent.
The following diagrams goes a little bit further into the architecture of an ESXi host featuring
multiple TEPs. In this example, the virtual machines VM1/2/3/4 are attached to an overlay
segment. The virtual machines can communicate directly between each other within the host
using this segment. The source teaming policy also associate those four VMs to the different
TEPs: here VM1/VM2 are associated to TEP1, while VM3/VM4 are associated to TEP2. What
this VM to TEP association means is that traffic going from say VM1 will be forwarded out of
the ESXi transport node on the overlay, using TEP1 as a source. Overlay traffic from remote
hosts toward VM1 will also be received by TEP1. This association is reflected in the mac address
table for the segment, represented in the right side of the diagram. This table is synchronized to
all remote transport nodes.
60
VMware NSX Reference Design Guide
The diagram in FIGURE 3-18 also shows how the mac address of the TEPs are learnt in the
physical infrastructure. For example, top of rack switch ToR1 has learnt that TEP1’s mac address
is on the downlink toward vmnic0 and TEP2’s mac address is reachable via the inter switch link.
Now, let’s suppose the link between ToR2 and vmnic1 goes down. The PHY of vmnic1 declares
the link as failed, and TEP2 is immediately moved to vmnic0. To update the mac address tables
in the physical infrastructure, a RARP is sent on vmnic0, using TEP2’s mac address as a source.
This convergence is represented in FIGURE 3-19 below:
61
VMware NSX Reference Design Guide
Following the flooding of the RARP packet, the mac address tables in the physical infrastructure
direct traffic for TEP2 directly toward the downlink of ToR1. This mechanism is extremely
efficient. As soon as the link of vmnic1 is declared down, the transmission of a single packet on
vmnic0 potentially recovers connectivity for all the VMs impacted by vmnic1’s failure. FIGURE
3-19 represents the updated mac addresses in the physical infrastructure in orange. Notice that
the mac address table of the segment, in the overlay, is left completely unaffected by the
reconvergence.
62
VMware NSX Reference Design Guide
The failed TEP will be replaced by a healthy one. This section details the failure detection, the
recovery mechanism involved and the way multi-TEP HA behaves when the initial failure
condition has cleared.
Multi-TEP HA is only available for ESXi transport nodes. It is not available for Edge Nodes.
3.2.8.2 Failure detection: physical uplink up but not able to transmit/receive traffic
The TEPs of NSX transport nodes are running BFD (bidirectional forwarding detection) sessions
between each other. Those sessions are used to evaluate the health of the TEPs for the various
NSX dashboards. Those BFD sessions can also give an indication of the health of the physical
interface associated to the TEP. Once a TEP has established a BFD session with a peer, it
indirectly gets the indication that the interface to which it is associated has bidirectional
connectivity. When all previously established BFD sessions go down, the TEP can also safely
infer that its associated interface has failed, even if its physical link is still up.
FIGURE 3-20 below presents the scenario where the link between ToR2 and vmnic1 stops
forwarding traffic while remaining physically up. TEPs are running BFD sessions between them
at a 1 packet per second fixed rate, not configurable by the user. About 3 seconds following the
failure, all BFD sessions to TEP2 are down.
63
VMware NSX Reference Design Guide
After all BFD sessions are down, multi-TEP HA adds another configurable delay before declaring
the TEP as failed.
3.2.8.3 Reconvergence
Let’s carry on using the example in FIGURE 3-20 above, where TEP2 has been declared as failed.
The virtual machines that had vNICs associated to this TEP2 need to move to TEP1, the only
remaining TEP available. Here, VM3 and VM4 are now going to switch to TEP1 to communicate
with the outside world.
64
VMware NSX Reference Design Guide
After moving the VMs to TEP1, there is no mac address change in the physical network: TEP1
and TEP2 are still associated to vmnic0 and vmnic1 respectively. However, some VMs that used
to be reachable via TEP2 are now behind TEP1 and the overlay mac address tables of all the
remote transport nodes need an update. This is achieved in two ways: one for a fast update,
the other for a reliable update:
• Using the data path, multiple RARPs packets with the source mac address matching the
VMs vNIC’s mac addresses are flooded on the overlay segment.
• The updated mac address table is also advertised to the central control plane
component, which in turn propagates the changes to the remote transport nodes.
This mac address table update is an expensive operation compared to the TEP HA equivalent
that only required moving a single mac address in the physical infrastructure. Furthermore, the
failure of a top of rack switch, as depicted FIGURE 3-17 for example, could result in many hosts
simultaneously switching TEP for some of their VMs. But keep in mind that multi-TEP HA is a
secondary failure detection mechanism, that is covering for some corner cases where TEP HA
would not converge at all. Most regular ToR failures are addressed by moving the TEP IP/MAC
to a different pNIC.
65
VMware NSX Reference Design Guide
to remote peers and validate vmnic1’s availability. However, for TEP2 to even attempt
establishing BFD sessions with the remote transport nodes, our current implementation
requires some vNICs to be associated to TEP2. The recovery mechanism will thus require
moving the VMs that had been evacuated to TEP1 back to TEP2. This is a disruptive operation,
and furthermore, if the failure was due to a physical uplink problem, TEP2 might not be able to
come back to an operational state and VMs would have to be associated back to TEP1. Multi-
TEP HA thus provide different options for attempting to recover a failed TEP:
• Automatic recovery. After a configurable timer, an attempt is made to move back the
evacuated VMs to their original TEP. If the TEP fails to establish at least a BFD session,
VMs are evacuated again. Additional periodic recovery attempts are made, following an
exponential back-off algorithm. This model has some implication on the connectivity of
the impacted VMs. Its benefits are the capability of falling back to a configuration where
all the physical uplinks are used, without user intervention.
• Manual recovery. The user, by CLI or API, notifies that they want to attempt a recovery
of the affected TEP. This is the best approach for administrators who want to minimize
traffic disruption, as they can check the problem in the physical infrastructure is solved
before moving the VMs back. The drawback is that the VM traffic distribution across the
physical uplinks remains affected until this manual step is initiated.
66
VMware NSX Reference Design Guide
TEP Groups
NSX data plane relies on tunnels, established between TEPs (tunnel end points) on the
transport nodes. When a transport node hosts multiple TEPs, there are multiple paths in and
out of this transport node. TEP groups is a feature improving the way NSX leverages multiple
TEPs on its transport nodes. With TEP groups:
• Traffic is distributed in a more evenly fashion across the tunnels available between
transport nodes, providing higher throughput.
• The failure of an individual tunnel between transport nodes is efficiently detected and
acted upon, leading to better availability.
NSX 4.2 introduces TEP groups for edge transport nodes. It will also be implemented on ESXi
transport nodes in a future release. This section describes the way TEP groups operates
between transport nodes in the data plane, then focuses on its benefits for traffic going
through edge nodes.
Transport node 2 (TN2) hosts two mac addresses Mac1 and Mac2. Those mac addresses could
be the mac address of a VNIC on an ESXi transport node, or the mac address of a gateway
interface on an edge transport node. Those mac addresses are associated to a TEP on TN2.
Here, mac address Mac1 is associated to TEP1, while Mac2 is associated to TEP2. That means
that if a transport node TN1 needs to send some traffic to Mac1, it will do a mac address lookup
67
VMware NSX Reference Design Guide
on Mac1, retrieve the IP address of TEP1 from this lookup, and send the traffic on a tunnel
leading to TEP1. This current model offers some form of load sharing across the TEPs of TN2.
Indeed, if TN1 (or any other transport node) wants to send some traffic to Mac2, it will be
directed to TEP2.
The TEP group feature bundles the TEPs of a transport node into a TEP group. All the mac
addresses of the transport nodes are now associated to the TEP group, instead of an individual
TEP. In FIGURE 3-23 below, TEP1 and TEP2 of TN2 are bundle into TEP Group TG1:
With this change, when TN1 wants to send traffic to Mac1, it can now leverage any of the
available tunnels leading to TN2 (here tunnel1 and tunnel2 instead of tunnel1 only.) The
selection of the tunnel by TN1 is achieved by computing a hash on the first packet of the flow
going to TN2. TEP groups thus enable per-flow load sharing. Note that the feature requires
some change on both the destination transport node (TN2 needs to bundle its TEPs) and the
source transport node (TN1 needs to know which TEPs are part of TG1 in order to pick a tunnel
leading to TN2 using a hash.) Of course, the TEP grouping can happen on both source and
destination transport nodes, as represented in the following FIGURE 3-24:
TN1 is maintaining four tunnels to TN2. When bundling the source TEPs in TEP group “TG”, not
only can TN1 can load share its traffic to TEP1/TEP2 on TN2, but also initiate this traffic from its
two separate TEPs, on a per-flow basis.
68
VMware NSX Reference Design Guide
This diagram represents the connectivity of transport nodes TN1 and TN3 to TN2. Each of the
tunnels between the TEPs are protected by a BFD session. Suppose that tunnel4, represented
by a red dashed line between TN1 and TN2, has been determined as failed by its BFD session.
TEP group high availability will simply remove tunnel4 from the list of tunnels available for
connecting the TN1 to TN2. TN1 can still reach TN2 via tunnel1, tunnel2 and tunnel3. Any flow
that was associated to tunnel4 will be re-assigned an available tunnel leading to TEP group TG1,
thus ensuring minimal disruption. Note that TEP2, the TEP at the end of the failed tunnel, is not
affected by this convergence. TEP2 can still receive traffic from TN1 on tunnel2, and TEP2 is still
available for connectivity to transport node TN3. TEP group high availability is very granular.
Compare this to multi-TEP high availability, presented in part 3.2.8, that could only detect the
failure of a TEP when all its BFD sessions were down.
69
VMware NSX Reference Design Guide
Figure 3-26: TEP group between ESXi host and edge node in NSX 4.2
In this diagram, the ESXi transport node is not running TEP groups. As a result, traffic from a
specific virtual machine is tightly associated to a TEP. For example, mac address Mac1 is
associated to TEP1 here. That means that:
• Flows from Mac1 to Mac3 on the edge transport node can only be transported by
tunnel1 and tunnel2, originated in TEP1.
• Traffic from the edge transport node toward Mac1 can only be directed toward TEP1.
However, the edge can share the load on a per-flow basis between tunnel1 and tunnel2,
originated on its TEP3 and TEP4.
3.2.9.2 TEP group benefits for traffic routed through edge nodes
Even as a partially implemented feature on edge transport nodes only, TEP groups is providing
significant improvement to the NSX data plane for routed traffic through edge nodes. This part
details that improvement, using a simple example.
Northbound traffic
Suppose that multiple virtual machines, spread across multiple ESXi transport nodes, forward
some traffic to the “Internet”. In our example, this traffic is routed through a Tier1 SR on edge2
transport node first, and then through a Tier0 SR on a edge1.
The logical view of this network is represented in the following FIGURE 3-27:
70
VMware NSX Reference Design Guide
Figure 3-27: Logical view: northbound traffic from ESXi hosts through multiple edge gateways
The virtual machines are spread in two segments in this example. The routed traffic is received
by the local Tier1 DR acting as a default gateway. The Tier1 DR forwards this traffic to its
centralized component via an internal segment (called transit VNI or backplane). This Tier1 SR
then routes this traffic to the Tier0 DR local to its edge. The traffic is finally forwarded through
another backplane segment to the Tier0 SR.
Let’s now focus on the path this traffic is taking on the physical infrastructure. More specifically,
we’re interested in the TEPs involved. The following FIGURE 3-28 is providing more details on
how this northbound traffic is distributed across the components:
71
VMware NSX Reference Design Guide
On ESXi transport nodes, as mentioned in the first part of this chapter, virtual machines’ VNICs
are associated to a TEP. The hosts in this example have four TEPs each, so that the traffic
distribution is more obvious (this is not a design recommendation.) We can see that a variety of
TEPs is used on each host to forward the traffic to the edge2, where the Tier1 SR is instantiated.
It is impossible for an edge to distribute its traffic on a per-VM basis (more precisely, on a per-
VNIC basis) because the edge is not running any virtual machine. Instead, an edge splits its
traffic across TEPs on a per segment basis. For a given segment, traffic is received/sent via a
single TEP. In the scenario depicted in FIGURE 3-28, this means that all the traffic generated by
the virtual machines is received on a single TEP on the edge. This is because, irrespective of the
segment to which the VM is originally attached, its traffic is forwarded on a unique internal
backplane segment between the Tier1 DR on the host and Tier1 SR on edge2. In most cases,
this is not an issue. However, there are few customers who have very high bandwidth
requirements for the routed traffic through their edge gateways and receiving this traffic
through a single TEP results in a bottleneck.
Note that the same issue is repeated when the traffic is routed between the Tier1 SR and the
Tier0 SR: a single TEP is used, even if four of them are available between edge2 and edge1.
TEP groups on edge allows distributing this northbound traffic across all the edge’s TEPs, as
represented in FIGURE 3-29 below:
72
VMware NSX Reference Design Guide
Figure 3-29: Northbound traffic from ESXi hosts through edge gateways, with TEP groups configured
Thanks to the TEP group, the edge is now reachable via any of its TEPs, irrespective of the
segment carrying the traffic. The ESXi hosts can target any TEP on the edge, distributing their
northbound traffic on a per-flow basis. The same mechanism is used between the two edges for
the traffic routed between the Tier1 SR and Tier0 SR. Note that in this edge-edge
communication, TEP groups are available on both source and destination transport nodes. This
means that traffic can be sent from any TEP on the Tier1 SR’s edge to any TEP on the Tier0 SR’s
edge.
73
VMware NSX Reference Design Guide
Tier0 DR on edge1 and the Tier1 SR on edge2. However, the traffic from the Tier1 DR on edge2
is distributed across multiple TEPs on the ESXi hosts, on a per VM basis.
Figure 3-30: Southbound traffic routed through edge gateways, without TEP groups
Let’s now drill further down on the traffic distribution with the physical view of the scenario,
represented in FIGURE 3-31 below:
74
VMware NSX Reference Design Guide
Here, there is no traffic distribution between edge1 and edge2. All routed traffic is going
through a single pair of TEPs.
However, as already visible in the logical view, there is some traffic distribution between edge2
and the ESXi hosts. Not only are the hosts receiving the traffic on multiple TEPs, but it is sourced
from different TEPs on edge2, on a per-segment basis.
Southbound traffic is thus less polarized than northbound traffic. Still TEP groups is going to
make the traffic distribution even better, as represented in FIGURE 3-32 below:
75
VMware NSX Reference Design Guide
Figure 3-32:Southbound traffic routed through edge gateways, with TEP groups configured
Traffic routed between edge1 and edge2 now use the four TEP as a source on edge1 and as a
destination on edge2.
Traffic from edge2 toward the hosts can now be sourced by any TEP on edge2, on a per-flow
basis. This should provide efficient load sharing, even if the destination VMs are on a limited
number of segments, or if a specific
76
VMware NSX Reference Design Guide
Edges. However, there are some scenarios where layer 2 connectivity is required between VMs
and physical devices. For this functionality, NSX introduces the NSX Bridge, a service that can be
instantiated on an Edge for the purpose of connecting an NSX logical segment with a traditional
VLAN at layer 2.
The most common use cases for this feature are:
• Physical to virtual/virtual to virtual migration. This is generally a temporary scenario where
a VLAN backed environment is being virtualized to an overlay backed NSX. The NSX Edge
Bridge is a simple way to maintain connectivity between the different components during
the intermediate stages of the migration process.
Virtual to virtual
Edge
Bridge
Physical to virtual
Overlay VLAN
Edge
Bridge
Non-virtualized appliances
Overlay VLAN
77
VMware NSX Reference Design Guide
efficient, as routing allows for Equal Cost Multi Pathing, which results in higher bandwidth and a
better redundancy model. A common misconception exists regarding the usage of the edge
bridge, from the fact that modern SDN based adoption must not use bridging. In fact, that is not
the case, the Edge Bridge can be conceived as a permanent solution for extending overlay-
backed segments into VLANs. The use case of having a permeant bridging for set of workloads
exist due to variety of reasons such as older application cannot change IP address, end of life
gear does not allow any change, regulation, third party connectivity and span of control on
those topologies or devices. However, as an architect if one desired to enable such use case
must consider some level of dedicated resources and planning that ensue, such as bandwidth,
operational control and protection of bridged topologies.
78
VMware NSX Reference Design Guide
The same segment can be attached to several bridges on different Edges. A common use case is
when non virtualized servers in the same broadcast domain exist in different racks. Adding an
edge bridge on each rack allow connecting those servers to the same segment without
requiring the physical infrastructure to extend a VLAN between racks. The Edge Bridge also
supports bridging 802.1Q tagged traffic carried in an overlay backed segment (Guest VLAN
Tagging.) For more information about this feature, see the BRIDGING WHITE PAPER on the
VMware communities website.
Figure 3-36: Bridge Profile, defining a redundant Edge Bridge (primary and backup)
Once a Bridge Profile is created, the user can attach a segment to it. By doing so, an active
Bridge instance is created on the primary Edge, while a standby Bridge is provisioned on the
backup Edge. NSX creates a Bridge Endpoint object, which represents this pair of Bridges. The
attachment of the segment to the Bridge Endpoint is represented by a dedicated Logical Port,
as shown in the diagram below:
79
VMware NSX Reference Design Guide
LP Segment
Logical Port
Edge Cluster
Bridge Profile
Bridge Endpoint
Primary Backup
BE
Active Standby
Uplink, derived
from a VLAN
Transport Zone Edge1 Edge2 EdgeN
VLAN
Figure 3-37: Primary Edge Bridge forwarding traffic between segment and VLAN
When associating a segment to a Bridge Profile, the user can specify the VLAN ID for the VLAN
traffic as well as the physical port that will be used on the Edge for sending/receiving this VLAN
traffic. At the time of the creation of the Bridge Profile, the user can also select the failover
mode. In the preemptive mode, the Bridge on the primary Edge will always become the active
bridge forwarding traffic between overlay and VLAN as soon as it is available, usurping the
function from an active backup. In the non-preemptive mode, the Bridge on the primary Edge
will remain standby should it become available when the Bridge on the backup Edge is already
active.
80
VMware NSX Reference Design Guide
Segment
Bridge Profile
Active Standby
Edge1 Edge2
VLAN
Figure 3-38: Edge Bridge Firewall
The firewall rules can leverage existing NSX grouping constructs, and there is currently a single
firewall section available for those rules.
81
VMware NSX Reference Design Guide
82
VMware NSX Reference Design Guide
In this above example, VM1, VM2, Physical Servers 1 and 2 have IP connectivity. Remarkably,
through the Edge Bridges, Tier-1 or Tier-0 gateways can act as default gateways for physical
devices. Note also that the distributed nature of NSX routing is not affected by the introduction
of an Edge Bridge. ARP requests from physical workload for the IP address of an NSX router
acting as a default gateway will be answered by the local distributed router on the Edge where
the Bridge is active.
83
VMware NSX Reference Design Guide
In a data center, traffic is categorized as East-West (E-W) or North-South (N-S) based on the
origin and destination of the flow. When virtual or physical workloads in a data center
communicate with the devices external to the data center (e.g., WAN, Internet), the traffic is
referred to as North-South traffic. The traffic between workloads confined within the data
center is referred to as East-West traffic. In modern data centers, more than 70% of the traffic
is East-West.
84
VMware NSX Reference Design Guide
For a multi-tiered application where the web tier needs to talk to the app tier and the app tier
needs to talk to the database tier and, these different tiers sit in different subnets. Every time a
routing decision is made, the packet is sent to the router. Traditionally, a centralized router
would provide routing for these different tiers. With VMs that are hosted on same the ESXi
hypervisor, traffic will leave the hypervisor multiple times to go to the centralized router for a
routing decision, then return to the same hypervisor; this is not optimal.
NSX is uniquely positioned to solve these challenges as it can bring networking closest to the
workload. Configuring a Gateway via NSX Manager instantiates a local distributed gateway on
each hypervisor. For the VMs hosted (e.g., “Web 1”, “App 1”) on the same hypervisor, the E-W
traffic does not need to leave the hypervisor for routing.
85
VMware NSX Reference Design Guide
86
VMware NSX Reference Design Guide
3. Once the MAC address of “App1” is learned, the L2 lookup is performed in the local
MAC table to determine how to reach “App1” and the packet is delivered to the App1
VM.
4. The return packet from “App1” follows the same process and routing would happen
again on the local DR.
In this example, neither the initial packet from “Web1” to “App1” nor the return packet from
“App1” to “Web1” left the hypervisor.
East-West Routing - Distributed Routing with Workloads on Different Hypervisor
In this example, the target workload “App2” differs as it rests on a hypervisor named “HV2”. If
“Web1” needs to communicate with “App2”, the traffic would have to leave the hypervisor
“HV1” as these VMs are hosted on two different hypervisors. FIGURE shows a logical view of
topology, highlighting the routing decisions taken by the DR on “HV1” and the DR on “HV2”.
When “Web1” sends traffic to “App2”, routing is done by the DR on “HV1”. The reverse traffic
from “App2” to “Web1” is routed by DR on “HV2”. Routing is performed on the hypervisor
attached to the source VM.
FIGURE shows the corresponding physical topology and packet walk from “Web1” to “App2”.
87
VMware NSX Reference Design Guide
88
VMware NSX Reference Design Guide
Services Router
East-West routing is completely distributed in the hypervisor, with each hypervisor in the
transport zone running a DR in its kernel. However, some services of NSX are not distributed,
due to its locality or stateful nature such as:
• Physical infrastructure connectivity (BGP/OSPF/Static Routing/BFD)
• NAT
• DHCP server
• VPN
• Gateway Firewall
• Bridging
• Service Interface
• Metadata Proxy for OpenStack
A services router (SR) is instantiated on an edge cluster when a service is enabled that cannot
be distributed on a gateway.
A centralized pool of capacity is required to run these services in a highly available and scaled-
out fashion. The appliances where the centralized services or SR instances are hosted are called
Edge nodes. An Edge node is the appliance that provides connectivity to the physical
infrastructure.
Left side of FIGURE 4-6 shows the logical view of a Tier-0 Gateway showing both DR and SR
components when connected to a physical router. Right side of FIGURE 4-6 shows how the
components of Tier-0 Gateway are realized on Compute hypervisor and Edge node. Note that
the compute host (i.e. HV1) has just the DR component and the Edge node shown on the right
has both the SR and DR components. SR/DR forwarding table merge has been done to address
future use-cases. SR and DR functionality remains the same after SR/DR merge in NSX 2.4
release, but with this change SR has direct visibility into the overlay segments. Notice that all
the overlay segments are attached to the SR as well.
89
VMware NSX Reference Design Guide
As mentioned previously, connectivity between DR on the compute host and SR on the Edge
node is auto plumbed by the system. Both the DR and SR get an IP address assigned in
169.254.0.0/24 subnet by default. The management plane also configures a default route on
the DR with the next hop IP address of the SR’s intra-tier transit link IP. This allows the DR to
take care of E-W routing while the SR provides N-S connectivity.
90
VMware NSX Reference Design Guide
FIGURE 4-8 shows a detailed packet walk from data center VM “Web1” to a device on the L3
physical infrastructure. As discussed in the E-W routing section, routing always happens closest
to the source. In this example, eBGP peering has been established between the physical router
interface with the IP address 192.168.240.1 and the Tier-0 Gateway SR component hosted on
the Edge node with an external interface IP address of 192.168.240.3. Tier-0 Gateway SR has a
BGP route for 192.168.100.0/24 prefix with a next hop of 192.168.240.1 and the physical router
has a BGP route for 172.16.10.0/24 with a next hop of 192.168.240.3.
91
VMware NSX Reference Design Guide
92
VMware NSX Reference Design Guide
93
VMware NSX Reference Design Guide
Two-Tier Routing
In addition to providing optimized distributed and centralized routing functions, NSX supports a
multi-tiered routing model with logical separation between different gateways within the NSX
infrastructure. The top-tier gateway is referred to as a Tier-0 gateway while the bottom-tier
gateway is a Tier-1 gateway. This structure gives complete control and flexibility over services
and policies. Various stateful services can be hosted on the Tier-1 while the Tier-0 can operate
in an active-active manner.
Configuring two tier routing is not mandatory. It can be single tiered as shown in the previous
section. FIGURE 4-10 presents an NSX two-tier routing architecture.
Northbound, the Tier-0 gateway connects to one or more physical routers/L3 switches and
serves as an on/off ramp to the physical infrastructure. Southbound, the Tier-0 gateway
connects to one or more Tier-1 gateways or directly to one or more segments as shown in
North-South routing section. Northbound, the Tier-1 gateway connects to a Tier-0 gateway
using a RouterLink port. Southbound, it connects to one or more segments using downlink
interfaces.
Concepts of DR/SR discussed in the SECTION 4.1 remain the same for multi-tiered routing. Like
Tier-0 gateway, when a Tier-1 gateway is created, a distributed component (DR) of the Tier-1
gateway is intelligently instantiated on the hypervisors and Edge nodes. Before enabling a
centralized service on a Tier-0 or Tier-1 gateway, an edge cluster must be configured on this
gateway. Configuring an edge cluster on a Tier-1 gateway, instantiates a corresponding Tier-1
services component (SR) on two Edge nodes part of this edge cluster. Configuring an Edge
cluster on a Tier-0 gateway does not automatically instantiate a Tier-0 service component (SR),
94
VMware NSX Reference Design Guide
the service component (SR) will only be created on a specific edge node along with the external
interface creation.
Unlike the Tier-0 gateway, the Tier-1 gateway does not support northbound connectivity to the
physical infrastructure. A Tier-1 gateway can only connect northbound to:
• a Tier-0 gateway
• a service interface, this is used to connect a one-arm load-balancer to a segment. More
details are available in CHAPTER 6.
Note that connecting Tier-1 to Tier-0 is a one click configuration or one API call configuration
regardless of components instantiated (DR and SR) for that gateway.
95
VMware NSX Reference Design Guide
• Router Link Interface/Linked Port: Interface connecting Tier-0 and Tier-1 gateways. Each
Tier-0-to-Tier-1 peer connection is provided a /31 subnet within the 100.64.0.0/16
reserved address space (RFC6598). This link is created automatically when the Tier-0 and
Tier-1 gateways are connected. This subnet can be changed when the Tier-0 gateway is
being created. It is not possible to change it afterward.
• Service Interface: Interface connecting VLAN segments to provide connectivity to VLAN
backed physical or virtual workloads. Service interface can also be connected to
overlay/VLAN segments for standalone load balancer use cases explained in load
balancer Chapter 6. Service Interface supports static. It is supported on both Tier-0 and
Tier-1 gateways configured in Active/Standby high-availability configuration mode
explained in section 4.6.2. Note that a Tier-0 or Tier-1 gateway must have an SR
component to realize service interfaces. This interface was referred to as centralized
service interface in previous releases. Dynamic Routing is not supported on Service
Interfaces.
• Loopback: Tier-0 gateway supports the loopback interfaces. A Loopback interface is a
virtual interface, and it can be redistributed into a routing protocol.
96
VMware NSX Reference Design Guide
• Tier-1 Gateway
o Connected – Connected routes on Tier-1 include segment subnets connected to Tier-
1 and service interface subnets configured on Tier-1 gateway.
o In FIGURE 4-12, 172.16.10.0/24 (Connected segment) and 192.168.10.0/24 (Service
Interface) are connected routes for Tier-1 gateway.
o Static– User configured static routes on Tier-1 gateway.
o NAT IP – NAT IP addresses owned by the Tier-1 gateway discovered from NAT rules
configured on the Tier-1 gateway.
o LB VIP – IP address of load balancing virtual server.
o LB SNAT – IP address or a range of IP addresses used for Source NAT by load balancer.
o IPsec Local IP – Local IPsec endpoint IP address for establishing VPN sessions.
o DNS Forwarder IP – Listener IP for DNS queries from clients. Also used as the source
IP to forward DNS queries to the upstream DNS server.
97
VMware NSX Reference Design Guide
“Tier-1 Gateway” advertises connected routes to Tier-0 Gateway. Figure 4-12 shows an
example of connected routes (172.16.10.0/24 and 192.168.10.0/24). If there are other route
types, like NAT IP etc. as discussed in section 4.2.2, a user can advertise those route types as
well. As soon as “Tier-1 Gateway” is connected to “Tier-0 Gateway”, the management plane
configures a default route on “Tier-1 Gateway” with next hop IP address as RouterLink interface
IP of “Tier-0 Gateway” i.e. 100.64.224.0/31 in the example above.
Tier-0 Gateway sees 172.16.10.0/24 and 192.168.10.1/24 as Tier-1 Connected routes (t1c) with
a next hop of 100.64.224.1/31. Tier-0 Gateway also has Tier-0 “Connected” routes
(172.16.20.0/24) in Figure 4-12.
Northbound, “Tier-0 Gateway” redistributes the Tier-0 connected and Tier-1 connected routes
in BGP and advertises these routes to its BGP neighbor, the physical router.
98
VMware NSX Reference Design Guide
FIGURE 4-13 shows both logical and per transport node views of two Tier-1 gateways serving
two different tenants and a Tier-0 gateway. Per transport node view shows that the distributed
component (DR) for Tier-0 and the Tier-1 gateways have been instantiated on two hypervisors.
If “VM1” in tenant 1 needs to communicate with “VM3” in tenant 2, routing happens locally on
hypervisor “HV1”. This eliminates the need to route of traffic to a centralized location to route
between different tenants or environments.
Multi-Tier Distributed Routing with Workloads on the same Hypervisor
The following list provides a detailed packet walk between workloads residing in different
tenants but hosted on the same hypervisor.
1. “VM1” (172.16.10.11) in tenant 1 sends a packet to “VM3” (172.16.201.11) in tenant 2.
The packet is sent to its default gateway interface located on tenant 1, the local Tier-1
DR.
2. Routing lookup happens on the tenant 1 Tier-1 DR and the packet is routed to the Tier-0
DR following the default route. This default route has the RouterLink interface IP
address (100.64.224.0/31) as a next hop.
3. Routing lookup happens on the Tier-0 DR. It determines that the 172.16.201.0/24
subnet is learned from the tenant 2 Tier-1 DR (100.64.224.3/31) and the packet is
routed there.
4. Routing lookup happens on the tenant 2 Tier-1 DR. This determines that the
172.16.201.0/24 subnet is directly connected. L2 lookup is performed in the local MAC
table to determine how to reach “VM3” and the packet is sent.
The reverse traffic from “VM3” follows the similar process. A packet from “VM3” to destination
172.16.10.11 is sent to the tenant-2 Tier-1 DR, then follows the default route to the Tier-0 DR.
The Tier-0 DR routes this packet to the tenant 1 Tier-1 DR and the packet is delivered to “VM1”.
During this process, the packet never left the hypervisor to be routed between tenants.
99
VMware NSX Reference Design Guide
The following list provides a detailed packet walk between workloads residing in different
tenants and hosted on the different hypervisors.
1. “VM1” (172.16.10.11) in tenant 1 sends a packet to “VM2” (172.16.200.11) in tenant 2.
VM1 sends the packet to its default gateway interface located on the local Tier-1 DR in
HV1.
2. Routing lookup happens on the tenant 1 Tier-1 DR and the packet follows the default
route to the Tier-0 DR with a next hop IP of 100.64.224.0/31.
3. Routing lookup happens on the Tier-0 DR which determines that the 172.16.200.0/24
subnet is learned via the tenant 2 Tier-1 DR (100.64.224.3/31) and the packet is routed
accordingly.
4. Routing lookup happens on the tenant 2 Tier-1 DR which determines that the
172.16.200.0/24 subnet is a directly connected subnet. A lookup is performed in ARP
table to determine the MAC address associated with the “VM2” IP address. This
destination MAC is learned via the remote TEP on hypervisor “HV2”.
5. The “HV1” TEP encapsulates the packet and sends it to the “HV2” TEP, finally leaving the
host.
6. The “HV2” TEP decapsulates the packet and recognize the VNI in the Geneve header. A
L2 lookup is performed in the local MAC table associated to the LIF where “VM2” is
connected.
7. The packet is delivered to “VM2”.
100
VMware NSX Reference Design Guide
The return packet follows the same process. A packet from “VM2” gets routed to the local
hypervisor Tier-1 DR and is sent to the Tier-0 DR. The Tier-0 DR routes this packet to tenant 1
Tier-1 DR which performs the L2 lookup to find out that the MAC associated with “VM1” is on
remote hypervisor “HV1”. The packet is encapsulated by “HV2” and sent to “HV1”, where this
packet is decapsulated and delivered to “VM1". It is important to notice that in this use case,
routing is performed locally on the hypervisor hosting the VM sourcing the traffic.
Routing Capabilities
NSX supports static routing as well dynamic routing protocols to provide connectivity to the
IPv4 and IPv6 workloads.
Static routing
In a multi-tier routing architecture and as described previously, NSX automatically creates the
RouterLink segment, ports and static routes to interconnect a Tier-0 gateway with one or
several Tier-1 gateways as depicted previously.
By default, the RouterLink segment is in the 100.64.0.0/16 IPv4 address range which is a shared
IPv4 address space that is compliant with the RFC 6598. A default static route pointing to the
Tier-0 gateway DR RouterLink port is automatically installed on the Tier-1 gateway routing
table.
Northbound, static routes can be configured on Tier-1 gateways with the next hop IP as the
RouterLink IP of the Tier-0 gateway (100.64.0.0/16 range or a range defined by user for
RouterLink interface when the Tier-0 gateway is being created). Southbound, static routes can
also be configured on Tier-1 gateway with a next hop as a layer 3 device reachable via a service
or downlink interface.
Tier-0 gateways can be configured with a static route toward external subnets with a next hop
IP of the physical router. Southbound, static routes can be configured on Tier-0 gateways with a
next hop of a layer 3 device reachable via Service interface.
ECMP is supported with static routes to provide load balancing, increased bandwidth, and fault
tolerance for failed paths or Edge nodes. Figure 4-4-15 shows a Tier-0 gateway with two
external interfaces leveraging Edge node, EN1 and EN2 connected to two physical routers. Two
equal cost static default routes configured for ECMP on Tier-0 Gateway. Up to eight paths are
supported in ECMP.
Tier-0 northbound static routes can be protected via BFD. BFD timers depend on the Edge node
type. Bare metal Edge supports a minimum of 50ms TX/RX BFD keep alive timer while the VM
form factor Edge supports a minimum of 500ms TX/RX BFD keep alive timer.
It is recommended to always implement BFD when configuring static routing between the Tier-
0 gateway and the physical network. BFD will detect an upstream physical network failures in
the absence of a dynamic routing protocol.
101
VMware NSX Reference Design Guide
First hop redundancy protocols on the physical network and a virtual IP (VIP) on the Tier-0
Gateway can also be implemented to provide high availability, but they are less reliable and
slower than BFD, cannot provide ECMP, and VIP only works with active/standby Tier-0
gateways. They should only be considered if enabling BFD was not an option.
102
VMware NSX Reference Design Guide
103
VMware NSX Reference Design Guide
BGP session reestablishes during this grace period, route revalidation is done, and the
forwarding table is updated. If the BGP session does not reestablish within this grace period,
the router flushes the stale routes.
The BGP session will not be GR capable if only one of the peers advertises it in the BGP OPEN
message; GR needs to be configured on both ends. GR can be enabled/disabled per Tier-0
gateway. The GR restart timer is 180 seconds by default and cannot be change after a BGP
peering adjacency is in the established state, otherwise the peering needs to be negotiated
again.
104
VMware NSX Reference Design Guide
• If no loopback is configured, RID will be equal to the highest numeric interface IPv4
address.
• If a loopback is configured the RID will be equal to the highest numeric loopback IPv4
address.
NSX does not support to manually configure an OSPF router-ID. There is no preemption when it
comes to calculating the Router ID otherwise that will trigger a reset in the OSPF adjacency and
disrupt traffic forwarding.
To change the RID, the OSPF process must be restarted either using the UI or an API call. An
NSX edge reboot will not change the RID. If the NSX edge is being redeployed, the OSPF RID will
be recalculated.
Once the OSPF RID is chosen, OSPF hello messages are sent on the OSPF enabled external
uplinks using the multicast address “224.0.0.5” as represented on FIGURE 4-16.
Figure 4-16 - OSPF Adjacency between the Tier-0 SR and the physical network fabric
105
VMware NSX Reference Design Guide
As explained earlier, Hello messages are sent on the OSPF enabled external uplink to the
multicast address 224.0.0.5 IP address (ALL SPF Routers). The hello message will check the
following parameters before establishing the adjacency:
• Interfaces between the OSPF routers must be in the same subnet.
• Interfaces between the OSPF routers must belong to the same OSPF area.
• Interfaces between the OSPF routers must have the same OSPF area type.
• Router ID must be unique between the OSPF Routers.
• OSPF timers (HelloInterval and RouterDeadInterval) must match between the OSPF
routers.
• Authentication process must be validated.
OSPF Authentication
The Tier-0 gateway supports the following authentication method:
• None
• Password: Password is sent in clear text over the external uplink interface
• MD5: Password is a message digest and sent over the external uplink interface
NSX supports area-wide authentication and does not support authentication per interface.
106
VMware NSX Reference Design Guide
VMware recommends authenticating every dynamic routing protocol adjacency using the MD5
method to exchange routes in a secure way.
NSX currently supports up to 8 characters for the authentication password
107
VMware NSX Reference Design Guide
108
VMware NSX Reference Design Guide
• Full:
o Routers have exchanged their LSDb and are fully adjacent.
After both routers have exchanged their LSDb, they run the Dijkstra algorithm. Since their LSDb
is identical they possess the same knowledge of the OSPF topology. The Dijkstra algorithm is
run so that each router can calculate their own best way to reach every destination.
Graceful Restart (GR)
Graceful restart in OSPF allows a neighbor to preserve its forwarding table while the control
plane restarts. It is recommended to enable OSPF Graceful restart when the OSPF peer has
multiple supervisors. An OSPF control plane restart could happen due to a supervisor
switchover in a dual supervisor hardware, planned maintenance, or active routing engine crash.
As soon as a GR-enabled router restarts (control plane failure), it preserves its forwarding table,
marks the routes as stale, and sets a grace period restart timer for the OSPF adjacency to
reestablish. If the OSPF adjacency reestablishes during this grace period, route revalidation is
done, and the forwarding table is updated. If the OSPF adjacency does not reestablish within
this grace period, the router flushes the stale routes.
OSPF Graceful restart helper mode is supported.
109
VMware NSX Reference Design Guide
maximum number of routers fully establishing an OSPF adjacency using the Point-to-Point
network type is two. No DR or BDR election is performed using this OSPF network type. The
Tier-0 SR OSPF Routers are considered “DROther” (OSPF priority of 0).
It is recommended to configure the Tier-0 SR uplink interfaces using a /31 network mask as
depicted on FIGURE 4-20:
In this example, the Tier-0 SR hosted on both the Edge Node 01 and Edge Node 02 will establish
an adjacency with both the physical router 1 and the physical router 2.
From a global Tier-0 perspective, all four adjacencies to the physical routers will be in the “Full”
state.
OSPF Broadcast Network Type:
On an ethernet segment using the OSPF broadcast type, more than two OSPF router can
exchange their LSAs. To reduce the LSAs flooding on that segment, the OSPF process will elect a
Designated Router and a Backup Designated Routers. The router with the highest OSPF priority
(1-255) will be elected DR. The router with the second highest OSPF priority will be elected BDR.
All other routers on the segment will be considered OSPF DROthers.
If the priorities for either the DR or BDR role are equal, the router with the highest OSPF
RouterID will be elected DR.
110
VMware NSX Reference Design Guide
All DROther routers will establish an adjacency with both the DR and the BDR as demonstrated
on FIGURE 4-21. These adjacencies will be in the “Full” state and LSUs can be exchanged. The DR
and BDR also establish an adjacency between themselves.
Figure 4-21: OSPF adjacency in the Full State – Broadcast network type
The DROthers routers will see each other’s in the 2-Way state but will not exchange their LSAs
directly between themselves, the DR and BDR will perform that function.
Figure 4-22: OSPF adjacency in the “2-Way” State – Broadcast network type
111
VMware NSX Reference Design Guide
In this example, each router will have 2 adjacencies in the Full State and 2 adjacencies in the “2-
Way” state.
From an NSX perspective, the Tier-0 gateway is always a DROther as its OSPF priority is hard-
coded to 0. It is not possible to change that priority.
FIGURE 4-23 demonstrates the OSPF adjacencies in the “Full” state between the physical
networking fabric and the Tier-0 Service Routers
Figure 4-23: OSPF adjacency in the Full State – Broadcast network type
FIGURE 4-24 demonstrates the OSPF adjacencies in the “2-Way” state between the Tier-0
Service Routers.
112
VMware NSX Reference Design Guide
When OSPF uses the broadcast network type, the Database Description packets are sent from
the Tier-0 SR (DROthers) to the Designated Router in the networking fabric using the multicast
address 224.0.0.6 (ALL OSPF DR Router multicast address). The DR sends its Database
Description packets using the 224.0.0.5 multicast address (ALL OSPF Router multicast address).
The number of adjacencies is increased using this network type which can increase the
complexity of the OSPF topology.
The OSPF LSDb is identical throughout the entire area therefore, there is no need to establish
an adjacency between the Tier-0 SR.
As explained earlier, VMware recommends using the Point-to-Point OSPF network type over
the Broadcast network type to simplify the routing topology.
113
VMware NSX Reference Design Guide
This can be problematic in a very large-scale environment where it would be unnecessary for all
routers to recompute the Dijkstra algorithm when a networking factor has changed.
OSPF routers and links can be compartmented into different areas to limit the LSA flooding and
reduce the time the SPF algorithm is run.
Routers with links in both a backbone area and a standard area are considered as Autonomous
Border Router. The Tier-0 gateway cannot perform ABR’s duties since only one area per Tier-0
is supported.
FIGURE 4-25 represents the OSPF LSAs that are supported in an NSX OSPF architecture.
The OSPF areas supported by NSX are listed in the FIGURE 4-26 and represented in FIGURE 4-27.
114
VMware NSX Reference Design Guide
Since a Tier-0 service router redistributes the routes into the OSPF domain, it can be considered
as an Autonomous System Boundary Router. An ABR in the OSPF topology will inject LSA type 4
in other areas to make sure the OSPF routers know how to reach the ASBR. When a Tier-0 SR
redistributes a prefix into OSPF, it uses an external metric of type 2:
• Total cost to reach the prefix is always the redistributed metric. Internal cost to reach the
ASBR is ignored.
The OSPF External type 1 is not supported by NSX when a prefix is redistributed.
FIGURE 4-28 represents the different external metric types used by the Tier-0 when different
area types are used:
• Standard and Backbone areas: Routes are redistributed using LSAs type 5 with an
external type of E2 and a cost of 20.
• When the Tier-0 gateway is connected to a not so stubby area, it redistributes its routes
using LSAs type 7 with an external type of N2 and a cost of 20.
The OSPF metric cannot be changed in NSX and will have a hard coded value of 20 throughout
the OSPF domain.
115
VMware NSX Reference Design Guide
Not So Stubby Areas are recommended for very large topologies. Instead of redistributing the
prefixes using LSA type 5 in the OSPF domain, LSA type 7 will be used. This type of LSA is
originated by the ASBR and flooded in the NSSA only. Another ABR in the OSPF domain will
translate that LSA type 7 into an LSA type 5 that will be transmitted throughout the OSPF
domain.
With the advent of OSPF support in NSX 3.1.1, it is now possible to choose which protocol the
routes can be redistributed into. It is possible to redistribute prefixes in BGP or OSPF.
It is possible to use OSPF to learn BGP peers IP addresses in the case of E-BGP multi-hop
topology.
FIGURE 4-29 represents an example where OSPF and BGP can be used on the same Tier-0
gateway. OSPF is used as an IGP to provide connectivity between the loopback interfaces.
BGP can be setup between the physical routers’ loopback interfaces and the Tier-0 loopback
interfaces. This solution should only be considered when direct single hop eBGP peering is not
an option. Chapter 7 provides guidance on the recommended routing configuration for BGP
peering on directly connected segments.
116
VMware NSX Reference Design Guide
Figure 4-29: OSPF used to learn Loopback reachability to establish a E-BGP multi-hop peering
117
VMware NSX Reference Design Guide
118
VMware NSX Reference Design Guide
Since the internal transit interface on the standby Tier-0 SR is administratively shutdown, the
physical fabric needs to send the traffic to the NSX domain via the active Tier-0 SR.
The standby Tier-0 SR will redistribute the internal NSX prefixes in OSPF and advertise them
with a cost of 65534 (External Type E2 or N2).
The active Tier-0 SR redistributes the same prefixes with a cost of 20, the networking fabric will
logically prefer the prefixes with a lower cost and therefore use the links via the active Tier-0
SR.
In this topology, ECMP with the top of rack switches can still be leveraged as demonstrated on
FIGURE 4-29. There is no ECMP between the Tier-0 DR and the Tier-0 SR.
119
VMware NSX Reference Design Guide
Figure 4-29: ECMP between the physical fabric and the Active Tier-0 SR
120
VMware NSX Reference Design Guide
121
VMware NSX Reference Design Guide
If the IP Addressing schema has been designed properly for summarization (contiguous
subnets), these large number of Type 5 LSAs can be advertised as a single summary prefix
embedded in a single LSA Type 5 as depicted in figure 4-32.
122
VMware NSX Reference Design Guide
When the area connected to the Tier-0 gateway is an NSSA, route summarization is supported
and the type of LSA injected in the area for summarization is an LSA Type 7.
123
VMware NSX Reference Design Guide
From a dynamic routing perspective, there are two options to when it comes to designing a
modern data center routing architecture with NSX:
• OSPFv2
• BGP
BGP is well known in the networking industry to be the best routing protocol when it comes to
interoperability as it exchanges prefixes between autonomous systems on the internet.
Modern and scalable data centers architecture rely on BGP to provide connectivity as
multiprotocol support is not possible with OSPFv2. OSPFv2 implementation require a different
routing protocol or static routing to route IPv6 packets.
VMware recommends using BGP over OSPF as it provides more flexibility, had a better feature
set and support address families.
OSPF should be considered when it is the only routing protocol running in the physical fabric
and no feature available only with BGP is required. In such a scenario OSPF avoids the
implementation of routing redistribution leading to simpler and more scalable design.
A comparison of features between OSPF and BGP is described in the Table 4-6.
124
VMware NSX Reference Design Guide
125
VMware NSX Reference Design Guide
Figure 4-33: Type of IPv6 addresses supported on Tier-0 and Tier-1 Gateway components
FIGURE 4-34 shows a single tiered routing topology on the left side with a Tier-0 Gateway
supporting dual stack on all interfaces and a multi-tiered routing topology on the right side with
a Tier-0 Gateway and Tier-1 Gateway supporting dual stack on all interfaces. A user can either
assign static IPv6 addresses to the workloads or use a DHCPv6 relay supported on gateway
interfaces to get dynamic IPv6 addresses from an external DHCPv6 server.
For a multi-tier IPv6 routing topology, each Tier-0-to-Tier-1 peer connection is provided a /64
unique local IPv6 address from a pool i.e. fc5f:2b61:bd01::/48. A user has the flexibility to
126
VMware NSX Reference Design Guide
change this subnet range and use another subnet if desired. Similar to IPv4, this IPv6 address is
auto plumbed by system in background.
127
VMware NSX Reference Design Guide
128
VMware NSX Reference Design Guide
Active/Active
Active/Active – This is a high availability mode where all SRs hosted on Edge nodes act as active
forwarders. Stateless services such as layer 3 forwarding are IP based, so it does not matter
which Edge node receives and forwards the traffic. All the SRs configured in active/active
configuration mode are active forwarders. This high availability mode has been traditionally
only available on Tier-0 gateway, but staring with NSX 4.0 T1 gateways support active/active
with stateful services. See SECTION 4.5.3 for more details.
Stateful services typically require tracking of connection state (e.g., sequence number check,
connection state), thus traffic for a given session needs to go through the same Edge node. Tier-
0 gateways can be implemented in stateful or stateless mode. In stateless mode a reduced
number of services are available. Such services include reflexive NAT and stateless firewall.
Left side of FIGURE 4-36 shows a Tier-0 gateway (configured in active/active high availability
mode) with two external interfaces leveraging two different Edge nodes, EN1 and EN2. Right
side of the diagram shows that the services router component (SR) of this Tier-0 gateway
instantiated on both Edge nodes, EN1 and EN2. A Compute host, ESXi is also shown in the
diagram that only has distributed component (DR) of Tier-0 gateway.
Note that Tier-0 SR on Edge nodes, EN1 and EN2 have different IP addresses northbound
toward physical routers and different IP addresses southbound towards Tier-0 DR.
Management plane configures two default routes on Tier-0 DR with next hop as SR on EN1
(169.254.0.2) and SR on EN2 (169.254.0.3) to provide ECMP for overlay traffic coming from
compute hosts.
129
VMware NSX Reference Design Guide
North-South traffic from overlay workloads hosted on Compute hosts will be load balanced and
sent to SR on EN1 or EN2, which will further do a routing lookup to send traffic out to the
physical infrastructure.
A user does not have to configure these static default routes on Tier-0 DR. Automatic plumbing
of default route happens in background depending upon the HA mode configuration.
Inter-SR Routing
To provide redundancy for physical router failure, Tier-0 SRs on both Edge nodes must establish
routing adjacency or exchange routing information with different physical router or TOR. These
physical routers may or may not have the same routing information. For instance, a route
192.168.100.0/24 may only be available on physical router 1 and not on physical router 2.
For such asymmetric topologies, users can enable Inter-SR routing. This feature is only available
on Tier-0 gateway configured in active/active high availability mode. FIGURE 4-37 shows an
asymmetric routing topology with Tier-0 gateway on Edge node, EN1 and EN2 peering with
physical router 1 and physical router 2, both advertising different routes.
When Inter-SR routing is enabled by the user, an overlay segment is auto plumbed between SRs
(similar to the transit segment auto plumbed between DR and SR) and each end gets an IP
address assigned in 169.254.0.128/25 subnet by default. An IBGP session is automatically
created between Tier-0 SRs and northbound routes (EBGP and static routes) are exchanged on
this IBGP session.
130
VMware NSX Reference Design Guide
As explained in previous figure, Tier-0 DR has auto plumbed default routes with next hops as
Tier-0 SRs and North-South traffic can go to either SR on EN1 or EN2. In case of asymmetric
routing topologies, a particular Tier-0 SR may or may not have the route to a destination. In
that case, traffic can follow the IBGP route to another SR that has the route to destination.
FIGURE 4-37 shows a topology where Tier-0 SR on EN1 is learning a default WAN route 0.0.0.0/0
and a corporate prefix 192.168.100.0/24 from physical router 1 and physical router 2
respectively. If “External 1” interface on Tier-0 fails and the traffic from compute workloads
destined to WAN lands on Tier-0 SR on EN1, this traffic can follow the default route (0.0.0.0/0)
learned via IBGP from Tier-0 SR on EN2.Traffic is being sent to EN2 through the Geneve overlay.
After a route lookup on Tier-0 SR on EN2, this N-S traffic can be sent to physical router 1 using
“External interface 3”.
Active/Standby
Active/Standby – This is a high availability mode where only one SR act as an active forwarder.
This mode is required when stateful services are enabled. Services like NAT are in constant
state of sync between active and standby SRs on the Edge nodes. This mode is supported on
both Tier-1 and Tier-0 SRs. Preemptive and Non-Preemptive modes are available for both Tier-0
and Tier-1 SRs. Default mode for gateways configured in active/standby high availability
configuration is non-preemptive.
131
VMware NSX Reference Design Guide
A user can select the preferred member (Edge node) when a gateway is configured in
active/standby preemptive mode. When enabled, preemptive behavior allows a SR to resume
active role on preferred edge node as soon as it recovers from a failure.
For Tier-1 Gateway, active/standby SRs have the same IP addresses northbound. Only the active
SR will reply to ARP requests, while the standby SR interfaces operational state is set as down
so that they will automatically drop packets.
For Tier-0 Gateway, active/standby SRs have different IP addresses northbound and both have
BGP sessions or OSPF adjacencies established on their uplinks. Both Tier-0 SRs (active and
standby) receive routing updates from physical routers and advertise routes to the physical
routers; however, the standby Tier-0 SR prepends its local AS three times in case of BGP or
advertise routes with a higher metric in case of OSPF so that traffic from the physical routers
prefer the active Tier-0 SR.
Southbound IP addresses on active and standby Tier-0 SRs are the same and the operational
state of standby SR southbound interface is down. Since the operational state of southbound
Tier-0 SR interface is down, the Tier-0 DR does not send any traffic to the standby SR. FIGURE
4-38 shows active and standby Tier-0 SRs on Edge nodes “EN1” and “EN2”.
132
VMware NSX Reference Design Guide
Suppose the edge node connects to multiple dual supervisor systems. In that case, the network
architect will have to choose what type of failover mechanism to prioritize between graceful
restart and traffic rerouting over a different link.
133
VMware NSX Reference Design Guide
134
VMware NSX Reference Design Guide
The Tier0 represented in FIGURE 4-40 is running in a stateful active/active mode. Its SRs are
spread across four different edges. NSX has created two “punt” overlay segments and each SR
is attached to those two logical switches with a shadow port. In the above example, the shadow
port of SR4 connecting to the north punt logical switch is highlighted. SR4 will treat traffic
received on this shadow port as if it had been received on the Tier0 uplink of SR4. A similar
shadow port connects SR4 to the south punt logical switch. Traffic received on this shadow port
will be treated as if it had been received on the internal port (sometimes referred to as the
“backplane port”) connecting SR4 to other distributed routers (DRs).
This additional switching infrastructure will be used to redirect traffic for a specific flow to a
deterministic SR. The following FIGURE 4-41 illustrates how this is achieved in the “northbound”
direction. VM A sends traffic towards B and ECMP inside NSX forward the packets of this flow to
T0SR3, on edge3.
135
VMware NSX Reference Design Guide
When edge3 receives this packet from A, a hash is taken based on the destination IP address of
the traffic. The resulting value is used as an index that selects SR4 in a table including all the SRs
for the Tier0 gateway. Edge3 uses the south punt logical switch to switch the packet to the
shadow port on SR4. Here, SR4 treats this packet as it had been received on its backplane port
and routes it directly to its destination. Any packet to destination IP address B will be redirected
to SR4 with this model. Of course, the same mechanism is applied in the southbound direction,
as represented in the next diagram.
In FIGURE 4-42, host B replies to VM A. Its packets are routed by the physical infrastructure to an
arbitrary active SR, here SR1. When entering SR1, a hash is taken on the source IP address this
time. This allows selecting the same SR in the table as the one that was selected using the
destination IP address in the northbound direction. SR1 immediately switches those packets on
the northbound punt logical switch to the shadow port on SR4, where it’s accepted as if they
had been received on SR4 uplink.
This example is of course the worst-case scenario where traffic needs to be redirected in each
direction. Still, thanks to scaling out to four active SRs, the overall forwarding capability in/out
NSX is greatly enhanced compared to a single active SR.
136
VMware NSX Reference Design Guide
of Tier1 SRs. They could be on a dedicated edge cluster for example, and the user can explicitly
designate on which edges active and standby SRs will be located. Active/active stateful Tier1
gateways will only run in the very specific configuration where they share their edges with a
corresponding active/active Tier0 gateway, as represented below:
This model simplifies the redundancy model, by sharing the fate of the Tier1 SR and the Tier0
SR, it also allows using the same pair of north/south punt logical switch for both gateways.
Failure Domain 1
Failure Domain 2
The user cannot explicitly select which edge is part of a sub-cluster, however NSX attempt to
group in a sub-clusters edges from a different FAILURE DOMAIN.
137
VMware NSX Reference Design Guide
Internet DMZ
When using a stateful active/active Tier1, an internal group representing the connectivity
between the Tier1 and the Tier0 is automatically created by the system.
138
VMware NSX Reference Design Guide
139
VMware NSX Reference Design Guide
Stdby Stdby
stateful stateful
Stdby Stdby
stateful stateful
Stdby Stdby
stateful stateful
New Combinations
Stdby
stateful
stateful
stateful
Tier0 Tier1
140
VMware NSX Reference Design Guide
The second type of failover event affects an individual service or SR. In case of a failure an
individual SR, the corresponding SR on the peer edge node will take over, while others may still
be running on the original node.
141
VMware NSX Reference Design Guide
Failover events that belong to the first category will cause all services and SRs to failover to the
corresponding peer edge node. Those conditions are:
1. Dataplane service is not running.
2. All TEP interfaces are down. This condition only applies to bare metal edge nodes as the
interface of an edge node VM should never go “physically” down. If a bare metal edge
node has more than one TEP interface, all need to be down for this condition to take
effect.
3. The edge node enters maintenance mode
4. Edge nodes TEPs run BFD with compute hosts and other edge nodes TEPs to monitor the
health of the overlay tunnels. If host transport nodes are present, when all the overlay
tunnels are down to both remote Edges and compute hypervisors, all the SRs on the
edge will be declared down. This condition does not apply if no host transport nodes are
present.
5. Edge nodes in an Edge cluster exchange additional BFD keepalives on two interfaces,
management and TEP interfaces. Each edge monitors the status of its peer edges on
those two interfaces, if both are down, it will consider the peer down, and it will make
the corresponding SRs active.
Failover event that belongs to the second category are (Only a specific SR is declared down and
individually failover to the peer edge without affecting other SRs):
6. Northbound routing on a Tier-0 SR is down. This is applicable to BGP, and OSPF with or
without BFD enabled and to static routes with BFD enabled. When this situation occurs,
the Tier-0 SR only will be declared failed. Other Tier-1 SRs on the same edge will still be
active. If the Tier-0 is configured with static routes with no BFD, northbound routing will
never be considered down. Another situation when northbound routing on a Tier-0 SR is
considered down is when all the VLAN uplink interfaces of the edge node are physically
down. This condition only applies to bare metal edges because the edge VM vNICs are
virtual components that should never be in down state.
7. Services are configured on SR. The health score of services on the SR is less than that on
the peer SR.
We will now review in more detail the failover triggers that requires a more careful
consideration from a design perspective.
FIGURE 4-50 outlines condition 5. The BFD sessions between two edge nodes are lost on both
the management and the overlay network. Standby Tier-0 and Tier-1 SRs will become active.
This condition addresses the scenario when an edge node fails or is completely isolated. When
designing edge node connectivity, it is important to consider corner-case situations when this
condition may apply, but the edge node is not down or completely isolated. For example, uplink
connectivity could be up while the overly and management network could be down if the uplink
142
VMware NSX Reference Design Guide
traffic used dedicated pNICs. Designing for fate sharing between uplink, management, and
overlay traffic will mitigate the risk of a dual-active scenario. If dedicated pNICs are part of the
design, providing adequate pNIC redundancy to management and overlay traffic will ensure
that condition 5 is only triggered because of the complete failure of an edge.
Figure 4-50: Failover triggers – BFD on management and overlay network down
FIGURE 4-51 below outlines condition 6. A northbound connectivity issue is detected by the
dynamic routing protocol timers or BFD. In this case, the affected Tier-0 only will undergo a
failover event. Any other Tier-1 SR on the same edge will remain active. Static routes without
BFD do not allow to detect a failure in the uplink network. For this reason, configurations
including static routes should always include BFD.
FIGURE 4-52 below outlines condition 4. In this case, edge node one did not fail, but it is not able
to communicate to any other transport node on the overlay network. This condition mitigates
situations when an edge node does not fail, but connectivity issues exist. All the SRs on the
143
VMware NSX Reference Design Guide
affected edge node, including all Tier-0 and Tier-1 SRs, will failover. This condition only applies
to deployments including host transport nodes, which can help discriminate between the
failure of the peer edge and a network connectivity problem to it.
Figure 4-52: Failover triggers – All tunnels down (with host transport nodes)
FIGURE 4-53 outlines condition 4 again, but this time no host transport node is present in the
topology. The all overlay tunnel down condition does not put the SRs running on edge node 2 in
a failed state. In the scenario depicted in the diagram, edge node one failed, and the SRs on
edge two took over even if all the overlay tunnels on edge 2 were down. If the all tunnel
condition applied to deployment without host transport nodes, the SRs running on edge node
two would go into a fail state, blackholing the traffic. This topology applies to VLAN-only
deployments, where workloads are connected via service interfaces.
Note: VLAN-Only deployments with service interfaces must use a single TEP. Multi-TEP
configurations will cause the all overlay tunnel down condition to disable the surviving node.
144
VMware NSX Reference Design Guide
Figure 4-53: Failover triggers – All tunnels down (without host transport nodes)
We should pay special attention to the implications of deploying multiple edge node VMs on
the same host and/or on hosts with TEP interfaces on the same network (possible starting NSX
version 3.1) on the “all tunnel down condition” failover trigger. A failure of the upstream
overlay transport network (pNIC or switch level failure) will not bring down the tunnels local to
the host, those between edge VMs on the same host, or to the host TEP (See FIGURE 4-54). This
situation may lead the edge node VMs to not trigger a failover if northbound routing
connectivity is up. Traffic that the upstream network forward to the affected edge node VMs
will be blackholed because of the lack of access to the overlay network.
145
VMware NSX Reference Design Guide
Figure 4-54: Failover Triggers - All tunnels Down - Tunnels local to ESXi host
This condition can be avoided by designing for fate sharing between the overlay and upstream
network. If that is not a possibility based on the design requirements, we should eliminate the
possibility of establishing overlay tunnels within a host with one of these options:
• No more than one edge node VM per host on hosts not part of the same overlay TZ
• No more than one edge node VM per host on hosts part of the same overlay TZ but edge
and host overlay VLAN and subnet are different (BFD keepalives will need to be
forwarded to the physical network)
• With more than one edge node VM per host, edge node VMs should have different
overlay VLANs and subnets (BFD keepalives will need to be forwarded to the physical
network)
VRF Lite
146
VMware NSX Reference Design Guide
147
VMware NSX Reference Design Guide
Another supported design is to deploy a separate Tier-0 gateway for each tenant on a
dedicated tenant edge node. This configuration allows for duplicated IP addresses between
tenants, but requires dedicated edge nodes per tenant (no more than a single Tier-0 SR can run
on each edge node). While providing the required separation, this solution may be limited from
a scalability perspective.
FIGURE 4-57 shows a traditional multi-tenant architecture using dedicated Tier-0 per tenant in
NSX 2.X .
148
VMware NSX Reference Design Guide
Figure 4-57: NSX 2.x multi-tenant architecture. Dedicated Tier0 for each tenant
In traditional networking, VRF instances are hosted on a physical appliance and share the
resources with the global routing table. Starting with NSX 3.0, Virtual Routing and Forwarding
(VRF) instances configured on the physical fabric can be extended to the NSX domain. A VRF
Tier-0 gateway must be associated to a traditional Tier-0 gateway identified as the “Parent Tier-
0”. FIGURE 4-58 diagrams an edge node hosting a traditional Tier-0 gateway with two VRF
gateways. Control plane is completely isolated between all the Tier-0 gateways instances.
The parent Tier-0 gateway can be considered as the global routing table and must have
connectivity to the physical fabric. A unique Tier-0 gateway instance (DR and SR) will be created
149
VMware NSX Reference Design Guide
and dedicated to a VRF. F IGURE 4-59 shows a detailed representation of the Tier-0 VRF gateway
with their respective Service Router and Distributed Router components.
Figure 4-59: Detailed representation of the SR/DR component for Tier-0 VRF hosted on an edge node
FIGURE 4-60 shows a typical single tier routing architecture with two Tier-0 VRF gateways
connected to their parent Tier-0 gateway. Traditional segments are connected to a Tier-0 VRF
gateway.
Figure 4-60: NSX 3.0 multi-tenant architecture. Dedicated Tier-0 VRF Instance for each VRF
150
VMware NSX Reference Design Guide
Since control plane is isolated between Tier-0 VRF instances and the parent Tier-0 gateway,
each Tier-0 VRF needs their own network and routing constructs:
• Segments
• Uplink Interfaces
• BGP configuration with dedicated peers or Static route configuration
NSX 3.0 supports BGP and static routes for the Tier-0 VRF gateway. It offers the flexibility to use
static routes on a particular Tier-0 VRF while another Tier-0 VRF would use BGP. The OSPF
routing protocol is not supported for VRF topologies.
FIGURE 4-61 shows a topology with two Tier-0 VRF instances and their respective BGP peers on
the physical networking fabric. It is important to emphasize that the Parent Tier-0 gateway has
a BGP peering adjacency with the physical routers using their respective global routing table
and BGP process.
From a data plane standpoint, 802.1q VLAN tags are used to differentiate traffic between the
VRFs instances as demonstrated in the following figure.
Figure 4-61: BGP peering Tier-0 VRF gateways and VRF on the networking fabric
When a Tier-0 VRF is attached to parent Tier-0, multiple parameters will be inherited by design
and cannot be changed:
151
VMware NSX Reference Design Guide
• Edge Cluster
• High Availability mode (Active/Active – Active/Standby)
• BGP Local AS Number (Starting in NSX 4.1.0, VRFs can have an AS number that is
different from the parent T0 AS number)
• Internal Transit Subnet
• Tier-0, Tier-1 Transit Subnet.
All other configuration parameters can be independently managed:
• External Interface IP addresses
• BGP neighbor
• Prefix list, route-map, Redistribution
• Firewall rules
• NAT rules
VRF Lite HA
As mentioned previously, The Tier-0 VRF is associated to a Parent Tier-0 and will follow the high
availability mode and state of its Parent Tier-0.
Both Active/Active or Active/Standby high availability mode are supported on the Tier-0 VRF
gateways. It is not possible to have an Active/Active Tier-0 VRF associated to an Active/Standby
Parent Tier-0 and vice-versa.
In a traditional Active/Standby design, a Tier-0 gateway failover can be triggered if all
northbound BGP peers are unreachable. Similar to the high availability construct between the
Tier-0 VRF and the Parent Tier-0, the BGP peering design must match between the VRF Tier-0
and the Parent Tier-0.
Until NSX version 4.2.1, inter-SR routing was not supported in Active/Active Tier-0 VRF
topologies. The following considerations apply to scenarios where inter-SR routing is not
enabled or not supported.
FIGURE 4-62 represents a BGP instance from both parent Tier-0 and Tier-0 VRF point of view.
This topology does not require inter-SR routing (no harm if it is enabled) as each Tier-0 SR (on
the parent and on the VRF itself) have a redundant path towards the network infrastructure.
The Top of Rack switches in this case are advertising the same prefixes towards the NSX Tier-0.
Both the Parent Tier-0 gateway and the Tier-0 VRF gateway are peering with the same physical
networking device. The Parent Tier-0 gateways establish BGP adjacencies with both top of rack
switches in the global routing table while the Tier-0 VRF gateways establish BGP adjacencies
with both top of rack switch on another BGP process dedicated to the VRF. The Tier-0 VRF
leverages physical redundancy towards the networking fabric if one of its northbound links fails.
152
VMware NSX Reference Design Guide
FIGURE 4-63 represents a VRF Active/Active design where different routes are learned from
different physical routers. Both the Parent Tier-0 and VRF Tier-0 gateways are learning their
default route from a single physical router. At the same time, they receive specific routes from
a different single BGP peer. This kind of scenario is supported for a traditional Tier-0
architecture as Inter-SR would provide a redundant path to the networking fabric, but it is not
supported for VRFs because of the lack of inter-SR peering capability until NSX 4.2.1. This VRF
architecture is supported in NSX 4.2.1 or later by enabling the inter-SR routing capability.
Figure 4-63: Unsupported (Before NSX 4.2.1) Active/Active Topology with VRF
153
VMware NSX Reference Design Guide
FIGURE 4-64 demonstrates the traffic being dropped as one internet router fails and Tier-0 VRF
gateways can’t leverage another redundant path to reach the destination. Since the Parent
Tier-0 gateway has an established BGP peering adjacency (VRFs HA state is inherited from the
parent), failover will not be triggered, and traffic will be blackholed on the Tier-0 VRF. This
situation would not arise if each VRF SR peered with every ToR.
Figure 4-64: Unsupported (Before NSX 4.2.1) Active-Active Topology with VRF – Failure
154
VMware NSX Reference Design Guide
2. Since the Tier-0 topology is Active/Active, the Tier-0 DR can send the traffic to either
Tier-0 SR1 and Tier-0 SR2 using a 2 or 5 tuple hashing algorithm.
3. From a Tier-0 SR1 point of view, the traffic is blackholed as there is no inter-SR BGP
adjacency between the Tier-0 SRs VRF.
In Active/Standby topologies, it is mandatory to follow the recommended BGP peering design
where:
• The parent T0 and each associated VRF peer to every ToR switch
• The same physical devices are used for the peering of the Parent T0 and the VRF
The Tier-0 VRF will inherit the behavior of the parent Tier-0 gateway. For Active/Standby
topologies, inter-SR routing is not available, not even in NSX 4.2.1 or later.
FIGURE 4-65 represents another unsupported design with Active/Standby Tier-0 and VRF.
Figure 4-65: Unsupported design BGP architecture Different peering with the networking fabric
In this design, traffic will be blackholed on the Tier-0 VRF SR1 as the internet router fails. Since
the Tier-0 VRF share its high availability running mode with the Parent Tier-0, it is important to
note that the Tier-0 SR1 will not failover to the Tier-0 SR2. The reason behind this behavior is
because a failover is triggered only if all northbound BGP sessions change to the “down” state
on the parent Tier-0 SR.
Since the parent Tier-0 gateway has an active BGP peering with a northbound router on the
physical networking fabric, failover will not occur and traffic will be blackholed on the VRF that
have only one BGP peer active. This situation would not arise if each VRF SR peered with every
ToR.
155
VMware NSX Reference Design Guide
Let’s now examine how Inter-SR BGP peering for VRF can make VRF peering designs more
flexible and redundant for Active/Active topologies. While following the same design principles
as for the Active/Standby topology is recommended, A/A topologies with inter-SR routing can
support a variety of failures even when those principles are not followed.
FIGURE 4-66 shows a topology where the Parent Tier-0 Gateway peers with a set of peers (Tor-1
and Tor-2) and the associated VRFs with a different set (Tor-3 and Tor-4). Moreover, each SR
only peers with a single top of rack switch, violating the two principles of what we consider the
“best” design. There are many reasons why the “best “design may not be possible in some
scenarios. Let’s examine how inter-SR routing can help. FIGURE 4-66 shows the failure of the
single BGP peer configured on the VRF SR1. No failover event has been triggered as both the
Parent-T0 SRs are in a healthy state, and the VRFs SRs inherit their HA state from them.
On the VRF, the outbound traffic flows in this way:
1. VM “172.16.10.0” sends its IP traffic towards the internet through the Tier-0 DR.
2. Since the Tier-0 topology is Active/Active, the Tier-0 DR can send the traffic to either
Tier-0 SR1 and Tier-0 SR2 using a 2 or 5 tuple hashing algorithm.
3. From a VRF SR1 point of view, the traffic that needs to be routed towards the internet
will be sent towards VRF SR2 as there is an inter-SR BGP adjacency and that VRF SR2
learns the route from ToR-4.
4. Traffic is received by the VRF SR2 and routed towards the physical fabric.
Figure 4-66: Active/Active VRF with inter-SR routing, suboptimal design - Failure of the VRF peer
156
VMware NSX Reference Design Guide
FIGURE 4-67 shows the failure the single BGP peer configured on the SR of the Parent Tier-0
running on edge 1. A failover event has been triggered for both the Parent-T0 and VRF SRs
running on edge node 1. Both are in failed state because the Parent-T0 has all northbound
routing down, and the VRF inherit its HA state from the Parent Tier-0.
On the VRF, the outbound traffic flows in this way:
5. VM “172.16.10.0” sends its IP traffic towards the internet through the Tier-0 DR.
6. Since the Tier-0 topology is Active/Active, the Tier-0 DR can send the traffic to either
Tier-0 SR1 and Tier-0 SR2 southbound IPs using a 2 or 5 tuple hashing algorithm, but
because the SR-1 is in failed state, the SR1 southbound IP now reside on SR-2. Effectively
all outbound traffic is reaching the VRF SR2 even if the VRF SR-1 has its northbound
routing operational.
7. Traffic is received by the VRF SR2 and routed towards the physical fabric.
Figure 4-67: Active/Active VRF with inter-SR routing, suboptimal design - Failure of the Parent-T0 peer
157
VMware NSX Reference Design Guide
Stateful services can either run on a Tier-0 VRF gateway or a Tier-1 gateway except for VPN and
Load Balancing as these features are not supported on a Tier-0 VRF. Tier-0 SR in charge of the
stateful services for a particular VRF will be hosted on the same edge nodes as the Parent Tier-0
Gateway. It is recommended to run the stateful services on a Tier-1 gateways and leverage an
Active/Active Tier-0 gateway to send the traffic northbound to the physical fabric.
FIGURE 4-69 represents stateful services running on traditional Tier-1 gateways SR.
158
VMware NSX Reference Design Guide
159
VMware NSX Reference Design Guide
If the intention is to minimize the physical network configuration and simply provide a common
VLAN where all the NSX VRFs can access the Internet, this configuration can be accomplished by
deploying a shared T0 Gateway dedicated to Internet access as demonstrated in the following
figure (FIGURE 4-70)
Figure 4-70: Different VRFs sharing Internet access via common Internet Tier-0
Directly connecting the different VRFs to a common VLAN is not allowed in NSX 4.1.
The VRFs can be interconnected to the shared T0 via dedicated overlay segments. The shared
Tier-0 gateway is providing internet access using a shared VLAN segment.
Internet facing NAT can implemented on the shared T0 if the VRFs are not carrying overlapping
IP spaces, or at the VRF level on the uplink facing the shared Tier-0 if they are.
eBGP peering between the VRFs Tier-0 Gateways and the shared Tier-0 gateway is
recommended to protect against an edge node failure.
160
VMware NSX Reference Design Guide
161
VMware NSX Reference Design Guide
162
VMware NSX Reference Design Guide
This EVPN Route type can advertise either IPv4 or IPv6 prefixes with a variable network mask
length (0 to 32 bits for IPv4 and 0 to 128 for IPv6). This route type allows inter subnet
forwarding, overlapping IP address scheme with the help of route distinguishers.
FIGURE 4-73 represents the field included in an EVPN Route Type 5
When designing an EVPN architecture based on NSX, the following network constructs must be
taken into account:
• Segments used for the uplink interfaces
163
VMware NSX Reference Design Guide
From a BGP standpoint, there are 2 mains requirements needed to setup the basic MP-BGP
Peering:
164
VMware NSX Reference Design Guide
• TEP interfaces (loopbacks) need to have reachability (Top of racks loopback interfaces to
NSX Tier-0 Loopbacks hosted on each edge nodes)
• L2VPN/EVPN address family negotiated between the Top of Rack Switch and the Tier-0
Service Router.
To achieve connectivity between the loopbacks hosted on the Tier-0 SR and the Top of Rack
switches, 2 designs are available as demonstrated in FIGURE 4-75. The first design will have a
BGP peering between the NSX Tier-0 SR and the Top of Rack switches using their uplink
interfaces. The loopbacks must be redistributed into BGP so that the entire network fabric is
aware of these TEP IP addresses to reach the different VRF prefixes hosted on the NSX domain.
The second design uses BGP Multi-hop between the loopback interfaces. To achieve
reachability between the loopbacks in this case, the use of static routing or OSPF is needed,
otherwise the BGP Multi-hop peering will never reach the “established” state.
The first design with loopback redistribution is recommended as it simplify the overall
architecture and reduce the operational load.
Figure 4-75: Peering over uplink interfaces with loopback redistribution into BGP (recommended) vs multi-hop E-BGP and
peering over the loopback interfaces
The Route Distinguisher field in a Type 5 EVPN route has the following functionalities and
properties:
• Allows VRF prefixes to be considered unique. If two tenants advertise the same prefix,
the Tier-0 (and the network fabric) must have a way to differentiate these 2 overlapping
prefixes so that they are globally unique among the different isolated routed instances.
• As a result, the Route Distinguisher must be unique per VRF. It is an 8 bytes identifier
165
VMware NSX Reference Design Guide
• NSX can auto assign Route Distinguishers, or an administrator can configure them
• Network engineer traditional use the format “BGP_AS:VRF” to configure the route
distinguishers
The following pictures show a multi-tenancy architecture based on a single MP-BGP EVPN
session that advertises the same prefix owned by 3 different tenants using route distinguishers
to make the routes unique.
166
VMware NSX Reference Design Guide
Figure 4-76: Relationships between Route Distinguisher, Route Target and VXLAN ID
FIGURE 4-77 summarizes the networking constructs needed per VRF to design and implement
MP-BGP EVPN with VXLAN between the Tier-0 gateways and the Top of Rack switches.
167
VMware NSX Reference Design Guide
The NSX edge node uses Geneve TEP interfaces for all NSX overlay traffic. To exchange network
traffic with the Top of Rack switches, the Tier-0 must have a different TEP interfaces enabled
for VXLAN. One VXLAN TEP being supported on each edge node, loopback interfaces acting as
VXLAN TEP interfaces are recommended in an NSX EVPN architecture.
168
VMware NSX Reference Design Guide
between the transport nodes. FIGURE 4-78 illustrates this point in the context of multicast
routing.
The diagram represents a multicast source located in the physical infrastructure sending to
multiple virtual machines on different ESXi hosts. Two kinds of multicast are involved here:
• multicast between physical devices, represented in dark green,
• multicast handling on the virtual switch inside the ESXi hosts, represented in light green.
More specifically, you can see that within ESXi-3, traffic has been replicated to VM1 and
VM3, but filtered to VM2.
When running multicast in a VLAN-backed environment (NSX VLAN segments, or simple
dvportgroups, if NSX is not present in the picture), the physical infrastructure can directly snoop
the IGMP traffic sent by the virtual machines and take care of the multicast replication and
filtering between the hosts. In that scenario, there is no difference between NSX and regular
vSphere networking: the virtual switch is identifying the receivers using IGMP snooping and
delivers the multicast traffic to the appropriate vnics. We will not elaborate on this capability in
this guide.
For overlay segments however, the physical infrastructure is unaware of the traffic being
tunneled and it’s up to NSX to handle efficiently the distribution of multicast traffic to the
appropriate transport nodes. This design guide section is going to focus on that scenario: the
replication/filtering of multicast traffic between transport nodes with an NSX overlay.
169
VMware NSX Reference Design Guide
170
VMware NSX Reference Design Guide
FIGURE 4-79 summarizes the kind of multicast connectivity provided by NSX with multicast
routing. Sources and receivers can be inside or outside the NSX domain, the Tier0 gateway is
running PIM SM to the physical infrastructure where the RP is located.
Technical overview
NSX routes multicast traffic between overlay segment thanks to its Tier0 and Tier1 gateways.
This section focuses on the data plane operations within NSX: how NSX achieves efficient
multicast replication between transport nodes. This multicast traffic is carried on an overlay
network and the physical infrastructure cannot look inside the tunnels NSX creates.
4.8.3.2 Data plane operations between Tier0 gateway and physical infrastructure
The Tier0 gateway connecting NSX to the physical infrastructure is implemented as multiple
Tier0-SRs on multiple edge transport nodes. FIGURE 4-80 illustrates a Tier0 made of four
171
VMware NSX Reference Design Guide
active/active Tier0-SRs, each configured with two physical uplinks running PIM sparse mode.
Thanks to ECMP, unicast traffic can be sent and received on all the uplinks of those Tier0-SRs.
NSX is also spreading multicast traffic across the different Tier0-SRs on a per-multicast group
basis.
For a given destination multicast IP address, NSX uses a hash to determine a unique “active”
Tier0-SR that will be responsible for handing traffic in and out of the NSX domain for this group.
Different groups are hashed to different Tier0-SRs, thus providing some form of load balancing
for multicast traffic. There is no configuration option to manually select the multicast active
Tier0-SR for a given multicast group.
All Tier0-SRs must have a configuration and connectivity to the physical infrastructure capable
of handling all the groups. If a Tier0-SR fails, the multicast groups it was handling will be
assigned to an arbitrary Tier0-SR that is still active. No multicast routing state is synchronized
between the Tier0-SRs, and all the affected multicast groups will see their traffic disrupted until
PIM has re-populated its tables. Note that only the groups that were associated to the failed
Tier0-SR are impacted, other groups remain associated to their existing active Tier0-SR.
Figure 4-80: multicast between edges and physical infrastructure, external sources, internal receivers
NSX will only join the source tree of a multicast source in the physical infrastructure on one of
the uplinks of the multicast active Tier0-SR for that group. That means that external multicast
traffic will always enter NSX through the multicast active Tier0-SR. In the above FIGURE 4-80, the
multicast traffic for G1 is entering NSX through Tier0-SR1, the multicast active Tier0-SR for
172
VMware NSX Reference Design Guide
group G1. Multicast group G2 could be associated to another Tier0-SR, like Tier0-SR3 in the
diagram, thus spreading inbound traffic across multiple Tier0-SRs.
Figure 4-81: multicast between edges and physical infrastructure, internal sources, external receivers.
When the source for the group is located inside the NSX domain, the traffic pattern might be
slightly different. As soon as the IP address of the source is known to the physical
infrastructure, a PIM join will be sent toward the IP address of the source, in order to join the
receiver to the source tree.
This PIM join can be forwarded by the physical infrastructure on any path toward the source.
Thanks to unicast ECMP, the IP address of the source, is reachable via any Tier0-SR. In FIGURE
4-81 above, the external receiver for group G1 ended up joining the source tree rooted at S1 G1
through Tier0-SR1, the Tier0-SR active for G1. This is purely by chance. By contrast, the receiver
for group G2 ended up trying to join the source tree rooted at S2 G2 via Tier0-SR4.
This is not the Tier0-SR active for G2. Tier0-SR4 will itself need to join the source tree rooted
inside NSX in order to provide connectivity to the receiver. Because G2 is associated with Tier0-
SR3, the multicast traffic to G2 is always sent to Tier0-SR3. From there, it will be forwarded to
Tier0-SR4 so that it can reach the receiver that joined the source tree through an uplink of
Tier0-SR4. In the “south-north” direction, multicast traffic can thus hairpin to the multicast
active Tier0-SR before reaching its destination.
173
VMware NSX Reference Design Guide
4.8.3.3 Data plane operations within NSX for segments attached to a Tier0 gateway
We have seen that the multicast traffic going in and out the NSX domain for group G is always
routed through the multicast active Tier0-SR for G. The rest of this section is dedicated to the
multicast traffic flow within NSX. We’re only going to represent traffic using active Tier0-SR1 in
the diagrams, but of course, traffic could be going through any other Tier0-SR, depending on
the group.
The following diagram show the path for multicast traffic initiated in the physical infrastructure
and distributed inside NSX via the multicast active Tier0-SR1. In this scenario, we’re going to
assume that all the receivers are attached to segments directly connected to the Tier0 gateway
(in other words, there is no Tier1 gateway in the picture).
1. Multicast traffic from the external source S reaches the Tier0-SR multicast active uplink.
There are receivers in the NSX domain, so this packet needs to be replicated to one or
multiple ESXi transport nodes.
2. For performance reasons, Tier0-SR1 is not going to replicate the traffic to the interested
transport nodes in the NSX domain. Instead, it’s going to “offload” the replication to an
ESXi host, picked arbitrarily in the Tier0 routing domain (again, the routing domain is the
group of transport nodes where the Tier0 gateway spans.) The host selected for
replication does not need to have a local receiver for the traffic, but the destination
multicast IP address is part of the hash used to select the host. This way, this multicast
offload is distributed across all the hosts in the routing domain, based on multicast
group. A note on the format of the packet that is forwarded to the ESXi-1 transport
174
VMware NSX Reference Design Guide
node: this is a multicast packet tunneled unicast to ESXi-1. The destination IP address of
the overlay packet is the unicast IP address of a TEP in ESXi-1.
3. In this example, ESXi-1 has been selected to replicate the multicast packet. First, the
local Tier0-DR routes the traffic to any receiver on local VMs.
4. Then the Tier0-DR initiate a “hybrid replication” to all the remote hosts in the Tier0
routing domain that have at least one receiver (and only those hosts.) The next section
will detail what this hybrid replication means, at that stage, just assume that it’s a way
of leveraging the physical infrastructure to assist packet replication between ESXi hosts.
5. At that final step, the remote Tier0-DRs receive the multicast packet and route it to their
local receivers.
FIGURE 4-83 details the multicast packets flow when a source is in the NSX domain (on a
segment attached to a Tier0) and there are receivers inside and outside NSX.
1. The VM sends multicast traffic on the segment to which it is connected. It reaches the
local Tier0-DR. Note: there is no receiver on ESXi-1. If there were some in the source
segment, they would directly receive the multicast traffic, forwarded at Layer 2 (with
IGMP snooping.) If there were some receivers on other segments in ESXi-1, the T0-DR
would route the traffic directly to them too.
2. Again, we assumed that Tier0-SR1 is multicast active for the group. It is thus considered
as being the multicast router and needs to receive a copy of every multicast packet
generated in the NSX domain for this group. The Tier0-DR on ESXi-1 sends a unicast copy
of the multicast packet to Tier0-SR1 (multicast traffic encapsulated in unicast overlay.)
175
VMware NSX Reference Design Guide
3. The Tier0-SR receives the traffic and routes it to the outside world on one of its uplinks
(or potentially through another Tier0-SR, as we have seen earlier.)
4. The Tier0-DR of ESXi-1 also starts hybrid replication to forward the multicast traffic to all
other ESXi transport nodes with receivers in the routing domain (the part dedicated to
hybrid replication will detail how this is achieved.)
5. The remote Tier0-DRs forward the multicast traffic to their local receivers.
4.8.3.4 Data plane operations within NSX: Tier0 and Tier1 segments combined
Until NSX 3.0, multicast routing was only possible on Tier0 gateways. NSX 3.1 introduces the
capability of routing multicast on Tier1 gateways. The handling of the Tier1 gateways is an
extension of the model described in the previous part for the Tier0 gateways.
When multicast routing is enabled on a Tier1 gateway, a Tier1-SR must be instantiated on an
edge. This Tier1-SR will be considered as the multicast router for the Tier1 routing domain and
will receive a copy of all multicast traffic initiated on a segment attached to this Tier1 gateway.
Then this Tier1-SR will be attached as a leaf to the existing multicast tree of its Tier0 gateway
(remember that in order to run multicast, a Tier1 gateway must itself be attached to a Tier0
gateway running multicast.)
The following diagram represents the traffic path of some external multicast traffic that needs
to be replicated within NSX to receivers off a Tier0 gateway as well as two Tier1 gateways
(green and purple.) Step 1-5 are about replicating the traffic to segments attached to the Tier0
gateway, thus exactly similar to the previous example.
176
VMware NSX Reference Design Guide
1. The multicast traffic is received by the multicast active Tier0-SR for the group (Tier0-
SR1.)
2. The Tier0-SR1 offloads replication to an arbitrary ESXi transport node, here ESXi-1.
3. The Tier0-DR of the offload transport node (ESXi-1) routes the multicast traffic to local
receivers.
4. The Tier0-DR of ESXi-1 starts hybrid replication to reach all the ESXi hosts that have
receivers on segments attached to the Tier0 gateway.
5. The Tier0-DRs of those remote ESXi transport nodes route the traffic to their local
receivers.
So far, those steps have been exactly like those described in the previous part.
6. This is the first step specific to multicast on Tier1 gateways: the multicast active Tier0-
SR1 initiate a hybrid replication targeting all the Tier1-SRs with receivers for this group.
7. The Tier1-SRs (green and purple) offload the multicast replication to an arbitrary ESXi
transport node in their routing domain. In this respect, they act exactly the same way as
the multicast active Tier0-SR for its routing domain.
8. The ESXi transport nodes selected for offloading their Tier1-SR (ESXi-3 for the green
Tier1 and ESXi-4 for the purple Tier1) route the traffic to their local receivers, if any.
9. The ESXi hosts selected for offloading their Tier1-SR initiate a hybrid replication for
reaching all the interested Tier1-DRs in their Tier1 routing domain.
10. Finally, the Tier1-DRs on remote transport nodes route the multicast traffic to their local
receivers.
For the sake of being complete, FIGURE 4-85 is showing the traffic flow when a multicast source
is in the routing domain of a Tier1 gateway and there are receivers off another Tier1 gateway,
Tier0 gateway and external.
177
VMware NSX Reference Design Guide
Figure 4-85: internal source, external receiver and receivers on Tier0 and Tier1 segments
1. The source is located on ESXi-3, off a segment attached to the purple Tier1. The
multicast traffic is first received by the local purple Tier1-DR.
2. The purple Tier1-DR initiate a hybrid replication to reach every remote purple Tier1-DR.
3. The remote purple Tier1-DRs route the traffic to their local destination.
4. The purple Tier1-DR of ESXi-3 also sends a copy of the multicast traffic to the multicast
router of the routing domain, i.e. the purple Tier1-SR on edge3.
5. The purple Tier1-SR forwards the traffic to all SRs with receivers for this group (Tier0SR
and green T1SR). Note that the diagram is simplifying the real process. You don’t need
to understand the details, but for accuracy, let’s just mention that the purple Tier1-SR is
in fact sending a unicast copy (not represented) to Tier0SR1, its multicast router. Then it
initiates a hybrid replication that will reach all SRs with receivers. So, the T0SR receives
two copies: one is replicated on the uplink toward external receivers (step 6 below),
while the other will go down the tree to internal receivers on segments attached to the
Tier0 (part of step 7 below.)
6. The multicast active Tier0-SR routes the traffic on one of its uplinks toward the external
receiver(s).
7. The Tier0SR and green Tier1-SR now offload replication to an arbitrary ESXi transport
node in their routing domain.
8. The Tier0-DR on ESXi-1 and green Tier1-DR on ESXi-3 (the host selected to offload the
green Tier1-SR) route the multicast traffic to their local receivers, if any.
178
VMware NSX Reference Design Guide
9. The Tier0-DR on ESXi-1 and green Tier1-DR on ESXi-3 initiate a hybrid replication to their
peers in their respective routing domain.
10. The remote Tier0-DR and green Tier1-DR route the traffic to their local receivers.
Some few notes on the Tier1 multicast model:
• The use of a Tier1-SR for multicast has an impact on unicast traffic. Indeed, unicast traffic
that needs to be routed to other Tier1s or to the outside world will now transit through
this Tier1-SR.
• Multicast traffic for Tier1 attached segments follows a tree that is rooted on the
corresponding Tier1-SR. The consequence is that a host with receivers on segments that
are attached to different Tier0/Tier1s will get multiple copies of the same multicast
traffic, one for each Tier0-DR or Tier1-DR involved. Check ESXi-4 in the above FIGURE
4-85: it received three separate copies of the same multicast packet, one for each
gateway.
179
VMware NSX Reference Design Guide
180
VMware NSX Reference Design Guide
nodes. This example is not about repeating the multicast packet flow presented in the previous
part, we’re not going to consider the multicast traffic sent to the multicast router for example,
we’re just going to show how hybrid replication can deliver this multicast IP packet to all the
other transport nodes that have a receiver for G. In this diagram those transport nodes with
receivers are TN-2, TN4, TN-5, TN-7, TN-8, TN-9, TN-10, and TN-11. The transport nodes have
their TEPs in three different subnets. This is a common design: typically, hosts under the same
top of rack switch (TOR) have their IP address in the same subnet. In this example, transport
nodes TN-1 to TN-4 have their TEP IP addresses in subnet 10.0.0.0/24, TN-5 to TN-8 have TEPs
in subnet 20.0.0.0/24 and TN-9 to TN-12 have TEP addresses in 30.0.0.0.24.
Let’s assume that NSX has mapped group G to group M in the replication multicast range for
ESXi transport nodes (called earlier as “range1”.) All the ESXi hosts with a receiver for G in the
overlay are joining group M in the physical infrastructure (that’s what the “M” on the TEPs is
representing.)
Transport nodes TN-1 knows the IP address of all the transport nodes interested in G (as
mentioned earlier, this has been advertised by relaying IGMP packets.) More specifically, TN-1
knows that it has receivers in its local subnet and in two remote subnets.
1. TN-1 encapsulate the multicast traffic to G in an overlay packet with M as a destination
and send it on its uplink. The physical infrastructure sees a packet to M and knows that
there are two hosts that have joined group M. The top of rack switch takes care of
replicating this packet to TN-2 and TN-4. Note again that we don’t expect the physical
infrastructure to route multicast traffic across subnet. The packet sent to M is thus
constrained to its source subnet 10.0.0.0/24 and not routed to the remote subnets
20.0.0.0/24 and 30.0.0.0/24 by the physical infrastructure.
2. TN-1 knows that receivers TN-5, TN-7 and TN-8 are all in the remote underlay subnet
20.0.0.0/24. It picks one of those three transport nodes (let’s say TN-7) and forward the
multicast traffic to G as an overlay unicast to it. TN-1 repeats the operation for the
181
VMware NSX Reference Design Guide
remote underlay subnet 30.0.0.0/24 and sends the multicast traffic to G as an overlay
unicast to TN-9.
3. TN-7 receives an encapsulated multicast packet from TN-1. In the metadata of this
packet, TN-1 has set a bit indicating that this packet needs to be replicated to the other
transport nodes in the underlay subnet (reference section O VERLAY ENCAPSULATION ). TN-
7 delivers the multicast packet to its local receiver VMs and copies the multicast traffic
to G on its uplink, encapsulated with an overlay destination multicast IP address M. The
physical infrastructure is then again taking care of replicating this packet to TN-5 and
TN-8. The metadata of this overlay packet created by TN-7 is not indicating this time
that this packet needs to be replicated to the other hosts in the same subnet, so TN-5
and TN-8 only forward the multicast traffic to their local receivers.
4. TN-9 receives the overlay unicast copy of the multicast traffic to G and perform the
same operation as TN-7 in the previous step.
182
VMware NSX Reference Design Guide
When TN-1 sends traffic to group G1, it’s encapsulated in a multicast tunnel packet and sent to
multicast address M. As a result, both TN-2 and TN-4 receive this packet. This is sub-optimal
183
VMware NSX Reference Design Guide
because TN-4 has no receiver for group G1. TN-4 just discards this packet. Note that, from a
performance standpoint, this issue is not catastrophic. The additional replication work is done
by the physical infrastructure, the cost for discarding superfluous packets it not that great on
TN-4, but it’s still better to minimize this scenario.
Enable IGMP snooping in the physical infrastructure
As mentioned already in this document, NSX only uses the physical infrastructure for multicast
within the same Layer 2 domain. One of the reasons for this implementation choice is that we
did not want NSX to be dependent on configuring multicast routing in the physical
infrastructure (something that many of our customers are not able to do.) It is however
beneficial to have IGMP snooping enabled on the Layer 2 switches of the physical
infrastructure, in order to optimize the multicast replication within the rack.
Most switches default to IGMP snooping enabled, just check that your infrastructure is
configured that way. Without IGMP snooping, IP multicast is treated like broadcast by Layer 2
switches, which results in sub-optimal replication.
Limit the number of TEP subnets
This one is just a recommendation related to performance. With the hybrid replication model,
NSX performs a unicast copy for each remote TEP subnets (see FIGURE 4-86: HYBRID replication.)
This copy consumes CPU on the source transport node and wastes bandwidth on its uplink. The
following diagram proposes a relatively extreme example comparing the replication of a single
multicast frame to ten remote ESXi transport nodes. In the case represented on the top of the
diagram, all the receiving transport nodes are in different subnets, in the bottom part of the
diagram, the receiving transport nodes are in the same subnet as the source.
184
VMware NSX Reference Design Guide
Figure 4-89: replication across multiple subnets vs. within a single subnet
In the top example, the source transport node (TN-1) needs to send 10 unicast copies of the
multicast packet, one for each and every receiving transport node. Practically, it means that the
bandwidth of the uplink of TN-1 has been divided by 10. There is a CPU cost to this 10-fold
replication too.
In the example represented at the bottom, the source transport node just needs to send a
single multicast copy of the multicast packet. It is received by the physical switch local to the
rack and directly replicated to the receivers. The source transport node thus only had to send a
single packet, and the burden of the replication has been entirely put on the hardware switch.
Of course, we’re not recommending putting all your hosts in the same Layer 2 domain. In fact,
one of the benefits of the NSX overlay is precisely that you can design your physical
infrastructure without the need of extending Layer 2. The cost associated to replicating across
subnets is difficult to avoid for ESXi hosts, that are bound to be spread across subnets. It might
however be possible to group VMs joining the same multicast groups on hosts that are spread
across a minimum number of racks.
Hybrid replication is also used between edges hosting the Tier0-SRs and Tier1-SRs. The edges of
your edge cluster should not be dependent on the TORs of a single rack, but it’s perfectly
possible to deploy all the edges of your edge cluster across two racks (and two TEP subnets.)
This will adequately reduce the amount of unicast replication that the hybrid model needs to
perform.
Using Tier1-SRs in the model has consequences, minimize the number of Tier1 routers with
multicast
185
VMware NSX Reference Design Guide
This has already been mentioned in the chapter DATA PLANE OPERATIONS WITHIN NSX: TIER0 AND
TIER1 SEGMENTS COMBINED . An ESXi transport node will receive a copy of the same multicast
packet for each Tier0 and Tier1 with local receivers. The following diagram makes a simpler
representation of this property:
Figure 4-90: host transport node receiving multiple copies of the same multicast packet
Here, the ESXi host at the bottom receives three copies of a multicast packet generated outside
NSX, behind a Tier0 gateway. The multicast trees for the Tier1 gateways are rooted in their
Tier1-SRs. You can see in the diagram that the ESXi host receives the multicast packet for the
local receiver under the green Tier1-DR straight from the green Tier1-SR, not from the local
Tier0-DR (as it would, for unicast traffic.) This looks sub-optimal (multicast is all about avoiding
the same links to carry the same multicast traffic multiple times), but it’s mandatory for
inserting stateful services on the Tier1 gateways, in a future release. Just be aware that
multiplying Tier1 gateways with receivers for the same groups will cause NSX to replicate the
same traffic multiple times.
Using Tier1-SRs in the model has consequences, be aware of a Tier1-SR impact on unicast
traffic
The choice of implementing a Tier1-SR for multicast traffic also has an impact on unicast traffic:
unicast traffic can only enter/exit the Tier1 routing domain through this centralized Tier1-SR, as
represented in the figure below.
186
VMware NSX Reference Design Guide
The impact of inserting a Tier1-SR is well known, and not specific to multicast. It is still possible
to have unicast ECMP straight from the host transport node to the Tier0-SRs (and the physical
infrastructure) by attaching workloads directly to a segment off the Tier0 gateway.
Edge Node
Edge nodes are service appliances with pools of capacity, dedicated to running network and
security services that cannot be distributed to the hypervisors. Edge node also provides
187
VMware NSX Reference Design Guide
188
VMware NSX Reference Design Guide
encapsulate the traffic sent to compute nodes. If the NSX deployment includes multiple
overlay transport zones, multiple sets of edge nodes must be deployed, one for each
overlay transport zone.
• VLAN Transport Zone: Edge nodes connect to the physical infrastructure using VLANs.
Edge node needs to be configured for VLAN transport zone to provide external or N-S
connectivity to the physical infrastructure. Depending upon the N-S topology, an edge
node can be configured with one or more VLAN transport zones.
Edge node can have one or more N-VDS to provide the desired connectivity. Each N-VDS on the
Edge node uses an uplink profile which can be the same or unique per N-VDS. The teaming
policies defined in the uplink profile defines how the N-VDS balances traffic across its uplinks.
The uplinks can in turn be individual pNICs or LAGs. While supported, implementing LAGs on
edge nodes is discouraged and not included in any of the deployment examples in this guide.
189
VMware NSX Reference Design Guide
The Bare Metal Edge resources specified above specify the minimum resources needed. It is
recommended to deploy an edge node on a bare metal server with the following specifications
for maximum performance:
• Memory: 256GB
• CPU Cores: 24
• Disk Space: 200GB
When NSX Edge is installed as a VM, vCPUs are allocated to the Linux IP stack and DPDK. The
number of vCPU assigned to a Linux IP stack or DPDK depends on the size of the Edge VM. A
medium Edge VM has two vCPUs for Linux IP stack and two vCPUs dedicated for DPDK. This
changes to four vCPUs for Linux IP stack and four vCPUs for DPDK in a large size Edge VM.
Several Intel and AMD CPUs are supported both for the virtualized and Bare Metal Edge node
form factor. Specifications can be found HERE .
190
VMware NSX Reference Design Guide
In-band management feature is leveraged for management traffic. Overlay traffic gets load
balanced by using multi-TEP feature on Edge and external traffic gets load balanced using
"Named Teaming policy" as described in section TEAMING P OLICY in chapter 3.
Figure 4-92: Bare metal Edge -Same N-VDS for overlay and external traffic with Multi-TEP
During a pNIC failure, Edge performs a TEP failover by migrating TEP IP and its MAC address to
another uplink. For instance, if pNIC P1 fails, TEP IP1 along with its MAC address will be
migrated to use Uplink2 that’s mapped to pNIC P2. In case of pNIC P1 failure, pNIC P2 will carry
the traffic for both TEP IP1 and TEP IP2. The BFD sessions from the individual Edge TEPs to other
transport nodes are not a reliable mechanism to detect and recover from partial failures (of a
single TEP for example). The exception is the “All tunnels down” condition when all the tunnels
from all edge TEPs to all the transport nodes are down. See the NSX EDGE HIGH AVAILABILITY
FAILOVER TRIGGERS section for more details.
The Edge Multi-TEP capability has been enhanced in NSX 4.2.1 by the TEP Groups feature,
described in a dedicated section in chapter 3: TEP GROUPS. It improves high availability and
performance for the overlay network and should be considered for any deployment that
supports it.
191
VMware NSX Reference Design Guide
faster failover, and higher throughput at low packet size (discussed in chapter 8 NSX
P ERFORMANCE & O PTIMIZATION). There are certain hardware requirements including CPU
specifics and supported NICs can be found in the NSX EDGE BARE METAL REQUIREMENTS section of
the NSX installation guide.
When a bare metal Edge node is installed, a dedicated interface is retained for management. If
redundancy is desired, two NICs can be used for management plane high availability. These
management interfaces can also be 1G. Bare metal Edge also supports in-band management
where management traffic can leverage an interface being used for overlay or external (N-S)
traffic.
Bare metal Edge node supports a maximum of 16 datapath physical NICs. For each of these 16
physical NICs on the server, an internal interface is created following the naming scheme “fp-
ethX”. These internal interfaces are assigned to the DPDK Fast Path. There is a flexibility in
assigning these Fast Path interfaces (fp-eth) for overlay or external connectivity.
VM Edge Node
NSX VM Edge in VM form factor can be installed using an OVA, OVF, or ISO file. NSX Edge VM is
only supported on ESXi host.
Up to NSX 3.2.0, an NSX Edge VM has four internal interfaces: eth0, fp-eth0, fp-eth1, and fp-
eth2. Eth0 is reserved for management, while the rest of the interfaces are assigned to DPDK
Fast Path. Starting with NSX 3.2.1, four datapath interfaces are available: fp-eth0, fp-eth1, fp-
eth2, and fp-eth3.
192
VMware NSX Reference Design Guide
These interfaces are allocated for external connectivity to TOR switches and for NSX overlay
tunneling. There is complete flexibility in assigning Fast Path interfaces (fp-eth) for overlay or
external connectivity. As an example, fp-eth0 could be assigned for overlay traffic with fp-eth1,
fp-eth2, or both for external traffic.
FIGURE 4-93 shows an edge VM where only two of the fast path interfaces are in use. They are
managed by the same NVDS and carry both overlay and VLAN traffic. This design is in line with
the edge reference architecture and will be explained in detailed in chapter 7 section EDGE
CONNECTIVITY GUIDELINES FOR LAYER 3 P EERING USE CASE .
Edge Cluster
An Edge cluster is a group of Edge transport nodes. It provides scale out, redundant, and high-
throughput gateway functionality for logical networks. Scale out from the logical networks to
the Edge nodes is achieved using ECMP. There is a flexibility in assigning Tier-0 or Tier-1
gateways to Edge nodes and clusters. Tier-0 and Tier-1 gateways can be hosted on either same
or different Edge clusters.
193
VMware NSX Reference Design Guide
Depending upon the services hosted on the Edge node and their usage, an Edge cluster could
be dedicated simply for running centralized services (e.g., NAT). FIGURE 4-95 shows two clusters
of Edge nodes. Edge Cluster 1 is dedicated for Tier-0 gateways only and provides external
connectivity to the physical infrastructure. Edge Cluster 2 is responsible for NAT functionality on
Tier-1 gateways.
Figure 4-95: Multiple Edge Clusters with Dedicated Tier-0 and Tier-1 Services
There can be only one Tier-0 gateway per Edge node; however, multiple Tier-1 gateways can be
hosted on one Edge node.
A Tier-0 gateway supports a maximum of eight equal cost paths per SR or DR component, thus
a maximum of eight Edge nodes are supported for ECMP. Edge nodes in an Edge cluster run
Bidirectional Forwarding Detection (BFD) on both tunnel and management networks to detect
Edge node failure. Edge VMs support BFD with minimum BFD timer of 500ms with three retries,
providing a 1.5 second failure detection time. Bare metal Edges support BFD with minimum BFD
TX/RX timer of 50ms with three retries which implies 150ms failure detection time.
Failure Domain
Failure domain is a logical grouping of Edge nodes within an Edge Cluster. This feature can be
enabled on the Edge cluster level via API.
As discussed in the high availability section, a Tier-1 gateway with centralized stateful services
runs on Edge nodes in active/standby or active/active HA configuration mode. When a user
194
VMware NSX Reference Design Guide
assigns a Tier-1 gateway to an Edge cluster, NSX manager automatically chooses the Edge
nodes in the cluster to run the active and standby Tier-1 SR in case of active/standby HA mode
or the edge sub-clusters in case of active/active stateful HA mode (See section 4.5.3.3 to review
the role of sub-clusters in A/A Stateful configurations). The auto placement of Tier-1 SRs on
different Edge nodes considers several parameters like Edge capacity, active/standby HA state
etc.
Failure domains compliment auto placement algorithm and guarantee service availability in
case of a failure affecting multiple edge nodes. Active and standby instance of a Tier-1 SR or
members of sub-cluster always run in different failure domains.
FIGURE 4-96 shows an edge cluster comprised of four Edge nodes, EN1, EN2, EN3 and EN4. EN1
and EN2 connected to two TOR switches in rack 1 and EN3 and EN4 connected to two TOR
switches in rack 2. Without failure domain, a Tier-1 SR could be auto placed in EN1 and EN2. If
rack1 fails, both active and standby instance of this Tier-1 SR fail as well.
EN1 and EN2 are configured to be a part of failure domain 1, while EN3 and EN4 are in failure
domain 2. When a new Tier-1 SR is created and if the active instance of that Tier-1 is hosted on
EN1, then the standby Tier-1 SR will be instantiated in failure domain 2 (EN3 or EN4).
To ensure that all Tier-1 services are active on a set of edge nodes, a user can also enforce that
all active Tier-1 SRs are placed in one failure domain. This configuration is supported for Tier-1
gateway in preemptive mode only.
195
VMware NSX Reference Design Guide
196
VMware NSX Reference Design Guide
In NSX, URPF is enabled by default on external, internal and service interfaces. From a security
standpoint, it is a best practice to keep uRPF enabled on these interfaces. uRPF is also
recommended in architectures that leverage ECMP. On intra-tier and router link interfaces, a
simplified anti-spoofing mechanism is implemented. It is checking that a packet is never sent
back to the interface the packet was received on.
It is possible to disable uRPF in complex routing architecture where the upstream BGP or OSPF
peers do not advertise the same networks.
197
VMware NSX Reference Design Guide
communication is initiated. It also takes care of the reply. For both SNAT and DNAT,
users can apply NAT rules based on 5 tuple match criteria.
● Reflexive NAT: Reflexive NAT rules are stateless ACLs which must be defined in both
directions. These do not keep track of the connection. Reflexive NAT rules can be used
in cases where stateful NAT cannot be used due to asymmetric paths (e.g., user needs
to enable NAT on active/active ECMP routers).
NAT Rules
Type Specific Usage Guidelines
Type
Table 4-4 summarizes the use cases and advantages of running NAT on Tier-0 and Tier-1
gateways.
NAT Rule
Gateway Type Specific Usage Guidelines
Type
198
VMware NSX Reference Design Guide
DHCP Services
NSX provides both DHCP relay and DHCP server functionality. DHCP relay can be enabled at the
gateway level and can act as relay between non-NSX managed environment and DHCP servers.
DHCP server functionality can be enabled to service DHCP requests from VMs connected to
NSX-managed segments. DHCP server functionality is a stateful service and must be bound to
an Edge cluster or a specific pair of Edge nodes as with NAT functionality.
Since Gateway Firewalling is a centralized service, it needs to run on an Edge cluster or a set of
Edge nodes. This service is described in more detail in the security chapter 5.
199
VMware NSX Reference Design Guide
Proxy ARP
Proxy ARP is a method that consist of answering an ARP request on behalf of another host. This
method is performed by a layer 3 networking device (usually a router). The purpose is to
provide connectivity between 2 hosts when routing wouldn’t be possible for various reasons.
Proxy ARP in an NSX infrastructure can be considered in environments where IP subnets are
limited. Proof of concepts and VMware Enterprise PKS environments are usually using Proxy-
ARP to simplify the network topology.
For production environment, it is recommended to implement proper routing between a
physical fabric and the NSX Tier-0 by using either static routes or Border Gateway Protocol with
BFD. If proper routing is used between the Tier-0 gateway and the physical fabric, BFD with its
sub-second timers will converge faster. In case of failover with proxy ARP, the convergence
relies on gratuitous ARP (broadcast) to update all hosts on the VLAN segment with the new
MAC Address to use. If the Tier-0 gateway has proxy ARP enabled for 100 IP addresses, the
newly active Tier-0 SR needs to send 100 Gratuitous ARP packets.
By enabling proxy-ARP, hosts on the overlay segments and hosts on a VLAN segment can
exchange network traffic together without implementing any change in the physical networking
fabric. Proxy ARP is automatically enabled when a NAT rule or a load balancer VIP uses an IP
address from the subnet of the Tier-0 gateway uplink.
FIGURE 4-98 presents the logical packet flow between a virtual machine connected to an NSX
overlay segment and a virtual machine or physical appliance connected to a VLAN segment
shared with the NSX Tier-0 uplinks.
In this example, the virtual machine connected to the overlay segment initiates networking
traffic toward 20.20.20.100.
200
VMware NSX Reference Design Guide
201
VMware NSX Reference Design Guide
7. Tier-0 receives the ARP request for 20.20.20.10 (broadcast) and has the proxy ARP
feature enabled on its uplink interfaces. It replies to the ARP request with an ARP reply
that contains the Tier-0 SR MAC address for the interface uplink.
8. The physical appliance “SRV01” receives the ARP request and sends a packet on the vlan
segment with a source IP of 20.20.20.100 and a destination IP of 20.20.20.10.
9. The packet is being received by the Tier-0 SR and is being routed to the Tier-1 who does
translate the Destination IP of 20.20.20.10 with a value of 172.16.10.10. Packet is sent
to the overlay segment and the virtual machine receives it.
It is crucial to note that in this case, the traffic is initiated by the virtual machine which is
connected to the overlay segment on the Tier-1. If the initial traffic was initiated by a server on
the VLAN segment, a Destination NAT rule would have been required on the Tier-1/Tier-0 since
the initial traffic would not match the SNAT rule that has been configured previously.
FIGURE 4-99 represents an outage on an active Tier-0 gateway with Proxy ARP enabled. The
newly active Tier-0 gateway will send a gratuitous ARP to announce the new MAC address to be
used by the hosts on the VLAN segment in order to reach the virtual machine connected to the
overlay. It is critical to fathom that the newly active Tier-0 will send a Gratuitous ARP for each IP
address that are configured for Proxy ARP.
Topology Consideration
This section covers a few of the many topologies that customers can build with NSX. NSX
routing components - Tier-1 and Tier-0 gateways - enable flexible deployment of multi-tiered
routing topologies. Topology design also depends on what services are enabled and where
those services are provided at the provider or tenant level.
202
VMware NSX Reference Design Guide
Supported Topologies
FIGURE 4-100 shows three topologies with Tier-0 gateway providing N-S traffic connectivity via
multiple Edge nodes. The first topology is single-tiered where Tier-0 gateway connects directly
to the segments and provides E-W routing between subnets. Tier-0 gateway provides multiple
active paths for N-S L3 forwarding using ECMP. The second topology shows the multi-tiered
approach where Tier-0 gateway provides multiple active paths for L3 forwarding using ECMP
and Tier-1 gateways as first hops for the segments connected to them. Routing is fully
distributed in this multi-tier topology. The third topology shows a multi-tiered topology with
Tier-0 gateway configured in Active/Standby HA mode to provide some centralized or stateful
services like NAT, VPN etc.
As discussed in the two-tier routing section, centralized services can be enabled on Tier-1 or
Tier-0 gateway level. FIGURE 4-101 shows two multi-tiered topologies.
The first topology shows centralized services like NAT, load balancer on Tier-1 gateways while
Tier-0 gateway provides multiple active paths for L3 forwarding using ECMP.
The second topology shows centralized services configured on a Tier-1 and Tier-0 gateway.
Some centralized services are only available on Tier-1.
203
VMware NSX Reference Design Guide
Figure 4-101: Stateful and Stateless (ECMP) Services Topologies Choices at Each Tier
FIGURE 4-102 shows a topology with Tier-0 gateways connected back to back. “Tenant-1 Tier-0
Gateway” is configured for a stateful firewall while “Tenant-2 Tier-0 Gateway” has stateful NAT
configured. Since stateful services are configured on both “Tenant-1 Tier-0 Gateway” and
“Tenant-2 Tier-0 Gateway”, they are configured in Active/Standby high availability mode. The
top layer of Tier-0 gateway, "Aggregate Tier-0 Gateway” provides ECMP for North-South traffic.
Note that only external interfaces should be used to connect a Tier-0 gateway to another Tier-0
gateway. Static routing and BGP are supported to exchange routes between two Tier-0
gateways and full mesh connectivity is recommended for optimal traffic forwarding and
failover. This topology provides high N-S throughput with centralized stateful services running
on different Tier-0 gateways. This topology also provides complete separation of routing tables
on the tenant Tier-0 gateway level and allows services that are only available on Tier-0
gateways (like VPN with redundant remote peers) to leverage ECMP northbound. Note that
VPN is available on Tier-1 gateways starting NSX 2.5 release. NSX 3.0 introduces new multi
tenancy features such as EVPN and VRF-lite. These features are recommended and suitable for
true multi-tenant architecture where stateful services need to be run on multiple layers or Tier-
0
Full mesh connectivity over a single overlay segment is recommended for optimal traffic
forwarding and failover.
Back-to-back Tier-0 peering involving Active/Standby Tier-0 Gateways presents risks. Some
partial failure conditions may cause the peering tier-0 gateway to forward traffic to the Standby
SR, causing a black hole. These topologies include both back-to-back A/S Tier-0 gateway, and an
A/S tier-0 Gateway peering with an A/A Tier-0 gateway. In order to minimize the risk of such a
condition, the underlying physical infrastructure should prevent the failure of individual BGP
sessions so that any adjacent Tier-0 gateway always has the preferred path for any destination
via the Active SR and not the Standby SR (which, by default performs an AS-Path prepend with a
length of 3). In such cases, we recommend establishing a full mesh EBGP peering over a single
204
VMware NSX Reference Design Guide
overlay segment and actively monitoring the status of all the BGP sessions. In case a partial
failure of a subset of the BGP sessions occurs, the NSX operator should manually intervene to
reestablish the peering session or perform a manual failover making active an SR with all the
BGP sessions operational.
Active/Active Stateful Tier-0 Gateways cannot be used as aggregation routers. Aggregation
routers forward traffic between external interfaces connected to different segments.
Active/Active Stateful Tier-0 gateways require traffic to flow between an external and a
downlink interface. Active/Active stateful gateways can be connected to an aggregation A/A
stateless Tier-0 gateways (Preferably) or an A/S stateful Tier-0 Gateway.
Figure 4-102: Multiple Tier-0 Topologies with Stateful and Stateless (ECMP) Services
FIGURE 4-103 shows another topology with Tier-0 gateways connected back-to-back. “Corporate
Tier-0 Gateway” on Edge cluster-1 provides connectivity to the corporate resources
(172.16.0.0/16 subnet) learned via a pair of physical routers on the left. This Tier-0 has stateful
Gateway Firewall enabled to allow access to restricted users only.
“WAN Tier-0 Gateway” on Edge-Cluster-2 provides WAN connectivity via WAN routers and is
also configured for stateful NAT.
“Aggregate Tier-0 gateway” on the Edge cluster-3 learns specific routes for corporate subnet
(172.16.0.0/16) from “Corporate Tier-0 Gateway” and a default route from “WAN Tier-0
Gateway”. “Aggregate Tier-0 Gateway” provides ECMP for both corporate and WAN traffic
originating from any segments connected to it or connected to a Tier-1 southbound. Full mesh
connectivity is recommended for optimal traffic forwarding.
205
VMware NSX Reference Design Guide
Figure 4-103: Multiple Tier-0 Topologies with Stateful and Stateless (ECMP) Services
A Tier-1 gateway usually connects to a Tier-0 gateway, but it is possible for some use cases to
interconnect it to another Tier-1 gateway using service interfaces (SI) and downlink as depicted
in FIGURE 4-104. Static routing must be configured on both Tier-1 gateways in this case as
dynamic routing is not supported. The Tier-0 gateway must be aware of the all the overlay
segments prefixes to provide connectivity.
206
VMware NSX Reference Design Guide
Figure 4-104: Supported Topology – T1 gateways interconnected to each other using Service Interface and Downlink
Unsupported Topologies
While the deployment of logical routing components enables customers to deploy flexible
multi-tiered routing topologies, FIGURE 4-105 presents topologies that are not supported. The
topology on the left shows that a tenant Tier-1 gateway cannot be connected directly to
another tenant Tier-1 gateway using downlinks exclusively.
The rightmost topology highlights that a Tier-1 gateway cannot be connected to two different
upstream Tier-0 gateways.
207
VMware NSX Reference Design Guide
5 NSX Security
Since this design guide's last NSX version 4.1 update, the formerly known VMware NSX Security
Solutions has been rebranded to VMware vDefend. However, we refer to the old name (NSX
Security) in this chapter to ensure consistency with the rest of the document.
In addition to providing network virtualization, NSX also serves as an advanced security
platform, providing a rich set of features to streamline the deployment of security solutions.
This chapter focuses on core NSX security capabilities, architecture, components, and
implementation. Key concepts for examination include:
● NSX distributed firewall (DFW) provides stateful protection of the workload at the vNIC
level. For ESXi, the DFW enforcement occurs in the hypervisor kernel, helping deliver
micro-segmentation. However, the DFW extends to physical servers, and containers
providing distributed policy enforcement.
● A full-stack security solution, including advanced threat prevention (ATP), is natively
integrated within the VMware Cloud Foundation (VCF) private cloud
● Agnostic to compute domain - supporting hypervisors managed by different compute-
managers while allowing any defined micro-segmentation policy to be applied across
hypervisors spanning multiple vCenter environments.
● Support for Layer 2, Layer 3, Layer 4, Layer-7 APP-ID, & Identity based firewall policies
provide security via protocol, port, and or deeper packet/session intelligence to suit
diverse needs.
● NSX Gateway firewall serves as a centralized stateful firewall service for N-S traffic.
Gateway firewall is implemented per gateway and supported at both Tier-0 and Tier-1.
Gateway firewall is independent of NSX DFW from policy configuration and enforcement
perspective, providing a means for defining perimeter security control in addition to
distributed security control.
● Gateway & Distributed Firewall Service insertion capability to integrate existing security
investments using integration with partner ecosystem products on a granular basis
without the need for interrupting natural traffic flows.
● Distributed IDS extends IDS capabilities to every host in the environment.
● Dynamic grouping of objects into logical constructs called Groups based on various
criteria including tag, virtual machine name or operating system, subnet, and segments
which automates policy application.
● The scope of policy enforcement can be selective, with application or workload-level
granularity.
208
VMware NSX Reference Design Guide
● Firewall Flood Protection capability to protect the workload & hypervisor resources.
● IP discovery mechanism dynamically identifies workload addressing.
● SpoofGuard blocks IP spoofing at vNIC level.
● Switch Security provides storm control and security against unauthorized traffic.
NSX 3.2 introduced advanced security features such as security on vCenter dvpgs, distributed
and centralized malware protection, centralized IDS/IPS, URL Filtering, extensive next
generation firewall App Identification support, Network Traffic Anomaly Detection, and
Network Detection and Response (NDR). These new features will be covered in the new version
if the SECURITY DESIGN GUIDE. This document will cover the NSX core security feature.
209
VMware NSX Reference Design Guide
210
VMware NSX Reference Design Guide
Management Plane
The NSX management plane is implemented through NSX Managers. NSX Managers are
deployed as a cluster of 3 manager nodes. Access to the NSX Manager is available through a
GUI or REST API framework. When a firewall policy rule is configured, the NSX management
plane service validates the configuration and locally stores a persistent copy. Then the NSX
Manager pushes user-published policies to the control plane service within Manager Cluster
which in turn pushes to the data plane. A typical DFW policy configuration consists of one or
more sections with a set of rules using objects like Groups, Segments, and application level
gateway (ALGs). For monitoring and troubleshooting, the NSX Manager interacts with a host-
based management plane agent (MPA) to retrieve DFW status along with rule and flow
statistics. The NSX Manager also collects an inventory of all hosted virtualized workloads on
NSX transport nodes. This is dynamically collected and updated from all NSX transport nodes.
Control Plane
The NSX control plane consists of two components - the central control plane (CCP) and the
Local Control Plane (LCP). The CCP is implemented on NSX Manager Cluster, while the LCP
includes the user space module on all of the NSX transport nodes. This module interacts with
the CCP to exchange configuration and state information.
211
VMware NSX Reference Design Guide
From a DFW policy configuration perspective, NSX Control plane will receive policy rules pushed
by the NSX Management plane. If the policy contains objects including segments or Groups, it
converts them into IP addresses using an object-to-IP mapping table. This table is maintained by
the control plane and updated using an IP discovery mechanism. Once the policy is converted
into a set of rules based on IP addresses, the CCP pushes the rules to the LCP on all the NSX
transport nodes.
The CCP utilizes a hierarchy system to distribute the load of CCP-to-LCP communication. The
responsibility for transport node notification is distributed across the managers in the manager
clusters based on an internal hashing mechanism. For example, for 30 transport nodes with
three managers, each manager will be responsible for roughly ten transport nodes.
Data Plane
The NSX transport nodes comprise the distributed data plane with DFW enforcement done at
the hypervisor kernel level. Each of the transport nodes, at any given time, connects to only one
of the CCP managers, based on mastership for that node. On each of the transport nodes, once
the local control plane (LCP) has received policy configuration from CCP, it pushes the firewall
policy and rules to the data plane filters (in kernel) for each of the virtual NICs. With the
“Applied To” field in the rule or section which defines scope of enforcement, the LCP makes
sure only relevant DFW rules are programmed on relevant virtual NICs instead of every rule
everywhere, which would be a suboptimal use of hypervisor resources.
212
VMware NSX Reference Design Guide
213
VMware NSX Reference Design Guide
that VM-to-VM communication is not broken during staging or migration phases. It is a best
practice to then change this default rule to a “drop” action and enforce access control through
an explicit allow model (i.e., only traffic defined in the firewall policy is allowed onto the
network). FIGURE 5-5: NSX DFW POLICY LOOKUP diagrams the policy rule lookup and packet flow.
214
VMware NSX Reference Design Guide
215
VMware NSX Reference Design Guide
5.4.1.1 Ethernet
The Ethernet Section of the policy is a Layer 2 firewalling section. All rules in this section must
use MAC Addresses for their source or destination objects. Any rule defined with any other
object type will be ignored.
5.4.1.2 Application
In an application-centric approach, grouping is based on the application type (e.g., VMs tagged
as “Web-Servers”), application environment (e.g., all resources tagged as “Production-Zone”)
and application security posture. An advantage of this approach is the security posture of the
application is not tied to network constructs or infrastructure. Security policies can move with
the application irrespective of network or infrastructure boundaries, allowing security teams to
focus on the policy rather than the architecture. Policies can be templated and reused across
instances of the same types of applications and workloads while following the application
lifecycle; they will be applied when the application is deployed and is destroyed when the
application is decommissioned. An application-based policy approach will significantly aid in
moving towards a self-service IT model. In an environment where there is strong adherence to
a strict naming convention, the VM substring grouping option allows for simple policy
definition.
An application-centric model does not provide significant benefits in an environment that is
static, lacks mobility, and has infrastructure functions that are properly demarcated.
216
VMware NSX Reference Design Guide
5.4.1.3 Infrastructure
Infrastructure-centric grouping is based on infrastructure components such as segments or
segment ports, identifying where application VMs are connected. Security teams must work
closely with the network administrators to understand logical and physical boundaries.
If there are no physical or logical boundaries in the environment, then an infrastructure-centric
approach is not suitable.
5.4.1.4 Network
217
VMware NSX Reference Design Guide
within each zone. A lab zone, for example may merely be ring-fenced with a policy that allows
any traffic type from lab device to lab device and only allows basic common services such as
LDAP, NTP, and DNS to penetrate the perimeter in. On the other end of the spectrum, any zone
containing regulated or sensitive data (such as customer info) will often be tightly defined
traffic between entities, many types being further inspected by natively available L7 firewall
and Advanced Threat Prevention capabilities of the platform .
The answers to these questions help shape a policy rule model. Policy models should be flexible
enough to address ever-changing deployment scenarios, rather than simply be part of the initial
setup. Concepts such as intelligent grouping, tags and hierarchy provide flexible yet agile
response capability for steady state protection as well as during instantaneous threat response.
The model shown in FIGURE 5-7: SECURITY RULE MODEL represents an overview of the different
classifications of security rules that can be placed into the NSX DFW rule table. Each of the
classification shown represents a category on NSX firewall table layout. The Firewall table
category aligns with the best practice around organizing rules to help admin with grouping
Policy based on the category. Each firewall category can have one or more policy within it to
organize firewall rules under that category.
218
VMware NSX Reference Design Guide
● VM Inventory Collection – Identify and organize a list of all hosted virtualized workloads on
NSX transport nodes. This is dynamically collected and saved by NSX Manager as the ESXi
are added as NSX transport nodes.
● Tag Workload – Use VM inventory collection to organize VMs with one or more tags. Each
designation consists of scope and tag association of the workload to an application,
environment, or tenant. For example, a VM tag could be “Scope = Prod, Tag = web” or
“Scope=tenant-1, Tag = app-1”. Often, these categories will dive several layers deep
including BU, project, environment, and regulatory flags. When following the iterative
approach of segmentation, categories and tags can be added to entities with existing tags.
In the application centric approach, new categories can be added with each application.
● Group Workloads – Use the NSX logical grouping construct with dynamic or static
membership criteria based on VM name, tags, segment, segment port, IP’s, or other
attributes. NSX allows for thousands of groups based on tags, although rarely are more than
a dozen or so needed.
● Define Security Policy – Using the firewall rule table, define the security policy. Have
categories and policies to separate and identify emergency, infrastructure, environment,
and application-specific policy rules based on the rule model.
The methodology and rule model mentioned earlier would influence how to tag and group the
workloads as well as affect policy definition. The following sections offer more details on
grouping and firewall rule table construction with an example of grouping objects and defining
NSX DFW policy.
219
VMware NSX Reference Design Guide
Static criteria provide capability to manually include particular objects into the Group. For
dynamic inclusion criteria, Boolean logic can be used to create groups between various criteria.
A Group creates a logical grouping of VMs based on static and dynamic criteria. TABLE 5-1: NSX
OBJECTS USED FOR GROUPS shows one type of grouping criteria based on NSX Objects.
Segment All VMs/vNICs connected to this segment/logical switch segment will be selected.
Selected MAC sets container will be used. MAC sets contain a list of individual
MAC Address
MAC addresses.
Grouping based on Active Directory groups for Identity Firewall (VDI/RDSH) use
AD Groups
case.
TABLE 5-2: VM PROPERTIES USED FOR GROUPS list the selection criteria based on VM properties.
VM Property Description
Tags All VMs that are applied with specified NSX security tags
220
VMware NSX Reference Design Guide
The use of Groups gives more flexibility as an environment changes over time. This approach
has three major advantages:
● Rules stay more constant for a given policy model, even as the data center environment
changes. The addition or deletion of workloads will affect group membership alone, not the
rules.
● Publishing a change of group membership to the underlying hosts is more efficient than
publishing a rule change. It is faster to send down to all the affected hosts and cheaper in
terms of memory and CPU utilization.
● As NSX adds more grouping object criteria, the group criteria can be edited to better reflect
the data center environment.
221
VMware NSX Reference Design Guide
222
VMware NSX Reference Design Guide
A rule within a policy is composed of field shown in Table 5-3: Policy Rule Fields and its meaning is
described below
Rule Name: User field; supports up to 30 characters.
ID: Unique rule ID auto generated by System. The rule id helps in monitoring and
troubleshooting. Firewall Log carries this Rule ID when rule logging is enabled.
Source and Destination: Source and destination fields of the packet. This will be a GROUP
which could be static or dynamic groups as mentioned under Group section.
Service: Predefined services, predefined services groups, or raw protocols can be selected.
When selecting raw protocols like TCP or UDP, it is possible to define individual port numbers or
a range. There are four options for the services field:
● Pre-defined Service – A pre-defined Service from the list of available objects.
● Add Custom Services – Define custom services by clicking on the “Create New Service”
option. Custom services can be created based on L4 Port Set, application level gateways
(ALGs), IP protocols, and other criteria. This is done using the “service type” option in the
configuration menu. When selecting an L4 port set with TCP or UDP, it is possible to define
individual destination ports or a range of destination ports. When selecting ALG, select
supported protocols for ALG from the list. ALGs are only supported in stateful mode; if the
section is marked as stateless, the ALGs will not be implemented.
● Custom Services Group – Define a custom Services group, selecting from single or multiple
services. Workflow is similar to adding Custom services, except you would be adding
multiple service entries.
Profiles: This is used to select & define Layer 7 Application ID, FQDN, URL and URL category
filtering profile. This is used for Layer 7 based security rules.
Applied To: Define the scope of rule publishing. The policy rule could be published all
workloads (default value) or restricted to a specific GROUP. When GROUP is used in Applied-
To it needs to be based on NON-IP members like VM object, Segments etc. Not using the
Applied To field can result in very large firewall tables being loaded on vNICs, which will
negatively affect performance.
Action: Define enforcement method for this policy rule; available options are listed in TABLE 5-4:
FIREWALL RULE TABLE – “ACTION” VALUES
223
VMware NSX Reference Design Guide
Action Description
224
VMware NSX Reference Design Guide
DFW has been enabled, so each VM has a dedicated instance of DFW attached to its
vNIC/segment port.
In order to define micro-segmentation policy for this application use the category Application
on DFW rule table and add a new policy session and rules within it for each application.
The following use cases employ present policy rules based on the different methodologies
introduced earlier.
This example shows use of the network methodology to define policy rule. Groups in this
example are identified in TABLE 5-5: FIREWALL RULE TABLE - EXAMPLE 1 – GROUP DEFINITION while the
firewall policy configuration is shown in TABLE 5-6: FIREWALL RULE TABLE - EXAMPLE 1- POLICY.
225
VMware NSX Reference Design Guide
The DFW engine is able to enforce network traffic access control based on the provided
information. To use this type of construct, exact IP information is required for the policy rule.
This construct is static and does not fully leverage dynamic capabilities with modern cloud
systems.
Example 2: Using Segment object Group in Security Policy rule.
This example uses the infrastructure methodology to define policy rule. Groups in this example
are identified in TABLE 5-7: FIREWALL RULE TABLE - EXAMPLE 2 – GROUP DEFINITION while the firewall
policy configuration is shown in TABLE 5-8: FIREWALL RULE TABLE - EXAMPLE 2 – POLICY.
226
VMware NSX Reference Design Guide
Reading this policy rule table would be easier for all teams in the organization, ranging from
security auditors to architects to operations. Any new VM connected on any segment will be
automatically secured with the corresponding security posture. For instance, a newly installed
web server will be seamlessly protected by the first policy rule with no human intervention,
while VM disconnected from a segment will no longer have a security policy applied to it. This
type of construct fully leverages the dynamic nature of NSX. It will monitor VM connectivity at
any given point in time, and if a VM is no longer connected to a particular segment, any
associated security policies are removed.
This policy rule also uses the “Applied To” option to apply the policy to only relevant objects
rather than populating the rule everywhere. In this example, the first rule is applied to the vNIC
associated with “SEG-Web”. Use of “Applied To” is recommended to define the enforcement
point for the given rule for better resource usage.
Security policy and IP Discovery
Both NSX DFW and Gateway Firewall (GFW) has a dependency on VM-to-IP discovery which is
used to translate objects to IP before rules are pushed to data path. This is mainly required
when the policy is defined using grouped objects. This VM-to-IP table is maintained by NSX
Control plane and populated by the IP discovery mechanism. IP discovery used as a central
mechanism to ascertain the IP address of a VM. By default, this is done using DHCP and ARP
snooping, with VMware Tools available as another mechanism with ESXi hosts. These
discovered VM-to-IP mappings can be overridden by manual input if needed, and multiple IP
addresses are possible on a single vNIC. The IP and MAC addresses learned are added to the
VM-to-IP table. This table is used internally by NSX for SpoofGuard, ARP suppression, and
firewall object-to-IP translation.
227
VMware NSX Reference Design Guide
Intrusion Detection
Much like distributed firewalling changed the game on firewalling by providing a distributed,
ubiquitous enforcement plane, NSX distributed IDS/IPS changes the game on IPS by providing a
distributed, ubiquitous enforcement plane. However, there are additional benefits that the
NSX distributed IPS model brings beyond ubiquity (which in itself is a game changer). NSX IPS is
IPS distributed across all the hosts. Much like with DFW, the distributed nature allows the IPS
capacity to grow linearly with compute capacity. Beyond that, however, there is an added
benefit to distributing IPS. This is the added context. Legacy network Intrusion Detection and
Prevention systems are deployed centrally in the network and rely either on traffic to be
hairpinned through them or a copy of the traffic to be sent to them via techniques like SPAN or
TAPs. These sensors typically match all traffic against all or a broad set of signatures and have
very little context about the assets they are protecting. Applying all signatures to all traffic is
very inefficient, as IDS/IPS unlike firewalling needs to look at the packet payload, not just the
network headers. Each signature that needs to be matched against the traffic adds inspection
overhead and potential latency introduced. Also, because legacy network IDS/IPS appliances
just see packets without having context about the protected workloads, it’s very difficult for
security teams to determine the appropriate priority for each incident. Obviously, a successful
intrusion against a vulnerable database server in production which holds mission-critical data
needs more attention than someone in the IT staff triggering an IDS event by running a
vulnerability scan. Because the NSX distributed IDS/IPS is applied to the vNIC of every workload,
traffic does not need to hairpinned to a centralized appliance, and we can be very selective as
to what signatures are applied. Signatures related to a windows vulnerability don’t need to be
applied to Linux workloads, or servers running Apache don’t need signatures that detect an
exploit of a database service. Through the Guest Introspection Framework, and in-guest drivers,
NSX has access to context about each guest, including the operating system version, users
logged in or any running process. This context can be leveraged to selectively apply only the
relevant signatures, not only reducing the processing impact, but more importantly reducing
the noise and quantity of false positives compared to what would be seen if all signatures are
applied to all traffic with a traditional appliance. For a detailed description of IDS configuration,
see the NSX Product Documentation.
228
VMware NSX Reference Design Guide
segment security profile to a segment for enforcement. The segment security profile has
options to allow/block bridge protocol data unit (BPDU), DHCP server/client traffic, non-IP
traffic. It allows for rate limiting of broadcast and multicast traffic, both transmitted and
received.
229
VMware NSX Reference Design Guide
NSX Distributed Firewall for Mix of VLAN and Overlay backed workloads
This use case mainly applies to customer who wants to adapt NSX micro-segmentation policies
to all of their workloads and looking at adapting NSX network virtualization (overlay) for their
application networking needs in phases. This scenario may arise when customer starts to either
deploy new application with network virtualization or migrating existing applications in phases
from VLAN to overlay backed networking to avail the advantages of NSX network virtualization.
This scenario is also common where there are applications which prevent overlay backed
networking from being adopted fully (as described in section <BRIDGING> above). The order of
operations in this environment is as follows: on egress, DFW processing happens first, then
overlay network processing happens second. On traffic arrival at a remote host, overlay
network processing happens first, then DFW processing happens before traffic arrives at the
VM.
The following diagram depicts this use case with logical and physical topology.
Figure 5-11: NSX DFW Logical Topology – Mix of VLAN & Overlay Backed Workloads
230
VMware NSX Reference Design Guide
Figure 5-12: NSX DFW Physical Topology – Mix of VLAN & Overlay Backed Workloads
Gateway Firewall
The NSX Gateway firewall provides essential perimeter firewall protection which can be used in
addition to a physical perimeter firewall. Gateway firewall service is part of the NSX Edge node
for both bare metal and VM form factors. The Gateway firewall is useful in developing PCI
zones, multi-tenant environments, or DevOps style connectivity without forcing the inter-
tenant or inter-zone traffic onto the physical network. The Gateway firewall data path uses
DPDK framework supported on Edge to provide better throughput.
Consumption
NSX Gateway firewall is instantiated per gateway and supported at both Tier-0 and Tier-1.
Gateway firewall works independently of NSX DFW from a policy configuration and
enforcement perspective, although objects can be shared from the DFW. A user can consume
the Gateway firewall using either the GUI or REST API framework provided by NSX Manager.
231
VMware NSX Reference Design Guide
The Gateway firewall configuration is similar to DFW firewall policy; it is defined as a set of
individual rules within a section. Like the DFW, the Gateway firewall rules can use logical
objects, tagging and grouping constructs (e.g., Groups) to build policies. Similarly, regarding L4
services in a rule, it is valid to use predefined Services, custom Services, predefined service
groups, custom service groups, or TCP/UDP protocols with the ports.
The NSX Gateway firewall provides advanced Next-Generation Firewall capabilities, including L7
application firewall, URL Filtering and FQDN Analysis, IDS/IPS, Malware Detection, and TLS
Inspection.
Implementation
Gateway firewall is an optional centralized firewall implemented on NSX Tier-0 gateway uplinks
and Tier-1 gateway links. This is implemented on a Tier-0/1 SR component which is hosted on
NSX Edge. Tier-0 and Tier-1 Gateway firewall supports stateful firewalling in both
active/standby or active/active stateful HA mode. Gateway firewall uses a similar model as DFW
for defining policy, and NSX grouping construct can be used as well. Gateway firewall policy
rules are organized using one or more policy sections in the firewall table for each Tier-0 and
Tier-1 Gateway. Firewalling at the perimeter allows for a coarse grain policy definition which
can greatly reduce the security policy size inside.
Deployment Scenarios
This section provides two examples for possible deployment and data path implementation.
232
VMware NSX Reference Design Guide
Gateway FW as Inter-tenant FW
The Tier-1 Gateway firewall is used as inter-tenant firewall within an NSX virtual domain. This is
used to define policies between different tenants who resides within an NSX environment. This
firewall is enforced for the traffic leaving the Tier-1 router and uses the Tier-1 SR component
which resides on the Edge node to enforce the firewall policy before sending to the Tier-0
Gateway for further processing of the traffic. The intra-tenant traffic continues to leverage
distributed routing and firewalling capabilities native to the NSX.
233
VMware NSX Reference Design Guide
● Exclude management components like vCenter Server, and security tools from the DFW
policy to avoid lockout, at least in the early days of DFW use. Once there is a level of
comfort and proficiency, the management components can be added back in with the
appropriate policy. This can be done by adding those VMs to the exclusion list.
● Use the Applied To field in the DFW to limit the rule growth on individual vNICs.
● Choose the policy methodology and rule model to enable optimum groupings and
policies for micro-segmentation.
● Use NSX tagging and grouping constructs to group an application or environment to its
natural boundaries. This will enable simpler policy management.
● Consider the flexibility and simplicity of a policy model for Day-2 operations. It should
address ever-changing deployment scenarios rather than simply be part of the initial
setup.
● Leverage DFW category and policies to group and manage policies based on the chosen
rule model. (e.g., emergency, infrastructure, environment, application...)
● Use an explicit allow model; create explicit rules for allowed traffic and change DFW the
default rule from “allow” to “drop”.
234
VMware NSX Reference Design Guide
235
VMware NSX Reference Design Guide
236
VMware NSX Reference Design Guide
1. Define NSX Groups for each of the Infrastructure Services. Following example shows the
group for DNS and NTP servers with IP addresses of the respective servers as group
members.
2. Define policy for common services; like DNS, NTP as in the figure below.
Define this policy under Infrastructure tab as shown below.
Have two rules allows all workloads to access the common services using GROUPS
created in step 1 above.
Use Layer 7 context profile, DNS and NTP, in the rule to further enhance the security
posture.
Have catch-all deny rule to deny any other destination for the common services with
logging enabled, for compliance and monitoring any unauthorized communication.
Note: If the management entities are not in an exclusion list, this section would need to
have rules to allow the required protocols between the appropriate entities. See
HTTPS://PORTS.VMWARE.COM/HOME/VSPHERE for the ports for all VMware products.
237
VMware NSX Reference Design Guide
Phase-2: Define Segmentation around ZONES - by having an explicit allow policy between
ZONES
As per the requirement, define policy between zones to deny any traffic between zones. This
can be done using IP CIDR block as data center zones have pre-assigned IP CIDR block.
Alternatively, this can be done using workload tags and other approach. However, IP-GROUP
based approach is simpler (as admin has pre-assigned IP CIDR Block per zone), no additional
workflow to tag workload and also less toll, compare to tagged approach, on NSX Manager and
control plane. Tagged approach may add additional burden on NSX Manager to compute
polices and update, in an environment with scale and churn. As a rule of thumb, the larger the
IP block that can be defined in a rule, the more the policy can be optimized using CIDR blocks. In
cases where there is no convenient CIDR block to group workloads, static groupings may be
used to create entities without churn on the NSX Manager.
Here are the suggested steps:
1. Define 2 NSX Groups for each of the ZONE, Development and Production, say DC-ZONE-
DEV-IP & DC-ZONE-PROD-IP with respective IP CIDR BLOCKs associated with the
respective zones as members.
238
VMware NSX Reference Design Guide
1. Define policy in environment category using the IP GROUPS created in step-1 to restrict
all communication between Development and Production ZONE’s.
2. Have logging enabled for this policy to track all unauthorized communication attempts.
(Note: In many industries, it is sufficient to log only the default action for
troubleshooting purposes. In others, there may be a compliance mandate to log every
action. Logging requirements are driven by the balance between storage costs and
compliance requirements.)
239
VMware NSX Reference Design Guide
This is two step approach to build a policy for the application. First step is to start with fence
around application to build security boundary. Then as a second step profile the application
further to plan and build more granular port-defined security policies between tiers.
Start with DEV zone first and identify an application to be micro-segmented, say DEV-ZONE-
APP-1.
Identify all VM’s associated with the Application within the zone.
Check application has its own dedicated network Segments or IP Subnets.
If yes, you can leverage Segment or IP-based Group.
If no, tag application VM’s with uniquely zone and application specific tags, say
ZONE-DEV & APP-1.
Check this application requires any other communication other than infra services and
communication within group. For example, APP is accessed from outside on HTTPS.
Once you have above information about DEV-ZONE-APP-1, create segmentation around
application by following steps:
1- Apply two tags to all the VM’s belonging to APP-1 in the ZONE DEV, ZONE-DEV & APP-1.
2- Create a GROUP, say “ZONE-DEV-APP-1” with criteria to match on tag equal to “ZONE-
DEV & APP-1”.
240
VMware NSX Reference Design Guide
3- Define a policy under Application category with 3 rules as in the FIGURE 5-26: APPLICATION
POLICY EXAMPLE.
a. Have “Applied To” set to “ZONE-DEV-APP-1” to limit the scope of policy only to
the application VM’s.
b. The first rule allows all internal communications between the application VM’s.
Enable logging for this rule to profile the application tiers and protocols. (Each
log entry will contain 5 tuple details about every connection.)
c. The second rule allows access to front end of the application from outside. Use
the L7 context profile to allow only SSL traffic. The below example uses Exclude
Source from within ZONE, so that application is only accessible from outside, not
from within except APP’s other VM’s, as per rule one.
d. Default deny all other communication to these “ZONE-DEV-APP-1” VM’s. Enable
log for compliance and monitoring any unauthorized communication.
241
VMware NSX Reference Design Guide
Log entries will identify the direction (In/Out) as well as the protocol and source IP address/port
and destination IP addresses/port for each flow. If using the log file for policy definition, it is
often advisable to process the log files using excel to sort traffic. Typically, 2 sheets are created,
one for IN traffic and one for OUT traffic. Then, each sheet is sorted by port first then IP
address. (In the case of IN traffic by destination IP address and in the case of OUT traffic by
source address. This sorting methodology allows for the grouping of multiple servers
serving/accessing the same traffic.) For each of these groupings, a rule can be inserted above
rule 1 for the application. This will prevent the known traffic from appearing in the log. Once
sufficient confidence is gained that the application is completely understood (this is typically
when the logs are empty), the original rule ZONE-DEV-APP-1 can be removed. At this point, the
security model has transitioned from zone-based to micro segmentation. (Note: Certain
environments - such as labs - may be best served by ring fencing, whereas other environments
may wish to add service insertion for certain traffic types on top of micro segmentation – such
as sensitive financial information. The value of NSX is that a customer provides the means to
implement appropriate security in one environment without impacting the other.)
Phase-5: Repeat Phase-3 for other applications and ZONES.
Repeat the same approach as in Phase-3 for other applications, to have security boundary for
every application within the ZONE-DEV and ZONE-PROD. Note that the securing of each of
these applications can happen asynchronously, without impact to the others. This
accommodates application-specific maintenance windows, where required.
Phase-6: Define Emergency policy, Kill Switch, in case of Security Event
An emergency policy mainly leveraged for following use case and enforced on top of the
firewall table:
1- To quarantine vulnerable or compromised workloads in order to protect other
workloads.
2- May want to explicitly deny known bad actors by their IP Subnet based on GEO location
or reputation.
This policy is defined in Emergency Category as shown:
1- First two rules quarantine all traffic from workloads belonging to group GRP-
QUARANTINE.
a. “GRP-QUARANTINE” is a group which matches all VM with tag equal to
“QUARANTINE”. (If guest introspection is implemented, the AV/AM tags can be
used to define different quarantine levels.)
b. In order to enforce this policy to vulnerable VM’s, add tag “QUARANTINE” to
isolate the VM’s and allow only admin to access the hosts to fix the vulnerability.
2- Other two rule uses Group with known bad IP’s to stop any communication with those
IP’s.
242
VMware NSX Reference Design Guide
In creating these policies, the iterative addition of rules to the policy is something that can be
done at any time. It is only when the action of the default rule changes from allow to
deny/drop that a maintenance window is advised. As logging has been on throughout the
process, it is highly unusual to see an application break during the window. What is most
frequently the case is that something within the next week or month may emerge as an
unforeseen rule that was missed. For this reason, it is advised that even in environments where
compliance does not dictate the collection of logs, the Deny All rule be set to logging. Aside
from the security value of understanding the traffic that is being blocked, the Deny All rule logs
are very useful when troubleshooting applications.
At this point you have basic level of micro-segmentation policy applied to all the workloads to
shrink the attack surface. As a next step you further break the application into application tiers
and its communication by profiling application flows using firewall logs or exporting IPFIX flows
to Network Insight platform. This will help to group the application workload based on the
function within the application and define policy based on associated port & protocols used.
Once you have these groupings and protocols identified for a given application, update the
policy for that application by creating additional groups and rules with right protocols to have
granularly defined rules one at a time.
With this approach you start with outside-in fencing to start with micro-segmentation policies
and finally come up with a granular port-based micro-segmentation policy for all the
application.
243
VMware NSX Reference Design Guide
The following FIGURE 5-28: NSX FIREWALL FOR ALL DEPLOYMENT SCENARIO summarizes different
datacenter deployment scenarios and associated NSX firewall security controls, which best fits
the design. You can use same NSX manager as a single pane of glass to define Security policies
to all of these different scenarios using different security controls.
244
VMware NSX Reference Design Guide
245
VMware NSX Reference Design Guide
246
VMware NSX Reference Design Guide
by the virtual server itself. This model is popular as it provides benefits for application scale-out
and high-availability:
• Application scale-out:
The following diagram represents traffic sent by users to the VIP of a virtual server,
running on a load balancer. This traffic is distributed across the members of a pre-
defined pool of capacity.
The server pool can include an arbitrary mix of physical servers, VMs or containers that
together, allow scaling out the application.
• Application high-availability:
The load balancer is also tracking the health of the servers and can transparently
remove a failing server from the pool, redistributing the traffic it was handling to the
other members:
Modern applications are often built around advanced load balancing capabilities, which go far
beyond the initial benefits of scale and availability. In the example below, the load balancer
selects different target servers based on the URL of the requests received at the VIP:
247
VMware NSX Reference Design Guide
Thanks to its native capabilities, modern applications can be deployed in NSX without requiring
any third party physical or virtual load balancer. The next sections in this part describe the
architecture of the NSX load balancer and its deployment modes.
248
VMware NSX Reference Design Guide
Load Balancer
The NSX load balancer is running on a Tier-1 gateway. The arrows in the above diagram
represent a dependency: the two load balancers LB1 and LB2 are respectively attached to the
Tier-1 gateways 1 and 2. Load balancers can only be attached to Tier-1 gateways (not Tier-0
gateways), and one Tier-1 gateway can only have one load balancer attached to it.
Virtual Server
On a load balancer, the user can define one or more virtual server (the maximum number
depends on the load balancer form factor – See NSX Administrator Guide for load balancer
scale information). As mentioned earlier, a virtual server is defined by a VIP and a TCP/UDP port
number, for example IP: 20.20.20.20 TCP port 80. The diagram represents four virtual servers
VS1, VS2, VS5 and VS6. A virtual server can have basic or advanced load balancing options such
as forward specific client requests to specific pools (see below), or redirect them to external
sites, or even block them.
Pool
A pool is a construct grouping servers hosting the same application. Grouping can be configured
using server IP addresses or for more flexibility using Groups. NSX provides advanced load
balancing rules that allow a virtual server to forward traffic to multiple pools. In the above
diagram for example, virtual server VS2 could load balance image requests to Pool2, while
directing other requests to Pool3.
Monitor
A monitor defines how the load balancer tests application availability. Those tests can range
from basic ICMP requests to matching patterns in complex HTTPS queries. The health of the
individual pool members is then validated according to a simple check (server replied), or more
advanced ones, like checking whether a web page response contains a specific string.
Monitors are specified by pools: a single pool can use only 1 monitor, but the same monitor can
be used by different Pools.
249
VMware NSX Reference Design Guide
Because the traffic between client and servers necessarily go through the load-balancer, there
is no need to perform any LB Source-NAT (Load Balancer Network Address Translation at virtual
server VIP).
The in-line mode is the simplest load-balancer deployment model. Its main benefit is that the
pool members can directly identify the clients from the source IP address, which is passed
unchanged (step2). The load-balancer being a centralized service, it is instantiated on a Tier-1
gateway SR (Service Router). The drawback from this model is that, because the Tier-1 gateway
now has a centralized component, East-West traffic for Segments behind different Tier-1 will be
pinned to an Edge node in order to get to the SR. This is the case even for traffic that does not
need to go through the load-balancer.
250
VMware NSX Reference Design Guide
Figure 6-6: One-Arm Load Balancing with Clients and Servers on the same segment
The need for a Tier-1 SR in for the centralized load-balancer service result in East-West traffic
for Segments behind different Tier-1 being pinned to an Edge node. This is the same drawback
as for the inline model described in the previous part.
251
VMware NSX Reference Design Guide
This design allows for better horizontal scale, as an individual segment can have its own
dedicated load-balancer service appliance(s). This flexibility in the assignment of load-balancing
resources comes at the expense of potentially instantiating several additional Tier-1 SRs on
several Edge nodes. Because the load-balancer service has its dedicated appliance, in East-West
traffic for Segments behind different Tier-1 gateway (the blue Tier-1 gateway in the above
diagram) can still be distributed. The diagram above represented a Tier-1 One-Arm attached to
overlay segment.
252
VMware NSX Reference Design Guide
Tier-1 One-Arm LB can also be attached to physical VLAN segments as shown in above figure,
and thus offering load balancing service even for applications on VLAN. In this use case, the
Tier-1 interface is also using a Service Interface, but this time connected to a segment-VLAN
instead of a segment-overlay.
Load-balancer high-availability
The load-balancer is a centralized service running on a Tier-1 gateway, meaning that it runs on a
Tier-1 gateway Service Router (SR). The load-balancer will thus run on the Edge node of its
associated Tier-1 SR, and its redundancy model will follow the Edge high-availability design.
253
VMware NSX Reference Design Guide
The above diagram represents two Edge nodes hosting three redundant Tier-1 SRs with a load-
balancer each. The Edge High Availability (HA) model is based on periodic keep alive messages
exchanged between each pair of Edges in an Edge Cluster. This keepalive protects against the
loss of an Edge as a whole. In the above diagram, should Edge node 2 go down, the standby
green SR on Edge node 1, along with its associated load-balancer, would become active
immediately.
There is a second messaging protocol between the Edges. This one is event driven (not
periodic), and per-application. This means that if a failure of the load-balancer of the red Tier-1
SR on Edge node 1 is detected, this mechanism can trigger a failover of just this red Tier-1 SR
from Edge node 1 to Edge node 2, without impacting the other services.
The active load balancer service will always synchronize the following information to the
standby load balancer:
• State Synchronization
• L4 Flow State
• Source-IP Persistence State
• Monitor State
This way, in case of failover, the standby load balancer (and its associated Tier-1 SR) can
immediately take over with minimal traffic interruption.
254
VMware NSX Reference Design Guide
Load-balancer monitor
The pools targeted by the virtual servers configured on a load-balancer have their monitor
services running on the same load-balancer. This ensure that the monitor service cannot fail
without the load-balancer failing itself (fate sharing.) The left part of the following diagram is
representing the same example of relation between the different load-balancer components as
the one used in part 6.3. The right part of the diagram is providing an example of where those
components would be physically located in a real-life scenario.
Here, LB1 is a load-balancer attached to Tier-1 Gateway 1 and running two virtual servers VS1
and VS2. The SR for Tier-1 Gateway 1 is instantiated on Edge 1. Similarly, load-balancer LB2 is
on gateway Tier-1 Gateway 2, running VS5 and VS6.
Monitor1 and Monitor2 protecting server pools Pool1, Pool2 and Pool3 used by LB1. As a result,
both Monitor1 and Monitor2 are implemented on the SR where LB1 reside. Monitor2 is also
polling servers used by LB2, thus it is also implemented on the SR where LB2 is running. The
Monitor2 example highlights the fact that a monitor service can be instantiated in several
physical locations and that a given pool can be monitored from different SRs.
255
VMware NSX Reference Design Guide
L7 VIP load balances HTTP or HTTPS connections. The client connection is terminated by the
VIP, and once the client’s HTTP or HTTPS request is received then the load balancer establishes
another connection to one of the pool members. If needed, some specific load balancing
configuration can also be done by L7 VIP, like a selection of specific pool members based on the
request.
Pool
Virtual Server
30.30.30.30:80
www
www.mysite.com
blog.mysite.com Pool
blog
For L7 VIP HTTPS, NSX LB offers 3 modes: HTTPS Off-Load, HTTPS End-to-End SSL, and SSL
Passthrough.
HTTPS Off-Load decrypts the HTTPS traffic at the VIP and forward the traffic in clear HTTP to the
pool members. It is the best balance between security, performance, and LB flexibility:
• Security: traffic is encrypted on the external side.
• Performance: web servers don’t have to run encryption.
• LB flexibility: all advanced configuration on HTTP traffic available like URL load
balancing.
256
VMware NSX Reference Design Guide
S
VIP L7
HTTPS:443
HTTPS End-to-End SSL decrypts the HTTPS traffic at the VIP and re-encrypts the traffic in
another HTTPS session to the pool members. It is the best security and LB flexibility:
• Security: traffic is encrypted end to end.
• Performance: this mode has lower performance with traffic decrypted/encrypted twice.
• LB flexibility: all advanced configuration on HTTP traffic available like URL load
balancing.
S
VIP L7
HTTPS:443
HTTPS SSL Passthrough does not decrypt the HTTPS traffic at the VIP and SSL connection is
terminated on the pool members. It is the best security and performance, but with limited LB
flexibility:
• Security: traffic is encrypted end to end.
• Performance: highest performance since LB does not terminate SSL traffic
• LB flexibility: advanced configuration based on HTTP traffic is not available. Only
advanced configuration based on SSL traffic is available like SSL SNI load balancing.
257
VMware NSX Reference Design Guide
S
VIP L7
HTTPS:443
Load-balancer IPv6
NSX LB has many NSX network and security services offers its service for IPv4 and IPv6 clients.
LB IPv4
Clients LB Server
IPv4 IPv4
S Pool
LB IPv6
Clients LB Server
IPv6 IPv6
S Pool
258
VMware NSX Reference Design Guide
in-line. Clearly, traffic from clients must go through the Tier-1 SR where the load-balancer is
instantiated in order to reach the server and vice versa:
The following diagram represents another scenario that, from a logical standpoint at least,
looks like an in-line load-balancer design. However, source LB-SNAT is required in this design,
even if the traffic between the clients and the servers cannot apparently avoid the Tier-1
gateway where the load-balancer is instantiated.
Figure 6-18: Load Balancing VIP IP@ in Tier-1 Downlink Subnet – Tier-1 Expanded View
259
VMware NSX Reference Design Guide
The following expanded view, where the Tier-1 SR and DR are represented as distinct entities
and hosted physically in different location in the network, clarifies the reason why source LB-
SNAT is mandatory:
Figure 6-19: Load Balancing VIP IP@ in Tier-1 Downlink Subnet – Tier-1 Expanded View
Traffic from server to client would be switched directly by the Tier-1 DR without going through
the load-balancer on the SR if source LB-SNAT was not configured. This design is not in fact a
true in-line deployment of the load-balancer and does require LB-SNAT.
260
VMware NSX Reference Design Guide
Figure 6-20: Load Balancing VIP IP@ in Tier-1 Downlink Subnet – Logical View
The diagram below offers a possible physical representation of the same network, where the
Tier-1 gateway is broken down between an SR on an Edge Node, and a DR on the host where
both client and servers are instantiated (note that, in order to simplify the representation, the
DR on the Edge was omitted.)
Figure 6-21: Load Balancing VIP IP@ in Tier-1 Downlink Subnet – Tier-1 Expanded View
261
VMware NSX Reference Design Guide
This representation makes it clear that because the VIP is not physically instantiated on the DR,
even if it belongs to the subnet of the downlink of the Tier-1 gateway, some additional
“plumbing” is needed in order to make sure that traffic destined to the load-balancer reach its
destination. Thus, NSX configures proxy-ARP on the DR to answer local request for the VIP and
adds a static route for the VIP pointing to the SR (represented in red in the diagram.)
262
VMware NSX Reference Design Guide
The first step in planning an NSX installation is understanding the deployment model we will
adopt. NSX can be adopted in three modes. They are not mutually exclusive, as they can coexist
in the same NSX installation and even vSphere cluster in some case. The three deployment
models map to different NSX use cases: virtual workload security, network virtualization, bare
263
VMware NSX Reference Design Guide
metal server security. NSX can help in a variety of more advanced use cases, such as multi
location connectivity, disaster recovery, and cloud native applications, but here we focus on the
core use cases to create a design framework.
The three deployment models are:
1. Distributed security model for virtualized workloads
2. Network virtualization with distributed security for virtualized workloads
3. Bare metal workloads security via gateway firewall (centralized security) or NSX bare
metal agent
In the distributed security model, NSX provides distributed security services such as distributed
firewall and distributed IDS/IPS. Still, the physical network retains a significant part of the
switching and routing responsibilities. The only NSX component participating in the networking
functionalities is the VMware VDS, responsible for switching the traffic between the VMs
running in the hypervisor and the physical uplinks. Once the packets are delivered to the
upstream physical switch, it is the responsibility of the physical fabric to deliver it to the
appropriate destination. This networking model is the most common in vSphere deployments
that are not running NSX, making the adoption of NSX distributed security entirely transparent
for the physical network in brownfield environments.
The networking and security model provides the full benefit of NSX network and security
virtualization. Distributed security services operate as in the security-only model, being
orthogonal to the adoption of the NSX network virtualization services. NSX network
virtualization decouples the physical network from the topology requirements of the
applications and services that run on top of it. The physical fabric can be designed and operated
based on a simpler and more stable model that does not require following the application's
lifecycle. The role of adapting to the ever-changing application requirements is offloaded to the
NSX network virtualization layer. At the same time, the physical fabric design can be centered
around maximizing throughput and high availability, minimizing latency, and optimizing
operational efficiency.
264
VMware NSX Reference Design Guide
Rack Availability Achieving rack availability via Easy to achieve rack availability
vSphere HA is not possible via vSphere HA for compute
without extending VLANs workloads
between racks
Disaster Recovery Disaster recovery requires re-IP Easy disaster recovery without
re-IP
Support of Advanced Tied to physical ASIC and limited Software defined allows scale the
Networking – NAT, by switch vendor networking and security stack
Multicast, IPV6, etc.
265
VMware NSX Reference Design Guide
The following sections list the physical network requirements for both deployment models and
provide considerations around specific network fabric designs.
The centralized security (or gateway firewall) model is a third less standard but emerging
deployment model. In this option, NSX Edges act as the perimeter firewall for the data center or
a branch location and protect not virtualized workloads operating as a next-generation firewall
appliance. Physical server security via the NSX agent is covered in this DOCUMENT .
266
VMware NSX Reference Design Guide
could have been dedicated to a cluster, or maybe two for rack redundancy, to limit the span of
VM mobility. In the real world, a different line of business can fund their initiatives in different
phases, and IT operators must expand their clusters on demand leveraging the available
network resources.
Figure 7-3: Network Overlays for elastic and non-uniform compute requirements
The role of the physical network is to provide reliable high throughput connectivity between
the hosts regardless of their nature. Dedicating a set of physical switches or racks to specific
workloads (vSphere clusters) is not a valid approach in data centers required to reach cloud
scale. It represents a waste of resources when the rack and switches are underutilized, and they
are a scale upper boundary for the workloads. What happens when you need to add a host to a
cluster with a dedicated rack and the ports, or the rack space is unavailable? NSX overlays
267
VMware NSX Reference Design Guide
allows for the different vSphere cluster to grown inorganically while preserving connectivity
and VM mobility.
So far, we have explored the benefits and flexibility that overlays bring to private clouds
requiring elastic compute and network connectivity properties. We can extend the model to
north-south bandwidth requirements and network services. NSX edge nodes are responsible for
scaling the software-defined network solution over those dimensions. FIGURE 7-4 shows how
edge hosts (ESXi hosts running NSX edge VMs) can be added to the design over time based on
changing requirements. The only additional requirement on the physical network is two
additional subnets/VLANs to establish BGP peering in each rack. As for the other infrastructure
VLANs, those are local and do not stretch across racks.
A single NSX deployment can horizontally scale up to 640 edge nodes. They can be distributed
over multiple racks to reduce the physical fabric oversubscription ratio between the leaf and
spine layers.
268
VMware NSX Reference Design Guide
Figure 7-4: Network Overlays for elastic North-South Bandwidth and network services
269
VMware NSX Reference Design Guide
When planning a distributed security deployment without overlay, three options exist:
1. Prepare the vSphere hosts for NSX Security only, and leave the virtual machines
connected to the virtual distributed port-groups managed by vCenter. This option has
been deprecated, it’s not supported in VCF in any version.
2. Prepare the vSphere hosts for NSX Networking & Security, and migrate the virtual
machines to NSX VLAN Segments.
3. Starting in NSX 4.2.0, it’s possible to enable NSX capabilities, including DFW, on
distributed port-groups for clusters prepared for network and security. This option
provides the greatest flexibility. A cluster can host workloads connected to NSX overlays,
NSX VLAN segments, or vCenter distributed port-groups and all of them will benefit
from the NSX distributed services.
Option 1 (deprecated) in a brownfield environment it does not require reconfigure the
networking for the VMs. The entire networking configuration ownership is retained by vCenter.
All virtual machines connected to dvpgs are automatically protected by the NSX DFW. This
option does not allow the implementation of NSX overlays and VPCs. This option has been
deprecated in favor of option 3 and in some scenarios option 2.
270
VMware NSX Reference Design Guide
Option 2 (Recommended for Greenfield and in larger deployments) requires placing the
workloads on the NSX-managed distributed virtual port-groups automatically created when a
VLAN segment is defined in NSX. In a brownfield, we can create NSX segments with the same
VLAN as the dvpgs where the VMs reside and then reconfigure the VMs vNIC with no
disruption. VMs on dvpg are excluded by the NSX distributed firewall. With this option, we can
adopt overlays at a later stage. Option 2 requires NSX 3.0 or later, vSphere 7, and VDS 7.
Option 3 (Most seamless for Brownfield) is the most seamless approach in brownfield
environments. It provides the ability to enable NSX distributed services (e.g. DFW) in brownfield
environments where the VMs are connected to vCenter dvpgs without migrating their NIC to
NSX VLAN segments. At the same time, it allows for the adoption of NSX overlays and VPCs in
the very same clusters. Option 3 requires NSX 4.2.0 or later and it is supported in VCF starting
VCF 5.2.0.
Option 3 is generally preferred in brownfield implementation of NSX, but option 2 provides a
simpler lifecycle management of VLAN networks via NSX manager in environment with a high
number of VDSs, across a single or multiple vCenters. If the VLAN span runs across many or all
those VDSs, individual vCenter dvpgs must be created on each individual VDS. When creating a
VLAN segment in NSX, a corresponding vCenter dpvg is automatically created on any VDS part
of the corresponding VLAN TZ. This consideration applies specifically to VCF deployments where
each vSphere cluster has a dedicated VDS, a design choice that increases the number of VDSs
and dvpgs to be managed.
From a scalability perspective option 2 fits better larger environment with a high number of
VDSs, vCenter Servers, and dvpgs that must be managed by a single NSX instance. Please refer
to configmax for the most up to date numbers.
ESXi NSX VLAN ESXi VLAN Yes, with overlay VMs connected to an
Segment Transport Zone and VPCs in the NSX VLAN Segment can
(connects Guest (Different than the same cluster have a SI as the default
VMs to VLANs) one for NSX Edge gateway
TNs)
271
VMware NSX Reference Design Guide
NSX Edge VM Edge VLAN N/A. NSX Edge VMs SI is connected to NSX
VLAN Segment Transport Zone are automatically Edge VLAN when it
(routing peering or (Different than the excluded from serves as the default
service interfaces) one for ESXi TNs) DFW gateway for VMs on
dvpgs or VLAN
segments. VLAN ID
must match. Dynamic
routing over SI is only
supported for EVPN
route servers mode
272
VMware NSX Reference Design Guide
• IP and MAC mobility across switches in the same rack – This is not NSX specific. It is
based on vSphere networking. When ESXi are connected to different switches, vSphere
networking assumes that moving the IP and MAC of VMs and VMK interface from one
uplink to another is transparent to the physical network. This is how vSphere teaming
policies protect against the failure of a pNIC or an upstream switch. If the two switches
where the ESXi is connected are not layer two adjacent, this mechanism cannot work.
Most network fabrics provide this capability; the exceptions consist in pure layer three
spine and leaf architectures. In such a scenario, the only option (not recommended) is to
single connect the servers and provide high availability at a different level.
Once the above requirements are met, an NSX security-only deployment is agnostic to any
physical fabric topology and configuration. NSX network virtualization has an additional MTU
requirement.
273
VMware NSX Reference Design Guide
• On any switch from any physical switch vendor, including legacy switches.
• With any underlying technology. IP connectivity can be achieved over an end-to-end
layer 2 network as well as across a fully routed environment.
For an optimal design and operation of NSX, well-known baseline standards are applicable.
These standards include:
• Device availability (e.g., host, TOR, rack-level)
• TOR bandwidth - both host-to-TOR and TOR uplinks
• Fault and operational domain consistency (e.g., localized peering of Edge node to the
northbound network, separation of host compute domains, etc.)
The following considerations apply to this topology in an NSX security only deployment (FIGURE
7-6):
• Different racks require different VLANs and network configurations. Rack 1 and 2
(yellow) host the compute clusters, rack 3 (green) hosts management workloads, rack 4
(blue) is dedicated to WAN connectivity, all with different VLAN requirements.
• While WAN and Management blocks are usually static, the compute rack switches may
require frequent configuration updates based on the applications lifecycle.
• Virtual machine mobility is not available across racks. This may limit the agility and
elasticity properties of the design and reduce the consolidation ratio in the data center.
• vSphere clusters cannot be striped across racks, which may limit high availability as the
infrastructure cannot protect against a rack failure.
274
VMware NSX Reference Design Guide
The following considerations apply to this topology in an NSX network and security
deployment (FIGURE 7-7):
• The ToR VLAN configuration is more consistent than in a security-only deployment, and it
is agnostic to the application lifecycle. Generally, three configuration templates must be
provided for the ToR switches, one for the compute block (yellow), one for the
management block (green), and one for the edge block (blue). Those configurations
templates tend to be static as they do not require modification when new NSX virtual
networks are deployed.
• Virtual machine mobility is available across racks. NSX overlay segments extend layer two
networks across the layer three boundaries of each rack.
• Compute clusters can be stretched across racks for rack level high availability.
• Resource pooling is streamlined as hosts in different racks (or datacenter rooms) can be
added to the same vSphere cluster.
• While some infrastructure management VMs such as the Aria suite components can be
deployed on NSX overlay segments, NSX managers and vCenter should be deployed on a
275
VMware NSX Reference Design Guide
VLAN network. This limits the span of the management vSphere cluster to a single rack.
The management VLAN must be extended across racks, to provide rack high availability.
The NSX manager cluster appliances can be deployed in different IP subnets to avoid
extending layer two between racks, but no solution is currently available for the vCenter
server.
FIGURE 7-8 below outlines a sample VLAN and IP design for an NSX networking and security
deployment over a layer three fabric. Because each rack represents a unique layer two domain,
we can use the same VLAN IDs across racks to improve consistency.
The compute racks generally require only four VLANs (ESXi Management, vMotion, Storage, ad
NSX Overlay), and they can share the same VDS. Management racks may or may not be
prepared for NSX. When they are not, they do not require the NSX Overlay VLAN.
The management racks require a VM Management VLAN to place the management
infrastructure components such as the vCenter Server and NSX Manager. This VLAN is generally
separate from the ESXi management network as it may be stretched between racks to provide
rack high availability for the management VMs. The stretching of the VM management VLAN
requires layer two links or physical fabric overlays. The ESXi management VLAN never has this
requirement.
The hosts in the edge racks are never prepared for NSX (unless edge and compute blocks are
merged) and require two VLANs dedicated to layer three peering between the physical fabric
and the NSX edges (VMs or Bare Metal). Edge vSphere clusters rarely need to be stretched
across racks. The edge node VMs themselves can participate in the same NSX edge cluster (and,
276
VMware NSX Reference Design Guide
for example, support the same T0 Gateway) even if they are deployed on different vSphere
clusters in different racks. Edges in different racks can use layer 3 peering VLANs local to each
rack, so those VLANs do not need to be stretched. The layer three peering subnets should
support ECMP to 8 edge nodes, so /28 subnets or larger are recommended.
Figure 7-8: Sample VLAN and IP Subnets schema for NSX Network Virtualization deployment on a Layer 3 fabric
The following considerations apply to this topology in an NSX security only deployment (FIGURE
7-9):
• At minimum, VLANs are extended between the switches in the same rack, meeting the IP
and MAC mobility requirement of vSphere networking. VLANs can be extended across
racks, increasing VM mobility and allowing the possibility to stripe vSphere clusters
across racks for increased high availability.
• While possible, extending VLANs across multiple racks increases the size of a layer two
fault domain.
• In a dynamic environment requiring frequent provisioning of new networks to support
the applications lifecycle, the network administrator may need to perform frequent and
widespread changes to the physical network configuration.
• The network architect will be required to balance the above considerations. Maximum
flexibility in VM placement via a larger VLAN span will negatively impact the size of
277
VMware NSX Reference Design Guide
failure domains and the overall manageability. A more segmented approach will instead
limit VM mobility and the overall elasticity of the design.
The following considerations apply to this topology in an NSX network and security
deployment (FIGURE 7-10 AND FIGURE 7-11):
• The network architect can limit the size of the layer two domains to a single rack by
pruning the VLANs appropriately. VM mobility is not impacted as NSX overlays extend
VM networks across layer three boundaries. This configuration makes the layer two
physical fabric properties very similar to those of a layer three fabric concerning the NSX
design.
• Management VMs VLAN can easily be extended between racks enabling the striping of
the management cluster.
• Small environments may benefit from a simpler VLAN/IP schema where the
infrastructure VLANs (management, vMotion, storage, overlay) are unique across the
fabric.
• The configuration of the physical devices becomes standardized and static. New
networks and topologies to support the applications lifecycle are provisioned in NSX
transparently to the physical fabric.
278
VMware NSX Reference Design Guide
• Bare metal edge nodes, or the ESXi hosts where edge node VMs reside, should be
connected to layer three capable devices. It usually means the spine or aggregation layer
switches in a layer two fabric. When implementing such a design is not desirable,
dedicated layer 3 ToR switches may be deployed to interconnect the edge nodes.
Figure 7-10: Network and Security Deployment – L2 fabric – Edge Nodes connected to spine switches
279
VMware NSX Reference Design Guide
Figure 7-11: Network and Security Deployment – L2 fabric – Edge Nodes connected to dedicated L3 ToRs
FIGURE 7-12 below outlines a sample VLAN/IP schema for a layer two fabric where VLANs have
been extended across the whole physical network. This design is appropriate for smaller
deployments. A large implementation may benefit from the segmentation of layer two
domains. In such cases, the VLAN/IP schema may approach the one outlined for the layer three
fabric example.
Figure 7-12: Sample VLAN and IP Subnets schema for NSX Network Virtualization deployment on a Layer 2 fabric
280
VMware NSX Reference Design Guide
making the same layer two segments available on any desired leaf switch. Examples of such
fabrics are EVPN based fabrics and Cisco ACI.
From an NSX and vSphere perspective, a layer three fabric with overlay is equivalent to a layer
two fabric. As such, similar considerations apply. For NSX Security only deployments (FIGURE
7-13), the points to emphasize are:
• VLANs can be extended across racks via the physical fabric overlays, increasing the span
of VM mobility and providing the possibility to stripe vSphere clusters across racks for
increased high availability.
• The span of the VLANs is less of a concern regarding fault domains size compared to
Layer 2 fabrics because network overlays are in use.
• In a dynamic environment requiring frequent provisioning of new networks to support
the applications lifecycle, the network administrator may need to perform frequent and
widespread changes to the physical network configuration.
For NSX Network and Security only deployments (FIGURE 7-14), the points to emphasize are:
• Larger physical fabrics can adopt a simplified VLAN/IP schema where unique VLANs are
common across the environment.
• NSX Geneve encapsulation will work over the physical fabric overlay encapsulation
without any problem as long as the virtual machines and physical fabric MTU account for
the additional headers.
281
VMware NSX Reference Design Guide
• The configuration of the physical devices becomes standardized and static. New
networks and topologies to support the application lifecycle are provisioned in NSX
transparently to the physical fabric.
• Edge nodes are connected to edge leaf switches with access to the WAN network.
• Edge node VMs should not be deployed on clusters striped across racks as it would
require extending the L3 peering VLANs via the physical fabric overlay. Edge nodes
should peer to the leaf switches in the same rack on dedicated local VLANs. Edges in
different racks and vSphere clusters can be grouped in the same NSX edge cluster.
FIGURE 7-15 outlines a sample VLAN/IP schema for a layer three fabric with an overlay where
NSX network overlays are also in use. The number of IP segments is minimized as the physical
networks overlay allows each IP network to be available on every rack. Note the four physical to
virtual transit segments (VLAN 106-108) in the edge racks, two per rack. They are not extended
between racks and only provide connectivity to the local edge nodes.
282
VMware NSX Reference Design Guide
Figure 7-15: Sample VLAN and IP Subnets schema for NSX Network Virtualization deployment on a Layer 3 fabric with overlays
283
VMware NSX Reference Design Guide
Overview
NSX Manager Appliances (bundling manager and controller functions) are mandatory NSX
infrastructure components. Their networking requirement is basic IP connectivity with the
other NSX components. The detailed requirements for these communication are listed at
HTTPS :// PORTS . VMWARE . COM/ HOME /NSX-DATA-CENTER.
NSX Manager Appliances are typically deployed on a hypervisor and connected to a VLAN
backed port group on a vSS or vDS; there is no need for the colocation of the three appliances
in the same subnet or VLAN. There are no dependencies on MTU or encapsulation
requirements as the NSX Manager appliances send management and control plane traffic over
the management VLAN only. The NSX Manager appliances ca be deployed on ESXi hypervisors.
FIGURE 7-16 shows three ESXi hypervisors in the management rack hosting three NSX Manager
appliances.
The ESXi management hypervisors are configured with a VDS/VSS with a management port
group mapped to a management VLAN. The management port group is configured with two
uplinks using physical NICs “P1” and “P2” attached to different top of rack switches. The uplink
teaming policy has no impact on NSX Manager operation, so it can be based on existing
VSS/VDS policy.
284
VMware NSX Reference Design Guide
In a typical deployment, the NSX management components should be deployed on a VLAN. This
is the recommended best practice. Deploying NSX management components on the software
defined overlay requires elaborate considerations and thus beyond the scope of this document.
285
VMware NSX Reference Design Guide
7.3.3.1 Single vSphere Cluster for all NSX Manager Nodes in a Manager Cluster
When all NSX Manager Nodes are deployed into a single vSphere cluster, it is important to
design that cluster to meet the needs of the NSX Managers with at least the following:
1. At least four vSphere Hosts. This is to adhere to best practices around vSphere HA and
vSphere Dynamic Resource Scheduling (DRS) and allow all three NSX Manager Nodes to
remain available during proactive maintenance or a failure scenario.
2. The hosts should all have access to the same data stores hosting the NSX Manager
Nodes to enable DRS and vSphere HA.
3. Each NSX Manager Node should be deployed onto a different data store (this is
supported as a VMFS, NFS, or other data store technology supported by vSphere)
4. DRS Anti-Affinity rules should be put in place to prevent, whenever possible, two NSX
Manager VMs from running on the same host.
5. During lifecycle events of this cluster, each node should be independently put into
maintenance mode, moving any running NSX Manager Node off the host prior to
maintenance, or any NSX Manager Node should be manually moved to a different host.
6. If possible, a rack-level or critical infrastructure (e.g., Power, HVAC, ToRs) should also be
taken into account to protect this cluster from any single failure event taking down the
entire cluster at once. In many cases, this means spreading this cluster across multiple
cabinets or racks and connecting the hosts in it to a diverse set of physical switches, etc.
Rack level redundancy requires extending the management VLAN between the racks as
the NSX Managers generally reside on a vCenter managed VLAN dvpg. Properly planned
rack-level redundancy includes a cluster striped across three racks and DRS rules placing
the NSX Managers on hosts in different racks because the failure of a cabinet with more
than one NSX manager causes the loss of the quorum.
7. NSX Manager backups should be configured and pointed to a location running outside of
the vSphere Cluster that the NSX Manager Nodes are deployed on.
286
VMware NSX Reference Design Guide
7.3.3.2 Single vSphere Cluster when leveraging VSAN as the storage technology
When all NSX Manager Nodes are deployed into a single vSphere cluster where VSAN is the
storage technology in use, we should take additional steps to protect the NSX Manager
Cluster’s availability since VSAN will present only a single data store. Few VSAN specific
parameters including, Primary Levels of Failures to Tolerate (PFTT), Secondary Levels of Failures
to Tolerate (SFTT), and Failure Tolerance Mode (FTM) (when only a single site in VSAN is
configured, PFTT is set to 0 and SFTT is usually referred to as just FTT) govern the availability of
the resources deployed on the VSAN datastore. The following configurations should be made to
improve the availability of the NSX Management and Control Planes:
• FTT >= 2
• FTM = Raid1
• Number of hosts in the VSAN cluster >= 5
This will dictate that for each object associated with the NSX Manager Node, two or more
copies of the data and a witness are available even if two failures occur, allowing the objects to
remain in a healthy state. This will accommodate both a maintenance event and an outage
occurring without impacting the integrity of the data on the datastore. This configuration
requires at least five (5) hosts in the vSphere Cluster.
Cabinet Level Failures can be accommodated as well, by distributing the cluster horizontally
across multiple cabinets. No more hosts than the number of failures that can be tolerated (a
287
VMware NSX Reference Design Guide
maximum of FTT=3 is supported by VSAN) should exist in each rack. This means a minimum of
five (5) hosts spread across three (3) racks in a 2-2-1 pattern, and DRS rules preferentially
placing the NSX Managers in different racks.
Implementing VSAN Stretched Clusters to provide cabinet level protection is not
recommended. While appealing because of the reduced number of hosts required (2 hosts and
a witness VM in 3 different failure domains are the minimum requirements), the failure of a
rack may cause the loss of the NSX manager cluster quorum because two NSX manager
appliances are placed in the same rack.
It is strongly recommended that hosts, any time they are proactively removed from service,
vacate the storage and repopulate the objects on the remaining hosts in the VSAN cluster.
Providing a VSAN cluster with five hosts is the recommended approach, but fewer hosts are
available in some situations. VSAN 7 update 2 introduced the ENHANCED DATA
DURABILITY feature, which may reduce the risk of running the NSX manager cluster with a
storage policy with FTT=1. With this feature, if a failure occurs when one of the nodes has been
taken out of service for maintenance, VSAN can rebuild the up-to-date data once the host is out
of maintenance mode.
Additional VSAN settings and best practices that we should consider are:
• Enable O PERATIONS RESERVE AND HOST REBUILD RESERVE . This helps monitor the reserve
capacity threshold, generates alerts when the threshold is reached and prevents further
provisioning.
• Ensure that the VM storage policy applied to the NSX Managers is compliant.
• Ensure to use the Data migration pre-check tool before carrying out host/cluster
maintenance
• Review the vSAN Skyline Health Checks regularly.
288
VMware NSX Reference Design Guide
289
VMware NSX Reference Design Guide
290
VMware NSX Reference Design Guide
291
VMware NSX Reference Design Guide
The NSX Manager availability has improved compared to the previous option; however, it is
important to clarify the distinction between node availability and load-balancing. In the case of
cluster VIP, all the API and GUI requests go to one node, and it's not possible to achieve load-
balancing of GUI and API sessions. In addition, the sessions that were established on a failed
node will need to be re-authenticated and re-established at the new owner of the cluster VIP.
The cluster VIP model is also designed to address the failure of certain NSX Manager services
but cannot guarantee recovery for certain corner cases service failures.
It is important to emphasize that the northbound clients' connectivity happened via the cluster
VIP. Still, the communication between NSX Manager nodes and the NSX transport nodes
components happens on the individual node IPs.
The cluster VIP is the preferred and recommended option for high availability of the NSX
Manager appliance nodes.
292
VMware NSX Reference Design Guide
293
VMware NSX Reference Design Guide
FIGURE 7-21 presents a scenario where the NSX manager nodes are load-balanced by an
external load balancer, and session persistency is based on the source IP of the client. With this
configuration, the load balancer will redirect all requests from the same client to the same NSX
manager node.
NSX Manager can authenticate clients in four ways - HTML basic authentication, client
certificate authentication, vIDM, and session. The API-based client can use all four forms of
authentication while web browsers use session-based authentication. The session-based
authentication typically requires LB persistence configuration, while API-based access does not
mandate that. FIGURE 7-21 represents a VIP with LB persistent configuration for both browser
(GUI) and API-based access.
While one can conceive an advanced load-balancing schema in which dedicated VIP for browser
access with LB persistent while other VIP without LB persistence for API access, this option may
294
VMware NSX Reference Design Guide
have limited value in terms of scale and performance differentiation while complicating the
access to the system. For this reason, it is highly recommended to first adopt the basic option of
LB persistence based on source IP with a single VIP for all types of access. The overall
recommendation is to start with a cluster VIP and move to an external LB if real needs exist.
295
VMware NSX Reference Design Guide
296
VMware NSX Reference Design Guide
single pNIC is used, its failure is not covered by a standby pNIC. Such a scenario is not common,
but it can be an appropriate choice when other high availability mechanisms like vSphere HA or
application-level redundancy are in place.
A scenario where we may leverage deterministic traffic per pNIC is in the case of multiple
distinct physical networks, e.g., DMZ, Storage, or Backup Networks, where the physical
underlay differs based on pNIC.
Designing a traffic management schema that utilizes both design patterns is possible. This
design guide covers both scenarios based on specific requirements and makes generalized
recommendations.
297
VMware NSX Reference Design Guide
Figure 7-22: ESXi Compute Rack Failover Order Teaming with One Teaming Policy
In FIGURE 7-22Figure 7-22, a single VDS is used with a two pNICs design and carries both
infrastructure and VM traffic. Physical NICs “P1” and “P2” are attached to the different top of
rack switches. The teaming policy selected is failover order active/standby; “Uplink1” is active
while “Uplink2” is standby.
When the virtual switch is a VDS with NSX, only the NSX DVPG traffic will follow the NSX
teaming policy. The infrastructure traffic will follow the teaming policy defined in their
respective VDS DVPGs configured in vCenter.
The top-of-rack switches are configured with a first-hop redundancy protocol (e.g., HSRP or
VRRP), providing the active default gateway for all the VLANs on “ToR-Left .” The VMs are
attached to overlay or VLAN segments defined in NSX, and they will follow the default teaming
policy regardless of the type of segment they are connected to. With the use of a single
teaming policy, the above design allows for a simple configuration of the physical infrastructure
and simple traffic management at the expense of leaving an uplink completely unused.
It is, however, easy to load balance traffic across the two uplinks while maintaining the
deterministic nature of the traffic distribution. The following example shows the same hosts,
this time configured with two separate failover order teaming policies: one with P1 active, P2
298
VMware NSX Reference Design Guide
standby, and the other with P1 standby, P2 active. Then, individual traffic types can be assigned
a preferred path by mapping it to either teaming policy.
Figure 7-23: ESXi Compute Rack Failover Order Teaming with two Teaming Policies
In FIGURE 7-23, storage and vMotion traffic follow a teaming policy setting P1 as primary active,
while management and VM traffic are following a teaming policy setting P2 as primary active.
When the virtual switch is a VDS with NSX, the above design is achieved with a default teaming
policy P2 active & P1 standby. Then, the DVPGs for infrastructure traffic need to be configured
individually: storage and vMotion will have a failover order teaming policy setting P1 active &
P2 standby, while the management DVPG will be configured for P2 active & P1 standby.
The ToR switches are configured with a first-hop redundancy protocol (FHRP), providing an
active default gateway for storage and vMotion traffic on “ToR-Left,” management, and overlay
traffic on “ToR-Right” to limit interlink usage. Multiple teaming policies allow the utilization of
all available pNICs while maintaining deterministic traffic management. This is a better
recommended approach when adopting a deterministic traffic per pNIC design approach.
299
VMware NSX Reference Design Guide
FIGURE 7-24, shows a two-pNIC design as in the previous example, with load balance based on
source port teaming policy. Notice that NSX instantiates one TEP on every uplink specified in
the teaming policy configuration to be able to send overlay traffic on all the corresponding
physical ports. With this kind of policy, potentially both uplinks are utilized based on the hash
value generated from the source port originating the traffic or the source MAC address of the
traffic. In case of a pNIC failure, the virtual switch teaming policy moves the corresponding TEP
interface to a still operational uplink. That means that in a failure scenario multiple TEPs are still
operational but send and receive traffic over the same physical link.
Load balancing based on source port is generally sufficient and recommended over a MAC-
based hash because each VM usually generates traffic from a single MAC address. Both
infrastructure and guest VM traffic benefit from this policy, allowing the use of all available
uplinks on the host. Infrastructure traffic is generally handled by different VMK interfaces
dedicated to each traffic type, making it possible for the hashing algorithm to place different
kinds of traffic on different links. Each traffic type will flow over a single link, though. vSphere
features such as multi-nic vMotion can address the requirement for load balancing the same
kind of traffic across different physical NICs.
Additionally, one can utilize a mix of the different teaming policy types together such that
infrastructure traffic (VSAN, vMotion, Management) leverage “failover order” enabling
deterministic bandwidth and failover, while VM traffic use some “load balance source” teaming
policy, spreading VM traffic across both pNICs.
300
VMware NSX Reference Design Guide
On a host running VDS with NSX, overlay traffic will follow the default teaming policy defined in
NSX. Infrastructure traffic will follow the individual teaming policies configured on their DVPGs
in vCenter.
Based on the requirement of underlying applications and preference, one can select a type of
traffic management as desired for the infrastructure traffic. However, for the overlay traffic, it
is highly recommended to use one of the “load balanced source” teaming policies as they’re the
only ones allowing overlay traffic on multiple active uplinks and thus provide better throughput
in/out of the host for VM traffic.
301
VMware NSX Reference Design Guide
• the design with 4 pNIC design provides a non-disruptive and straightforward installation
of NSX on existing ESXi hypervisors. We can deploy an N-VDS (or a new VDS prepared for
NSX) with two dedicated uplinks for VM traffic while leaving an existing VSS/VDS running
on two separate uplinks to handle the infrastructure traffic.
• On top of the benefits from this simple NSX installation, the model also provides a strict
separation between VM and infrastructure traffic. There are scenarios where this
separation is mandated by policy. The fact that NSX is deployed on its dedicated virtual
switch ensures that no misconfiguration could ever lead to VM traffic being sent on the
uplinks dedicated to infrastructure traffic.
The following FIGURE 7-25 presents an example of a host with 4 uplinks and two virtual
switches:
• A VDS is dedicated to infrastructure traffic.
• A second NSX prepared VDS handles the VM traffic.
Figure 7-25: ESXi Compute Rack 4 pNICs – VDS and NSX virtual switch
The VDS is configured with pNICs “P1” and “P2”. And each port group is configured with
different pNICs in active/standby to use both pNICs. However, the choice of teaming mode on
VDS is left to the user based on the considerations we outlined in the two pNIC design section.
The virtual switch running NSX owns pNICs “P3” and “P4”. To leverage both pNICs, the VDS is
configured in load balance source teaming mode. Each type of host traffic has dedicated IP
subnets and VLANs.
302
VMware NSX Reference Design Guide
Because we can deploy NSX directly on a VDS, we can easily achieve the same functionality with
a single VDS running NSX and owning the four uplinks, as represented below.
Figure 7-26: ESXi Compute Rack 4 pNICs – VDS and NSX virtual switch
You can achieve this configuration by simply mapping the four pNICs of the host to 4 uplinks on
the VDS. Then, install NSX using an uplink profile that maps its uplinks (Uplink1/Uplink2 in the
diagram) to the VDS uplinks P3 and P4. This model still benefits from the simple, non-disruptive
installation of NSX. At the same time, the NSX component can only work with the two uplinks
mapped to the uplink profile. This means that VM traffic can never flow over P1/P2.
The final added benefit of this model is that the administrator can manage a single virtual
switch and has the flexibility of adding additional uplinks for overlay or infrastructure traffic.
303
VMware NSX Reference Design Guide
• There is a specific use case for NFV (Network Function Virtualization) where two pNICs
are dedicated to a standard virtual switch for overlay and the other two pNICs for an
“enhanced datapath” virtual switch. The “enhanced mode” is not discussed here. Please
refer to VMware NFV documentation.
In the following scenario, FIGURE 7-27 shows that each virtual switch is built to serve specific
topology or provide traffic separation based on enterprise requirements. The segments cannot
extend between two virtual switches as each is tied to a different transport zone.
Figure 7-27: ESXi Compute Rack 4 pNICs with to two VDSs prepared for NSX
Below, we list a series of use cases and configurations where the dual VDS design is relevant. In
any of those cases, the infrastructure traffic will be carried on the first virtual switch. Here are
some examples:
• The first two pNICs are exclusively used for infrastructure traffic, and the remaining two
pNICs are for overlay VM traffic. This allows dedicated bandwidth for overlay application
traffic. One can select the appropriate teaming mode as discussed in the above two
pNICs design section (ESXI-BASED COMPUTE HYPERVISOR WITH TWO P NICS)
• The first two pNICs are dedicated to “VLAN only” micro-segmentation, and the second
one is for overlay traffic.
• Building multiple overlays for separation of traffic. Starting with NSX 3.1, a host can have
virtual switches part of different overlay transport zones and the TEPs on each virtual
switch can be on different VLAN/IP subnets (still, all the TEPs for an individual switch
must be part of the same subnet). When planning such configuration, it is important to
304
VMware NSX Reference Design Guide
remember that while hosts can be part of different overlay transport zones, edge nodes
cannot. The implication is that when implementing multiple overlay transport zones,
multiple edge clusters are required, one per overlay transport zones as a minimum.
• Building regulatory compliant domains with VLAN only or overlay.
• Building traditional DMZ type isolation.
The two virtual switches running NSX must attach to different transport zones. See detail in
section SEGMENTS AND TRANSPORT ZONES
=====================================================
Note: The two virtual switches could be of different datapath mode. See chapter 9 for a
description of the different mode available.
=====================================================
305
VMware NSX Reference Design Guide
306
VMware NSX Reference Design Guide
• Normalization of N-VDS configuration – All the edge node form factors deployments
can use a single virtual switch like a host hypervisor. Single teaming policy for overlay –
Load Balanced Source. Multiple policies for N-S peering – Named teaming Policies
307
VMware NSX Reference Design Guide
If deployment is running vSphere 6.5 where mac learning is not available, the only other way to
run bridging is by enabling promiscuous mode. Typically, promiscuous mode should not be
enabled system wide. Thus, either enable promiscuous mode just for DVPG associated with
bridge vNIC or it may be worth considering dedicating an Edge VM for the bridged traffic so that
other kinds of traffic to/from the Edge do not suffer from the performance impact related to
promiscuous mode.
308
VMware NSX Reference Design Guide
309
VMware NSX Reference Design Guide
In the above scenario, the failure of the uplink of Edge 1 to physical switch S1 would trigger an
Edge Bridge convergence where the Bridge on Edge 2 would become active. However, the
failure of the path between physical switches S1 and S3 (as represented in the diagram) would
have no impact on the Edge Bridge HA and would have to be recovered in the VLAN L2 domain
itself. Here, we need to make sure that the alternate path S1-S2-S3 would become active
thanks to some L2 control protocol in the bridged physical infrastructure.
310
VMware NSX Reference Design Guide
Edges for Segment/VLAN load balancing. Also note that, up to NSX release 2.5, a failover cannot
be triggered by user intervention. As a result, with this design, one cannot assume that both
Edges will be leveraged for bridged traffic, even when they are both available and several
Bridge Profiles are used for Segment/VLAN load balancing. This is perfectly acceptable if
availability is more important than available aggregate bandwidth.
Figure 7-30: Load-balancing bridged traffic for two Logical Switches over two Edges (Edge Cluster omitted for clarity.)
Further scale out can be achieved with more Edge nodes. The following diagram shows an
example of three Edge Nodes active at the same time for three different Logical Switches.
311
VMware NSX Reference Design Guide
Figure 7-32: Load-balancing example across three Edge nodes (Bridge Profiles not shown for clarity.)
Note that if several Bridge Profiles can be configured to involve several Edge nodes in the
bridging activity, a given Bridge Profile cannot specify more than two Edge nodes.
312
VMware NSX Reference Design Guide
• Overlay traffic is load-balanced across two physical links leveraging multi-tep edge
capability. Multi-tep provides load balancing and better edge node resource utilization
compared to a single TEP.
• Because for each bridging instance, VLAN traffic can be forwarded over a single uplink at
the time, load balancing is achieved by pinning different VLANs to different pNICs.
• VLAN and overlay traffic are forwarded over different pNICs. While sharing the pNICs
between VLAN and overlay traffic may lead to better performances and high availability
(e.g., configuring 4 TEPs and connecting the edge to 4 ToRs), we chose a simpler design
with a deterministic traffic pattern.
The design presented in FIGURE 7-32 below includes a single NVDS per edge node. The NVDS
manages both the overlay and VLAN traffic. The diagram presents two bridging instances, for
VLAN and overlay X, and VLAN and overlay Y, but more can be configured. The VLAN traffic is
pinned to pNIC 2 or pNIC 3. Additional VLANs could be pinned to either interface. In this case,
the pinning is achieved via a bridging teaming policy, implying that any bridging instance must
be associated with a named teaming policy. Overlay traffic for any bridging instance is load-
balanced across pNIC0 and pNIC1 via the default teaming policy.
We should pay special attention to the cabling and VLAN to pNIC association, which is not the
same for the two bare metal edge nodes. The reason for this asymmetry is that while overlay
traffic can failover between the two pNICs on the same edge node, the same is not true for
VLAN traffic, which is pinned to a single uplink because of the limitation discussed in chapter 3.
If that single pNIC or the ToR where it is connected fails, the bridge instance should fail over to
the second edge, which must have a viable path for the associated VLANs. We want to maintain
a standard cabling configuration, where two pNICs are connected to one ToR and two to the
other consistently while at the same time allowing for the recovery of the traffic after the
failure of a ToR. For this reason, we crossed the mapping between pNIC2 and pNIC3 and uplink2
and uplink3 on the two bare metal edges.
FIGURE 7-31 shows a sample configuration. If this configuration is not desirable, connecting
pNIC2 to ToR2 and pNIC3 to ToR1 for bare metal edge-2 will have the same effect.
313
VMware NSX Reference Design Guide
Figure 7-31: Bridge Design – 4 pNIC BM edge – Single NVDS – uplink to pNIC mapping
314
VMware NSX Reference Design Guide
We can achieve the same design without named teaming policies and deploying multiple NVDSs
on the same bare metal edge. The functionalities provided are the same, and the decision
between the two options may be based on the preferred configuration workflows for new
bridging instances. This configuration option is presented in the diagram below.
315
VMware NSX Reference Design Guide
So far, we have shown how to achieve optimal high availability, CPU, and link utilization across
a single bare metal edge. The previous diagrams show Bare Metal Edge 2 running standby
bridging instances only. It is possible to spread the active bridging instances across multiple
bare metal edges via bridge profiles. Because the traffic for each bridging instance is always
handled by a single edge node at the time, and VLAN traffic is pinned to a single pNIC, at least
four bridging instances are required to utilize all available links. This configuration is depicted in
FIGURE 7-34.
316
VMware NSX Reference Design Guide
317
VMware NSX Reference Design Guide
318
VMware NSX Reference Design Guide
source MAC addresses (the MACs of all the VMs on the overlay) that can be used by the
upstream VDS for the load-balancing hash.
MAC learning and forged transmit must be enabled on the bridge dvpg carrying VLAN
traffic. It is not required to configure MAC learning on the Overlay1 and Overlay2 dvpgs as
the failures of vNIC1 or vNIC2 are not realistic occurrences and should not be part of a test
plan.
This configuration can be achieved with a single N-VDS and a named teaming policy as
depicted in FIGURE 7-36.
It can also be achieved without named teaming policies and two NVDSs per edge VM. The
resulting designs are equivalent, they only differ in the configuration workflow. The 2 NVDS
design may be preferable to avoid the need to associate a teaming policy to each bridging
instance. Doing so may lower the probability of a misconfiguration.
319
VMware NSX Reference Design Guide
320
VMware NSX Reference Design Guide
• A deployment incorporating two hosts with 4 edge VMs (2 per host) requires two bridge
profiles for an active/standby configuration (1 host active, 1 host standby), and 4 bridge
profiles for an active/active configuration (both hosts active).
Note: nor hosts or edge VMs are active or standby, it’s the bridging instances running on top of
the edge VMs that are. Designating hosts as active or standby represents a design abstraction
helpful when building a logical model hiding implementation complexities.
FIGURE 7-38 presents a 4 pNICs ESXi server hosting two edge VMs dedicated to bridging. The
two edge VMs are associated with different bridge profiles so that they will never run an
active/standby pair for any VLAN-Overlay pair. The failure of the host should not cause both the
active and standby instance to fail at the same time. Edge VMs running on a different host
provide high availability for the bridging instances depicted in the diagram. vSphere DRS rules
should be implemented to achieve the desired design and placement.
321
VMware NSX Reference Design Guide
322
VMware NSX Reference Design Guide
Router1
Router2
eBGP
eBGP External2-VLAN 200
External1-VLAN 100
EN1 EN2
SR
Tier0 DR
Tier1 Overlay
SR
Traffic DR
DR
Figure 7-39: Typical Enterprise Bare metal Edge Note Logical View with Overlay/External Traffic
323
VMware NSX Reference Design Guide
is load-balanced across uplinks using a named teaming policy, which pins a VLAN segment to a
specific uplink.
We can also use the same VLAN segment to connect Tier-0 Gateway to TOR-Left and TOR-Right.
However, it is not recommended because of inter-rack VLAN dependencies leading to spanning
tree-related convergence and the inability to load-balance the traffic across different pNICs.
This topology provides redundancy for management, overlay, and external traffic, in the event
of a pNIC failure on Edge node/TOR and a complete TOR Failure.
The right side of the diagram shows two pNICs bare metal edge configured with the same N-
VDS "Overlay and External N-VDS" for carrying overlay and external traffic as in the example
above, but that it is also leveraging in-band management.
P1 P2 P3 P4 P1 P2
Mgmt
Uplink1 Uplink2 Uplink1 Uplink2
Traffic
Bare Metal Edge with 4 Physical NICS Bare Metal Edge with 2 Physical NICS
2* 1G NIC + 2 * 10G NIC 2 * 10G NIC
Figure 7-40: Bare metal Edge configured for Multi-TEP - Single N-VDS for overlay and external traffic
Both topologies use the same uplink profile, as shown in FIGURE 7-41. This configuration shows
a default teaming policy that uses both Uplink1 and Uplink2. This default policy is used for all
the overlay segments/logical switches created on this N-VDS.
Two additional teaming policies, "Vlan300-Policy" and "Vlan400-Policy," have been defined to
override the default teaming policy and send traffic to "Uplink1" and "Uplink2" only,
respectively.
"External VLAN segment 300" is configured to use the named teaming policy "Vlan300-Policy"
that sends traffic from this VLAN only on "Uplink1". "External VLAN segment 400" is configured
324
VMware NSX Reference Design Guide
to use a named teaming policy "Vlan400-Policy" that sends traffic from this VLAN only on
"Uplink2".
Based on these teaming policies, TOR-Left will receive traffic for VLAN 100 (Mgmt.), VLAN 200
(overlay) and VLAN 300 (Traffic from VLAN segment 300). Similarly, TOR-Right will receive
traffic for VLAN 100 (Mgmt.), VLAN 200 (overlay) and VLAN 400 (Traffic from VLAN segment
400). A sample configuration screenshot is shown below.
FIGURE 7-42 shows a logical and physical topology where a Tier-0 gateway has four external
interfaces. External interfaces 1 and 2 are provided by bare metal Edge node “EN1”, whereas
External interfaces 3 and 4 are provided by bare metal Edge node “EN2”. Both the Edge nodes
are in the same rack and connect to the TOR switches in that rack. Both the Edge nodes are
configured for Multi-TEP and use named teaming policy to send traffic from VLAN 300 to TOR-
Left and traffic from VLAN 400 to TOR-Right. Tier-0 Gateway establishes BGP peering on all four
external interfaces and provides 4-way ECMP.
Logical Topology Physical Topology
BGP P1 P2 P1 P2
Uplink1 Uplink2 Uplink1 Uplink2
External-1 External-4
192.168.240.2/24 192.168.250.3/24
Overlay and External Overlay and External
External-2 External-3
N-VDS N-VDS
192.168.250.2/24 192.168.240.3/24 TEP-IP1 TEP-IP2 TEP-IP1 TEP-IP2
EN1 EN2
Mgmt-IP Mgmt-IP
325
VMware NSX Reference Design Guide
For most environments the performance provided by a pNIC bare metal edge are comparable
to an edge node VMs when the physical server has only 2 pNICs, if those interfaces are 10Gbps
or 25Gbps. The pNIC bandwidth usually represents the bottleneck and the VM form factor is
generally recommended for the easier lifecycle management. Situation when a bare metal edge
node should be considered are:
• Requirement for line rate services
• Higher bandwidth pNIC (25 or 40 Gbps)
• Traffic profile is characterized by small packet size (e.g., 250 Bytes)
• Sub-second link failure detection between physical network and the edge node
• Network operation team retains responsibility for the NSX edges and has a preference
for an appliance-based model
• In a multiple Tier-0 deployment model where the top Tier-0 is deployed on bare metal
edges and drives the throughput with higher speed (40 Gbps) pNICs.
• Management interface redundancy is not always required but a good practice. In-band
option is most practical deployment model when a limited number of interfaces is
available.
326
VMware NSX Reference Design Guide
Figure 7-43: Bare metal Edge with six pNICs - Same N-VDS for Overlay and External traffic
The bare metal configuration with greater than two pNICs is the most practical and
recommended design. This is because four or more pNICs configurations substantially offer
more bandwidths than the equivalent Edge VM configurations. The same reasons for choosing
bare metal apply as in the two pNICs configurations discussed above.
327
VMware NSX Reference Design Guide
Named teaming policy is also configured to force external traffic on specific edge VM nNICs.
FIGURE 7-44 also shows named teaming policy configuration used for this topology. "External
VLAN segment 300" is configured to use a named teaming policy “Vlan300-Policy” that sends
traffic from this VLAN on “Uplink1” (vNIC2 of Edge VM). "External VLAN segment 400" is
configured to use a named teaming policy “Vlan400-Policy” that sends traffic from this VLAN on
“Uplink2” (vNIC3 of Edge VM). Based on this named teaming policy, external traffic from
“External VLAN Segment 300” will always be sent and received on vNIC2 of the Edge VM.
North-South or external traffic from “External VLAN Segment 400” will always be sent and
received on vNIC3 of the Edge VM.
Overlay or external traffic from Edge VM is received by the VDS DVPGs “Trunk1 PG” and
“Trunk2 PG”. The teaming policy used on the VDS port groups defines how this overlay and
external traffic coming from Edge node VM exits the hypervisor. For instance, “Trunk1 PG” is
configured to use active uplink as “VDS-Uplink1” and standby uplink as “VDS-Uplink2”. “Trunk2
PG” is configured to use active uplink as “VDS-Uplink2” and standby uplink as “VDS-Uplink1”.
This configuration ensures that the traffic sent on “External VLAN Segment 300” (i.e., VLAN 300)
always uses vNIC2 of Edge VM to exit the Edge VM. This traffic then uses “VDS-Uplink1” (based
on “Trunk1 PG” configuration) and is sent to the left TOR switch. Similarly, traffic sent on VLAN
400 uses “VDS-Uplink2” and is sent to the TOR switch on the right.
In case of a failure of an ESXi host pNIC, the active/standby teaming policy on the trunk dvpg
will redirect the traffic to the surviving pNIC (set the teaming policy to “Explicit Failover” with
one VDS uplink active and one standby). Overlay traffic (VLAN 200 in the diagram) will flow
over the operational ESXi pNIC. This failover event is entirely transparent to the edge node VM
N-VDS, which keeps load balancing overlay traffic over the two TEPs and vNICs. Traffic
originated or destined to both TEPs flows over a single ESXi pNIC. External VLAN traffic will not
failover to the surviving ESXi pNIC because, while the overlay VLAN is defined on both switches,
the two external VLANs are defined on TOR-Left or TOR-Right. A routing protocol
reconvergence will redirect the traffic to the surviving peering VLAN and corresponding pNIC.
The TOR switches configuration mustn't define the two peering VLAN on both switches;
otherwise, it would be possible to have the failed neighborship restored over the inter-switch
link, which is not desirable because of the spanning tree dependency and the lack of univocal
mapping between the physical links and the routing paths.
The BFD sessions from the individual Edge TEPs to other transport nodes are not a reliable
mechanism to detect and recover from partial failures (of a single TEP for example). This is the
reason for requiring an active/standby teaming policy to protect TEP traffic in case of a host
pNIC failure. The exception is the “All tunnels down” condition when all the tunnels from all
edge TEPs to all the transport nodes are down. See the NSX EDGE HIGH AVAILABILITY FAILOVER
TRIGGERS section for more details.
This design does not consider the failure of the edge vNICs because it does not represent a
realistic failure being the vNIC a virtual component. Disabling the edge vNICs will cause a black
328
VMware NSX Reference Design Guide
hole of the traffic. It is not recommended to enable MAC learning and/or forged transmit on
the upstream VDS dvpg to support this failure scenario.
Starting with NSX release 2.5, single N-VDS deployment mode is recommended for both bare
metal and Edge VM. Key benefits of single N-VDS deployment are:
• Consistent deployment model for both Edge VM and bare metal Edge with one N-VDS
carrying both overlay and external traffic.
• Load balancing of overlay traffic with Multi-TEP configuration.
• Ability to distribute external traffic to specific TORs for distinct point to point routing
adjacencies.
• No change in DVPG configuration when new service interfaces (workload VLAN
segments) are added.
• Deterministic North South traffic pattern.
329
VMware NSX Reference Design Guide
Service interface on a Tier-1 Gateway can also be connected to overlay segments for
standalone load balancer use cases. This is explained in Load balancer CHAPTER 6. Connecting a
service interface to an overlay segment to act as the default gateway for the VMs on that
330
VMware NSX Reference Design Guide
segment is supported but not recommended. The service interface is residing on the edge node
SR, meaning that any East-West traffic will traverse the edge node and distributed routing
capabilities are not available. Overlay segments should always be connected to downlink
interfaces (default behavior when creating a segment and connecting it to a Gateway).
In some corner case scenarios, service interfaces belonging to different Gateways (Tier-0 or
Tier-1) must be connected to the same segment. This configuration is supported, with the
caveat that if the segment is a VLAN the edges node hosting the different gateways must
belong to different edge clusters. This limitation does not apply to overlay segments.
331
VMware NSX Reference Design Guide
Figure 7-46: Single N-VDS per Edge VM - Two Edge Node VM on Host
332
VMware NSX Reference Design Guide
333
VMware NSX Reference Design Guide
Figure 7-47: Two Edge VMs per host with 2 pNICs – Routing and bridging
334
VMware NSX Reference Design Guide
From a configuration perspective, the 2 PNIC design is replicated “side-by-side” with dedicated
pNICs and dvpgs. We create a separate management dvpg carrying traffic for the same VLAN
and IP space of the first edge VM, but with a different teaming policy that forces the
management traffic for the second edge VM over P3 and P4. We perform this configuration to
enforce fate sharing across management, overlay, and VLAN peering networks for each edge
VM. Refer to the edge HA section in chapter 4 for the rationale behind this recommendation.
7.5.2.5 Edge node VM connectivity to N-VDS (or VDS prepared for NSX)
In all the previous scenarios, we connected the edge node VMs to a vSphere VDS not prepared
for NSX, which is the most common scenario when a dedicated vSphere edge cluster is
available. When edge node VMs share the ESXi host with regular workload VMs requiring NSX
services (Overlay and/or DFW), it may be desirable to connect the edge VMs and the workload
VMs to the same virtual switch (NVDS or VDS prepared for NSX).
This type of design is the only available option when only two pNICs are available on the host.
When four or more pNICs are available, it is possible to have multiple virtual switches on the
ESXi host and connect edge VMs and workloads VMs to different ones.
Before NSX 3.1, it was required to place the ESXi TEPs and the Edge TEPs in different subnets
and VLANs when leveraging the same two pNICs for both. FIGURE 7-49 below presents such a
design. We can transport the edge overlay traffic over trunk NSX segments or regular vCenter
managed trunk dvpgs. If the ESXi host virtual switch consists of an NVDS, NSX VLAN segments
are the only choice.
335
VMware NSX Reference Design Guide
Figure 7-49: Edge VM connected to NSX Prepared VDS – Host and Edge TEP on different IP/VLAN
Starting with NSX version 3.1, edge and host TEPs can reside on the same VLAN because the
host now can process Geneve traffic internal to the host itself. We must transport edge VM
overlay traffic over an NSX Segment in this case. If the edge TEPs are connected to a vCenter
managed dvpg, tunnels between the host and the edge will not come up. This design is
presented in FIGURE 7-50 below:
336
VMware NSX Reference Design Guide
Figure 7-50: Edge VM connected to NSX Prepared VDS – Host and Edge TEP on the same IP/VLAN - NSX 3.1 or later
When connecting edge VMs to an NSX prepared VDS or NVDS, please keep in mind the
following recommendations:
• Host and Edges should be part of different VLAN Transport Zones. This ensures a clear
boundary between the transport segments on the host and those used for the routing
peering on the edges. The edge segment traffic is transported by the host segments,
configured as a trunk.
• When implementing a single TEP VLAN design like in FIGURE 7-50, the VDS trunk port
groups transporting the edge TEP traffic must be NSX managed segments and cannot be
created in vCenter.
• Follow the canonical recommendations regarding VLAN trunking and teaming policy
configuration for overlay and VLAN peering traffic described in section: 7.5.2.2.
The design with different VLAN/IP subnets per TEP is still valid and can be used with any NSX
version, including 3.1 or later. In most cases, the single TEP design is preferred for its simplicity,
however for most deployments having separate VLANs for Edge and Host TEP is recommended
due to following considerations:
• When Edges and hosts share the same TEP VLAN, they also share the span of that VLAN.
While it is usually desirable to limit the host TEP VLAN to a rack, edge VMs may require
mobility across racks or even sites (e.g., in the VCF stretched cluster design). Separate
VLANs allow to manage the span of host and edge TEP networks individually.
• An edge and the host where the edge is running will never lose TEP connectivity if they
share the same TEP network, regardless of a pNIC failure. This means that the edge node
337
VMware NSX Reference Design Guide
VM will never incur in an all tunnels down HA condition, limiting its ability to react to
specific failures. Please refer to the EDGE HA SECTION IN CHAPTER 4 for more information.
(Note: a design that matches FIGURE 7-50 should not incur any issue as management,
overlay, and VLAN peering networks share the same pNICs).
338
VMware NSX Reference Design Guide
FIGURE 7-52 shows a detailed example of services only edge node VM connectivity. Again, you
can notice the absence of VLAN connectivity requirements for the services only edges deployed
on the ESXi host on the right.
Specific implementation considerations apply to the edge node VM form factor:
• VLAN tagging is not required on edge NVDS as the upstream VDS dvpg only carries
overlay traffic and could be configured for a specific VLAN rather than as a trunk.
• Tagging overlay traffic on the edge node VM NVDS and configuring the dvpgs as trunks is
recommended for consistency with the other use cases and possibly adopting them at a
later stage on the same edge.
Specific design considerations apply to the edge node VM form factor:
• In a layer 2 physical fabric (or layer 3 with overlays), edge node VMs dedicated to T1
Gateway services can have a larger mobility span than edges providing peering
functionalities and can be deployed on clusters striped across multiple racks.
339
VMware NSX Reference Design Guide
• While the ESXi pNICs are usually the limiting performance factor for edge node VMs
running T0 Gateway for layer three peering, the host CPU may represent the bottleneck
for service-intensive edges. In such cases connecting multiple edge VMs to the same
pNIC may be appropriate (see the right ESXi in FIGURE 7-52)
340
VMware NSX Reference Design Guide
341
VMware NSX Reference Design Guide
• General SLA for the service itself. We could have an SLA for all the services, e.g., X for all
services to be available, or we could have a different SLAs for different services (e.g., X
for layer3 peering, Y for VPN, Z TLS Decryption), or even a specific one per tenant and
service (e.g., tenant 1 requires A for TLS decryption, tenant 1 requires B for Gateway IPS
functionality, tenant 2 requires C for TLS decryption). SLA requirements affect the edge
cluster design in term of the number of edge nodes deployed, where they are deployed
(same rack, different rack, different rooms), and the availability requirements of the
underlying infrastructure (e.g., the vSphere cluster or the datastore)
• Max recovery time in case of a network failure. While the general SLA considers the
total downtime over an extending time, usually a year, requirements should outline the
expectation for fast recovery in case of common hardware network failure such as the
failure of a pNIC, a host, or a top of rack switch. The reason for such requirements is that
why multiple short interruption may not violate the SLA for the year, they may represent
a problem for applications with short timeouts or incapable of reconnecting
automatically. This type of requirements affects, among other things, the edge node
form factor, routing protocols and BFD timer’s implementation. These requirements
should always be derived by specific application needs, and not provided as blanket
statements.
Manageability:
• Lifecycle. Different edge services should be able to undergo maintenance at different
times. Requirements in this area may lead to deploy edges in different edge clusters,
different vSphere clusters, and even different NSX Manager domains.
• Scalability. How elastic should the service cluster be? This consideration may not apply
to all the services in the same way. For example, I may need to increase the throughput
for the TLS decryption service based on on-boarding new tenants, but the VPN service
will not be affected by this change. Requirements in this area may lead to dedicate edge
clusters to specific services, and to design the underlying infrastructure with different
characteristics depending on the services (e.g., hosts with different pNIC, CPU resources,
memory)
• RBAC. Who is entitled to configure each service? Is it a self-service offering? This area
has impacts on the integration with directory services and user privileges.
• Object based RBAC. Do I need different users to be able to manage different objects of
the same type? For example, each tenant should be able to manage the Gateway
Firewall rules for their dedicated T0 or T1 Gateway. This area may lead to considerations
around separating the NSX deployment in multiple NSX Managers domains, the
integration with the most appropriate cloud management platform, or the adoption f
NSX Multi-tenancy via NSX Projects and/or VPCs.
• Monitoring. What are the metrics that I need to collect to ensure I am proactive in
addressing performance and scalability issues? Separating services on different edge
342
VMware NSX Reference Design Guide
VMs may provide granular reporting around the resource consumption of specific
services.
Performance:
• North/South throughput. The amount of traffic that the NSX edges should support
between the physical and the virtual environments is commonly estimated in bits per
second (bps), e.g., 20Gbits. Such requirements tend to be misleading as the throughput
is highly influenced by the packet size. So, the traffic profile should be taken into
consideration when stating the expected North/South throughput for the system. A
more general and accurate characterization of the capability of the provided design
would be stating North/South throughput is packets per second (PPS). Requirements in
this are affect edge node form factor, the number of edges deployed in ECMP, the host
resources in terms of pNIC and CPU, and others.
• Latency and jitter. Some application may have stringent latency and jitter requirements.
Those requirements should be evaluated and may lead to the provisioning of dedicated
resources such as T0 Gateways, pNICs or hosts to avoid the resource contention between
mice and elephant flows.
• New connections per second (CPS). This parameter has impact on the stateful services
design, it may require spreading services across multiple edge nodes.
• Throughput per service. This is different from the north/south throughput which is
generally and aggregate of all the services traffic. Different services may be more or less
resource intensive and may require the allocation of dedicated resources. For example,
NAT and Gateway firewall are in most case light on resources and can coexist on the
same edge dedicated to the layer 3 peering without a noticeable performance impact.
Other services (e.g. VPN and TLS decryption) may have noticeable impact if deployed at
scale. Requirements in this area will impact the segmentation of services in different
edge clusters, and the choice of the edge form factor for some services.
Recoverability:
• Recovery Time Objective (RTO). This is different than the requirements for high
availability in the sense that we want to specify how much time we have to restore the
services after a major outage. The outage in scope depends on the context of the design
and may refer for example to a rack failure, a datacenter room failure, or an entire site
failure. Requirements in this area may affect the striping of an edge cluster across
different availability zones, the protection mechanisms we want to rely on for the
recovery (e.g., edge native HA vs vSphere HA), and the recovery procedure in place (fully
automatic, vs. manual, vs. scripted)
• Recovery Point Objective (RPO). While RPO usually refers to workload data, in the
context of NSX services it may help us formalize how much configuration it is acceptable
to lose because of a failure. While this requirement may not be stringent in a manually
343
VMware NSX Reference Design Guide
344
VMware NSX Reference Design Guide
mixture of different sizes/performance levels within the same Edge cluster can have the
following effects:
○ With two Edge nodes hosting a Tier-0 configured in active/active mode, traffic will
be spread evenly. If one Edge node is of lower capacity or performance, half of the
traffic may see reduced performance while the other Edge node has excess
capacity.
○ For two Edge nodes hosting a Tier-0 or Tier-1 configured in active/standby mode,
only one Edge node is processing the entire traffic load. If this Edge node fails, the
second Edge node will become active but may not be able to meet production
requirements, leading to slowness or dropped connections.
• Services available on a T1 Gateway and not on a T0 Gateway. Some services are available
on T1 Gateway only and are not available on T0 Gateway. Designs requiring Gateway
IDS/IPS/Malware Prevention, Gateway Identity Firewall, URL Filtering, TLS Inspection and
NSX Load Balancer, need to include T1 Gateways.
• Services on T1 implies the hair pinning of the traffic between tenants connected to
different T1 Gateways.
• VPN remote peer redundancy. VPNs can be enabled on T0 or T1 Gateways with exactly
the same functionalities except for the ability to establish BGP peering over an IPSEC
routed VPN. If remote peer redundancy is required VPNs on T0 Gateways are the only
option.
• Because Edge nodes can be part of a single overlay transport zone, if the NSX
deployment has been segmented in multiple overlay transport zone, as a minimum, a
corresponding number of edge clusters should be deployed to provide physical to virtual
connectivity to each overlay transport zone.
345
VMware NSX Reference Design Guide
The shared mode provides simplicity of allocating services in automated fashion as NSX tracks
which Edge node is provisioned with service and reduced that Edge node as potential target for
next services deployment. However, each Edge node is sharing CPU and thus bandwidth is
346
VMware NSX Reference Design Guide
shared among services. In addition, if the Edge node fails, all the services inside Edge nodes fails
together. Shared edge mode if configured with preemption for the services, leads to only
service-related secondary convergence. On the other hand, it provides optimized footprint of
CPU capacity per host. If the high dedicated bandwidth per service and granular services control
is not a priority, then use shared mode of deployment with Edge services.
Dedicated Mode: In this mode, Edge node is either running ECMP or stateful services but not
both. This mode is important for building scalable and performance-based services edge
cluster. Separation of services on dedicated Edge node allows distinct operational model for
ECMP vs stateful services. The choices of scaling either ECMP or stateful services can be
achieved via choice of bare metal or multiple of Edge VMs.
FIGURE 7-54: DEDICATED SERVICE EDGE NODE Cluster described dedicated modes per service,
ECMP or stateful services. One can further enhanced configuration by deploying a specific
service per Edge node, in another word each of the services in EN3 and EN4 gets deployed as an
independent Edge node. It’s the most flexible model, however not a cost-effective mode as
each Edge node reserves the CPU. In this mode of deployment one can choose preemptive or
non-preemptive mode for each service individually if deployed as a dedicated Edge VM per
services. In above figure if preemptive mode is configured, all the services in EN3 will
experience secondary convergence. However, if one segregate each service to dedicated Edge
VM, one can control which services can be preemptive or non-preemptive. Thus, it is a design
choice of availability versus load-balancing the edge resources. The dedicated edge node either
per service or grouped for set of services allows deploying a specific form factor Edge VM, thus
one can distinguish ECMP based Edge VM running larger form (8 vCPU) allowing dedicated CPU
for high bandwidth need of the NSX domain. Similar design choices can be adopted by allowing
smaller form factor of Edge VM if the services do not require line rate bandwidth. Thus, if the
multi-tenant services do not require high bandwidth one can construct a very high density per
tenant Edge node services with just 2 vCPU per edge node (e.g. VPN services or a LB deployed
with DevTest/QA). The LB service with container deployment is one clear example where
adequate planning of host CPU and bandwidth is required. A dedicated edge VM or cluster may
be required as each container services can deploy LB, quickly exhausting the underlying
resources.
347
VMware NSX Reference Design Guide
Another use case to run dedicated services node is multi-tier Tier-0 or having a Tier-0 level
multi-tenancy model, which is only possible with running multiple instances of dedicated Edge
node (Tier-0) for each tenant or services and thus Edge VM deployment is the most economical
and flexible option. For the startup design one should adopt Edge VM form factor, then later as
growth in bandwidth or services demands, one can lead to selective upgrade of Edge node VM
to bare metal form. For Edge VM host convertibility to bare metal , it must be compatible with
BARE METAL REQUIREMENT . If the design choice is to immunize from most of the future capacity
and predictive bandwidth consideration, by default going with bare metal is the right choice
(either for ECMP or stateful services). This decision to go with VM versus bare metal also hinges
on operational model of the organization in which if the network team owns the lifecycle and
relatively want to remain agnostic to workload design and adopt a cloud model by providing
generalized capacity then bare metal is also a right choice.
348
VMware NSX Reference Design Guide
For the deployment that requires stateful services the most common mode of deployment is
shared Edge node mode (see FIGURE 7-53: SHARED SERVICE EDGE NODE Cluster) in which both
ECMP Tier-0 services as well stateful services at Tier-1 is enabled inside an Edge node, based on
per workload requirements. The FIGURE 7-56: SHARED SERVICES EDGE NODE CLUSTER GROWTH
Patterns below shows shared edge not for services at Tier-1, red Tier-1 is enabled with load-
balancer, while black Tier-1 with NAT. In addition, one can enable multiple active-standby
services per Edge node, in other word one can optimize services such that two services can run
on separate host complementing each other (e.g. on two host configuration below one can
enable Tier-1 NAT active on host 2 and standby on host 1) while in four hosts configuration
dedicated services are enabled per host. For the workloads which could have dedicated Tier-1
gateways, are not shown in the figure as they are in distributed mode thus, they all get ECMP
service from Tier-0. For the active-standby services consideration, in this case of in-rack
deployment mode one must ensure the active-standby services instances be deployed in two
different host. This is obvious in two edge nodes deployed on two hosts as shown below as NSX
will deploy them in two different host automatically. The growth pattern is just adding two
more hosts so on. Note here there is only one Edge node instances per host with assumption of
two 10 Gbps pNICs. Adding additional Edge node in the same host may oversubscribed
available bandwidth, as one must not forget that Edge node not only runs ECMP Tier-0 but also
serves other Tier-1s that are distributed.
349
VMware NSX Reference Design Guide
The choices in adding additional Edge node per host from above configuration is possible with
higher bandwidth pNIC deployment (25/40 Gbps) or, even better, a higher number of pNICs. In
the case four Edge node deployment on two hosts, it is required to ensure active-standby
instances does not end up on Edge nodes on the same hosts. One can prevent this condition by
building a horizontal Failure Domain as shown in below FIGURE 7-57: TWO EDGE NODES PER HOST
– SHARED SERVICES CLUSTER GROWTH Pattern. Failure domain in below figures make sure any
stateful services.
T0 EN3 T1
T1 Standby NAT
Same Rack
Host 4 T1 Active LB
2 hosts 2 Edge Node per host T0 EN4 T1
Active/Standby Service Availability not
guaranteed without Failure Domain T1 Standby LB
Host 1
FD 1
T0 EN1 T1 T0 EN3 T1
Use Fault T0 EN1 T1 T0 EN3 T1
Host 2 Domains
T0 EN2 T1 T0 EN4 T1 FD 2
T0 EN2 T1 T0 EN4 T1
Edge Cluster
Same Rack
350
VMware NSX Reference Design Guide
Figure 7-57: Two Edge Nodes per Host – Shared Services Cluster Growth Pattern
An edge cluster design with dedicated Edge node per services is shown in below FIGURE 7-58:
DEDICATED SERVICES PER EDGE NODES GROWTH Pattern. In a dedicated mode, Tier-0 is only running
ECMP services belongs to first edge cluster while Tier-1 running active-standby services on
second edge cluster. Both of this configuration are shown below.
Notice that each cluster is striped vertically to make sure each service gets deployed in separate
host. This is especially needed for active/standby services. For the ECMP services the vertical
striping is needed when the same host is used for deploying stateful services. This is to avoid
over deployment of Edge nods on the same host otherwise the arrangement shown in FIGURE
7-55: ECMP BASE EDGE NODE CLUSTER GROWTH Pattern is a sufficient configuration.
The multi-rack Edge node deployment is the best illustration of Failure Domain capability. It is
obvious each Edge node must be on separate hypervisor in a separate rack with the
deployment with two Edge nodes.
The case described below is the dedicated Edge node per service. The figure below shows the
growth pattern evolving from two to four in tandem to each rack. In the case of four hosts,
assuming two Edge VMs (one for ECMP and other for services) per host with two hosts in two
different rack. In that configuration, the ECMP Edge node is stripped across two racks with its
own Edge cluster, the placement and availability are not an issue since each node is capable of
servicing equally. The Edge node where services is enabled must use failure domain vertically
striped as shown in below figure. If the failure domains are not used, the cluster configuration
will mandate dedicated Edge cluster for each service as there is no guarantee that active-
standby services will be instantiated in Edge node residing in two different rack. This mandate
minimum two edge clusters where each cluster consist of Edge node VM from two racks
providing rack availability.
351
VMware NSX Reference Design Guide
Finally, the standby edge reallocation capability (only available to Tier-1 gateways) allows a
possibility of building a multiple availability zones such that a standby edge VM can be
instantiated automatically after minimum of 10 minutes of failure detection. If the Edge node
that fails is running the active logical router, the original standby logical router becomes the
active logical router and a new standby logical router is created. If the Edge node that fails is
running the standby logical router, the new standby logical router replaces it.
There are several other combinations of topologies are possible based on the requirements of
the SLA as described in the beginning of the section. Reader can build necessary models to
meet the business requirements from above choices.
352
VMware NSX Reference Design Guide
353
VMware NSX Reference Design Guide
354
VMware NSX Reference Design Guide
Figure 7-61: Single NSX Edge cluster across two vSphere/VSAN clusters
If deployment of the NSX Edge Cluster across multiple VSAN datastores is not feasible then the
following configurations can be made to minimize the potential of a cascading failure due to the
VSAN datastore:
1. The vSphere Cluster should be configured with an FTT setting of at least 2 and an FTM of
Raid1
This will dictate that for each object associated with the NSX Edge Node VMs, a witness,
two copies of the data (or component) are available even if two failures occur, allowing
the objects to remain in a healthy state. This will accommodate both a maintenance
event and an outage occurring without impacting the integrity of the data on the
datastore. This configuration requires at least five (5) hosts in the vSphere Cluster.
2. ToR and Cabinet Level Failures can be accommodated as well, this can be accomplished
in multiple ways; either through leveraging VSAN’s PFTT capability commonly referred
to as Failure Domains for VSAN Stretched Clusters and leveraging a Witness VM running
in a third failure domain or by distributing the cluster horizontally across multiple
cabinets where no more hosts exist in each Rack or Cabinet than the number of failures
that can be tolerated (a maximum of FTT=3 is supported by VSAN).
355
VMware NSX Reference Design Guide
3. When the vSphere cluster hosting NSX Edges is spread across multiple cabinets or ToRs,
it is critical to set DRS rules to better disburse the NSX Edge Nodes across the cabinets
and have predictability for which one is running where. Even if the physical fabric allows
for IP mobility across racks, edge VM mobility should be limited to a single rack to avoid
the BGP peering to the local ToR to drop upon vMotion. BGP peering across racks is
highly discouraged.
4. It is strongly recommended that Hosts, any time they are proactively removed from
service, vacate the storage and repopulate the objects on the remaining hosts in the
vSphere Cluster.
Figure 7-62: Single NSX Edge Cluster, Each Rack NSX Failure Domain
For a realistic example of edge nodes deployed on a single vSphere cluster striped across
multiple racks refer to section USE CASE : IMPLEMENTING A REPEATABLE RACK DESIGN FOR A SCALABLE
PRIVATE CLOUD PLATFORM IN A LAYER 3 FABRIC
While VSAN Stretched Cluster and other Metro-Storage Cluster technologies provide a very
high level of storage availability, NSX Edge Nodes provide an “Application Level” availability
through horizontal scaling and various networking technologies. If a dedicated vSphere Cluster
is planned to host the Edge Node VMs, using two independent clusters that are in diverse
locations as opposed to a single vSphere Cluster stretched across those locations should
seriously be considered as it is the best solution in most circumstances.
356
VMware NSX Reference Design Guide
Figure 7-63: Single Architecture for Heterogeneous Compute and Cloud Native Application Framework
There are essential factors to consider when evaluating how to best design these workload
domains, as well as how the capabilities and limitations of each component influence the
arrangement of NSX resources. Designing multi-domain compute requires considering the
following key factors:
● Use Cases
○ IaaS
○ CaaS
○ VDI
○ PaaS
○ Other
● Type of Workloads
○ Enterprise applications, QA, DevOps
○ Regulation and compliance
○ Performance requirements
○ Security
357
VMware NSX Reference Design Guide
● Management
○ RBAC
○ Inventory of objects and attributes controls
○ Lifecycle management
○ Ecosystem support – applications, storage, and staff knowledge
● Scale and Capacity
○ Compute hypervisor scale
○ NSX Platform Scale limits
○ Network services design
○ Bandwidth requirements, either as a whole compute or per compute domains
● Availability and Agility
○ Cross-domain mobility
○ Cross-domain connectivity
358
VMware NSX Reference Design Guide
Upgrades motivated by features or bug fixes required by a specific solution (e.g., TKG-S)
now affect the whole infrastructure (e.g., VDI), requiring more planning, testing, and an
overall higher burden on the operations team.
Separation of management responsibility. Different teams may be responsible for
different solutions. The organization may be segmented per solution rather than for
technology (e.g., the VDI team has NSX expertise, same as the team managing the Tanzu
platform. No single NSX team manages NSX across the different internal IT offerings)
Scalability. The NSX platform scalability limits affect all the solutions. Capacity planning
becomes more complex as the growth of all the different environments must be taken
into consideration.
The mixing of manual and programmatic consumption (potentially by multiple plug-ins,
integrations, or tools) on the same NSX manager creates the risk of configuration conflicts
and errors.
359
VMware NSX Reference Design Guide
and project 3 where each project has a dedicated set of vSphere clusters within the same
compute manager.
Projects (and VPCs) also work on shared infrastructure where multiple tenants (Project 4,5, and
6) share the same vSphere clusters. Tenant isolation is logically performed via logical routing
and distributed and gateway firewall.
Projects (and VPCs) are agnostic to the tenants’ compute infrastructure physical layout as they
were designed to work and provide logical separation on a shared infrastructure. If physical
separation is desired, it can be implemented at the vSphere level, but it will be transparent to
NSX multi-tenancy.
Up to NSX 4.2.0, NSX multi-tenancy required a single overlay transport zone (All projects and
VPC segments are created as part of the default overlay transport zone). It means that all
project and VPC segments will be available on all the vSphere clusters regardless of the tenant
the cluster may be associated with. vCenter RBAC permissions can be implemented to enforce
that the vSphere users part of each tenant can connect VMs only to appropriate NSX projects
(and VPC) segments. This design is presented IN FIGURE 7-65.
Starting with NSX 4.2.1 and vSphere 8.0u3, the vSphere dvpgs corresponding to projects
segments and VPC subnets, are automatically organized in vCenters’ folders. This behavior must
be manually enabled at the Project Level in NSX. It allows for easily applying vCenter RBAC
permissions at the folder level so that all the dvpgs in the folder will inherit them automatically.
360
VMware NSX Reference Design Guide
Starting with NSX 4.2.1, it is possible to associate a non-default overlay transport zone to a
project (API Only). All project segments and VPC subnets will be created as part of that
transport zone. A design with multiple overlay transport zones is not generally recommended.
The reasons are outlined in the dedicated section, 7.6.1.4. In some cases, this option may be
beneficial, for example when the tenants are given full access to tenant-dedicated vCenter
servers that are sharing a common NSX Domain with muti-tenancy. The transport-zone
separation prevents each tenant to connect VMs to other tenants’ networks, as they are not
visible in their environment because of the transport zone separation. A multi-tenant design
with dedicated transport zones per tenant is presented in FIGURE 7-66.
Note: VMware Cloud Foundation (VCF) supports a single overly transport zone per NSX Domain,
hence this design is not applicable to VCF deployments.
Figure 7-66: Single NSX Domain with management plane multi-tenancy and separate overlay transport zones
NSX 4.2.1 in conjunction with vSphere 8.0U3 allows for organizing Projects segments and VPC
subnets into vCenter folders based on their NSX Project and VPC. This automatic organization
allows for easier enforcement of RBAC policies to the networks belonging to a project or VPC,
as they can be applied at the folder level rather than individually on each distributed port-
group.
VMware and third-party solutions has started adopting and consuming the new muti-tenancy
framework (e.g., vCloud Director starting in version 10.5.1, and Aria Automation starting in
version 8.18.1) . Even when the integration is not ready, it is possible for projects and VPCs to
361
VMware NSX Reference Design Guide
coexist in the same NSX domain with solutions that are agnostic to the multitenancy
framework. Those integration will keep working in the NSX default space.
FIGURE 7-67 shows an example where projects are implemented on the same NSX installation
where the Tanzu and the Aria Automation integration consume NSX in the default space. The
diagram presents the different use cases leveraging the same shared infrastructure. It’s
certainly possible to segment the infrastructure and dedicate physical resources to each use
case (i.e., dedicated vSphere clusters, compute managers, or edge nodes), but be aware of the
requirement for any multitenancy networking object (Tier-0s, Tier-1, segment) to be part of the
default overlay transport zone. Dedicating a transport zone to different use case requires the
external solutions to support non default transport zones.
Figure 7-67: NSX multi-tenancy framework and default space integration coexistence
362
VMware NSX Reference Design Guide
VDI environment runs an internally certified version of NSX and does not require frequent
updates. Each new NSX version needs to go through an internal certification cycle.
The IaaS environment includes dynamically created configuration objects (Gateways,
segments, DFW rules, Groups) by Aria Automation. The frequent manual configuration
changes required by the security policies in the VDI environment should be carefully
planned to avoid conflicts, and manual configuration errors may lead to applications
downtime.
Examples of design implications for the NSX domain separation are:
When the administrator creates security policies in the VDI environment to provide access
to applications in the IaaS environments, groups based on IP sets must be used. Intelligent
grouping is not available for resources outside of the NSX domain. Policies may have to be
duplicated in the destination NSX Domain (The duplication may be avoided by creating a
wider allow rule on the destination environment permitting the entire VDI pool range.
Granular policies can be enforced at the source only).
Troubleshooting may be more complex as native tools such as trace flow or live traffic
analysis have a single NSX domain scope.
Micro-segmentation planning may be more complex as NSX Intelligence have a single NSX
domain scope.
363
VMware NSX Reference Design Guide
364
VMware NSX Reference Design Guide
365
VMware NSX Reference Design Guide
can deploy NSX components such edge nodes and management plane nodes on clusters
managed by other NSX instances or where NSX is not installed.
The primary reason to implement this topology (FIGURE 7-71) is to provide NSX management
plane multi-tenancy within a single vCenter Server deployment. Tenants have access to their
own dedicated NSX instance and any configuration they perform will only affect the workload
running on their respective cluster.
The benefits of this multi-tenancy model are the following:
Complete Management plane multi-tenancy
Reduced the number of vCenter servers to manage and lifecycle.
Tenants have access to every NSX feature.
The implications of this multi-tenancy model are the following:
NSX Manager sprawl. Hard to manage at scale without custom automation.
Infrastructure resources overhead in small environments. NSX manager VMs must be
deployed for each tenant, even if the environment includes few hosts, at times one in DRaaS
use cases. The singleton NSX Manager deployment may alleviate the issue.
Physical compute resources are dedicated to each tenant. A shared compute infrastructure
model is not supported.
366
VMware NSX Reference Design Guide
Besides multi-tenancy, another use-case for the multi nsx feature is the creation of a shared
pool of compute resources to host NSX components across multiple NSX domains. This scenario
is presented in FIGURE 7-72. Multiple NSX domains can deploy edge nodes on the same set of
clusters via the compute manager integration. This type of topology was supported before but
it required the edge nodes to be deployed manually trough the OVA
Figure 7-72:Multi NSX compute manager leveraged as an edge nodes pool of resources
The official documentation for the multi NSX feature is available HERE .
7.6.1.4 Single NSX Domain with one or multiple overlay transport zones
Note: VMware Cloud Foundation (VCF) supports a single overly transport zone per NSX Domain.
This section presents examples of how the NSX transport zone can be used to segment the
infrastructure for different use cases (VDI vs. IaaS) or environments (Test vs. Prod) while
providing a unified networking and security consumption model at the management plane
level.
Transport zones are not security boundaries and should not be implemented for the purpose of
separating workloads at the data plane level. Distributed firewall policies and dedicated routed
topologies are the appropriate NSX tools to achieve segmentation.
Transport zones provide a hard boundary for the span of NSX objects such as segments and
gateways. They provide a mean to avoid mixing objects specifics to a use case or environment
367
VMware NSX Reference Design Guide
in compute managers or clusters serving other solutions. The implication is that vCenter server
domains, or as a minimum, vSphere clusters, should be dedicated to specific use cases or
environments and not shared between them. This type of design leads to potential
inefficiencies and lower consolidation ratios, but it may be motivated by compliance or risk
management reasons.
Edge nodes can be part of a single overlay transport zones. It implies that dedicated edge
clusters and corresponding peering configuration to the physical network must be deployed if
multiple overlay transport zones are part of the design. If a concern, the multiple peering
configuration problem can be solved via a two levels Tier-0 architecture, but the edge node
sprawl cannot be mitigated.
368
VMware NSX Reference Design Guide
Figure 7-73: NSX Design with Single Overlay Transport Zone Design
7.6.1.4.2 Separate NSX Overlay Transport Zones Per Use Case Design
FIGURE 7-74 presents a design where vSphere cluster are allocated to each use case. The
rationale for the separation may be based on requirements driving the separation for
compliance or lifecycle management. From an NSX perspective, we can map the same
separation segmented the vSphere clusters in different overlay transport zones based on the
associated use case. The consumption model of cloud automation platforms (e.g. Aria
Automation), container plug-in (e.g., NCP in the Tanzu use case), or automation tools (e.g.,
Terraform) usually assume the existence of a pre-created Tier-0 Gateway and an associated
overlay transport zone. When setting up the integration, the administrator pass the Tier-0
gateway, the edge cluster, and Transport zone where dynamically created object should be
placed. The automation integration will create segments as part of the specified overlay
transport zone and will connect the dynamically created Tier-1 Gateways to the specified Tier-
0. This behavior matches the desired resource placement based on vSphere clusters and NSX
edge cluster dedicated to each use cases
It's important to emphasize that the design choice to use multiple overlay transport zones is
derived by a higher level design decision to segment the compute resources per use case.
Implementing multiple transport zones without such requirements leads to unnecessary
segmentation of resources, lower consolidation ratio, and it is in general a poor choice.
Each use case requires a dedicated NSX edge cluster because edge nodes can be part of a single
overlay transport zone only. The edge nodes can be deployed on the vSphere clusters dedicated
369
VMware NSX Reference Design Guide
to the use case as depicted in FIGURE 7-74. Or they can be deployed on a shared vSphere cluster
as depicted in FIGURE 7-75.
Figure 7-74: NSX Design with Overlay Transport Zone per use case - Edges on Compute Clusters
370
VMware NSX Reference Design Guide
Figure 7-75: NSX Design with Overlay Transport Zone per use case - Edges on Shared Cluster
371
VMware NSX Reference Design Guide
Figure 7-76: Multiple Overlay Transport Zones on the same compute cluster
372
VMware NSX Reference Design Guide
373
VMware NSX Reference Design Guide
this stage. We are still planning our design at a more abstract level. The decision to include a
disaster recovery site or multiple active data centers will depend on non-functional
requirements such as the high availability of SLA and RPO/RTO. The design may cater to a single
application with very stringent non-functional requirements. In such a case, the
conceptual/logical design can be very simple, while the complexity resides in implementing
resilient physical components across multiple sites.
Once we have our conceptual model in place, we can now map the abstract elements of the
conceptual design to actual NSX components. FIGURE 7-79 presets an example of such mapping,
where each tenant is mapped to an independent NSX Domain, projects to dedicated compute
managers (vCenter servers) and Tier-0 Gateway, and individual applications, VDI pools, and
Tanzu Namespaces to segments and/or Tier-1 Gateways.
374
VMware NSX Reference Design Guide
Let’s consider the implications of such mapping in a multi-environment multi-use case design
where the first level of tenancy (tenant/org) is mapped to the environment while the second
(project) is to the use case. An example of such mapping is presented in FIGURE 7-80.
Figure 7-80: Two Tier Multi tenancy design with mapping to NSX constructs
375
VMware NSX Reference Design Guide
376
VMware NSX Reference Design Guide
• A dedicated Tier-0 Gateway for each use case: IaaS and VDI. Performance, scalability, and
high availability requirements can be met independently for the two use cases.
• The IaaS Tier-0 gateway is referenced in the cloud management platform (e.g., Aria
Automation), which dynamically creates Tier-1 Gateways and segments for each
application deployment. The dynamically created Tier-1 gateways are connected to the
IaaS Tier-0 gateway.
• VDI pools are deployed on dedicated segments rather than on a shared one. Distributed
firewall security policies can be applied to the VDIs at the deployment time based on the
network where they reside without human intervention or custom automation
(Segments can be part of groups, both via direct membership and based on tags)
• VDI does not require edge network services. VDI segments are directly connected to an
Active/Active Tier-0 gateway.
FIGURE 7-82 presents an example where the same use cases are implemented, but the chosen
topology is different. The design has the following properties:
• The two use cases share the same Tier-0 gateway. North/South traffic for both goes
through the same edge nodes. The edge design must cater to the requirements of both
use cases at the same time.
377
VMware NSX Reference Design Guide
• The shared Tier-0 should be deployed in Active/Active (no stateful services) to allow for
scaling out the North/South bandwidth.
• A Tier-1 Gateway is deployed for each use case, but no stateful services are required on
them. The separation is mostly logical. It has no impact on the data plane as the routing
is fully distributed across the entire deployment.
• Each application or VDI pool is deployed on a dedicated segment
• Centralized networking services are not required. Security is provided via distributed
firewall.
• If load balancer services are needed, they can be deployed in one-arm mode.
FIGURE 7-83 shows an example where the entire NSX Domain is dedicated to a single use case
(IaaS). Within the same NSX domain, two environments are present: Prod and Test. The design
has the following properties:
• The two use cases share the same Tier-0 gateway. North/South traffic for both goes
through the same edge nodes. The edge design must cater to the requirements of both
use cases at the same time.
• The shared Tier-0 are deployed in Active/Active (no stateful services) to allow for scaling
out the North/South bandwidth.
378
VMware NSX Reference Design Guide
• Centralized networking services are not required. Security is provided via distributed
firewall.
• If load balancer services are needed, they can be deployed in one-arm mode.
• Each environment has a dedicated segment and corresponding IP range.
• When an environment grows beyond a single segment limit, segments can be added and
connected to the same Tier-0 gateway.
• Segmentation between environments is provided via DFW based on the IP ranges
associated with each environment. Such policies are manually created by the
administrator.
• The CMP (e.g., Aria Automation) sees the segments as external networks. It means that
those segments must be pre-create by the NSX admin, referenced in the Aria Automation
configuration, and mapped to the two different projects. Aria Automation will place the
dynamically created VMs on those segments based on the environment. Because of the
segment placement, the VMs will inherit the environment segmentation based on IP
ranges implemented via DFW.
• The separation between applications within an environment can be achieved via
dynamically created groups and security policies managed by the CMP.
• Larger, shared, and manually configured segments facilitate the integration with disaster
recovery orchestration tools (e.g., VMware Sire Recovery Manager) compared to
segments dynamically created by the CMP. Network mappings between sites can be
configured in advance and do not necessitate frequent updates.
379
VMware NSX Reference Design Guide
380
VMware NSX Reference Design Guide
This model is the most scalable and provides less resource contention for the NSX components.
381
VMware NSX Reference Design Guide
the management cluster is prepared for NSX, overlay networks can be consumed by the
management appliances part of the Aria suite, but not by vCenter and NSX Manager.
382
VMware NSX Reference Design Guide
Figure 7-85: vSphere Edge Cluster - Dedicated Hosts to P2V and Services edges
Figure 7-86: vSphere Edge Cluster - Shared Hosts to P2V and Services Edges
vSphere edge clusters demand specific configurations on the leaf switches to where they
connect. Specifically, they require VLANs for edge management, edge overlay, and edge routing
peering. If we stripe the vSphere edge cluster across different racks, we must extend those
VLANs as well. NSX edge clusters can be deployed across multiple vSphere clusters, and they do
not require layer two connectivity between them. In a layer 3 fabric, it is recommended to keep
383
VMware NSX Reference Design Guide
the vSphere edge cluster confined to a single rack and deploy NSX Edges in different vSphere
clusters if rack availability is required. See FIGURE 7-87.
Figure 7-87: NSX Edge Cluster deployed across 2 different vSphere clusters in different racks
384
VMware NSX Reference Design Guide
Note: in some scenarios, the recovery of an edge VM may cause a traffic interruption if the
dynamic routing peering to the physical network and the southbound data plane on the overlay
side are not ready simultaneously. To mitigate this risk, increasing the Tier-0 forwarding up
timer to a higher value than the default 0 seconds is possible. This will allow the edge to be fully
ready before being placed in service. This scenario may affect any operation that involves
adding an edge node to a Tier-0 already serving traffic, including edge VM vSphere HA recovery,
an edge node exiting maintenance mode, or adding an edge node to a tier-0 gateway to scale
up N/S bandwidth.
DRS and vMotion
We recommend setting DRS to fully automated and use DRS should rules to have deterministic
placement of the edge VMs during normal operations while allowing edge VMs to move during
ESXi maintenances.
From an operational perspective, disabling DRS (or setting it to anything other than Fully
Automated) requires manually moving or powering off the edge VMs before placing the ESXi
host in maintenance mode. The design choice is between control and predictability vs.
automation of the lifecycle management that comes with vSphere. When DRS is enabled for
edge node VMs hosting a T0 gateway, we want to limit unnecessary migrations that can cause
interruptions to the north south traffic. Implementing VMs to Host should rules allows to
specify on which host the edge VM will run. DRS will not migrate the edge VM unless the host is
placed in maintenance mode or extreme resource contention occurs.
Figure 7-88: vSphere Edge cluster and vSphere HA for P2V Edges
385
VMware NSX Reference Design Guide
Figure 7-89: vSphere edge cluster - vSphere HA and DRS for Service Edge VMs
With a single vSphere edge cluster hosting independent NSX Edge cluster dedicated to different
use cases (e.g., p2v and services), it’s possible to customize the vSphere HA and DRS setting per
edge VM to achieve different behavior per use case. When the NSX edge cluster is shared (e.g.,
the same edge node VMs provide p2v and services), the design decisions around vSphere HA
386
VMware NSX Reference Design Guide
and DRS setting are more complex and should be based on what design properties are the most
important.
387
VMware NSX Reference Design Guide
When limited compute resources are available, or the footprint of the management and edge
components is limited, collapsing those components in the same vSphere cluster is an option.
When dedicated clusters are not an option, collapsing edge and management functions is
usually the preferred option because of the predictable footprint and growth of the
management components compared to those of the application workloads.
Mixing management and edge components in smaller or simpler deployments is not an issue as
long as the requirements for both workload types are taken into consideration when designing
the hardware resources in the cluster. For example, the four nodes management/edge cluster
depicted in FIGURE 7-91 can be adequate to host essential infrastructure management
components and edge nodes dedicated to North/South traffic and minimal services. Having
dedicated pNICs for the edge node VMs hosting the tier-0 Gateway is always a good idea. Host
traffic (vMotion, storage) should share the pNICs used by the management appliances rather
than those used by the edge VMs. Deploying a management/edge cluster with two pNIC hosts
is supported, but it will most likely represent a compromise in terms of north/south
performances.
388
VMware NSX Reference Design Guide
389
VMware NSX Reference Design Guide
Refer to the dedicated sections for management components and edge node connectivity for
detailed guidelines in term of virtual switch design and teaming policies. The diagrams below
present a detailed view of such configuration settings in a two and four pNIC designs.
P1 P2
VDS
VMK Storage PG VMK vMotion PG VMK Mgmt PG Trunk DVPG-1 A/S Trunk DVPG-2 A/S
ESXi-with-Mgmt_Edge_Node_VM-VDS
P1 P2 P3 P4
VDS-1 VDS-2
vDS vDS
Storage PG vMotion PG Mgmt PG Mgt PG Trunk DVPG 1 Trunk DVPG 2
390
VMware NSX Reference Design Guide
Figure 7-93: Collapsed Management and Edge VM on Separate VDSs with 4 pNICs
Some designs incorporate a collapsed edge/compute cluster design. This model may be
appealing from a resource planning perspective and allows for a simple chargeback model. It
allows the operator to dedicate ESXi resources to a use case or a tenant and use those
resources for both workloads and network services.
A dedicated vSphere edge cluster may mean that multiple use cases or tenants will share the
same compute resources. Sharing the compute hosts dedicated to edge services across multiple
use cases or tenants makes resource planning and measuring consumption more complex.
Splitting the edge services in dedicated vSphere clusters may be unfeasible from a cost
perspective. The co-placement of the edge nodes with the associated workloads may become
appealing to preserve the resource segmentation across the use cases. A collapsed
compute/edge deployment model presents the following challenges:
• Resource contention and planning. Application workloads demands may vary over time
and may affect edge services performance even when the appropriate capacity planning
was performed as part of the initial implementation.
• North/South edge services benefit from dedicated pNICs. Compute clusters are usually
deployed with a two pNIC model. We should not deploy more than one edge per host to
avoid excessive oversubscription.
391
VMware NSX Reference Design Guide
• Compute hosts require minimal configuration on the physical fabric: four VLANs local to
each rack, management, vMotion, Overlay, and storage, even when the cluster is striped
across racks. Edge nodes require additional VLANs that must be available on each host in
the cluster. A layer 2 fabric is now required to stripe the compute cluster across racks if
the edge node VMs are co-located with compute workloads. Edge VMs running T1
Services and no T0 can run without problems on a cluster striped across racks in a layer 2
fabric.
• Edge nodes with T0s should always create routing adjacencies with switches directly
connected to the host where the edge node VMs reside. This may be unfeasible in an
edge/compute cluster spanning multiple racks. An edge VM moving between racks may
lose the BGP peering adjacencies unless the network fabric extends the peering VLANs
between racks (not recommended). A vSphere cluster hosting T0 Edge VMs should be
confined to a single rack. Rack availability can be provided via another vSphere cluster
hosting edge node VMs part of the same NSX edge cluster and T0 Gateway. In case the
cluster must be striped across racks, mobility of the T0 Gateway should be limited to a
single rack even during maintenance operations (Placing the edge node in maintenance
mode and powering it off is an option if adequate compute resources are not available in
the same rack).
• vSphere HA is most likely enabled globally for the cluster to support compute workloads.
If vSphere HA for T0 edge VM is not desired, it must be disabled manually (In case the
recovery of the edge VM on a different host will not lead to gains in high availability or
performances). vSphere HA for T1 only service or T0/T1 mixed edges is a good practice in
most situations.
• vSphere HA does not honor preferential rules (should) but respects mandatory rules. This
behavior should be considered when the placement of edge node VMs is critical (T0 Edge
VM with vSphere cluster striped across racks)
• DRS is most likely enabled for the cluster to support compute workloads. VM to Host
Should rules should be implemented to enforce the desired host to T0 edge VM
placement and limit the T0 edge VMs mobility to host maintenance events.
• Limit over reservation of CPU/memory resources for the compute workloads as it may
lead to DRS moving the T0 edge VMs regardless of the preferential rules.
• DRS rules, in conjunction with NSX edge failure domains, should be in place so that
Active/Standby Gateways do not reside on the same ESXi host during normal operations
and maintenance windows (Anti-affinity rules).
• In a two pNIC host design, the edge node VMs connect to an NVDS or a VDS prepared for
NSX. Connectivity guidelines presented in “FIGURE 7-49: EDGE VM CONNECTED TO NSX
P REPARED VDS – HOST AND EDGE TEP ON DIFFERENT IP/VLAN” or “FIGURE 7-50: EDGE VM
CONNECTED TO NSX P REPARED VDS – HOST AND EDGE TEP ON THE SAME IP/VLAN - NSX 3.1 OR
later” should be followed.
392
VMware NSX Reference Design Guide
In a fully collapsed cluster design, compute workloads, edge node VMs, and management
components all reside in the same vSphere cluster. The NSX EASY ADOPTION DESIGN
GUIDE presents an example of this deployment model in great detail. Please refer to the DC in A
Box section. A collapsed cluster design presents the challenges of both the compute/edge, and
the management/edge collapsed models, but those concerns are usually mitigated by the
smaller size of the deployment, usually confined to a single rack. Key considerations for a fully
collapsed cluster are:
• In a two pNIC host design, the edge node VMs connect to an NVDS or a VDS prepared for
NSX. Connectivity guidelines presented in “FIGURE 7-49: EDGE VM CONNECTED TO NSX
P REPARED VDS – HOST AND EDGE TEP ON DIFFERENT IP/VLAN” or “FIGURE 7-50: EDGE VM
CONNECTED TO NSX P REPARED VDS – HOST AND EDGE TEP ON THE SAME IP/VLAN - NSX 3.1 OR
later” should be followed.
• NSX Manager should not be placed on a VLAN segment but on a vCenter managed dvpg.
The diagrams below outline virtual switch and teaming policies design guidelines for four pNICs
and two pNICs hosts.
393
VMware NSX Reference Design Guide
Figure 7-97: Collapsed Edge/Management/Compute cluster - 2 pNICs - separate VLANs for Host and Edge TEPs
394
VMware NSX Reference Design Guide
Figure 7-98 Collapsed Edge/Management/Compute cluster - 2 pNICs - single VLAN for Host and Edge TEPs
7.6.3.5 Use case: implementing a repeatable rack design for a scalable private cloud platform
in a layer 3 fabric
We have seen that vSphere clusters dedicated to management, edge, and compute workloads
have different connectivity and resource requirements. It is possible to cater to such specific
requirements by customizing the hardware design for each class of workloads (e.g., CPU core
count and pNIC design specific to edges hosts) and dedicating racks to a specific cluster function
(e.g., racks dedicated to ESXi hosts running edges with ToR switches configured with the
appropriate VLANs). While such a design may be the most effective at leveraging the available
resources, it may lack the repeatability and consistency required in large-scale deployments.
Large-scale private clouds require a high degree of automation, from the automatic
deployment and configuration of the ToR switches and the provisioning of the ESXi hosts to the
dynamic allocation of resources to different compute pools (e.g., a vSphere cluster or a VCF
workload domain).
This section presents different options to create a repeatable rack layout for computing and
edge workloads. We assume that management workloads can be centralized and do not
require to follow the same repeatable pattern because of their predictable resource
requirements.
395
VMware NSX Reference Design Guide
396
VMware NSX Reference Design Guide
Figure 7-100: Resource Block - Dedicated vSphere edge cluster with one host per rack
In FIGURE 7-100, all hosts have the same hardware configuration (e.g., 2x25G pNICs, 32 cores,
512G Memory). One host in each rack is dedicated to NSX edge services. The host runs two or
more NSX edge node VMs where both a T0 Gateway in ECMP and T1 Gateways with services
are hosted. While this configuration may not be optimal for edge performances, it allows for a
repeatable deployment model that enhances the scalability and the manageability of the
overall platform. This model can be extended to a design where T0 and T1 gateways run on
different edge nodes part of different edge clusters. Such model is not discussed in this section.
In FIGURE 7-100, the edge hosts are grouped in a dedicated vSphere edge cluster. The cluster is
striped across racks. We mentioned already that, in general, it’s not a good practice to stripe
vSphere edge clusters across racks, but in this case, the repeatability of the deployment model,
which dictates one edge host per rack, takes precedence. The three hosts part of the edge
cluster have access to the same VSAN Datastore, but they cannot provide VM mobility across
racks for the edge VMs because each rack has specific management, overlay, and layer 3
peering VLANs. If an edge VMs migrates to a different rack, it will lose connectivity
immediately. Design guidelines and implications for the model are:
• We should map edge node VMs to NSX failure domains based on the rack where they are
deployed.
• Edge Node VMs cannot move across racks because the appropriate management,
overlay, and L3 peering networks are not available on every rack.
397
VMware NSX Reference Design Guide
• If the hosts in the edge vSphere cluster are part of the same VDS and the VLAN IDs are
repeated on each rack for consistency, DRS is not aware that the available networks are
different and may move the edge VMs anyway.
• DRS MUST rules should enforce the correct rack placement for each edge node VM to
avoid any edge VM migration.
• If a single edge ESXi host is present in each rack, as depicted in FIGURE 7-100, the host
cannot be evacuated to be placed in maintenance mode. Edge node VMs must be placed
in maintenance mode and then shut down before the host can be placed in maintenance
mode. When the host exits maintenance mode and the edge VM is powered and brought
back online, it is required to manually check the status of its services (e.g., BGP neighbors
are up) before proceeding with the maintenance of another ESXi host and the
corresponding edge VMs.
Figure 7-101: Resource Block - Dedicated vSphere edge cluster with two hosts per rack
The design in FIGURE 7-101 has the following guidelines and implications:
• Edge hosts can be evacuated by vMotion to a different host in the same rack during a
maintenance. Edge nodes are not shut down.
398
VMware NSX Reference Design Guide
• Edge Node VMs cannot move across racks because the appropriate management,
overlay, and L3 peering networks are not available on every rack.
• If the hosts in the edge vSphere cluster are part of the same VDS and the VLAN IDs are
repeated on each rack for consistency, DRS is not aware that the available networks are
different and may move the edge VMs anyway.
• DRS MUST rules should enforce the correct rack placement for each edge node VM to
avoid any edge VM migration across racks.
• DRS SHOULD rules will dictate the preferred host placement for each edge VMs. Because
now multiple hosts are present in the same rack, DRS in fully automated mode could
rebalance the edge VMs placement during normal operations. The SHOULD rule will
prevent undesired migrations. A SHOULD rule will be violated when placing the ESXi host
in maintenance mode. The edge VMs will be moved to another host in the same rack. It
won’t be moved to a different rack because MUST rules cannot be violated.
• If DRS rules management is considered too much of an overhead, DRS can be disabled or
set to partially automated. In this case, edge VMs must be migrated manually during
maintenances. This choice carries the risk of human error (e.g., edge node VM migrated
to a different rack by mistake)
• Even with six hosts in the edge cluster spread across three racks, the VSAN storage policy
has an PFTT=1 when each rack is mapped to a VSAN failure domain.
• Two edge hosts per rack may represent a waste of computing resources, especially in
deployments with five racks or more.
399
VMware NSX Reference Design Guide
Design guidelines and implications for the design in FIGURE 7-102 are:
• A complex set of DRS rules is required.
• One VM to host MUST rule per rack is required to avoid edge VM mobility across racks.
• VM to Host SHOULD rules must be in place to avoid edge VM movements across hosts in
the same rack during normal operations.
• DRS rule (SHOULD or MUST depending on the design requirements) to avoid/discourage
the placement of computing workloads on ESXi hosts dedicated to edge node VMs. This
DRS rule has the most impactful implications in terms of day-2 operations because any
new workload deployed on the cluster needs to be added to the corresponding VM
group to avoid resource contention with the edge node VMs.
400
VMware NSX Reference Design Guide
Because of this underlying data center framework, this chapter primarily focuses on
performance in terms of throughput for TCP-based workloads
NSX provides two datapath stacks: Standard and Enhanced Datapath Standard. In this chapter,
we will cover both.
There are some niche workloads such as NFV, where raw packet processing may be ideal, and
the enhanced version of N-VDS called N-VDS (E) was designed to address these requirements.
Check out the last part of this section for more details on N-VDS (E).
401
VMware NSX Reference Design Guide
With its options field length specified for each packet within the Geneve header, Geneve allows
packing the header with arbitrary information into each packet. This flexibility offered by
Geneve opens up doors for new use cases, as additional information may be embedded into
the packet, to help track the packets path or for in depth packet flow analysis.
FIGURE 8-2 shows the location of the Length and Options fields within the Geneve header and
also shows the location of TEP source and destination IP’s.
For further insight into this topic, please check out the following blog post:
HTTPS :// OCTO. VMWARE . COM/GENEVE- VXLAN - NETWORK -VIRTUALIZATION - ENCAPSULATIONS/
Performance Considerations
In this section we will take a look at the factors that influence performance. Performance of a
workload in a virtualized environment, depends on many factors within that environment;
hardware used, drivers and features etc. FIGURE 8-3 shows the different areas that may have an
impact on performance.
402
VMware NSX Reference Design Guide
Workloads
Data Center’s today, typically carry different types of traffic flows with different SLAs. The
following table is a summary of workload types, requirements and features that could be help
with such traffic.
403
VMware NSX Reference Design Guide
Mixed Flows A mix of the above workloads Mix of bandwidth RSS or Rx / Tx Filters
and latency
Feature Description
404
VMware NSX Reference Design Guide
Geneve Rx / Tx Filters Queuing mechanism that enables multiple cores to be used for
packet processing based on the inner-packet headers.
We will take a closer look at each of these features in the following sections.
405
VMware NSX Reference Design Guide
406
VMware NSX Reference Design Guide
407
VMware NSX Reference Design Guide
Look for the tag “VMNET_CAP_Geneve_OFFLOAD”, highlighted in red above. This verbiage
indicates the Geneve Offload is activated on NIC card vmnic3. If the tag is missing, then it
means Geneve Offload is not enabled because either the NIC or its driver does not support it.
408
VMware NSX Reference Design Guide
[Host-1] # vsish
/> get /net/pNics/vmnic0/rxqueues/info
rx queues info {
# queues supported:5
# filters supported:126
409
VMware NSX Reference Design Guide
# active filters:0
Rx Queue features:features: 0x1a0 -> Dynamic RSS Dynamic
Preemptible
}
/>
CLI 2 Check RSS
To overcome this limitation, the latest NICs ( SEE COMPATIBILITY GUIDE ) support an advanced
feature, known as Rx Filters, which looks at the inner packet headers for hashing flows to
different queues on the receive side. In the following FIGURE 8-8: RX FILTERS: FIELDS USED FOR
Hashing fields used by Rx Filter are circled in red.
410
VMware NSX Reference Design Guide
Simply put, Rx Filters look at the inner packet headers for queuing decisions. As driven by NSX,
the queuing decision itself is based on flows and bandwidth utilization. Hence, Rx Filters
provide optimal queuing compared to RSS, which is akin to a hardware-based brute force
method.
Geneve Rx Filters help increase the number of cores used to process incoming traffic, which is
in turn increases performance by a factor of 4x times based on the number of hardware queues
available on the NIC. For older cards which do not support Geneve Rx Filters, check whether
they at minimum have RSS capability. The following graph shows the throughput achieved with
and without using Geneve Rx Filters, 10Gbps vs near line rate:
411
VMware NSX Reference Design Guide
=====================================================================
FIGURE 8-9 Note: This test was run with LRO enabled, which is software-supported starting with
ESX version 6.5 and higher on the latest NICs which support the Geneve Rx Filter. Thus, along
with Rx Filters, LRO contributes to the higher throughput depicted here.
=====================================================================
412
VMware NSX Reference Design Guide
HTTPS :// WWW. VMWARE . COM/ RESOURCES / COMPATIBILITY / SEARCH . PHP?DEVICE CATEGORY =IO
The following section show the steps to check whether a NIC, Intel 810s in this case, supports
one of the key features, Geneve offload.
1. Access the online tool:
HTTPS :// WWW. VMWARE . COM/ RESOURCES / COMPATIBILITY / SEARCH . PHP?DEVICE CATEGORY = IO
413
VMware NSX Reference Design Guide
2. Specify the
1. Version of ESX
2. Vendor of the NIC card
3. Model if available
4. Select Network as the IO Device Type
5. Select Geneve Offload and Geneve Rx Filters (more on that in the upcoming
section) in the Features box
6. Select Native
7. Click “Update and View Results”
414
VMware NSX Reference Design Guide
3. From the results, click on the ESX version for the concerned card. In this example, ESXi
version 7.0 for Intel E810-C with QSFP ports:
415
VMware NSX Reference Design Guide
4. Click on the [+] symbol to expand and check the features supported.
5. Make sure the concerned driver actually has the “Geneve-Offload and Geneve-Rx
Filters” as listed features.
Follow the above procedure to ensure Geneve offload and Geneve Rx Filters are available on
any NIC card you are planning to deploy for use with NSX. As mentioned earlier, not having
Geneve offload will impact performance with higher CPU cycles spent to make up for the lack of
software based Geneve offload capabilities.
416
VMware NSX Reference Design Guide
TCP Throughput
Intel® Xeon® Platinum 8352Y @ 2.20 Ghz - Intel® 810C
100
80
Throughput (Gbps)
60
40
20
0
Logical Switch Tier1 Router Tier0 Router
8800
PCIe Gen 4
PCIe Gen 4 doubles the bandwidth when compared to PCIe Gen3. Which translates into higher
throughput even when using standard 1500 MTU. Following graph shows the throughput on
PCIe Gen 4 Platform based on AMD EPYCTM 7F72 @ 3.2 GHz – Single-Socket using PCIe Gen 4
NIC Mellanox ConnectX-6 single port using a standard 1500 MTU.
417
VMware NSX Reference Design Guide
Figure 8-15: Throughput with AMD® EPYCTM 7F72 – MTU 1500 Bytes
418
VMware NSX Reference Design Guide
MTU is a key factor for driving high throughput, and this is true for any NIC which supports 9K
MTU also known as jumbo frame. The following graph FIGURE 8-16: MTU AND Throughput shows
throughput achieved with MTU set to 1500 and 8800:
419
VMware NSX Reference Design Guide
Recommendation for optimal throughput is to set the underlying fabric and ESX host’s pNICs to
9000 and the VM vNIC MTU to 8800.
Notes for FIGURE 8-16: MTU AND Throughput:
• The above graph represents a single pair of VMs running iPerf with 4 sessions.
• For both VM MTU cases of 1500 and 8800, the MTU on the host was 9000 with
demonstrated performance improvements.
420
VMware NSX Reference Design Guide
For the VM, use the commands specific to the operating system for checking MTU. For
example, “ifconfig” is one of the commands in Linux to check MTU.
Oversubscription
Oversubscription is a key consideration from a performance perspective. This should be taken
into account from both host hardware design and also from the underlying physical.
vNIC Queues
Once the lower-level details are tuned for optimal performance, the next consideration moves
closer to the workload. At this point the question is, whether the workload is able to consume
packets at the rate which ESXi is tuned to send? Note that this is generally not a concern when
a single ESXi host is hosting many VMs, optimally, this should be twice the number of total
queues provided by the pNIC. That is, if the total queues across all pNICs is 4, that ESXi host
should have at least have 8 VMs. In this typical data center use case, while each VM may not be
able to consume at the ESXi’s packet processing rate, many VMs together may be able to reach
that balance.
421
VMware NSX Reference Design Guide
Queuing at the vNIC only becomes relevant for instances where the ESXi may be running 1 or 2
VMs and the expectation for the workload is pure packet processing, as in the case of Edge VM.
In fact, enabling multiple queues is the recommended and default option for the Edge VMs.
Similar to the RSS and queuing configuration at the pNIC level, vNIC also allows configuration
for multiple queues instead of the default single queue. Edge VM section has more details on
how to set this up.
422
VMware NSX Reference Design Guide
423
VMware NSX Reference Design Guide
3. Add the two configuration parameters by clicking on “Add Configuration” for each item
to add:
424
VMware NSX Reference Design Guide
While vNIC queues can be set for any VM, the workload running on the VM, should be able to
leverage the multiple queues, in order to see a benefit. NSX VM Edge is designed to take
advantage of multiple queues and should be leveraged for optimal performance.
vCPU Count
As discussed in the pNIC queues and vNIC queues topics, each queue needs a core/thread to
service it. However, these cores are not dedicated to just servicing the queues but are engaged
on demand when needed. That is, they may be used for workloads when there is no network
traffic. Hence, a system should not only be tuned for max performance via multiple queues,
but care should be taken that the system actually has access to the required number of cores to
service those queues.
While the pNIC queues depend on the system cores, that is physical cores for packet
processing, vNIC queues depend on the vCPU cores. This vCPU count should be considered
425
VMware NSX Reference Design Guide
when allocating vCPUs for the workload. However, as called out in the vNIC queuing section,
this is only true for pure packet processing focused workloads such as NSX Edge VM. For most
common data center TCP workloads, leveraging offloads such as Geneve offload, this is not a
concern.
Following table lists the vCPU count for different VM edge form factors:
Small 1 1 2
Medium 2 2 4
Large 4 4 8
Extra Large 8 8 16
Model Description
426
VMware NSX Reference Design Guide
The following image visualizes the three methods of core allocation: Interrupt driven, Dedicated
and Offloaded:
427
VMware NSX Reference Design Guide
at processing small packet sizes ~78 Bytes at near line on a 40Gbps port such as Intel® XL710,
which is useful in NFV-style workloads. The following graph in FIGURE 8-20 shows the
performance of bare metal Edge with a standard RFC 2544 test with IXIA.
Figure 8-20: Bare Metal Edge Performance Test (RFC2544) with IXIA
Note: For the above test, the overlay lines in blue are calculated by adding throughput
reported by IXIA and the Geneve Overlay header size.
SSL Offload
VMware NSX bare metal Edges also support SSL Offload. This configuration helps in reducing
the CPU cycles spent on SSL Offload and also has a direct impact on the throughput achieved.
In the following image, Intel® QAT 8960s are used to show case the throughput achieved with
SSL Offload.
428
VMware NSX Reference Design Guide
Workload Mostly large flows such Mix of flows with high High PPS High PPS
/Requirement as logs performance
requirements
429
VMware NSX Reference Design Guide
Features that Geneve-Offload: To Enhanced Data Path Enhanced Data DPDK: Poll mode
Matter save on CPU cycles Standard: For DPDK- Path driver with
like capabilities Standard: For memory- related
DPDK-like enhancements to
Geneve-RxFilters: To Geneve-Offload: To
capabilities help maximize
increase throughput by save on CPU cycles
packet processing
using more cores and RSS: To leverage
speed
using software based multiple cores
Geneve-RxFilters: To
LRO
increase throughput
QATs: For high
RSS (if Geneve- by using more cores
encrypt/decrypt
RxFilters does not and using software
performance with
exist): To increase based LRO
SSL-offload
throughput by using
RSS (if Geneve-
more cores
RxFilters does not
exist): To increase
throughput by using
more cores
Datapath Options
NSX provides two datapath options for typical data center workloads:
• Enhanced Datapath Standard
430
VMware NSX Reference Design Guide
• Standard
Standard
Standard is an older interrupt driven datapath that is unfortunately not designed to address the
current typical datacenter performance requirements. Standard datapath requires extensive
tuning, for any packet processing focused workloads such as the NSX Edge VMs. Standard
datapath, is also heavy on CPU usage. It is enabled by default in any current version of NSX or
VCF (Up to NSX 4.2.x, and VCF 5.2.x)
Overview
At the heart of the Enhanced Datapath Standard are three components that together enable
high performance:
431
VMware NSX Reference Design Guide
432
VMware NSX Reference Design Guide
8.5.1.4 VCF UI
In the VCF Add Cluster dialog box, Operational Mode under Switch Configuration allows
selected Enhanced Datapath Interrupt, which is the VCF terminology for Enhanced Datapath
Standard.
433
VMware NSX Reference Design Guide
Note: For existing clusters in VCF leveraging vLCM, since the TNP cannot be detached, this
change is only possible via API.
Results
This section covers the performance benefits of leveraging Enhanced Datapath Standard for
NSX Edge VMs.
Note: While the following test results show case Edge VM use case, Enhanced Datapath
Standard is a ESXi layer performance optimization that will benefit any workload, not just Edge
VMs. Therefore, for optimal performance benefits, Enhanced Datapath Standard should be
enabled for both compute and edge clusters.
434
VMware NSX Reference Design Guide
Note: Throughput is simply a function of packet size and packets per second.
435
VMware NSX Reference Design Guide
CPU usage for Enhaced Datapath Standard (blue line), is almost half of what is used by the
Standard Datapath (red line), even though, Enhaced Datapath Standard throughput (green bar)
is almost 50-70% better than the throughput of Standard datapath (yellow bar).
Another way to look at core usage, is the throughput per core used for packet processing. In the
following graph, we compare throughput per core for standard (yellow bar) vs Enhanced
Datapath Standard (green bar).
436
VMware NSX Reference Design Guide
The above graph shows Enhanced Datapath Standard core usage is 2.5 – 3.5 times more
efficient than the core usage of the standard datapath. Enhanced Datapath Standard’s efficient
core usage results in higher core availability for workloads.
Standard Datapath
Standard is an interrupt driven datapath that requires extensive tuning, for any packet
processing focused workloads. Standard datapath, is also heavy on CPU usage. The following
sections focus on tuning the standard datapath for best performance.
Note: For optimal out of the box performance, Enhanced Datapath Standard is the
recommended datapath.
pNIC Queues
As discussed in the section on RSS, queuing has considerable impact on the performance for all
types of workloads and specially VM Edge type of workloads. For a given NIC, the number of
RSS engines and the queues available for each engine, may be different based on the driver /
firmware versions. For general purpose workloads, defaults should work fine. For workloads
like VM Edge, following settings will help improve performance at the pNIC level.
437
VMware NSX Reference Design Guide
Receive side is often the bottleneck, much more than the transmit. Hence, to tune, it's best to
focus on higher number of queues on the receive side and not so much on the transmit side,
following table captures few ways to tune and includes the vCPU count for each tuning model.
Rx Queue 4 8 12
Tx Queue 2 4 4
# of vCPUs 6 12 16
438
VMware NSX Reference Design Guide
Figure 8-22: Throughput with Single TEP vs Dual TEP with Different MTUs
Note: Intel® XL710s are PCIe Gen 3 x8 lane NICs. On x8 lane NICs, the max throughput is limited
to 64Gbps. To achieve the above near-line rate of 80Gbps, 2 x Intel® XL710s must be used.
439
VMware NSX Reference Design Guide
RSS at pNIC
To achieve the best throughput performance, use an RSS-enabled NIC on the host running Edge
VM, and ensure an appropriate driver which supports RSS is also being used. Use the VMWARE
COMPATIBILITY G UIDE FOR I/O to confirm driver support.
The following graph FIGURE 8-23: RSS FOR VM Edge shows the comparison of throughput
between a NIC which supports RSS and a NIC which does not. Note: In both cases, even where
the pNIC doesn’t support RSS, RSS was enabled on the VM Edge:
440
VMware NSX Reference Design Guide
With an RSS-enabled NIC, a single Edge VM may be tuned to drive over 20Gbps throughput. As
the above graph shows, RSS may not be required for 10Gbps NICs as they can achieve close to
~15 Gbps throughput even without enabling RSS.
441
VMware NSX Reference Design Guide
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 2122 23 24 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 2122 23 24
Edge VM (X-Large)
1 2 3 4 5 6 7 8
9 10 11 12 13 14 15 16
In general, when
vNIC 1 vNIC 2 vNIC 3 adding Edge VMs
on a Edge host,
core availability is
not a concern.
TEP TOR-L TOR-R pNICs are
ESXi Host
pNIC 1 pNIC 2
In FIGURE 8-24, based on medium level tuning of about 8 Rx queues and 2 Tx queues, the vCPU
count is 20 for both the pNICs together. While in the above image there is only one edge, in
reality the above configuration has been leveraged to run many Edge VMs with optimal
performance. However, given that modern servers have lot of cores to spare, a better way to
utilize the available cores is to add two additional pNICs.
The following image FIGURE 8-25 shows how multiple pNICs can help utilize the available cores
better and help reduce the ESX server host footprint by increasing the Edge VM density per
host.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 2122 23 24 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 2122 23 24
1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8
9 10 11 12 13 14 15 16 9 10 11 12 13 14 15 16
Based on usage
patterns, additional
vNIC 1 vNIC 2 vNIC 3 lighter load Edge vNIC 1 vNIC 2 vNIC 3
VMs may also be
deployed to these
Edge hosts
TEP TOR-L TOR-R TEP TOR-L TOR-R
ESXi Host
442
VMware NSX Reference Design Guide
Note: For workloads that are heavy on packet processing, it would be better to dedicate a pair
of pNICs to a single edge VM.
The following table provides various configurations that can be leveraged when deployed edge
VMs based on the server vCPU count.
Large X-Large Edge Total Edge vCPU # of pNICs # Rx Queues # Tx Queues Total pNIC Total vCPU
Edge VM VM Count Count / pNIC / pNIC vCPU Count
Count
0 4 64 2 12 4 32 96
0 4 64 4 8 2 40 104
2 2 48 4 8 2 40 88
4 0 32 4 8 2 40 72
4 0 32 2 12 4 32 64
In summary, increasing the VM Edge count along with increasing the dedicated pNIC count
could help leverage all the cores on a dedicated Edge host.
443
VMware NSX Reference Design Guide
8.7.1.1 NIC
Network Interface Card (NIC) is one of the most critical components to consider, from
performance perspective, for use with bare metal or virtual form factor edges. While the high
level specs such as the port speed of the NICs may be similar, the actual hardware design must
be considered.
For example:
• any single port NIC with 100G port, will either need to have PCIe Gen 3 x16 to fully
leverage the 100G speed or a PCIe gen 4 x8 NIC.
• any dual port NIC with over 40G ports, will need either a PCIe Gen 3 x16 to fully leverage
up to 80G speed or a PCIe Gen 4 x8 NIC.
• Any quad port NIC with over 25G ports, will need a PCIe Gen 3 x16 to fully leverage up
to the 100G speed of a PCIe Gen 4 x8 NIC.
As noted above, PCIe Gen 3 x8 PCIe lanes NICs, the PCIe lanes could potentially become the
bottleneck.
Bare Metal Edge is able to fully leverage any port speed, at line rate, as long as the underlying
HW level specs of the NIC card and the PCIe lanes are sufficient to support the actual port
speed. Considering that, it is not necessary, from a performance perspective, to use multiple
ports / pNICs of lower port speed, such as 4 x 25Gbps. Recommendation is to instead leverage
pNICs with higher port speed, such as 2 x 100Gbps.
Adding it all up, in general, it’s better to err on the side of caution, and choose:
1. x16 PCIe lane NICs along with
2. PCIe Gen 4, if the system supports it and
3. 2 x 100G ports instead of 4 x 25G ports
In the case of Cisco VICs and other such architectures, the type of fabric extenders and any
other hardware that’s in the data path must be considered. If the farbric extenders, or any
another hardware component between the Edge and the ToR switch, is oversubscribed, it may
not be possible to leverage the full bandwidth capacity of VIC.
444
VMware NSX Reference Design Guide
With a single NUMA node, traffic is processed within that single NUMA node which avoids any
inter NUMA communication bottlenecks. In a multi NUMA architecture, such as with dual
socket Intel systems, traffic may end up being processed by both the cores as DPDK leverages
only NUMA 0 for all the packet processing. This forces traffic to cross the NUMA boundary, if
services are leveraging a different NUMA node or if the pNIC is attached to a different NUMA
node.
In these cases where traffic needs to cross the inter NUMA boundary, the inter NUMA
interconnect, such as Intel’s Ultra Path Interconnect (UPI) or the older Quick Path Interconnect
(QPI) could potentially become the bottle neck and may also add latency. Theoretical and
practical bandwidth capabilities of the interconnect should be considered. This is especially true
when leveraging multiple 40G / 100G NICs.
For workloads that are heavily focused on packet processing, such as Telco’s, single socket
architecture would help achieve optimal performance.
Note: Bare Metal edge only supports a defined max number of cores (the number depends on
the version, check the official documentation). Hence, total core count must be considered,
especially when going with dual socket platforms.
Benchmarking Tools
Compute
On the compute side, our recommendation for testing the software components is to use a
benchmarking tool close to the application layer. Application layer benchmarking tools will
help take advantage of many features and show the true performance characteristics of the
system. While application benchmarking tools are ideal, they may not be very easy to setup
and run. In such cases, iPerf is a great tool to quickly setup and check throughput. Netperf is
another tool to help check both throughput and latency.
Here is a github resource for an example script to run iPerf on multiple VMs simultaneously and
summarize results: HTTPS:// GITHUB.COM/ VMWARE -SAMPLES/NSX-PERFORMANCE -TESTING-SCRIPTS
445
VMware NSX Reference Design Guide
Edges
8.8.2.1 VM Edge
As VM Edges are designed for typical DC workloads, application layer tools are best for testing
VM Edge performance.
Figure 8-26: Example Topology to Use Geneve Overlay with Hardware IXIA or Spirent
Conclusion
NSX uses a number of features supported in hardware to optimize for performance.
On the compute side, these are:
1. Geneve Offload for CPU cycle reduction and marginal performance benefits
446
VMware NSX Reference Design Guide
For the best out-of-the box performance, recommendation is to leverage the Enhanced
Datapath Standard.
For the Bare Metal Edges, leveraging optimal SSL offload performance such as Intel® QAT 8960s
and deploying supported hardware from the VMware NSX install guide will result in
performance gains.
447
VMware NSX Reference Design Guide
DPU Architecture
DPU stands for Data Processing Unit, a term used interchangeably for SmartNICs. A DPU is a
PCIe card that plugs into a server and goes beyond simple connectivity. It implements the
network traffic processing that the server previously performed. Every DPU has its own
onboard ARM processor as well as a programmable hardware accelerator that assists with
networking and security functions. These server CPU cores, previously consumed for
infrastructure support, are now freed up and can be repurposed to improve workload density
and enhance application performance.
DPUs have dual High-Speed Ethernet ports with bandwidth ranging from 25-100 Gbps. In
addition, a dedicated 1 Gbps Ethernet port for management access. However, in the first phase
of DPU-based Acceleration for NSX, DPUs are not independently managed. The lifecycle of the
DPU is tied to the lifecycle of the hypervisor.
448
VMware NSX Reference Design Guide
449
VMware NSX Reference Design Guide
Standard Datapath
The Standard Datapath is the default datapath mode that is enabled for vSphere Networking as
well as NSX today. In this interrupt-driven Datapath model, when a packet arrives on the wire, it
triggers a CPU interrupt. The CPU suspends whatever it is currently processing, saves the
current CPU context, and switches to processing the new packet detected on the wire. At very
high packet rates, this interrupt-driven and context switch model can lead to very high and
unpredictable latency, which makes it less favorable for applications requiring high network
performance.
For applications in need of high throughput and low latency, the Datapath models of choice are
as follows.
450
VMware NSX Reference Design Guide
complete bypass of the x86 hypervisor, which further ensures low latency and the ability to
process higher packets per second (PPS). However, the applications hosted on the VMs need to
be compatible with the NIC Vendor Poll Mode Driver (PMD) and, therefore, be aware of the
underlying hardware. Due to the software bypass and hardware dependency, none of the high
availability (HA) features like vMotion and DRS, or the networking and security functions, is
available with this Datapath model.
Unfortunately, the low latency and high PPS benefits are outweighed by the inability to
leverage the HA, Network, and Security features, and this Datapath model is minimally
adopted.
451
VMware NSX Reference Design Guide
on whether to drop/forward the existing packet in addition to programming the Fast Path for
faster processing of subsequent flows. The applications running on the workloads need to be
compatible with the para-virtualized VMXNET3 Poll Mode Driver and therefore do not need to
be aware of the underlying hardware. All the HA features, like vMotion and DRS, and
Networking and Security features, are available in this Datapath Mode.
However, some CPU cores are reserved at the time of host preparation solely for the purpose
of packet processing. Irrespective of the traffic conditions (lower or higher packet rates), these
cores are completely reserved for packet processing and cannot be utilized by the workloads.
CPU cores reserved for EDP-Performance will always be at 100% utilization. The CPU core usage
will vary based on traffic conditions, but this core is unavailable for other functions even in low-
traffic conditions.
452
VMware NSX Reference Design Guide
For NSX Offload to the DPUs, Enhanced Datapath is a requirement (either Performance or
Standard). The recommendation is to go with EDP-Standard. This is because with the
introduction of DPUs we rely less on the software slow/fast path, and offload decision making
to the HW acceleration when possible. EDP-Performance would reserve CPU resources from
the DPU/SmartNIC and could lead to underutilization.
453
VMware NSX Reference Design Guide
When a workload is attached to a segment associated with an offloaded VDS [an offloaded VDS
is defined as a Distributed Virtual switch with uplinks that are mapped to the ports of a DPU],
every individual vNIC can be deployed in 2 modes.
454
VMware NSX Reference Design Guide
455
VMware NSX Reference Design Guide
Assumptions
This chapter's content is based on the following assumptions:
• The primary focus is on vLCM image-based vSphere clusters, as VUM-based (vLCM
baseline) vSphere clusters will be deprecated in VCF versions beyond 5.2.
• The distinctions between the recently introduced vSAN Express Storage Architecture
(ESA) and the vSAN Original Storage Architecture (OSA) are considered largely irrelevant
within the context of this specific NSX on VMware Cloud Foundation chapter.
Out-of-scope
The current iteration of this NSX reference guide omits several key areas pertaining to NSX
implementation within VMware Cloud Foundation, deeming them out of scope. Future
document iterations might include some of these areas:
• Details on stretched vSAN clusters for the VCF Management Domain and VI Workload
Domains
• Multi VMware Cloud Foundation instance-based topologies leveraging NSX Federation
• Non-VCF-aware Aria Suite and WorkspaceOne Access (WS1A) deployments
• NSX logical topologies for workloads (As they are orthogonal to the VCF deployment)
• VMware Cloud Foundation 5.2.x brownfield ingestion options are not addressed in this
document
456
VMware NSX Reference Design Guide
Figure 10-1: Single Site Deployment using the VCF Consolidated Architecture
457
VMware NSX Reference Design Guide
458
VMware NSX Reference Design Guide
changes or maintenance windows affect the entire NSX domain simultaneously. For example,
software upgrades need to be coordinated across the VCF VI WLD part of a single NSX domain.
The VCF standard architecture is built with scalability in mind. It can accommodate anywhere
from typically 100 to 1000 ESXi hosts per VCF instance, making it suitable for organizations of
various sizes and with differing scale resource needs. An individual VCF VI WLD can scale up to
800 ESXi hosts. Please refer to CONFIGMAX for the most up to date scale numbers.
The VCF standard architecture supports deployments up to 14 VCF VI WLDs plus a single VCF
Management Domain when configured with a single shared SSO instance. When isolated SSO
domains per VCF VI WLD are used, then up to 24 VCF VI WLDs can be deployed. Starting from
VCF 5.2, the VCF administrators can choose to compose multiple isolated VCF VI WLDs
configured with a shared NSX instance. A single NSX instance can support up to 16 isolated VCF
VI WLDs.
The VCF standard architecture supports Multi-Availability Zone (AZ) with stretched vSAN
clusters either only in the VCF Management Domain or in both, the VCF Management Domain
and for vSAN clusters as part of the VCF VI WLD. Stretched vSAN clusters are not supported in
VI WLDs if the VCF Management domain is not stretched.
Multi VCF Instances architecture is supported by leveraging NSX Federation for example for
Disaster Recovery or when a unified security policy model across the different NSX domain is
required. NSX Federation is implemented on a per NSX domain base.
459
VMware NSX Reference Design Guide
Figure 10-2: Single Site Deployment using the VCF Standard Architecture with a single NSX Domain for VCF VI WLDs
460
VMware NSX Reference Design Guide
Figure 10-3: Single Site Deployment using the VCF Standard Architecture with a dedicated NSX Domain per VCF VI WLDs
461
VMware NSX Reference Design Guide
This architecture isn't limited to just NSX domains running in VCF instances. It also supports
Non-VCF NSX instances, showcasing its flexibility to enable with various deployments setups.
The supported scenarios for NSX Federation in VCF are:
• Two or more management WLD of different VCF instances (standard and consolidated
VCF architecture)
• Two or more VI WLD of different VCF instances (standard VCF architecture)
• Two or more VI WLD in the same VCF instance using dedicated LM in each (standard VCF
architecture)
• Management WLD and VI WLD in the same VCF instance (using dedicated LM in each).
• Non-VCF NSX Local Manager (LM) deployment and any WLD in a VCF deployment
• GM-Active cluster and GM-Standby cluster
• Single GM-Active cluster cross location (low latency cross locations).
The limitation of NSX Federation in the context of VCF are:
• SDDC Manager functions (such as password rotation, certificate replacement, LCM, etc.)
do not support the NSX Global Manager.
• VMware Cloud Foundation does not support NSX Federation between a VI WLD in one
VCF instance and a management WLD in another VCF instance.
• NSX Federation in VCF 4.2 or later releases only supports greenfield deployments. Please
raise a ticket with GSS and your account team to evaluate if NSX Federation is suitable
for your existing VCF deployment (production).
462
VMware NSX Reference Design Guide
The two Availability Zones (AZ) should be deployed with an equal number of vSAN ESXi hosts
distributed evenly between the 2 sites. vSAN stretched cluster are limited to exactly 2
Availability Zones (AZ). A third physical site is required to have the vSAN witness host deployed
to achieve quorum. For more details about the vSAN witness host, please see the general vSAN
463
VMware NSX Reference Design Guide
VCF requires that the vSAN cluster in the VCF Management Domain need to be stretched
before you start stretching another vSAN cluster in the VCF VI WLDs. The reason for this hard
requirement is simple, it is required to run the vCenter and NSX Managers in the second
Availability Zone upon a failure in the first Availability Zone in the VCF Management Domain,
otherwise the Workloads in the VCF VI WLD cannot be managed and operated.
The additional vSAN-ready ESXi hosts installed in the second Availability Zone are provisioned
using an additional SDDC Manager network pool for vSAN and vMotion during the ESXi host
commissioning phase. Stretching a vSAN cluster requires the execution of API call on the SDDC
Manager. An example of the JSON specification for this API call is available in the public VCF
administration guide. The NSX prepared ESXi hosts support TEP IP assignment using DHCP or
NSX IP pool (supported since VCF 5.1.0). When using NSX IP pool for the ESXi host TEP IP
assignment, the SDDC Manager automatically creates the required NSX Sub-Transport Node
Profiles (TNP) in NSX.
The simplified stretched vSAN cluster example presented in FIGURE 10-5 has a VCF Management
Domain and a single VCF VI WLD, where each includes a single stretched vSAN cluster.
Additional stretched vSAN cluster or not-stretched vSphere clusters can be added based on the
business needs. The built-in SDDC Manager deployment workflows ensure that by default the
NSX Managers, vCenters, SDDC Manager and NSX Edge Nodes are running in the first
Availability Zone as long the first Availability Zone is up and running.
464
VMware NSX Reference Design Guide
Figure 10-5: Multi Availability Zone (AZ) using stretched vSAN Clusters
465
VMware NSX Reference Design Guide
Connectivity for management component VMs and vmkernel interfaces (Management, vSAN,
vMotion) is facilitated through vSphere dvPortgroups. Since the advent of VCF 5.1, the
connectivity design has evolved to accommodate distinct dvPortgroups for management
component VMs and ESXi host management vmkernel interfaces. This enhancement empowers
VCF administrators to segregate access to different systems with greater granularity and
control.
466
VMware NSX Reference Design Guide
sizes, thereby accommodating higher scale requirements (e.g., system-wide Distributed Firewall
rules). For those leveraging SDDC Manager API-based workflows, there exists the additional
flexibility to specify small and medium NSX Manager appliance sizes, offering a more granular
approach to resource allocation and performance optimization.
467
VMware NSX Reference Design Guide
468
VMware NSX Reference Design Guide
The following figures illustrate these various VDS profiles and their resulting VDS designs. For
the sake of simplicity, the optionally configured dvPortgroups for NSX Edge Node uplinks and
the NSX Edge Node TEP dvPortgroups have been intentionally omitted from these
representations.
Figure 10-7: EMS deployment parameter workbook/spreadsheet – VDS Profile 1 (2 and 4 pNICs)
469
VMware NSX Reference Design Guide
When utilizing the bring-up JSON specification file, additional diverse combinations of pNIC and
VDS configuration options are supported. Below is a selection of these additional options
(please note that this list is not exhaustive):
• More than 4 pNICs
• More than 2 VDS
• Change the VDS default teaming load balancing option (default is based on NIC load to
load balancing)
• Change the VDS teaming policies for the vmnics (active, standby, unused), default all
active
• Change the name for the NSX Transport Zone name
• Change the NSX host switch Datapath Mode (Standard Datapath, Enhanced Datapath –
Standard and Enhanced Datapath – Performance), default is Standard Datapath
470
VMware NSX Reference Design Guide
EMS Parameter VDS Number of VDS / pNIC NSX Overlay NSX VLAN
Profile per VDS Transport Zone Transport Zone
Required,
recommended on
Profile-3 2/2 Optional
Mgmt-VDS02, but not
on both
471
VMware NSX Reference Design Guide
472
VMware NSX Reference Design Guide
473
VMware NSX Reference Design Guide
Figure 10-10: Single rack layout with a vertical striped management vSAN cluster without AVN
Figure 10-11: Single rack layout with a vertical striped management vSAN cluster with AVN
474
VMware NSX Reference Design Guide
10.2.5.14 Four rack layout with a horizontal striped management vSAN cluster
In scenarios with horizontal striped management vSAN cluster across 4 racks, the vSAN-ready
ESXi hosts can be symmetrically deployed as illustrated in FIGURE 10-12.This configuration
inherently ensures vSAN storage resilience, as vSAN interprets each node as an autonomous
fault domain, thus augmenting rack failure tolerance implicitly.
Figure 10-12: Four rack layout with a horizontal striped management vSAN cluster
10.2.5.15 Three rack layout with a horizontal striped management vSAN cluster
In scenarios where the horizontally striped management vSAN cluster is limited to distribution
across 3 racks, a minimum of six vSAN-ready ESXi hosts isa prerequisite. The resulting rack
layout is illustrated in FIGURE 10-13. This architecture necessitates the establishment of vSAN
475
VMware NSX Reference Design Guide
failure domains on a per-rack basis, after the initial automated day0 bring-up deployment of
the VCF Management Domain via the Cloud Builder appliance.
Figure 10-13: Three rack layout with a horizontal striped management vSAN cluster
The depicted horizontal striped management vSAN cluster in the VCF Management Domain
exemplars omit AVN implementation, as its adoption necessitates case-specific considerations
regarding the BGP peering required by overlay AVNs. VLAN AVNs implementation is generally
straightforward as long the corresponding VLANs are extended across the different racks on the
physical network.
476
VMware NSX Reference Design Guide
Requires 4 pNICs
2 VDSs User choice Configurable in EMS deployment
parameter workbook/spreadsheet
477
VMware NSX Reference Design Guide
478
VMware NSX Reference Design Guide
479
VMware NSX Reference Design Guide
• The Plan & Preparation workbook covers only NSX IP pools for host TEPs (Reference VCF-
NSX-L3MR-OVERLAY-REQD-CFG-002)
Alternatively, horizontal striped compute clusters can be implemented with shared VLANs
across all racks, utilizing Layer 2 adjacency in the physical network fabric. This approach
480
VMware NSX Reference Design Guide
mitigates operational overhead and circumvents previously mentioned constraints, albeit at the
cost of increased dependence on physical fabric capabilities and elevated risk due to a larger
Layer 2 networking failure domain.
Both VLAN assignment strategies in horizontal compute vSphere cluster deployments facilitate
ad hoc cluster capacity expansion and enhance application availability across racks. However,
this architecture may be suboptimal for vSAN traffic, which must traverse the physical network
spine layer, potentially impacting performance.
481
VMware NSX Reference Design Guide
Layer 2 DHCP
Yes Yes Yes | Yes
host TEP
dvPortgroup
load balancing
based on Route Yes Yes Yes | Yes
based on physical
NIC load
dvPortgroup
Needs to select custom VDS
load balancing
No Yes Yes | Yes configuration option in UI or
based on other
API
options
More than 2
pNIC per VDS Needs to select custom VDS
with any traffic No Yes Yes | Yes configuration option in UI or
distribution and API
teaming policy
10.2.6.4.1 Driver to deploy dedicated vSphere clusters to host NSX Edge Nodes
The following points summarizes the reasons why dedicated vSphere clusters to host NSX Edge
Node VM is highly recommended:
482
VMware NSX Reference Design Guide
10.2.6.4.2 vSphere Clusters to host NSX Edge Node VMs VLAN Requirements
vSphere clusters hosting NSX Edge Node VMs without workload VMs require 7 or 8 VLANs,
depending on the management interface connectivity of the NSX Edge Node VMs:
• ESXi Management
• vMotion
• Storage
• Host Overlay (Host TEP)
• NSX Edge Node TEP (Reference VCF-NSX-EDGE-REQD-CFG-003)
• NSX Edge Node management (optional)
• Two uplink VLANs for BGP peering (Reference VCF-NSX-BGP-REQD-CFG-003)
These clusters should be deployed exclusively as vertical striped vSphere clusters, confining NSX
Edge Node VMs to a single rack to limit BGP peering. Ideally, rack-specific VLAN assignments
483
VMware NSX Reference Design Guide
are employed for the requisite 7 or 8 VLANs, restricting the Layer 2 failure domain to individual
racks. This approach, while enhancing resilience, introduces higher operational overhead due to
the management of multiple SDDC Manager network pools for vMotion and vSAN/storage,
additional VLANs and IP subnets, and expanded DHCP scopes or NSX IP pools for host TEPs.
Vertical striped vSphere clusters hosting NSX Edge Node VMs establish BGP peering exclusively
with local physical network switches (ToR switches), simplifying configuration, facilitating
streamlined bandwidth management, and reducing troubleshooting complexity. This
deployment strategy also confines the provisioning of additional NSX Edge Node VLANs (two
uplinks and optionally one for NSX Edge Node management interfaces) to rack local physical
network switches (ToR switches), further enhancing operational efficiency. These VLANs should
not be extended via physical fabric overlays as there is no purpose in doing so and it may limit
the ability to establish BGP peering on directly connected addresses in some fabrics.
484
VMware NSX Reference Design Guide
the anticipated aggregate bandwidth of all the NSX Edge Nodes. This spine-to-ToR link capacity
should be scaled proportionally to accommodate the cumulative traffic load.
Figure 10-15: Physical layout for two vSphere clusters (in purple) to host NSX Edge Node
10.2.6.4.5 vSphere Clusters to host NSX Edge Node Physical Network Requirements
The physical network switches (ToR switches) interconnecting ESXi hosts within the vSphere
cluster for NSX Edge Node VMs must operate at Layer 3 and support Border Gateway Protocol
(BGP). BGP peering from the NSX Edge, augmented with Bidirectional Forwarding Detection
485
VMware NSX Reference Design Guide
(BFD), should be configured exclusively on physical network switches (ToR switches) within the
same rack.
10.2.6.4.7 Enhanced Datapath for vSphere Clusters to host NSX Edge Nodes
Enhanced Datapath - Standard is a performance-oriented datapath option that combines the
flexibility of on-demand CPU usage from the existing standard datapath with DPDK-like features
for enhanced performance. This mode dynamically scales core usage for packet processing
based on demand. Enhanced Datapath - Standard has demonstrated superior performance
characteristics, particularly for smaller flows primarily focused on packet processing, such as
NSX Edge Nodes, without requiring further tuning. This mode is especially well-suited for
workloads like NSX Edge Nodes. For more comprehensive information about Enhanced
Datapath modes, please refer to chapter 8 section: ENHANCED DATAPATH STANDARD
It is recommended to select Enhanced Datapath – Standard during the SDDC Manager cluster
deployment workflow. In the SDDC Manager cluster user interface, Enhanced Datapath –
Standard mode is labeled as "Enhanced Datapath Interrupt". Post day0 deployment changes to
enable "Enhanced Datapath Interrupt" require the use of NSX Policy API or the UI (API is
required in NSX 4.2.0)
By default, every Virtual Distributed Switch (VDS) is configured with the NSX standard datapath.
The Enhanced Datapath – Standard mode (Enhanced Datapath Interrupt) option is only
available when a custom VDS profile is selected. The predefined VDS profiles (default, SAN
separation, NSX separation) do not permit the selection of the Enhanced Datapath - Standard
mode (Reference VCF-VDS-DES-REQD-CFG-00).
486
VMware NSX Reference Design Guide
487
VMware NSX Reference Design Guide
488
VMware NSX Reference Design Guide
This approach, while enhancing resilience, introduces higher operational overhead due to the
management of multiple SDDC Manager network pools for vMotion and vSAN/storage,
additional VLANs and IP subnets, and expanded DHCP scopes or NSX IP pools for host TEPs.
Vertical striped collapsed Edge/Compute vSphere clusters prohibit NSX Edge Node VM
migration across racks and establish BGP peering exclusively with local physical network
switches (ToR switches).
489
VMware NSX Reference Design Guide
Figure 10-18: SDDC Manager VDS configuration – Storage traffic separation VDS profile
490
VMware NSX Reference Design Guide
Figure 10-19: SDDC Manager VDS configuration –NSX traffic separation VDS profile
Figure 10-20: SDDC Manager VDS configuration –Storage & NSX traffic separation VDS profile
The four predefined VDS profiles share commonalities in their stringent configuration
limitations. The following table delineates the configurable and non-configurable parameters
within the SDDC Manager workflow, encompassing both VI WLD creation and cluster addition
processes.
Number of 1 2 2 3
VDS
Minium 2 4 4 6
required pNIC
per ESXi host
491
VMware NSX Reference Design Guide
Change VDS Supported Supported on both Supported on both Supported on all VDS
Name VDS VDS
Change VDS Not changeable Not changeable for Not changeable for Not changeable for all
MTU both VDS both VDS VDS
Default: 9000 Bytes
Default: 9000 Bytes Default: 9000 Bytes Default: 9000 Bytes
Change vmk Not changeable Not changeable Not changeable Not changeable
MTU
Default: 1500 Bytes Default: 1500 Bytes Default: 1500 Bytes Default: 1500 Bytes
(Mgmt) (Mgmt) (Mgmt) (Mgmt)
Default 8800 Bytes Default 8800 Bytes Default 8800 Bytes Default 8800 Bytes
(vSAN and vMotion) (vSAN and vMotion) (vSAN and vMotion) (vSAN and vMotion)
Adding more Not supported Not supported on both Not supported on Not supported on all
than 2 uplinks VDS both VDS VDS
per VDS
Flexible VDS Not supported Not supported on both Not supported on Not supported on all
uplink to VDS both VDS VDS
vmnic
mapping
Change Not changeable Not changeable on Not changeable on Not changeable on all
dvPortgroup both VDS both VDS VDS
Default: Route based
load balancing
on physical NIC load Default: Route based Default: Route based Default: Route based
on physical NIC load on physical NIC load on physical NIC load
Using Standby Not supported Not supported on both Not supported on Not supported on all
uplinks/unuse VDS both VDS VDS
d uplinks
NSX host Not changeable Not changeable Not changeable Not changeable
switch
Default: Standard Default: Standard Default: Standard Default: Standard
Operational
datapath mode datapath mode datapath mode datapath mode
Mode
492
VMware NSX Reference Design Guide
Change NSX Not supported Not supported (VDS01) Not supported Not supported
overlay (VDS02) (VDS03)
Transport
Zone Name
Adding NSX Not supported Not supported (VDS01) Not supported Not supported
VLAN (VDS02) (VDS03)
Transport
Zone
Flexible NSX Not supported Not supported (VDS01) Not supported Not supported
uplink to VDS (VDS02) (VDS03)
uplink
mapping
Change NSX Not supported Not supported (VDS01) Not supported Not supported
uplink profile (VDS02) (VDS03)
name (host)
493
VMware NSX Reference Design Guide
The subsequent figure illustrates a typical example of a custom VDS profile for ESXi hosts
equipped with dual pNICs, configured as part of a compute vSphere cluster dedicated solely to
running workload VMs. This bespoke VDS profile incorporates active/standby VDS uplinks,
ensuring a deterministic traffic path for vSAN traffic, and employs NSX IP pool based allocation
for host TEP IPs in lieu of DHCP.
494
VMware NSX Reference Design Guide
Preconfigured Default
1/2 Configured Not configured
Profile
Preconfigured
Storage Traffic 2/2 Configured for VDS01 Not configured
Separation
Preconfigured NSX
2/2 Configured for VDS02 Not configured
Traffic Separation
Preconfigured
Storage & NSX Traffic 3/2 Configured for VDS02 Not configured
Separation
495
VMware NSX Reference Design Guide
• Workload VMs which are connected with NSX overlay (preferred for any private cloud
deployment) or NSX VLAN segments could be protected by the vDefend Distributed
Firewall (DFW) since VCF 4.x.
• Workload VMs which are interfaced with dvPortgroups, necessitate the deployment of
VCF 5.2.x to facilitate the activation of vDefend Distributed Firewall (DFW). This option is
referred to as "activate NSX on dvPG" in NSX.
• As of the current VCF versions up to and including 5.2.0, the protection of vmkernel
interfaces through vDefend Distributed Firewall (DFW) remains unsupported.
• The vDefend Distributed Firewall (DFW) is an add-on component and requires additional
licenses.
• Refer to section DISTRIBUTED SECURITY MODEL , DVPGS VS NSX VLAN SEGMENTS in chapter 7
for the considerations around using VLAN segments vs dvpg for DFW implementation on
VLAN networks.
10.2.7.2 Option A - SDDC Manager Edge Cluster Creation Workflow using the UI
Option A stands as the preferred deployment option in case of simple NSX logical topologies
demanding stateful centralized services like NAT, VPN or Load Balancer at the NSX Tier-1
Gateway level. This approach is appropriate when the throughput requirements and the
operational models allow for shared NSX Edge Clusters across NSX Tier-0 and NSX Tier-1
Gateways. Separating Tier-0 Gateway and the default Tier-1 gateway on separate edge clusters
is not possible via this workflow.
It's particularly advantageous for VCF administrators looking to harness the full potential of
SDDC Manager's Edge Cluster creation workflow. This SDDC Manager UI workflow option
496
VMware NSX Reference Design Guide
includes the configuration of the NSX uplink segments/vSphere uplink trunk dvPortgroups, BGP
peering and many predefined configuration parameters (object names, BGP timers etc.).
This deployment option A maintains flexibility for post day0 deployment lifecycle tasks,
allowing VCF administrators to expand or shrink the NSX Edge Cluster or perform password
updates and rotations through the SDDC Manager.
Please note Option A invariably attaches the same NSX Edge Cluster to both the Tier-0 and Tier-
1 Gateways, facilitating the adoption of stateful centralized services (e.g., NAT, VPN, Load
Balancer) at the Tier-1 Gateway level. Additional Tier-1 Gateways with or without stateful
services enabled can be added as a day-2 operation in NSX.
10.2.7.3 Option B - SDDC Manager Edge Cluster Creation Workflow using API
Option B utilizes the SDDC Manager API for a programmatic deployment of NSX edge nodes.
Option B provides a wider set of supported topologies compared to Option A. Option B offers a
flexible Tier-1 Gateway deployment, enabling customized configuration. In scenarios where the
NSX topology doesn't necessitate stateful centralized services (e.g., NAT, VPN, or Load Balancer)
at the NSX Tier-1 Gateway level, a Tier-1 Gateway in DR-only mode can be selected. This
approach maximizes NSX's scalable distributed routing model.
To enable this configuration, include "tier1Unhosted": true in the JSON specification.
Consequently, the NSX Edge Cluster is dedicated solely to the NSX Tier-0 Gateway. The SDDC
Manager API workflow orchestrates the setup of NSX uplink segments, vSphere uplink trunk
dvPortgroups, BGP peering, and various predefined parameters, including object nomenclature
and BGP timers. Option B maintains flexibility for post-deployment lifecycle tasks, enabling VCF
administrators to modify the NSX Edge Cluster or perform credential management through the
SDDC Manager.
This workflow allows to create NSX edge cluster without a Tier-1 gateway, and even “empty”
NSX edge cluster without either a Tier-0 or Tier-1 Gateway.
Manually adding a Tier-0 gateway via NSX to an “empty” edge cluster deployed via the SDDC
manager API is not recommended. Such operation will affect some SDDC manager day-2
operations, such as the expansion and shrinking of the cluster. For this reason, the deployment
of “empty” edge clusters via the SDDC manager API should be limited to scenarios where the
edge cluster will host Tier-1 gateways only. One of such use cases is the VLAN AVN workflow for
the management domain, which doesn’t require the deployment of a Tier-0 Gateway.
10.2.7.4 Option C - NSX Manager driven Edge Node topology deployment (UI/API)
Option C is the most flexible option and stands out as the preferred deployment strategy when
the NSX topology diverges from the predefined SDDC Manager NSX Edge Node Topologies, and
the other two options are not viable. NSX UI and API provide the same level of flexibility. This
option caters to a variety of non-standard configurations, for example (list is not exhaustive):
• OSPF in place of BGP
497
VMware NSX Reference Design Guide
498
VMware NSX Reference Design Guide
10.2.7.5.1 Single VDS with 4 pNICs – all pNICs assigned with host TEP
This NSX Edge Node design employs a single VDS, incorporating an NSX Overlay Transport Zone
with all four pNICs configured as host TEPs. We dedicate two pNICs for each NSX Edge Node to
enhance performance (PPS).
Because we want each edge node to leverage dedicated host pNICs, edges running on the same
host must be connected to distinct trunk dvpgs.
SDDC Manager Edge Cluster workflow mandates that edge uplinks are mapped to VDS uplink
associated with host TEPs. If the uplink does not have any host TEP associated, the SDDC
Manager workflow validation will fail. This constrain implies that NSX edge nodes must be
connected to host VDS associated with the default overlay transport zone (This is an SDDC
Manager constrain, that can be bypassed when deploying the NSX edges directly in NSX)
The SDDC Manager Edge Cluster workflow enforces DRS anti-affinity rules, ensuring that NSX
Edge Nodes from the same NSX Edge Cluster run on distinct ESXi hosts. This configuration
implies that NSX Edge Nodes coexisting on a single ESXi host must be part of distinct NSX Edge
clusters.
During deployment, administrators must select different uplink pairs for each NSX Edge Cluster
(uplink1 and uplink2 for the NSX Edge Cluster 1, and uplink3 and uplink 4 for the NSX Edge
Cluster 2) to prevent shared trunk dvPortgroup connections leading to edges sharing pNICs on
the same host.
The connectivity for ESXi management, vSAN, vMotion, and optional NSX Edge Node
management dvPortgroups can utilize all or a subset of pNICs, depending on design
preferences. Given the typically low traffic load for these functions, their pNIC mapping
negligibly impacts bandwidth availability for NSX Edge Node trunk dvPortgroups. Consequently,
these connections are intentionally omitted from the figure. Customers can deploy those
connection based on their preference with little or no impact on the design from a performance
perspective. From a high availability perspective, it is preferred to have the edge management
interface sharing the same host uplinks as the two Datapath interfaces, to address some
specific failure scenario and topologies covered in chapter 4 section NSX EDGE HIGH AVAILABILITY
FAILOVER TRIGGERS
The design illustrated in FIGURE 10-22 fully complies with SDDC Manager Edge Cluster workflow
requirements.
499
VMware NSX Reference Design Guide
Figure 10-22: Single VDS with 4 pNICs - all pNICs assigned with host TEP
500
VMware NSX Reference Design Guide
10.2.7.5.2 Single VDS with 4 pNIC – only two pNIC assigned host TEP
This design diverges from the previous example by configuring only the first two pNICs with
host TEP. While the SDDC Manager Edge Cluster workflow permits deployment of NSX Edge
Node 1, it fails validation for NSX Edge Node 2.
This design strategy minimizes host TEPs to two, aligning with VCF's minimum requirements.
The underlying logic is that in a vSphere cluster exclusively hosting NSX Edge Nodes, the
expected host TEP traffic is virtually nonexistent. Consequently, this approach liberates all
available bandwidth through pNIC P3 and P4 for NSX Edge Node 2.
The configuration depicted in FIGURE 10-23 is unattainable through the SDDC Manager Edge
Cluster workflow and must be performed via NSX Manager. In such case the two edge nodes
can be part of the same NSX edge cluster.
Figure 10-23: Single VDS with 4 pNICs – only two pNICs assigned with host TEP
501
VMware NSX Reference Design Guide
10.2.7.5.3 Two VDS each with 2 pNICs, but only the 2nd VDS has pNICs assigned with host TEPs
This design diverges in a similar way from the previous examples by configuring only the second
VDS with pNICs assigned with host TEP. While the SDDC Manager Edge Cluster workflow
permits the deployment of NSX Edge Node 2, it fails validation for NSX Edge Node 1.
This design approach minimizes host TEPs to two, adhering to VCF's minimum requirements
while implementing traffic type separation at the VDS level.
The configuration depicted in FIGURE 10-24 is unattainable through the SDDC Manager Edge
Cluster workflow. This design can be implemented using deployment option C (Edge
deployment via NSX Manager UI/API).
Figure 10-24: Two VDS each with 2 pNICs but only the 2nd VDS has pNICs assigned with host TEP
502
VMware NSX Reference Design Guide
10.2.7.5.4 Two different vSphere cluster to host NSX Edges for high availability
SecTION 10.2.6.4.3 emphasizes the critical requirement of implementing at least two vertically
striped vSphere Clusters to host NSX Edges, mitigating single points of failure at both vSAN
cluster and rack levels. FIGURE 10-25 illustrates NSX Edge Nodes, segregated by Layer 3
boundaries between two racks centric vSAN clusters are allocated to the same single NSX Edge
Cluster.
This NSX Edge Cluster, when assigned to a NSX Tier-0 Gateway, facilitates North-South traffic
forwarding with rack level redundant forwarding path.
This design augments the granularity of maintenance operations, aligning these operations to
both rack and vSAN cluster levels.
This design paradigm, which adheres to SDDC Manager Edge Cluster workflow requirements,
leverages dedicated VLAN sets (ESXi management, vMotion, vSAN, NSX Edge Node
management and TEP, host TEP, and BGP uplinks) at the rack level, thus optimizing Layer 2
network failure domain isolation.
Conversely, alternative vSphere cluster designs to host NSX Edge Nodes following the Layer 3
multi-rack vSphere clusters design across L3 boundaries are incompatible with SDDC Manager
Edge Cluster workflows. Such Layer 3 multi-rack vSphere cluster can only host workload VM.
Note: This design, when utilizing the SDDC Manager Edge Cluster workflow (UI or API), is
restricted to exactly one NSX Edge Node from the same NSX Edge Cluster per ESXi host. The
workflow enforces strict DRS anti-affinity rules during its validation phases. For deployments
requiring multiple NSX Edge Nodes from the same NSX Edge Cluster per ESXi host, optionally
combined with 4 pNICs, please select deployment option C.
503
VMware NSX Reference Design Guide
Figure 10-25: Two different vSphere cluster to host NSX Edges for high availability
504
VMware NSX Reference Design Guide
505
VMware NSX Reference Design Guide
506
VMware NSX Reference Design Guide
10.2.8.1 Overview
To streamline deployment and ensure version compatibility across VCF and Aria Suite
components and WorkspaceOne Access (WS1A), SDDC Manager incorporates workflows for
configuring VMware Aria Suite Lifecycle (vRSLCM). This integrated approach facilitates vRSLCM
deployment in "VCF-aware mode", enabling password lifecycle management via SDDC Manager
for Aria Suite and WS1A components, thus enhancing operational efficiency and security
coherence.
The integration of vRSLCM and the Aria Suite and WS1A components within the VCF
Management Domain necessitates the following sequential steps:
1. SDDC Manager Edge Cluster deployment for AVN
2. AVN implementation (Two NSX VLAN or overlay segments deployed via SDDC Manager)
507
VMware NSX Reference Design Guide
508
VMware NSX Reference Design Guide
Additional VLANs are required to support the NSX Edge Node VM connectivity. For the AVN
overlay case, this includes VLANs for BGP peering between the NSX Edge Node VMs and the
physical network switches (ToR switches). For both AVN segment type cases, an additional NSX
Edge Node TEP VLAN is required (which must include routing between the host TEP and the
NSX Edge Node TEP VLANs; Reference VCF-NSX-EDGE-REQD-CFG-003).
Optionally, a VLAN may be implemented to connect the NSX Edge Management interface.
Further details regarding AVN and vRSLCM will be elaborated upon in a subsequent section.
509
VMware NSX Reference Design Guide
Regardless of the selected AVN segment type, two segments are utilized to connect the
following components:
AVN region A segments:
• Aria Operations Collectors
• Aria Operations for Logs
• WorkspaceOne Access (WS1A) for NSX
510
VMware NSX Reference Design Guide
The two AVN deployment topologies are illustrated in FIGURE 10-28. Both diagrams incorporate
the standalone NSX Tier-1 Gateway, despite the fact that this standalone NSX Tier-1 Gateway is
only added during the subsequent phase, namely, the deployment of the vRSLCM.
For the sake of clarity, the Aria Suite and WorkspaceOne Access (WS1A) components are
intentionally omitted from the diagram.
511