vSAN Networking - Network Oversubscription - VMware Cloud Foundation (VCF) Blog

It is not unusual for a virtualization administrator to request little more than connectivity to a certain number of network ports and perhaps some VLANs in their Top-of-Rack (ToR) network switches. Knowing what is beyond your up-stream ToR switches is essential in your ability to provide a virtualized environment that will perform as designed. Network oversubscription is one of those critical design elements that can impact the performance and availability of your vSAN clusters. This post will explain what it is, and what is recommended for vSAN.

The concepts described here build off the information found in the post: “vSAN Networking – Network Topologies.”

The Basics of Network Oversubscription

In the context of networking using a spine-leaf architecture, “oversubscription” refers to the amount of bandwidth serviced by the leaf switches to the amount of bandwidth provided to the spine. The bandwidth may be calculated in a variety of ways, including:

The theoretical bandwidth capabilities of the entire switch.
The theoretical bandwidth of the connected links.
The anticipated bandwidth of the workloads and services using those links.

The most common and appropriate measure is the theoretical bandwidth of the connected links. When using this approach, it gives you the best chance to run at the native throughput of the links when connecting to hosts across different racks. This method of measurement is also the easiest to understand.

Oversubscription is expressed as a ratio, such as 1:1, 2:1, 3:1, etc, where the lower the ratio, the better. However, the lowest ratio that can be had is 1:1, as it expresses it as line-rate, and can also be referred to as having “no oversubscription.” A 2:1 oversubscription ratio means that the potential upstream bandwidth from the leaf to the spine is only half the bandwidth of the potential bandwidth serviced by the leaf switch. The higher the ratio, the higher the potential for contention to occur, and lower performance as a result.

Lets use a simple example, where a 25Gb ToR leaf switch may have 24 of its 32 ports connected to hosts. That switch uses 3, 100Gb uplinks to connect to the spine. The aggregate bandwidth serviced by the leaf would be 600Gbps, and the amount of bandwidth provided to the spine would be 300Gbps. This would result in an oversubscription ratio of 2:1.

Figure 1. Network oversubscription in a spine-leaf network.

A spine-leaf network design will typically factor in the quantity and speed of the connections for the ToR leaf switches to come up with the quantity and sizing of the spine switches. Some spine-leaf networks will be designed for no oversubscription under normal operating conditions, but may experience oversubscription during maintenance or failure of a spine switch.

Network oversubscription can also occur with the host uplinks, where you may have too many data services consuming the uplink during a normal operation. While this consideration is extremely important in your design, it is not the focus of this post.

The Impact of Oversubscription Ratios Beyond 1:1

Storage traffic relies on consistent delivery across the network to ensure predictable performance. For vSAN, striving for a 1:1 oversubscription ratio in a spine-leaf network will provide the most consistent and high-performance capabilities that is offered by the network.

What happens under heavy contention on the spine? Contention in a network will introduce delays. These delays may drop packets if the contention is severe enough. When packets drop, TCP uses a combination of algorithms to retransmit the data. Under heavy congestion TCP will use an exponential backoff algorithm to retry the transmission, where it will dramatically increase the wait time on subsequent failures to deliver a packet. Switches and routers throughout the path will also manage their queues using a variation of different queue management algorithms, including tail drop and random early detection to improve fairness of dropped packets.

What is the result? Severely degraded storage performance. With a 2% packet loss, there is a 32% reduction in IOPS. And with a 10% packet loss, there is a 93% reduction in IOPS. With newer versions of vSAN paired with fast server and networking hardware, the impact of packet loss may be more severe. One study (not vSAN related) demonstrated the impact of packet loss to be much worse – over 70% with just a 1% packet loss.

Figure 2. The impact of packet loss in a network.

Worse is that network bottlenecks lack visibility in the hypervisor as it lives outside the boundary of hypervisor management. So the non-deterministic latencies caused by the network will only show up in vSAN as higher, unpredictable latency with lower IOPS and throughput. What may appear to be a vSAN issue is actually a network design issue.

Addressing Oversubscription

What should you do if you find out your network is oversubscribed? Since the ratio represents the amount of bandwidth serviced by the leaf switches to the amount of bandwidth provided to the spine switches, the options for improving your oversubscription ratio are straightforward.

Add another spine switch. In a spine-leaf network, this is a simple and effective method of addressing oversubscription. Connect all of the existing leaf switches to the new spine switch, and your oversubscription ratio will be reduced accordingly. The main tradeoff of this approach is the expense of another spine switch.
Reduce the traffic needing to traverse the spine. Several options are available that can potentially reduce traffic across the spine. 1.) Verify that the hosts in the rack are cabled to each ToR switch correctly. 2.) Check the VDS configurations on your hosts to ensure they don’t use teaming policies that attempt to load balance traffic across links. 3.) Ensure the active VMkernel interfaces tagged for a given service such as vSAN connect to the same ToR leaf switch. 4.) When possible, keep all hosts that comprise a vSAN cluster within the same rack.
Reduce the number of hosts connecting to ToR switches. This would reduce the amount of bandwidth serviced by the leaf switches, which would change the ratio of spine to leaf traffic. While this option can certainly be effective, it may not be viable for many environments.

Oversubscription Examples

The following are three examples for vSAN clusters across multiple racks. They demonstrate how to achieve a 1:1 oversubscription ratio under various conditions. While a 1:1 oversubscription ratio should always be a design goal with vSAN, environments may occasionally see oversubscription due to the maintenance or failure of a spine switch.

vSAN HCI Clusters using 25Gb Leaf Switches and a 1:1 Oversubscription Ratio

Let’s first look at multiple vSAN HCI clusters where each cluster spans across several racks, and look at how to calculate the oversubscription ratio. In this case, there are 4 racks with 2 x 25GbE ToR leaf switches, and 2 x 100Gb spine switches. Let’s assume that there are a total of 4, 16 host vSAN clusters. Each cluster is distributed across the 4 racks evenly, meaning that each cluster transmits its vSAN traffic across the spine. With this arrangement of 4 clusters, each rack will have a total of 16 hosts. Each host has 25GbE networking, connected to a 25GbE switch. Each rack has a theoretical amount of 400Gbps that it could consume with vSAN cluster traffic. That means to maintain a 1:1 oversubscription ratio, we must have 4, 100Gb uplinks on every leaf switch to each of the spine switches. This would provide a line rate, non-blocking network for all of the vSAN hosts, regardless of where they are located.

Figure 3. vSAN HCI clusters using 25Gb leaf switches and a 1:1 oversubscription ratio.

Note that in the illustration above, if each vSAN cluster had hosts that resided within a single rack (e.g. 16 host clusters that resided in each rack), and there is not any datastore sharing between vSAN clusters, there would not be any vSAN traffic across the spine. Unless you are using the vSAN Fault Domains feature to provide rack-level resilience, minimizing the occurrence of vSAN clusters spanning across multiple racks will reduce the upstream bandwidth requirements to the spine.

vSAN Storage Cluster using 25Gb Leaf Switches and a 1:1 Oversubscription Ratio

Now let’s look at a design of a vSAN storage cluster using 25Gb leaf switches and a 1:1 oversubscription ratio, for an environment that has vSphere hosts using 10Gb networking with 15% utilization. In this example, the 16 host vSAN storage cluster resides in a single rack, connected to 2, 25GbE switches. The back-end vSAN cluster network traffic could use up to 400Gbps of total bandwidth. If we assume that vSphere clients mounting this datastore would reflect roughly 33% of the potential traffic on the vSAN storage cluster (as illustrated on Figure 2 of the post: “vSAN Networking – Network Topologies”), that would mean that we’d have to have at least 132Gb of bandwidth connected to the spine. However, in this example, we would use 2, 100Gb links for additional headroom. The racks with the vSphere clusters have 10Gb ToR switches, with a total of 20 vSphere hosts, equaling 30Gb of potential egress bandwidth. These switches would also use 2, 100Gb uplinks to the spine. As a result, the vSphere clusters could communicate with the vSAN storage cluster at line rate, maintaining a 1:1 oversubscription ratio.

Figure 4. vSAN storage clusters using 25Gb leaf switches and a 1:1 oversubscription ratio.

Having additional upstream bandwidth to the spine can be helpful in conditions where the client clusters mounting the datastore of the vSAN storage cluster are issuing large quantities of read requests. These read requests are not subject to the amplification characteristics of writes, as described in the post: “vSAN Networking – Network Topologies”

vSAN Storage Cluster using 100Gb Leaf Switches and a 1:1 Oversubscription Ratio

Now let’s look at a design of a vSAN storage cluster using 100Gb leaf switches and a 1:1 oversubscription ratio, for an environment that has vSphere hosts using 10Gb networking, and 20% utilization. In this example, the 16 host vSAN storage cluster resides in a single rack, connected to 2, 100GbE switches. The vSAN cluster network traffic could use up to 1,600Gbps of total bandwidth. Much like the 25Gb example, one would likely see about a 3:1 ratio between client traffic and back-end storage cluster traffic. However, due to a few limitations with 100Gb networking, we’ll want to use a ratio of 6:1. This means that we’d want at least 272Gb of bandwidth between the ToR switch for the vSAN storage cluster, and the spine. In this case, we will use 4, 100Gb links to satisfy that need. The racks with the vSphere clusters have 10Gb ToR switches, with a total of 20 vSphere hosts, equaling 40Gb of potential egress bandwidth. These switches would also use 2, 100Gb uplinks to the spine. As a result, the vSphere clusters could communicate with the vSAN storage cluster at line rate, maintaining a 1:1 oversubscription ratio.

Figure 5. vSAN storage clusters using 100Gb leaf switches and a 1:1 oversubscription ratio.

The examples above show the use of 10Gb ToR switches for the vSphere clusters mounting a vSAN storage cluster residing in another rack that uses 25Gb or 100Gb networking. These other racks can certainly use faster networking as well, which may help performance if 10Gb is no longer sufficient for your environment.

Further information on vSAN networking may be found with the vSAN Network Design Guide. For VCF environments, see “Network Design for vSAN for VMware Cloud Foundation.”

Summary

Your network provides the conduit for the delivery of storage traffic. Oversubscription ratios are often overlooked in the design of a network that requires consistent delivery of storage traffic. Depending on the native speed of the networking, and the demand of the workloads, the impact of oversubscription can vary widely. Whether you are looking at vSAN HCI clusters, or vSAN storage clusters, designing your vSAN environment with a 1:1 network oversubscription ratio simply extends the native performance of the hosts regardless of the location in the racks.

@vmpete

The Basics of Network Oversubscription

The Impact of Oversubscription Ratios Beyond 1:1

Addressing Oversubscription

Oversubscription Examples

vSAN HCI Clusters using 25Gb Leaf Switches and a 1:1 Oversubscription Ratio

vSAN Storage Cluster using 25Gb Leaf Switches and a 1:1 Oversubscription Ratio

vSAN Storage Cluster using 100Gb Leaf Switches and a 1:1 Oversubscription Ratio

Summary

Related Articles

vSAN Networking - Is RDMA Right for You?

vSAN Networking - Teaming for Redundancy

VCF Networking - VLAN to Logical Network Migration