Ensuring Continuous Network Operations with Cisco Nexus Hitless Upgrades


Is there ever really a good time to perform a network device image upgrade? For many customers, downtime is not an option. They expect that upgrades occur while the network continues to forward packets, without any service impact.

Designing a highly redundant network involves several strategies to ensure continuous operation and minimize downtime. Key approaches include multiple network paths between critical points, load balancing, and dual-homed devices and switches. Cisco supported hitless upgrades for data centers built with Cisco Nexus switches in both Cisco Application Centric Infrastructure (ACI) and NX-OS operating models. Let’s explore the hitless upgrade options available in Cisco NX-OS and the Cisco recommended best practices.

The networking industry has many variants of hitless upgrades. Some, such as Smart System Upgrade (SSU) or “Leaf SSU,” incur packet loss during an upgrade depending on the features enabled on the networking devices. Hitless upgrades in this blog refer to Cisco’s implementation of hitless upgrades (unless otherwise noted)—the ability to upgrade with zero packet loss (ZPL) with Cisco Nexus 9000 Series Switches.

What capabilities does Cisco NX-OS provide to achieve hitless upgrades?

Software Maintenance Update (SMU)

SMU is a package of software updates designed to address specific critical issues and security vulnerabilities in a software system. These updates are released typically to ensure the continued reliability, security, and performance of the software system. SMUs are used to resolve specific issues without requiring a full system upgrade.

Graceful Insertion and Removal (GIR) (“Maintenance mode”)

This mode allows certain hardware and software processes to be disabled or isolated so that maintenance tasks, such as software upgrades, hardware replacement, and troubleshooting, can be performed without affecting the normal operation of the rest of the network. GIR uses redundant paths in the network to gracefully remove a device from an active network, place it out of service, and insert it back into service when the maintenance is complete.

Specific to GIR, some vendors only support a subset of protocols, such as Border Gateway Protocol (BGP) and Multi-Chassis Link Aggregation Group (MLAG) for maintenance modes of operation. NX-OS isolates devices from the network with support for all Layer-3 protocols, including:

  • Border Gateway Protocol (BGP)
  • Enhanced Interior Gateway Routing Protocol (EIGRP)
  • Intermediate System-to-Intermediate System (IS-IS)
  • Open Shortest Path First (OSPF)
  • Protocol Independent Multicast (PIM)
  • Routing Information Protocol (RIP)
  • Multi-Chassis Link Aggregation (MLAG)

In-Service Software Upgrade (ISSU)

ISSU allows for the upgrade of the software on Cisco Nexus switches without disrupting the network services they provide. ISSU provides upgrades with zero packet loss (i.e., no data plane downtime). But it does involve 50 to 90 seconds of control plane downtime. During this control plane downtime period, peering with neighbors over L3 protocols will be paused and then get reestablished immediately after the upgrade. Since the data plane runs continuously without interruption, data center applications are not impacted. ISSU capability is particularly important in environments where maintaining continuous network availability is critical, such as data centers and enterprise networks.

Enhanced In-Service Software Upgrade (EISSU)

EISSU is an advanced version of the ISSU that uses containers built into NX-OS. It builds upon the standard ISSU capabilities to provide even more robust and seamless software upgrades, particularly in complex and high-availability environments. EISSU creates a second virtual supervisor engine as a container with the new software image and swaps it with the original image. This innovation not only keeps the data plane downtime to zero—resulting in zero packet loss—but also reduces the control plane downtime to only three seconds.

When using ISSU or EISSU from a Layer-3 perspective, all protocols support graceful restart—this is also known as Nonstop Forwarding (NSF). For Layer-2 protocols, Spanning-Tree Protocol (STP) and Virtual Port Channel (VPC) are supported. VPC takes two separate physical switches and presents them as one logical device to the connected Layer-2 device, while STP prevents loops from being formed when switches or bridges are interconnected through multiple paths.

But what if the kernel needs patching? Then a reload is surely needed, right? In the event the kernel needs patching—with NX-OS 10.2(2) on—EISSU will automatically revert to ISSU and still perform the upgrade with ZPL. The only difference is the control plane will be down longer with ISSU than with EISSU.

All Cisco Nexus 9300 Series GX2A and GX2B models ship with EISSU enabled by default. EISSU is also enabled by default with Nexus 9300 Series GX and FX3 models—with NX-OS 10.3.3 on. For previous Nexus 9300 Series releases like the FX and FX2 models, an additional step is needed in the form of an extra command followed by a reload.

When to use these technologies?

Ideally, for network architecture resiliency, everything in a data center should be redundant down to the network connections. In reality, this is not always the case. Here are a few representative scenarios where ISSU, EISSU, and GIR can enable upgrades, patches, and more, without losing packets.

Figure 1: Hitless upgrade model recommendations

The deployment topology for a typical data center network with multiple tiers/layers is shown in Figure 1. Endpoints are connected to leaf switches (sometimes referred to as Top-of-Rack switches). Leaf switches are connected to spine switches and spines are interconnected using super spine switches. It is a common and best practice to deploy fixed form factor switches at the leaf layer. Spine and super spine layer can be made up of either fixed or modular switches. Physical redundancy is built into all the networking layers. It is also a best practice and a recommended approach to have multi-homed endpoints connecting to a minimum of two leaf switches. In some cases, single-homed endpoints are also deployed depending on the business constraints. Now let’s look at multiple scenarios and the recommended upgrade options.

  • Upgrade of a leaf switch when dual- or multi-homed endpoints (ex: E1 and E2) are connected to the leaf switch: Since there is a physical redundancy between endpoints and the leaf switch, it is best to upgrade the leaf switch software using GIR. While it is possible to leverage ISSU or EISSU in this case, the recommended approach is GIR.
  • Upgrade of a leaf switch when single-homed endpoints (ex: E5 and E6) are connected to the leaf switch: There is no physical redundancy between the endpoints and the leaf switch, so GIR is not an option. The recommended approach in this scenario is to use ISSU or EISSU to achieve zero packet loss while performing the leaf switch upgrade.
  • Upgrade of spine layer switches: There is physical redundancy between leafs and spines, and between spines and super spines. To upgrade spine layer switches, GIR works best.
  • Upgrade of super spine layer switches: Similar to spine layer switches, super spine layer switches also have physical redundancy with spine layer switches. Hence, GIR is the best option in this scenario as well.
  • Troubleshooting: Imagine if a switch is not behaving as expected and you need to troubleshoot. It could be hardware related, software related, or configuration related. Again, you would rely on GIR. SMU is an option in all the above scenarios if the code update is being delivered for a point fix.

How can you perform these upgrades at scale?

Patching or upgrading one switch at a time is not realistic nor feasible for all but the smallest of networks. Thankfully, Cisco Nexus Dashboard is an operations and automation platform that simplifies the deployment, management, and service assurance of Cisco Nexus switches running Cisco NX-OS with unified user experience. One of the fully integrated services within the Nexus Dashboard is the Nexus Dashboard Fabric Controller (NDFC). It provides built-in best-practice templates and workflows and can patch and upgrade hundreds of switches at a time through an integrated scheduler.

With NDFC, you can automate fabric builds from zero-touch provisioning, build traditional VPC-based and Ethernet-VPN (EVPN) fabrics, manage networks, and more. NDFC supports image and patch management, has dedicated workflows for ISSU, EISSU, and GIR, and the ability to take snapshots for validation.

Whether you are running AI workloads, Virtual Extensible LANs (VXLANs), EVPNs, VPCs, or a traditional Layer2/Layer 3 network, Cisco Nexus 9300 Series switches and Cisco NX-OS allow you to perform scheduled maintenance and non-scheduled maintenance without impacting production traffic and critical systems.

Share:



Source link

Leave a Comment