Automation: How to streamline a networkwide switch upgrade


Automation can make a big difference in repetitive networking tasks, and that’s just what we did to streamline an enterprise switch upgrade using scripts we created with Python and a set of open-source tools.

The project reaped several benefits, three of which were eliminating much human error inherent in the manual process, faster deployment overall, and significant cost savings.

Upgrading a large, switched network is always a challenge. The typical solution is to carefully document the old switch configurations and the wiring to the patch panel, then manually configure the new switches and replace the wiring. The endpoints must be carefully tracked so they are assigned to the appropriate VLAN and have the correct interface configuration.

This process can be further complicated because new switches may have new features and inevitably have an interface layout different from the old switches’.

Here, we’ll describe the project that NetCraftsmen was hired to carry out, the manual process suggested by the switch vendor, and the automated process we wrote ourselves. It was a great example of using an upgrade as the opportunity to remove accumulated technical debt.

The project

Our client employs over 20,000 employees spread across 100 locations. Their network used a route/switch design with more than 100 aging 1Gigabit switches, some of which had reached end-of-life and for which the vendor would no longer provide software updates.

Upon examining the wiring closets, we found that the interconnects to the patch bays needed a cleanup. We’d need to replace the rack wiring with Cat5e cabling, but fortunately there was no need to change the wiring from the patch panels to the wall jacks. We’d also need to identify unused interfaces that could be made available for future use.

An inventory showed that 100 switches with an average of 200 ports per switch needed to be replaced. The work had already started manually when NetCraftsmen became involved, so we knew what would work, but our team saw this as a great opportunity to employ automation to streamline the process.

Doing it manually

The mostly manual process defined by the switch vendor called for tracking down each patch-panel connection, checking the endpoints, replacing the switch, repatching active links, updating the new switch configuration, and verifying connectivity. That would be followed by the final troubleshooting to fix things that didn’t work.

Collecting the information by hand was time consuming and error prone; missing a copy-and-paste by one character would require back-tracking to identify and correct the mistake.

We classified switch ports as active if they had been used in the past 180 days, which allowed us to skip the many cabled ports that had been inactive for years. That resulted in many more unused ports on the new switches than on the old, but we replaced the old switches one-for-one because the flexibility of having more available ports was deemed more important than trying to maximize port utilization.

This phase of the process was complicated by the need to clean up the connections between the switches and the patch panels. We created detailed documentation to map the old switch-port connections to ports on the new switch.

The process also required examining the old switch configurations to identify endpoints that required special configurations like 10Mbps or half-duplex. These endpoints could then be assigned to compatible ports on the new switch. Once all this information had been collected, the new switch configuration could be created from the customer’s configuration templates.

With the documentation and planning complete, the old switch could be removed and the new switch installed. Each endpoint connection had to be manually validated, both from the switch-port status and by pinging the device. (Note: It is always wise to identify endpoints that don’t respond to pings, such as IoT devices that use a proprietary protocol to talk with a custom controller. The best you can do is verify link connectivity, MAC address, and packet flow.) Then the inevitable cabling mistakes had to be resolved.

Our team started its work by following the vendor-defined process and projected that the wiring closet upgrades would take two-to-three hours each, but some took four-to-five. We quickly identified the need for automation.

Robert Hallinan, one of NetCraftsmen’s team leads, created a simpler, cost-effective process that automated much of the configuration as well as validation of connectivity for all the endpoints that responded to ping. He created the solution with Python, Nornir (an open-source Python automation framework), Pandas (an open-source library for manipulating numerical data), TextFSM (an open-source Python module for turning text into structured data), and related tools. Here is his step-by-step description of its use.

The automated process

Before the start of the maintenance window, we always grab a snapshot of the state of the old switch using a data-gathering script developed for this project. The automation process uses the data to generate a station report listing all of the ports, their current state, MAC address (and look up the Organizationally Unique Identifier (OUI)), IP address, reverse lookup of the DNS name learned on each port, tie-ins to pull in phone data, Cisco Discovery Protocol (CDP), speed/duplex, etc.

The script also generates a text file that can be directly imported into pingInfoView, a free ping sweeper that we used to validate connectivity. I enhanced the data-gathering script to also output a flat-file database with a subset of that information, including whether a device is pingable.

We created an example lab environment to allow development of the automation without affecting the production network (see diagram below). Several virtual PCs (VPCs)—the virtual endpoints used for our test environment—were placed in three VLANs (20, 30, and 40). Originally they were cabled as shown.

switch upgrades example environment 1 Rob Schultz / Shutterstock

Figure 1: Example environment

Then I built a GUI in tkinter, the standard Python interface to the Tcl/Tk GUI toolkit that can take in the device file from the gather script and the station-report scripts, user credentials, the database from the station report, and a VLAN standardization file. Once started, the tool periodically logs into the switch and pulls the ARP- and MAC-address tables to add to the device-tracking database.

This process detects the MAC address of the endpoint device on new ports as they get connected, and checks whether the VLAN configuration is what it should be. If it is, the process continues. If not, it logs into the switch, fixes the configuration, and bounces the port. It also checks the IP address learned and logs it.

Additionally, the script periodically pings the IP addresses associated with the endpoints. If the address changes during the maintenance window, the old IP is replaced with the new one. This way a ping tool won’t showing a missing host because it can’t ping the old address that has now been replaced.

All this is loaded from the database into a Pandas dataframe, which updates periodically on the screen. It also allows for easy export of the displayed frame to CSV, providing visibility into missing devices or devices that are no longer pingable, etc. Then we also get a better picture if a device’s IP address changes due to a new address being issued by DHCP. At the end of the maintenance window, the whole process easily provides a report to the customer. Below shows a screenshot of a report before the new switch and devices are online.

1 pre switch online Foundry

Figure 2: Before a new switch is online

We also no longer need to pre-provision the outliers—those endpoints that have special port configurations. The only manual configuration (and I’m working on automating this part) is to question the network technician onsite at the end of the maintenance window to identify the outlier ports that were not connected at the beginning of the window. This step is required since we can’t detect the MAC address of a not-connected port during the initial data-gathering script. Instead, we use a tool to detect the switch port from the patch-panel side and ensure that the configuration is carried over.

After “re-cabling” the lab environment, ports have been assigned for the endpoints and the customer VLANs have been re-standardized, from 20, 30, and 40 to 200, 300, and 400, as shown in Figure 3 below.

switch upgrades recabling 2 Rob Schultz / Shutterstock

Figure 3: Upgraded lab network

A neat addition to the automation is that the GUI includes a progress counter that shows how many endpoints remain to be reconnected. The automation runs every few minutes, gathering new endpoint connection data, and updating the active and remaining count. See the counter in the upper right of the screenshot below:

2 additional devs Foundry

Figure 4: Progress counter in upper right

Finally, the monitoring screen shows when the switch upgrade is complete:

3 complete Foundry

Figure 5: GUI showing upgrade complete

Savings

The automation process saves more than an hour during the staging process and an average of 1.5 hours on each maintenance window, multiplied by the team size of five or six people. The total savings comes to about 8.5 person-hours per switch. More time was saved in wiring closets that had multiple switches. By the time we became involved and built the automation, only 40 switches remained to be upgraded. Our automated system was used to monitor the hardware upgrade of those final 40 switches for an estimated savings of 340 hours so far.

Overall, it took about 80 hours to create the automated process, mostly due to learning how to use tkinter to create the GUI. The other software tools had been used for other projects and much of the required functionality was already well understood. The net savings to date was 260 hours and the system is usable for future upgrades.

There’s another savings that is immeasurable: reducing employee stress. Doing extensive network upgrades is wearing because a single mistake can turn a maintenance window into an all-night marathon. Automation can transform these tedious, stressful tasks into satisfying infrastructure-improvement jobs.

This project shows how automation can improve an existing change process, reduce the effort involved, and increase the accuracy of the changes. The customer was very happy with the results and the upgrade team enjoyed the greatly streamlined process. Saving 260 hours made the customer executives and technical staff very happy and supportive of future automation efforts. It’s nice to have a major infrastructure change run smoothly and with minimal impact on the user community.

Copyright © 2022 IDG Communications, Inc.



Source link