Network availability: Are you your own worst enemy?

Network availability: Are you your own worst enemy?


My early enterprise surveys from 30 years ago showed that the largest reported source of network outages was human error. Today, that’s still the case, and in fact human error leads any equipment or transmission cause by a larger margin today than it did 30 years ago. This, despite the fact that enterprises say they’ve invested significantly in improving, simplifying, and automating network operations. The old saying, “We have met the enemy and they are us,” sure seems to apply.

If you ask network operations professionals, most will tell you that the problem is that network complexity is growing faster than operations management can cope with. Most, but not all. Operations management believes that acquisition and retention of qualified network experts is a big part of the problem. Some technical pundits think network technology itself is to blame. Almost everyone things that more automation is the solution, but some wonder if our automation tools are just adding another layer of complexity when complexity is the big problem to start with. Hot news: They’re all correct.

Skills shortage

The biggest source of network complexity isn’t the proliferation of networks or devices. Enterprises connect about the same number of sites they did a decade ago, and the only place where the number of network devices has increased significantly in that period is the data center. The problem is layers of technology.  Switching, Wi-Fi, routing, management, orchestration, and security all add features, which add managed elements, which increase complexity in two ways: First, the sheer number of things to be managed; and second, the fact that different management practices and tools are associated with each of the layers.

This is where enterprises find employee hiring and retention problematic, too. One network skill may be easy to find, but if you need three or four different skills to work with your network, what’s the chances of finding someone who has them all? How much will you have to pay to keep them? If you can’t get all the skills you need in a hire, how do you train them, and how long does it take? Just defining the issues seems to take longer than finding new issues.

One thing that enterprises agree will help resolve these sorts of problems is a single-vendor network.  This flies in the face of classic enterprise fears of vendor lock-in and price gouging, but more and more enterprises are finding that they can usually get unified operations tools and practices from a single vendor, and integrating their netops elements is almost impossible otherwise. In 2022, a third of enterprises who told me that they valued integrated operations higher than multi-vendor benefits—up from a fifth just two years ago.

A single-vendor approach also helps with another often-cited path to reducing human error in netops: artificial intelligence and machine learning (AI/ML). In a multi-vendor network, enterprises find that it’s much more difficult to integrate network telemetry from each source to support AI/ML operations. It’s also more difficult to coordinate remedial action across multiple vendors.

AI/ML is the technology enterprises cite most often as their hope for resolving human-error problems in netops. But even in single-vendor networks, issues can arise that will limit its utility. Once AI is adopted, there’s also a shake-out period when the network operations staff work out how to best use it, and of course there’s also cases where AI/ML fails to do what’s expected of it.

AI/ML tools lack depth

The biggest technical problem users report with AI/ML is superficiality. One enterprise told me that their AI tool, in four months of operation, generated a total of over a hundred suggested actions, none of which would have taken the operations center staff more than a second to have raised and implemented without AI support.  In a dozen cases where the staff actually needed help, the AI system wanted more information or made a vague, general, suggestion regarding cause and remedy.  Just tacking “AI” onto a management system doesn’t deliver much, it turns out. Surprise, surprise!

The second-most-common complaint about AI/ML in netops was over-reliance.  Any netops professional will tell you that situational awareness is as important in a network operations center (NOC) as it is in the cockpit of a fighter. It’s easy for the NOC staff to become complacent about AI/ML performing routine tasks, and to lose touch with what’s happening in the network, missing important trends or simply forgetting that past automated changes have to be considered when the staff has to step in and do something manually.

The best solution to the superficiality problem is to assess an AI/ML tool in real-world use at an enterprise NOC.  Yes, it’s possible to use vendor demos to weed out obviously limited tools, but the only way to determine whether AI/ML will actually provide useful insights is to see it in operation and talk with NOC staff. Some enterprises tell me that they’ve been able to get an AI/ML tool set up on a trial basis, and that can also be considered.

The solution to the over-reliance problem is both a matter for NOC practices and procedures and controlling senior-management expectations. The more you let AI do, the harder it will be for the NOC to intervene if AI either can’t do something or does it wrong. Few network professionals can completely shake off the scenario where AI is merrily taking down this device or that connection while the NOC staff stands helplessly by. Few senior executives can shake off the hope that AI will let them run their networks with, maybe, at most, one human for each shift. Movement on both sides here is essential.

Enterprises say it’s fairly easy to establish procedures that require operations people to take stock of conditions after an AI/ML action is complete. In some cases, they require ops personnel to log a review of the state of the network, to force them to consider the consequences of actions taken and the way that post-action conditions could impact later tasks. A regular review of these logs, particularly as an operations-team activity, is a good way to get NOC personnel to think about what AI/ML systems are doing.

A final truth here is that whether we’re talking about AI/ML or more traditional operations tools, network telemetry is critical. That doesn’t mean that pumping information at systems is a solution to anything, but mapping out data sources can ensure that you’re covering the critical points in the network. A hole in coverage means that things that you don’t see can be happening inside the network that could escape to impact the network overall.

Join the Network World communities on Facebook and LinkedIn to comment on topics that are top of mind.

Copyright © 2022 IDG Communications, Inc.



Source link