“The scars” and “that horrible world” are some of the terms for network management, according to one who’s been in the trenches.
Kailem Anderson was with Cisco Systems for 12 years prior to joining fiber-optics giant Ciena last year. As vice president of portfolio and engineering for the Blue Planet, a software division of Ciena, he is trying to help avoid such pain for those who must keep networks running.
“I managed customer networks, and I spent a lot of time hiring analysts to watch the network, to watch alarms, and to build big strings of rules,” for networking monitoring, says Anderson. His breezy Aussie accent gives a certain lightness to what sounds like a rather miserable affair.
At $26 million in revenue in 2018, Blue Planet was a tiny fraction of Ciena’s roughly $200 million in software revenue in 2018 and $3 billion in total revenue. But it increased by a healthy 66%, and it can bring higher profit margin than Ciena’s optical networking equipment sale. It also offers the company a recurring revenue stream that is highly appreciated by Wall Street. Those economic aspects, plus the fact that it can be strategic in designing customers’ networks, make it an important part of where Ciena is headed as a company.
Figuring out what’s gone wrong in a network involves detective work at several levels of what’s known as the “stack” of protocols, the Open Systems Interconnect, or “OSI.” Some information comes from the bottom of the stack, if you will, the “layer one,” which consists of the physical medium of transmission. That could be, for example, coaxial cabling or fiber-optic links.
At the next layer above that, layer two, raw bits are packaged into bundles, such as Ethernet frames, and there’s all kinds of information to be gleaned about the state of those frames of data as they move through the fibers and cables of the network. The next layer up is layer three, where data is packaged as Internet-addressable packets, again, with lots of their owing information to be gleaned, such as routing and switching information about where the packets are going.
From there, one can go on up to higher levels, layers four through seven, the domain of applications, and get information about who an individual application is placing its data into those internet packets and whether it is having any trouble doing so.
Take the example where there is an transponder failure on one of two optical links. That leads to a route change in the multi-protocol label system, or MPLS. The network equipment reports congestion along the IP route as a link shoulders the burden of more traffic, and an end user experiences heavy delays using the network. All these are part of the same problem, Anderson explains, but getting from the user experience to the transponder failure can be a mystery.
Traditionally, a systems administrator sees the various items in a disparate fashion, with signals at each of the OSI layers coming from different telemetry systems, such as SNMP monitors, the systems log, a third thing that tracks “flows,” and then information coming from an individual piece of equipment, such as information about a recent configuration change — none of which are coordinated.
What looks like bad user performance from one angle looks like an MPLS routing issue or an IP bandwidth issue at another level, leading to a serious piece of detective work to find the culprit, the transponder failure.
A ticket gets created, and it ping-pongs between teams, with no one team having visibility into the other side, says Anderson. “Eventually they solve it, they have engineers inspect the matter, but it’s very inefficient.”
Sys admins must try and construct systems of rules as to what every possible combination of factors could mean. “They spend 1,000s of hours building these rules,” says Anderson. “It’s a zero sum game to spend that time to identify all the different scenarios.”
Instead, Blue Planet tools can train the network software using a combination of labeled examples, known as supervised learning and reinforcement learning, where the computer explores states of affairs and possible next steps.
With that combination, the software can be trained to identify patterns “up and down the stack” that are difficult to piece together with a rules-based system.
“We want to have the system learn to identify those scenarios, to basically help us get to the root cause much more quickly, and to use that information to close the loop,” he says, and then have a supervisor come into the picture only once that outline has been determined.
The tools necessary to do this are mostly starting from off-the-shelf machine learning models, says Anderson. “Most of this, yes, we can get from the cloud guys,” he says, referring to the various enterprise-grade machine learning offerings in cloud computing facilities. “We use them all,” though the tools can also be run solely on-prem. “It’s six and one half dozen of the other at the moment, but I think analytics is ultimately a good thing to move into the cloud.”
Open-source tools such as SparkML play a big role in organizing all the telemetry data.
The technology of machine learning, says Anderson, has matured substantially in recent years to make the investment in labeling network events pay off.
“Five years ago I was playing with this and with the amount of effort that needed to go into labeling, the risk versus value I was getting was questionable,” he says. “With the hardening of the algorithm, and the maturity of AI, that effort-to-reward ratio has compressed significantly. You only have to do a reasonable amount of tagging now and the outputs are significant.”
Anderson maintains there is another dimension in the shift to machine learning, which is that a more comprehensive sense of the network emerges that may lead to different ways or structuring and maintaining networks.
Traditionally, many sys admins will simply turn off sources of information, says Anderson, which is understandable, because of the information overload, but it means that network administrators are throwing away important clues.
“That’s the complexity in operating with a million different data sources,” he observes. “The traditional way to mange an operations team is to filter the information, almost turn off the information that is too much.
“At Cisco, if I was running a service provider network, I would get in the vicinity of a million events a day, and I might have an operations team of 40 to 50 people who have to handle all that.”
As a consequence, admins end up only looking for “what they deem fair scenarios,” and “are turning off performance-based scenarios,” information about the relative quality of the network.
But, says Anderson, “you don’t want to turn off the information, you want to funnel it, and use it to identify what conditions are driving consistent scenarios,
“Eventually, solutions could be different if they’re trained,” he offers. Data may lead to structuring things differently. “Usually, you have a planned network condition, but then an actual network condition; through learning, you might find the actual is more optimal than planned, and then execute a policy” based on that new insight.
There are new frontiers to achieve, such as delivering analysis of the data in a “graph database” format, says Anderson. “We are in the operations and network world, and so you want to visualize all this in a network graph concept.” Some customers “want to see it just programmatically propagate to northbound systems that are going to leverage that information, to be able to visualize with a graph database and have APIs to send that northbound information to the BSS layer.”
The one catch at the moment in all this is that systems administrators are not yet ready to close the loop, so to speak, and let machine learning completely take over and automate both detection and resolution of network issues.
“This isn’t a tech limit, it’s a cultural aspect,” he says. Machine learning systems are probabilistic, not deterministic. Hence, while they can detect many failure issues, there is a reluctance to automate what could be a false positive scenario. “You only need to screw up .0001% of the time and that’s a big issue.”
“I still think we are a little bit away in terms of closing the loop, I think it’s trust in the technology. It will happen incrementally, where you can close the loop on something non-catastrophic, that doesn’t create a failure scenario, where there is low risk, and then other areas over time