AIOps – Use Some Intelligence (Part 1) – VMware Cloud Community
There are many types of Ops in today’s world of Cloud, from DevOps to GitOps, SecOps to DevSecOps, the list goes on, but AIOps is here to truly enable business innovation. Please check out this video where I explain the background of AIOps, what problems it can help solve and the opportunities for the future.
As usual, if you would prefer to read the content, I’ve placed the transcript below. Enjoy!
I’d like to follow up on my AI/ML Demystified video with something more focused towards cloud and data center operations.
Firstly, If you’re not familiar with the very basics of AI/ML, what it is, where it’s come from and where it might be going, I encourage you to check out the video on my channel called AI/ML demystified.
For this 2 part video, I asked myself the question
“How can we start to make use of AI/ML if we’re not a developer or data scientist?”
Over these two videos, I’m going to talk about the background, the problem or opportunity that exists, a bit about what AIOps is, some of the tools that can help, what I think the future looks like and finally try to answer that question.
Background
My role in VMware is all about Cloud and cloud management, so my conversations are usually with people who care about building, running, or managing applications in public and private clouds and caring about everything that’s required to do so. That led me to think about firstly how AI is going to affect these people, but secondly, what opportunities it creates for different roles, specifically in the IT operations space.
So, do we need another x Ops?, We have many operations or Ops terms in the Cloud and DC world today, everything from DevOps, to GitOps, SecOps, DevSecOps, its all about the Ops these days, it’s the hub, its the hub, it mission control! But what about AIOps?!
Shift Left
Well, just as operations responsibilities have started to “shift left” to developers, becoming “DevOps”, AI Ops is currently in that strange place where people talk about it as a tool, or set of tools. We thought like this when “DevOps” came out 12 years ago. We were very silly back then and tended to think of DevOps as a tool, that tool was called Jenkins, Ansible, or sometimes (and yes I have heard this) we thought DevOps was “just what Kubernetes does”. Thankfully DevOps is pretty well understood now and we are way beyond this thinking now and most of the IT industry get it. DevOps is about breaking down silos, changing the way Developers and Operations teams work together more closely to produce better software.
And ill stop talking about DevOps and move onto AI Ops now, but that is the kind of thinking I’d like you to have as we explore AIOps, because that’s a roadmap we can expect AIOps to start taking. But where DevOps took many years to get there, my prediction is that AIOps will come to fruition much faster.
Gartner coined the term AI Ops and they currently Define it as:
“AIOps combines big data and machine learning to automate IT operations processes, including event correlation, anomaly detection and causality determination.”
So then, reading Gartner’s definition, right now, AIOps is going to be used a as a snazzy new buzzword or label for any monitoring tool that uses an algorithm or some kind of machine learning.
There are many of these tools already out there, fantastic software that leverages A/ML technology to dramatically improve IT Operations, and they definitely should be leveraged in any modern data center or cloud, I’m going to talk more on some of the tools themselves later that fit into this kinda new bucket so bear with me.
But let’s remember, AI Ops isn’t a set of tools, its a whole new way of approaching IT Operations.
The Problem
Lets start with the problem or opportunity that exists today.
Digital Transformation
Applications are driving more complexity, that’s pretty much the base for absolutely any IT problem today. Things are complex now.
Digital Transformation has been happening for years now, in fact we’ve been talking about it since the 90s. Then, it meant a company having a website. Now, the website IS the company. Businesses are embedding technology into their products or are building new digital products.
Digital Transformation has meant the way we work has needed to change to catch up. It has led to DevOps, Kubernetes, Microservices, Functions as a Service, CI/CD with automated pipelines. All of these amazing people, process and technology changes.
And all of these things happened because the old way wasn’t working, methodologies were too slow, silos meant many hand-offs between different operations teams, Essentially… People were moving too slowly for the technology.
Now, we are in a world, where processes are becoming so heavily automated, we can push new code for a new app, into production every 10 seconds (or much less for some). The humans do the creative piece, figuring out what the app does, what it looks like, etc, humans write the code for the app.
Everything else required to get it into production can all be automated with CI/CD, all the testing, staging, user acceptance, etc everything, can be automated, this should be your goal as a company if you aren’t already. McKinsey reports that
“57% of companies have started using automation in one or more business functions.”
Automation really is the key. But, if you do things faster, you MUST have the foundation to keep it going. In that respect, the burden is now falling onto operations teams again.
Everything is moving faster than the eye can see. IT Ops/Or Cloud Admins, which is basically IT ops for someone else’s DC….., might not have to look after specific apps as much anymore, since DevOps begins to cover this, but they now have to look after Kubernetes, they have to surface newer things to developers, like Lamda, they have completely different DB types to monitor, they have different clouds to manage, with different APIs and different functionality, they have more of everything to deal with.
Today, nobody wants them to become the bottleneck. The days of making an app and throwing it over the fence for ops to look after is over. DevOps has really allowed Operations teams to give some of the burden of looking after apps to the developers themselves.
But there’s much more we can do to not just remove the bottleneck from ops further, but actually have ops become an enabler for innovation.
AI Ops – To the Rescue
As Gartner says, “AIOps combines big data and machine learning to automate IT operations processes, including event correlation, anomaly detection and causality determination.’ So what does this mean? I’d like to break it down into each piece.
Big Data
Lets start with Big Data – this is a vast amount of data of structured or unstructured data. This is where all of your useful information is waiting to be analyzed. If anyone has ever tried to run a query on a large database before, which took a long time. Think of that but larger. So large, that traditional tools just can’t query it.
Machine Learning
Machine Learning – Well hopefully you watched my previous video, but essentially machine learning is an overarching AI concept where a machine can assess their success from their output and change how they process data, without a human, to provide a better output next time, hence learning.
Automate
Then we combine big data and machine learning to Automate IT Operations processes. This piece is pretty self explanatory, but the examples Gartner give are:
Event Correlation, Anomaly Detection and Causality Determination.
Event correlation
Bringing together events from multiple different places, for example networking devices, virtualization platforms, cloud services, all together. Then filtering, tidying up the data, removing any duplication etc. Then it looks at relationships and patterns between all of these events, ready for root cause analysis. No longer having to have 20 different tools open for each of those devices, trying to correlate events manually.
Anomaly Detection
Noticing when something is different. For example, my app usually has around 1000 users logged in at this time on a Friday, but today i can see that it has 100 users logged in. that is an anomaly and could be a good indicator that there’s a problem with the platform.
But why is this happening?
Causality Determination
Once we’ve correlated events and looked for anomalies, we can start to use this to find the root of a problem. Maybe the number of users logged into the app drop every time a particular firewall is upgraded for example.
Ultimately, the use of AI Ops changes this whole process from hours of work, to seconds with this automation. And we’ve not even touched how to automate fixing these issues too! Maybe that’s for another video.
So then “AIOps combines this big data and machine learning to automate IT operations processes, event correlation, anomaly detection and causality determination.’ Make sense?
Hopefully you can see that this type of approach is going to be a massive enabler for innovation.
That’s it for today, please join me for part 2, where we will look into some of the AIOps tools that exist today, what the future could look like for AIOps and then summing up and answering the question “How can we start to make use of AI/ML if we’re not a developer or data scientist?”
COMING SOON!