On March 28, 2019 Unravel CTO, Co-founder, and Duke University Adjunct Professor Shivnath Babu and Alkis Simitsis, Chief Scientist, Cybersecurity Analytics at Micro Focus presented a session on Automation of root cause analysis for big data stack applications at Strata Data Conference in San Francisco, CA. You can view a video of the session or read the transcript.
So I am Shivnath Babu. I am Co-founder and CTO at Unravel, also an adjunct professor of computer science at Duke University. And my co-speaker will be Alkis Simitsis, Chief Scientist at Unravel. So what we would like to do today is give you a flavor of how AI and machine learning can be used to address a very important challenge in distributed systems.
And as many of you know and probably the reason you’re here is, you must be dealing with, managing, or think of building a modern data application, right? Something that runs on a collection of distributed systems, something like this, what we’re actually showing over here, right?
You might have different kinds of data, structured data, unstructured data, business-related data, telemetry, monitoring data, logs, whatnot. And now we have a huge collection of systems that can be used to convert this data to insights. For example, the pipeline that’s being shown here. Ingest the data via Kafka. Maybe store it on a Hadoop distributed file system, or an S3, or Azure Blob tool.
Then apply analysis on it using Python or Spark. And then this data might actually be converted into some structured formats that you can use to do business intelligence using a Tableau, using ClickView. Or actually query it and explore it using Athena, Redshift, Hive, Spark.
At the same time, there are other interesting applications once this data goes into a lake. You can actually do machine learning, build your models, use that for serving as well as maybe do things in real-time with Kafka, with streaming, and actually power Internet of Things kind of applications.
But what you realize is all of these systems are distributed. And now we have applications that are sort of connecting them together, and because of that, dependent on them, right? So running these applications ends up being hard.
And what are some definitions of hard? Like, things can go wrong. Things can fail. Things may not meet your expectations.
It could be as simple as, “Hey, my Spark application failed,” or, “My SQL query is failing. I don’t know what is going on.” Or it could be that things are running, but things are not running well.
Maybe they are not well-tuned. Maybe they were running well last week, but today there’s a problem. It’s not meeting your SLA. And last but not least, in terms of these problems, once you start utilizing, let’s say, the cloud, very, very quickly, your cost of running these apps on the cloud can go out of control.
Maybe there’s a lot of resources getting allocated and maybe not getting utilized. Many, many, many things can actually happen. Like this, right? You don’t want to be this user who really wants to develop his applications, make them reliable. These are mission-critical applications, but he’s spending a lot of his time fighting these fires, hey, like, “Application’s failing. The container sizes are wrong,” or, “Data layouts are not perfect. My costs are actually going through the roof. My boss is yelling at me.” So can we address this problem?
That is the context of this talk. Using interesting algorithms, using AI and ML.
Well, let me give you a flavor of why this is a very important problem. What have people been doing wrong that makes this problem harder to solve than it needs to be? So an interesting study was done by the company AB Dynamics. They surveyed around 6,000 IT professionals across a wide variety of industries, across a range of countries.
One of the things that kept coming out is, when you’re trying to troubleshoot these problems, the first thing that people often face is that the data is actually siloed. All the logs, and metrics, and things you need to understand what the cause of the problem is, understand why things are failing and fix it, ends up being siloed.
And this is another interesting fact they came up with as they were surveying these IT professionals. I’m not going to go into all these numbers. But one interesting number, you see that 38%? What that 38% is like, these companies and these people are telling them that a lot of the time, the way they discover, like, operations teams discovers there’s a problem, is because somebody’s posting on some social media.
It may not be a…truthfully, it might be something internal. Like, it takes time. People often become unhappy that there are these problems, and things are not well. And of course, the real KPI is the time it takes to fix these problems, the money that is lost. So these statistics actually, in the survey, concluded that the mean time to recover, the mean time to diagnose these problems and fix them, is something like seven hours, like an entire business day.
And the average cost of fixing these problems is around $402,000, a ton of money. So things are not well. Things are not happy. Users are not happy. And can we now address the problem?
Can we use all those interesting things that you see in the conference, using AI & ML? Convert this problem into a problem that can be solved with data, and things, and algorithms. So what does that mean? High level, what that means is, “All those different sources of data that might be spread across logs, and metrics, and resource manager, and application server, and this, and that, and whatnot, bring it together into one single platform, one single view.”
And use analysis on it. Use algorithms on it. From the conventional machine learning algorithms all the way to AI, and automation, and reinforcement learning. And can we use that to make things much better? And that entire area is called AIOps.
That’s the term that many analytics firms are actually using now. And actually, what it means is, “We have these challenges.” Distributed application performance management, operations management, is pretty challenging, right? Can we use AI to make that easier? Can we actually automate a lot out of these problems?
Now, it’s definitely not easy to build such a platform. There are many, many different challenges that arise. From something as simple as, “How do you even get all of this data into a single platform?” And what are we talking about when we say data? We are talking about monitoring data. We are talking about telemetry data, which could be nice, structured time series metrics, what you see on these dashboards.
It could be unstructured logs, text that these different containers and applications are spitting out. And anything in between, like from SQL execution plans to information about metadata, things like that.
Once you’re bringing all of this data…and imagine bringing that data from a large 5,000-node cluster. At the same time, imagine bringing that data from a cluster that comes up on the cloud and just goes. It’s a transient cluster. Or autoscales, maybe one node now and 100 nodes for some time. So it’s challenging to bring all the data in a single place.
Assume you have actually solved that problem. The next challenge becomes, “How do you store all of this data?” As we saw, this data would be streaming and coming in real-time. The data could be unstructured logs. Storage is a challenge. And one thing, as you deliver distributed systems, do you realize, “We can never assume that all the data will arrive nicely and in a very ordered fashion.
Things will be very asynchronous.” Okay. So we have brought the data. We have actually managed to kind of store it and process it. Then comes the interesting challenges that we actually want to spend time today. How can you analyze the data? What sort of algorithms can be used?
Maybe for some use cases, detecting a bad query that has consumed a lot of resources and cost. You want to detect that in real-time, right? Or if you want to optimize an entire workload, maybe that is a more offline kind of a problem.
And at the same time, everybody wants…if these actions are actually making decisions and if these algorithms are taking actions, you want that to be very reliable and predictable. No false positives and negatives. So a tall order.
We have been working on this problem for a while. We want to kind of bring out some of these learnings that we have. And the way we will do it is we’ll go through three problems.
We’ll spend some time in each of them, and tell you about the things that we worked on, and the things that have succeeded, and sometimes the things that didn’t really work.
And the first problem we’re going to kind of take is a straight up failure. An app failed. How many of you here have run a Spark app and seen a screen like this? It seems like the right side of the room actually has all these problems. So this is the kind of thing.
You run a Spark application. It failed. And what did it give you? Five levels of nested stack traces. That’s not the kind of ease of use that we really want, right?
So all of this messy data is, especially a data scientist or a data analyst who just wants to get her work done, this is the blocker. Like, “I’m spending 80% of my time dealing with these kinds of issues.” Like, it would be great, even if it’s such a simple problem as, “Hey, that table I was querying, that didn’t exist where it’s supposed to be existing.
The metastore said that the table is in this location on Amazon S3 or Hadoop file system, but actually the data was not there,” if something told you this.
If something told this particular user that, and then she can actually move on and fix the problem. And then maybe some metadata needs to be fixed, whatnot. That’s sort of the impact that quick automated root cause analysis, where this person doesn’t have to depend on an operations person, somebody where they file it, take it, and come and look at it. Nothing like that. Just automated, quick root cause analysis. It is very possible.
This is a very challenging problem, by the way. This is not easy to do, even though the last slide may look easy. What it ends up being in the real world and in the distributed systems world where you have a lot of these different kinds of logs, you can apply a workflow like what I’m showing here to address the problem.
So first and foremost, bringing all of these logs from whatever level they are. I’m using, here, container logs. Because applications like Spark end up running in a container-less environment, and every container keeps on spitting out logs. Bring that, and then have some sort of structured data extraction on top of this.
Which what I’ve shown here is, “Extract error templates.” What are the actual errors happening? And there could be many different errors. Maybe one is the root cause. The others are all just, “Because this thing happened, I’d like to do something else.”
Bring that, convert that into a more structured representation. What I’ve shown here is feature vectors so that from that unstructured data it becomes a more structured point in a high-dimension space. And if you can add labels to this data, convert it into a labeled data problem that maybe an expert has diagnosed the problem, and then he or she is saying that, “This is the root cause of the problem,” then that becomes a label.
If you can get this, training data. Then you can apply your favorite supervised learning algorithm and build a model. And once you have such a model, the next time such a failure happens, use the model predicts and then immediately you have the root cause. Sounds great. But the big challenge here becomes, “How do you get these root cause?”
Do you have enough experts who can actually look at a lot of these problems, diagnose it? And can you get enough amount of representative training data to kind of learn a model? So there are two ways.
As I said, one is, “Have an expert do a manual diagnosis.” But what is interesting is, across a huge space of customers, what we have seen is that a lot of the problems that happen, they tend to happen…so there’s kind of an 80-20 or even a 90-10 kind of rule where 90% of the problems are caused by 10% of unique root causes.
I’ve listed some of them here, and you can see the kind of space here. Maybe the input wasn’t valid, something related to input. Or it ran out of memory, which is by far the biggest problem in a lot of these memory-oriented processing systems.
Or maybe there was a Java exception that actually happened. So there is a space from which these root causes come. And if you can understand that space, then you can apply a solution like this. You can have an environment where these problems are actually injected.
So there is a failure injector that is injecting that particular root cause into a running system. And the running system might be running different kinds of applications. And you can bring up such an environment on a cloud kind of environment, because many of these problems are not really a function of the hardware.
They’re more like a app-level problem or a platform-level problem. Bring up a cluster, keep running these workloads, inject these problems, inject many different kinds of problems and different kinds of ways in which the applications might run, or different amounts of contention. So create a representative space where what you’re really doing is you’re generating training data.
Because if you inject the problem, when you collect the logs, you know exactly what the truth is, what the model should actually root cause the problem to. And here is, quick, two slides on what we did here. There are many kinds of models you can build. High-level. I’m just calling here, the shallow learning. The more traditional support vector machines we use, logistic regression and random forests.
And then you have the deep learning and the neural networks for the model itself. This is one slide to give you an idea of how this approach can be really powerful. So I’m showing two graphs here. The blue is using the logistic regression model, and the purple is using random forests.
And you’ll see the two terms, TF-IDF and Doc2Vec. So converting these unstructured documents to a structured representation, so that you can apply a modeling technique like logistic regression. That’s where TF-IDF stands for Term Frequency – Inverse Document Frequency. It’s one common way, or more traditional way, of converting documents into structured representation.
But Doc2Vec, that’s a newer technique, and often implemented using a deep learning kind of model. So because of experiments. And experiments are more like we actually inject problems. We use 75% of the data for training and 25% for testing.
And we have done a lot of different things. I don’t have time to get into all the results. But the key thing is we can build very good models that are good at at least recognizing the problems that happened in the past.
So create a way to automate at least that domain knowledge, and a quick tool that can diagnose things so the accuracy rates are pretty high. Once again, more than happy to tell you about some of the other findings we have here.
But this is really a great time for me to switch to Alkis. And Alkis will tell you about app failure. But a failure may not be just, straight up, it died. It might be that it’s slow. It’s running slowly. So he’ll tell you about how you can use AI and ML to kind of auto-tune and bring the application to great performance. Don’t just look at an app. Look at an entire cluster. Alkis, to you.
Hello from me as well. So now we saw how we build with failures, how to identify failures, and how we label and organize them. Let’s see how we can use that knowledge to fix applications.
And also, the greater scale, work on improving our cluster behavior, what we call the holistic cluster optimization. So in general, when you have a system and you run apps, there is a very large number of parameters you can tune, either for optimizing your workload, your application, and so on.
The space you have to optimize is pretty huge. And this is a talent for experts, but it is an even greater task for people like data scientists, data analysts. They just want to use and run the application.
And most of the time what happens, the tuning is done in a manual kind of way, and it’s a trial and error kind of process. Ideally, what we would like to have is a user that is only concerned about, “I need to make my application run faster. I need to make my application just to run a successful state.”
So what we really need from a user is, “Give me your app, or a set of applications, and give me a goal. Specify a goal.” This can be a speed up. This can be a resource kind of efficiency objective. It can be any kind of optimization objective.
And what we really need to do is…one approach we have, is we create sessions in which we run applications. And the way we do it is, you take an application or a set of applications, you run it one time, you get back some insights, either if it has failed or is successful.
Based on that, you can create some recommendations of what you can do to make your application run better, faster, or even run in a successful state if those failed before. You get these set of recommendations. We apply them automatically. You run an application again. And this is an iterative process until you complete your objective or you reach a specific threshold.
And as the system runs, as you have multiple rounds with the application, you see that your application gets in a better state every single run. The way we do this successfully, at the very high level, we get this on the input, and the application, and the goal.
And then we use a probe algorithm to run this application through an orchestration kind of engine. And we can run it either on on-premises deployment or at the cloud deployment. As we run the application, we get data from the execution of the application, a multiplicity of metrics related to this execution.
We call this probe data. We correlate this data with historical data that we keep from previous executions of either the same application, or a similar application, or even a compatible application. And we use this combination of metrics as a feedback back to our probe and recommendation algorithms.
We know then to create a set of recommendations for a subsequent run of the application. And this look keeps going until, as I said, before either we meet our objective or reach a goal. Like, for example, a specific number of runs.
To give an example, this was a test of a Spark application that failed due to an out-of-memory error. And we created an automatic analyzation for optimizing this application.
So the first time we run it, we figure out the problem was an out-of-memory problem. You get back some recommendations to fix some configuration parameters. So the next run gets the application to a successful state. We continue the session. And the next round of recommendations chooses a little bit more efficient memory configuration. And with that, we are able to speed up, a little bit, the application.
And this process goes on. That’s the basic idea behind the auto-tuning kind of framework. Let’s move on to a higher state, which is, “Yes, you optimized your applications, and this is great. But in the bigger picture, what you really need to do is optimize your workloads.”
You have multiple applications running, and you need to monitor everything that’s going on in your cluster and try to optimize this. And what we have, what we want to achieve is we know that every single system can probably optimize the cluster configuration in the best possible way for its own good.
What we miss is a single pane of glass, a system that can optimize an application in the cluster across different systems. We saw the beginning of a complex architecture that combines different engines with different purposes. It’s a little bit of a challenge to try to optimize an application that runs, for example, into a Spark engine and then a Redshift engine, for example.
So in order to do that, we need to employ some advanced analytics to discover the problems and inefficiencies that we may have in our cluster. And as a subsequent step, to recommend solutions and concrete ways to resolve these problems. Example key objects we have in this exercise is we understand performance management, we understand autoscaling, and we understand cost optimization.
Cost optimization is very important, both on on-premises deployments but also a little bit more importantly at cloud deployments, where actually the cost translates directly to monetary values. So what we do is we create actionable insights across different dimensions. We really need to have actionable insights on application performance. For example, we need to identify if an application has issues that are related to code inefficiencies, or maybe some hardware failures or inefficiencies.
For example, slow nodes or maybe even contention in cloud or cluster resources. We’re also interested in providing actionable insights on cluster tuning, based on aggregation we have on application data. And also actionable insights on things like cluster utilization and auto-scaling.
What we typically observe in practice is a situation where a cluster gets maxed out, and we have queues that are pretty much blocked. And as a consequence, you have an application that is waiting for resources. That’s a very typical problem. In reality, what happens is that, most of the time, you have tasks that have been allocated, but the applications don’t really need all these tasks.
And inefficiency may come from different reasons, for example, based on the number of containers or the size of containers. A solution to this is we can provide app-level and cluster-level recommendations to help us cut down the cost to what we actually need.
How we do this? So optimizing an application by itself is very interesting and very useful. But what happens when you have a production environment when you run 100,000 applications or 150,000 applications per day? Manually tuning all of them is a little bit of a challenge.
So what you can do, one idea is, let’s observe how things worked in the past and let’s try to change the cluster default configuration based on the workload we know that we run on the cluster, and also previous runs with the applications we have. When this happens, new applications that are coming in, they use by default the new configurations we have changed.
The challenge here is to find a good default configuration that can satisfy, if not all, at least most of the applications. And what really matters is we need to optimize our workloads. In doing that, we may hard limit specific applications. But in the longer picture, what we really need is to optimize our entire workload.
The approach we are following is a data-driven approach. What we do is we can run reports so that we collect performance data of previously completed applications. And we need to do that for a workload that runs, for example, for a good enough time window, for example, a week. You run a report. You’re getting parameters. You run a second report again. And your workload hasn’t changed much.
Then you can directly compare a number of KPIs between the two different executions of those reports. KPIs that you may want to check is number of containers, size of containers, vCore seconds, vMemory, and that’s on a daily basis. Once you do that, then we can analyze applications with respect to how the cluster’s configured today.
With that, we generate recommendations how we can change the cluster configuration to adapt to the analysis we just did. And in addition, we predict and quantify, “What will be the effect of applying this cluster-wide configurations into applications that will run the future?”
To give you an example, this is from a map reduce kind of workload, and this specific metric is about memory. So what we do is, first, we create a histogram of applications that are using different default configurations.
For example, we may have a candidate for a memory, like 1,800 megabytes, or another one, 2,048 megabytes. Creating these histograms now for a specific candidate configuration parameter, we can calculate a reward, which translates as a percentage of savings that we predict that we are going to have if we use this specific parameter.
And along with that, we create a calculated risk, that translates to a percentage of the jobs we expect to run in a successful state once we use this specific cluster parameter. And for this specific example we used 4GB, and the result was that we predict that the improvement potential is high. In reality, what happened is that we had a 50% memory savings for 90% of the workload.
Doing things like that give us some very, very interesting results in production. This is a real example from a large financial enterprise, recently. And what we did using this cluster optimization technique is we ended up having a 2X throughput increase, from 900 jobs per day to 1,900 jobs per day.
And at the same time, we had a 2X reduction in cost, from 640 vCore hours per day down to 340 hours per day. These are the usual things to do, but they are not the only ones.
Other things we can do cluster-wide, we can monitor our queues, we can perform a queue analysis. And by doing that, we can find, identify, suboptimal queue designs, or even workloads that are running sub-optimally in queues. A couple of examples. On the left, we have a workload with blue here we represent pending resources.
With black down here is the resources allocated. So in this specific case, the resources we require from that queue are exceeding a lot what we actually have. On the other hand, we have a case where we have a high number of resources allocated, and we don’t really use that. Both extreme scenarios are not good, and we need to fix them.
This is not the only thing we need to do when we analyze our queue results. We can also identify interesting trends for our queue results, either per app-type, per user, per project. And we can also identify problems like convoys, typical problems we have in workload management.
Or even a problem applications, in the sense that you may have applications that are hogging system resources. Or even problem users, like users that are frequently using problem applications, or even users that are monopolizing a specific queue.
On top of that, and based off that, we create recommendations on how to fix these kinds of issues. We have recommendations for modifying queue settings and queue designs. We can even start thinking about reassigning workloads to different queues, like in the example I showed before.
A workload that was running in the queue that did not have enough resources to support it, we can move this to a different queue, or we can change the resources of that specific queue. Interesting things we can do having all this data is we can have insights across app types, queues, and data. Another use case could be, for example, we can notify a user that, “You know, your application is slow because it runs in a queue which is blocked by another application that acts as a table that has a small files problem.”
These kind of insights are very hard to find manually, and it’s very, very useful for a user to see that in a very fast and automatic way. Again, as I said before, just identifying these recommendations and these insights is not very useful in a production environment that you have thousands of queues running thousands of applications.
So what is really interesting is to support automated ways to apply these kinds of recommendations across the entire environment you have. And one way to do this is with what we call the auto-actions, which are complex actionable rules on a multiplicity of cluster metrics.
Example actions could be, “Kill an app,” or, “Move an app from one queue to another,” or, “Move a workload from one queue to another,” and so on. Additional tools, what is really important in cluster optimization is you really need to understand how we use our cluster resources.
And there are various tools and reports that can help us do this. For example, we can have a chargeback report and we can identify exactly the number of applications, the different app types, that were run in our clusters.
We can identify how our users use our clusters. We can drill down into the application level and understand exactly what’s going on. We can also have cluster workload reports, for example, this heat map, where we can see exactly how we use our clusters.
This heat map, for example, identifies, for every single hour of every single day, the cluster use. So we can identify hot spots, or even patterns of how we use our cluster. This is very useful in on-premises deployment, but it’s even more interesting and useful when you go to the cloud.
Imagine, for example, a cloud migration scenario where someone needs to understand how they should move the applications and queues, even by cluster, to the cloud using the best possible migration strategy. If you know exactly how you’re using your cloud, you can identify the right instances to use in a cost effective way.
Finally, you can have forecasting reports, where they can show you, based on how you run your application, how you run your workloads, how you use your cluster, we can predict things like how your resources are going to be used in the foreseeable future. Resources like CPU, disk memory, even your queue results, and so on. So by that, let me summarize what we presented was, on a very high level, how one can deal with failures, how we can automatically identify, with pretty good accuracy, failing applications, the reasons why an application might fail.
And those are presented, again, at a high level, how we can auto-tune applications and how we can optimize our cluster usage in our cluster environment. This AIOps space is very interesting. We have very rich opportunities to address distributed application performance management as a combination of AI and machine learning problems. With that, thank you for your attention.
Hey, thanks for this, you guys. But all these AI or ML models, what you guys have worked on, are they available for the rest of the world where we can play around with using our existing data logs and all, or we have to come to you?
I can take that question. So some of the work is published where we could. For instance, the work that was done on auto-tuning. That is actually published, so all the information is out there. Some of it is not.
And I’m happy to kind of talk to you offline. But the work is actually happening in a startup which both of us are a part of. So there’s things that we actually own in house which we haven’t released yet. There’s this interest in one of the things, but we’d love any kind of collaboration. The certain problems you’re having, we’d love to give you feedback on the product, and it’s a free version of it for a trial.
Hi. For the root cause analysis, it’s noisy, you showed how you analyze a particular problem, focusing on a single stack. But if you have to correlate multiple errors, do you do anything specific in that space? I mean, some kind of dependency graph or something that you apply to provide more meaningful insight what the problem might be, and how it can affect some other area of the system.
So the technique that we talked about, in a sense, the key aspect there is coming up with that label. So all the logs are there. You kind of see problem A happened, then problem B, then problem C. But the root cause might have been some A-dash that happened much before.
The symptom propagation problem, the flood of alerts problem that you’re talking about. So we have worked on that, and some of that is also published. But what we have seen that has the most impact is, “Come up with a way in which you can have that exact root cause labeled, where this deduplication or the causation has been resolved.”
And either somebody manually does that. They could use any technique, with dependency graphs and things like that. Or where the most traction happened is once you understand the root causes, inject them. So we know the ground root, so you don’t have to go through all of that dependency and things like that. So that is where the most traction happened.
Now, if you follow the work that we have, some things we have published. We do have work along the lines of how to build that dependency graph. We use this technique called, I think it’s called, the coding book. It comes from coding theory, how you can do the flood-of-alert deduplication.
Just to add on that, you are right. We do have a taxonomy of errors, and there is a hierarchy. And we’re continuously building to improve and extend this taxonomy. So we do have some kind of dependency graph.