We're hiring! View our open positions!


Using Apache Spark to Tune Spark, Spark + AI Summit 2018

At the Spark + AI Summit hosted by Databricks in June, 2018, Adrian Popescu and Shivnath Babu of Unravel spoke in two sessions. This one is on Using Apache Spark to Tune Spark, and it was […]

  • 18 min read

At the Spark + AI Summit hosted by Databricks in June, 2018, Adrian Popescu and Shivnath Babu of Unravel spoke in two sessions. This one is on Using Apache Spark to Tune Spark, and it was a very well-attended session. Everyone is faced with multiple users, many different applications running simultaneously, and competing sets of SLAs. Shivnath Babu, CTO and co-founder of Unravel, gives three different real-world case studies, supported by Adrian Popescu, data engineer at Unravel, with more than eight years of experience working with performance monitoring and optimization of big data applications.

Transcript of video; download the slides here:

– [Shivnath] Hello everyone. It’s great to see a nice and crowded room. We’re just coming from another session where we were the speakers, and I don’t know how the organizers nicely put us back to back but we’re right on time. So, I’m Shivnath Babu. I’m CTO and co-founder of Unravel where we’re actually building a platform to simplify the management of applications and operations for big data clusters.

And I’m also an Adjunct Professor of Computer Science at Duke University.

– [Adrian] Hello. I’m Adrian. I’m a data engineer at Unravel. So, I have more than eight years of experience in working with performance monitoring and optimization. So, I’m really focusing on optimizing the performance of big data applications.

– [Shivnath] Here we go. So, it’s clearly an indication that lots of applications have been built on Spark, right? So, from the keynotes today, we see all these interests are in the room for Spark applications from machine learning applications, streaming applications or mundane BI and data warehousing applications.

A lot of things are being built on Spark. This is terrific, right? But, let’s face it: Running apps and production on Spark apps is just very, very, very hard, right? Like, I’ll actually give you a whole bunch of examples. I’m sure, how many of you here have run a Spark app and it just failed: out of memory? With show of hands. Right?

So, this problem, like, had a picture of a data scientist but it’s very common problem that happens, right? Or once you get out and make the application production ready it’s reliable and whatnot, the app is just taking a long time to run. So, it’s just too slow. It’s not meeting the performance requirements, right? Or maybe you have some strict performance requirements especially with all of these streaming and real-time applications coming, they have to be real time and there are strict SLAs.

The app is missing SLA. Like, now what to do about it? But on the operations side of things, application sometimes are generated by BI tools and there could be some really hairy bad SQL that ends up consuming a lot of resources in the cluster, reduces throughput, affects a lot of other applications on your multi tenant cluster and the ops team is not happy. Dealing with these challenges is not easy because app performance can be affected by many different things such as the application itself, right?  

The joins that are being used in the application or sometimes with Spark like the people actually hand-code a whole bunch of different things, right? And all the way from the application to how the scheduling and the multi-tenancy is being configured or the data layout is being configured or I’m running on the cloud and then basically there’s some noisy neighbors affecting my applications.

Lots and lots of things not to mention machines and like JVM just become worse over time. A lot of things can affect performance and the entire complexity of the ecosystem is not helping, right? From the outside, like, the Hello World in Spark can be just done in two lines, but it can be done, unfortunately, using many different like submission modes, CLIs, Spark Submit and lots of different kinds of applications.

On one side is all like really powerful SQL and streaming and AI and all of that can be done in a single platform, but they just add to the complexity. At the same time there are a lot of these like resource management options now, right? You have the good old YARN and now Kubernetes and Mesos. Clusters can auto-scale up and down like containers’ deployments.

So, a lot of complexity. So, the theme that we actually have been like kind of developing and working on for a long time is how do we get the best of both worlds? How do we like make Spark simple and easy to use for Devs and like all the operations teams? At the same time like convert all that complexity into easy to understand insights so that productionizing the application is easy, right?

So, we’re going to like show you how the problem can actually be treated as a data problem. What do I mean by that? Right? So, the first and foremost thing is if you look at a Spark cluster or even a big data kind of cluster, there is a lot of different kinds of data that you can bring and collect and bring together, right? There’s all the resource management data if you are using YARN, for example, the Resource Management API.

There is lots of different kinds of information about what is going on in the cluster from any level. All the applications scheduling containers what are running. Everything can be brought in, right? Or the historical information, not just live information, what happened? What was running? The history server API, not to mention you can actually go to even every container, execute driver container, collect and verifying very fine information from many different levels like the JVM level, the actual container level or even the host level.

It’s just like that. Now you have metadata, right? Data about tables and schema and statistics, right? Or if you’re running SQL the actual execution plan that is being used: logs, configurations, you name it. Can we actually bring all of this data together into one system, right? One platform and then start to apply all the kinds of algorithms you’re developing in Spark on this data, and to convert that into insights automatically, right?

That’s what it would be ideal to do. And once you’re starting along those lines, you will start basically hitting this problem. What is the right tool or the right platform to use? And if you look at the picture that I showed in the beginning, like Spark, right? What all of you are doing with Spark, right? You’re taking some data, tweets or like customer, like access information and things like that and converting that into insights.

Why can’t we do that with Spark and apply that to all of this system monitoring data? Let’s get all of that data, let’s keep bringing it into this entire like a Spark ecosystem. Now, what do we do with the data? And that’s actually a pretty important question to ask and answer, of course.

Like, applications and cluster management. Can we break that down into like the smaller tasks and then address those tasks with intelligent algorithms developed in Spark, right? So, we are actually going to spend the bulk of the time in this talk, having worked on many of these problems in isolation and having built them over time, and we’re seeing a lot of similarity.

So, it might be good to, like, bring all of them together in a similar abstraction and one platform to solve, and we’re going start with failures, like, that nasty problem that we mentioned, and Adrian is actually going to lead you to some of the ideas that we have developed and how they match with Spark.

– [Adrian] Yeah. So, basically let me start with an example, right? So, here a user issues a query. So, this is a SQL query that it runs using Spark SQL and the application fails. So, in order to figure out what happened and to fix it, so the user has to download the logs, interpret them and figure out what happened, and then provide the fix.

In many of the situation is just very difficult. So, in this case, we have like five level of stack traces of this form where the logs are extremely confusing and hard to understand. So, obviously, doing this thing manually takes a lot of time, and a lot of expertise is required—which is something that is being acquired with time—and it’s not something that everyone would spend on it…so, like, data scientists and analysts.  

What we built at Unravel is a tool which basically analyzes these logs and applies AI and machine learning in order to bring in a panel like here illustrated, where essentially we figure out what the root cause is and we provide the English text and English description of it, what was the root cause and then what you can do in order to address the failure.

So, for this particular case, we couldn’t run the Spark SQL query because actually the input data, save table was not available on HDFS. So, the fix would be like add that table, add that data set on HDFS and try again. So, obviously using such a system where it will help a lot, it would reduce troubleshooting time from days, hours to seconds and we also improve the productivity of data scientists and analysts.

So, let us look at the system that we need in order to automate the root cause analysis. So, a lot of information is contained into the executor logs. So, in order to identify the root cause, so we need to take these logs, we need to parse them, we need to generate some feature vectors that then can be correlated with a root cause.

So, essentially, there are two problems. One problem is to obtain the feature vectors and the other one is to actually know what the root cause is. Given that we know what the feature vectors are and what is the root cause, then we can use a supervised learning technique and view the predictive model that we can use for prediction. So, now one of the biggest challenges here is that we need to know what the root cause is which it’s, in many cases, it’s very difficult to know.

And for this, we require a lot of expert knowledge where the expert will actually look into it, figure out what happened, and then label every single failure with a corresponding root cause. So, what we built, essentially, is the taxonomy of failures where here the taxonomy is essentially it’s like a classification of all of the possible failures.

So, this is a tree where the nodes or categories, the sub-nodes or subcategories, and then the leaf nodes are actually the root cause labels. So, for instance, here we can see that we have configuration errors, data errors, resource errors, deployment errors. And then the data errors can be sub-categorized into input path, not available errors, number format exceptions, JSON parsing exceptions, and so on.

So, this is the methodology that we use. We still have the challenge. Right? How can we actually label the root causes? And one way to proceed here is to use an expert. That takes a lot of time, it requires expert knowledge.

So, that’s one way of doing it. But the second approach that we can also use is actually to automatically inject known failures. So, I will go into the second approach into more details, but to achieve that, essentially, we built an environment at Unravel where we collected a lot of data on clusters using either in the cloud, or on premises where you run a lot of workloads and then we’ve injected known failures.

So, in terms of workloads that we use, we use the variety of workloads such as batch processing, machine learning, SQL, Spark Streaming. And then we injected failures. Part of these failures were failures taken from our customers and partners. We keep on updating the set of failures that we observed there and also failures that we basically obtained in our own environment.

So, as soon as we have a set of failures, so the procedure that we use essentially is to inject these failures and then collect both feature vectors and then collect these failure labels and build the prediction model. So, let us look a little bit deeper into this step.

So, essentially, given an application, we take a failure scenario. So, we take a failure, we can inject either a set of configuration parameters which would cause the application to fail, or some errors into the program itself at the semantic level. Then we monitor the application and then we collect the logs, right? Given these logs, we extract the features, and then we build the feature vectors.

So, for every pair of feature vectors and then the known failure, so we can label them and we can add them into a database of labeled failures. The set of failures that we look for and looked into are here; illustrated by some categories of it such as input data or out of memory errors.

Out of memory can be at the JVM level, also the YARN level. We also have binary compatibility errors, running out of space on the disk. Also semantic error such as running transformation within another transformation, arithmetic errors, runtime errors, invalid configuration errors.

So, using this approach, we essentially collected a large number of applications and a large number of logs of applications and a large number of labels that we use to build the model. So, let us see now how we can actually, given the input logs, how we can create the feature vectors.

So, to extract the feature vectors, so, given the input raw data, so we parse these logs, and then what we do, essentially, we tokenize the logs, the granularity of words and class names and we built a dictionary of these words. So, for this particular example that I illustrated here, so we have like a token for java.lang.OutOfmemoryError.

So, that’s a token. Then the message that follows every single word is a token. And then… For the stack trace, again, the full stack trace would be categorized as word, and the reason for that is to also include like key information such as where the application failed which was the line of code.

So, this is important to maintain the correlation between tokens. So, in terms of actually building the feature vectors, so there are different approaches that we can reuse from data mining and  text search in mining. So, one traditional approach is to use the TF-IDF where essentially TF-IDF it computes for every single word from the vocabulary, a counter which reflects basically the importance of the word into a document.

There are also other approaches that we use such as Doc2Vec. Essentially, the difference is that Doc2Vec also looks at the placement of words within a paragraph or a document, so it also looks a little bit at the locality of the word in a context. Under the covers, Doc2Vec uses neural networks and in the end it produces similarly an input vector that we can use for training the model.

So, let us come back at the system architecture. So, I talked so far about how we collect the feature vectors, how we create…basically, how we collect the labels. So, given that we have these two, so we can use a supervised learning algorithm to build the model which can be used for prediction.

So, then given that we have a new failure scenario, we can similarly parse the logs, obtain the feature vectors, and given this feature vector, use the predictive model to actually predict the root cause. So, let me focus a little bit more on the learning algorithm per se. So, in terms of learning algorithm, we use both shallow learning and deep learning.

So, for shallow learning, we use regression models and three models such as random forest. For deep learning, we use neural networks. So, both types of models are available in Spark and are easy to use and actually execute at scale using the Spark infrastructure.

Let me show you also some results that we obtained by doing this work. So, in this experiment, what I show here, so here we use the data collected in our lab environment where essentially we split the data into four, right? And one-fourth we used for training the models and then the rest of the three-fourths we used it for testing.

So, here in these results that I show, we use the logistic regression and random forests. And we can see that the accuracy score in actually predicting the true root cause for these failures…it’s quite good and encouraging. So, it’s above 95% for TF-IDF and close to similar to Doc2Vec. So, this is very encouraging and showing that it’s possible to actually use AI and machine learning techniques to predict the root cause of the failures.

And this is something very useful for Spark developers and practitioners such that they can speed up their work considerably without troubleshooting manually the failures. So, we have more details about basically this work. So, we had also talk in Strata, New York, 2017, so you can look for that into more detail or you can just discuss it with us.

So, Shivnath now will continue on the SLA management for real-time data pipelines.

– [Shivnath] Thanks Adrian. So, Adrian talked about this fundamental challenge of like, how do you try to automate root cause analysis? Right? In a kind of a sort of scoped environment like in Spark and quickly trying to at least point out what the… Like, you can never really point out what the root cause is, but at least find the administrator or develop in the right way. Right?

I’m going to like, as promised, pick two more examples, right? One example is what we are now seeing a lot in practice. People are putting together streaming pipelines, right? In a very common architecture, this will have Spark, Spark Streaming for the computer side of streaming. And a system like Kafka for the data ingestion in a similar fashion and maybe state like the intermediate state sold in NoSQL system like a Cassandra or HBase, right?

And here, as you can imagine, a lot of things can go wrong, from the resource contention that a Spark Streaming might run into or some sort of a misconfiguration or from the HBase side or the Kafka side, maybe a partition is not good or like the way in which you put them together and the dependencies across partitioning and like number of tasks in Spark. A bunch of things could actually go wrong, right?

And it’s basically with streaming pipelines producing results in a streaming fashion and sticking to SLAs becomes super important. So, we’ve actually… What I’m going to show you in the next few slides are cases where good forecasting techniques as well as good anomaly detection techniques can actually simplify the management of streaming pipelines significantly.

Let’s start with this one example. So, what you’re seeing here is from a real life example where a customer is running a sentiment analysis application, Spark, HBase and Kafka. What you see as those green bars is the rate at which data is coming in. It’s coming in at the rate of around like forty thousand records a second and there’s a…in this case, the customer has a strict latency SLA of three minutes, meaning, from the time data came in and the time the results are actually given out should not be more than three minutes, right?

If you see carefully, you can see that black line in between, that’s actually measuring the delay as of now at any point of time between like the time data came in and the results were given out, and you can see that trending up, right? And below, if you notice, there is a projection of forecasting that can actually be done of, like, right now the way in which things are trending, the latency SLA might actually be missed in around like two more minutes kind of time frame or ten more minutes kind of time frame, right?

So, getting advance alerts about these sort of like problems can actually be used to trigger SLA-aware auto-scaling, for example, or at least avoid situations where you have to do firefighting. But it’s not that easy because, like, when you start to do forecasting, you have to distinguish between real trends and these sort of false positives, which actually brings up this entire dimension of anomalies.

Like, I’m seeing a lag, as I was showing in the previous slide, is that a truly like sort of expected kind of thing or it was unexpected, right? Maybe this is something that happens at 9:00 a.m. every morning and I shouldn’t be reacting to it and because that can lead to like more problems than if I were to like leave it by. Right? So here just a quick note on anomaly, like, an anomaly is something that is sort of unexpected that needs your attention.

And as you start to think about anomaly and algorithms for anomaly detection, it quickly becomes that you have to balance between what are called false positives and false negatives where like you alerted, there was really a spurious alert or you missed a real alert. Right? Thankfully, in the Spark community there has been excellent work on what this falls under, like, time series analysis.

So, I’m just citing like four of the interesting kinds of work that has happened here, automatic anomaly detection to like forecasting techniques and whatnot. It’s basically like a treasure trove of techniques that can be applied. And in the Strata, San Jose, that happened a few months back, we showed how you can apply these sort of techniques to do streaming SLA management and especially deal with easy operations for Kafka, right?

So, this is an interesting like second technique of how you can bring in machine learning and Spark to simplify management of real-time pipelines. I’m going to pick a third example which is like application auto-tuning, right? So, this is basically the problem where if you have to tune any application especially one in Spark, there’s this dizzying set of like configuration parameters, container sizing, parallelism, data layout, SQL, hash join.

Like, all of these sort of things, not to mention caching, right? So, what is really is going on under the covers is based on a setting of these parameters, there is some response surface, actually, there’s multiple response surfaces, one that measures how much time the app might take, right? How much resources the app might consume. And these response surfaces tend to be unknown, right? And a lot of this during today and some being like trial and error.

Like, somebody is trying to figure out what will happen if I were to, like, change the size of a container, for example. So, we have been working in an interesting new world where if you’re a developer and operations person, you can just come and say, “Look, here is my application and here’s the goal I want to achieve. I want to make it more reliable or I want to speed it up.”

And the system can work under the covers, basically, to get the user to this goal as quickly as possible. So, essentially, what you’re seeing here is a hypothetical kind of like scenario, where the user mentioned my app is taking like ten minutes to run. I want to improve its performance. Immediately, the system is able to say, “Hey, I have a configuration for you that’s actually thirty percent faster.”

Sometime later maybe the user is checking email and she comes back, there is a verified configuration that can run it sixty percent faster and so on. Right? Won’t this be a great system to have? And now the technology has sort of reached a level where it’s possible to build these sort of systems. The latest darling is like reinforcement learning, you can apply reinforcement learning techniques like what has been built to be, like, human players and go, for example, similar kind of techniques where the problem can be modeled as a state space and moving from one state to another you get some reward and basically you can use like learning techniques to kind of build an agent that can quickly get an application from like a bad configuration to a good configuration, right?

Actually, you don’t even have to like reach out to these deep learning techniques and things like that. There are other more amenable and easier techniques using like negotiation surface modeling and things like that that have been built by other communities. We’ve actually worked on this problem. This is something which we gave a talk half an hour back in a different session on this particular problem.

Our observation has been that on this particular problem, you don’t even need, like, reinforcement learning techniques and things like that, but there is some opportunity to further improve the state of the art using these techniques, but unfortunately, Spark is still catching up on that particular area. There are some higher level, like, software like Keras and BigDL that is simplifying that, right? How to convert the system management problems and apply deep learning on that to get to good performance very, very quickly.

And as I mentioned, please check out our talk. So, in summary, the whole context, if you remember, of our talk was, well, Spark performance management is hard. Can we simplify that using Spark itself? Right? And it’s all about using the right techniques and the right tool for the job. And I gave you three examples of how, if you bring in the right algorithms and for the right abstraction of these tasks, then you can bring all the power of Spark and on all this suite of telemetry information you can gather to simplify Spark performance management.

And we would love to kind of get your feedback. This is software that we have been releasing as a company, Unravel. We would love to get your feedback. Please check out our free trial. And even better, we would love you to come join our team because it takes a team to build these sort of solutions.

Thank you.