We're hiring! View our open positions!


How Unravel Complements Cloudera Manager

It used to be really hard to install and manage all the pieces that make up a cluster. But now Cloudera Manager automates installation, upgrading, and monitoring all your systems. Just like Cloudera Manager makes your life easier […]

  • 13 min read

It used to be really hard to install and manage all the pieces that make up a cluster. But now Cloudera Manager automates installation, upgrading, and monitoring all your systems. Just like Cloudera Manager makes your life easier to do systems management, Unravel simplifies how to manage the performance and utilization of the applications that run on those systems. Together Unravel and Cloudera Manager provide a complete solution to ensure the full-stack is reliable and your applications run smoothly in production.

Watch a demo of how Unravel complements Cloudera Manager

Sign up to get immediate access to the on-demand webinar to learn more about how Unravel complements Cloudera Manager.

Unravel and Cloudera Manager for Big Data APM Demo Transcript

So this is one of the main entry points to the Unravel product and you can log on to Unravel, you can get to see all the applications which are running in the cluster, so it’s easy for you to go and select any time frame you could come back and say “show me all the applications that are running in the cluster between 2:00a.m. and 3:00a.m. on this specific day”. Right. And just a few clicks you can look at all the applications in this cluster for them primarily running like Spark applications.

But we support all the different frameworks which are running on the Cloudera stack. That could be MapReduce or it could be Hive or MapReduce. We also support frameworks Pig and Cascading. And recently we’ve added support for Impala and Kafka as well. And within Spark, we support both Spark and Spark SQL.

So this kind of gives you a quick way to slice and dice. You can come and look at all the applications run by a particular tenant. You can also look at it from a specific queue perspective. So it kind of gives you a quick easy interface to search the application. And this is what we kind of call the self serve interface. Like many times our end users are submitting these queries using the CLI or end user tool. And there’s no easy way for them to come in and take a look at it.

You could go to a Cloudera manager but it’s got all of these capabilities around the system side of things, but the end user is really concerned with their own application and with Unravel, we kind of make it easier for the end user to just come in and look at their application that they are most interested in.

Let’s search for, let’s roll play. Let’s kind of figure out that I’m a developer. I’m trying to really optimize a Spark application which I wrote and I know that this application completed in 36 minutes. But I know my business user wants his application to run much faster. Right. They have like an SLA to complete this within ten minutes and I’m really scratching my head and trying to figure out why did this application take 36 minutes.

And what is it that I can do to really make this application much faster.

So with Unravel I was quickly able to find that Spark application and I was able to sort it by the duration I was able to find my Spark application which took 36 minutes. So I was able to kind of see this application. One of the first things I could do is like that I could also check with an operator. Hey I ran this park application a few days back. It took much longer. Can you help me, what’s going on? And many times the operator would like to see what is really happening from a Cloudera manager perspective because there could be a lot of systemic issues which are causing these applications to fail or not perform correctly or they could be like high CPU utilization in the cluster.

So you could come in and look at like the cloud manager interface to really kind of understand what happened on the 10th, right? How much CPU was being used in the cluster and clearly it seems like it had enough resources in the cluster it was not bottleneck from that perspective. You also want to kind of look at the network statistics right many times network becomes the bottleneck and you want to understand what happened from network perspective at that time and Cloudera Manager is a great tool to help you kind of look at that from a systems perspective. We really want to understand does the cluster have the right resources from a CPU or a memory and a network perspective.

The next thing you want to know is was a Spark service running at that time right.

So with the Cloudera manager just through a few clicks you can come in and you can look at like Okay how did that Spark, how’s my Spark daemons doing? Are they up or down. I can actually scroll and I can go to that time on 8:10 when I really ran a spark application and I can understand it can just look at a few charts you can see hey my spark service was up at that time. I can also look at like all the instances and see what was going on on those instances. I can also understand from the configuration perspective what are the key configurations associated with that. Many times, you also want to see your overall hosts view and see if all your host are up because of a couple of the horse in a cluster are down. Clearly your jobs will not run optimally.

So Cloudera Manager is a great entry point from a systems perspective as we here at the bottom up view of what happened from the infrastructure perspective where they really gaps that could affect drop cluster. You could also come here and also start looking at that search for that application from the perspective that the application view you can search for that application which took like 30 minutes in that case and I’m able to find that application so it kind of gives you the high level KPI that this application ran on on on so and so date on 8:10 and it took 36 minutes right. Very similar to what you saw in the first interface.

But at this point like, you’re left to yourself right because a Cloudera manager was really focused on the systems side of things that said hey you ran this application on the cluster, and then from there if you click on it you end up going to the web UI and that kind of gives you some history and then from here you can start going to the spark web UI. It keep going on that path right. And if I kind of showed you that like once you go to this part of web UI you start kind of like chasing on different parts and you’re trying to really digest that information and eventually get to the locks trying to figure out what’s going on.

So with Unravel we’ve taken a very different approach.

So when you click on that application you get to see everything which is related to the application in one view.

And this is really important for us because we’re trying to tie all the things related to the application very holistically in one picture. Right. So as an end user may not have all the details of how that Spark job ran in the cluster but with Unravel we are trying to abstract it, we give you the key pieces of information that is most relevant to you. If you see like if I’m the end user I have a question why did this application take 36 minutes? What is it that I can do to improve the performance and you can quickly see that the key KPIs are really nicely laid out.

You can also see that there were close to like 18,000 task associated with that. I can also see how it ran in the cluster. From a dense perspective from a overall container perspective how much of memory was being used. I can also really understand the fine grain resource usage right. Spark is very kind of memory intensive framework. You want to understand at the container level how much of the resources are being used. And this is a piece of technology that unravel is created where we really understand the context of these applications running in the containers and what is really happening in the container. So we feed that information to Unravel and you get to see in a very fine level what is going on at the container level. So a lot of times you want to kind of look at like heap size issues and other things and that can quickly become evident looking at the Unravel web UI.

At the high level we also kind of show you what are the different stages which are generated or transpired as part of the spark application. We have a very simple Gantt chart view which tells you where time was being spent. So if you’re really trying to optimize your Spark application it’s important to figure out which stage to optimize.

You can look at all of the data and that is available in various views. But the key is, quickly get in and resolve the problems and with Unravel, you’re always kept in mind as we’ve built this product, so you can see how time is being spent. And in fact if I click on that stage it takes me to the next level of detail.

So notice how I started at the highest level of abstraction which is my application and then slowly but surely I’m kind of drilling down into the individual components and within that component I can really understand how the task really ran. Right. So I can give you a timeline view which is how the task distributed how much time did each of the tasks take. How much how much of data get crunched and so on and so forth. Right.

And then all the way to how these particular tasks map to the underlying cost. So starting from the core application and drilling down to the individual components and these kind of visuals are very important because if you have things like the visuals built into Unravel can help you spot those very very easily. Because you typically find like one task which is taking much longer than others and it’s crunching a lot more data than others. Just looking at these visuals gives you a lot of insight which is very difficult to see otherwise.

You have to kind of like put your systems and processes in place and your knowledge and a lot of best practices to get this. We’ve tried to encompass all of this in a product in our products.

But the key value prop of Unravel is evolving beyond. I mentioned we not only present you with all of these visuals which is very very important but we go beyond this. We try to do a lot of the analytics behind the scenes where we’re looking at this metrics and logs and configuration data and then kind of telling you, Look we understood the context of this application.

We also understand what other applications are running in the Cluster. We also understand we also know how much of resources and capabilities are available in the cluster based on all of that. So we do have an analytics and kind of tell you hey, there is a lot of things you can do to improve the performance and we pointed out to you we kind of say okay these are some of the current configuration parameters but you could do much better by changing these configuration parameters.

And the reality is for all of these big data systems there are so many configuration knobs that it’s impossible for an end user for a business analyst to be really on top of all of this. And many times they usually end up using the default configuration parameters and that’s not the right thing to do. So with Unravel we’ve kind of since we’re understanding all of these applications which are running in real time, we can provide very objectively what your recommended configuration parameters are right.

In this case we’re kind of looking at how the key spark parameters and we are suggesting concrete values that we can use as part of the next run to improve the performance. And we also kind of give me the reasoning behind it.

Years of science behind it where we’ve understood many of the configuration parameters we’ve looked at how the containers are being used or under utilized. Many times we also give you recommendations above and beyond just configuration parameters where we tell things like up if you can cache a certain RDD right.

So with Spark applications you can get significant more performance caching certain RDD so we can point out which RDD you can cache and also help you map to the line of code where you can go and add the cache statement.

So in this case I took this recommendation which Unravel provided and then reran that same Spark application which now ran in seven minutes 12 seconds. Right. So you notice a significant performance improvement from 30 minutes to seven minutes 12 seconds, and as you noticed it just took a few clicks.

So if you’re like and end user business analyst who’s got access to Unravel you can come and take a look at it and see what Unravel is saying and start collaborating with the OPs team and say you know what, take a look at this, Unravel is saying I need to kind of change these configuration parameters and it can have both the OPs theme and the development team take a look at this and within a few minutes everybody can agree that these are good configuration parameters that you can change.

You can like rerun your Spark application with these changes and you can get a significant performance improvement. Going back to the poll for many of you as you are kind of struggling with performance and trying to improve the speed or trying to improve the speed of these applications, Unravel provides a quick path to help you get unblocked and solve a lot of these issues.

So in addition to looking at application speedups there’s a lot of other value added that you are provided in Unravel, so let me kind of quickly go to a few other key capabilities. We do not have time to cover all of that.

But another key feature which we do in unravel is helping identify or chase down errors because many times is just not application speed up. It could be a new go to spark application and it failed right? And like today, there’s a lot of back and forth really trying to figure out where the error is because you have to go and get the line of code in a particular executer and the executer it could be running on node number like 200 or something like that right, so go and fetching those logs and trying to understand that it becomes really painful.

With Unravel just a few clicks you can come here and we can tell you exactly where the error is and what the problem is right. So we’re kind of giving you some insight on what is going on and if you click on the other views we’re pointing out exactly what those problems are.

So we’re kind of telling you these warnings and we give you this fatal exception messages and we’re extracting that stack trace from the appropriate log. Right. So there are some permissions issues which are causing that particular spark application to fail. Notice again just through a few clicks unravel kind of helps you get to the root cause of the problems.

Quickly come in, lets you understand those issues, fix those issues, and more. So in addition to looking at ad hoc applications the other thing which we do very effectively is looking at the data pipelines right because many times it’s just not the individual hive application or Spark application or Impala query. You also want to look at your end to end data pipelines. And that is very important because it’s it’s a combination of your hive jobs of spark jobs and that view is very difficult to get to today.

And you could go to a Yarn web UI or you could go to another console but they’ll kind of tell you your individual MapReduce components as Spark components.

But what you’re really trying to get to it tell me my entire workflow and that workflow is really composed of a bunch of Hive jobs and Spark job and MapReduce jobs. And you want to look at the entirety of that and that is a very powerful view which we’re building in Unravel where again you can come in and say This is my data pipeline prod model Gen2 which runs daily and I really want to understand how the performance has changed over a period of time. So with Unravel with just a few clicks you can do a delta of the performance between your good run and a bad run and that in the bad run again you can figure it out like which component is a long pole and you want to spend your time and effort really optimizing the component which took the most amount.

So this is the other sort of view in terms of the applications where in addition to ad hoc we also look at your repeated running workloads.

As the other component of which we have it unravel is helping you sort of bring the operational side of things where we kind of like look at your overall cluster and tell you what is going on. So this is important because if you are the operator you’re not just looking at individual applications you want to look at your cluster holistically and answer questions like what is really happening in the cluster. I see a spike in my usage.

And you could see that in a Cloudera manager but you would answer that from the application perspective. You want to come in and say which application took the most amount of time or resources in my cluster and unravel helps you kind of like answer those questions where you can because of the forensic view that you can really help to figure out which applications were running the cluster at a particular amount of time and how much resources were being used.

And armed with this knowledge you could create some policies. They could come in and say “Yeah, like I’m seeing some users abuse my cluster repeatedly in certain queues,” and I want to be able to create some policies to fix those.

So with Unravel, we give it we call these auto actions but basically, we give you a mechanism to go and create those policies where you can specify certain things about how much memory is being used how much of other jobs are pending. And then we could send you appropriate alerts. We can send you an e-mail or we can take some action based on it. They either move those outlier jobs to a different queue or kill applications.

So this is just a brief tour in terms of some of our other core capabilities. Happy to kind of do a deep dive demo for some of you who might be interested.

You can sort of do a deep dive demo of all the other capabilities that are in Unravel, but what I really wanted to highlight is how we are really helping complement to Cloudera manager which just coming from the systems side of things and we are taking an app first approach, an application first approach where we’re kind of telling you how you can look at applications holistically and then trying to root cause into the performance issue or other issues and then help go and fix those.