APM market overview by Bernd Harzog, co-founder and CEO of APM Experts
Demo and applications by Bala Venkatrao, VP of Products at Unravel Data
What’s covered in this insightful interview is the historical taxonomy of the application performance management software (APM software) category and vendors. Just recently, Cloudera and Hortonworks announced a merger, so change continues in the APM part of the performance management market. Also this discussion covers the unique challenges of managing performance, throughput and reliability for big data applications, together with an overview of the latest improvements and a live demo of Unravel with sample case studies…where Unravel applies AI from millions of big data runs to anticipate and head off problems.
Transcript of video (subtitles provided for silent viewing)
Taxonomy of the APM software category
Gill: Hi everyone, and thanks for joining for today’s webinar, Applying Machine Learning to Solving Performance Issues for Your Big Data Applications. Today’s webinar will take about 40 minutes, and after the presentation, we’ll close it out for Q&A. Just a couple of housekeeping items before we begin. If you have any questions please submit them in the GoToWebinar panel any time and we’ll go through them during the Q&A portion. Today’s webinar is also being recorded and we’ll make sure that you get a link after the webinar.
Gill: What we’ll be covering today would be the historical taxonomy of the application performance management category and vendors. Also the unique challenges of managing performance, throughput and reliability for big data applications, and also an overview and a live demo of Unravel with sample case studies. Today’s speakers I’d like to introduce is Bernd Harzog, he’s the Co-founder and CEO at APM Experts, and also Bala Venkatrao, VP of Products at Unravel Data. And with that, I hand it over to Bernd.
Bernd: Thank you very much Gill. Thank you very much for spending your time today with us, we greatly appreciate it. Let me make a note here that if you want to ask us questions there’s a Q&A applet in GoToWebinar. Just type in your question, and we’ll get to it after my presentation and Bala’s presentation. So let’s take a little step… Let me first give you an overview of my background here.
I spent the first part of my career in the computer industry bringing products to market, and brought a bunch of interesting products to market. And in the 2004, 2005 timeframe, ended up being CEO of a performance management company that focused on the Citrix market and then we sold that to Citrix. And I’ve spent my time since then consulting as an analyst, working with both vendors and customers, helping vendors on their product and their marketing, and helping customers decide what products to buy, and helping them run their implementation projects.
So by virtue of my working so closely with both the vendors and the customers, I do have a relatively unique perspective on the whole entire performance management market.
Application Performance Management Market Overview
So let’s talk about the APM part of the performance management market real quick, and we’re gonna use Gartner’s definition, so we don’t have to make up a definition here. Gartner considers APM as having one or more to three different dimensions to it. The first is digital experience monitoring which is how your users interact with you digitally across all different formats.
So your website and application, the phone, fax, email, social media, etc. Application discovery, tracing and diagnostics is the behavior of the application end-to-end. So it’s the tracing of the flow of the transaction through the tiers of the transaction from its inception, to using it on the database and back, and to whatever external services it calls.
The timing of those transactions, the determination of normal, and finding of transactions that are abnormal, and then assisting with the root cause. So what is the problem and how do you fix it. And lastly, is the application of AI to APM, and this is an area where a tremendous amount of innovation is going on right now.
So AI has been used in the past, simplistic AI has been used in the past to do things like automatically create baselines, and automatically determine if something is abnormal with respect to the baseline, but it’s going much deeper than that. And Bala, from Unravel is gonna give you a really great example of how incredibly productively AI can be deployed in an APM solution.
Looking back on the APM Market
Let’s take a step back here and let’s look at the application performance management market eight years ago in comparison to today. So on the left is the Gartner Magic Quadrant, so the upper right are the leaders from 2010, and on the right side is the leaders now in 2018.
And notice a couple of the things that jump out at you here. So the first thing is that other than CA (who have completely rebuilt their product so it’s not the same product in 2010, as it is 2018), the vendors that were leaders back then are no longer leaders, and the vendors that are leaders now… the new leaders (AppDynamics, Dynatrace and New Relic) weren’t even on the Magic Quadrant, they didn’t even exist in the eyes of Gartner eight years ago.
So what happened? What happened is that an architectural shift occurred, from monolithic applications written to J2EE to distributed SOA applications, and that created a new set of requirements for how you manage applications in production. Specifically what it created was the ability to trace highly distributed applications, because that’s what SOA applications are, is they’re distributed across tiers.
New Requirements. New Innovations. New Leaders.
And so that new set of requirements led to a new set of innovations, which led to a new set of leaders. And that’s a really, really important thing to keep in mind here, a new set of requirements, a new set of innovations, and a new set of leaders, because we’re in the process of just such a change right now.
We’re going through a similar change, I think, in fact a more disruptive change than the last time in the area of application architecture and data architectures. So we’re gonna drill into that here so we can kind of get a good lens on what’s happening in the APM market. So what’s driving all of this innovation is digitization, and digitization is the idea of implementing all of your business processes in your company and software.
And this is a really, really important concept, and it’s creating a whole bunch of problems. One of the problems is it’s creating an infinite demand for business functionality implemented in software. So an infinite demand for software development, and for the process of getting software into production, and managing effectively in production.
Demand Fuels Innovation in App Architecture
And that infinite demand is fueling a set of innovations, innovation in application architecture. And that innovation is the new one is something called microservices, which breaks applications into fine-grained little components. Each component does one thing, and hopefully does it well, and the people that have fully implemented this have reported outstanding gains in time-to-market for new application functionality.
And it’s driving changes in process so Agile, DevOps, CICD, are all things that have been invented in response to this infinite demand. It’s creating innovation in languages: new languages are being invented to make it easier to write certain kinds of code to solve certain kinds of problems. There’s now a language for every problem you want to solve.
And a tremendous proliferation in the services, and containers, and orchestration environments in which you run these applications. So things like Kubernetes and Istio and Spark and Kafka, are all being deployed in order to help these applications run in production with an extended functionality more efficiently.
The Proliferation of the Data Architecture Business
The database business, the data architecture business has undergone an incredible proliferation of innovation from basic SQL databases, which by the way, aren’t going away, but now we have the NoSQL databases like Hadoop and Cassandra and Mongo, time series databases like Influx, we have cloud databases like Big Table, and Amazon EMR, and perhaps the unfortunately named CockroachDB, which is a highly distributed database, the idea being your data is gonna be everywhere like cockroaches.
And everything’s virtualized, and everything runs in different clouds. So the combination of these innovations creates a situation where we have an unprecedented level of complexity, diversity, and dynamic behavior. And we’ve never had such a complex, diverse, and dynamic environment before, and needless to say, this creates unprecedented management challenges.
Let’s drill in on the application architecture part of the problem. And the application architecture part is, “Okay, how is that changing over time?” And the first thing is okay, we started off with simple J2EE Web applications and those were monoliths, and those were monitored by…at the time some startups, then they got acquired by IBM, BMC, CA, and HP. That’s the old, the first generation of web applications.
The second generation was the SOA service-oriented architecture applications, and that’s why new APM vendors like AppDynamics, Dynatrace, and New Relic came into being was to meet those requirements. Now we have two innovations in the data space. We have big data applications, so batch, big data applications, and we have real time, streaming big data applications.
And they have completely different requirements for monitoring, and Bala is gonna go through that with you. An innovative new vendor like Unravel is gonna handle that for you. And now we are, over on the application architecture side, we have the move to microservices and that requires a completely different approach to actually monitoring the microservice, and the code in the microservice itself.
And finally, we are just embarking upon applications that are a fusion of the Web/mobile and then streaming data applications. So applications that perhaps have a variety of back end data architectures. An application might well have a SQL database, it might well use a Hadoop data store of some kind, it might well use Cassandra all simultaneously.
And that’s gonna again, create an unprecedented problem in terms of how to manage this in production. So now that we’ve gone this far we forgot, Gill, to do the poll. So let’s run the first poll right now, and then I’ll go through the rest of my presentation.
And if you’d like to take the poll, please take the poll now and then we’re going to close it out and show you the results. All right, Gill, why don’t you go ahead and close out the poll and let’s share the results of the poll. Well, the good news is that everyone agrees that this is either a problem or somewhat of a problem, or they strongly agree that it’s a problem. And we certainly agree. We think it’s a very challenging problem, so we’re glad you agree with us.
Data and Application Architecture Deployment
So let’s talk about what this really means and wrap this up, which is, okay, these new data and application architectures are being very widely deployed, and just as is the case from the 2008 scenario to the 2010 scenario, what worked in 2008 is not working in 2018, excuse me, 2010 and 2018. And what worked for the current set of SOA applications is not gonna work for this new set of microservice and data-intense applications.
And that problem is compounded by the complexity and diversity, and the rate of change, and the lack of talent. It’s a real challenge for people to find the people that can manage these systems and production effectively. So we really need the tools to help, we really need the tools to help on both the quality of service front, and on helping the people be more productive front. So where do these tools create value for you?
I’m not gonna read through the left column here because this is the value for the standard web based applications. But if you do this right over on the big data side, the top line business value is timed insight.
If your big data environment runs more effectively, if you can guarantee the SLAs, you can guarantee your cluster performance, and your job completion time, and minimize failed jobs, the value, the top line business value to you is you’re gonna get those data-driven insights more quickly and you’re gonna be able to take action as a business to take advantage on those data driven insights.
AI to Guarantee System Peformance
And lastly, and, again, I’m gonna do it, I’m trying to set Bala up here. If AI is appropriately deployed here, the AI can actually help you guarantee the performance of these systems, which is something I think we’ve all been wanting monitoring to do for a long time, is not just alert the humans but go fix the problem itself.
So if I abstract up from this, it turns out that the business value of doing this correctly actually gets translated in the company value. If you optimize for timed insight, research from IDC shows that you end up with a more valuable company. Obviously it has an effect upon revenue, and it has an effect upon profits, and therefore has an effect upon the value of the company.
So that’s the end of my presentation. I’m gonna turn it over to Bala here, but I think Gill, this is when we run the second poll, right?
Gill: Yes. Here we go.
Bernd: Let’s give this another couple of seconds here, and so far everyone seems… Well, I’m not gonna give away the results. Let’s go ahead and close this out Gill, and let’s share the results with the attendees.
And looks like everybody thinks this really should be focused upon application performance tuning. And so for all of you who think that, I have some really good news because here comes the presentation from Gill with Unravel. Take it away.
Bala: Thanks, Bernd, and yeah. So the poll results were pretty interesting. So looks like all of you are at the right webinar, because that is what we we’re going to be talking about. In fact, many of you, you’ve visited our website and looked at our collaterals.
That’s a cool value prop of Unravel. We help you tune and optimize your big data applications. So thanks Bernd, again for setting up the stage, and Bernd brings a wealth of experience from traditional APM space, and as he’s kind of taking a page and looking at what’s really happening in the microservices and the big data realm, there’s clearly a need for a purpose-built product to really understand performance for your big data applications.
So with that, as Gill mentioned my name is Bala Venkatrao, I’m part of the founding team and VP of Products at Unravel. I’ve spent several years in the big data space and it’s really exciting to bring Unravel to the market. Today we’re working with several large enterprise customers, helping them tune and optimize their big data applications.
Big Data is Here to Stay
So if you take a step back and really understand what’s going on is: big data is here to stay and a lot of enterprises are standardizing on the big data stack. And the reason I mentioned big data stack is big data encompasses a whole bunch of different systems. You could be using Kafka as a way to ingest and you could be doing analytics using Hadoop and Spark, you could be using a NoSQL technology to solve those results.
But at the end of the day, the true value of these platforms or systems is what is the business impact? What is it that you can do with these big data platforms and infrastructure to build meaningful applications that impact your end business results or end business outcomes? And as businesses come to rely on this platform, they expect a certain capability from this platform as they have expected from their traditional database and data warehousing platforms.
So they expect performance, they expect uptime, they expect SLAs, and these are like enterprise requirements that now the operations team have to guarantee for the end business users, so they can effectively use the platform.
Peformance, Reliability, Ease of Use
So as I mentioned high performance, high reliability, and ease of use. And these pretty much underscore the key requirements that we have seen as enterprises are trying to prioritize, and operationalize the big data applications on their infrastructure. And as I mentioned, when you kind of look at the collection of these systems, things suddenly start to get very complicated, because you’re not just looking at like one system, you’re looking at like a multitude of these systems, and the reality is that most enterprises are using best of breed.
They could be using a different systems on-prem, but they could be using a completely different system on the cloud. And there are various choices out there, I think that’s what makes this time really exciting. You’ve seen an abundance of new SQL technology, new MPP databases, new streaming technology, a lot of them are open source, some of them are proprietary.
And frankly, like, enterprises have an option to choose what platforms they decide to bring together to build their big data applications. But regardless of that, you need an application performance management paradigm that really transcends these multitudes of systems, whether you’re running on-prem or whether you’re running on the cloud.
Challenges facing Enterprises
And some of the challenges we have seen often as enterprises are trying to standardize on this platform are that you can categorize time on the application side and the operations side. From the application side, we’ve heard the questions around missed SLAs or failures.
You’re running a Hive query, or you’re running a Spark application, and these applications or these queries could have been triggered by a BI tool. As you’ve seen in the previous slide, as I mentioned, there are various applications. You could be using ETL, it could be doing like a BI query on top of the platform, or it could be running advanced machine learning algorithms.
But regardless of what this application is doing, when you’re trying to prioritize it, you have a lot of ongoing application performance challenges, whether it’s things that are missed SLAs, failures, or the applications are just flat-out slow. A BI query that came from Tableau running on my big data infrastructure which was taking like 20 minutes yesterday for some reason today it’s taking two hours.
And what is the cause of the slowdown? And people are spending an awful lot of time to really go and root-cause those issues and understand what is really happening. And then for on-prem clusters we also have this issue around multi-tenancy because most of the big data deployments happen in a multi-tenant environment where you have lots of tenants and users are trying to access the same resources.
And many times you’ve seen applications slow down for no fault of a particular user, it’s because there’s some other rogue users, or rogue queries that are running the cluster that are taking over your other resources. So there are a classic set of problems of the application perspective that affect end-users from using the platform effectively.
Challenges facing Operations
From the operations perspective, there are lots of challenges for operators who are trying to prioritize these platforms. Whether it’s providing them a single pane of glass, really understanding all the way from applications and the impact of the applications in the cluster and vice versa.
Today, you tend to have a siloed kind of view. So you tend to have good systems view or the cluster view, and then you can go and look at your application view in tools like Spark Web UI or Beyond web page, but they kind of give you limited information first and foremost. And the other thing is that it’s very difficult to correlate, to be able to understand the impact of the applications on the cluster and vice versa.
Also understanding cost, whether it’s on-prem or on the cloud, how do you sort of figure out which tenants are using the cluster, can you create effective charge back and show back policies, and have a good understanding how resources are being actually used or abused by your various tenants. And on the cloud, this gets really magnified. You could be using like an Amazon EMR, you could be using HDInsight.
It’s very easy, very convenient to spin up those clusters or those environments, but it is very difficult to figure out cost. You could have a runaway query, you could have like a Spark job, or a Hive job that just kind of like runs out on the scale out infrastructure on the cloud, but the end of the month, you see your bill and it’s very difficult for you to rationalize how cost was being spent on those applications.
Finding the Needle in the Haystack
And then addressing the core NTTR issues. Things do go wrong in production, but when they do happen, a lot of times the end users just like punt the problem to the operations team, and the operations team are getting overwhelmed because they have to look at like a whole bunch of issues, whether it’s in the cluster side, or the application side, and it’s really finding a needle in the haystack.
And if you actually take a step back and you look at these class of problems, in fact a lot of these problems can be solved very effectively using machine learning, because we are in the realm of, like, machine learning these days, and it’s pretty interesting. It’s pretty surprising to me that when you go and talk to a lot of the operators and the application users for the big data infrastructure, and you ask them, “How do you go about understanding a performance or trying to figure out why applications fail?”
People are just using brute force, they go back and look at the logs, and they’re going looking at the native web UI pages that come with the Apache software, and it’s very old school in terms of how they solve the problem. And think about it. If you’re running something on like a 100-node cluster or a 1,000-node cluster or even like a 20-node cluster, there’s so much of data to look at, it’s humanly impossible for you to really find the needle in the haystack.
So a lot of these problems are from the application performance perspective and understanding how the cluster is behaving or not behaving properly. They really write to applying data science and machine learning sort of algorithms. So the challenge is how do you collect this information and then be able to root-cause and triage that effectively.
Finding the Root Cause of the Problem
And that’s the journey Unravel has been on right from day one, because we felt that there is a lot of value in bringing the power of data, the power of data science, to both these application and system-level data that’s being generated for big data applications.
And the key is how do you harness the data? Can you create effective models, can you apply effective machine learning sort of algorithms, to quickly get to the root cause of these issues, so you can solve the key issues that the enterprises are looking for, from an SLA management perspective or reducing the MTTI for your operations folks.
And the lack of this is really affecting or is causing a slowdown in terms of enterprises fully embracing your big data infrastructure. Because if you don’t address these issues systematically, you have things like missed SLAs, that could be affecting your top line revenue, you could have high infrastructure cost, and we’ve heard that often where operators or enterprises end up provisioning large Hadoop clusters and the only way they solve the performance issue is adding more capacity to it.
And that’s not really sustainable because you can’t just keep adding more nodes to your infrastructure without really understanding the ROI that your big data investments are providing. And then if you’re not able to solve these issues in a timely fashion, your end business users will frankly get frustrated, because they are used to a certain degree of SLA from their traditional databases and data warehousing kind of environment and they expect the same resiliency from the new or modern data platforms.
Full-Stack, Autonomous APM
And this is where Unravel comes in. The key message or the key value prop for Unravel is we are the application performance management for big data providing you a full-stack, intelligent, and autonomous platform. And all of these words have significant meaning.
When we say full-stack, we really mean it because we feel that to solve the application problem for your big data application, you need to have the complete picture. That means you need to understand how your applications are performing right from your query level, or it could be a data pipeline, you want to understand how your end-to-end data pipeline is performing.
But from there you should be able to double click on your platform level, how your underlying systems are doing, whether it’s your Hive systems, or your Spark systems all the way to the infrastructure level. So you need to have an end-to-end view right from the application and the impact of those applications on your underlying infrastructure. So full-stack, we cover it from that angle, and then also the notion of full-stack is across the board.
Single View into Multiple Systems
If you have multiple systems, if you have Kafka feeding into Spark Streaming, doing analytics in Spark and eventually feeding into HBase, you want to have a single pane of glass to understand how this end-to-end application is doing, and that’s where Unravel comes in.
We sort of like sit as part of your, a different big data environment, observe everything that’s happening in the cluster, and then correlate that information, and then help you get to the root cause of it. The other key value prop is intelligence. It’s fairly easy to get all of these metrics via GMX and other API kind of tools and give you 100 graphs.
Well, you’re not gonna do much of an upgrade because, again, you’re gonna get overwhelmed just looking at all of these different graphs and metrics and then trying to figure out the needle in the haystack. So the approach that Unravel strongly believes is collecting the data, I think that’s first and foremost important, but what do you do with the data?
Can you effectively apply smart algorithms, smart machine learning models? And I kind of talk about how we have applied machine learning across our entire product feature set. Where we layer intelligence in the entire product, where we just don’t give you graphs and metrics, but help give you answers so you can solve your problem.
And you will see in the demo how we effectively go about doing that.
The last aspect is autonomous. Wouldn’t it be great if a product like Unravel could do the job for you, where you give the controls to Unravel, and based on what’s really happening in the cluster, figuring out what needs to be done, the intelligence aspect, it is orchestrating things? And Unravel is clearly going in that position.
In the new release, in Unravel 4.4, we’ve introduced a lot of autonomous capability where we can self-tune applications based on what we have observed. So based on, like, multiple runs of the application, we can converge to the right set of recommendations, and we can apply those recommendations on your behalf. Or we can use the existing API that the platform provides and pull those APIs in the right way or apply the right recommendations to those APIs to orchestrate things in the platform.
So increasingly, we’re bringing a lot of capabilities around automating a lot of workflows and tasks that you’ve been doing in a manual fashion but taking it in a systematic way where we can take our intelligence, apply those in a way that we can orchestra things in the platform. And we do all of this in a very non-intrusive fashion.
We sit as part of a cluster, we deploy like the typical Unravel deployment request and Unravel server that you deploy on an edge node, and we deploy the Unravel sensors which is a very low-footprint kind of a sensor that get deployed on your cluster, and that collects all the information as these applications are running and feed that to the Unravel server where we do all the correlation.
Your Very Own Smart AI Bot
So think of Unravel as a smart AI bot that is sitting out there observing everything from the application perspective, and then giving you recommendations and tuning recommendation what needs to do from the application perspective, and the cluster perspective. And then we can apply those recommendations in many cases if you give us the right permissions.
So what are some of the machine learning capabilities we have leveraged in the product features today? And I’ve kind of broken this down from the application management perspective and the operations perspective. So here are some of the capabilities we’ve introduced over the last few releases.
From the application management perspective clearly we’ve added a lot of capabilities around automated tuning. So what does this mean? So in the notion of automated tuning, you can tell Unravel what is your goal, what are you trying to do with a particular application.
You want to speed up your application or you want it to consume less resources, there could be various dimensions you might be looking at from a tuning perspective. So once you define a goal to Unravel, we start working towards the goal. And what we have learned is it takes a few iterations, and this is a classic, like, data sort of modeling or machine learning sort of model.
You have to have multiple passes of the same application to be able to converge on what the right configuration parameters are. So in their most recent release we’ve introduced this notion of auto-tuning where we can run this application, and then give you the right set of recommendations.
Rich set of Tuning Recommendations
Building on that, we have a rich set of capabilities around tuning recommendations. So whether you’re running Hive on MapReduce or you’re running Hive on-page, you could be running Spark or a Spark SQL application, you could be running Impala, or you could be running Workflows.
Regardless of the engine you decide and applications that are running on top of that, we’re able to analyze this application, profile that application and then give you recommendations on what you need to tune or optimize. So the tuning recommendation has been something, like, we were the first in the industry to introduce this notion of tuning recommendations.
We’ve been doing it for the last couple of years and then building on that we’ve added the complete auto-tuning sort of exercise. And both of these require a lot of data modeling and analysis behind the scenes. Like, everything is abstracted, we use like a whole bunch of algorithms to get to the final result, but for the end user Unravel provides you answers that you can act on to improve your overall performance.
Understanding Error Views
The other feature where we leveraged a lot of machine learning capability also looking at error views. Like, when applications fail today, there’s so much of information out there in the logs. Some of you on today’s webinar I’m pretty sure you experience running Spark applications, I think you can fully appreciate that when you run a Spark application there’s so much of logs that’s being produced that it’s so difficult to break through the logs and really understand what’s going on. And you need to apply smart algorithms to it.
You should be able to know what logs to look at, first and foremost, and within those logs what are you looking for.
So we’ve come up with like… We use various heuristics where you come up with various rules to help you figure out what the problem is, and how we can go about fixing those problems in your application. So this is just a quick teaser on some of the machine learning capabilities we’ve introduced from the application management perspective. From the operations management perspective again we have introduced a bunch of capabilities that leverage machine learning sort of capabilities behind the scenes.
The first thing is around cluster optimization. So what we learned as we were working with customers, while it was great to provide tuning recommendation at the individual application level, for many of the operators they also want to look at things holistically. They want to understand what nodes to tune and optimize at the overall cluster level that can affect a whole bunch of applications that are running in the cluster.
So what we have done is we’ve looked at all the application runs that are happening in the cluster, we fed that to the Unravel engine. And then we’ve applied a bunch of machine learning sort of algorithms to give you recommendations in terms of what not to do with the overall cluster level, and more importantly what would be the impact of those.
How many applications can get positively affected because of these changes and how many of them could be negatively affected in some cases, and then you can make a call in terms of whether you want to make those changes at the global level. The other interesting feature we’ve introduced is around capacity planning.
A use case which is so ripe to using machine learning kind of algorithms, because think about it for a moment, like, end of the year, I’m pretty sure starting like the next month or in November, all of you are scrambling to figure out, “How do I plan my cluster growth for the next year? I have to roll out the budgets and give it to my management team.”
Data Science Approach to Viewing Clusters
And like, today, the process is very ad hoc. You’re kind of looking at like some data and you’re trying to kind of like get some private knowledge and then figure out how to grow the cluster. What if you apply more a scientific approach?
Can you apply a data science approach to looking at how your cluster is being used, understand what workflows are being run, how much data is being ingested, and then applying some forecasting algorithms which tell you how your cluster is gonna grow in the next one month, three months, six months, and nine months.
And that’s precisely what we have done, we’ve looked at the key storage metrics and we’ve collected that over a period of time, then we applied some really interesting forecasting algorithms which give you a certain of degree of confidence of how the cluster is going to improve.
And if you’re gonna bring this same capability for a whole bunch of other key metrics in a cluster, but it’s going to be on your compute side, or your memory usage side. And there are a whole bunch of time series metrics that you collect today as part of a cluster environment and they’re so ripe to applying a lot of these forecasting algorithms.
How Unravel Performance Tuning Works
So what happens behind the scenes? This is just kind of giving you a quick idea of the architecture, how does the Unravel Performance Tuning Architecture work? How do we apply these machine learning principles in the first place? And the key thing here is data, the more data you collect the right way, the right data, I think the more data and right data is important.
So whether you’re running your cluster on-prem or in the cloud, we do the right instrumentation. And the good thing about big data applications is they leave behind a lot of trace, there’s a lot of exhaust that’s available out there in terms of logs and metrics that we can tap into. And we do, we talk to existing APIs, we pull in all that information and regardless whether you’re running on-prem or on the cloud, we feed that information to the Unravel engine.
And once the data comes in and we feed that to the Unravel engine, we have a bunch of the recommendation algorithms. So we pass through that data, we do a bunch of ETL, we have cleaning, and then create a data model out of that.
And then once we have that unified data model, we apply some algorithms based on that. And then once we apply those algorithms, we learn from that. So really there is a learning context associated with that, and that’s one of the things we have learnt is, you can’t just like give recommendations once and forget about it. You need to be able to apply those recommendations, see how the application performs, learn from that, and then rerun it till you converge to the right group.
So that’s the new notion that’s important from a machine learning perspective, and that’s what we’ve introduced in the most recent release.
Capabilities within Unravel
So with that, let me switch over to a quick demo to give you an idea of some of the capabilities within Unravel that apply these machine learning and kind of recommendations. So hopefully you guys can see my screen. So this is an example of a Spark application, and this Spark application could have been triggered through a CLI or there could be some end user tool that is putting out this Spark application.
But from an end user perspective, they’re trying to figure out what is that they should do to optimize this particular Spark application? And today if you’re not using Unravel, it’s fairly challenging to go and figure out where to start. You could look at the OnWeb UI, you can go look at the Spark Web UI, but you eventually end up doing a bunch of Google searches, looking at some best practices.
Complete View of your Application
There’s no one place where you can look at the entire application from a complete 360 degree perspective, and then understand what’s going on. So with Unravel we solved the problem to a great extent where we provide your good visuals to come in and understand what happened. What really happened with the application, how much time it took, how much data IO was used, but you can also look at the code that was submitted as part of the application, what really happened, how many Spark tasks were used up as part of this particular application, looking at some key resources.
So you can look at your drivers and executors, and you can look at a whole bunch of metrics at the JBM level. You can also understand how many jobs were created as part of that particular Spark application, get a good Gantt chart view which tells you where time was being spent, like, which particular Spark job took the most amount of time, and then clicking on that particular Spark job, you can understand the right stages associated with that, which, again, which stage took the most amount of time.
And then here you can go all the way down to the timeline view which tells you how the tasks were distributed. So if you notice how we started from the highest level of abstraction, the application, and then start going down to the successive layers, all the way to the tasks that are running on the underlying host.
So this, again, goes back to the notion of providing a complete full-stack perspective and understanding what’s going on. While all this is great, you can spend like a few minutes here and really understand end-to-end what’s going on, and if there are errors we would kind of point that out.
In this case, there are a few warnings, so we’re passing that and giving that information to you. But the key value-add from a machine learning perspective, is we’ve taken all of this data, we’ve understood what’s going on, and then we kind of tell you or we give you recommendation in terms of what is that you need to do to fix it. So we tell you what your configuration parameters are, and what is the recommended value.
And this is the key science that happens behind the scenes to give you these recommendations. And then we also kind of tell you in plain English why we are recommending those in the first place. And the reality is most of these configuration parameters, the current value tends to be default values that come with the platform, and given that there are so many configuration parameters it’s hard for end users and operators to go and figure out what knobs to kind of tune and optimize.
The Right Recommendation
So with Unravel we give you the right recommendation, we also give you the reason why you need to change these recommendations in the first place. And some of the recommendations also are at, more at the code level. We kind of point out that you could benefit by caching a certain RDD and you can also point out to the line of code where you can go and add a cache statement so that RDD can be cached.
So in this case, I took the recommendation and we ran the same application, which took 13 minutes, and the same application now ran for 6 minutes. So this is a good example where we’ve looked at the data, we give you the recommendation. In the most recent release, we can actually rerun the same application for you and converge to it.
So you don’t have to just look at the recommendation and take those recommendation and apply it. We can do it on your behalf. So that’s a whole auto-tuning kind of architecture I talked about.
Spark Job Workflow
The other example I want to show you is for Workflows. So you could be running a Workflow and Workflow is something that runs repeatedly. So you could have an ETL job or a machine learning model that needs to get generated every day, and suddenly you see a deviation in performance.
So here’s a good example of the same Workflow that was running for, that’s running, like today it’s running 4 minutes, but there was a deviation, this workflow took 14 minutes. So what we have done right now is like if you look at this particular application, and if you click on our Events, we tell you that there is a deviation. So Workflow deviation is worse than a certain baseline model.
So we do a bunch of baselining here. So if the same Workflow runs repeatedly, we’re able to group them and then we apply some statistical algorithms to see why it’s deviating from the normal, how much it’s deviating from the normal, and what is a root cause for why it’s deviating from the normal.
So in this case, it’s deviating from the normal because if you go and look at the overall cluster, and then you spend some time looking at the operational capabilities, you can go and see what happened to the cluster at that time. And you see that the case where the Workflow was deviating from the normal that MR job was taking much longer than before. And the reason is there was a rogue job.
There’s a Spark job, a Spark shell application that is really affecting the cluster. So we’re able to kind of pinpoint to you what the problem is, that there is a Spark application which is the rogue application.
So first and foremost, if you notice how we’ve collected multiple runs, we’ve created a baseline out of that, and from the baseline we’re able to root-cause the deviation, the reason why it deviated from the baseline and then based on the Auto-Action alert that we have, we can help you pinpoint that there’s a rogue application that’s stealing all the resources in the cluster.
Our goal is to make your job easier
So this is just a quick teaser in terms of like how we’ve applied machine learning across our different product features to help you quickly root cause and identify performance issues in your cluster. So let me just go back to my slide presentation. So the end goal of Unravel is, if you get this right and we are increasingly providing more capabilities around it, we can make your job a lot easier, whether it is… Yeah.
So the end goal is like your end users will get to benefit from using Unravel so they can get more predictable performance from their big data platform, so they can be assured that the applications on the platform are run reliably. Whether it’s apps running on time and the operations perspective, we can make sure that their costs are under control.
So this whole notion of cost assurance, there is a certain budget being allocated for your big data projects, how do you make sure you fit within the cost, not exceeding your cost every year? And, lastly, improving the productivity, both of the end users and also from the operations team. So you’re doing more with less, and you’re actually spending more time on building the right applications that your business users can benefit from…
Financial Services Case Study
Here’s a quick a case study of a customer. We’ve been working with a large Fortune 200 financial services company for almost like a year, a year and half now. And they have a fairly complex environment, a big data stack that includes many systems and it’s just challenging because they are kind of like looking at all of these different things and they have to reduce their MTTR, or look at these multiple systems.
And they’ve used Unravel in various ways to address these problems, whether it’s debugging the application challenges, looking at the cluster holistically, using the smart alerts within Unravel to go and figure out the rogue jobs that could be affecting the clusters adversely. So in a nutshell, in summary, Unravel helps you optimize your big data applications and your big data clusters.
It helps you troubleshoot, so you can quickly resolve your key issues that are affecting applications and clusters, and they help you understand what is really going on, holistically from the application perspective, and your cluster perspective so you can get more done from your current infrastructure. So yeah, with that, let me hand it over to, Gill.
Questions & Answers
Gill: Yeah. Thank you, Bala, thank you Bernd for a great presentation. So we have some time to take over some questions, and we do have a couple that came in. First question that can go to both of you, Bernd or Bala. “AppDynamics and Dynatrace claim that they monitor the components of the Hadoop Stack. Exactly what does Unravel do that is different?”
Bernd: Sure. So there’s three parts to that answer. One part is that, yes, the APM vendors do collect data via JMX from some of the basic components of Hadoop Stack. HDFS, Spark, etc. But Unravel collects the data from essentially all of the relevant components of the stack, not just a few, including Impala, Hive, Tez, Pig, YARN, MapReduce, EMR, Azure Insights, etc.
So there’s a simple coverage difference, and if you’re gonna understand the full-stack well then you’d better understand every component that comprises the stack. The second thing is that Unravel maps the typology of the queries from the application through the entire stack. So Unravel understands the journey of that query through the stack and obviously understands where the resources are being used, where the time is being spent. The APM vendors don’t map the typologies with the stack.
Their typology mapping only works for SOA-based web applications. It doesn’t work for the components of a big data stack which talk to each other in very different ways than the way a web application talks to its components. And lastly, and most importantly, and then Bala just showed this to you, the APM vendors don’t have any automated model-driven performance optimization.
So they have no way of saying, “Hey if you were to make these changes from this number to this number, then the performance of this thing is gonna dramatically improve.” That’s a completely unique application of AI to APM that so far, I have only seen from Unravel, and not from anybody else.
Bala: Thanks Bernd. And I can just add to what Bernd said. I think what we observed is building an application performance management for big data application is a completely sort of new capability. You can’t just take your existing web app or EPM kind of technology and point out to a big data application.
It’s not gonna work out. If you think about Spark, a single Spark application, it’s pretty easy to write like a Spark Scala or Python code, but when it runs on the cluster, the big data cluster it spins up so many containers and executors. And it’s just not a question of collecting those GMX metrics, it’s really understanding end-to-end what’s going on. And that thinking requires a ground up, purpose-built solution.
And none of the APM tools, as far as I know, have done that. Most of them just claim that they supports Spark or Hadoop or Cassandra, but when you scratch the surface and you understand what’s going on, they just pull a bunch of GMX metrics and give you a 100 graphs.
Like, I want to help you solve these fairly complex big data application problems. You need a purpose-built solution like Unravel to help that, and we’ve been at it for many years. The genesis of Unravel comes out of research at Duke University, where my CTO is a professor of computer science.
So he spent a lot of time looking at distributed systems and then trying to understand what is that you can tune and optimize in these distributed systems? What are the common design patterns across these distributed systems that mandates a new approach to manage applications on these platforms?
Gill: Great. Thank you Bala. We’ve got one more question here. “Cloudera and Hortonworks both offer monitoring solutions. How is Unravel different and better?” [Editor’s note: this interview took place before the Cloudera/Hortonworks merger in October, 2018]
Bala: Yeah. So I’ll take this.
Bernd: Bala since you used to work for one of them why don’t you take this one.
Bala: Sure thing, yeah. Yeah. So most of these vendors, whether it’s Cloudera or Hortonworks, they have a very systems approach to it. So a lot of these platform vendors, rightfully so, you need to build great systems management capabilities that’s all focused around setting up the cluster, managing the cluster, and they provide you monitoring more at the system level. How is my Hive service doing?
How is the Spark service doing? And then I think lately they’ve started to add some application view, but it’s kind of limited in terms of its capability, because they are very focused from a system and a cluster perspective. Taking that end-to-end application perspective, is something Unravel has been going for many years now. So that’s the first point.
And the second point is that there is a multitude of systems out there. While Hadoop and Hadoop platforms offered by Cloudera and Hortonworks is important, the reality is there is really a best-of-breed systems out there where you could be using Kafka outside of Hadoop, feeding into Hadoop, you’re doing a bunch of analytics, or you could feed into a NoSQL system like Cassandra, or MongoDB that’s outside of Hadoop.
You still need like an end-to-end application performance management, a product to understand how all these systems are coming together to serve your applications.
And then as you move into the cloud, the reality is you could be using a mixed environment, a multi-system, multi-platform sort of environment, where you could be using a Cloudera or Hortonworks on-prem, but on the cloud, you might just decide to use Amazon EMR, or Microsoft HDInsights. So regardless of where you want to be on-prem or on the cloud, Unravel will be the application performance management layer.
It gives you a very, a similar interface to understand how your jobs and applications are doing on your hybrid environment. So you don’t have to retrain your users, and help them figure out what, like, they don’t have to worry about different tools but they get a consistent experience on either of those platforms.
Gill: Great. Thanks. Anything to add, Bernd?
Bernd: No, I think that was a very perfect explanation.
Gill: Great, okay. One more here. So, “Can the automatic performance optimization be manually controlled by the administrators of the big data environment?”
Bala: Yes, it can be. So we have a couple of options again. The first option is like, we’ll give you the recommendations and we’ll make it available in a pause mode, but we won’t apply it on your behalf. We tell you the recommendations and it’s up to you to decide to apply it or not. The other mode is the auto tuning mode where you give us the right permissions and then we can apply it for you, and then we run it till we converge to the right design goal. So we have both those options available.
Gill: Right… And one here says, “Exactly what metrics can the automatic performance optimization optimize?”
Bala: So today there are a wide range of parameters we can optimize. So these are primarily like runtime configuration parameters. So if you look at like Spark, there are a whole bunch of configuration parameters it exposes, the same thing with Tez or Impala. So we’re looking at like all of these configuration parameters which could be in the hundreds, and looking at this entire scope of configuration parameters and then telling you what is that you can tune and optimize.
Gill: Great, okay.
Bala: And some of our recommendations are also at the SQL level, not just runtime configuration parameters. We can also kind of tell you that your stats are missing, on your column stats for SQL level, or certain joins need to be kind of rewritten. So we kind of give you a lot of recommendation even at the SQL level, and we’re gonna be introducing more in the next few releases.
Gill: Great. Great. So we have here one asking, “Can Unravel improve the efficiency of my big data cluster and save me money? And if so, how?”
Bala: Yes, so definitely right. So in addition to providing application-level tuning, one of the features we’ve introduced is cluster-level tuning. So a lot of times there are cluster level defaults that we just use out of the box and that results in… Like, a great example I’ve seen is people spin up your Spark applications and all of these Spark applications are running on 8-gig containers.
Now when you go and ask the customer why is it running on 8-gig containers they say “You know what? That was a default that came with the platform.” But when Unravel will go and profile these applications run time, we figure out that you don’t need 8-gig containers. You can get by with 2-gig containers. Now when we kind of tell you your default can be 2-gig, all your applications can now just use 2-gig containers.
So just imagine the resources gains you’re getting at the entire cluster level. So today if you’re just using 8-gig containers, you’re kind of like screaming around saying that, “I don’t have enough resources, I need to add more resources to the cluster.” But with just one tuning recommendation with Unravel, not only your applications are putting less resources, or less pressure on YARN, you improve the overall resource replenishment on that cluster.
So that’s one aspect of what we can do at the overall cluster level. The other thing is that we can also help you sort out, like many times when you look at the operations features in Unravel, we can help you figure out the spikes in your cluster. You might see your V-code usage spike and get to the max level.
When you click that view within Unravel, we call it the forensics view, you know precisely which applications, where it’s running, that was causing those pressures, or causing the V-code usage to go back. And when you double click on that application we can tell you why.
Maybe the application was so under-performing that it was just asking for so much of resources, and we’ll give you tuning recommendations to tune that particular application which again may reduce the overall V-code usage in the entire cluster. The other thing is we provide you a good chargeback view. I didn’t have time to demo that capability today, but we have previous webinars where we’ve talked about operational features within Unravel.
We can kind of tell you how many users are using the cluster, you can look at it from a tenant perspective, a queue perspective. In fact one of the innovations within Unravel is you can look at it from a business user perspective. So you have a fraud business group that’s using the cluster, we can group information at that level, and tell you like how much resources are being used by the fraud business group, and then you can assign some dollar value.
Gill: Thank you Bala, so we’re taking one more question here, and it’s, “Is Unravel a cloud service or an on-premise install?”
Bala: So good question. So we kind of like work according to the customer needs. So today we give you the Unravel software that you deploy it in your environment. So if you’re running it on-prem we give you the Unravel software that you can deploy it on your client nodes.
Everything happens within your node, no data comes through Unravel per se. Or you can also deploy it on the Amazon cloud. So we are available both on the Azure marketplace, and we’ll be available on the AWS marketplace pretty soon. So if you just go through that motion you can just do a few clicks and Unravel can get attached to your existing EMR cluster or HDInsight cluster, and then we can start profiling all the applications running on those clusters.
Gill: Great. Thank you, Bala. So that’s about the time that we have for today. I wanna thank our speakers Bernd Harzog and also Bala, for a great presentation. For any other questions that we are not able to answer we will respond by email directly, and of course will also send you a link to this recorded webinar.
Now for more information about Unravel or if you wanna get a one-on-one demo, or
Bernd: Thanks folks, bye.