Unravel Co-Founder and CTO, Shivnath Babu Co-presented at Strata 2018 in New York with Madhu Tumma, Director, IT Engineering, TIAA on the topic of “Quick, Reliable and cost-effective ways to operationalize big data apps”. The video and full transcript is provided here.
Big Data is complex. Applications can fail. And when they do, one can be easily overwhelmed in diagnosing the breakdown: where is the root cause? What happened? Where do I even start looking? Engineers end up going on long explorations to try to figure it out, a process that could take weeks while the app continues to fail. With Unravel, you get notifications of a failed application, and you can get that app from a failed state to a successful state in minutes.
While TIAA seeks to make its big data environment stable, problems can arise that are sometimes difficult to diagnose due to such a large deployment. It’s very challenging to see who is doing what, how you control users and how you optimize it all. TIAA faced challenges in trying to tame its deployment, as apps were not always written efficiently and data was unreliable. Looking to optimize and improve reliability, TIAA looked to Unravel Data. As seen across many organizations 75 percent of application developers are spending their time figuring out why their app failed or is not performing well instead of creating innovative new apps. Thanks to its machine learning capabilities, Unravel can reduce this with the click of a button and can prevent problems from even happening by giving users the tools to self-serve quickly.
With the help of Unravel, companies like TIAA are building new applications in the face of these challenges.
My name is Madhu Tumma, I work for TIAA, which a lot of people know about TIAA-CREF. I manage the engineering team primarily focused on the data. Data infrastructure, stability of the products, stability of environments- that’s my area. I’m going to share some of my experiences, then we’ll lead into Shivnath’s aspect later.
Just a little bit about our company. The company now is a 1 trillion wealth management company. So we’re in the top 10 in USA, we have quite a number of offices all over the country. We primarily, you know, it’s 100 years old company is a very traditional, it has been serving the education community for a long time. But now it’s open, everybody can open in transactions, they can live without company. So we have about 17,500 employees, we deal with almost like 5 million participants. So that’s our business.
Coming to why we are here- today is also the forefront of digital revolution. So we are one of the key financial industry. Now, as the market changes, we are changing. The market, customers, and participants need the digital experience. So that’s where we are. And there’s an interesting article written by Klent in CIO last year, it talks about our CTO’s vision. He talks about Scott blank who is our digital CTO. So we are trying to take the experience of our entire operations more digitally so that we are in the digital forefront.
So how do you do that? So let me start with the general, typical environment for most of the companies. Traditionally, companies have been doing quite a lot of database structures, traditional starting from mainframe relational, that has been a strong foot of many companies.
In the last 10 years, there’s a sea change. So data is becoming, not only in terms of traditional players like Oracle, Teradata, Microsoft, which are the companies that gave us good products for a number of years, data has been the mainstay for all operations. But now data is everywhere, data is moving into a different platform. So that’s where the Hadoop and associated technologies come into picture. So look at this picture here. So relational database are they are giving away too big data, but they’re still there. So that’s what we call polyglot persistence, right? if you look at this concept, it is highly popularized by Martin Fowler. The application needs to do, they want to store wherever it makes sense, it can be in in terms of audio BMS, it can be in terms of audio BMS it can be Kafka or Hadoop. So there’s a lot quite a variety of the platform, there are support for data. So that’s why the focus of the session is more of how to really maintain them and how to really make it a stable environment. So that’s most of the topic I’ll be talking about.
Big data, pretty much it has been in the last 10 years, starting from 2007 when Yahoo guys opened up Hadoop, it’ll become a sea change in terms of analysis and large data sets. In a way, it is taking over the world, right?
Let’s go to a typical architecture. In any organization, you see multiple data sources. Symptoms of audio BMS, it could be audio BMS, it could be social streaming, it could be machine learning, it could be IoT, and a lot of mobile. Typical year to collect the data, you process it, and also you are processing in terms of batch, real-time, etc. Then, somewhere to store it. So just picture it an example of what kind of architectures are being used for supporting a lot of applications that could be internal analysis, it could be data science support, it could be mobile, it could be response time, it could be compliant security. So there are a variety of applications which are looking for the environment. But at the same time, the big data is becoming duplicates in terms of the way in which you respond to the analysis, the way in which you alert your customers, or you alert your own administrators. So there are a variety of functionality that are being based on this architecture, even just typical architecture.
Before I go further, most people have attended a lot yesterday, a keynote. And the last one we have seen a number of sessions, different sessions, a lot of sessions on machine learning and a lot of sessions on artificial intelligence and a lot of sessions on solution architecture, how the applications are architected. But we believe that, besides those three things, the stability of your systems. How do you manage your systems? How do you operate your systems? Those are important. So data science, or mobile customer would like to check your results on a mobile somewhere or somebody is trying to analyze everything is dependent on the stability of your platforms. That’s where the platform operational aspect is very important. Just to remind you, the very first keynote session yesterday, he started by an open as well as Brian, Brian from PNC Bank and an open from Cloudera. If you recollect, upon what he said, he talked about three 5’s. Let me recollect, he talked about 5,000 users, they’re talking about 500,000 queries. Then he talked about 500 terabytes. If you remember, on the first slide he showed it. Then after that, he showed another slide where there’s a table. There are 292 tables in single query. This is what actually happens in stability, when you talk about clusters, that’s exactly what happens. That’s what we’ll just spend, then we will lead a discussion about how you really operationalize. That is the topic of this training.
So you know, we have grown quite a lot in the in terms of our data structures in our company we are primarily mainframe lot of Oracle and a lot of Teradata but as our company made a decision to go more digital. We wanted to look at other things and take advantage of different systems and platforms that can help in our achieving our digital revolution. That’s what I’ve been. So we started from curiosity in the beginning, then it became more of a pilot, the pilot we went into production. So we are able to say that now we are able to manage lot of ETL flows into big data, a lot of different tools we use, we are also trying to utilize our systems for not only for data science research analytics, we’re also trying to utilize for customer-oriented experiences. We are trying to feed some of our end user websites through big data, so we are trying to utilize security compliance. Suppose we are hit by some security a threat whether our Big Data can alert us. Whether it’s a combination of Kafka combination of some some new sequel, combination of Hadoop structures. So that’s what we have been. And as you realize, as most of you have been working with Hadoop structures, so there’s a lot of diversity in terms of Hadoop and ecosystem, you talk about HDFS file systems, you talk about different trading engines like Impala MapReduce, then you also have a Presto, you have 12 different querying engines. And then you have different formats, lot of different file format, same time you have a novel in a table, send databases in terms of high, then you have H-base which is a little bit different from that. So it’s a combination of many services. But this is how actually the diversity has come into picture. So if we look at so you see diversity in terms of application types, there are different types of applications are also utilizing a big how to stimulate a stable cluster. Different applications have different needs. So that is another important feature we need to understand. And also the different users, there could be some users who are power users, they would like to do a lot of self-service things. There are some users simply they take the application which is written for them and they use it. Now you are typing, you’re also having a different programming set. Some people have a good skill such when they develop applications and also criticality of applications is important. That is what we call diversity. So let’s go further.
So now that growth- the growth and diversity. Growth is, typically as I mentioned yesterday, a typical a big data is always you know, approaching petabyte in a normal company, some companies have got multi petabyte, maybe Uber people use …. but every company is not the same. They are different in their transactional nature, different analysis, different data, but typically there’s a growth and there’s a diversity. What we do is, when you have a such a large system, you tend to have problems which are sometimes difficult to diagnose. And also, it’s very difficult to see who is doing what, what kinds of files are being created, who is doing what, how do you track it, and how do you control it- the second aspect. The third aspect is, how do you optimize, how do you really plan it? The idea is to make your environment very stable. That is how we will start this topic.
Coming to the challenges. The more problems you have, what do you have visibility problems, because applications are not written always very efficiently. There could be some kind of an application tested with some small data sometime application tested in a different environment then it’s on boarded. If everything is good then you are good, but sometimes it could become a rogue application and sometime queries will not come back. It was working for the last one month, but last one week, we have noticed a lot of slowness that’s typical, it happens. What is the result of that, then you have a failure and sometimes you miss the SLS, then slowdowns. This is what happens in the feed. This is one area where you need to diagnose what exactly is happening. Generally what happens is, when application is tested, or developed further it looks very good. But runtime- after some time data pattern changes, data gets skewed sometimes and data grows. So those things will start coming. That is one area where how do you really track it. That’s one topic we need to understand.
In order to match the functionality of the applications. There are a lot of metrics which are in terms of system level metrics. And also you need to understand metrics which is which is very specific to Hadoop intelligence, right? Sometimes these could be HDFS details, or it could be some kind of memory consumption in terms of containers, in terms of memory allocations, the resource pooling etc. So are we following best practices? Are we following or whatever is generally accepted norms for application? So that’s one thing. Also say, like some time, what happens is, when you start the data application, you may have, you may think that going to create a number of so many files and so many partitions, but sometimes it goes beyond your control and I’ll give an example.
Normally a lot of streaming applications are now trying to write how to purchase. So streaming, what happens when you start streaming, he may do some kind analysis on the streaming at Kafka level, there’s one example. But you tend to write back into Hadoop and then start using it. So when you start writing in the Hadoop, you’ll have to write your to close the file. Until you close a file, we can say immutable data. So we tend to write lots of files. And you must not have anticipated this when you designed the application. There is one example, then after some time if you look at their number of files or millions of files for one table. Now that dictates your performance level partitions, files, number of files, and it’s always some people say, especially the Hadoop experts say, no don’t play too many files, not too small files. But again, if you had two small number of files, even our panelists this kind of thing. So these are all some of the challenges that we’ll have, but our idea is that you need to understand them, you need to really track it.
How do you optimize? Now systems are built in. It’s always generally the clusters. It could be 50 node cluster, hundred node cluster. That depends on the type of structure that you have. And a lot of parameters, which are system-level parameters, we need to understand them. Can I apply them to the entire system? One particular application may take advantage of that, and some other application may not. So we need to understand that. The resource allocations for different applications, so sometimes applications compete. Before I go further on this one, if you just look at the history of Hadoop clusters, they all started as a single purpose, simple, one-subject to have, and they’re on it, and they know the user system. But that was sometime back. Nowadays, it’s more of a multi-tenant cluster. You have a large 50 node cluster or 100 node cluster, and every application team wants to add, because you cannot replicate. There are some challenges in having multiple clusters. But normally, many applications, many companies, tend to do a large cluster so that you don’t equate to duplicate the data. You can take advantage of same enable in one place. That brings up a challenge for multi tenancy. Those are one of the things that bother a lot of administrators as far as how to manage it. When you have different query engines, you have Impala for example, you have a hype cycle, then you have Presto, or you may have some other query engine. So they all coexist in the same place. Everybody has their own region of synthesis allocation. So that’s why it becomes a challenge. Then, we need to understand the forecasting needs.
Just to go back, three things we talked about, more problems and harder blackness, how to track it, how to control it, then how do you optimize. So now, it depends on a lot of expertise with administrators, as well as developers, they need to work collaboratively. And that’s where we were looking for some kind of outside help. It could be in terms of professional services, but professional service is always subject. You know, you can wear them, they come and they help on particularly here, but up some time, you go back into your own stability issues. We were always looking for some kind of tools. Traditionally, there is a concept of APM right now, and for a long time, Application Performance Management is an important aspect.
Before I go to APM, the poor application and Ops management, it results into a lack of reliability. People don’t believe it, and sometimes costs go up because you tend to buy more hardware, and also, you’d like to be able to solve the problems in time. So that’s why Ops management is an important aspect of the entire Hadoop ecosystem. Generally, if you look at application to apply, been engaged for a long time, they talk about application performance management end-to-end and it’s not purely the platform has to do ultimately the experience of additional user, experience of a data scientist- that’s important. We need to take care of all the things. Traditionally we have been seeing in the past 10 years past, we have a lot of different software like a precise while the rest later we saw some ca technologies, then a trace, etc. And recently we have seen Compuware New Relic. These are all meeting different functionality. But now we realized that Unravel is something which we started researching them, we collaborate with them to help us. So Unravel is fitting into more APM, so that’s where I stop. Now, I’ll hand it over to Shivnath, he’s the CTO of Unravel as well as a professor from Duke.
Thanks a lot Madhu. It’s a pleasure. As Madhu said, I am CTO and co-founder of Unravel. This picture actually shows you how Unravel is in a sense like a ball that is trying to solve a pretty hard problem. A problem that has sort of been solved in the past, but for systems like databases and Oracle database. Fast forward many years and databases started getting used as part of a larger web system. And then companies like app dynamics came to solve the challenge of supporting applications and Operations Management when you have complex systems. And today if you look at it, as Madhu was saying, all of these mission critical applications, like innovative machine learning applications, streaming applications, and good ole’ ETL and BI type applications all running on a new stack, which by definition is composed of multiple disparate systems. Hadoop, which is composed of MapReduce or Spark, or an S3 on a cloud store, or Amazon. Genesis and Kafka type streaming systems or new TensorFlow. All of these tend to be decieved systems and all the challenges that Madhu mentioned, are claimed that Unravel can solve them. Well really? You might be asking. Let’s jump into a few demo scenarios, and what I’d like to do in the net 5-10 minutes or so is show you three quick examples.
I’ll start with this one- how many times, how many have you written bigger applications here? Probably 30% of you have written applications, right? How many of you support big data applications?
We have DevOps people right here in front, right, so this is a problem, I’m sure we can all relate to. You run your first Spark application or your hundredth Spark application and boom it fails. Where is the root cause? What happened? The first problem you run into is, where do I even start looking? Do I start looking in the cloud-run manager? Do I go to ganglia metrics, cloud watch metrics, data rock and things like that? Do I start go looking at the logs? And if you’ve seen any of these really bad luck 4g logs, you’ll know it’s very hard to understand what is going on.
Let’s see if I can quickly jump in to see if I can bring up my demo. So Shivnath the big data developer at 2 a.m. gets a mail or a page saying whom you are proud ML model pipeline is failing, go fix it, right? First, he jumps in and thinks first I need to find my pipeline and make sure that everything is good in the cluster. I log into Cloudera manager here boom I logged in and great everything is looking green and it seems like all these services are good and nothing is looking bad on the cluster. And I recall that my application was a Spark application so let’s take a look at Spark. Seemingly everything is good, but my application is failing. So where do I start now? Here is a history server web UI, let me go into that. Great, so now I can at least see applications, but what am I looking at here? Which application is mine? Everything looks the same. Some ETL application is running repeatedly but when I start looking and now probably go back and talk to Raymond. Raymond asks for an application ID and something to start looking at. Probably he responds, and says that application number 1515. Even timeline, I look at that as a nice beautiful visual, but what is it. Fixing themes or execution failed well I know that. I don’t know what to do or where to start looking. Luckily it says here, below if you can see squint carefully it says failed jobs. Let me go into failed jobs, that’s probably a place where I can take a look and start looking at failed jobs. If you look very carefully there is a stage that failed and it is all Spark, Apache Spark, shuffle, fetch, failed, exception, direct buffer memory. Then go into details, what are the details? some complex stats rates. What do you do next? I cut and paste this and I go in, I type in Google, and the first thing I see is it doesn’t look very refreshing, right? I don’t know any details about Spark app but I find and I go into Stack overflow. So this is reality, right? This could be a 2 a.m. to 4 a.m. type of thing. This could be weeks, until, as Madhu was saying, my application is failing and this is not reliability, even on a big data stack. So fast forward to the Unravel interface. Right here tells you hey, I found an application that failed, let me click, see what’s happening, this is like the Unravel interface. And at first this application, you can see that application 1515. And I’ll say yes, I detected that it failed. But you know what, I was actually able to get it running. The first one is the failed application, 15 minutes and it failed. Unravel, did something. It was able to get that application from a failed state to a successful state in four minutes. But no it doesn’t stop there, it tried something more and it said, if you go to run it again, I actually have a better configuration that can run it reliably and run it quickly in under a minute.
This is basically the difference between looking at screens like this where you can see some stacks raised and all of that, to actually be able to solve problems. And as Madhu was mentioning, everybody here in Strata, is using big data and applying AI and ML to get insights and TA is doing that. And he gives a bunch of examples, right? Why can’t we do the same thing with the big data generated by systems, metrics, logs and all of that and converted to insights and really make applications that are self-tuning. That is the context in which we have been building the company. So now if you go back and look at this world, we want to go from this world right here where app Devs are looking at things and wasting time. We have actually done studies with a lot of enterprises and you’d be surprised, but maybe you won’t be surprised to learn that 75% of time of application developers is actually not going into what they should be doing which is creating innovative apps, it’s going to everything else, like today my app failed or tomorrow it’s slow it’s like performance change, right? And what we want to do with Unravel is exactly this. And with the customers that we work with, the numbers are pretty high, we can reduce from a click of a button. Or Unravel is kind of auto-tuning the app. Hopefully you’ll understand why there’s so many interesting, like nothing’s possible. If you go from the individual app developer’s world, to the world of all other operational people. And as Madhu was mentioning, Clusters today are very diverse there are lots of different kinds of apps. Spark streaming apps, machine learning apps, reading a lot of data, lots of different kinds of apps. And in all companies, you’ll see that for every operations person, sometimes 50 application developers, and the core operations person is actually sitting and trying to solve problems like this, which is the pipeline that I was mentioning. The proud ML model pipeline was running well, and today it’s actually very slow. If you see the right side here, every dot here represents around this application was running well under 5 minutes the SLA today so only performance is bad, unpredictable performance. With Unravel, the same kinds of machine learning analysis can be applied to data to crisply tell you, things like this right. Unravel built a baseline of your application, and can tell you that today performance is worse. Yes, you’re seeing that, and you want to be warned ahead of time. But it can also tell you the root cause. From things like today the problem is that there’s some contention in the cluster and the application masters are taking a long time to get launched. That’s great to know. At least you know what is happening, will be even better if this was an alert that came out to you like a road application detector. Imagine an alert coming to you saying, by the way, that mission critical application that was running as part of the workflow was slowed down by rouge user who started a spark shell and ate up all the resources with dynamic location and things like that is very, very easy. You can ask Madhu and he’ll probably tell you how many times you guys have probably had to solve problems where some rogue application is causing problems in the cluster, not because of user simulations but because there’s so much complexity, and they just want to write their SQL and wanted to run very well in the cluster. So in this world, Unravel has a whole bunch of we’ve been building auto actions capability which is all about, can we automate, can we actually prevent problems from happening. So if you look at reality, that’s going on in the world today, and again numbers based on the studies that we have done with a lot of enterprises, so we will tell you exactly, and I can see it on my browser here that the application stuck. This is actually a problem I didn’t get into.
It has been an interesting talk with a lot of technical difficulties. So what you can actually see here, and the more interesting stuff is on the other side. Closing a ticket or solving any particular ticket in the Big Data ecosystem takes at least 20 hours on average. So unfortunately the last 5 minutes of the talk will have to be spoken on something that looks like this, to connect everything that I was saying. Operations teams are spending close to like 20 hours working on any ticket. And with Unravel, based on all these analysis and machine learning capabilities that we have been able to bring up, we’ve been able to cut that down significantly into minutes for three reasons. One is preventing problems from happening in the first place, just like how I showed you. You can actually fix these SLA problems or rogue applications automatically. The second is give tools to the users so they don’t even create these tickets in the first place. They can self-serve. And the third, of course, is when a problem happens and the ticket comes to the op’s person, he has a very good set of tools to solve it very quickly. So using these three mechanisms, Unravel has been able to cut down this time significantly.
And the other last part, which fortunately I can show you, is as Madhu mentioned, now bigger apps, that’s 30 seconds. Bigger apps have gone beyond SQL query or a Spark application is not really your application. It is a pipeline that can be composed of Kafka bringing data. Spark swimming that’s consuming from Kafka and doing computation and maybe certain estate industries. How do you go about troubleshooting something if the application is very slow? Unravel connects, wow okay so this is the right time, right? And this makes this much easier than before with the far fewer tickets, and not only in the case of individual, queries or into the Spark applications. But the whole gamut and being very optimistic and going into the next slide. Oh wow, even in the context of applications like this, and all the challenges you’re facing.
Just to wrap up, overall what we showed today is companies like TI are building new applications, facing a lot of challenges. And we have been working very closely with enterprises to solve all these sorts of challenges. While we do have some ways to go, we actually have a pretty solid solution that I would like all of you to try out and give us feedback. We have free trials, please try it out. Thank you.