We're hiring! View our open positions!


Using Machine Learning to Simplify Kafka Operations

– Learn about current algorithms in the open source community used to find outliers and spot big data performance issues – Understand how Unravel’s APM software can provide transparent full stack management of clusters such as […]

  • 25 min read

– Learn about current algorithms in the open source community used to find outliers and spot big data performance issues

– Understand how Unravel’s APM software can provide transparent full stack management of clusters such as Azure HDInsight’s cloud infrastructure or on-premise systems

– How to get the best Kafka performance, with predictability and reliability, whatever applications you’re running

– Explore in depth a case study of how to apply recent advances in machine learning and AI to solve the real-world problems of teams with multiple goals, running multiple applications, and controlling cloud costs by running at a good utilization

Transcript of video (subtitles provided for silent viewing)

– [Shivnath] I’m Shivnath Babu, CTO and cofounder of Unravel Data Systems. I’m also an adjunct professor of computer science at Duke University. So, in like, two words about Unravel itself, the company was formed in 2013. It provides a solution to collect monitoring information from every level of the stack, from applications like ETL and today we’ll see a lot of streaming applications from the platform side, like with Spark and Kafka, from the infrastructure.

All of this monitoring information is collected in real time, brought to the platform, and we’ll see the kind of interesting things it can do once such monitoring data comes in real-time. We have a lot of customers who are running the system in production, and we work very closely as a partner with Microsoft… Now, it’s my pleasure to introduce Dhruv, the program manager who’s leading everything streaming at Azure HDInsight. So, take the stage, Dhruv.

– [Dhruv] Thanks, Shivnath. Can you all hear me? Morning everyone! How’s everyone doing?  Good? Can I have a show of hands? How many people here have heard of Azure HDInsight? Okay… a lot of you. That’s great! So, I’ll just do a quick intro about Azure HDInsight. It’s our fully-managed cloud service that makes it really easy for people to have a fast and cost-effective mechanism to process massive amounts of data.

I won’t go too deep into the product because I think this talk is more about how it applies to real-world scenarios as well as how the streaming solution and the streaming scenario is working out in production companies.Then, Shivnath is going to talk about how machine learning and AI is really helping out there. Recently we announced a big price cut in our product…over 50% for a lot of our services.

With HDInsight, you can set up clusters for Hadoop, Spark, or Kafka on Azure and have it running very quickly. In general, with HDInsight, we have seen our customers coming from a variety of industries and there are real-world applications running today in almost every industry, be it manufacturing, healthcare, or retail.

A few years ago, we used to talk about  big data as something very nasty that was being done in labs. But now we’re seeing companies actually productionizing a lot of these scenarios. So, Azure HDInsight is something which is cloud-native, built in the cloud, making it really global. I’m going to talk about one of our customer’s scenarios which shows how fast it is to scale out and to really increase the visibility of your service, allowing you to process data all over the world.

You get a lot of built-in security and compliance because it’s integrated with a lot of the core Azure services, which provide all of those features like built-in HIPAA certification. We have a lot of productivity tools built in and I already talked about low cost… So, I’m going to talk specifically about Kafka, which became generally available on HDInsight in December.

Today, Kafka is really popular for ingesting large amounts of data. Some really interesting things about both Kafka and Azure HDInsight is that we provide disaster recovery via Mirror Maker and end-to-end streaming pipelines. We see these as real-world scenarios which are using Storm and Spark to process massive amounts of your streaming data… Not just data from every scenario where we see multiple events, not just gigabytes, but thousands of petabytes of data being processed all the time.

I mentioned how I would refer to a use case… One of our biggest internal customers, Siphon, is the bench on which Office 365 and Bing Ads operates. So, it’s a data bus in which lots of data comes in, gets analyzed through Spark and Storm, and then gets passed onto Office 365 and Bing for their ad monetization and other functions.

What you can see here is the growth and how easy it has been for them to scale their operations. This is from December, but today, they’re processing a trillion events per day, which is just amazing! It’s amazing how quickly they’re scaling, and how easy it is for them to scale, because they operate in many different regions.

For example, for Office 365, sometimes a key scenario is that the underlying data has to be in the same data center in which they’re doing that process. So, Siphon needed to set up something in East Asia, and it was really easy for them to quickly start up clusters in Korea and Asia Pacific to get their operations up and running really quickly.

Similarly, they have operations in our Azure government regions such as U.S. Gov and China. So, it’s very easy with HDInsight, and with Kafka…real-time scenarios….for them to scale out very quickly. For us, a really big success story for this streaming scenario has been with Toyota and its Connected Car platform.

In the Toyota Connected Car platform, you can imagine lots and lots of cars having a variety of signals coming in and being processed so both customers and Toyota know when the car needs servicing or how the car is performing. And that’s kind of based on HDInsight/Kafka today. Toyota has been really happy because they can get this managed service with a great SLA and leverage both a scalable technology and process to create the end-to-end data streaming pipeline.

…This is the architecture that Toyota uses. They have millions and billions of events coming in every second into the Azure Gateway Services through IoT hubs. It gets processed across various HDInsight clusters, which use Kafka, Storm, and Spark to process it, and then they can visualize to figure out, “Hey, which cars need to be serviced really quickly?”

All of this is trying to show that these are real production systems happening today and this is just a slice of the scenarios you have with streaming… only a slice of use cases that our customers face. But there are challenges across those, and it’s some of the challenges that ML and AI are processing. So, I’ll pass it over to you, Shivnath.

– Thanks, Dhruv. So as Dhruv showed, a lot of companies are now creating streaming applications and running them in a fashion which is mission-critical. He gave an excellent example, actually two examples… One from an internal Microsoft use case, which is actually running at a pretty large scale and maybe matches the scale that I can see some of the gentlemen here from LinkedIn are also running, and another very important use case from IoT. Now, this is not just these two industries. There are challenges being solved by streaming applications in manufacturing, in customer service, in sentiment analysis, a lot of real-time recommendations. Now, because all of these are mission-critical systems, you want the underlying architecture which can be composed of Kafka for ingesting the data and transferring it in real-time… an HBase or Cassandra for interactive access to this data or a Spark streaming, or a Flink to actually process the data.

The architecture has to be reliable and when results are not being generated in a timely fashion, the real-time results are not real-time, right? That’s when problems actually start to happen. And today, arguably, it’s a challenge to understand where the root cause of the problem is and how to fix it. Right? It could be that the Spark streaming side is being challenged by not getting the right amount of resources to run, it’s running in a multitenant fashion.

Or it could be on the Kafka side. Maybe the number of partitions are not right, or maybe there’s some configurations that are not set right, right? Or resources have not been allocated properly, and anything and everything in between. Right? And it’s a challenge because teams that are creating these applications and actually operating these clusters, arguably, they do not have a good tool that actually gives end-to-end visibility, right?

Well, visibility would be great but would be even better, for there to be a tool that can analyze all of this data, give insights in real time, give recommendations on how to process or how to fix these problems in real-time and ideally, even in the self-healing fashion, the system sort of cures itself, right? So, the purpose of today’s talk is to raise some awareness and interest around these problems and actually show you some of the work that we have been doing and maybe get some feedback and explore some opportunities for collaboration.

So how can we empower these data ops teams? Well, we can collect a lot of these metrics, and if you’re familiar with the systems just as… You know, there are metrics everywhere – from the processing side to the storage side – like can it be we build a platform that brings all of these metrics into one single place, and can give nice views and insights? But it should not stop there, right?

What we really want is to apply machine learning algorithms on that data and generate insights and recommendations and ideally even get to applying artificial intelligence, right, which can take actions, fix these problems. And that’s what I’m going to be focusing on in the rest of the talk. And the way I will do it is, it’s like, again, this is the world of machine learning and AI, right? And these algorithms techniques are a dime a dozen.

And I’m sure you are flooded with emails about companies doing machine learning and deep learning. So, what I’ll try to do today is give you a good way to think about these problems and tell you some of the successes we have had and the challenges we are facing, which all boils down to understand what the goal is, what are these DevOps goals and then map the right algorithms to it.

So, something about goals that DevOps teams have in a streaming environment, let’s say very specifically a Kafka environment, there’ll be application teams that might have some goals to meet. I have a throughput goal: I need to process 100,000 events a second… right? Or a latency goal: the gap between the time at which data arrives and the time it is processed is less than two minutes… At the same time, the platform owners will have…usually, they run multitenant Kafka clusters so they could have different applications with different goals, right?

And on top of that, they might want to keep the platform running at a good utilization and if it’s on the cloud like HDInsight, running at reasonably low cost. And not only today but nicely planned for the future as well, right? On the other side, we have all of these interesting algorithms that are being worked on and like, you know, many algorithms that you can readily get in open source tools, like algorithms for outlier detection, algorithms for like, you know, correlation analysis, right?

And how do they like, match these two things? And that’s exactly what I’m going to cover in the rest of the talk. Let’s start with a very sort of like, you know, easy-to-understand problem, outlier detection, right? And the format I’ll follow is I’ll give you use cases where each technique can be applied to great benefit and then I’ll connect with like, you know, different kinds of algorithms that can be applied.

So, outlier detection in a Kafka world, right? It’s very common to have the load be imbalanced across Kafka brokers or across Kafka partitions. Here’s an example, a real-life example where what I’m showing here are two sets of time series charts… and you see a bunch of these charts as we go along. The first one shows the bytes ingested per second, and the bottom one shows the message rate per second, and if you look at the tooltip, you’ll see that I’m showing three brokers here, of which one broker – the kabo1 broker – is actually having one-tenth of a load with respect to the other brokers, and often the problems might be the other way, right?

One broker is becoming a hotspot. And this can happen at partitions at many different levels too. This is a problem which can be like, you know, found or detected very quickly with algorithms for outlier detection and even the simplest like, you know, algorithm. But when you start thinking about these algorithms… for example, outlier detection… like there are a couple of different dimensions to think about:

One is, there are algorithms that are great for analyzing one-time series, like, you know, one feature. And some algorithms are great for doing what’s called multidimensional analysis, right? There are certain algorithms, and I’m showing you an example here, commonly what’s called the Z-score algorithm. What that does is it fits a distribution to the data and then any points that don’t match the distribution, like I’ve shown there, are outliers.

And even simple algorithms like these can actually go a long way to quickly notify an operator, instead of like staring at graphs, that something is actually… there is a broker that is like out of whack, right? And as I said, like, you know, sometimes your problem can fit into a single dimension, but a lot of the time you might want here like, you know, identify brokers that are both outliers, in terms of like input data, as well as like input data as well as like, you know, maybe CPU utilization or disk utilization.

So, there are algorithms that are good for multidimensional and early detection. The DBScan algorithm which actually uses a density-based clustering that I’ve illustrated here. So, it groups together points into clusters so that things which don’t really satisfy our part of the clusters can be quickly identified as outliers.

And recently, there has been a lot of interest in outliers from, as I said, like, you know, this applies not only to Kafka, but it applies broadly to any kinds of data. The two ways in which people have been extending these algorithms is one, bring more interesting and sophisticated models like decision trees to be applied to this problem or, now with deep learning, there are auto encoders which are grated, like, you know, recreating the data based on the original data and identifying which data points don’t really match, right?

So, that’s an example of like, you know, outliers, outlier detection. And I’m going to give you four more such like, you know if you think about it, use case kind of patterns. Let’s move onto the next one, slightly more complex, forecasting. There are very excellent examples of like how forecasting can help in an operations world. And it’s all about how quickly can a predictive algorithm let you know that a problem is going to happen?

This can help human operators, of course. It gives them like… instead of getting into a fire fighting scenario, you get a proactive notification. Or, this is actually…this can be the like, the backbone of an auto or self-healing algorithms that can take the signal and take actions, right? Let’s see an example.

I’m actually going to give you a real example from a customer scenario, but this customer is processing a whole bunch of the firehose of Tweets… to extract sentiment from it and use that to do much more intelligent customer support. It’s from a telecom use case and this application uses Kafka, it uses HBase to keep intermediate state. All the computation happens in Spark Streaming and the resource allocation happens in the YARN environment. And here’s a quick screenshot from this kind of real-life scenario. As you can see, it’s a Spark Streaming application.

All these green bars you see are the events per second.This is actually what I got permission to share. It is not the full firehose, but it actually illustrates the problem. You’re getting around 40,000 to 50,000 events a second. And if you look at that gray line in between, in this particular scenario, their latency SLA was, from the time data arrives to the time it’s actually processed should not exceed three minutes.

Now, if you look at the trend, you can see the lag sort of building up… It’s actually one minute. So, by applying like a forecasting algorithm on this kind of data, you can get a quick sense of how much more room I have before all of these SLAs will be missed, right? So, in this way for like, you know, getting a sense of problems beforehand, forecasting is great.

Now, forecasting is a field where there has been a lot of work over like, you know, many, many years and there are very good algorithms which are, again, provided by most of these open-source libraries that you can actually get ARIMA – autoregressive integrated moving average. Holt-Winters – these are traditional algorithms that apply like, you know, time series, forecasting to the data and on the other hand, a lot of interest more recently from what I call recurrent neural networks in machine and deep learning, and are very specific version of them called the long like, you know, short-term memory networks have been great at forecasting.

Now, where we have had a lot of success is actually by employing an algorithm that Facebook released, I think it was sometime last year called the Prophet algorithm, right? So the challenges we ran into is some of the earlier techniques need hand-holding, right? It’s not like you just put the technique in and then everything magically happens. With the Prophet algorithm, at least our experience has been that it requires much less customization, much less domain knowledge, and it works with default settings most of the time.

And this technique, like to just get into a little bit more detail, it uses what’s called Time Series Decomposition, which is what many of the other techniques use as well, but uses specifically something called the GAM, the generative additive model – that breaks down the overall time series from a trend perspective, from a seasonality perspective… and a lot of the time what actually stumps you is like…that can be a holiday like a Thanksgiving time or things like that, and then the load can be much higher.

So, these techniques can be quickly customized to input such domain knowledge as well. So, forecasting is something which has a lot of value that it can provide and, moving on… So, we saw outlier detection, we saw forecasting… then I’m going to get on to even more complex techniques. And one of the techniques which has been raising a lot of attention broadly in this entire, real-time streaming scenario, is anomaly detection, right?

The perfect, the ideal anomaly detection scenario is if an anomaly can be detected, and what is an anomaly? Some sort of an unexpected change, something that you’re not expecting and something that really should like, you know, get your attention. That would be a smart anomaly detection algorithm, and if we have one such algorithm, it can be used to like, you know, generate smart alerts, right?

Alerts that notify you about problems which you didn’t even expect like, you know, would have happened. So, again, you get a good sense and enough time to address the problem. But the challenge with these algorithms is people get quickly turned off if there are false positives, like you must have heard of “alert fatigue”, right?

So, false positives are bad, very bad. False negatives are also bad with the algorithm been detected and often becomes like a tradeoff between tuning algorithms to meet these goals. Here’s one example to kind of give you a sense of the problem I’m talking about: We’ve talked about lag, I gave you an example where if the system starts to lag, it can’t keep up with the rate of data. It is not able to generate the results within a timeframe of when the data arrives. That’s bad, right?

Now, here’s a scenario where a particular consumer, a Spark streaming consumer that’s actually processing Tweets, and the same kind of domain that I showed earlier, it’s starting to lag. Now, is this worth alerting? Is it worth waking up somebody at like, you know, 2:00 a.m. in the morning? That’s a challenge, right?

Now, sometimes lags are okay because maybe at 10:00 a.m. every day you get a burst of data and a lag might build for some time, but it’ll quickly catch up, right? You want algorithms that are smart enough to detect that, realize that, and then cut out the noise in these alerts that are coming to you. So, with anomaly detection that is the one thing that we have learned by working with it for a while –  there is no silver bullet. The traditional kind of techniques or the approach that is taken is, you actually take the data that you have, the monitoring data, you apply forecasting algorithms, reasonably good algorithms, see what is expected, and if the data that you’re getting, the metrics that you are getting don’t match that distribution like what I’ve shown here, then you can categorize that as an anomaly, right?

And all the different forecasting algorithms that I talked about from ARIMA onwards can be used for this. Where we have had a lot of traction and successes with anomaly detection algorithms in the streaming and Kafka operational context is with this algorithm which has been around for maybe twenty or thirty years now: STL. I think it was developed sometime in the 1990s, right?

It’s called “Seasonal and Trend Decomposition using the Loess Method”, and I’ll illustrate that using these four graphs. What it does, in a nutshell, is it takes a signal, the topmost flow that you see there is the actual time series that you are getting. It then extracts out seasonal patterns and trends and then what remains are the residuals, or the remainder, right?

If there are extreme points in the remainder, that’s a good indication there is an anomaly, but even this, without actually having a post-processing layer that can look at some other signals, it does not work right off the bat, but this combination can actually lead to a scenario where you can get smart alerts with very less false positives, and reasonably less false negatives.

Now, anomaly detection is an area that has seen a lot of research, especially from the deep learning community. These long short-term memory networks have been reasonably good at like, you know, forecasting and time series kind of pattern detection and characterizing what anomalies are. So, moving right along into the fourth category of like algorithms, correlation analysis, right?

So, so far we saw how you can detect problems through outliers or through anomalies, right? And the next biggest challenge that Ops teams face, well, I know that there is a problem, but can I actually get some hints, some guidance towards what the root cause of the problem is? And that’s where all of this correlation analysis really like, you know, comes in handy. Let me show… Like, you know, give you some examples, right, managing that problem.

I got an alert saying that a latency SLA is about to be missed. So something has changed. What would have caused that change? Is it something changed at the application level? Could it be that something changed in the resource allocation level? Because at the end of the day, all of these are multi-tenant environments, right? Could something have changed at the platform level itself, like something at a Spark or a Kafka level?

Could it be that something in the data change, maybe rates changed, maybe distribution changed? It could be one of like, you know, anything, right? Let me give you a quick example again, a nice real time example. So here’s a like an anomaly. What that first plot shows, this is actually the latency, right? Suddenly latency is spiked up and we were able to detect that by basically using techniques that I talked about earlier, like doing a baselining, hey, this is unexpected.

It is an anomaly, right? But then what you also want when you get such an alert is, yes, there was an anomaly, but what changed? What could have caused that, right? If you look at what I’m showing here, based on analyzing a whole bunch of different like, you know, time series metrics and the reality is, as I said in all of these operational environments, time series metrics, are a dime a dozen.

Sometimes you have hundreds, not to mention sometimes like, you know, millions of time series metrics, right? So, how can you pinpoint from all these different metrics, what are the potential metrics whose changes that best explains this high-level change? And in this bullet case, it was some contention on the YARN side, the multitenancy side where launching the application mastered to process these workloads was where like, you know, most of the change happened that explains this anomaly, right?

Now as I said, algorithms for correlation analysis, it’s another like, you know, much more so than anomaly detection and the place where like there’s so many pitfalls. I can show you many examples where just having some trend in the data, two very random and totally unrelated time series can actually seem very correlated, right? So the trick to actually get good results here, two things.

One is you have to bring domain knowledge. You can’t simply throw millions of time series and then do correlation and then expect things to actually just work. You have to very carefully pick which metrics you are actually doing the correlation against. And then the key thing is how are you doing the correlation? And all boils down to some metric that can identify similarity between time series there are correlation coefficients like, you know, even the Pearson’s correlation coefficient, and there are much better coalition coefficients as well.

Now, the technique that we have had the most success is by using time series similarity metrics, and I’ll illustrate that and the kind of pitfalls that you have to watch out for in the scenario too. So, I’m showing two time series, right? Which like, you know,showcase the kind of problems that you have to deal with in operational context. You can’t expect all the time series to be nicely synchronized in time because at the end of the day, there’s some measurement algorithms that are running, that are collecting data.

Things could be literally a little bit like, you know, unsynchronized, based on how they’re coming to the system where the processing is happening or that could be causality patterns in the data, right? The simplest way to compute similarity between two time series is just to use what’s called the equilibrium distance, right? Match each point in terms of time and take the distance between them.

But we have had a lot of success with another technique that’s called Dynamic Time Warping. What that does is it is like, you know, it’s less sensitive. It kind of looks around in the shapes of the time series and tries to find a match, and that way it is more robust in identifying time series that are correlated and weeding out those that are not correlated, compared to more traditional techniques.

So that brings me to the last and the most complex one I’m going to be talking about, right? So, we talked about like, you know, anomaly detection, outliers, forecasting, all like, you know, point techniques. But if you had a performance model of the entire system or some sub-competence, right? A performance model that can actually predict if I add like, ten more partitions, what will the performance be?

There’s so many things you can do, but at the same time, these models are pretty, pretty hard to generate and implement, right? But if you have these models that can really help you get towards SLA management and cost efficiency.

Two examples of this, they’re great at answering what-if and optimization questions. For example, a question of the form, I have a latency SLA to meet, some shape or a certain throughput I need to handle. What is the best application configuration, platform configuration, resource configuration, and data configuration I need to have, right? That’s the ideal question that people want to answer, but if you start breaking that down to not so complex questions, but more operationally-important questions. My cluster right now, the outlier detection is saying that some nodes are….some brokers are running much harder than others.

Right? So which leaders, which partitions, which replica should I move to make the system much more balanced? What will the impact of moving a replica actually be right? Now, along that context from a resilience and a stability perspective, right? What if a broker dies, what will be the impact? Am I in threat of like, you know, data loss?

Or, my system seems to be bottlenecked now. If I add one more node, what’s going to happen? There’s a problem where we have actually started to do some work, it’s still pretty ongoing. Now, where we have actually focused on is on that lagging kind of scenario where once you can detect lags, can you guide the user towards the operations person or the application owner towards what the root cause is?

And in this case, like, if you do this one, the bottleneck is in the number of partitions that Kafka has and if we increase the number of partitions to twenty, then you’ll be able to meet the latency SLA, right? So, it goes into like, you know, a lot of modeling. And what the good news is, in these metrics-oriented, especially in a Kafka, HBase kind of world, training data is not as hard to generate.

I know like, you know, you had to be very focused on different kinds of use cases, but especially like, you know, the impact of adding more partitions and things like that, you can generate training data and constantly customize those models as you’re getting more observations on the field. Now, things are not all so rosy, and I’m going to actually mention the Kafka Cruise Control project.

The main developer of the Cruise Control project is right here in the audience. So, they have had the challenge, the Cruise Control project and similar things from Pinterest and from Microsoft, right? The challenge is, how do you dynamically balance your workload? How do you actually estimate what the impact will be of moving a leader or a replica from one node to another?

What will be the impact of adding one more broker, right? So, this is, I think it’s fair to say that it’s early work, but I’m seeing a lot of exciting results that they’ve published recently on how you can build a model that will predict the CPU utilization of a broker based on some of the key KPIs like the leader and the replica bytes-in and bytes-out, right?  The message ingestion rate as well as the rate of different types of requests in Kafka such as the produce and fetch request. And I’m sure there is more way to go here, but it’s very exciting to see broader work from the community like this. So that brings me to my very last slide. Right? So, essentially, what I promised to talk about is: how we can take on one hand all of these different challenging goals that DevOps teams have:

“I need to meet a latency SLA,” “I need to meet a throughput,” and then. think about the operation side where not just one application, Kafka has now become truly multi-tenant, multiple applications running from multiple topics there, right? And think about all these goals from stability or from the Intel multi-tenancy perspective, not to mention the overall capacity, planning, and growth perspective.

How do you take all of these use cases and get all of these different methods that are available then carefully work through different algorithms, AI and ML algorithms and find a good match and that way enable operations teams to actually like, you know, sleep better? So, thanks, that is all I have. I would love to get like, you know, more feedback. We have a booth. We’re doing a lot of this work, as you can imagine, at Unravel.

We have a booth here. Please come by. We love your feedback and if there are… I’m sure some of you are probably thinking about these problems, we would love to kind of collaborate and exchange ideas. Thank you. We can take any questions.

– Question here. – [Man] [Inaudible]- Great question!

So, the question is like, you know, because these systems are constantly running, how can you incorporate the feedback into the models, right? That you’re actually getting like, getting on the field? Now, I would say it again, there is no one easy way, it sort of depends on the use case. For the outlier detection, forecasting kind of models, what we have done is we have created models based on our own testing in our lab clusters and whatnot.

We gave these models for different customers to try out, and based on the feedback because the good news with many of these things is like everything that the system is capturing is preserved, right? So, let’s say at 10:00 a.m. like, you know, last week, it generated an anomaly a lot, right? Now, you can go back, review it, and then check for at least some of these kinds of like, you know, anomalies and alerts.

Was the system doing the right thing? Let’s look at the ground truth. Was it real or not, right? So that way…that’s the manual part, that we can actually look at it, gain some more, like, you know observations from the field and then improve the algorithm. Now, on the other end of the spectrum, what we have actually done is, like a lot of these root causes of problems can be automatically injected into system.

So we have constantly running Kafka clusters. Internally in our system itself, we use Kafka, so what we do is we are constantly pumping loads, injecting different kinds of problems all the time, and constantly evaluating how good our detection from our models are with respect to the true ground to be injected.

So, kind of you have to do both. I wish there was the better technique. Now, some of the models, especially from the modeling perspective are good at constantly retraining based on the observations you get. Right? So from a machine learning perspective, also there are…some models are better than others, but what we have like, you know, what I would suggest is you constantly have to keep checking the goodness of your models by injecting different problems in a test kind of environment, to see what is happening.

– The problem might be, chaos will be…- Absolutely. Absolutely, yeah. Yes?

– How do you evaluate [inaudible]

– Do you want to repeat the question?

– Yes. The question was, how do you, like again, in the distributed system errors can happen in different places, right? And in distributed systems, it’s a well-known problem that you have like something phase and there’s… that propagates across like, you know, different systems and a different root cause, different thing fails and it’s a challenge to find out what the root cause is, right?

That’s what we’re trying to solve by bringing all the data together and applying interesting algorithms on it. Now, coincidentally in the previous Strata, we gave a talk on exactly that problem. How do you gather error messages from logs and build a model that can guide you towards the root cause?

So, happy to talk like, you know, more offline on it… Do we have time for any more questions? – [Inaudible]

– That’s a very loaded question… So the question was like, you know, I talked about Kafka, a streaming application, and architecture, and then all of this SLA management, and things like that… Can we apply that at the financial domain?

Yes, we would like you to apply it in the financial domain and we have a lot of like, you know, like not just Unravel, but across your different companies that are supporting Kafka, there a lot of companies in the financial domain that are using it. But I’m happy to like, you know, again, talk more offline on that question and can give you at least what expertise we have gained from working with Kafka. One more question out there. Do you have time for more questions?

Please, yeah.

– What is the answer to [inaudible]?

– It is… Yeah. We have actually built a time series store ourselves to store the data because for us it’s not just about the metrics, it’s also about logs, like there are so many different signals, not to mention signal execution plans.

Like so, we have actually had to build a time series store ourselves. Yeah. Right off the bat, if you just wanted to try things out, Prometheus is actually something I see a lot of usage. I’m sure that you can use InfluxDB.

– Yeah.

– Yeah. Any more questions? – [Man 2] Let’s thank our speakers.

– Thanks a lot.