See FinOps AI Agent in Action Virtual Event Aug 8

App Performance

Using Machine Learning to understand Kafka runtime behavior

On May 13, 2019, Unravel Co-Founder and CTO Shivnath Babu joined Nate Snapp, Senior Reliability Engineer from Palo Alto Networks, to present a session on Using Machine Learning to Understand Kafka Runtime Behavior at Kafka Summit […]

  • 26 min read

On May 13, 2019, Unravel Co-Founder and CTO Shivnath Babu joined Nate Snapp, Senior Reliability Engineer from Palo Alto Networks, to present a session on Using Machine Learning to Understand Kafka Runtime Behavior at Kafka Summit in London. You can review the session slides or read the transcript below. 


Nate Snapp:

All right. Thanks. I’ll just give a quick introduction. My name is Nate Snapp. And I do big data infrastructure and engineering at companies such as Adobe, Palo Alto Networks, and Omniture. And I have had quite a bit of experience in streaming even outside of the Kafka realm about 12 years. I’ve worked at some of the big scale efforts that we’ve done for web analytics at Adobe and Omniture, working in the space of web analytics for a lot of the big companies out there, about 9 out of the 10 Fortune 500.

I’ve have dealt with major release events, from new iPads to all these things that have to do with big increases in data and streaming that in a timely fashion. Done systems that have over 10,000 servers in that [00:01:00] proprietary stack. But the exciting thing is, last couple years, moving to Kafka, and being able to apply some of those same principles in Kafka. And I’ll explain some of those today. And then I like to blog as well,

Shivnath Babu:

Hello, everyone. I’m Shivnath Babu. I’m cofounder and CTO at Unravel. What my day job looks like is building a platform like what is shown on the slide where we collect monitoring information from the big data systems, receiver systems like Kafka, like Spark, and correlate the data, analyze it. And some of the techniques that we will talk about today are inspired by that work to help operations teams as well as application developers, better manage, easily troubleshoot, and to do better capacity planning and operations for their receiver systems.


All right. So to start with some of the Kafka work that I have been doing, we have [00:02:00] typically about…we have several clusters of Kafka that we run that have about 6 to 12 brokers, up to about 29 brokers on one of them. But we run the Kafka 5.2.1, and we’re upgrading the rest of the systems to that version. And we have about 1,700 topics across all these clusters. And we have pretty varying rates of ingestion, but topping out about 20,000 messages a second…we actually gotten higher than that. But what I explained this for is as we get into what kind of systems and events we see, the complexity involved often has some interesting play with how do you respond to these operational events.

So one of the aspects is that we have a lot of self-service that’s involved in this. We have users that set up their own topics, they setup their own pipelines. And we allow them to do [00:03:00] that to help make them most proficient and able to get them up to speed easily. And so because of that, we have a varying environment. And we have quite a big skew for how they bring it in. They do it with other…we have several other Kafka clusters that feed in. We have people that use the Java API, we have others that are using the REST API. Does anybody out here use the REST API for ingestion to Kafka? Not very many. That’s been a challenge for us.

But we also been for the Egress have a lot of custom endpoints, and a big one is to use HDFS and Hive. So as I get into this, I’ll be explaining some of the things that we see that are really dynamically challenging and why it’s not as simple as EFL Statements and how you triage and respond to these events. And then Shiv will talk more about how you can get to a higher level of using the actual ML to solve some of these challenging [00:04:00] issues.

So I’d like to start with a simple example. And in fact, this is what we often use. When we first get somebody new to the system, or we’re interviewing somebody to work in this kind of environment, as we start with a very simple example, and take it from a high level, I like to do something like this on the whiteboard and say, “Okay. You’re working in an environment that has data that’s constantly coming in that’s going through multiple stages of processing, and then eventually getting into a location where will be reported on, where people are gonna dive into it.”

And when I do this, it’s to show that there’s many choke points that can occur when this setup is made. And so you look at this and you say, “That’s great,” you have the data flowing through here. But when you hit different challenges along the way, it can back things up in interesting ways. And often, we talk about that as cascading failures of latency and latency-sensitive systems, you’re counting [00:05:00] latency is things back up. And so what I’ll do is explain to them, what if we were to take, for instance, you’ve got these three vats, “Okay, let’s take this last vat or thus last bin,” and if that’s our third stage, let’s go ahead and see if there’s a choke point in the pipe there that’s not processing. What happens then, and how do you respond to that?

And often, a new person will say, “Well, okay, I’m going to go and dive in,” then they’re gonna say, “Okay. What’s the problem with that particular choke point? And what can I do to alleviate that?” And I’ll say, “Okay. But what are you doing to make sure the rest of the system is processing okay? Is that choke point really just for one source of data? How do you make sure that it doesn’t start back-filling and cause other issues in the entire environment?” And so when I do that, they’ll often come back with, “Oh, okay, I’ll make sure that I’ve got enough capacity at the stage before the things don’t back up.”

And this is actually has very practical implications. You take the simple model, and it applies to a variety of things [00:06:00] that happen in the real world. So for every connection you have to a broker, and for every source that’s writing at the end, you actually have to account for two ports. It’s basic things, but as you scale up, it matters. So I have a port with the data being written in. I have a port that I’m managing for writing out. And as I have 5,000, 10,000 connections, I’m actually two X on the number of ports that I’m managing.

And what we found recently was we’re hitting the ephemeral port, what was it, the max ephemeral port range that Linux allows, and all sudden, “Okay. It doesn’t matter if you felt like you had capacity and maybe the message storage state, but that we actually hit boundaries at different areas. And so I think it’s important to say these have to stay a little different at time, so that you’re thinking in terms [00:07:00] of what other resources are we not accounting for, that can back up? So we find that there’s very important considerations on understanding the business logic behind it as well.

Data transparency, data governance actually can be important knowing where that data is coming from. And as I’ll talk about later, the ordering effects that you can have, it helps with being able to manage that at the application layer a lot better. So as I go through this, I want to highlight that it’s not so much a matter of just having a static view of the problem, but really understanding streams for the dynamic nature that they have, and that you may have planned for a certain amount of capacity. And then when you have a spike and that goes up, knowing how the system will respond at that level and what actions to take, having the ability to view that becomes really important.

So I’ll go through a couple of [00:08:00] data challenges here, practical challenges that I’ve seen. And as I go through that, then when we get into the ML part, the fun part, you’ll see what kind of algorithms we can use to better understand the varying signals that we have. So first challenge is variance in flow. And we actually had this with a customer of ours. And they would come to us saying, “Hey, everything look good as far as the number of messages that they’ve received.” But when they went to look at a certain…basically another topic for that, they were saying, “Well, some of the information about these visits actually looks wrong.”

And so they actually showed me a chart that look something like this. You could look at this like a five business day window here. And you look at that and you say, “Yeah, I mean, I could see. There’s a drop after Monday, the first day drops a little low.” That may be okay. Again, this is where you have to know what does a business trying to do with this data. And as an operator like myself, I can’t always make those as testaments really well. [00:09:00] Going to the third peak there, you see that go down and have a dip, they actually point out something like that, although I guess it was much smaller one and they’d circle this. And they said, “This is a huge problem for us.”

And I come to find out it had to do with sales of product and being able to target the right experience for that user. And so looking at that, and knowing what those anomalies are, and what they pertain to, having that visibility, again, going to the data transparency, you can help you to make the right decisions and say, “Okay.” And in this case with Kafka, what we do find is that at times, things like that can happen because we have, in this case, a bunch of different partitions.

One of those partitions backed up. And now, most of the data is processing, but some isn’t, and it happens to be bogged down. So what decisions can we make to be able to account for that and be able to say that, “Okay. This partition backed up, why is that partition backed up?” And the kind of ML algorithms you choose are the ones that help you [00:10:00] answer those five whys. If you talk about, “Okay. It backed up. It backed up because,” and you keep going backwards until you get to the root cause.

Challenge number two. We’ve experienced negative effects from really slow data. Slow data can happen for a variety of reasons. Two of them that we principally have found is users that have data that staging and temporary that they’re gonna be testing another pipeline, and then rolling that into production. But there’s actually some in production that are slow data, and it’s more or less how that stream is supposed to behave.

So think in terms again of what is the business trying to run on the stream? In our case, we would have some data that we wouldn’t want to have frequently like cancellations of users for our product. We hope that that stay really low. We hope that you have very few bumps in the road with that. But what we find with Kafka is that you have to have a very good [00:11:00] understanding of your offset expiration and your retention periods. We find that if you have those offsets expire too soon, then you have to go into a guesswork of, “Am I going to be aggressive,” and try and look back really far and reread that data, in which case, you may double count? Or am I gonna be really conservative, which case you may miss critical events, especially if you’re looking at cancellations, something that you need to understand about your users.

And so we found that to be a challenge. Keeping with that and going into this third idea is event sourcing. And this is something that I’ve heard argued either way should Kafka be used for this event sourcing and what that is. We have different events that we want to represent in the system. Do I create a topic for event and then have a whole bunch of topics, or do I have a single topic because we provide more guarantees on the partitions in that topic?

And it’s arguable, depends on what you’re trying to accomplish, which is the right method. But what we’ve seen [00:12:00] with users is because we have a self-service environment, we wanna give the transparency back to them on here’s what’s going on when you set it up in a certain way. And so if they say, “Okay, I wanna have,” for instance for our products, “a purchase topic representing the times that things were purchased.” And then a cancellation topic to represent the times that they decided not to use certain…they decided to cancel the product, what we can see is some ordering issues here. And I represented those matching with the colors up there. So a brown purchase followed by cancellation on that same user. You could see the, you know, the light purple there, but you can see one where the cancellation clearly comes before the purchase.

And so that is confusing when you do the processing, “Okay. Now, they’re out of order.” So having the ability to expose that back to the user and say, “Here’s the data that you’re looking at and why the data is confusing. What can you do to respond to that,” and actually pushing that back to the users is critical. So I’ll just cover a couple of surprises that [00:13:00] we’ve had along the way. And then I’ll turn the time over to Shiv. But does anybody here use Schema Registry? Any Schema Registry users? Okay, we’ve got a few. Anybody ever have users change the schema unexpectedly and still cause issues, even though you’ve used Schema Registry? Is it just me? Had a few, okay.

We found that users have this idea that they wanna have a solid schema, except when they wanna change it. And so we’ve coined this term flexible-rigid schema. They want a flexible-rigid schema. They want the best in both worlds. But what they have to understand is, “Okay, you introduce a new Boolean value, but your events don’t have that.” I can’t pick one. I can’t default to one. Limit the wrong decision, I guarantee you. It’s a 50/50 chance. And so we have this idea of can we expose back to them what they’re seeing, and when those changes occur. And they don’t always have control over changing the events at the source. And so they may not even be control of their destiny of having a [00:14:00] schema throughout the system changed.

Timeouts, leader affinity, I’m gonna skip over some of these or not spend a lot of time on it. But timeouts is a big one. As we write to HDFS, we see that the backups that occur from the Hive meta store when we’re writing with the HDFS sync can cause everything to go to a rebalanced state, which is really expensive, and now becomes a cascading issue, which was a single broker having an issue. So, again, others with leader affinity, poor assignments. There’s a randomizes where those get assigned to. We’d like to have concepts of the system involved. We wanna be able to understand the state of the system. Windows choices are being made. And if we can affect that, and Shiv will cover how that kind of visibility is helpful with some of the algorithms that can be used. And then basically, those things, all just lead to, why do we need to have better visibility into our data problems with Kafka? [00:15:00] Thanks.


Thanks, Nate. So what Nate just actually talked about, and I’m sure most of you at this conference, the reason you’re here is that the streaming applications, Kafka-based architectures are becoming very, very popular, very common, and driving mission critical applications across the manufacturing, across ecommerce, many, many industries. So for this talk, I’m just going to approximate streaming architecture as something like this. There’s a stream store, something like Kafka that is being used, or poser, and maybe status actually kept in a NoSQL system, like HBase or Cassandra. And then there’s the computational element. This could be in Kafka itself with KStream, but it could be a Spark streaming, Flink as well.

When things are great, everything is fine. But when things start to break, maybe as Nate mentioned, we’re not getting your results on time. Things are actually congested, [00:16:00] backlog, all of these things can happen. And unfortunately, in any architecture like this, what happens is they’re all receiver systems. So problems could happen in many different places such as, it could be an application problem, maybe the structure, the schema, how things where…like Nate gave an example of a joint across streams, which…There might be problems there, but that might also be problems that the Kafka level, may be the partitioning, may be brokers, may be contention, or things that as a Spark level, no resource level problems, or configuration problems, all of these become very hard to troubleshoot.

And I’m sure many of you have run into problems like this. How many of you here can relate to problems what I’m showing on the slide here? Quite a few hands. As you’re running things in production, these challenges happen. And unfortunately, what happens is, given that these are all individual systems, often there is no single correlated view event that connects the streaming [00:17:00] computational side with the story side, or maybe with the NoSQL sides. And that poses a lot of issues. One of the crucial things is that when a problem happens, it takes a lot of time to resolve.

And wouldn’t it be great if there’s some tool out there, some solution that can give you visibility, as well as to do things along the following lines and empowered like the DevOps teams? First and foremost, there are metrics all over the place, that metrics, logs, you name it. And from all these different levels, especially the platform, the application, and all the interactions that happen in between. Metrics along these can be brought into one single platform, and at least have a good, nice correlated view. So again, you can go from the app, to the platform, or vice-versa, depending on the problem.

And what we will try to cover today is with all these advances that are happening in machine learning, how in applying machine learning to some of this data can help you find out problems quicker, [00:18:00] as well as, in some cases, using artificial intelligence, using ability to take automated actions either prevent the problems from happening the first place, or at least if those problems happen gonna to be able to quickly recover and fix it.

And what we will really try to cover in the stack is basically what I’m showing the slide. There’s tons and tons of interesting machine learning algorithms. And the same time, you have all of these different problems that happened with Kafka streaming architectures. How do you connect both worlds? How can you bring based on the goal that you’re actually trying to solve the right algorithm to bear on the problem? And the way we’ll structure that is, again, DevOps teams have lots of different goals. Broadly, there’s the app side of the fence, and there’s the operations side of the fence.

As a person who owns Kafka streaming application, you might have goals related to latency. I need this kind of latency, or this much amount of the throughput, or maybe might be a combination of this along with, “Hey, I can only [00:19:00] tolerate this much amount of data loss,” and all those talks that have happened very much in this room, when the two different aspects of this and how replicas and parameters help you get all of these goals.

On the operation side, maybe your goals are around ensuring that the cluster reliable like a particular loss of a rack doesn’t really cause data loss and things like that, or maybe they are related on the cloud ensuring that you’re getting the right bite performance and all of those things. And on the other side are all of these interesting advances that happened in ML. There are algorithms for detecting outliers, anomalies, or actually doing correlation. So the real focus is going to be like, let’s take some of these goals. Let’s work with the different algorithms, and then seeing how you can get these things to meet. And along the way, we’ll try to describe some our own experiences. What would worked kind of what didn’t work, as well the other things that are worth exploring?

[00:20:00] So let’s start with the very first one, the outlier detection algorithms. And why would you care? There are two very, very simple problems. And if you remember from the earlier in the talk, Nate talked about one of the critical challenges they had, where the problem was exactly this, which is, hey, there could be some imbalance among my brokers. There could be some imbalance and a topic among the partitions, some partition really getting backed up. Very, very common problems that happen. How can you very quickly instead of manually looking at graphs, and trying to stare and figure things out? Apply some automation to it.

Here’s a quick screenshot that lead us to through the problem. If you look at the graph that have highlighted on the bottom, this is a small cluster with three brokers, capital one, two, and three. And one of the brokers is actually having much lower throughput. It could be the other way. One of the having a really high throughput. Can we detect this automatically? This is about having to figure it after the fact. [00:21:00] And there’s a problem where there are tons of algorithms for outlier detection from statistics and now more recently in machine learning.

And there are algorithms and differ based on, do they deal with one parameter at a time, do they deal with multiple parameters of the time, univariate, multivariate, or algorithms that can actually take temporal things into account, algorithms that are more looking at a snapshot in time. The very, very simple technique, which actually works surprisingly well as this score where the main idea is to take…let’s say, you have different brokers, or you have hundred partitions in a topic, you can take any metric. It might be in the bites and metric, and vector Gaussian and distribution to the data.

And anything that is actually a few standard deviations away as an outlier. Very, very simple technique. The problem in this technique is, it does assume that is the distribution, but sometimes may not be the [00:22:00] case. In the case of in a brokers and partitions that is usually a safe assumption to make. But if that technique doesn’t work, there are other techniques. One of the more popular ones that we have had some success with, especially when you’re looking at multiple time series of the same time is the DBSCAN. It’s basically a density-based clustering technique. I’m not going to the all the details, but the key ideas, it basically uses some notion of distance to group points into clusters, and anything that doesn’t fall into clusters and outlier.

Then there are tons of other very interesting techniques using like in a binary trees to find outliers called the isolation forests. And in the deep learning world, there is a lot of interesting work happening with auto encoders, which tried to learn representation of the data. And again, once you’ve learned the representation from all the training data that is available, things don’t fit the representation are outliers.

So this is the template I’m going to [00:23:00] follow in the rest of the talk. Basically, the template, I pick a technique. Next, I’m gonna look at forecasting and give you some example use cases that having a good forecasting technique can help you in that Kafka DevOps world, and then tell you about some of these techniques. So for forecasting, two places where it makes those…having a good technique makes a lot of difference. One is avoiding this reactive firefighting. When you have something like a latency SLA, if you could get a sense, things are backing up, and there’s a chance in a certain amount of time, the SLAs will get missed, then you can actually take action quickly.

It could even be bringing on normal brokers and things like that, or basically doing some partition reassignment and whatnot. But getting heads up this often very, very useful. The other is more long-term forecasting for things like capacity planning. So I’m gonna use actually a real life example here, an example that one of our telecom customers actually worked with [00:24:00] where it was a sentiment analysis, sentiment extraction used case based on tweets. So the overall architecture consisted of tweets coming in real-time like in those SLA Kafka, and then the computation was happening in Spark streaming with some state actually being stored in a database.

So here’s a quick screenshot of how things actually play out there. In sentiment analysis, especially for customer service, customer support really the used cases, there’s some kind of SLA. In this case, their SLA was around three minutes. What that means is, by the time you learn about the particular incident, if you will, which you can think of these tweets coming in Kafka. Within a three-minute interval, data has to be processed. What are you seeing on screen here, all those green bars represent the rate in which data is coming, and the black line in between as the time series indicating the actual delay, [00:25:00] the end-t-end delay, the processing delay between data arrival and data being processed. And there is a SLA of three-minutes.

So if you can see the line, it trending up, and applying…there was a good forecasting technique that could be applied on this line, you can actually forecast and stay within a certain interval of time, maybe it’s a few hours, maybe even less than that. The rate at which this latency is trending up, my SLA can get missed. And it’s a great use case for having a good forecasting technique. So again, forecasting, another area that has been very well-studied. And there are tons of interesting techniques, from the time-series forecasting of the more statistical bend, there are techniques like ARIMA, which stands for autoregressive integrated moving average. And there’s lots of variance around that, which uses the trend and data differences between data elements and patterns you forecast with smoothing and taking in historic data into account, and all of that good stuff.

On the other [00:26:00] extreme, there has been a lot of like I’ve said, advances recently in using neural networks, because time-series data is one thing that is very, very easy to get, too. So there’s this long short-term memory, the LSCM, and recurrent neural networks, which have been pretty good at this, that we have actually had a lot of I would say success is with a technique that was originally it was something that Facebook released this open source called the Prophet Algorithm, which is not very different from the ARIMA and the older family of forecasting techniques. It defaults in some subtle ways.

The main idea here was what is called a generative additive model. I put in a very simple example here. So the idea is to model this time-series, whichever time-series you are picking as a combination of the trend in the time-series data and extract all the trend. The seasonality, maybe there is a yearly seasonality, maybe there’s monthly seasonality, weekly [00:27:00] seasonality, maybe even daily seasonality. This is a key thing. I’ve used the term shocky [SP]. So if you’re thinking about forecasting and ecommerce setting, like Black Friday or the Christmas days, these are actually times when the time-series will have a very different behavior.

So in Kafka or in the operational context, if you are rebooting, if you are installing a new patch or upgrading, this often end up shifting the patterns of the time-series, and they have to be explicitly modeled. Otherwise, the forecasting can go wrong, and then the rest is error. The recent Prophet has actually worked really well for us apart from the ones I mentioned. It fits quickly and all of that good stuff, but it is very, I would say, customizable, where your domain knowledge about the problem can be incorporated, instead of if it’s something that gives you a result, and then all that remains is parameter tuning.

So Prophet is something I’ll definitely ask all of you [00:28:00] to take a look. And defaults work relatively well, but forecasting is something where we have seen that. You can just trust the machine alone. It needs some guidance, it needs some data science, it needs some domain knowledge to be put along with the machine learning to actually get good forecast. So that’s the forecasting. So we saw outlier detection how that applies, we saw forecasting. Now let’s get into something even more interesting. Anomalies, detecting anomalies.

So a place where anomaly detection, and possible was an anomaly, you can think of an anomaly as unexpected change, something that if you were to expect it, then it’s not an anomaly, something unexpected that needs your attention. That’s what I’m gonna characterize as an anomaly. Where can it help? Actually smart alerts, alerts where you don’t have to configure threshold and all of that things, and worry about your workload changing or new upgrades happening, and all of that stuff, wouldn’t it be great if these anomalies can [00:29:00] be auto-detect it. But that’s also very challenging. By no means, it’s a trivial because if you’re not careful, then your smart alerts will turn out to be really dump. And you might get a lot of false alerts. And that way, you lose confidence, or it might miss a real problems that are happening.

So, I don’t know, detection is something which is pretty hard to get right in practice. And here’s one simple, but one very illustrative example. With Kafka you always see no lags. So here, what I’m plotting here is increasing lag. Is that really an expected one? Maybe there could be both in data arrival, and maybe these lags might have built up, like at some point of time every day, maybe it’s okay. When it is an anomaly that I really need to take a look at. So that’s when these anomaly detection techniques become very important.

Many [00:30:00] different schools have thought excess on how to build a good anomaly detection technique, including the ones I talked about earlier with outlier detection. One approach that has worked well in the past for us is, when you can have really modern anomalies as I’m forecasting something based on what I know. And if what I see the first one to forecasts, then that’s an anomaly. So you can pick your favorite forecasting technique, or the one that actually worked, ARIMA, or Prophet, or whatnot, use that technique to do the forecasting, and then deviations become interesting and anomalous.

Whatever sitting here is a simple example of the technique and for Prophet, or more common one that we have seen actually does work relatively well, this thing called STL. It stands for Seasonal Trend Decomposition using a smoothing function called LOWESS. So you have the time series, extract out and separate out the trend from it first. So without the trend, so that leaves [00:31:00] the time series without the trend and then extract out all the seasonalities. And once you have done that, whatever reminder or residual has called, even if you put some, like a threshold on that, it’s reasonably good. I wouldn’t say it’s perfect but reasonably good at extracting out these anomalies.

Next one, correlation analysis, getting even more complex. So once basically you have detect an anomaly. Or you have a problem that you want the root cause, why did it happen? What do you need to do to fix it. Here’s a great example. I saw my anomaly something shot up, maybe there’s the lag, that is actually building up, it’s definitely looks anomalous. And now what do you really want us…okay, maybe we address where the baseline is much higher. But can I root cause it? Can I pinpoint what is causing that? Is it just the fact that there was a burst of data or something else, maybe resource allocation issues, maybe some hot spotting and the [00:32:00] brokers?

And here, you can start to apply time-series correlation, which time series your lower level correlates best with the higher level time-series where your application latency increased? Challenge here is, there are hundreds, if not thousands of times-series if you look at Kafka, with every broker has so many kind of time-series it can give you from every level. And it quickly all of these adds up. So here, it’s a pretty hard problem.  So if you just throw a time-series correlation techniques, even time-series which just have some trend in there they look correlated. So you have to be very careful.

The things to keep in mind are things like pick a good similarity function across time-series. For example using something like Euclidean, which is a straight up well-defined, well-understood distance function between points or between time series. We have had a lot of success with something called dynamic time warping, which is very good to deal with time-series, which might be slightly out of [00:33:00] sync. If you remember all the things that need mentioned, you just gone that Kafka world and streaming world in asynchronous, even world, you just see him all the time-series and nicely synchronized. So time warping is a good technique to extract distances in such a context.

But by far, the other part is just like you saw with Prophet. You have to really instead of just throwing some machine learning technique and praying that it works, you have to really try to understand the problem. And the way in which we have tried to break it down into something usable is, for a lot of these time-series, you can split the space into time-series that are related to the application performance, time-series are related to resource and contention, and then apply correlation within these bucket. So try to scope the problem.

Last technique, model learning. And this turns out to be the hardest. But if you have a good modeling technique, good model that can answer what-if questions, then things like what we saw in the previous talk with [00:34:00] all of those reassignment and whatnot. You can actually find out what’s the good reassignment and what impact will that have. Or, as Nate mentioned, this rebalanced-type, consumer times, timeouts, and rebalancing storms can actually kick in, “What’s the good timeout?”

So a lot of these places where you have to pick some threshold or some strategy, but having a model that can quickly do what-if avoid is better, or can even rank them can be very, very useful and powerful. And this is the thing that is needed for enabling action automatically. Here’s a quick example. There’s a lag. We can figure out where the problem actually is happening. But then wouldn’t be great if something is suggesting how to fix that? Increase the number of partitions from X to Y,. And that will fix the problem.

The modeling at the end of the day is function that you’re fitting to the data. And in modeling, I don’t have time to actually go into this in great detail because of time. You carefully pick the right [00:35:00] input features. And what is very, very important is to ensure that you have the right training data. For some problems, just collecting data from the production cluster is good trading data. Sometimes, it is not because you have only seen certain regions of the space. So with that, I’m actually going to wrap up.

So what we try to do in the talk is to really give you a sense, as much as we can do in a 40-minute talk of having all of these interesting Kafka DevOps challenges, so meaning application challenges, and how to map that to something where you can use machine learning or some elements of AI to make your life easier, at least guide you so that you’re not wasting a lot of time trying to look for a needle in a haystack. And with that, I’m going to wrap up. There is this area called AIOps, which is very interesting and trying to bring AI and ML with data to solve these DevOps challenges. We have a booth here. Please, drop by to see some of the techniques we have. And yes, if you’re interested in working on this interesting challenges, streaming, building [00:36:00] these applications, or applying techniques like this, we are hiring. Thank you.

Okay. Thanks, guys. So we actually have about three or four minutes for any question, mini Q&A.

Attendee 1:

I know nothing about machine learning. But I can see it helping me really help with debugging problems. As a noob, how can I get started?


I can take that question. So the question was, so the whole point of this talk, and gets in that question, which is, again, on one hand, if you go to a Spark Summit, or Spark, you’ll see a lot of machine learning algorithms, and this and that, right? If we come to a Kafka Summit, then it’s all about Kafka and DevOps challenges. How do these worlds meet? That’s exactly the thing that we tried to cover in the talk. If you were listening looking at a lot of the techniques, they’re not fancy machine learning techniques. So our experience has been that once you understand the use case, then there are fairly good techniques from even statistics that can solve the problem reasonably well.

And once you have first startup experience, then look for better and better techniques. So hopefully, this talk gives you a sense of the techniques just get started with, and you get a better understanding of the problem and the [00:38:00] data, and you can actually improve and apply more of those deep learning techniques and whatnot. But most of the time, you don’t need that for these problems. Nate?


No. Absolutely same thing.

Attendee 2:

Hi, Shiv, nice talk. Thank you. Quick question is for those machine learning algorithms, can they be applied cross domain? Or if you are moving from DevOps of Kafka to Spark streaming, for example, do you have to hand-picking those algorithms and tuning the parameters again?


So the question is, how much of these algorithms just apply to something we’ve found for Kafka? Does it apply for Spark streaming? Would that apply for high performance Impala, or Redshift for that matter? So again, the real hard truth is that no one size fits all. So you can’t just have…I mentioned outlier detection, that might be a technique that can be applied to any [00:39:00] load imbalance problem. But then that’s start getting into anomalies or correlation, Some amount of expert knowledge about the system has to be combined with the machine learning technique to really get good results.

So again, it’s not as if it has to be all export rules, but some combination. So if you pick a financial space, a data scientist who exists, understands the domain, and knows how to work with data. Just like that even in the DevOps world, the biggest success will come if somebody understand receiver system, and has the knack of working with these algorithms reasonably well, and they can combine both of that. So something like a performance data scientist.