Over the coming few weeks we’ll be providing some practical technical insights and advice on typical problems we encounter in working with many complex data pipelines. Is this blog we’ll talk to multi-tenant cluster contention issues. Check back in for future topics.
Modern data clusters are becoming increasingly commonplace and essential to your business. Typically a wide variety of workloads run on a single cluster, which can make it a nightmare to manage and operate, similar to managing traffic in a busy city. We feel the pain of the operations folks out there who have to manage Spark, Hive, Impala and Kafka apps running on the same cluster where they have to worry about each app’s resource requirements, the time distribution of the cluster workloads, the priority levels of each app or user, and then make sure everything runs like a predictable well-oiled machine.
We at Unravel have a unique perspective on this kind of thorny problem since we spend countless hours, day in and day out, studying the behavior of giant production clusters in the discovery of insights into how to improve performance, predictability and stability. Whether it is a thousand node Hadoop cluster running batch jobs, or a five hundred node Spark cluster running AI, ML or some type of advanced, real-time, analytics. Or, more likely, 1000 nodes of Hadoop, connected via a 50 node Kafka cluster to a 500 node Spark cluster for processing. What could possibly go wrong?
As you are most likely painfully well aware, plenty! Here are some of the most common problems that we see in the field:
So how do you go about solving each of these issues?
Measure and Analyze
To understand which of the above issues is plaguing your cluster, you must first understand what’s happening under the hood. Modern data clusters have a number of precious resources that operations team must keep a constant eye on. These include Memory, CPU, NameNode. When monitoring these resources make sure to measure both the total available and consumed at any given time.
Next, break down these resource charts by user, app, department, project to truly understand who is contributing how much to the total usage. This kind of analytical exploration can help quickly reveal:
Make apps better multi-tenant citizens
Configuration settings at the cluster and app level dictate how much system resources each app gets. For example, if we have a setting of 8GB containers at the master level, then each app will get 8GB containers whether they need it or not. Now imagine if most of your apps only needed 4GB containers. Well your system would show it’s at max capacity when it could actually be running twice as many apps.
In addition to inefficient memory sizing, big data apps can be bad multi-tenant citizens due to other bad configuration settings (CPU, # of containers, heap size, etc., etc. ) inefficient code and bad data layout.
Therefore its important to measure and understand each of these resource hogging factors for every app on the system and make sure that they are actually using and not abusing resources.
Define queues and priority levels
Your big data cluster must have a resource management tool built-in, for example YARN or Kubernetes. These tools allow you to divide your cluster into queues. This feature can work really well if you want to separate production workloads from experiments or Spark from HBase or high priority users from low priority ones, etc. The trick though is to get the levels of these queues right.
This is where “measure and analyze” techniques help again. You should analyze the usage of system resources by users, departments or any other tenant you see fit to determine the min, max, average that they usually demand. This will at the least get you some common sense levels for your queues.
However, queue levels may need to be adjusted dynamically for best results. For example, a mission critical app may need more resources if it processes 5x more data one day compared to the other. Therefore having a sense of seasonality is also important when allocating these levels. A heatmap of cluster usage (as shown below) will enable to get more precise about these allocations.
Proactively find and fix rogue users or apps
Even after you follow the steps above, your cluster will experience rogue usage from time to time. Rogue usage is defined as bad behavior on the cluster by an application or user. Such as hogging resources from a mission critical app, taking more CPU or memory than needed for timely execution, having a very long idle shell, etc. In a multi-tenant environment this type of behavior affects all users and ultimately reduces reliability of the overall platform.
Therefore setting boundaries for acceptable behaviour is very important to keep your big data cluster humming. A few examples of these are:
Setting the thresholds for these boundaries should be done after analyzing your cluster patterns over a month to help determine what is the average or accepted values. These values may also be different for different days of the week. Also think about what happens when these boundaries are breached? Should the user and admin get an alert? Should these rogue applications be killed or moved to a lower priority queue?
Let us know if this was helpful or if you have topics that you would like us to address in future postings. Check back in for the next topic which will be on the topic of rogue applications.