Your Databricks jobs run slower than expected. Memory utilization looks normal. Cluster sizing seems fine. Yet tasks keep timing out, executors go unresponsive, and you’re cutting it close on SLA deadlines.
The culprit? Garbage collection issues quietly degrading performance across your data pipelines.
Databricks gives you powerful flexibility to configure JVM tuning and optimize garbage collection performance. But here’s the problem: figuring out which workloads need attention becomes nearly impossible when you’re managing hundreds or thousands of jobs across multiple workspaces. Even experienced teams struggle to identify the right fixes without spending days digging through logs and metrics.
Understanding Garbage Collection in Databricks
Slow garbage collection occurs when the Java Virtual Machine spends excessive time cleaning up unused objects from memory, causing application pauses and severely degrading Spark performance. Garbage collection overhead becomes problematic when it consumes more than 10-15% of total execution time.
Spark applications create massive amounts of temporary objects during data processing: serialized records, intermediate results, cached data structures, and shuffle buffers. When memory pressure builds up, the JVM must frequently pause application threads to reclaim memory. Long garbage collection pauses cause tasks to timeout, executors to be marked as unresponsive, and overall job performance to degrade dramatically.
Garbage collection pauses range from milliseconds to several minutes, during which all processing stops. Applications experience stuttering performance where tasks alternate between running normally and freezing during collection cycles. Memory-intensive operations become exponentially slower as garbage collection overhead increases. In extreme cases, executors get killed due to garbage collection-induced unresponsiveness, causing task failures and job instability.
Real-world impact looks like this:
- Streaming applications experiencing 30-second garbage collection pauses, missing processing deadlines
- ETL jobs spending 60% of execution time in garbage collection instead of actual processing
- Interactive queries timing out due to garbage collection pauses blocking result collection
- Memory-intensive aggregations taking 10x longer due to constant full garbage collection cycles
Detecting Garbage Collection Issues
Monitor garbage collection metrics in Spark UI executors tab and Ganglia for frequency and duration. Look for high “GC Time” relative to “Duration” in task metrics. Check executor logs for frequent garbage collection warnings and “Stop-the-world” events. Monitor heap utilization patterns that show sawtooth patterns indicating memory pressure.
Key detection patterns include:
- Garbage collection time exceeding 15% of total task execution time
- Frequent full garbage collection cycles (more than 1 per minute)
- Garbage collection pause durations exceeding 1 second
- Heap utilization consistently above 80%
- Tasks failing with “Container killed” after garbage collection timeouts
- Executor heartbeat timeouts during garbage collection pauses
- Sawtooth memory usage patterns in monitoring
- “OutOfMemoryError: Java heap space” in logs
You can analyze garbage collection behavior programmatically by monitoring memory usage patterns across operations. Check Python garbage collection stats before and after operations, measure execution duration, and look for correlation between long execution times and potential garbage collection overhead. Analyze partition sizes for memory pressure by examining rows per partition and estimating memory usage. Partitions with more than 2GB of data or over 10 million rows per partition often cause garbage collection pressure.
Configuring JVM for Better Garbage Collection Performance
Use the G1 garbage collector for large-scale data processing. Configure it with these settings for Databricks:
spark.executor.extraJavaOptions: -XX:+UseG1GC -XX:MaxGCPauseMillis=200 -XX:G1HeapRegionSize=32m -XX:+G1UseAdaptiveIHOP -XX:G1MaxNewSizePercent=30 -XX:G1NewSizePercent=20 -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -verbose:gc spark.driver.extraJavaOptions: -XX:+UseG1GC -XX:MaxGCPauseMillis=200 -XX:+PrintGCDetails -verbose:gc
Set appropriate memory allocation:
- spark.executor.memory: 14g (right-size based on workload)
- spark.executor.memoryFraction: 0.75 (leave room for garbage collection overhead)
- spark.storage.memoryFraction: 0.5 (balance storage and execution memory)
Reduce object creation by using Kryo serialization and enabling Arrow for PySpark:
- spark.serializer: org.apache.spark.serializer.KryoSerializer
- spark.sql.execution.arrow.pyspark.enabled: true
Code Patterns That Cause Garbage Collection Problems
Memory-intensive operations create excessive garbage collection overhead. Here’s what to avoid:
Bad approach with heavy garbage collection:
large_df = spark.read.table("transactions")
result = large_df.select("*") \
.withColumn("processed_date", current_timestamp()) \
.withColumn("year", year("transaction_date")) \
.withColumn("month", month("transaction_date")) \
.withColumn("day", dayofmonth("transaction_date")) \
.withColumn("amount_usd", col("amount") * col("exchange_rate")) \
.filter(col("amount_usd") > 100) \
.groupBy("year", "month", "customer_id") \
.agg(
sum("amount_usd").alias("total_amount"),
count("*").alias("transaction_count"),
collect_list("transaction_id").alias("transaction_ids") # Memory intensive
)
Optimized approach minimizing garbage collection:
large_df = spark.read.table("transactions")
result = large_df.select(
"customer_id",
"transaction_id",
"transaction_date",
(col("amount") * col("exchange_rate")).alias("amount_usd")
).filter(col("amount_usd") > 100) \
.withColumn("year", year("transaction_date")) \
.withColumn("month", month("transaction_date")) \
.groupBy("year", "month", "customer_id") \
.agg(
sum("amount_usd").alias("total_amount"),
count("*").alias("transaction_count")
# Removed collect_list to avoid memory pressure
)
The key differences: combine operations to reduce intermediate objects, avoid collect_list and collect_set on large datasets, and minimize the number of transformations that create temporary objects.
Optimizing Serialization and Data Structures
Configure Kryo serialization to reduce garbage collection overhead:
serialization_config = {
"spark.serializer": "org.apache.spark.serializer.KryoSerializer",
"spark.kryo.referenceTracking": "false", # Reduce object overhead
"spark.kryoserializer.buffer.max": "2047m",
"spark.rdd.compress": "true",
"spark.broadcast.compress": "true",
"spark.shuffle.compress": "true",
"spark.shuffle.spill.compress": "true"
}
Avoid operations that create large object collections in memory. Instead of collect_list which can be huge, use aggregations that don’t accumulate objects:
# Bad - creates memory pressure
df.groupBy("user_id") \
.agg(collect_list("event_data").alias("all_events"))
# Good - memory efficient
df.groupBy("user_id") \
.agg(
count("*").alias("event_count"),
countDistinct("event_type").alias("unique_event_types"),
max("timestamp").alias("last_event_time"),
sum("event_value").alias("total_value")
)
Stop wasting Databricks spend—act now with a free health check.
ETL Pipeline Optimization for Garbage Collection
Complex nested JSON processing creates excessive objects. Here’s how to optimize:
Problematic approach:
raw_data = spark.read.json("/raw/events/")
processed = raw_data.select("*") \
.withColumn("parsed_timestamp", to_timestamp("timestamp_str")) \
.withColumn("user_info", from_json("user_data", user_schema)) \
.select("*", "user_info.*") \
.withColumn("processed_events",
transform("events", lambda x: process_event_udf(x))) \
.withColumn("event_count", size("processed_events")) \
.filter(col("event_count") > 0)
Optimized approach:
raw_data = spark.read.json("/raw/events/")
stage1 = raw_data.select(
"event_id",
to_timestamp("timestamp_str").alias("event_timestamp"),
get_json_object("user_data", "$.user_id").alias("user_id"),
get_json_object("user_data", "$.segment").alias("user_segment"),
size("events").alias("event_count")
).filter(col("event_count") > 0)
stage1.cache()
final_result = stage1.select(
"event_id",
"event_timestamp",
"user_id",
"user_segment",
"event_count"
)
Process in stages to control memory allocation. Cache intermediate results to avoid recomputation. Use get_json_object instead of from_json and select("*", "column.*") flattening which creates more objects.
Memory-Intensive Analytics Optimization
Large aggregations cause severe garbage collection pressure. Avoid this pattern:
# Bad - memory intensive
customer_analysis = sales.groupBy("customer_id") \
.agg(
collect_list("product_id").alias("purchased_products"),
collect_list("order_date").alias("order_dates"),
sum("amount").alias("total_spent"),
count("*").alias("order_count")
)
Use garbage collection-friendly operations instead:
# Good - low memory pressure
customer_summary = sales.groupBy("customer_id") \
.agg(
sum("amount").alias("total_spent"),
count("*").alias("order_count"),
countDistinct("product_id").alias("unique_products"),
min("order_date").alias("first_order"),
max("order_date").alias("last_order")
)
customer_summary.cache()
enhanced_summary = customer_summary.withColumn(
"customer_tenure_days",
datediff("last_order", "first_order")
).withColumn(
"avg_order_value",
col("total_spent") / col("order_count")
)
Streaming Applications and Garbage Collection
Streaming applications need specific garbage collection tuning to handle continuous processing:
spark.sql.streaming.trigger.processingTime: 10 seconds # Reduce micro-batch size spark.cleaner.ttl: 3600 # Clean up metadata after 1 hour spark.executor.memory: 8g # Smaller executors for better GC spark.executor.memoryFraction: 0.7 spark.storage.memoryFraction: 0.3 # Reduce caching for streaming spark.executor.extraJavaOptions: -XX:+UseG1GC -XX:MaxGCPauseMillis=100 # Shorter pauses for streaming -XX:G1HeapRegionSize=16m # Smaller regions for frequent GC
For streaming queries, avoid stateful operations that accumulate state. Use simple aggregations with time windows:
stream = spark.readStream.format("delta").load("/streaming/events")
processed_stream = stream.select(
"event_id",
"timestamp",
"user_id",
"event_type"
).filter(col("event_type").isin(["purchase", "signup"])) \
.groupBy(window("timestamp", "1 minute"), "event_type") \
.count()
query = processed_stream.writeStream \
.outputMode("update") \
.format("console") \
.trigger(processingTime="10 seconds") \
.start()
Advanced Garbage Collection Monitoring
Set up comprehensive garbage collection logging for production workloads:
spark.executor.extraJavaOptions: -XX:+UseG1GC -XX:MaxGCPauseMillis=200 -XX:+PrintGC -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintGCApplicationStoppedTime -Xloggc:/tmp/gc-executor.log -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=10 -XX:GCLogFileSize=10M
Monitor application for garbage collection pressure indicators. If garbage collection overhead exceeds 15% of execution time, take action. Increase executor memory, reduce partition sizes, optimize data structures, or enable G1 garbage collector if not already using it.
Garbage Collection Tuning Best Practices
Use G1 garbage collector for applications with large heaps (over 4GB). Set appropriate garbage collection pause time targets (100-200ms for most workloads). Avoid memory-intensive operations like collect_list on large datasets. Use Kryo serialization to reduce object overhead.
Right-size executor memory. Don’t go too large to avoid long garbage collection pauses. Monitor garbage collection logs and tune based on actual patterns. Reduce object creation through efficient transformations. Cache strategically to avoid recomputation without memory pressure. Use columnar formats (Parquet, Delta) to reduce object allocation. Break large operations into smaller stages to control memory usage.
Common garbage collection anti-patterns to avoid:
- Using
collect_listorcollect_seton large datasets - Oversized executor heaps (over 32GB) causing long garbage collection pauses
- Retaining large cached datasets unnecessarily
- Complex nested data structures creating object overhead
- Using default JVM garbage collection settings for large-scale data processing
- Not monitoring garbage collection metrics in production workloads
Optimal executor sizing for garbage collection:
- Small executors (2-4 cores, 4-8GB): Better garbage collection performance, more overhead
- Medium executors (4-8 cores, 8-16GB): Good balance for most workloads
- Large executors (over 8 cores, over 16GB): Risk of garbage collection pauses, use only with tuning
Enterprise Visibility Challenges
Here’s where things get complicated at scale.
Managing hundreds of jobs across multiple workspaces means correlating metrics from executor logs, Spark UI data, and cluster-level monitoring. This manual process becomes impractical quickly. You’re left trying to answer questions like: Which of my 500 jobs have garbage collection problems? What’s the business impact? Is it worth fixing? What configuration changes will actually help?
A job spending 30% of its execution time in garbage collection might meet its SLA today. But as data volumes grow, that hidden overhead pushes processing into the next day’s batch window. The relationship between garbage collection performance and business impact is difficult to quantify without comprehensive visibility beyond what Databricks’ native monitoring provides.
Root cause analysis for garbage collection problems demands expertise that many data teams lack. When a job fails due to executor unresponsiveness, determining whether it was caused by excessive pause times, inadequate heap sizing, memory-intensive operations, or inefficient serialization typically requires deep JVM tuning knowledge and extensive manual investigation.
Cost implications of suboptimal garbage collection tuning remain largely invisible. Jobs with poor garbage collection performance consume cluster resources for longer periods, directly increasing compute costs. Yet without visibility into the relationship between garbage collection overhead and job duration, teams struggle to prioritize optimization efforts based on potential cost reductions.
Configuration drift across clusters compounds the detection challenge. Even when teams establish garbage collection tuning best practices, ensuring consistent application across all clusters and workspaces requires systematic tracking. Jobs inherit suboptimal JVM configurations from cluster templates. Individual teams implement conflicting tuning strategies based on incomplete understanding.
How Unravel Complements Databricks with Automated Garbage Collection Intelligence
Unravel’s DataOps Agent works alongside Databricks to transform garbage collection tuning from a manual detective process into automated intelligence. The platform continuously monitors every workload and automatically detects when jobs experience performance degradation due to garbage collection overhead. It analyzes executor metrics and task execution patterns to identify which workloads need attention.
Instead of making data engineers manually review logs and correlate metrics across Databricks’ extensive monitoring data, Unravel surfaces garbage collection issues proactively with clear context about their business impact. You see exactly which jobs have problems and what those problems are costing you.
The Data Engineering Agent provides specific, actionable recommendations that leverage Databricks’ powerful JVM tuning capabilities. When garbage collection problems are identified, Unravel tells you exactly which Databricks configurations to adjust: G1HeapRegionSize settings, memory fractions, or code refactoring to reduce object allocation pressure. These recommendations come with estimated performance improvements and cost savings.
Unravel’s automation operates at three controllable levels. For teams starting their optimization journey, Unravel surfaces issues and suggests fixes, but all changes require manual approval before being applied. Organizations ready for more automation can enable auto-approval for certain types of garbage collection optimization. For enterprises seeking maximum efficiency, Unravel can automatically apply garbage collection tuning changes with appropriate governance controls.
Organizations using Unravel alongside Databricks for garbage collection tuning typically achieve 25-35% cost reduction through improved resource utilization while simultaneously improving job performance. Teams maximize their Databricks investment without requiring deep JVM tuning expertise across the entire data engineering organization.
Other Useful Links
- Our Databricks Optimization Platform
- Get a Free Databricks Health Check
- Check out other Databricks Resources