Unravel launches free Snowflake native app Read press release

Snowflake

Snowflake UNION ALL Performance Query Speed Boost

Most developers default to UNION when combining query results in Snowflake. It’s the safe choice – UNION automatically removes duplicates, so you don’t have to think about whether your data sources might overlap. But that safety […]

  • 8 min read

Most developers default to UNION when combining query results in Snowflake. It’s the safe choice – UNION automatically removes duplicates, so you don’t have to think about whether your data sources might overlap.

But that safety comes with a significant performance cost.

Every time you use UNION, Snowflake performs duplicate elimination on your entire result set. For large datasets, this means sorting or hashing millions of rows to identify duplicates, consuming substantial memory and CPU resources. Queries that should complete quickly stretch into minutes. Memory spills to disk. Compute costs multiply.

UNION ALL skips the entire duplicate elimination process. When you know your data sources don’t overlap, or when duplicates are acceptable, UNION ALL simply concatenates result sets and finishes. The performance difference for large datasets can be dramatic – we’re talking 10x faster execution times and substantially lower compute costs.

This article explains when to use UNION ALL for maximum Snowflake UNION ALL performance, when UNION’s duplicate elimination is actually necessary, and how to identify queries where this optimization provides the biggest gains.

Understanding UNION vs UNION ALL in Snowflake

Snowflake provides two operators for combining query results: UNION and UNION ALL. Both merge rows from multiple SELECT statements into a single result set, but they handle duplicates fundamentally differently.

UNION performs duplicate elimination automatically. After combining result sets, Snowflake identifies and removes duplicate rows, ensuring each unique row appears only once in the output. This implicit DISTINCT operation requires sorting or hashing the entire combined dataset.

That’s expensive.

UNION ALL simply concatenates inputs. It combines all rows from each query without any duplicate checking or elimination. If the same row appears in multiple source queries, it appears multiple times in the final result. This straightforward concatenation requires no additional processing beyond merging the result sets.

The performance difference stems entirely from that duplicate elimination step.

Consider combining data from three monthly tables, each containing 10 million rows. UNION ALL concatenates all 30 million rows and finishes. UNION must process those same 30 million rows through a deduplication algorithm – sorting or hashing to identify duplicates, then filtering them out.

For Snowflake’s columnar storage and distributed architecture, this deduplication becomes particularly costly. The system must shuffle data across compute nodes to group identical rows together, consume memory to track which rows have been seen, and perform CPU-intensive comparisons.

When datasets are large, memory pressure increases, potentially triggering spillage to local or remote disk.

Memory consumption patterns differ dramatically:

  • UNION ALL: Memory usage scales with the combined result set size
  • UNION: Memory usage includes the combined set plus overhead for duplicate tracking
  • Large result sets with UNION can easily exceed available warehouse memory
  • Memory spillage to disk devastates query performance, especially remote spillage
  • UNION ALL avoids this entire category of performance problems

The query optimizer handles these operators differently too. UNION appears in query profiles as a UnionAll operator with an extra Aggregate operator on top performing duplicate elimination. That additional operator represents real computational work – work that UNION ALL completely eliminates.

CPU utilization follows memory patterns. UNION ALL performs minimal CPU work beyond merging result sets. UNION adds substantial CPU load for hashing or sorting operations, row comparisons, and managing the duplicate elimination process.

For queries combining large datasets, this CPU overhead translates directly to longer execution times and higher compute costs.

Network data transfer increases with UNION operations. Snowflake’s distributed architecture moves data between compute nodes during query execution. The duplicate elimination process requires shuffling data to group potential duplicates together, multiplying network traffic compared to UNION ALL’s straightforward concatenation.

When to Use UNION ALL for Maximum Performance

UNION ALL delivers optimal Snowflake UNION ALL performance when duplicate rows are acceptable or when data sources naturally don’t overlap. Understanding these scenarios helps you choose the right operator and avoid unnecessary duplicate elimination overhead.

Combining non-overlapping datasets represents the most common valid UNION ALL use case.

When merging data partitioned by time periods, geographic regions, or other dimensions that guarantee no overlap, duplicates are impossible. UNION ALL is the correct choice.

Monthly data aggregation exemplifies this pattern perfectly:

-- Optimal: Each month contains distinct records
SELECT customer_id, order_date, amount 
FROM orders_january
UNION ALL
SELECT customer_id, order_date, amount 
FROM orders_february
UNION ALL  
SELECT customer_id, order_date, amount 
FROM orders_march;

Each monthly table contains orders from that specific month only. No order can appear in multiple tables. Duplicate elimination is unnecessary work that UNION ALL correctly avoids.

The query simply concatenates the three result sets and finishes, delivering maximum Snowflake UNION ALL performance.

Partitioned data sources follow the same principle. When combining regional tables, department-specific tables, or any other non-overlapping partitions, UNION ALL eliminates duplicate elimination overhead:

-- Fast: Regional data doesn't overlap
SELECT transaction_id, amount, region
FROM sales_north_america
UNION ALL
SELECT transaction_id, amount, region
FROM sales_europe
UNION ALL
SELECT transaction_id, amount, region
FROM sales_asia;

The partitioning scheme guarantees each transaction appears in exactly one regional table. UNION would waste resources checking for duplicates that can’t exist.

When duplicates are acceptable:

If your business logic doesn’t require unique rows, or if downstream processing handles deduplication, UNION ALL’s performance advantage outweighs any duplicate concerns:

-- Duplicates acceptable: Logging events from multiple sources
SELECT event_timestamp, event_type, user_id
FROM application_logs
UNION ALL
SELECT event_timestamp, event_type, user_id
FROM system_logs;

Event logs might contain duplicate entries across sources, but the analysis doesn’t require uniqueness. Preserving all events with UNION ALL maintains data fidelity while optimizing query performance.

Building intermediate result sets for further processing often benefits from UNION ALL.

When the combined dataset feeds into aggregations, joins, or other operations that naturally handle duplicates, eliminating them prematurely wastes resources:

-- Efficient: GROUP BY handles any duplicates
SELECT category, SUM(amount) as total
FROM (
    SELECT category, amount FROM q1_sales
    UNION ALL
    SELECT category, amount FROM q2_sales
    UNION ALL
    SELECT category, amount FROM q3_sales
)
GROUP BY category;

The GROUP BY aggregation consolidates duplicate categories regardless of whether UNION or UNION ALL combines the quarterly data. Using UNION ALL avoids duplicate elimination that the subsequent GROUP BY renders redundant.

ETL and data pipeline operations frequently combine data from multiple sources where duplicates either don’t exist or get handled in later pipeline stages. UNION ALL keeps these operations fast and efficient, improving overall pipeline performance and reducing compute costs.

The key principle: if you don’t specifically need duplicate elimination, don’t pay for it. UNION ALL provides the same row combination functionality without the performance overhead.

Stop wasting Snowflake spend—act now with a free health check.

Request Your Health Check Report

When UNION’s Duplicate Elimination is Necessary

UNION becomes the appropriate choice when your query specifically requires unique rows and data sources might contain duplicates. Understanding when duplicate elimination is necessary prevents choosing UNION ALL inappropriately.

Overlapping data sources represent the primary scenario requiring UNION.

-- Necessary: Customer lists might overlap
SELECT customer_id, customer_name, email
FROM active_customers
UNION
SELECT customer_id, customer_name, email
FROM vip_customers;

Active customers might also be VIP customers. Without duplicate elimination, these customers would appear twice in the result set. UNION correctly removes duplicates, ensuring each customer appears once regardless of which categories they belong to.

-- Correct: Recent data might duplicate archived records
SELECT order_id, customer_id, order_date, amount
FROM orders_archive
UNION
SELECT order_id, customer_id, order_date, amount
FROM orders_current;

If current orders haven’t been archived yet, they exist only in orders_current. If archival has occurred, some orders might exist in both tables.

UNION eliminates these duplicates appropriately.

Combining filtered subsets from the same table can create duplicates when filter conditions overlap:

-- UNION needed: Orders might meet both conditions
SELECT order_id, customer_id, amount
FROM orders
WHERE amount > 1000
UNION
SELECT order_id, customer_id, amount
FROM orders
WHERE customer_type = 'VIP';

An order with amount over 1000 from a VIP customer satisfies both conditions. UNION eliminates the duplicate, while UNION ALL would return it twice.

Business logic requiring uniqueness justifies UNION even when it impacts Snowflake UNION ALL performance. If reporting, compliance, or application requirements demand unique rows, the duplicate elimination overhead becomes a necessary cost.

However, evaluate whether you truly need uniqueness at the UNION stage.

If downstream operations would handle duplicates anyway, or if application logic can manage them, UNION ALL’s performance advantage might outweigh the duplicate concern.

The critical distinction: use UNION when duplicate elimination is part of your query’s logical requirements. Use UNION ALL when it’s unnecessary overhead.

Performance Impact of UNION’s Duplicate Elimination

The performance difference between UNION and UNION ALL becomes dramatic with large datasets. Understanding the specific impacts helps you recognize when defaulting to UNION creates unnecessary overhead affecting Snowflake UNION ALL performance.

Memory consumption increases substantially with UNION operations.

Snowflake must allocate memory to track rows during duplicate elimination, storing hash values or maintaining sorted structures to identify duplicates. For a query combining result sets totaling 100 million rows, UNION might consume 3-5x more memory than UNION ALL.

This memory overhead triggers spillage scenarios that devastate performance. When the duplicate elimination process exceeds available warehouse memory, Snowflake begins spilling to local disk. Queries slow noticeably but remain manageable.

If local disk capacity proves insufficient, remote spillage to cloud storage occurs.

Remote spillage kills query performance. Network latency dominates execution time as Snowflake repeatedly writes intermediate results to remote storage and reads them back for processing.

A query that completes in 30 seconds with UNION ALL might take 20 minutes with UNION if remote spillage occurs.

CPU utilization spikes during duplicate elimination. Sorting algorithms or hash computations for identifying duplicates consume significant CPU cycles. For large result sets, this CPU work extends execution time even when memory remains adequate.

A UNION query might use 10x the CPU time of the equivalent UNION ALL query.

Query execution time provides the most visible performance impact. Simple UNION ALL operations on large datasets complete in seconds – the time required to scan source tables and concatenate results. Adding UNION’s duplicate elimination can extend this to minutes or hours depending on dataset size and warehouse capacity.

Optimizing Query Performance with UNION ALL

Maximizing Snowflake UNION ALL performance requires intentionally choosing UNION ALL over UNION when duplicate elimination is unnecessary. This optimization applies across various query patterns and data pipeline scenarios.

Review existing UNION queries systematically.

Many developers default to UNION out of habit or uncertainty about whether duplicates might exist. Examine each UNION in your codebase and ask: do I specifically need duplicate elimination here?

If the answer is no, UNION ALL provides immediate performance improvement.

  • Combining time-partitioned tables (monthly, daily, yearly data)
  • Merging geographically partitioned data (regional tables)
  • Aggregating data where GROUP BY handles uniqueness
  • Building intermediate result sets for further processing
  • ETL operations where duplicates get handled downstream

Audit query profiles to identify UNION operations causing performance problems. Look for queries with the UnionAll operator followed by an Aggregate operator. Check execution time and memory consumption for these operators.

Document data partitioning schemes to make UNION ALL decisions clearer. When teams understand how data is partitioned – by date, region, department, or other dimensions – choosing UNION ALL becomes obvious.

Establish coding standards that default to UNION ALL unless duplicate elimination is explicitly required.

How Unravel’s Data Engineering Agent Automates Query Optimization

Identifying inefficient UNION usage across thousands of queries proves challenging manually. Development teams focus on delivering features. Query optimization becomes reactive – addressing problems only after performance degrades noticeably.

Unravel’s Data Engineering Agent eliminates this reactive approach by moving from insight to automated action.

Built natively on Snowflake System Tables, the Data Engineering Agent requires no agents, no data extraction, and no external infrastructure. It analyzes query patterns directly within your Snowflake environment using Delta Sharing or Direct Share, maintaining complete data security.

The agent monitors every query execution, identifies UNION operations performing unnecessary duplicate elimination, and determines when UNION ALL would provide better Snowflake UNION ALL performance.

It doesn’t just flag the issue. It implements the fix based on your governance preferences.

You control exactly how much automation to enable:

Level 1 – Recommendations: The agent identifies queries using UNION when UNION ALL semantics are sufficient and recommends the optimization. You review each suggestion, understand the change, and approve manually.

Level 2 – Auto-Approve Specific Types: Enable automatic implementation for low-risk UNION-to-UNION ALL conversions. The agent applies these fixes automatically while requiring approval for complex cases.

Level 3 – Full Automation with Guardrails: Define policies where UNION ALL is appropriate, and let the agent continuously optimize within those boundaries.

Teams using Unravel’s Data Engineering Agent achieve 25-35% sustained cost reduction and eliminate hundreds of hours of manual tuning work while maintaining full data integrity and governance.

 
 

Other Useful Links