Unravel launches free Snowflake native app Read press release

Snowflake

Snowflake SELECT * Critical Performance Fixes

Typing SELECT * is convenient. Fast during development, requires zero column memorization, seems harmless for quick data exploration. But in production Snowflake environments processing millions of rows daily, that asterisk silently creates performance problems until monthly […]

  • 10 min read

Typing SELECT * is convenient. Fast during development, requires zero column memorization, seems harmless for quick data exploration. But in production Snowflake environments processing millions of rows daily, that asterisk silently creates performance problems until monthly bills spike 30% or dashboards grind to a crawl.

This isn’t about Snowflake’s architecture being flawed. Snowflake built an exceptional cloud data platform—columnar storage, automatic micro-partitioning, intelligent query optimization that genuinely works. Outstanding performance when queries specify exactly what data they need. But SELECT * forces Snowflake to read, decompress, and transfer every column in every micro-partition the query touches, whether applications use that data or not.

The cost impact? Measurable. A query selecting 50 columns when it needs 5 consumes 10x more I/O resources. Burns credits proportionally. Scale that across thousands of daily queries from dashboards, ETL pipelines, ad-hoc analysis, and inefficiency becomes expensive. Organizations running Snowflake at enterprise scale consistently find that eliminating SELECT * patterns delivers 25-35% credit reduction while simultaneously improving query performance for end users.

This article examines why Snowflake SELECT * undermines performance, breaks applications, wastes credits. More importantly, it provides the critical fixes that restore efficiency without requiring application rewrites or workflow disruption.

Understanding Snowflake’s Columnar Storage Architecture

Snowflake stores data in columnar format within micro-partitions. Fundamentally different from traditional row-based databases. Each column exists as a separate compressed structure, enabling exceptional compression ratios and query performance when queries are column-specific. This architecture is why Snowflake excels at analytical workloads, but it requires query patterns that work with the design rather than against it.

When you execute a query, Snowflake doesn’t retrieve entire rows and filter later—it reads only the specific columns requested from relevant micro-partitions. A table with 50 columns? Actually 50 separate column files per micro-partition. Requesting 5 columns means reading 5 files. SELECT * means reading all 50, even if your application immediately discards 45 of them.

Consider a customer profiles table with customer_id, name, email, and 47 additional attributes including purchase history, preferences, behavioral metadata. Execute SELECT * FROM customer_profiles WHERE region = 'US' and Snowflake must:

  • Read all 50 column files from every micro-partition containing US customers
  • Decompress each column’s data
  • Transfer everything across the network to your warehouse
  • Send it all to your application

The Query Profile reveals waste clearly in its “Bytes scanned” metric. A SELECT * query against a 10GB table with 50 columns scans all 10GB. Selecting 5 specific columns scans approximately 1GB. 10x reduction in I/O. Lower I/O translates directly to faster queries and fewer credits consumed.

This difference becomes critical when tables include large TEXT or VARIANT columns—a product catalog storing 10MB JSON metadata and lengthy descriptions per row transfers massive amounts of data with SELECT *. Those columns might represent 80% of the table’s storage but serve zero purpose in queries just checking product availability or pricing. Snowflake SELECT * queries against wide tables with heavy columns often run 5-10x slower than targeted column selection.

Snowflake’s columnar advantage isn’t theoretical. It’s the foundation of the platform’s performance reputation. But that advantage only materializes when queries specify columns explicitly.

Micro-Partition Pruning and Data Transfer Efficiency

Micro-partitions are Snowflake’s fundamental storage unit. Typically 50-500MB of compressed data organized into columnar structures. Snowflake’s query optimizer uses metadata to determine which micro-partitions contain relevant data, pruning away partitions that don’t match query predicates—this pruning is highly efficient, but what happens after pruning depends entirely on query column selection.

Even when Snowflake successfully prunes 90% of micro-partitions through WHERE clause evaluation or clustering key filtering, SELECT * forces complete column decompression from the remaining 10%. Query a 1 million row table and Snowflake prunes it to 100,000 relevant rows across 50 micro-partitions? Those partitions still contain every column. SELECT * means decompressing and transferring all columns from all 50 partitions.

Network transfer becomes a bottleneck faster than most teams anticipate. Data moves from Snowflake’s storage layer to the compute layer, then from Snowflake infrastructure to your application over public internet connections. Large result sets strain both internal and external network capacity. Return 100,000 rows with 20 columns where only 10 are needed? You’ve doubled network transfer time, doubled data egress costs, doubled the wait before applications receive results.

Result caching suffers from Snowflake SELECT * patterns. Snowflake caches query results for 24 hours, but cache hits require identical queries. One team runs SELECT * FROM orders WHERE date = '2024-11-01' while another runs SELECT order_id, customer_id, amount FROM orders WHERE date = '2024-11-01'. These are different queries despite querying identical underlying data. No cache hit occurs. Explicit column selection improves cache utilization across team queries because different queries requesting the same specific columns can share cached results more effectively.

Memory pressure on warehouses increases with wider result sets. Virtual warehouses allocate finite memory across compute nodes. When multiple concurrent queries all use SELECT *, memory fills rapidly with unnecessary column data, triggering disk spilling where Snowflake writes intermediate results to storage because they exceed available memory. Disk spilling destroys query performance—operations completing in seconds can require minutes when spilling occurs.

For clustered tables, Snowflake SELECT * diminishes clustering benefits. An orders table clustered on customer_id efficiently locates relevant micro-partitions when querying WHERE customer_id = 12345. After finding those partitions though? SELECT * still decompresses every column. Compare this to SELECT order_id, order_date, amount FROM orders WHERE customer_id = 12345, which maintains the clustering advantage while minimizing decompression overhead.

Credit Consumption and Cost Impact

Snowflake’s consumption-based pricing means every second of warehouse compute time costs credits. Credit accumulation based on warehouse size and runtime. Query efficiency directly impacts your bill. Snowflake SELECT * is a measurable cost problem, not just a performance annoyance.

CPU cycles spent decompressing unused columns? Wasted credits. Snowflake’s columnar compression is sophisticated, using different algorithms per column based on data characteristics—decompression requires CPU processing, and when queries select all columns but use only a subset, warehouses burn CPU cycles on decompression that provides zero application value.

This inefficiency scales with query frequency. Dashboard refreshing every 5 minutes with SELECT * against a large fact table wastes credits 288 times daily. Optimize that single query by eliminating unnecessary columns, reduce execution time from 12 seconds to 3 seconds, and you’ve cut that query’s credit consumption by 75%. Multiply across thousands of queries? Savings compound quickly.

Memory’s role in credit costs gets underestimated. When warehouses exhaust memory and spill to disk, queries slow dramatically but keep running and consuming credits the entire time. A query costing 0.1 credits might consume 2 credits when disk spilling occurs. Snowflake SELECT * increases memory pressure and spilling likelihood, especially during concurrent query load periods.

Network transfer time keeps warehouses active longer. While data transfers to applications, warehouses remain running and accumulating charges. Larger result sets from SELECT * mean longer active time before query completion. For queries returning hundreds of thousands or millions of rows, transfer time can equal or exceed actual compute time. You’re paying for both.

Query History makes cost impact visible:

  • Filter for long-running queries and examine their Query Profiles
  • High “Bytes scanned” relative to actual data needs signals waste
  • Common pattern: queries scanning gigabytes or terabytes while returning only megabytes to applications
  • That ratio indicates massive inefficiency, typically driven by SELECT * pulling columns filtered out in application code

Teams running Snowflake at scale consistently report 25-35% credit reduction after systematic elimination of SELECT * patterns across query workloads. This reduction doesn’t require application rewrites or architecture changes, just explicit column specification. When annual Snowflake spend reaches hundreds of thousands of dollars, 30% reduction delivers significant ROI for relatively simple code modifications.

Breaking Changes and Security Vulnerabilities

Beyond performance and cost, Snowflake SELECT * creates maintenance risks and security exposure that emerge gradually. Production incidents. Compliance violations.

Application code relying on SELECT * with positional column access breaks when schemas evolve. Code extracting data as ResultSet rs = statement.executeQuery("SELECT * FROM users") then accessing columns by index like String email = rs.getString(5) fails when someone adds a column before position 5—development teams add columns to tables constantly, and normal schema evolution shouldn’t break applications, but SELECT * makes code fragile to these changes.

Even when applications use column names instead of positions, SELECT * can introduce subtle bugs. A table starting with 10 columns might grow to 50 over months as business requirements evolve. If original developers assumed certain performance characteristics? Applications may degrade unexpectedly. A query completing in 2 seconds with 10 columns might require 15 seconds with 50, especially if new columns contain heavy data types like large VARCHAR or VARIANT fields.

Security and compliance teams increasingly scrutinize data access patterns. Snowflake SELECT * queries potentially expose sensitive columns to users or applications that shouldn’t access them. A customer_data table might initially contain customer_id, name, and email, then later have social_security_number and credit_card_info added—legacy code with SELECT * suddenly returns PII and PCI data to applications not designed to handle it securely or log it appropriately.

Snowflake’s row access policies and column-level security provide defensive controls. Reactive measures. Better practice is explicit column selection that clearly documents query data needs. When auditing data access for compliance, SELECT customer_id, name, email FROM customer_data demonstrates intentional, minimal data retrieval. SELECT * appears like lazy coding that might inappropriately access sensitive data.

Query readability and maintainability deteriorate with SELECT *. Six months later, can you determine what columns a query actually uses? Does it need all 50 columns or just 5? You must examine every line of code processing the result set to know. Explicit column selection makes query intent transparent: these specific data points are required for this operation.

Stop wasting Snowflake spend—act now with a free health check.

Request Your Health Check Report

Development Workflow and Code Quality

Production performance matters most, but Snowflake SELECT * also affects development efficiency and long-term code maintainability. Teams standardizing on explicit column selection write better code. Debug faster. Onboard new engineers more effectively.

Code reviews become more meaningful when queries show exactly what data they access—a pull request with SELECT customer_id, order_date, total_amount, status FROM orders communicates intent clearly, letting reviewers assess whether the query makes sense for the feature being built. SELECT * hides this information. Reviewers must trace through application code to determine actual column usage.

Development environments benefit from explicit column selection. When prototyping queries or exploring new tables, developers often start with SELECT * to see available data. Fine for initial exploration. The problem occurs when SELECT * gets committed to production code. Establishing a simple rule works: “SELECT * allowed in scratch queries, never in production code.”

Troubleshooting query performance becomes easier with explicit columns. When a query runs slowly, the first question is typically “what data is this accessing?” With SELECT *, you examine Query Profile byte counts and infer which columns might cause issues. With explicit selection? The query itself documents what it retrieves.

Creating focused views and materialized views works better with explicit column selection:

  • Instead of building views with SELECT *, define exactly which columns each view should expose
  • A customer_summary view might include customer_id, name, email, registration_date, and status
  • Teams can query the view confidently
  • The view definition documents the standard column set

Documentation and knowledge sharing improve when queries are self-documenting. New engineers reading existing code understand data flows by examining query column lists—they see which tables connect via which keys, what attributes are relevant for each operation, how data moves through the system. SELECT * obscures all of this, forcing engineers to run queries and examine results to understand what’s happening.

Enterprise Visibility Challenges at Scale

The critical fixes for Snowflake SELECT * deliver clear benefits. Faster execution, lower credit costs, better security, improved code quality. But realizing these benefits across enterprise Snowflake deployments requires visibility into actual query patterns and their cost impact.

Most organizations run hundreds or thousands of daily queries across multiple warehouses, written by dozens of developers and analysts. Query History shows execution records, but manually identifying problematic SELECT * patterns is impractical. Which queries scan the most unnecessary data? Which tables have SELECT * queries hitting them most frequently? What’s the aggregate credit cost of all SELECT * queries versus optimized alternatives?

Snowflake’s Query Profile provides deep insights into individual query execution—bytes scanned, micro-partitions pruned, network transfer times—invaluable for optimizing specific queries, but at enterprise scale, teams need aggregate analysis across thousands of queries to identify optimization opportunities delivering highest ROI.

The challenge compounds when queries are generated programmatically by BI tools, ETL frameworks, or application ORMs. Developers might not realize their code generates SELECT * queries. The ORM defaults to fetching all columns unless explicitly configured otherwise. By the time performance issues surface? Hundreds of similar queries already run in production.

Tracking relationships between query patterns and credit consumption requires correlation across Snowflake’s Account Usage views:

  • Teams must join QUERY_HISTORY with WAREHOUSE_METERING_HISTORY
  • Identify queries with high bytes_scanned relative to bytes_returned
  • Calculate potential savings from column optimization
  • This analysis is technically possible but requires dedicated data engineering effort to build and maintain monitoring pipelines

Change management presents another visibility challenge. As tables evolve and add columns, existing SELECT * queries automatically pull new columns without code changes—performance degrades gradually, making it difficult to pinpoint when and why queries got slower. Without proactive monitoring, teams discover problems only after users complain about application lag or finance flags unexpected credit usage.

Organizational complexity adds friction. Different teams own different warehouses and queries. Data engineering might optimize ETL pipelines while analytics teams continue using SELECT * in dashboards, Marketing runs ad-hoc queries with SELECT * for quick analyses. Each team needs education on performance impact, but Snowflake SELECT * optimization rarely gets prioritized over feature delivery.

How Unravel’s FinOps Agent Automates Query Optimization

Unravel’s FinOps Agent addresses enterprise visibility challenges by providing continuous automated analysis of Snowflake query patterns and implementing optimizations based on proven best practices. Built natively on Snowflake System Tables with no agents or external infrastructure required, the FinOps Agent monitors query execution and identifies optimization opportunities like eliminating unnecessary column selection from SELECT * queries.

The agent analyzes actual column usage patterns by examining which columns queries request versus which columns applications actually process. It identifies queries where SELECT * pulls 50 columns but applications use only 5, then generates optimized query recommendations with explicit column lists. You control implementation through three automation levels: recommendations requiring approval, auto-implementation for low-risk optimizations, or full automation with governance controls.

For Snowflake SELECT * optimization specifically, the FinOps Agent tracks which queries scan excessive data relative to actual needs, correlates this with credit consumption, and calculates potential savings from column optimization. Instead of manually reviewing Query History and building custom analysis, teams get automated identification of highest-impact optimization opportunities. The agent implements approved optimizations, monitors performance impact, and continuously refines recommendations based on observed results.

This moves query optimization from reactive firefighting to proactive governance. Rather than discovering performance issues after credits spike or users complain, teams get ahead of problems with automated monitoring and optimization. Organizations using Unravel’s FinOps Agent report 25-35% sustained cost reduction from systematic query optimization, including SELECT * elimination across Snowflake workloads—delivering hundreds of thousands of dollars in credit savings while improving query performance for end users.

 
 

Other Useful Links