AP 05 - Beware of SELECT *
Severity
CautionType
Problem
Whenever a developer uses SELECT *, they commonly pick more columns than they need. Without any justification, the database must free up more system resources. The use of this anti-pattern could cause performance problems, reduce concurrency, and increase costs.
The SELECT * anti pattern has the following impact
- When columns are renamed or dropped, downstream processes may break in your data engineering and BI work loads
- More database resources need to be used to scan data at the table level. The impact is significantly worse for databases with columnar storage.
- More data travels across the network
- Running SELECT * is expensive. Some database providers, like Google BigQuery, charge customers according to the volume of data accessed. It can literally cost you hundreds of dollars to run a query across a TB-sized table with dozens of columns when we only need four or five. While the extra cost for BigQuery is obvious, you will pay comparable compute costs with other database vendors and have fewer resources available for other users.
Solution
- Fix the anti pattern for small tables that are part of an ETL or BI application.
- Apply the LIMIT clause to only retrieve a subset of the data for data exploration use cases. The implementation of LIMIT differs by database vendor and you are advised to consult the explain plan.
SQL without LIMIT takes minutes on a 500 GB table
SELECT *
FROM web_sales a
JOIN item b
ON(a.ws_item_sk=b.i_item_sk)
SQL with LIMIT takes milliseconds for the same query
SELECT *
FROM web_sales a
JOIN item b
ON(a.ws_item_sk=b.i_item_sk)
LIMIT 100
Example SQL was created and tested on Snowflake
- Even if we know the names of the columns we need to retrieve it is much quicker to write SELECT * instead of spelling out each individual column. An SQL IDE with autocomplete feature can minimise the amount of work.
Legitimate use of the anti pattern
If this anti pattern appears during data exploration or data discovery, you can ignore it for small tables. The extra expense is probably less than the time you must spend on data engineering if you write out the full column names
For data exploration and data discovery use cases you may need to browse all of the columns in a table to get a better understanding of the data.