AP 02 - Beware of count distinct

Severity

Warning Notice

Type

Performance

Problem

The Count distinct operation is computationally expensive and decreases database performance. BigQuery defaults to approximating the distinct count, using a HyperLogLog algorithm. But BigQuery also supports exact distinct count in a separate function.

One of two algorithms is typically used for distinct counts.

  • Sorting: Records are sorted, duplicate records then are identified and then removed. The number of target expression values in the remaining set of tuples would then represent a distinct count of the target expression values.
  • Hashing: Records are hashed and then duplicates are removed within each hash partition. Implementing hashing may be less computationally intensive than the sorting approach and may scale better with a sufficiently large set of records. However, it is still computationally expensive

Solution

There are two possible solutions to the problem.

If you don’t require 100% accuracy you can use approximate count distinct aka HyperLogLog. Most modern databases support this functionality.

Here is an example of approximate count distinct using SQL Server

SELECT approx_count_distinct(O_OrderKey) AS approx_distinct_order_key
  FROM Orders

Bitmap-based distinct counts can be performed incrementally, combining them with Materialized Views for query rewrite in databases that support such functionality. Snowflake and Oracle are two examples of databases that support bitmap-based distinct counts.

If you do not have an approximate count distinct (HyperLogLog) or bitmap based count distinct as part of your database, you could precompute the distinct counts as part of ETL.

Legitimate use of the anti pattern

For small data volumes you will not see a performance problem when using count distinct.

As the number of distinct values increases, the elapsed time and memory usage of standard count distinct increases significantly (non-linear growth).

The elapsed time and memory usage of approximate count distinct on the other hand remains consistent regardless of the number of distinct values in the data set (linear growth).