AP 06 - Beware of SELECT DISTINCT

Severity

Warning Notice

Type

Performance

Correctness

Problem

Incorrect use of the DISTINCT operator by SQL developers and analysts is common.

In the vast majority of query scenarios where it is used, using DISTINCT is problematic.

SELECT DISTINCT is used incorrectly to work around the following types of problems:

  • Data quality issues in the underlying data
  • Data quality issues because of a badly designed data model
  • Mistake in the SQL statement, e.g. Join or Filtering
  • Incorrect usage of SQL features

Solution

When you have duplicate values, resist the urge to use DISTINCT.

When you find duplicates in your data, instead of using DISTINCT, you should first make sure you understand the data model and the data itself.

  • Make sure to check that your table joins are correct. Do you need to join on compound keys instead of single key columns?
  • Make sure to apply the correct filters.
  • Make sure that you understand the data model
  • Make sure to check for data quality issues

Legitimate use of the anti pattern

There are times when it makes sense to use DISTINCT. As a general rule, the pattern is correct if your goal is to find the unique values in one or more columns. But if you use DISTINCT to get rid of a large number of duplicates that you didn’t expect, you’ve hit an anti pattern.

Further reading

SQL antipatterns: SELECT DISTINCT