Simplify your SQL Queries with Common Table Expressions (CTEs)
June 01, 2019
Writing SQL in a professional setting is never as simple as the tutorials make it out to be. You’ll reference several tables, each of which may require custom business logic in order to make them usable. To keep your code readable to yourself and others, consider using Common Table Expressions, or CTEs, to avoid packing too much logic into subqueries.
A CTE is a query whose result set you can reference in a later section of your query. It’s sort of like a view that’s local to the query that it lives in. Instead of writing a large subquery and having your logic nested within an outer query, you can write the subquery as a common table expression that a further section of your query can reference. The result is that the overall query is more modular and significantly easier to edit and understand.
The syntax of a CTE is simple:
WITH my_cte_name AS (SELECT * FROM EXAMPLE)
After you write this CTE, you can reference its result set further down in the query by selecting from result set my_cte_name
as if it were a table in the database. Here’s a more concrete example showing how a complex query can be easier to read, write, and maintain using CTEs.
Example with Real Data
Here I’ll use the Stack Overflow dataset from Google’s public BigQuery data. My goal is to find Stack Overflow users who are experienced in machine learning in Python, which I’ll define as the subset of users who have received upvotes on answers they’ve submitted to questions tagged with both python
and machine-learning
. I’ll want to know how many of these comments they’ve submitted as a proxy for how skilled or involved in the ML community they are, and I’ll also want to know some general info about who they are (name, location, etc.).
There are three tables needed to answer this. stackoverflow_posts
contains every post, its unique id, and tags identifying the subject matter. comments
has each comment id, the commenter’s user id, the number of upvotes it received, and the id of the Stack Overflow question it belogs to. Last, users
has user metadata saying who the user is.
Solving This with CTEs
This question can be broken into three questions, which will help to organize it into common table expressions.
1: which posts are asking questions about machine learning in Python?
2: which comments belong to these ML posts, and which of these comments received at least one upvote?
3: who are the users who submitted upvoted comments to these machine learning posts, and how many of these successful comments have they submitted?
I’ll answer the first question with a CTE. The results from the first CTE will then allow me to answer the second question by querying the comments
table in a second CTE. Finally, I can inner join the second CTE onto the users
table to answer my initial question of which users are most successful in answering Stack Overflow questions about Machine Learning in Python.
#StandardSQL
-- find posts about machine learning in python
WITH python_ml_posts AS (
SELECT
id
FROM
`bigquery-public-data.stackoverflow.stackoverflow_posts`
WHERE
tags LIKE '%machine-learning%'
AND
tags LIKE '%python%'
),
-- find comments on these ML posts with >0 upvotes and the commenter's user id
ml_post_comments AS (
SELECT
user_id,
count(*) AS comment_count
FROM
`bigquery-public-data.stackoverflow.comments` comments
INNER JOIN
python_ml_posts
ON
python_ml_posts.id = comments.post_id
WHERE
score > 0
GROUP BY
user_id
)
-- connect users with their machine-learning-comment-count
SELECT
users.display_name,
users.location,
users.reputation,
ml_post_comments.comment_count
FROM
`bigquery-public-data.stackoverflow.users` users
INNER JOIN
ml_post_comments
ON
ml_post_comments.user_id = users.id
ORDER BY
comment_count desc
Row | display_name | location | reputation | comment_count |
---|---|---|---|---|
1 | Fred Foo | 281432 | 30 | |
2 | Andreas Mueller | New York, United States | 14857 | 21 |
3 | lejlot | London, United Kingdom | 47420 | 20 |
4 | EdChum | Berkshire, United Kingdom | 178578 | 17 |
5 | ogrisel | Paris, France | 29009 | 13 |
This is a fairly complicated question, but as you can see from the above query, it’s easy to read and write when you tackle it with CTEs. The alternative would be to query the users table, then join on the comments table with a subquery for finding each user’s comments, which in turn would need a second subquery for finding which of these comments were related to machine learning in Python. Gross! Don’t do that. Use common table expressions instead. Your future self and whoever you share a code base with will thank you.