All Articles

Joins

Intro

Rarely will your data be located in a single table. There is actually good reason for that — to avoid data redundancy.

Data Redundancy

Suppose I have a table Customers structured as follows.

customer_id first_name last_name
1 John Smith

Now, suppose that for each person we want to capture a dynamic data point, transaction amount.

customer_id first_name last_name txn_amnt
1 John Smith 10

So far so good.

However, as soon as someone makes two purchases, we begin to see data redundancy.

customer_id first_name last_name txn_amnt
1 John Smith 10
1 John Smith 30

We have captured the entries John and Smith twice.

We can instead drop the txn_amnt column from the first table and create a second table.Transactions, say. This table will have a customer_id column and a txn_amnt column, as shown below.

customer_id txn_amnt
1 10
1 30

The problem now is that our data lives in two separate tables.

Joins

Ok, now our data lives in two tables. The common denominator between the above two tables is the customer_id column. Now suppose that we would like to return records with all three columns first_name, last_name, txn_amnt.

The existence problem

If you read about joins you will see a slew of different types. Despite the many different types of joins, they all follow from the following simple fact.

The process is the following. A field which both tables share in common is chosen. In the above case, customer_id.

The key point is this: you cannot know a priori (unless you explicitly check) whether or not the value of customer_id in a particular record in one table will exist in a record in the other table.

Every join is a particular way of addressing this fact.

Inner Join

This is the case where we wish only to include those records for which there exists a match between both tables in the chosen mutually shared field.

Customers

customer_id first_name last_name
1 John Smith

Transactions

customer_id txn_amnt
1 10
1 30

The customer_id value assumes the value 1 in both columns. That is, there exists a match for customer_id=1.These records would therefore be returned in an INNER JOIN.

Example

SELECT sys.Customers.customer_id, sys.Transactions.txn_amnt
FROM sys.Customers
INNER JOIN sys.Transactions
ON sys.Customers.customer_id=sys.Transactions.Id;
customer_id txn_amnt
1 10
1 30

Note the importance of specifying the parent table for each column. This is to avoid namespace ambiguity in the case that both tables have a column with the same name.

AnINNER JOIN can be visualized as follows.

LEFT JOIN

Consider the following additional entry in the Customers table.

customer_id first_name last_name
1 John Smith
2 Jane Doe

It is conceivable that Jane Doe may not yet have made a purchase, so that customer_id=2 has no match in the Transactions table. If we were to perform an INNER JOIN then, the Jane Doe records would not appear.

You can override this by using a LEFT JOIN.

A LEFT JOIN will behave like an INNER JOIN, the difference being that all records in the table specified on the left of the LEFT JOIN statement will be included irrespective of whether or not there exists a match in the ON field in the second table. The catch is that the fields of any such unmatched records in the right table will render as NULL in the result.

Example

SELECT sys.Customers.customer_id, sys.Transactions.txn_amnt
FROM sys.Customers
LEFT JOIN sys.Transactions
ON sys.Customers.ID=sys.Transactions.customer_id;
customer_id txn_amnt
1 10
1 30
2 (null)

A LEFT JOIN can be visualized as follows.

RIGHT JOIN

You can think of RIGHT JOIN as the reflection of LEFT JOIN. You can actually use symmetry to always replace one with the other. That is,

T1 LEFT JOIN T2 ~ T2 RIGHT JOIN T1

The anti-join

Suppose the records of interest are precisely those for which there is no match in the ON field. A join that returns such a result is often referred to as an anti-join.

Consider again the Customers and Transactions tables.

Customers

customer_id first_name last_name
1 John Smith
2 Jane Doe

Transactions

customer_id txn_amnt
1 10
1 30

A left anti-join in this context would mean we are interested in those customers who have not made a purchase. For example, Jane Doe.

There are several ways to construct such a join.

V1

SELECT sys.Customers.customer_id
FROM sys.Customers
WHERE sys.Customers.customer_id
NOT IN (SELECT customer_id FROM Transactions);

Here we are achieving an anti-join without performing a join at all!

V2

SELECT sys.Customers.customer_id
FROM sys.Customers
LEFT JOIN sys.Transactions.customer_id
ON sys.Customers.customer_id=sys.Transactions.customer_id
WHERE sys.Trasactions.customer_id IS NULL

The idea is that the customer_id field will be NULL after the LEFT JOIN precisely for those records for which there is no match.

You can use symmetry to construct the right anti-join case.

Published 26 Jun 2018