Exploring Join Paths; The Key to Building Multiple-Table Joins
Exploring Join Paths; The Key to Building Multiple-Table Joins
As we see in the SQL 100 and SQL 250 courses, and as we further explore in this article, joins are used in queries that need to access data stored in more than one table (served up either by the table itself, views, or functions). This article introduces the concept of the “join path” and then uses the AdventureWorks database to explore join paths as a way to identify the tables to include in a query. This article also makes the case that, when there is more than one join path between two tables, each one can produce a different query result, so it’s important to choose carefully to maintain the intent of the query.
Tables are not the only things that can be joined in a query. Views and functions can also be joined. This article keeps the focus on tables for simplicity, but every time we join a table in this article, we could be joining a view or function, too.
1. What’s a Join Path?
First, it’s important to distinguish between joins and relationships. Relationships between tables are defined in the structure of the tables themselves using primary and foreign keys, whereas joins are operations in a query that bring together related tables; joins are usually performed among related tables unless there is a good reason to do otherwise (here’s an example of a query that does otherwise: How to Use Values from Previous or Next Rows in a SQL Server Query and we’ll also get a chance to see a simpler example in this article).
So, what’s a join path? A join path brings two tables together in a query by using one or more joins. There are two types of join paths—direct and indirect. If the two tables are directly related to each other (one table has a foreign key that references the primary key in the other table), then a direct join path is available that consists of a single join between the two tables. An indirect join path is one that consists of two or more joins through other tables. As you’ll see in this article, an indirect join path must be used when there is no direct join path, or when a direct join path exists but produces results that are not consistent with the intent of the query.
Let’s work through a few examples of join paths to get a better handle on all this.
2. Direct Join Paths
A direct join path is one that goes from one table to another without any additional tables needed.
Example 2.1. A Direct Join Path
Let’s say that a query needs to list the names of all salaried employees from the AdventureWorks database. Here is a sample of the result of the query (there is also a SalariedFlag column that is used to filter the query so that only salaried employees are selected; the SalariedFlag column is not listed in the result, but it is still used in the query):
What tables are needed to produce this result? Well, employee’s names are in the Contact table and their salaried flags are in the Employee table, so we’ll need those two tables. A direct join path is available in this case, because the two tables are directly related to each other.
The direct join path can be used in this query because the results it produces are consistent with the intent of the query (this will become clearer in the next example when we have a choice between more than one join path).
Example 2.2. Multiple Direct Join Paths
Can there be more than one direct join path between two tables? Sure. Let’s look at a query that lists the sales order number, the order date, and the city of the bill-to address. Here’s a sample of the query result:
What tables are needed to produce this result? Well, the sales order numbers and order dates will come from the SalesOrderHeader table and the bill-to addresses will come from the Address table, so we’ll need those tables. Are the tables directly related to each other like in the previous example? Not only are they related, but they are related twice. The SalesOrderHeader table has two foreign keys in it, each of which references the primary key in the Address table:
With two direct relationships between these tables, there are two different direct join paths available. Which one should we use? Does it matter?
It does matter. The two join paths will produce different results. The direct join path that uses the BillToAddressID foreign key will show the cities of the bill-to addresses, and the one that uses the ShipToAddressID foreign key will show the cities of the ship-to addresses. Because it is the cities of bill-to addresses that our query intends to show, we must use the direct join path that uses the BillToAddressID foreign key:
Example 2.3. Using Two Join Paths in the Same Query
There is nothing that prevents us from showing the cities of both the bill-to and ship-to addresses in this case. We just need to join the Address table a second time using the other direct join path. What’s important to keep in mind here is, again, that different join paths produce different results, so if we include two join paths between the same tables, each one will give us something different.
3. Indirect Join Paths
Direct join paths are relatively simple, but they’re not always available or they don’t always produce results that are consistent with the intent of the query. When we can’t use a direct join path, we must find an indirect one. An indirect join path is one that joins two tables together through other tables.
Example 3.1. Indirect Join Paths
Let’s say that we want to write a query that shows, for each product, all the sales people who have sold that product. We want the query to list the product name, the sales person’s ID, and the salesperson’s YTD sales. Here is a sample of the query result:
What tables are needed to produce this result? Well, the product names are in the Product table and the salespersons’ IDs and YTD sales are in the SalesPerson table, so we’ll need those two tables. Since the two tables are not directly related to each other, there is no direct join path available in this case. We’ll have to find a join path through other tables.
When we examine the relationships in the AdventureWorks database, we can see that there are two relationship paths that can be used to join the Product and SalesPerson tables. Notice in the diagram below that the green relationship path goes through the SalesOrderHeader table while the red one goes through the PurchaseOrderHeader table:
A join path created along either the green or red relationship lines would be an indirect join path because it would join the SalesPerson and Product tables through other tables. Notice that, even though we don’t use data from any of those other tables in our query, we still need to include them in the query to select the relevant rows from the two tables from which we do want data.
Again, which path do we choose? As we’ve seen, different join paths can produce very different query results, so we have to choose the path that is consistent with the intent of the query. The red path would show all sales people who have purchase orders—not necessarily sales—for each product. The green path would show all sales people who have sales for each product. The green path is the one that is consistent with the intent of the query.
We’ll use the green path, but with one simplification; there is a table in it that we really don’t need. To see this clearly, let’s have a closer look at the green path. We’ll start with the SalesPerson table (we could start from either end since these would all be inner joins). The SalesPerson table would be joined to the SalesOrderHeader table on SalesPersonID (that is the foreign key in the SalesOrderHeader table that references the primary key in the SalesPerson table), resulting in a virtual table in which every salesperson is matched up with every one of their orders. Next, that virtual table would be joined to the SalesOrderDetail table on SalesOrderID, resulting in another virtual table that matches up every salesperson with every product on each of their orders.
All we’d need now would be the product name that is two tables over. Do we really need to go through the SpecialOfferProduct table? If we do, we would be using the ProductID in the SalesOrderDetail table to look up a row in the SpecialOfferProduct table so that we could get a ProductID (the same one we started with) from the SpecialOfferProduct table to allow us to join the Product table. But why look up something we already have? We’ll just bypass the SpecialOfferProduct table and join the SalesOrderDetail table directly to the Product table using each table’s ProductID (this is another example of when we can use a join that is not along relationship lines). Here’s our query:
Example 3.2. More Indirect Join Paths
This next example further highlights how different join paths can produce very different results. In the partial diagram below of the same AdventureWorks database, we see three paths—shown in red, green, and blue—between the Employee and Address tables.
A query that uses the red join path will show employees’ home addresses. A query that uses the green join path will show the ship-to address on orders in which an employee acted as the salesperson (or bill-to address depending on which foreign key in the SalesOrderHeader table is used, a minor but also significant fork in the path). And a query that uses the blue join path will show the ship-to (or bill-to) address on orders in which employees acted as contacts.
4. Direct Join Paths Are Not Always the Best Choice Over Indirect Join Paths
What if there are direct and indirect join paths available between two tables we want to join? Can we say that we should always use the direct join path over an indirect one? After all, direct paths are simpler and they don’t weigh the query down so much with other tables. But remember, join paths should be chosen based on how consistent their results are with the intent of the query; it will do you no good to have a simple join if it gives you apples when you want oranges. Have a look at this:
Example 4.1. Direct Join Paths Are Not Always the Best Choice
Let’s say we want to write a query that shows all customers who have orders with total due values over $1,000. The query needs to list the customer’s name, the sales order number, and the total due amount. Here is a sample of the query result:
What are the tables needed to produce this result? Well, the sales order numbers and total due amounts are in the SalesOrderHeader table, and the customer names are in the Contact table, so we’ll need those two tables. And, like in the previous example, the SalesOrderHeader and Contact tables are directly related to each other (the ContactID foreign key in the SalesOrderHeader table references the ContactID primary key in the Contact table):
Looks like we’re on easy street again! All we have to do is join the two tables on the ContactID and we’re done, right? Wrong.
Let’s recall the intent of the query. We want a query that lists customer names, not contact names. If we join the two tables on ContactID, the result will show the names of the contacts on the order and not the names of the customers. The direct join path is not consistent with the intent of the query.
Now look carefully at the SalesOrderHeader table in the figure above. There is a CustomerID column just above the ContactID column. Is there another path between the two tables that would allow us to join on that CustomerID, instead? There is. In the figure below, the blue path goes through the Individual table. The Individual table contains both a CustomerID and a ContactID, effectively associating each CustomerID with a corresponding ContactID. We can look up the customer ID in the individual table and use the corresponding contact ID to find their names in the Contact table.
For the same reason as in example 3.1, we will bypass the Customer table and join the SalesOrderHeader table directly to the Individual table on the CustomerID (if we don’t bypass it, the CustomerID in the SalesOrderHeader table would be used to look up a Customer in the Customer table so that the query could get a CustomerID to find the related row in the Individual table; but, since we already start out with a CustomerID, that lookup (join) is unnecessary). Here’s our query:
5. In Summary…
Join paths provide a useful tool to help us identify the tables we need to include in a query. When more than one join path exists between tables—whether direct, indirect, or both—they can each produce very different query results. The join path you choose must be consistent with the intent of your query if you don’t want unintended results!
You May Also Like
In this video, you will gain an understanding of Agile and Scrum Master Certification terminologies and concepts to help you make better decisions in your Project Management capabilities. Whether you’re a developer looking to obtain an Agile or Scrum Master Certification, or you’re a Project Manager/Product Owner who is attempting to get your product or … Continue reading Agile Methodology in Project Management
How does an investigator hunt down and identify unknown malware? In this recording of our IT Security training webinar on April 21, 2015, Security expert Mike Danseglio (CISSP / CEH) performed several malware investigations on infected computers and identify symptoms, find root cause, and follow the leads to determine what’s happening. He demonstrated his preferred … Continue reading Detailed Forensic Investigation of Malware Infections – April 21, 2015