If you read my last post, you’ll find this to the be exact opposite. Instead of determining the best way to find rows that exist in one dataset but not another, we’ll be looking for rows that exist in 2 different datasets.
The obvious answer is a join statement, but does that always perform the best? Certainly, if you require other fields from the secondary table then a join will always be best. For this test we are looking for rows that appear in a table called orderheader that are also in a temporary table called #ords. The field that we are comparing is an INT called ord_hdrnumber and it is the Clustered Primary Key of each table.
We’ll compare the pros, cons, and performance of 3 popular methods to complete the task.
SELECT a.ord_hdrnumber FROM orderheader a WHERE a.ord_hdrnumber IN (SELECT b.ord_hdrnumber FROM #ords b);
SELECT a.ord_hdrnumber FROM orderheader a INNER JOIN #ords b ON a.ord_hdrnumber = b.ord_hdrnumber;
SELECT a.ord_hdrnumber FROM orderheader a WHERE EXISTS (SELECT 1 FROM #ords b WHERE b.ord_hdrnumber = a.ord_hdrnumber);
First, let’s look at some functionality differences. There are some pretty significant limitations to the IN version.
|Number of columns output from the primary table?||Number of columns to be compared?|
The next step is to compare performance. I played with this in multiple environments that had 500, 500K, and 2.4M rows in the orderheader table. In each I tried 3 sizes of #ords — 10, 1k, and 900k rows.
During the test of the small primary dataset all 3 secondary datasets performed identically. During the test of the medium and large primary datasets a pattern emerged.
So where does that leave us? There is very little difference in the options until the data sets get larger. With the functionality that INNER JOIN offers us compared to the others it should probably be our primary method when comparing datasets. We’ll save IN for comparing to constants or small variable lists and EXISTS for when subqueries are required.