Performance Woes & Wows #5: IN vs UNION

Querying the Same Thing. Faster.

In a previous post, we’ve seen that equivalent queries can produce distinct execution plans – Yes, how nice and intuitive of you SQL Server. But this happens for a logical reason: The Query Optimizer hates you doesn’t have the means to explore every possible semantic equivalence. Hence, the shape of the execution plan will be intrisically bounded to the way we write our queries.

In this post, we’ll extend this notion to a more real-world like example, which I found very interesting because it seemed kind of counterintuitive to me.

Querying

Let’s assume we want to count every Question, Wiki and TagWikiExerpt post, in the Posts table of the StackOverflow2010 database. In order to achieve that we can create the following index:

CREATE INDEX IX_PostTypeId ON Posts (PostTypeId)

And then execute this query:

SELECT COUNT(1) 
FROM Posts 
WHERE PostTypeId IN (1,3,4)

We can confirm that the execution plan is comprised by a scan to the previously created index and has an estimated cost of 3.27277. The IO statistics show 1915 logical reads and regarding CPU time and duration, 129ms and 123ms, respectively. These time statistics remained very identical on all the performed executions.

the Same Thing.

Now let’s do the same thing, expecting different results. Well, almost the same thing. Let’s just change the previous query a little bit, by replacing the IN operator for this logical equivalent, but uglier and more complex version:

SELECT SUM(x.c) FROM
(
	SELECT COUNT(1) c FROM Posts WHERE PostTypeId = 1
	UNION ALL
	SELECT COUNT(1) c FROM Posts WHERE PostTypeId = 3
	UNION ALL
	SELECT COUNT(1) c FROM Posts WHERE PostTypeId = 4
) x

These changes are reflected in the execution plan, as expected:

I’m so ugly // But that’s okay, ‘cause so are you

Faster.

The really interesting part about this plan is… That it’s better than the first one. Don’t believe me? Check for yourself:

This query only consumed 47ms of CPU time (-63.6%) and ran for a total of 76ms (-38.2%). The difference was not quite significant regarding logical reads, nevertheless a lower value was achieved with 1912. Accordingly, the estimated plan cost was also lower, 2.68443. Sure, some assumptions behind the cost calculation may be outdated and not entirely accurate, but this time, the prediction was entirely correct.

So What?

Rewriting queries using equivalent syntax seems like a valid approach when trying to improve query performance. Nevertheless, I don’t know any set of rules or even guidelines that may help in this decision process. It’s more like an empiric procedure, one day on the threshold of despair you try a thing that unexpectedly works and later in the future when a similar pattern emerges, you’ll know that the time has come. Why it works? Because of reasons. But if it works, it ain’t stupid. Best advice of the year.

Performance Woes & Wows #4: Filtered Indexes

Generally speaking, an index can improve the efficiency of read operations, by providing a data structure that SQL Server can use to seek or scan specific data pages, while skipping non-relevant pages in the process.

Regarding Non-Clustered Indexes, since they typically represent narrower copies of the underlying table, even if a complete index scan is required, the amount of data pages that end up being read can be significantly smaller in comparison to an equivalent Clustered Index scan.

Filtered indexes can improve this process even further, simply by reducing the amount of data comprehended by the index, hence reducing the number of nodes and data pages to be traversed.

In order to illustrate a scenario where the usage of a filtered index can be beneficial, let’s create a store procedure in the StackOverflow2010 database, that will retrieve all questions (PostTypeId=1) with a creation date and a score greater than or equal to the inputted values:

CREATE OR ALTER PROCEDURE GetQuestionsByDateAndMinimumScore
(@Score int,@CreationDate datetime) AS
BEGIN
SELECT 
	p.Id,p.creationDate,u.DisplayName, 
        p.title,p.Score,p.ViewCount,p.AnswerCount
FROM 
	Posts p
INNER JOIN 
	Users u ON u.id = p.OwnerUserId
WHERE 
	p.CreationDate >= @CreationDate AND
	Score >= @Score AND
	PostTypeId = 1 
ORDER BY 
	CreationDate DESC
END


Next, we’ll create a non-filtered index on the table Posts, in order to improve the performance of our query.

CREATE INDEX IX_CreationDate_PostTypeId_Score 
ON Posts (CreationDate,Score,PostTypeId)

After its creation, it’s time to execute the store procedure and analyze our baseline values.

EXEC GetQuestionsByMinimumScoreAndDate 
              500, '2009-07-31 00:00:00.000'

In the execution plan, we are able to see that SQL Server used our index to seek the Posts table and ended up reading 2 809 841 rows, which translated into 12 000 logical reads.

Now, the fun part. Since we care only about question posts and they only represent about one third of the table, we could create a filtered index instead.

CREATE INDEX IX_CreationDate_PostTypeId_Score_Filtered
ON Posts (CreationDate,Score,PostTypeId)
WHERE (PostTypeId = 1)

Finally, after executing the procedure with the same parameters, we should be able to see some improvements.

And indeed we do. This time SQL Server ended up reading 875 361 rows and the IO statistics show 5676 logical reads on the Posts table. Noice.

I’ll Only Create Filtered Indexes From Now On

I mean, you could. You probably shouldn’t though.

In our example we were interested only in querying a specific PostTypeId, but if we needed to address another type, e.g. Answers, our index would be pretty useless and even considering a filtered index would make no sense, because in that particular case, both types represent almost the entire table.

The takeaway is that filtered indexes can be great for performance, but we must always keep in mind the selectivity of what we want to filter and the gains versus costs, because although filtered indexes are lighter by concept, they aren’t exactly free.

Not All Seeks Are Created Equal #1

Seeks seeks seeks… The number of the beast. Seeks are cool though. I like seeks. Sometimes, at least. Perhaps some seeks are cooler than others. I just said seeks six times, now we’re ready to go.

Let’s begin by creating a temporary table that holds values from 1 to 10 000.

CREATE TABLE #T(id integer PRIMARY KEY CLUSTERED);

WITH n(n) AS
(SELECT 1 UNION ALL SELECT n+1 FROM n WHERE n < 10000)

INSERT INTO #T(id) 
SELECT n FROM n ORDER BY n
OPTION (MAXRECURSION 10000);

Now, let’s assume that we need to retrieve a specific range of values – e.g. between 500 and 563. There are a couple of ways to do that, that’s not exactly rocket science, but let’s consider the following two approaches:

Using the BETWEEN operator:

SELECT 
	Id 
FROM 
	#X
WHERE 
	Id BETWEEN 500 AND 563

Using the IN operator:

SELECT 
	Id 
FROM 
	#X
WHERE 
	Id in
	(
		500,501,502,503,504,505,506,507,508,509,510,511,512,
		513,514,515,516,517,518,519,520,521,522,523,524,525,
		526,527,528,529,530,531,532,533,534,535,536,537,538,
		539,540,541,542,543,544,545,546,547,548,549,550,551,
		552,553,554,555,556,557,558,559,560,561,562,563
	)

Uh, truly fascinating.

But, behind these boring and functional equivalent queries, there is a dirty seekret – Wow, I’m on fire today. Both result in a Clustered Index Seek, but one is sus. Discuss!

That’s quite a difference. The second query results in 64 times more logical reads. But why?

One Seek To Rule Them All

In the execution plan of the query that uses the BETWEEN operator, it’s possible to notice that there’s only one seek operation, which range starts at 500 and then traverses the leaf level of the index, until it finds a row beyond the range limit – 563 in this case.

The Meeseeks Box

The seek #2, in the execution plan of the query that uses the IN operator is rather different:

The list continues down to a total of 64 individual seeks.

The underlying reason for the 128 logical reads is related to the size of our table and consequently, the size of the index. Since it is large enough to need a separate root page, each one of the 64 seeks will imply 2 logical reads.

But I digress, the bottom line here is that Seek #2 is the impostor. It is not actually a seek, but a whole bunch of them.