Performance Woes & Wows #4: Filtered Indexes

Generally speaking, an index can improve the efficiency of read operations, by providing a data structure that SQL Server can use to seek or scan specific data pages, while skipping non-relevant pages in the process.

Regarding Non-Clustered Indexes, since they typically represent narrower copies of the underlying table, even if a complete index scan is required, the amount of data pages that end up being read can be significantly smaller in comparison to an equivalent Clustered Index scan.

Filtered indexes can improve this process even further, simply by reducing the amount of data comprehended by the index, hence reducing the number of nodes and data pages to be traversed.

In order to illustrate a scenario where the usage of a filtered index can be beneficial, let’s create a store procedure in the StackOverflow2010 database, that will retrieve all questions (PostTypeId=1) with a creation date and a score greater than or equal to the inputted values:

CREATE OR ALTER PROCEDURE GetQuestionsByDateAndMinimumScore
(@Score int,@CreationDate datetime) AS
BEGIN
SELECT 
	p.Id,p.creationDate,u.DisplayName, 
        p.title,p.Score,p.ViewCount,p.AnswerCount
FROM 
	Posts p
INNER JOIN 
	Users u ON u.id = p.OwnerUserId
WHERE 
	p.CreationDate >= @CreationDate AND
	Score >= @Score AND
	PostTypeId = 1 
ORDER BY 
	CreationDate DESC
END


Next, we’ll create a non-filtered index on the table Posts, in order to improve the performance of our query.

CREATE INDEX IX_CreationDate_PostTypeId_Score 
ON Posts (CreationDate,Score,PostTypeId)

After its creation, it’s time to execute the store procedure and analyze our baseline values.

EXEC GetQuestionsByMinimumScoreAndDate 
              500, '2009-07-31 00:00:00.000'

In the execution plan, we are able to see that SQL Server used our index to seek the Posts table and ended up reading 2 809 841 rows, which translated into 12 000 logical reads.

Now, the fun part. Since we care only about question posts and they only represent about one third of the table, we could create a filtered index instead.

CREATE INDEX IX_CreationDate_PostTypeId_Score_Filtered
ON Posts (CreationDate,Score,PostTypeId)
WHERE (PostTypeId = 1)

Finally, after executing the procedure with the same parameters, we should be able to see some improvements.

And indeed we do. This time SQL Server ended up reading 875 361 rows and the IO statistics show 5676 logical reads on the Posts table. Noice.

I’ll Only Create Filtered Indexes From Now On

I mean, you could. You probably shouldn’t though.

In our example we were interested only in querying a specific PostTypeId, but if we needed to address another type, e.g. Answers, our index would be pretty useless and even considering a filtered index would make no sense, because in that particular case, both types represent almost the entire table.

The takeaway is that filtered indexes can be great for performance, but we must always keep in mind the selectivity of what we want to filter and the gains versus costs, because although filtered indexes are lighter by concept, they aren’t exactly free.

Not All Seeks Are Created Equal #2

As we saw in the previous post, Index Seek operations can look for single rows (Singleton Seek) or specific subsets of rows (Range Seek).

Singleton Seek

A Singleton Seek occurs when just one single row, at most, will satisfy the search requirements that are related to the Seek Key.

The IO statistics for the same execution presented above, will show a scan count of zero, also confirming the Singleton Seek occurrence.

The following image depicts a similar seek operation, as exemplified by the red arrows:

In this case, the B-tree is traversed from the root to the leaf node and the specific data page is fetched.

Range Seek

Conversely, a Range Seek will occur when SQL Server assumes that more than one row can be retrieved by the Seek Key.

The following image depicts an equivalent seek operation, as exemplified by the red arrows:

Similarly to a Singleton Seek, the B-tree is also traversed to a specific leaf node, but in this case, this leaf node will be either the first or the last key within range (depending on the scan direction) and then the remaining leaf nodes within range limit will be scanned.

And that’s all he wrote.