Seeks seeks seeks… The number of the beast. Seeks are cool though. I like seeks. Sometimes, at least. Perhaps some seeks are cooler than others. I just said seeks six times, now we’re ready to go.
Let’s begin by creating a temporary table that holds values from 1 to 10 000.
CREATE TABLE #T(id integer PRIMARY KEY CLUSTERED);
WITH n(n) AS
(SELECT 1 UNION ALL SELECT n+1 FROM n WHERE n < 10000)
INSERT INTO #T(id)
SELECT n FROM n ORDER BY n
OPTION (MAXRECURSION 10000);
Now, let’s assume that we need to retrieve a specific range of values – e.g. between 500 and 563. There are a couple of ways to do that, that’s not exactly rocket science, but let’s consider the following two approaches:
Using the BETWEEN operator:
Id BETWEEN 500 AND 563
Using the IN operator:
Uh, truly fascinating.
But, behind these boring and functional equivalent queries, there is a dirty seekret – Wow, I’m on fire today. Both result in a Clustered Index Seek, but one is sus. Discuss!
That’s quite a difference. The second query results in 64 times more logical reads. But why?
One Seek To Rule Them All
In the execution plan of the query that uses the BETWEEN operator, it’s possible to notice that there’s only one seek operation, which range starts at 500 and then traverses the leaf level of the index, until it finds a row beyond the range limit – 563 in this case.
The Meeseeks Box
The seek #2, in the execution plan of the query that uses the IN operator is rather different:
The list continues down to a total of 64 individual seeks.
The underlying reason for the 128 logical reads is related to the size of our table and consequently, the size of the index. Since it is large enough to need a separate root page, each one of the 64 seeks will imply 2 logical reads.
But I digress, the bottom line here is that Seek #2 is the impostor. It is not actually a seek, but a whole bunch of them.
Sargability is a funny word that’s used to characterize the ability of the engine to take advantage of and index, in order to speed up the execution of a query. The legend says that it is derived from a contraction of Search ARGument ABLE.
Usually, when some writes a query, the goal it’s to retrieve the desired information, as fast as possible. What’s interesting is that “as fast as possible” can either refer to “execution time” or “time I will spend writing this query” and we all know which one is, most of the times. As a rule of thumb, queries must be written in the simplest way possible, being the underlying reason that if it’s easy for you to read, it will also be easier for the Optimizer to read and process it.
But that’s not always true, for example, we use functions because they save us effort and improve query readability, but typically they don’t make Optimizer’s life easier. Au contraire, because they tend to make predicates non-sargable. On other occasions, bighearted developers decide to allow the customer to search on every damn existing column, using LIKE with leading and trailing % wildcards and that’s also a sargable capital sin, as we’re going to look further.
So, let’s talk a bit more about bad ideas. Not those that start with “Hold my beer and watch this”, but instead “It runs fast on my machine, let’s take it to Production”.
Non-sargable predicates have a negative impact on query performance, e.g. by causing poor cardinality estimates or inability to perform index seeks, which consequently will lead to inappropriate plan choices and increased CPU consumption.
The following list include some of the most common patterns that kill predicate sargability. We’ll very briefly overview them and posteriorly focus on one specific case.
ColumnX = Function(ColumnX): As already pointed out, functions can be naughty. One might be tempted to use functions because they encapsulate logic in a very practical way to us humans, but this often implies performance loss. The query presented below, where we extract the year from a date column, comparing it against a specific value, represents one of those cases. SQL Server won’t be able to use any index efficiently, even if we have an index on that field. We’ll shortly discuss why this happens.
SELECT COUNT(1) FROM Matches WHERE YEAR(Date) = 1999
ColumnX = ColumnY + ColumnZ: A similar behaviour is observed when we use the result of arithmetic operations between columns to retrieve results of a particular column. It’s important to stress that a predicate containing an arithmetic operation between a column and a constant literal value would also be non-sargable.
WHERE FullTimeGoals = HalfTimeHomeGoals + HalfTimeAwayGoals
@Parameter1 IS NULLORColumnX = @Parameter1: The heart of that single query where search dreams come true and someone got really proud of it. It’s possible to search for everything. Slowly. In runtime SQL Server won’t be able to figure this out, hence the non-sargability. There are a couple of ways to get around this, being one of them to add OPTION(RECOMPILE). This hint enables Parameter Embedding Optimization, which gives the Parser the ability to replace query parameters with literal constant values during query parsing and enables further complex simplifications, like removing contradictions. Nevertheless, as this hint forces the recompilation of a statement, it introduces some overhead that needs to be taken into account.
CREATE OR ALTER PROCEDURE SearchNonFriendlyMatch
@Hometeam INT = NULL,
@Awayteam INT = NULL,
@Date DATE = NULL,
@FTG TINYINT = NULL
League IS NOT NULL AND
(@HomeTeam IS NULL OR HomeTeam = @HomeTeam) OR
(@AwayTeam IS NULL OR HomeTeam = @HomeTeam) OR
(@Date IS NULL OR [Date] = @Date) OR
(@FTG IS NULL OR FullTimeGoals >= @FTG)
ColumnXLIKE ‘%Make This Query Go Slower Please%’
When we have this wildcard prefixing and suffixing a string, we’re telling SQL Server to retrieve all records that contain the specified text, regardless of what comes before or after it.
SELECT TOP 5 * FROM Teams WHERE Name LIKE '%ar%'
The execution of the previous query returns these results:
Cardiff City FC
But why is this bad for performance?
Let’s put ourselves in SQL Server shoes and pretend that we’ve received a list containing names of football teams, ordered alphabetically. We need to count how many of these names contain the letters “ar”. We roll up our sleeves and start analyzing all characters from the first record, then the second, then the third, and so on, until we reach the last item in the list. We had no alternative but to scan through all values, simply because the order of this list wasn’t helpful at all. Recalling the table above it’s easy to understand why: the substring can be everywhere, there is no shortcut, no possibility to discard any subset of these names. SQL Server will have the same exact struggle scanning through all the index represented by our list.
Now, let’s execute a similar query:
SELECT COUNT(1) FROM Teams WHERE t.Name LIKE '%Benf%'
And look into its execution plan:
This is a pretty small table but it’s enough to prove the point. This query ran in about 600 miliseconds and IO statistics show 1 Scan Count and 1226 logical reads. The cost is almost entirely in the Index Scan, that was chosen even thought SQL Server estimated just 9 rows to be retrieved. Well, that wasn’t really a choice, in this case an Index Scan was made because the non-sargable search prevented SQL Server from using an Index Seek, as we already know.
With that being said, let’s remove the leading wildcard and execute the query:
SELECT COUNT(1) FROM Teams t WHERE t.Name LIKE 'Benf%'
And this time, it is possible to confirm by the execution plan that the query was much more efficient using the Index Seek.
It has executed in virtually no time and IO statistics show 1 Scan Count and 3 logical reads.
If we return to our ordered list, it’s very easy to understand why the query executed so much faster: Now we can benefit from that alphabetical order, going directly to the letter B, looking through a few lines and quickly return the row that matches the search.
What Should I Do, Then?
We saw that the sargable version of our query ran in virtually no time, while the non-sargable version took about 600ms and even though proportionally this gap is huge, users don’t complain about half a second queries. Nevertheless, the real problem with this type of queries is they won’t scale. When you try to run this query against a table with millions of rows instead of a couple thousand, it will be much more painful. You’ve been warned.
But what can we do to avoid it?
The first and obvious option would be to analyze if that leading wildcard is really needed. If you can simply get rid of it, problem solved! If not, there’s still hope…
One way to improve these searches is by using Full Text Search (FTS). FTS is able to quickly find all matches for a term search in a table, because it doesn’t have to scan rows. Instead it takes advantage of text indexes that store positional information for all terms found in the columns where we create the text index on. Instead of using LIKE operator, FTS enables the use of CONTAINS/CONTAINSTABLE or FREETEXT/FREETEXTTABLE, as exemplified below:
SELECT COUNT(1) FROM Teams t WHERE CONTAINS((t.Name), @Name)
FTS has a few limitations and requires some setup effort, but I often work with it and it can be pretty awesome.
Additionally, there are newer, hotter technologies designed to solve these issues. For example ElasticSearch, that allows fast text searches in a very scalable way. It also allows pretty complex searches but, to me, the nicest advantage comparing to Full-Text Search is the possibility to take the heavy processing completely outside SQL Server. CPU cores in SQL Server can be very expensive and should be treated with respect, my boss says.
Now you now why you should avoid queries LIKE ‘%these%’ and you have some ideias to improve them. Hope this was helpful to you 🖖