Performance Woes & Wows #3 Full-Text Search

In the previous post of Performance Woes & Wows, we’ve discussed the reason why using a LIKE predicate along with a leading wildcard can really hurt the performance of our queries. Accordingly, we saw how the use of just a trailing % wildcard will produce much more efficient searches.

Nevertheless, it’s likely that we can’t restrict every LIKE search to use just a trailing wildcard, particularly in cases where it’s required to confer a greater level of flexibility to searches. Under those circumstances we have a couple of alternative approaches that can confer more flexibility, comparing to predicates with LIKE and just a trailing wildcard, and that perform significantly better than the use of LIKE with both leading and trailing wildcards. In this post, we’re going to talk about one of those possible approaches: Full-Text Search (FTS).

Full-text queries perform linguistic searches over text data in full-text indexes, based on the rules of a particular language. These queries can search for simple or prefix terms, generations terms (inflectional forms of a word), proximity terms, weighted terms and even synonyms for a specific word. In order to accomplish this, a set of predicates (CONTAINS / FREETEXT) and functions (CONTAINSTABLE / FREETEXTTABLE) are used.

Full-Text Querying Process

Inside SQL Server, Full-Text has a separate engine to which the Query Processor sends the full-text portions of a query for processing. The Full-Text Engine can perform the following actions:

  • Word Breaking: The word breaker identifies individual words by determining word boundaries, based on the lexical rules of the language. Each word (also known as token) is inserted into the full-text index using a compressed representation to reduce its size.
  • Stemming: The stemmer generates inflectional forms of a particular word, e.g. “drinking”, “drank”, and “drunk” are various forms of the word “drink”.
  • Stopword Processing: Stopwords comprehended by a stoplist can be words with meaning in a specific language or tokens without linguistic meaning. For example, in the English language, words such as “a,” “and,” “is,” or “the” are left out of the full-text index since they are known to be irrelevant in a search context.
  • Thesaurus Expansions: FTS queries can search for synonyms of user-specified terms through the use of a Full-Text Search thesaurus. Each thesaurus defines a set of synonyms for a specific language.

The previously mentioned full-text portions are represented in the form of SQL operators that during execution will access the Full-Text Index to retrieve the results.

Original image from docs.microsoft.com

Creating Required Objects

Before executing full-text queries, we must create a Full-Text Catalog and Full-Text Indexes to support our searches.

Full-Text Catalog

The first object we’re going to create is the Full-Text Catalog, which is simply a logical container for Full-Text Indexes:

CREATE FULLTEXT CATALOG DefaultCatalog AS DEFAULT;  

Full-Text Index

The information in a Full-Text Index is used by the Full-Text Engine to compile the respective queries and quickly search a table for particular word, or combinations of words. The process of building a Full-Text Index differs from building other types of indexes, instead of constructing a B-tree structure based on a value stored in a particular row, the Full-Text Engine builds an inverted, stacked, compressed index structure, based on individual tokens from the text being indexed.

Enough chit-chat, let’s create our index.

CREATE FULLTEXT INDEX ON dbo.Entities
(   
     Name      Language 1033,  
     VATNumber Language 1033      
)   
KEY INDEX PK_Entities ON DefaultCatalog
WITH STOPLIST = SYSTEM,CHANGE_TRACKING AUTO;   
GO  

The index was created on the table Entities, over the columns Name and VatNumber and the Language 1033 refers to the English language. On KEY INDEX, PK_entities is specified, which is the clustered index (integer identity column) of the table. Note that a FTS KEY must be unique, non-nullable, single-column index which is not offline, is not defined on a non-deterministic or imprecise non-persisted computed column, does not have a filter, and has maximum size of 900 bytes. The index is contained by the previously created Catalog and uses the system Stoplist. Finally, change tracking is set to automatic, which means that after the initial full population is completed, changes are tracked and propagated automatically.

It’s important to refer that only one full-text index is allowed per table.

Clash of The Predicates

In contrast to FTS, the LIKE predicate works on character patterns only, so it will be interesting to compare the results versus a FTS predicate, in this case CONTAINS. The table Entities that will be queried by both predicates has a total of 550 000 rows.

LIKE

First, we create a store procedure that receives a varchar parameter, which will be used to query the columns Name and VATNumber of the table Entities.

CREATE PROCEDURE GetEntityByNameOrVatNumber
(@SearchText as VARCHAR(300))
AS
BEGIN
	SELECT
		[Name],
		[VATNumber]
	FROM
		[dbo].[Entities]
	WHERE
		[Name] LIKE @SearchText OR
		[VATNumber] LIKE @SearchText
END

The string ‘%EASY-R%’ with leading and trailing wildcards is passed and one result is retrieved.

In order to perform the search, SQL Server has done 5 755 logical reads and consumed more than 2 seconds of CPU.

CONTAINS

Now, let’s create an equivalent store procedure that will use CONTAINS, instead of LIKE:

CREATE PROCEDURE GetEntityByNameOrVatNumber_FTS
(@SearchText as VARCHAR(300))
AS
BEGIN
	SELECT
		[Name],
		[VATNumber]
	FROM
		[dbo].[Entities]
	WHERE
	        (CONTAINS(([Name],[VATNumber]),@SearchText))
END

This time the input string will be “*EASY-R*”, which more closely mimics the behavior of the previous search. This execution produces the exact same result.

But this time, SQL Server only does 4 logical reads and consumes just 188 ms of CPU!

All Hail the King?

It’s obvious that the FTS is the winner of this duel, improving logical reads by 99.93% and CPU consumption by 90.74%. Lowering CPU from more than 2 seconds to less then 0.2 seconds is really impressive, as this will greatly increase the capacity of our server to manage concurrent searches.

“This is perfect, let’s replace all LIKE predicates in the world”

Weeeeeell, hold on dear reader. When we used the CONTAINS predicate and changed the string pattern, I said that it was the one that more closely mimics the LIKE search with both wildcards. This means that they are not the same, hence not always producing the same results, which often can generate confusion.

The key here relies on understanding these differences, so we can accurately identify where we can take advantage of FTS. In a future post we will further discuss and uncover these differences.