Statistics shape the way executions plans are built and have a massive impact on query performance. In order to achieve a better understanding on this relation, we will start to briefly explore what happens when a query is executed and where do statistics come into play.
Query Execution Phases
When a query is executed, there’s quite a lot happening in the Relational Engine inside SQL Server. In a higher plan, this process can be divided into four sequential phases: Parsing, Binding, Optimization and Execution. Each phase requires an input that is consumed and transformed into an output that is forwarded to the next phase.
In the first step, a syntactic validation of the T-SQL is done, also known as query parsing. The example provided in the image below, represents a query that failed to be parsed, in that case, because the SELECT clause was miswritten.
The output of this phase is a parse tree, which comprehends the logical steps required to perform the query.
When the query is successfully parsed, query binding takes place. This process, owned by the algebrizer will resolve the parse tree, trying to bind the respective strings to specific objects within the database. This translates to, for instance, verifying if referenced tables and columns exist under the current execution context. It’s important to mention that the query can use names that do not match directly to any database object, e.g. if aliases are being used. So, it’s also the algebrizer job to resolve these cases. Additionally, it will also identify datatypes and process aggregate functions to some extent (known as aggregate binding). The following example shows a query that could not be resolved, because it refers to a column that does not exist.
The output of this phase is a query processor tree, that will be handed to the query optimizer.
Well, in fact, the query processor tree isn’t the only input received by the optimizer. Building a new plan from scratch consumes resources, specifically time and CPU. In order to manage effectively this resources, SQL Server will keep plans in cache, whenever possible, in order to reuse them. So, the optimizer will also receive the hash code of the query currently being executed, as an input. This hash code will be used to probe the existence of a valid execution plan in cache, that matches the query. If there is a match with a valid plan, the optimization process stops and the cached plan is reused. However, if changes were made to any table referenced in the query, or statistics were updated in the meantime, the plan is considered invalid and must be rebuilt.
If no valid plan is found, the optimizer creates a new plan, which in essence is an analysis of many alternate ways to achieve the expectable result. The optimizer estimates a cost for each alternative approach and tries to find a plan, cheap enough, in an incredible short amount of time. Trivial plans (e.g. SELECT Col1 FROM Table1) don’t go through this optimization process, as they typically don’t have alternative approaches to be considered, due to their simplicity. More complex queries will be subject to a full cost-based optimization process, resulting in a cost-based plan, calculated according to the cardinality estimations. After the cheapest possible plan is chosen and built, it is handed to the Execution Engine.
Finally, in the execution phase, the Execution Engine will receive and process the execution plan. Validations on the plan are reinforced at runtime, to confirm that it is still valid. If something happened in the meanwhile that changed the initial state, for example because statistics were updated, execution is paused, the compilation process is invoked and a new plan is built for the affected statement(s).
The Importance of Statistics
Statistics provide information about the distribution and selectivity of values across columns, using a sample that will represent the entire universe of values. The selectivity of a value is the ratio between the rows that pass the selection criteria and the total number of rows. We can extrapolate this principle and think in the selectivity of a column as the ratio between the number of distinct values and the total number of rows. This basically defines the uniqueness of what you are trying to find and it will provide guidance to the optimizer on how to shape the plan. In a side note, keep in mind that statistics can be manually created, (whether directly or through the creation of indexes) or automatically created by the SQL Server when a query is executed, for the optimizer to consume them.
This process is fundamental because we want our queries to run as fast as possible, what wouldn’t happen if the optimizer had to scan all data in all referenced tables.
Going Under the Hood
For the purposes of this demo, we’ll only cover the usage of single column statistics, equality predicates and single table queries. Bear in mind that the behavior differs when using multiple column statistics, joining other tables or applying more complex predicates.
First, we’ll run the following query, which will retrieve the matches where Manchester City (Id=77) played at Etihad Stadium.
SELECT * FROM Matches WHERE HomeTeam = 77
In the execution plan, we’ll see that SQL Server estimated, correctly, that 390 would be retrieved, from a total of 535 983 rows.
If we open the XML execution plan, we’ll be able to confirm which statistics were used in order to make the estimate, under “OptimizerStatsUsage“.
Next, we use DBCC SHOW_STATISTICS to view to view those statistics, related to HomeTeam column.
DBCC SHOW_STATISTICS(N'Matches', _WA_Sys_00000005_73BA3083)
The information that makes up statistics is divided in three sections:
- Header: Metadata about the statistics.
- Density Vector: Selectivity of the data, used to measure cross-column correlation.
- Histogram: Distribution of values in the first key column of the statistics object.
A histogram is composed by the following columns:
- RANGE_HI_KEY: Upper-bound value for a step (always based on leftmost column).
- RANGE_ROWS: Number of rows with a value falling within a histogram step, excluding the upper bound.
- EQ_ROWS: Number of rows whose value equals the RANGE_HI_KEY.
- DISTINCT_RANGE_ROWS: Number of rows with a distinct column value within a histogram step, excluding the upper bound.
- AVG_RANGE_ROWS: Average number of rows with duplicate column values within a histogram step, excluding the upper bound calculation (RANGE_ROWS / DISTINCT_RANGE_ROWS)
In this case, because 77 hits a RANGE_HI_KEY in the histogram, we’ll be able to notice that the estimated value is on the column “EQ_ROWS” of that row.
The formulas to calculate the estimated number of rows and selectivity, can be expressed as it follows:
Selectivity is the key to combine statistics. Next, we’ll combine two single column statistics, by adding another equality predicate do the WHERE clause of the previous query. It’s important to stress that the version of the Cardinality Estimator (CE) dictates how calculations are made. The legacy CE (prior to SQL Server 2014) assumes that there is no correlation between multiple columns. Therefore, the cardinality is calculated with the following formula, that comprehends these variables: estimated number of rows (E), selectivity of the first (S1) and second (S2) columns and the total number of rows (R).
Contrarily, the new CE assumes some correlation between columns and estimates the number of rows as it follows:
Now, let’s run a new query that retrieves the games where Manchester City played at home and the full-time result was a win to the away team.
SELECT * FROM Matches WHERE HomeTeam = 77 AND FTR = 'A'
We already know the values related to the predicate HomeTeam = 77, so let’s focus on the full-time result column (FTR) and the specific value “A”. If we open the XML execution plan, we’ll notice that the statistics of another column were added.
We then repeat the DBCC SHOW_STATISTICS command, changing it to match the above highlighted stats, related to FTR column.
These statistics will show us the same 535 983 total rows and 9597 occurences of “A” in the FTR column.
Ok, now we have all the data we need, so let’s calculate the estimated number of rows for our query.
If we open the execution plan, we’ll be able to confirm that the our calculations match the ones made by SQL Server.
And as expected, if we change to the legacy CE, we will observe that the estimate has changed.
Accordingly, if we apply the formula used by the legacy CE, our results will also match:
In this case, the new CE did a better job estimating the workload, nevertheless, it doesn’t mean that the legacy CE is ready for retirement. In fact, in many cases the tables will turn and grandpa CE will do a better job. But, generally speaking, you won’t solve that many problems just by changing the CE version, despite the fact that it can have an observable impact, depending on the query specificities.
We’ve briefly looked into the phases of query execution and stated that the optimizer will choose the lowest-cost plan, based on its estimated cost that largely depends on the cardinality estimations. This implies that the quality of the plan choice is highly correlated to the quality of the statistics used by the optimizer. In a future post we’ll see in practice the negative impacts of bad estimates and what can be done to avoid some tears.
Until then, stay safe 🖖