Is The Statistics Sampling Process Deterministic?

In a previous post we’ve discussed the importance of statistics, its underlying structure and explored how SQL Server uses them to come up with execution plans. We’ve saw that the query optimizer computes a histogram on the column values in the first key column of the statistics object, that measures the frequency for each distinct value in a data set.

But, the information that these histograms hold is intrinsically related to the chosen sample universe. In practice, SQL Server will either do a full scan over the entire column or estimate the value distribution through the extrapolation of a statistical sample. In this post, we’re going to very briefly determine whether the process of statistics sampling is deterministic or not, i.e. if under the same premises, the chosen sample varies, or not. Place your bets now.

Let’s start by creating a table T, with one clustered index, one non-clustered index. Then, we’ll populate the State column, with random values from a range that may vary from 1 to 5.

CREATE TABLE T (Id INT IDENTITY, [State] TINYINT)
CREATE CLUSTERED INDEX CIX_T ON T (ID)
CREATE NONCLUSTERED INDEX IX_T_State ON T ([State])

INSERT INTO T SELECT FLOOR(RAND()*(5-1+1)+1);
GO 28595716           

Now, if we update any statistics using the FULL SCAN option, the histogram will reflect exactly the current value distribution for the State column and we can easily confirm it by running DBCC SHOW_STATISTICS:

UPDATE STATISTICS T IX_T_State WITH FULLSCAN
DBCC SHOW_STATISTICS ('T','IX_T_State') WITH HISTOGRAM

And then comparing the results with the actual values in the table

SELECT [State],COUNT(1) Occurrences 
FROM T 
GROUP BY [State]

This level of accuracy allows SQL Server to do better choices regarding execution plans, operators and specific algorithms. However, when our tables have hundreds of millions or even billions of rows, we may be forced to loosen up a little bit, in order to fit these operations in the available maintenance window. Additionally, even if we’re able to do a full scan in order to populate the histogram with 100% of accuracy, if the modification threshold is posteriorly reached, a much lower sample will be used (typically) to rebuild it. When the Query Optimizer determines that statistics might be out-of-date, it updates them (when they are needed for a query plan). But it’s usually possible to improve query performance by updating statistics more frequently and with a greater sample than what auto update statistics would do. Updating statistics ensures that queries compile with up-to-date statistics, but be aware that updating statistics causes queries to recompile.

Going back to our T table, if we update the IX_T_State index that has 28 595 716 rows, letting SQL Server determine the sample size, we’ll see that a sample of 2.68% was picked and the statistics will hold the following data:

In this case, using such a small sample percentage, SQL Server has done a pretty awesome job extrapolating the results, but that’s beyond the point now. What we want to confirm is if a new update, over the same data universe, will lead to a distinct histogram, or not:

The results show that both histograms are equal. Changing the sample percent and the table size won’t alter the behavior. SQL Server uses a deterministic sampling process.

Deixe uma Resposta

Preencha os seus detalhes abaixo ou clique num ícone para iniciar sessão:

Logótipo da WordPress.com

Está a comentar usando a sua conta WordPress.com Terminar Sessão /  Alterar )

Google photo

Está a comentar usando a sua conta Google Terminar Sessão /  Alterar )

Imagem do Twitter

Está a comentar usando a sua conta Twitter Terminar Sessão /  Alterar )

Facebook photo

Está a comentar usando a sua conta Facebook Terminar Sessão /  Alterar )

Connecting to %s