Is The Statistics Sampling Process Deterministic?

In a previous post we’ve discussed the importance of statistics, its underlying structure and explored how SQL Server uses them to come up with execution plans. We’ve saw that the query optimizer computes a histogram on the column values in the first key column of the statistics object, that measures the frequency for each distinct value in a data set.

But, the information that these histograms hold is intrinsically related to the chosen sample universe. In practice, SQL Server will either do a full scan over the entire column or estimate the value distribution through the extrapolation of a statistical sample. In this post, we’re going to very briefly determine whether the process of statistics sampling is deterministic or not, i.e. if under the same premises, the chosen sample varies, or not. Place your bets now.

Let’s start by creating a table T, with one clustered index, one non-clustered index. Then, we’ll populate the State column, with random values from a range that may vary from 1 to 5.

CREATE TABLE T (Id INT IDENTITY, [State] TINYINT)
CREATE CLUSTERED INDEX CIX_T ON T (ID)
CREATE NONCLUSTERED INDEX IX_T_State ON T ([State])

INSERT INTO T SELECT FLOOR(RAND()*(5-1+1)+1);
GO 28595716           

Now, if we update any statistics using the FULL SCAN option, the histogram will reflect exactly the current value distribution for the State column and we can easily confirm it by running DBCC SHOW_STATISTICS:

UPDATE STATISTICS T IX_T_State WITH FULLSCAN
DBCC SHOW_STATISTICS ('T','IX_T_State') WITH HISTOGRAM

And then comparing the results with the actual values in the table

SELECT [State],COUNT(1) Occurrences 
FROM T 
GROUP BY [State]

This level of accuracy allows SQL Server to do better choices regarding execution plans, operators and specific algorithms. However, when our tables have hundreds of millions or even billions of rows, we may be forced to loosen up a little bit, in order to fit these operations in the available maintenance window. Additionally, even if we’re able to do a full scan in order to populate the histogram with 100% of accuracy, if the modification threshold is posteriorly reached, a much lower sample will be used (typically) to rebuild it. When the Query Optimizer determines that statistics might be out-of-date, it updates them (when they are needed for a query plan). But it’s usually possible to improve query performance by updating statistics more frequently and with a greater sample than what auto update statistics would do. Updating statistics ensures that queries compile with up-to-date statistics, but be aware that updating statistics causes queries to recompile.

Going back to our T table, if we update the IX_T_State index that has 28 595 716 rows, letting SQL Server determine the sample size, we’ll see that a sample of 2.68% was picked and the statistics will hold the following data:

In this case, using such a small sample percentage, SQL Server has done a pretty awesome job extrapolating the results, but that’s beyond the point now. What we want to confirm is if a new update, over the same data universe, will lead to a distinct histogram, or not:

The results show that both histograms are equal. Changing the sample percent and the table size won’t alter the behavior. SQL Server uses a deterministic sampling process.

The Deceiving N

In this post, we’ll briefly look into a common misconception about nchar and nvarchar data types.

We begin by setting the breakfast table:

CREATE TABLE Breakfast
(
    Id SMALLINT IDENTITY(1,1),
    Product NVARCHAR(10)
)
INSERT INTO Breakfast (Product) VALUES ('Caviar'),('Blinis')

Later, someone spots our mistake creating the breakfast menu and tries to apply the only logical correction to it:

UPDATE Breakfast SET Product = 'Beluga Gold Vodka' WHERE Product = 'Blinis' 

But, despite the good thinking and good taste, the attempt fails.

Furthermore, looking into the Breakfast table definition can make thinks even more confusing, because the Lenght value is slighty greater than the inputed string.

It’s important to understand that in NVARCHAR(n) and NCHAR(n), the n doesn’t correspond to the number of characters that can be stored, but rather the string length in byte-pairs (from 1 to 4000). This misconception is probably explained by the fact that in characters defined within the Unicode range of 0 to 65 535, one character is stored per byte-pair. However, in higher Unicode ranges (65 536 to 1 114 111) one character may use two byte-pairs.

In our case, one character is being stored per byte-pair, but the discrepancy that we saw in the column information, more specifically in its lenght can lead to some confusion. Since we’ve initially set our n to be 10 in the Product column, those 10 byte-pairs will correspond and justify the value of 20 presented in the object definition for the column lenght. However, because our NVARCHAR column uses 2 bytes per character, the maximum number of characters that can be stored is 10.

If we run the following query:

SELECT
	column_name,
	character_maximum_length,
	character_octet_length
FROM 
	information_schema.columns
WHERE 
	table_name = 'Breakfast' AND
	Column_name ='Product'

The character_maximum_length and character_octet_length for the Product column will confirm exactly what we’ve just saw.

Now the modifications to allow a proper breakfast can be made.

Za zdarovye!

Wait Stats#1: SOS_SCHEDULER_YIELD

In order for you to have a happy life as someone responsible for a database, you need to ensure that it follows a strict diet consisting of well-designed queries and indexes, plenty of CPU, memory and disk capacity. Neglecting any of these aspects will compromise the entire system and consequently your happiness.

Regarding improving CPU usage, we can roll up our sleeves and start tuning our server, our queries and indexes, so they perform better and use less CPU. This is the preferable approach, because the alternative of adding processing power to SQL Server can be quite expensive. Nonetheless, it’s possible to reach a point where your CPUs are simply not enough to support your workload and tuning queries will feel like rearranging chairs on a sinking Titanic.

In theory there are two options when increasing CPU capacity:

  • Adding more CPUs, hence increasing the thread pool and improving parallel queries.
  • Get faster CPUs, which will improve serial query speed and thread consumption

In practice, the second option is becoming gradually less common because most people are in the cloud and using virtual machines, which typically imply less flexibility.

But, in order to confirm that the CPU is really the bottleneck, we need to have a closer look. Wait stats are a great starting point and in this specific matter, wait types like SOS_SCHEDULER_YIELD, CXPACKET, or THREADPOOL can give us pretty useful hints of what’s happening. It’s important to stress that waits are not intrinsically bad, they are an indivisible part of SQL Server functioning and we must be able to separate evil waits from benign waits.

In this post, we’re going to overview SOS_SCHEDULER_YIELD wait type. However, to have a clearer understanding about the underlying information, we must grasp some basic concepts, namely schedulers and the query lifecycle.

Schedulers and Query Lifecycle

A scheduler is the logical equivalent of a CPU, where worker threads are scheduled to actively carry out the tasks assigned to them. We can view scheduler information using the DMV sys.dm_os_schedulers. Inernally, schedulers have three components:

  • Processor: Phisycal or logical CPU core that is responsible for processing the thread. Each core will process only a thread at a time.
  • Waiter List: List that contains all threads that are waiting for a resource to became available. The threads in this list are in Suspended state (we’ll talk more about these states shortly). We can view information about these resource waits in the DMV sys.dm_os_waiting_tasks.
  • Runnable Queue: Encompasses all threads that are waiting for processor time (Runnable state). When priorities are not assigned to workload groups the processing order on this queue is First-In-First-Out. (more info on sys.dm_exec_requests).

Accordingly, these components are directly related to the states encompassed by the query lifecycle:

  • Running: Query is being processed on a CPU.
  • Runnable: Query is ready to execute, but it’s waiting for CPU time.
  • Suspended: Query is waiting on a particular resource to become available.

During one execution, a query can go through these states multiple times. This behavior is enforced by the cooperative (or non-preemptive) scheduling used by SQL OS. Basically, only one thread will be running on the scheduler at a given moment and each one has a quantum of 4 milliseconds, which represents the maximum consecutive CPU time for *any* of them. After these 4 milliseconds, the thread voluntarily yields the CPU and moves back into the Runnable state. Even if there aren’t other threads waiting, nor the necessity for the thread to wait for any resource (and consequently transitioning into Suspended state), the quantum is not exceeded.

It’s important to mention that preemptive function calls work differently, they are selfish and don’t like to cooperate. They are nothing more than calls to functions outside the realm of SQL OS, like Win32 API functions, where the waiting logic rules no longer apply. Under these circumstances, and before the external function is invoked, a sibling worker is signaled and given ownership of the scheduler, while the current thread continues to run. Then, when the thread goes preemptive, its quantum “ends” and no additional accounting is done for CPU usage until it reverts to non-preemptive scheduling. When the preemptive work is done and the thread transitions into cooperative scheduling again, it declares the end of its preemptive wait and goes into the runnable queue. If you want to know more about preemptive scheduling check this awesome blog post from Ewald Cress.

SOS_SCHEDULER_YIELD

This wait type occurs when a thread voluntarily yields the scheduler for other tasks to execute, waiting for its quantum to be renewed.

Since the thread doesn’t wait for any resource and it just voluntarily goes into the runnable queue, SOS_SCHEDULER_YIELD won’t imply resource wait time. This will create the signal wait time that will show how long the thread has waited in the runnable queue to get back in the running state again.

Problem?

Probably not. As we saw, SOS_SCHEDULER_YIELD just indicates that threads are exhausting their quantum, hence generally being a benign wait type. Even if this wait type is prevalent, if signal wait time is low, that just means that there are multiple threads running on the scheduler and you should look somewhere else in order to find the root case of your problems.

Nonetheless, when signal wait time is high things can get interesting. Frequent SOS_SCHEDULER_YIELD waits with high signal wait time may suggest that a substantial number of CPU intensive queries are fighting for CPU and each ends up waiting a long time to be signaled. In this case it’s probably a good idea to look for the top CPU consuming queries and analyze their execution plans.

Contrarily, less frequent waits with high signal time indicates that there are not many threads fighting for CPU time. In such case, an external source may be causing CPU pressure, either from preemptive function calls or something completely external to SQL Server.