TIL: How to optimize Group by queries in SQL Server

Let me share this technique I learned to improve queries with GROUP BY in SQL Server.

To improve queries with GROUP BY, write the SELECT query with the GROUP BY part using only the needed columns to do the grouping or sorting inside a common table expression (CTE) first. Then, join the CTE with the right tables to find other columns.

Usual GROUP BY: Find StackOverflow most down-voted questions

Let’s use this technique to tune the store procedure to find most down-voted questions on StackOverflow.

Here’s the store procedure. Let’s fire our local copy of the StackOverflow 2013 database to run it.

CREATE OR ALTER PROC dbo.MostDownVotedQuestions AS
BEGIN
    select top 20 count(v.PostId) as 'Vote count', v.PostId AS [Post Link],p.Body
    from Votes v 
    inner join Posts p on p.Id=v.PostId
    where PostTypeId = 1 and v.VoteTypeId=3
    group by v.PostId, p.Body
    order by 'Vote count' desc
END
GO

I ran this stored procedure on my local machine five times without any indexes. It took about 2 seconds each time. On my machine, SQL Server only had 8GB of RAM. Remember, by default SQL Server uses all available RAM.

This is the execution plan. Let’s notice the Clustered Index Seek on the dbo.Posts table. And, yes, SQL Server is recommending an index. But we’re not adding it.

StackOverflow most down-votes questions
StackOverflow most down-votes questions

Then, these are the metrics grabbed with sp_BlitzCache from the First Responder Kit. This stored procedure finds all the most CPU intensive queries SQL Server has recently executed.

Most CPU intensive queries
Most CPU intensive queries

To find the most down-voted questions, our stored procedure is grouping by Body. And, that’s an NVARCHAR(MAX) column, the actual content of StackOverflow posts.

Sorting and grouping on large data types is a CPU expensive operation.

Spoons full of Indian spices
Photo by Pratiksha Mohanty on Unsplash

Group and order inside CTE’s first

To improve queries with GROUP BY, group inside a common table expression (CTE) with only the required columns in the grouping. For example, IDs or columns covered by indexes. Then, join the CTE with the right tables to find other columns.

After grouping only by PostId inside a CTE first, our stored procedure looks like this,

CREATE OR ALTER PROC dbo.MostDownVotedQuestions AS
BEGIN
    WITH MostDownVoted AS (
        select top 20 count(v.PostId) as VoteCount, v.PostId /* We removed the Body column */
        from Votes v 
        inner join Posts p on p.Id=v.PostId
        where PostTypeId = 1 and v.VoteTypeId=3
        group by v.PostId /* Also, from here */
        order by VoteCount desc
    )
    select VoteCount as 'Vote count', d.PostId AS [Post Link], p.Body
    from MostDownVoted d
    inner join Posts p on p.Id = d.PostId
END
GO

This time, we are excluding the Body column from the GROUP BY part. Then, we are joining the MostDownVotes CTE to the dbo.Post table to show only the Body of the 20 resulting posts.

Again, this is the execution plan of grouping inside a CTE first.

StackOverflow most down-votes questions
StackOverflow most down-votes questions grouping inside a CTE

Notice the Clustered Index Seek operator on the left branch. That’s to find the body of only the 20 post SQL Server found as a result of grouping inside the CTE. This time, SQL Server is grouping and sorting fewer data. It made our stored procedure use less CPU time.

Let’s take another look at sp_BlitzCache. Before running the modified version of our store procedure five times, I ran DBCC FREEPROCCACHE to free up SQL Server’s plan cache.

Most CPU intensive queries
Most CPU intensive queries with our modified stored proc

Notice the “Total CPU” and “Avg CPU” columns, we’re using less CPU time after the change. I went from 36.151ms of total CPU time to 35.505ms. Hooray!

Now, imagine if that store procedure runs not only 5 times, but multiple times per minute. What if our stored procedure feeds a reporting screen in our app? That change with a CTE will make a huge difference in the overral CPU usage.

Voilà! That’s how we can improve queries with GROUP BY. Remember, group and sort inside CTE’s to take advantage of existing indexes and avoid expensive sorting operations. Use this technique with OFFSET/FETCH, SUM, AVG, and other group functions.

I learned this technique following Brent Ozar’s Mastering Query Tuning class.

For more SQL Server content, check don’t write functions around columns in WHERE, what are implicit conversions?, and How to do a case-sensitive search in SQL Server.

Happy coding!

TIL: How to do a case-sensitive search in SQL Server

Do you use LOWER or UPPER to do case-sensitive searches? Let’s see how to write a case-sensitive search in SQL Server.

To write a case-sensitive search in SQL Server, don’t use UPPER or LOWER functions around the column to filter. Instead, use the COLLATE keyword with a case-sensitive collation followed by the comparison.

Often by mistake, to do a case-sensitive search, we wrap a column around LOWER(). Something like this,

SELECT TOP 50 DisplayName, Location, CreationDate, Reputation
FROM dbo.Users
WHERE LOWER(DisplayName) LIKE 'john%'
--    ^^^^^
ORDER BY Reputation DESC;
GO

We tried to find the first 50 StackOverflow users with DisplayName containing lowercase ‘john’.

But, interestingly enough, some results don’t match our intended filter. They include both lowercase and uppercase ‘john’.

Result of finding users by wrapping a column with LOWER
Naive case insensitive search using LOWER

It would be the same if we use UPPER() instead. In general, let’s not use functions around columns in our WHEREs. That’s a common bad practice.

scrabble pieces
Photo by Brett Jordan on Unsplash

Collation and case sensitivity

In SQL Server, collations provide sorting rules and case and accent sensitivity for our data. For example, when we use an ORDER BY, the sort order of our results depends on the database collation.

In SQL Server Management Studio, we find a database collation after clicking on the database Properties and then on General. It’s under the Maintenance section.

Here’s mine.

SQL Server database properties
SQL Server database properties

The default collation for English is SQL_Latin1_General_CP1_CI_AS. This is a case-insensitive collation.

In collation names, CI means case insensitive. And, AS means accent sensitive.

That’s why when we wrote LOWER(DisplayName), SQL Server showed uppercase and lowercase results. SQL Server used the default collation, which was a case-insensitive one.

For more details about collations, check Microsoft docs on collations and Unicode support.

Case sensitive searches in WHERE clause

For case-sensitive searches in SQL Server, use COLLATE keyword with the case sensitive collation “SQL_Latin1_General_CP1_CS_AS” followed by the LIKE or = comparisons as usual.

When using the SQL_Latin1_General_CP1_CS_AS collation, a is different from A, and à is different from á. It’s both case and accent-sensitive.

Let’s rewrite our query with COLLATE,

SELECT TOP 50 DisplayName, Location, CreationDate, Reputation
FROM dbo.Users
WHERE DisplayName COLLATE SQL_Latin1_General_CP1_CS_AS LIKE '%john'
ORDER BY Reputation DESC;
GO

This time, we have the results we were expecting. Only users with lowercase ‘john’ in their display name.

Query to find StackOverflow users with COLLATE
Case sensitive query with COLLATE

Voilà! That’s how we can write case-sensitive searches in SQL Server. Remember, don’t use LOWER or UPPER. They won’t work for case-sensitive searches. Use a different collation instead.

For more content on SQL Server, check what are implicit conversions and why you should care, how to optimize GROUP BY queries, and just listen to index recommendations.

Happy coding!

Monday Links: Daily Meetings, Estimates, and Challenges

This Monday Links episode is about careers and developer life. Four reads. Let’s start with a rant on daily meetings. Enjoy!

Daily Standup Meetings are useless

I think I could have written this article myself. I’ve been in those daily meetings you want to finish as quickly as possible to get back to your life. -“I’m working on ABC, no blockers.” We need to stop doing daily meetings as a micromanaging strategy. Otherwise, let’s replace them with an email or a message in a group chat. That would be more productive for everyone. Read full article

Is tasking developers with creating detailed estimates a waste of company money?

I’ve worked in places where developers spend half a day estimating tasks and, they have to resend their estimates every time they change them. And, I’ve heard stories from past coworkers where they spend a whole day estimating tasks for the next 6-month period. This is a common thing and I don’t see an easy way out because “that’s how we have always done things here”.

Quoting the article about why project management techniques don’t work on software projects… “software development involves messy discovery of new tasks in the complex and abstract environment of code, which results in software development task durations being a ‘?’.” Read full article

Film Examining Room circa 1940s
It's time to quit their job? Photo by Museums Victoria on Unsplash

5 Signs It’s Time to Quit Your Job

Are you planning to change jobs? This article can help you to decide. One common factor is learning. Ask yourself if you’re learning new skills and if your learning is still valuable in another company. Quoting the article “Sometimes five years of experience is just… the same year of experience, five times over.” Read full article

Bring your own code

Have you ever been in a hiring process where you are asked to work on a coding exercise without getting clear instructions? This article shows an alternative to coding exercises: bring an already-written piece of code. I like this idea. Also, it shows hiring managers what to look for. Hiring managers…“you need to be comfortable with the fact that you’ll lose some candidates who are unwilling to do your assignment but would take a different option.” Read full article

Voilà! Another Monday Links about career and workplaces. I don’t know if this topic will become a trend in future “Monday Links.” Stay tuned to find it out.

In the meantime, check my Unit Testing 101 series. Don’t miss the previous Monday Links on Better programming, flags, and C# and Workplaces, studying and communication.

What are implicit conversions and why you should care

SQL Server compares columns and parameters with the same data types. But, if the two data types are different, weird things happen. Let’s see what implicit conversions are and why we should care.

An implicit conversion happens when the data types of columns and parameters in comparisons are different. And SQL Server has to convert between them, following type precedence rules. Often, implicit conversions lead to unneeded index or table scans.

An implicit conversion that scans

Let’s see an implicit convention! For this, let’s create a new table from the StackOverflow Users table. But, this time, let’s change the Location data type from NVARCHAR to VARCHAR.

USE StackOverflow2013;
GO

CREATE TABLE dbo.Users_Varchar (Id INT PRIMARY KEY CLUSTERED, Location VARCHAR(100));
INSERT INTO dbo.Users_Varchar (Id, Location)
  SELECT Id, Location
  FROM dbo.Users;
GO

CREATE INDEX Location ON dbo.Users_Varchar(Location);
GO

Let’s find all users from Colombia. To prove a point, let’s query the dbo.Users_Varchar table instead of the original dbo.Users table.

DECLARE @Location NVARCHAR(20) = N'Colombia';

SELECT Id, Location
FROM dbo.Users_Varchar
/* The column is VARCHAR, but the parameter NVARCHAR */
WHERE Location = @Location;
GO

Notice we have declared @Location as NVARCHAR. We have a type mismatch between the column and the variable.

Let’s see the execution plan.

Execution plan of finding all users StackOverflow users from Colombia
StackOverflow users from Colombia

SQL Server had to scan the index on Location. But, why?

Warning sign on execution plan of finding all users StackOverflow users from Colombia
Warning sign on SELECT operator

Let’s notice the warning sign on the execution plan. When we hover over it, it shows the cause: SQL Server had to convert the two types. Yes! SQL Server converted between VARCHAR and NVARCHAR.

SQL Server data type precedence

To determine what types to convert, SQL Server follows a data type precedence order. This is a short version:

  1. datetimeoffset
  2. datetime2
  3. datetime
  4. smalldatetime
  5. date
  6. time
  7. decimal
  8. bigint
  9. int
  10. timestamp
  11. uniqueidentifier
  12. nvarchar (including nvarchar(max))
  13. varchar (including varchar(max))

SQL Server convert “lower” types to “higher” types. We don’t need to memorize this order. Let’s remember SQL Server always has to convert VARCHAR to other types.

For the complete list of type precedence between all data types, check Microsoft docs on SQL Server data Type precedence.

An implicit conversion that seeks

In our example, the VARCHAR type was on the column, on the left side of the comparison in the WHERE. It means SQL Server had to read the whole content of the index to convert and then compare. That’s more than 2 million rows. That’s why the index scan.

Let’s use the original dbo.Users table with Location as NVARCHAR and repeat the query. This time, switching the variable type to VARCHAR. What would be different?

DECLARE @Location VARCHAR(20) = 'Colombia';

SELECT Id, Location
/* We're filtering on the original Users table */
FROM dbo.Users
/* This time, the column is NVARCHAR, but the parameter VARCHAR */
WHERE Location = @Location;
GO

Now, the VARCHAR type is on the right of the comparison. It means SQL Server has to do one single conversion: the parameter.

Execution plan of finding all users StackOverflow users from Colombia
Index Seek on Location index when finding all users from Colombia

This time we don’t have a yellow bang on our execution plan. And, we have an Index Seek. Not all implicit conversions are bad.

In stored procedures and queries, use input parameters with the same types as the columns on the tables.

To identify which queries on your SQL Server have implicit conversions issues, we can use the third query from these six performance tuning tips from Pinal Dave. But, after taking Brent Ozar’s Mastering courses, I learned to start working with the most expensive queries instead of jumping to queries with implicit convertion issues right away.

Voilà! Those are implicit conversions and why you should care. Let’s use input parameters with the right data types on your queries and store procedures. Otherwise, we will pay the performance penalty of converting and comparing types. Implicit conversions are like functions around columns, implicitly added by SQL Server itself.

Let’s remember that not all implicit conversions are bad. When looking at execution plans, let’s check how many rows SQL Server reads to convert and compare things.

For more content on SQL Server, check how to compare datetimes without the time part, how to write case-sensitive searches and how to optimize queries with GROUP BY.

Happy coding!

Don't use functions around columns in your WHEREs: The worst T-SQL mistake

There’s one thing we could do to write faster queries in SQL Server: don’t use functions around columns in WHERE clauses. I learned it the hard way. Let me share this lesson with you.

Don’t use user-defined or built-in functions around columns in the WHERE clause of queries. It prevents SQL Server from estimating the right amount of rows out of the function. Write queries with operators and comparisons to make SQL Server better use the indexes it has.

With functions around columns in WHEREs

To prove this point, let’s query a local copy of the StackOverflow database. Yes, the StackOverflow we all know and use.

StackOverflow has a Users table that contains, well…, all registered users and their profiles. Among other things, every user has a display name, location, and reputation.

Let’s find the first 50 users by reputation in Colombia.

To make things faster, let’s create an index on the Location field. It’s an NVARCHAR(100) column.

CREATE INDEX Location ON dbo.Users(Location);

This is the query we often write,

DECLARE @Location NVARCHAR(20) = N'Colombia';

SELECT TOP 50 DisplayName, Location, CreationDate, Reputation
FROM dbo.Users
-- Often, we put LOWER on both sides of the comparison
WHERE LOWER(Location) = LOWER(@Location)
--    ^^^^^             ^^^^^
ORDER BY Reputation DESC;
GO

Did you notice the LOWER function on both sides of the equal sign?

We all have written queries like that one. I declared myself guilty too. Often, we use LOWER and UPPER or wrap the column around RTRIM and LTRIM.

But, let’s see what happened in the Execution Plan.

Execution plan of finding the first 50 StackOverflow users in Colombia
First 50 StackOverflow users in Colombia by reputation

Here, SQL Server chose to scan the index Location first. And let’s notice the width of the arrow coming out of the first operator. When we place the cursor on it, it shows “Number of Rows Read.”

Rows read by Index Scan when finding StackOveflow users
Rows read by Index Scan on Location index

In this copy of the StackOverflow database, there are 2,465,713 users, only 463 of them living in Colombia. SQL Server had to read the whole content of the index to execute our query. Arrrggg!

It means that to find all users in Colombia, SQL Server had to go through all users in the index. We could use that index in a better way.

Write queries with comparisons, functions, and operators around parameters. This way SQL Server could properly use indexes and have better estimates of the contents of tables. But, don’t write functions around columns in the WHERE clauses.

The same is true when joining tables. Let’s not put functions around the foreign keys in our JOINs either.

Rewrite your queries to avoid functions around columns in WHERE

Let’s go back and rewrite our query without any functions wrapping columns. This way,

DECLARE @Location NVARCHAR(20) = N'Colombia';

SELECT TOP 50 DisplayName, Location, CreationDate, Reputation
FROM dbo.Users
-- We remove LOWER on both sides
WHERE Location = @Location
ORDER BY Reputation DESC;
GO

And, let’s check the execution plan again.

First 50 StackOverflow users in Colombia
Again, first 50 StackOverflow users in Colombia by reputation

This time, SQL Server used an Index Seek. It means SQL Server didn’t have to read the whole content of the Location index to run our query. And the execution plan didn’t go parallel. We don’t have the black arrows in yellow circles in the operators.

Let’s notice that this time we have a thinner arrow on the first operator. Let’s see how many rows SQL Server read this time.

Rows read by Index Seek on Location index
Rows read by Index Seek on Location index

After that change, SQL Server only read 463 records. That was way better than reading the whole index.

Voilà! If we want to write faster queries, let’s stop using functions around columns in our WHEREs. That screws SQL Server estimates. For example, let’s not use LOWER or UPPER around our columns. By the way, SQL Server string searches are case insensitive by default, we don’t need those functions at all.

By the way, this is the same mistake we make when comparing DateTime in our SQL queries.

To read more SQL Server content, check what implicit conversions are and why you should care, how to optimize queries with GROUP BY, and just listen to SQL Server index recommendations.

Happy SQL time!