Wednesday, May 27, 2015

Checksum vs Hashbytes

Have you ever had to write an incredibly long UPDATE statement based on the values of multiple columns? Depending on the amount of columns you're working with, it can be maddening. Instead of wasting so much of your time on typing out all those column names, you can use either a Checksum, or Hashbyte value. Their are pros and cons to each, and I will discuss and demonstrate here.

Checksum

The Checksum function in SQL Server will return a numeric value based on the values of fields in a row. You can use the checksum value as a means of comparison when detecting changes in a row. Let's say, for example, you have 20 different columns that could change values. If a value changes, then the Checksum value for that row will change as well. Generating the Checksum is simple. In the example below, I am returning rows from the AdventureWorks2012 database. The Production.Product table to be exact. I am only returning a few columns. The query is as follows.


Select ProductID,
ProductNumber,
Name,
Color,
ListPrice,
ProductLine,
Checksum(ProductID, ProductNumber, Name, Color, ListPrice, ProductLine) AS ChecksumValue
From Production.Product


Notice I am using the Checksum function with all the columns that would like to include in the checksum enclosed in parenthesis. I don't have to use all the columns if I don't want to. Here are the results of the query.










With the checksum, you can compare the value of just one field when looking for changes. If any of the values of this row were to change, then so too would the Checksum value.

Checksum is nice in that it can be used with any data type and doesn't care about NULL's. There is a downside, however. When working with small data sets, the checksum will more than likely always be a unique value. As the data sets grow, there is a chance that the values won't always be unique.

Hashbytes

The Hashbytes function returns a hash value of the specified column values. When using the Hashbytes function, you can specify a hash type. You can choose MD2, MD4, MD5, SHA, SHA1, or SHA2. If I take the above example code and use Hashbytes instead of Checksum, the query would be written as follows.


Select ProductID,
ProductNumber,
Name,
Color,
ListPrice,
ProductLine,
HASHBYTES('MD5', Name + color + ProductLine) as HashValue
From Production.Product


You should notice I am only using a few of the columns. I will explain that, but lets look at the results first.










The hash value is much longer and more complex than the Checksum value. There is a much smaller chance of having duplicate values with the hash. However, there is an issue. You can see in the image above that we have some NULL values. With Hashbytes, if there is a NULL in one or more columns, then the hash will be NULL as well. To avoid this, we would need to replace the NULLs.


Select ProductID,
       ProductNumber,
       Name,
       Color,
       ListPrice,
       ProductLine,
       HASHBYTES('MD5', Name + ISNULL(color, 'Unknown') + ISNULL(ProductLine, 'Unknown')) as HashValue  
From Production.Product


Also, notice that the columns being included in the hash are not separated by commas, but concatenated. Here are the new results.









As I mentioned before, I am only using certain columns, not all of them. The reason for this is that hashbytes will only work with varchar, nvarchar, or varbinary data types. If I want to use the other columns in this query as part of the hash, I would have to cast them.

In the end, Hashbytes requires a little more code, but it does have an advantage over Checksum.

0 comments:

Post a Comment