Wednesday, May 27, 2015

Checksum vs Hashbytes

Have you ever had to write an incredibly long UPDATE statement based on the values of multiple columns? Depending on the amount of columns you're working with, it can be maddening. Instead of wasting so much of your time on typing out all those column names, you can use either a Checksum, or Hashbyte value. Their are pros and cons to each, and I will discuss and demonstrate here.

Checksum

The Checksum function in SQL Server will return a numeric value based on the values of fields in a row. You can use the checksum value as a means of comparison when detecting changes in a row. Let's say, for example, you have 20 different columns that could change values. If a value changes, then the Checksum value for that row will change as well. Generating the Checksum is simple. In the example below, I am returning rows from the AdventureWorks2012 database. The Production.Product table to be exact. I am only returning a few columns. The query is as follows.


Select ProductID,
ProductNumber,
Name,
Color,
ListPrice,
ProductLine,
Checksum(ProductID, ProductNumber, Name, Color, ListPrice, ProductLine) AS ChecksumValue
From Production.Product


Notice I am using the Checksum function with all the columns that would like to include in the checksum enclosed in parenthesis. I don't have to use all the columns if I don't want to. Here are the results of the query.










With the checksum, you can compare the value of just one field when looking for changes. If any of the values of this row were to change, then so too would the Checksum value.

Checksum is nice in that it can be used with any data type and doesn't care about NULL's. There is a downside, however. When working with small data sets, the checksum will more than likely always be a unique value. As the data sets grow, there is a chance that the values won't always be unique.

Hashbytes

The Hashbytes function returns a hash value of the specified column values. When using the Hashbytes function, you can specify a hash type. You can choose MD2, MD4, MD5, SHA, SHA1, or SHA2. If I take the above example code and use Hashbytes instead of Checksum, the query would be written as follows.


Select ProductID,
ProductNumber,
Name,
Color,
ListPrice,
ProductLine,
HASHBYTES('MD5', Name + color + ProductLine) as HashValue
From Production.Product


You should notice I am only using a few of the columns. I will explain that, but lets look at the results first.










The hash value is much longer and more complex than the Checksum value. There is a much smaller chance of having duplicate values with the hash. However, there is an issue. You can see in the image above that we have some NULL values. With Hashbytes, if there is a NULL in one or more columns, then the hash will be NULL as well. To avoid this, we would need to replace the NULLs.


Select ProductID,
       ProductNumber,
       Name,
       Color,
       ListPrice,
       ProductLine,
       HASHBYTES('MD5', Name + ISNULL(color, 'Unknown') + ISNULL(ProductLine, 'Unknown')) as HashValue  
From Production.Product


Also, notice that the columns being included in the hash are not separated by commas, but concatenated. Here are the new results.









As I mentioned before, I am only using certain columns, not all of them. The reason for this is that hashbytes will only work with varchar, nvarchar, or varbinary data types. If I want to use the other columns in this query as part of the hash, I would have to cast them.

In the end, Hashbytes requires a little more code, but it does have an advantage over Checksum.

Friday, May 22, 2015

Site Updates

I have started up a YouTube channel to which I will be posting video tutorials to go along with some of my blog posts. You can view the channel here. I also added a new page in which the videos will be embedded. So enjoy the material here, on YouTube, wherever. Hopefully, I will be able to fill it with useful content on a regular basis. Feel free to leave me some suggestions in the comments section.


Monday, May 18, 2015

SSIS Performance Tip: RetainSameConnection Property

In today's post, I am going to talk about a connection property that will yield a better execution time in certain scenarios. A not so elusive property called RetainSameConnection. This is a property that can be set for the connection managers that will prevent them from closing after an SSIS task has completed an insert or update using that connection. Why would this be important? Well, by default, this property is set to false. Each time a package inserts rows or makes any updates to the table, the connection is opened, the inserts/updates take place, and then the connection is closed. In that kind of scenario, the default setting is fine. There wouldn't be any reason to change it. However, if multiple actions are taking place against a table, it may be a good idea to set this property to true.

For example, let's say I have a package that uses a ForEach Loop to loop through a folder, read the file names and log those file names into a table. For every iteration of the loop, the connection will be opened and then closed. In a case like that, I can set the RetainSameConnection property to true in order to save some time.

Another scenario in which this is used would be inside of packages that contain an OLEDB Command. An OLEDB Command should be avoided when possible since it is absolutely abysmal for performance. Think for a moment about what that transform actually does. It takes a row of data, opens a connection, runs a SQL statement against it, and then closes the connection. It does this for every row that is sent through it. When stuck in a situation in which you have to use it, set the RetainSameConnection property to true. That will yield a rather significant performance gain.

Where is this property? Well, inside of SSDT/BIDS, select a connection manager.



Open the properties pane by pressing F4 or locating the tab to the lower right. The property in question can be found toward the bottom.

























Open the drop down and set it to true. This should reduce the execution times of any package using a ForEach loop, or an OLEDB Command.

For more information on using this property with ForEach Loops, stay tuned for an upcoming video tutorial that I will be posting this week.


Tuesday, May 12, 2015

SQL Saturday Jacksonville Followup

Last Saturday, the 9th, was SQL Saturday #391 in Jacksonville, FL. I'm glad to see it had an awesome turnout of over 500 attendees. Manuel Quintana and I co-presented a session on SSIS Performance Tuning. This is a session I have done a few times before, so this time I wanted to spice it up a bit and make it interesting. We did the entire presentation wearing superhero onesie pajamas. I was Batman (no surprise there) and Manuel was Superman. It turned out to be a success. I'm not the kind of person to pack a presentation full of ridiculous shtick, but let's be real, SSIS can be a very boring topic. All in all, the presentation went well and was definitely memorable. If you would like to download the PowerPoint slides, you can get them here. For those of you reading this that did attend, thank you very much. We definitely appreciate the people that show up to these events and are able to learn something valuable.

If you missed out on SQL Saturday, that is unfortunate. I encourage you to check the SQL Saturday site and look for the next event in your area. You really can't beat a free training event!


This just screams 'Professionalism'!

Tuesday, May 5, 2015

Scheduling Doc xPress with a SQL Agent Job

Since I work for Pragmatic Works, I use Doc xPress a lot (and you should, too!). One of the many handy features is the ability to run snapshots through as a scheduled job. There are two different ways to do this; as a SQL job, or a scheduled task. This post will serve as a tutorial for setting up a SQL job.

The first thing that must be done is the creation of a snapshot request file. In a Doc xPress solution, you will see a the option to create one in the lower right corner. The snapshot request file is an XML file that contains all the information about the Doc xPress solution.











You will also see an option to create a documentation request file, used for automating the documentation process. Click the snapshot request file button. Browse to a location in which to store the file and give it a different name, if desired. In this example, I am renaming the file as the same name as the solution and storing the file in a folder on my C: drive.















Now that the file has been created, the job needs to be created next. Open SSMS and make sure the SQL Server Agent is running. Right click 'Jobs' and select 'New Job'. Name the job and don't worry about a category. It isn't needed here. Who likes being pigeon holed into groups anyway?






















To the left of the window, select 'Steps'. Click the button at the bottom of the window that reads 'New'. Give the step a name, and then, under the 'Type' drop down menu, select 'Operating system (CmdExec)'. Next, add the command. There are two parts that are needed; the path to Docxpress.exe, and the path to the request file. When using the default installation directory of Doc xPress and the path to the file that was just created, the command  will look like this (including the quotes!)...

"C:\Program Files (x86)\Pragmatic Works\Pragmatic Workbench\Docxpress.exe" "C:\Snapshots\LocalEnvironment.xml"

Now, I can't stress this enough, be sure to include the quotes as seen in the command above.

The finished step should look something like this...






















Click OK. Before I set up a schedule, I want to run the job and be sure it works. Click OK to save the job. In the Object Explorer in SSMS, right click the job and select 'Start job at step'. This will run the job and create a new snapshot for the Doc xPress solution. Once it runs successfully, then it can be scheduled. This way, I can automate my late night snapshots so I have time to do other things.

Again, this is just one way of scheduling. I will follow up with another post showing this same process with the Windows Task Scheduler.