Thursday, June 11, 2015

Looping Through a Result Set with the ForEach Loop

In today's blog post, I am going to show you another use for the ForEach Loop container in SSIS. Most of the time, this container is used for looping through a group of files, however, there are many different uses for it.

Let's say you have a need to take some rows from a source and aggregate some values. Normally, something like that is done in the data flow, but it can also be done in the control flow.

First, create 5 variables. Pay close attention to the data types.









In the control flow of an SSIS package, add an Execute SQL Task and open the editor. Set the 'Result Set' property to Full result set. Create an OLE DB connection to the AdventureWorks2012 database. Add the following T-SQL  statement.

SELECT Name, SafetyStockLevel, ListPrice
From Production.Product
Where ListPrice < 20
AND ListPrice > 0


This query just pulls a small set of data from the Product table.





















Click the 'Result Set' option to the left. Set the Result Name to 0 and for the Variable Name, select objProductList. This variable will hold the results returned by the query. Click OK to close the editor.






















Next, add a ForEach Loop container and connect the Execute SQL  task to it. Open the editor and click Collection. In the enumerator drop down menu, select Foreach ADO Enumerator. In the ADO Object Source Variable drop down menu, select objProductList and for the Enumeration Mode, select Rows in the first table.





















Open Variable Mappings. In the drop down menu for the first variable, select intStock, then for the next one, select intListPrice. Click OK to close the editor.





















Add an expression task inside the ForEach Loop. In the expression editor, use the following expression...

@[User::intTotalValue] = @[User::intStock]* @[User::intListPrice] +  @[User::intTotalValue]























This expression will multiply the number of stock with the list price value of each item and then add that to the value of the intTotalValue variable. Click OK to close the editor.

To test this out, right click the ForEach Loop and select 'Edit Breakpoints'. Select the last option; Break at the beginning of every iteration of the loop.



















Now, execute the package. The package will stop and a watch window should open at the bottom of the screen. If it doesn't, click Debug > Windows > Watch 1. In the watch window, add the names of the variables you want to monitor. Keep in mind that the variable names are case sensitive here.





Each time the execution is continued, the variable values will change and the value of intTotalValue will increase. The interesting part about this is that you won't see the final value of intTotalValue. I actually drove myself insane trying to find out what was wrong with the expression, but there is nothing wrong at all. If there is a need to check the final value of the variable, a script task can be use to display the value.

In the Script task editor, select intTotalValue as the ReadOnlyVariable. Then, click Edit Script.





















In the script editor, use the following script.

MessageBox.Show(Dts.Variables["FileName"].Value.ToString());


Close the script editor and then click OK to close the script task editor. Remove the breakpoint from the ForEach loop and execute the package. You should see a message box that displays the final value of intTotalValue.











And there you have it. Again, this is something that is normally done in the data flow, but in case you ever encounter a need to do it at the control flow level, you know that it can be done.

Thursday, June 4, 2015

Scheduling Doc xPress with Windows Task Scheduler

A few weeks ago, I posted a tutorial on scheduling a Doc xPress snapshot with a SQL agent job. Today, I will show how this is done with the Windows task scheduler.

For the most part, the steps are similar. The first thing to do is to create that XML request file. From inside a Doc xPress solution, click the 'button that reads 'Create Snapshot Request File'.


This will prompt you for a name and location to which to save the request file. The snapshot request file contains all the information about your solution. I have renamed mine and saved it to a location on my C: drive.















Now, I can create and schedule a task. Open Windows Task Scheduler and select the option to create a new task. On the general tab, give the task a name. Add a desired description and choose the appropriate security options. I have set mine to run whether or not I am logged in, with the highest privileges and configured for my current OS.



















Next, click on the 'Triggers' tab. Here, the task can be put on a schedule. The frequency of this task really depends on the environment and how often it changes. I would set it up to run weekly, just to start.






















Then, open the 'Actions' tab and click New. Make sure that the selected option in the Action drop down menu is 'Start a program'. In the 'Program/script' field, enter the file path of the Doc xPress executable, surrounded in double quotes, followed by the full path and file name of the snapshot request file, also surrounded by double quotes. The command should look like this...

"C:\Program Files (x86)\Pragmatic Works\Pragmatic Workbench\DocxPress.exe" "C:\Snapshots\LocalEnvironment.xml"

So the Action window should look like this...

























Click OK to close the action window. Click OK again to close out the Task Editor. That's all there is to it. As you can see, similar to using the SQL agent.

Wednesday, May 27, 2015

Checksum vs Hashbytes

Have you ever had to write an incredibly long UPDATE statement based on the values of multiple columns? Depending on the amount of columns you're working with, it can be maddening. Instead of wasting so much of your time on typing out all those column names, you can use either a Checksum, or Hashbyte value. Their are pros and cons to each, and I will discuss and demonstrate here.

Checksum

The Checksum function in SQL Server will return a numeric value based on the values of fields in a row. You can use the checksum value as a means of comparison when detecting changes in a row. Let's say, for example, you have 20 different columns that could change values. If a value changes, then the Checksum value for that row will change as well. Generating the Checksum is simple. In the example below, I am returning rows from the AdventureWorks2012 database. The Production.Product table to be exact. I am only returning a few columns. The query is as follows.


Select ProductID,
ProductNumber,
Name,
Color,
ListPrice,
ProductLine,
Checksum(ProductID, ProductNumber, Name, Color, ListPrice, ProductLine) AS ChecksumValue
From Production.Product


Notice I am using the Checksum function with all the columns that would like to include in the checksum enclosed in parenthesis. I don't have to use all the columns if I don't want to. Here are the results of the query.










With the checksum, you can compare the value of just one field when looking for changes. If any of the values of this row were to change, then so too would the Checksum value.

Checksum is nice in that it can be used with any data type and doesn't care about NULL's. There is a downside, however. When working with small data sets, the checksum will more than likely always be a unique value. As the data sets grow, there is a chance that the values won't always be unique.

Hashbytes

The Hashbytes function returns a hash value of the specified column values. When using the Hashbytes function, you can specify a hash type. You can choose MD2, MD4, MD5, SHA, SHA1, or SHA2. If I take the above example code and use Hashbytes instead of Checksum, the query would be written as follows.


Select ProductID,
ProductNumber,
Name,
Color,
ListPrice,
ProductLine,
HASHBYTES('MD5', Name + color + ProductLine) as HashValue
From Production.Product


You should notice I am only using a few of the columns. I will explain that, but lets look at the results first.










The hash value is much longer and more complex than the Checksum value. There is a much smaller chance of having duplicate values with the hash. However, there is an issue. You can see in the image above that we have some NULL values. With Hashbytes, if there is a NULL in one or more columns, then the hash will be NULL as well. To avoid this, we would need to replace the NULLs.


Select ProductID,
       ProductNumber,
       Name,
       Color,
       ListPrice,
       ProductLine,
       HASHBYTES('MD5', Name + ISNULL(color, 'Unknown') + ISNULL(ProductLine, 'Unknown')) as HashValue  
From Production.Product


Also, notice that the columns being included in the hash are not separated by commas, but concatenated. Here are the new results.









As I mentioned before, I am only using certain columns, not all of them. The reason for this is that hashbytes will only work with varchar, nvarchar, or varbinary data types. If I want to use the other columns in this query as part of the hash, I would have to cast them.

In the end, Hashbytes requires a little more code, but it does have an advantage over Checksum.

Friday, May 22, 2015

Site Updates

I have started up a YouTube channel to which I will be posting video tutorials to go along with some of my blog posts. You can view the channel here. I also added a new page in which the videos will be embedded. So enjoy the material here, on YouTube, wherever. Hopefully, I will be able to fill it with useful content on a regular basis. Feel free to leave me some suggestions in the comments section.


Monday, May 18, 2015

SSIS Performance Tip: RetainSameConnection Property

In today's post, I am going to talk about a connection property that will yield a better execution time in certain scenarios. A not so elusive property called RetainSameConnection. This is a property that can be set for the connection managers that will prevent them from closing after an SSIS task has completed an insert or update using that connection. Why would this be important? Well, by default, this property is set to false. Each time a package inserts rows or makes any updates to the table, the connection is opened, the inserts/updates take place, and then the connection is closed. In that kind of scenario, the default setting is fine. There wouldn't be any reason to change it. However, if multiple actions are taking place against a table, it may be a good idea to set this property to true.

For example, let's say I have a package that uses a ForEach Loop to loop through a folder, read the file names and log those file names into a table. For every iteration of the loop, the connection will be opened and then closed. In a case like that, I can set the RetainSameConnection property to true in order to save some time.

Another scenario in which this is used would be inside of packages that contain an OLEDB Command. An OLEDB Command should be avoided when possible since it is absolutely abysmal for performance. Think for a moment about what that transform actually does. It takes a row of data, opens a connection, runs a SQL statement against it, and then closes the connection. It does this for every row that is sent through it. When stuck in a situation in which you have to use it, set the RetainSameConnection property to true. That will yield a rather significant performance gain.

Where is this property? Well, inside of SSDT/BIDS, select a connection manager.



Open the properties pane by pressing F4 or locating the tab to the lower right. The property in question can be found toward the bottom.

























Open the drop down and set it to true. This should reduce the execution times of any package using a ForEach loop, or an OLEDB Command.

For more information on using this property with ForEach Loops, stay tuned for an upcoming video tutorial that I will be posting this week.


Tuesday, May 12, 2015

SQL Saturday Jacksonville Followup

Last Saturday, the 9th, was SQL Saturday #391 in Jacksonville, FL. I'm glad to see it had an awesome turnout of over 500 attendees. Manuel Quintana and I co-presented a session on SSIS Performance Tuning. This is a session I have done a few times before, so this time I wanted to spice it up a bit and make it interesting. We did the entire presentation wearing superhero onesie pajamas. I was Batman (no surprise there) and Manuel was Superman. It turned out to be a success. I'm not the kind of person to pack a presentation full of ridiculous shtick, but let's be real, SSIS can be a very boring topic. All in all, the presentation went well and was definitely memorable. If you would like to download the PowerPoint slides, you can get them here. For those of you reading this that did attend, thank you very much. We definitely appreciate the people that show up to these events and are able to learn something valuable.

If you missed out on SQL Saturday, that is unfortunate. I encourage you to check the SQL Saturday site and look for the next event in your area. You really can't beat a free training event!


This just screams 'Professionalism'!

Tuesday, May 5, 2015

Scheduling Doc xPress with a SQL Agent Job

Since I work for Pragmatic Works, I use Doc xPress a lot (and you should, too!). One of the many handy features is the ability to run snapshots through as a scheduled job. There are two different ways to do this; as a SQL job, or a scheduled task. This post will serve as a tutorial for setting up a SQL job.

The first thing that must be done is the creation of a snapshot request file. In a Doc xPress solution, you will see a the option to create one in the lower right corner. The snapshot request file is an XML file that contains all the information about the Doc xPress solution.











You will also see an option to create a documentation request file, used for automating the documentation process. Click the snapshot request file button. Browse to a location in which to store the file and give it a different name, if desired. In this example, I am renaming the file as the same name as the solution and storing the file in a folder on my C: drive.















Now that the file has been created, the job needs to be created next. Open SSMS and make sure the SQL Server Agent is running. Right click 'Jobs' and select 'New Job'. Name the job and don't worry about a category. It isn't needed here. Who likes being pigeon holed into groups anyway?






















To the left of the window, select 'Steps'. Click the button at the bottom of the window that reads 'New'. Give the step a name, and then, under the 'Type' drop down menu, select 'Operating system (CmdExec)'. Next, add the command. There are two parts that are needed; the path to Docxpress.exe, and the path to the request file. When using the default installation directory of Doc xPress and the path to the file that was just created, the command  will look like this (including the quotes!)...

"C:\Program Files (x86)\Pragmatic Works\Pragmatic Workbench\Docxpress.exe" "C:\Snapshots\LocalEnvironment.xml"

Now, I can't stress this enough, be sure to include the quotes as seen in the command above.

The finished step should look something like this...






















Click OK. Before I set up a schedule, I want to run the job and be sure it works. Click OK to save the job. In the Object Explorer in SSMS, right click the job and select 'Start job at step'. This will run the job and create a new snapshot for the Doc xPress solution. Once it runs successfully, then it can be scheduled. This way, I can automate my late night snapshots so I have time to do other things.

Again, this is just one way of scheduling. I will follow up with another post showing this same process with the Windows Task Scheduler.


Wednesday, April 15, 2015

A Followup Power Query Tutorial

I have decided to redo my example from earlier today using a slightly different query method and data that is a little friendlier. This time I will be looking at sales data from AdventureWorks. Just like before, this will be a simple report with Power Query put into Power View. I am going to create a report that displays sales amounts from different countries over a period years.

First, launch excel and in a blank workbook, select Power Query. Click the 'From Database' option and then select 'From SQL Server Database'.

























In the connection window, enter a server name and database name. For this example, I am pointing to my local server and the AdventerWorksDW2012 database. In the previous post from today, I used a query, but I'm not going to use it this time.











When I click OK, the Navigator window opens to the right and displays all the views, tables, and functions in the database. This is where I can select which object I would like to pull from. It is possible to select multiple tables, but I only need to select one in this case. In the navigator pane, right click the DimReseller table and select Edit. This will open the query editor.

















In the editor, all the columns of the DimReseller table are displayed. I don't need all of these, so I will remove some of them later. Scroll all the way to the right to see columns from other tables that are related to this one. Notice the column header displays the name of the tables and their is an 'Expand' icon next to the table names.








Click the Expand icon next to DimGeography. This will open a filter menu of all the columns in the DimGeography table. Uncheck the 'Select All Columns' option and select 'EnglishCountryRegionName' and then click OK.

























This will add the Country column to the data displayed in the query editor. Now, expand the FactResellerSales header. Uncheck the 'Select All Columns' option and then select SalesAmount and OrderDate. Click OK. Now that I have the columns I am interested in, I need to remove columns I don't want. Basically, all I need to do is right click the headers of the columns I don't want and choose the option to remove them. There are a lot I don't need.

Remove columns until you are left with ResellerName, EnglishCountryRegionName, SalesAmount, and OrderDate. I also kept the address columns even though I may not need them, unless I want to play different options in Power View. Click 'Close and Load' in the upper left corner of the Query Editor ribbon. This will load the data into Excel.




















There are a few more little tweaks I want to make to this. First, rename the columns so the make more sense. Since I will be looking at data by year, I only need the year part of my OrderDate column. There are different ways to achieve this, but I am going with fast and easy. I will use an Excel function. Click the cell on the end, next to the first value of the OrderDate column. Next to the formula bar, click on the function button. This opens a list of functions. In the category drop down menu, select 'Date & Time' and then, in the list of functions, select 'Year'.






















Click OK. Now it asks for a serial number. All I have to do here is type in the first cell number of the column I want to alter. For example, the first value of the Order Date column is in cell H2. So, I will type that as the serial number.












Click OK. That will take the year part of the date and display it in a new column. Rename the column as Order Year.















Click on the Insert tab in the ribbon and select Power View. A generic table is displayed.
















You can choose different styles of charts in the ribbon. The style you choose will be influenced by the type of data you would like to display. For this example, I will choose a clustered bar chart. When I select that kind of chart, Power View takes a guess as what should be displayed and how. I will be moving some items around here. I need to clear out the Values, Axis, Legend, and Vertical Multiplier fields, because they are all wrong as you can see below.

























After clearing them out, I add Sales Amount to the values field (make sure it is summed), Year to axis, and Country to legend. It should look something like this...

























The end result is a clustered bar chart that shows total sales by country and shows how the sales have increased.

















This is just a simple example of pulling in data through Power Query and then visualizing it with Power View.

Creating a Simple Report with Power Query and Power View

In this tutorial, I will show you how to pull data into Excel using Power Query and then put together a simple report with Power View. I have found that you can get rather elaborate with these tools, but today I am focusing on something simple to start with. Since the month of April is Alcoholism Awareness Month, I want to put together a little report showing how many people we have lost in the US to alcohol related ailments. I will be using a database that contains mortality data provided by the World Health Organization. If you would like to download a copy of the database, I have made it available here. So, let's get right into it.

Inside Excel, in a blank workbook, click the Power Query tab in the ribbon. Power Query can read data from multiple sources including the web, Azure, IMDB2, and many others. I am going to use a SQL Server database. Over to the left side of the ribbon, select 'From Database' and select 'From SQL Server Database' in the drop down menu.


























This will open a window asking for a database connection. All that needs to be provided is the server name. I could also provide the database name and a SQL statement. Normally, I would just add the server name and the database name. Power Query is mainly used for exploring and transforming data. However, the data I will be using is a lot like a random data dump and I have run into some viewing limitations in Power Query that makes filtering a little problematic in this example. So, I will provide a SQL query that is going to return a small portion of what I am looking for.








Remember: Normally, all I would add is the server and database names.








When I click OK, it loads the results into a query editor. Here, I can filter values, move columns around, and perform some data transformations.














I don't have to do much to this data, since I used a query that did most of the work, but I do need to filter some things out. First, I will filter out all the other countries. When I click the drop down arrow in the Country column, it opens a menu for sorting and filtering.







Notice towards the bottom of the menu the warning about an incomplete list. Only a small sample is shown. When I click the option 'Load More', then I can see the full list of countries. Keep in mind that this window has a row limit of 1000.












For this example, I will deselect the 'Select All' option and choose United States of America at the bottom of the list. When I click OK, the data is refreshed.




















Next, I will do the same with the year column. I only want to look at years 1999 through 2010, which is the latest year that I have. Then, I will filter the Cause column. The query I used contains only causes of death that involve alcohol. If I had only pointed to the database and selected the tables I needed, then this list of columns would not be complete. The database I am working with has thousands of causes of death, and would leave out most of the columns I want. As you can see in he screenshot below, WHO has a tendency to get incredibly detailed on their causes of death.

























For this example, I only want Alcohol Abuse, Alcoholic cirrhosis of the liver, and Alcoholic Induced Chronic Pancreatitis. After choosing those items, I click the 'Close and Load' option in the upper left corner and the data is loaded into Excel.




To recap what I have so far, I am looking at the years 1999 through 2010 for the US, deaths across different age groups and all together, also divided by male and female, and this is covering three different types of alcohol related causes of death.

In the ribbon, select the Insert tab and click 'Power View'. The result is a small, default table with a few fields. In my case, it's summing everything, even the values of the Year column.



The Design tab in the ribbon should be opened. If it isn't, click on table and that should do it. The first thing I want to do is make sure the year isn't being summed. To the far right, there is the Power View Fields pane. Towards the bottom right, I click the drop down arrow next to the Year fields and select 'Do not summarize'.



I can visualize this data in a number of ways at this point. In the query I used, I returned the numbers of deaths for multiple age groups. In this Power View sheet, I will only look at the total number of deaths. So, in the Power View Fields pane to the right, at the top, I will uncheck all the age groups and make sure I am looking at DeathsAllAges. I could also look at deaths by gender, but for now, I will uncheck the gender field as well.



Now, I just have to choose a style of chart. You may find that the chart you choose is normally influenced by the kind of data you are working with. When working in Power View, it's a good idea to try out a few different types of charts and tweak them to your liking. For this example, I will choose a clustered bar chart. The chart is displayed and I can see the data, but not exactly the way I want it. I do have to move a few things around in the Power View Fields pane. See the below screen shot for the final arrangement.




I end up with a simple bar chart showing the numbers of the cause of death for each year from 1999 to 2010. I can scroll through to see how the numbers increase or decrease.




What does this data tell us? Well, it appears that alcoholic cirrhosis of the liver is the main cause of death when looking at deaths related to alcohol. Keep in mind, this isn't looking at traffic accidents, but alcoholism and alcohol addiction. As I scroll through the chart, I can see how the rate of cirrhosis of the liver maintains a constant increase as pancreatitis has decreased, and general alcohol abuse slightly increases. What would be even more interesting is if I could get data for 2011 through 2014 and see how alcoholism was affected by the recession. Also, keep in mind that there may be discrepancies in this database as I have found WHO doesn't seem to label things perfectly. There are still a little over 400,000 rows missing from the database because of causes of death that don't have label codes yet under the new ICD10 standards.