Wednesday, April 15, 2015

A Followup Power Query Tutorial

I have decided to redo my example from earlier today using a slightly different query method and data that is a little friendlier. This time I will be looking at sales data from AdventureWorks. Just like before, this will be a simple report with Power Query put into Power View. I am going to create a report that displays sales amounts from different countries over a period years.

First, launch excel and in a blank workbook, select Power Query. Click the 'From Database' option and then select 'From SQL Server Database'.

























In the connection window, enter a server name and database name. For this example, I am pointing to my local server and the AdventerWorksDW2012 database. In the previous post from today, I used a query, but I'm not going to use it this time.











When I click OK, the Navigator window opens to the right and displays all the views, tables, and functions in the database. This is where I can select which object I would like to pull from. It is possible to select multiple tables, but I only need to select one in this case. In the navigator pane, right click the DimReseller table and select Edit. This will open the query editor.

















In the editor, all the columns of the DimReseller table are displayed. I don't need all of these, so I will remove some of them later. Scroll all the way to the right to see columns from other tables that are related to this one. Notice the column header displays the name of the tables and their is an 'Expand' icon next to the table names.








Click the Expand icon next to DimGeography. This will open a filter menu of all the columns in the DimGeography table. Uncheck the 'Select All Columns' option and select 'EnglishCountryRegionName' and then click OK.

























This will add the Country column to the data displayed in the query editor. Now, expand the FactResellerSales header. Uncheck the 'Select All Columns' option and then select SalesAmount and OrderDate. Click OK. Now that I have the columns I am interested in, I need to remove columns I don't want. Basically, all I need to do is right click the headers of the columns I don't want and choose the option to remove them. There are a lot I don't need.

Remove columns until you are left with ResellerName, EnglishCountryRegionName, SalesAmount, and OrderDate. I also kept the address columns even though I may not need them, unless I want to play different options in Power View. Click 'Close and Load' in the upper left corner of the Query Editor ribbon. This will load the data into Excel.




















There are a few more little tweaks I want to make to this. First, rename the columns so the make more sense. Since I will be looking at data by year, I only need the year part of my OrderDate column. There are different ways to achieve this, but I am going with fast and easy. I will use an Excel function. Click the cell on the end, next to the first value of the OrderDate column. Next to the formula bar, click on the function button. This opens a list of functions. In the category drop down menu, select 'Date & Time' and then, in the list of functions, select 'Year'.






















Click OK. Now it asks for a serial number. All I have to do here is type in the first cell number of the column I want to alter. For example, the first value of the Order Date column is in cell H2. So, I will type that as the serial number.












Click OK. That will take the year part of the date and display it in a new column. Rename the column as Order Year.















Click on the Insert tab in the ribbon and select Power View. A generic table is displayed.
















You can choose different styles of charts in the ribbon. The style you choose will be influenced by the type of data you would like to display. For this example, I will choose a clustered bar chart. When I select that kind of chart, Power View takes a guess as what should be displayed and how. I will be moving some items around here. I need to clear out the Values, Axis, Legend, and Vertical Multiplier fields, because they are all wrong as you can see below.

























After clearing them out, I add Sales Amount to the values field (make sure it is summed), Year to axis, and Country to legend. It should look something like this...

























The end result is a clustered bar chart that shows total sales by country and shows how the sales have increased.

















This is just a simple example of pulling in data through Power Query and then visualizing it with Power View.

Creating a Simple Report with Power Query and Power View

In this tutorial, I will show you how to pull data into Excel using Power Query and then put together a simple report with Power View. I have found that you can get rather elaborate with these tools, but today I am focusing on something simple to start with. Since the month of April is Alcoholism Awareness Month, I want to put together a little report showing how many people we have lost in the US to alcohol related ailments. I will be using a database that contains mortality data provided by the World Health Organization. If you would like to download a copy of the database, I have made it available here. So, let's get right into it.

Inside Excel, in a blank workbook, click the Power Query tab in the ribbon. Power Query can read data from multiple sources including the web, Azure, IMDB2, and many others. I am going to use a SQL Server database. Over to the left side of the ribbon, select 'From Database' and select 'From SQL Server Database' in the drop down menu.


























This will open a window asking for a database connection. All that needs to be provided is the server name. I could also provide the database name and a SQL statement. Normally, I would just add the server name and the database name. Power Query is mainly used for exploring and transforming data. However, the data I will be using is a lot like a random data dump and I have run into some viewing limitations in Power Query that makes filtering a little problematic in this example. So, I will provide a SQL query that is going to return a small portion of what I am looking for.








Remember: Normally, all I would add is the server and database names.








When I click OK, it loads the results into a query editor. Here, I can filter values, move columns around, and perform some data transformations.














I don't have to do much to this data, since I used a query that did most of the work, but I do need to filter some things out. First, I will filter out all the other countries. When I click the drop down arrow in the Country column, it opens a menu for sorting and filtering.







Notice towards the bottom of the menu the warning about an incomplete list. Only a small sample is shown. When I click the option 'Load More', then I can see the full list of countries. Keep in mind that this window has a row limit of 1000.












For this example, I will deselect the 'Select All' option and choose United States of America at the bottom of the list. When I click OK, the data is refreshed.




















Next, I will do the same with the year column. I only want to look at years 1999 through 2010, which is the latest year that I have. Then, I will filter the Cause column. The query I used contains only causes of death that involve alcohol. If I had only pointed to the database and selected the tables I needed, then this list of columns would not be complete. The database I am working with has thousands of causes of death, and would leave out most of the columns I want. As you can see in he screenshot below, WHO has a tendency to get incredibly detailed on their causes of death.

























For this example, I only want Alcohol Abuse, Alcoholic cirrhosis of the liver, and Alcoholic Induced Chronic Pancreatitis. After choosing those items, I click the 'Close and Load' option in the upper left corner and the data is loaded into Excel.




To recap what I have so far, I am looking at the years 1999 through 2010 for the US, deaths across different age groups and all together, also divided by male and female, and this is covering three different types of alcohol related causes of death.

In the ribbon, select the Insert tab and click 'Power View'. The result is a small, default table with a few fields. In my case, it's summing everything, even the values of the Year column.



The Design tab in the ribbon should be opened. If it isn't, click on table and that should do it. The first thing I want to do is make sure the year isn't being summed. To the far right, there is the Power View Fields pane. Towards the bottom right, I click the drop down arrow next to the Year fields and select 'Do not summarize'.



I can visualize this data in a number of ways at this point. In the query I used, I returned the numbers of deaths for multiple age groups. In this Power View sheet, I will only look at the total number of deaths. So, in the Power View Fields pane to the right, at the top, I will uncheck all the age groups and make sure I am looking at DeathsAllAges. I could also look at deaths by gender, but for now, I will uncheck the gender field as well.



Now, I just have to choose a style of chart. You may find that the chart you choose is normally influenced by the kind of data you are working with. When working in Power View, it's a good idea to try out a few different types of charts and tweak them to your liking. For this example, I will choose a clustered bar chart. The chart is displayed and I can see the data, but not exactly the way I want it. I do have to move a few things around in the Power View Fields pane. See the below screen shot for the final arrangement.




I end up with a simple bar chart showing the numbers of the cause of death for each year from 1999 to 2010. I can scroll through to see how the numbers increase or decrease.




What does this data tell us? Well, it appears that alcoholic cirrhosis of the liver is the main cause of death when looking at deaths related to alcohol. Keep in mind, this isn't looking at traffic accidents, but alcoholism and alcohol addiction. As I scroll through the chart, I can see how the rate of cirrhosis of the liver maintains a constant increase as pancreatitis has decreased, and general alcohol abuse slightly increases. What would be even more interesting is if I could get data for 2011 through 2014 and see how alcoholism was affected by the recession. Also, keep in mind that there may be discrepancies in this database as I have found WHO doesn't seem to label things perfectly. There are still a little over 400,000 rows missing from the database because of causes of death that don't have label codes yet under the new ICD10 standards.

Friday, April 10, 2015

SQL Saturday: Free Training on All Things SQL



This year, I will be presenting at SQL Saturday #391 in Jacksonville, FL on May 9th. This will be the 5th consecutive year that I have presented and it gets better and better every year. This is one of those events that I really look forward to. This year, I will be co-presenting with Manuel Quintana, and we are going to make it really interesting. Our session will be SSIS performance tuning. We are planning on delivering the material in a wacky, amusing fashion. If you are in the area, I would definitely recommend attending the event. It's a whole day's worth of free SQL Server training.

Besides, it's one of the few times you will catch me out and about in the daylight.

If you have never been to a SQL Saturday event, I encourage you to do so. They are held all over the world and there are no registrations fees. Absolutely free training! Visit SQLSaturday.com for more info and a complete listing of all SQL Saturday events.

Friday, April 3, 2015

Getting Started With Cassandra noSQL

Even though most of my experience is with Microsoft SQL Server and their BI stack, I like to tinker with other things. I am currently playing around with Cassandra noSQL from Datastax.

First off, what is noSQL? There are some really long, drawn out explanations out there, but to put it simply, noSQL is a database system that consists of many different types of technologies to handle big data. It can handle large volumes of data, whether it's structured or not.

I will be playing with noSQL a bit and exploring what you can do with it, but this post is just a tutorial on creating a connection and setting up a database in Cassandra. For this tutorial, I will using Cassandra community edition. You can download it from planetcassandra.org/cassandra. To interact with the instance of Cassandra, I will be using DevCenter from Datastax. You can download that from the Datastax download page.

I'm not going to cover the installation process, since  it was actually really straight forward. I installed it on a Windows laptop and followed the few prompts that it gave me. I didn't need to change any of the installation options. DevCenter has no installer. It only uses an executable. Be aware that you do need to have a Java JRE or JDK installed in order to use DevCenter.

In DevCenter, click the 'New Connection' icon.













For the connection name, I will just call it localhost. Type localhost into the 'Contact Hosts' field and then click the 'Add' button to the right. Leave all the other options the way they are and click 'Finish'.




















Localhost will now be displayed in the connections pane.

Now it's time to create storage objects. To interact with this Cassandra instance, I will have to use CQL (Cassandra Query Lsnguage). You may find that it is not too different than T-SQL. The first object that needs to be created is a 'keyspace'. Think of this as a database in SQL Server. The following syntax creates a keyspace named 'WHO_Mortality'.

Create keyspace "WHO_Mortality" WITH Replication = {'class' : 'SimpleStrategy', 'replication_factor' : 1 };

When creating a keyspace, replication is a required property. It relates to the nodes of a cluster over which the data needs to be replicated. There are different types of replication that can be used, but for this simple tutorial, I am using 'SimpleStrategy'. Run the statement by clicking the green run icon.







Now that the keyspace has been created, the next step is to create tables. This database will eventually hold mortality data from the World Health Organization. A little deranged, but what did you expect? Besides, there is a lot of interesting health related info in my staging database.

I will start with the 'Country' table. It's just a small table that will hold a list of countries and their ID's. Use the following syntax to create the Country table.

Create Table "WHO_Mortality".Country (
        CountryID int PRIMARY KEY,
        Country varchar
);

There are a few things I want to point out about these two statements. In the first statement, when creating the keyspace, notice that I added double quotes around the keyspace name. Encompassing an object name in double quotes makes it case sensitive. If the object is made case sensitive, then it must be typed exactly the same way every time it is  referenced in code. Notice that I didn't do that for the table or column names. There is no reasoning behind that really other than to show how that works. Also, when creating any CQL statements, you must end it with a semicolon. Whether it's a truncate table statement, create statement or drop table statement. Every statement needs it.

Now that I have the keyspace and a table, I can create the rest of my tables. One other incredibly important thing to remember is that Cassandra doesn't support foreign keys. You will find that there are certain types of commands that Cassandra can't handle such as joins. Because of this, the database will need to be designed according to the queries that will be run against it. Remember, noSQL is designed to handle incredibly large amounts of data, so the tables can be wider than what would typically be seen in a relational database system. This is just a simple tutorial on creating the keyspace and a table. More than likely, as my adventure in Cassandra continues, I will be altering this table and adding more columns until I am more familiar with the design concepts and have a finished database I can load.

The end result for this noSQL database will be another tutorial in which I load it using data that is stored in MSSQL Server.