SQL Server Merge Statement for Handling SCD2 Changes in a Data Warehouse (Tip #3)

In my last post (Part 2) I explained what dimension and fact tables are and how we handle changes in our dimension tables. I also went through a very high level example of using the merge statement to handle these changes. In Part 3, I am going to explain how I used the merge statements on a preview project and an indexed view to get the desired performance I wanted.

In my previous post I stated that in my scenario I used one, very flat staging table that went into multiple dimension tables and one fact table. The system that I was dealing with had to do with student loans, so my dimension tables were DimPerson, DimAddress, DimCommunication, DimLoanInformation, and DimLoanInformationLoanStatus, and the single fact table was FactLoan. All my dimension tables contained information that was relevant to describing a student loan and my fact table was data that changed frequently for a student loan (e.g., current balance, last payment type, last payment amount, and last payment date). I also had some static dimension tables that were used for things like loan status, loan type, data provider, and time.

Since I focused on “DimPerson” in the first part of this post I am going to continue with that dimension and explain how we used an SQL Server merge statement with indexed views.

Creating Views for the Merge Statement

In the first post I explained that when you are tracking type II changes you will have an active column to signify what record is the active record. I used an example about Suzie Que who got married and changed her name to Suzie Doe. If we queried Suzie in the “DimPerson” table there would be two records for her. The first record would be inactive and the second record would be active. Then, if she got divorced and changed her name back to Suzie Que there would be a third record added for Suzie. Each distinct individual in my “DimPerson” table will have one active record and n number of inactive records.

Since I wanted only to focus on the active records in my merge statement, I created an indexed view that only contained the active records. If you are not familiar with an indexed view it is a view that is on disk so when you query against the view it doesn’t have to re-run the underlying query and, instead, simply reads the results of the query from disk. Another advantage of an indexed view is that it can contain indexes you can utilize when writing queries. It should be noted that when declaring an indexed view it must also be schema bound. Another important aspect that needs to be stated here is that in order to take advantage of the view it requires you add a query hint to your queries of “noexpand”. If you don’t include the hint the view will run like a normal view and execute the under lying query.

The indexed view I created for “DimPerson” is show in Figure 1.

When looking at the sql you will see that I am only getting those records that have an IsActiveRow of 1. I have also created a unique index on the view for PersonID, which is the primary key of the base table and will always be unique.

The second view I created for my merge statement was against my staging table.  I previously explained that my staging table was very flat with multiple columns, so when running my merge statement I wanted to limit the number of columns returned and only include those that are required for my “DimPerson” dimension (See Figure 2). As you can see, this view contains multiple indexes, some of which are specific for our implementation. Here, I’ll explain index 3.

Index 3 relates to a business key that we create prior to running the ETL and guarantees uniqueness in a record. You can kind of think of it as an “SSN” for a person. The reason we couldn’t use “SSN” for a unique key in our situation was that we had multiple providers of loan information and a person could reside in each provider and each provider could have the same person with multiple loans. To make things even more difficult, each loan could have varying person information. So we had to create uniqueness by data provider and loan. This is why we used the key we used.

I wanted to limit the number to the columns returned and only include those that are required for my “DimPerson” dimension

Creating the Merge Statement

Once the views were created it was time to create the merge statement (See Figure 3). On line 826 of the merge statement I am using the vActivePeople view as the destination. I also want to point out that in this instance I am not using the query hint so that the underlying sql will run. I tried multiple ways to get the query hint to work, but nothing seemed to work. The desire is to use the query hint and utilize the indexed view, but it just never worked for me. I am still getting good performance with the view the way it is, but it would be better if the query hint could be used. On lines 829 – 833 I am selecting the columns from the vStagingPeople view which will be our source data — here I was able to use the query hint. On line 834 I am joining the destination and source views using index 3 of the source view. Inserting new records happens on lines 835 – 840. The process of dealing with the SCD II changes starts on line 841 – 847, where I am finding the rows that match on key but have a different column value. Those records get updated and set to inactive. Line 848 – 850 takes the output of the merge statement and inserts it into the temp table I created in line 821 – 825. Then the last thing I did was insert the SCD 2 changes and place the inactivated rows ID into the parent ID column.

Having never done this before, there was one thing specifically that was slightly strange to me. When I ran the merge statement in Figure 3 I was going against a views and not the base tables, so when you try to run the statement it will fail because you can’t insert into the view. However, what I learned is that you can put a trigger on a view and then insert into the base table. Figure 4 shows the three triggers created on the vActivePeople view. When creating the triggers you are forced to create an “insert”, “update”, and “delete trigger”. The trigger takes the values that were assigned to the view and inserts them into the base table.

During these two posts I described what dimension and fact tables are and how to track slow changing dimensions. I also talked about three ways to track slow changing dimensions and dug deeper into the merge statement method. Then lastly I showed how I have used the merge method along with indexed views to help with performance. I hope you enjoyed the best and please let me know if you have any questions in the comments!

April 18, 2013