Total Pageviews

Monday, December 12, 2011

Pivot Transformation in SSIS


What does the Pivot Transformation do?

This bit is fairly straightforward – it takes row values and converts them to columns, moving – but not aggregating -  a value as it does so. So if your data looked like this:
b
Fig 1: Pre Pivot Data
Pivoting on “Food Type” using “Cost” as a value would turn it into this:
b
Fig 2: Post Pivot Data

How to configure the Pivot Transformation

In the sample package Data Flow 1 I start with a query which delivers Line Totals and Order Quantities summarised by Product and Order Year. What I want to do is Pivot so I have Order Year on the rows, the values from Product Category as the Columns and the sum of Line Totals as the Value, as laid out below:
b
Fig 3: The Desired Pivot Results

Step 1: Configure the Inputs

The first step is to open the Pivot Transformation and on the Input Columns tab select the columns that are going to be used in the Pivot. Once that is done, switch to the Input and Output Properties tab. Here we are going to set how each column is used in the Pivot operation. This is done by setting the PivotUsage property of the Input Columns. This can be found by expanding Pivot Default Input > Input Columns and selecting the relevant Input Column. Down at the bottom of the list under the Custom header is the property PivotUsage. This can be set to one of four values:
  • 0 – the column is passed through unaffected
  • 1 – the column values become the rows of the pivot (aka the Set Key)
  • 2 – the column values become the column names of the pivot (aka the Pivot Column)
  • 3 - the column values that are pivoted in the pivot
So, in my example, I set OrderYear to 1, Product to 2 and Line Totals to 3.
b
Fig 4: Configuring the Pivot Transformation Inputs
Now something you have to configure upstream from this is that whatever column will be the rows of the pivot (PivotKeyUsage = 1) needs to be sorted for the component to work properly. The Pivot component iteslf doesn’t force this (as it does in a Merge Join) so you need to go back and make sure that your column is sorted by the time it hits the pivot. Setting the IsSorted and SortKeyPosition properties of the Set Key column upstream will make no difference as the Pivot Transformation won’t actually pay any attention to these settings, but it may cause unexpected results if the data itself isn’t sorted. You can see how this affects the output by running Data Flow 1a, which is the same as Data Flow 1 but not sorted properly.

Step 2: Configure the Outputs

Despite everything you’ve done so far, this component will have exactly zero outputs – each has to be manually defined. So you will need an output for your Rows (In my example, OrderYear) and an output for each of your Pivoted column’s values that you want to output. So in my example I will need to create an output for each Product Category that is in the source data – you cannot ignore a value as the component will return an error. For example, if I forgot to add a column for Components, I would get the error “The pivot key value “Components” is not valid” when I tried to execute.
To create an output, on the Inputs and Outputs tab expand the Pivot Default Output > OutputColumns and start clicking the Add Column button until you have enough columns. Then for each column you need to set the following properties:
  • Name – the name for the output column – so if you were pivoting a value of Bike, you could rename it to Bicycle in the output
  • PivotKeyValue – the value in the pivoted column that will go into this output
  • SourceColumn – the Lineage ID of the column that is to be used in the pivot. What this means is you need to go back to your input columns, and get the value of the LineageID property of the column that you set the PivotUsage of to 3 for pivoted columns, and the LineageID of the Column you set to PivotUsage of 1 for the Row Column.
b
Fig 5: Configuring the Pivot Transformation Outputs
Having fun yet? Unlikely, but fortunately you are done.

Multiple Pivots and the T-SQL Alternative

In the sample package Data Flow 2, I demonstrate a configuration of the Pivot Transformation that actually pivots two data values. This is achieved by adding a duplicate set of columns for the Pivot Column to receive the pivoted values in, and can be achieved mostly by duplicating the actions in Step 2 to create a new set of Pivot Columns.
I also demonstrate the Pivot T-SQL command in Data Flow 3 – this generates the same output as in Data Flow 1 but the aggregation and pivoting is done in a single operation with none of the awkward configuration required by the Pivot Transformation in SSIS. I won’t be going into the syntax for these queries here but lkinks to the MSDN documentation are at the end of this post.

Where to use the Pivot Transformation?

Because of the pain involved in configuring it, I would suggest you only use it when you absolutely have to and can’t push the operation back to the source. This is not least because the pivot – unlike Excel Pivot Tables or a T-SQL pivot – cannot aggregate, it needs aggregation first on the pivoted column. In my example, if I didn’t have an aggregate operation in the query, I would get the error “Duplicate pivot key value “Bikes”” as it tried to pivot another row which had a Product Category of “Bike”. The T-SQL Pivot query in SQL Server is far more powerful with its variety of aggregation options – even though it will initially be a bit awkward to write those queries to a non-expert. Additionally using T-SQL will allow you to lock down the number of columns – if you get a new value in the Pivot Column it won’t kill your package.
All in all the Pivot Transformation is badly implemented from a UI perspective, and weak from a data processing viewpoint due to it’s lack of aggregation capability.
MSDN Documentation for the Pivot Transformation can be found here for 2008 and here for 2005.
MSDN Documentation for the Pivot T-SQL command can be found here for 2008 and here for 2005.

No comments:

Post a Comment