Source: http://bidn.com/blogs/KathiKellenberger/sql-server/3097/ssis-transforms-and-execution-trees
The majority of the work performace in most SSIS packages is done in the Data Flow Task. This task allows you to transfer data from a source to a destination with optional transforms in between to clean up and shape the data. While there is a transform for just about anything you can dream up, not all transforms are created equal when it comes to performance.
Some of the transforms, such as the Sort transform, are poor performers, and it is easy to see why. The Sort transform must read all the required data before it can do its work of ordering the data. That is an obvious roadblock. The Sort transform is called a Blocking Transformation because all incoming records must be read before execution can continue. My colleague, Devin Knight, has a great metaphor for this: a deck of cards. If you are handed 10 cards at a time, you can't have a sorted deck until you have all 52 cards in hand.
Under the covers, SSIS loads data into memory buffers moving chunks of data in and out of the buffers as it peforms the required work. (I’ll explain more about this in a later blog). The blocking transforms create new buffers and execution paths. To be most efficient, you need to minimize the number of new execution paths created.
While the Sort transform is obviously a poor performer, how can we figure out if the performance of a particular package is suffering because of Blocking Transformations? There is an event you can log to see how many execution paths are created called “Pipeline Execution Tree.”
To test this, I created a package with a Data Flow Task that added a column to the AdventureWorks Production.Product table, performed an aggregate and a sort before exporting to another database table (Figure 1).
Figure 1: My test package
After testing my package, I enabled logging by right-clicking in the Control Flow area and selecting Logging. I enabled Logging of the Package and Data Flow Task. I added a text file to capture the log and chose the Pipeline Execution Trees event. (See Figures 2 and 3.)
Figure 2: Configure logging
Figure 3: Log PipelineExecutionTrees
Once I ran the package with the logging in place, I could view the results in Notepad. (Figure 4).
Figure 4: The logging results
These results tell me that there were three execution paths created: 0 to 2. Path 0 starts at the source and contains the Derived Column transform. That means that the Derived Column transform is a non-blocking transform that works within the existing execution path. Path 1 begins with the Aggregate output. Path 2 begins with the Sort output. This means that both the Aggregate and Sort transforms created their own buffers and execution paths and are not among the best performing transforms. Figure 5 shows how the execution paths map to my package.
Figure 5: The execution paths mapped to the package
So, what could I have done to make this particular package perform better? In this case, it is easy to perform all the work of the transforms within our wonderful query language, T-SQL. I could have eliminated all three transforms by just modifying my query. Of course, sometimes you will need those transforms because the data source doesn’t have the features you need, such as with a flat file. Even then, maybe it would make sense to import a very large flat file into a staging table to take advantage of the database engine.
No comments:
Post a Comment