If you’re a regular reader then you’ll know I’ve recently become enamored of SSIS. I’m absolutely blown away by this product, I wish I learned about it 5 years ago!
Of course, being new to such an extensive technology means dealing with landmines along the way. Unfortunately, I’ve run into a couple of them already. The first one was my own fault: I developed a project in 2008 but the client has Sql Server 2005. To solve it I had to install Sql Server 2005 and create a 2005 version of the Package.
AcquireConnection Method Error
The second issue I’ve encountered was hardly as intuitive. I was at the client site yesterday deploying my first SSIS Package outside my own network. With the help of Andy Leonard, my SSIS mentor, I had worked through all the necessary steps to make the Package dynamic, a process worthy of a blog post of its own. At the client site I followed my cheat sheet to the letter, explaining every step to the client along the way. After solving a missing prerequisite problem on the server, we fired up the Package only to be quickly stopped with a Validation failure message:
The AcquireConnection method call to the connection manager failed with error code 0xC0202009.
I assumed there was a problem with the dynamic connection strings. I spent about an hour diagnosing the problem only to find out there was no problem, at least not with the dynamic connections. The problem, it turns out, was in a thing called DelayValidation.
Dynamic vs. Design Time Connection Strings
When you design an SSIS Package, you need to define connections to data sources and destinations that you can access at design time. Without these connections, SSIS would be pretty useless. So these design time data connections have to be hard coded somewhere: I put mine in package variables and bind the ConnectionString property of the ConnectionManager objects to the variables. These variables are then overridden at runtime by external values, making the connections dynamic and portable.
What I was unaware of was that each ConnectionManager object and each DataFlow Task has a property called DelayValidation. When the Package begins, it validates all of the DataFlow tasks prior to executing any of them. If DelayValidation is set to False, then the validation process uses the design time connection objects. In other words, it tries to connect using that connection string before it has a chance to be set by the dynamic process. And here’s the kicker: False is the default value, so this is the default behavior.
Working on my own network, I never even noticed this was happening. On the client network, however, it immediately blows up because the tasks cannot connect to my design time databases. I found a good article online that explained the problem and it actually makes sense. The solution is to set DelayValidation to True, which will delay the validation process until the task begins to execute, giving it time to be set dynamically.
I started by setting the property on every ConnectionManager object. At first I believed that this would be sufficient but it didn’t stop the validation error: I had to do the same thing to every DataFlow task as well, after which the Package ran properly. This was a real pain because I had so many to change. I could select multiple tasks within the same container, but I could not select tasks across container boundaries, and I have a lot of containers.
While I understand what DelayValidation is doing and why it does it, I don’t understand why False is the default value. If nothing else, it seems to me this should be a Package level setting. If you know of a better or easier way to handle this problem, please share it in the comments below.
While Expression Blend remains my favorite tool and Visual Studio my most productive tool (with a little R# love), Sql Server Integration Services (SSIS) has moved solidly into the #3 slot.
I frequently have tasks that require me to move data from one location to another, perform conversions and transformations, create new tables with the resulting columns, etc. In the past, this meant a lot of ADO.NET and a lot of coding. These processes can be exceedingly slow and time consuming to produce and the performance is frequently less than desirable. With what I’ve learned about SSIS, most of that work will be a thing of the past. I can now do conversion work in a fraction of the time it took before and the resulting product, a Package in SSIS lingo, can execute in far less time.
It seems that SSIS is thought of as a DBA’s tool, but I believe that as Blend is not just for Designers, SSIS is not just for DBAs. This post is not going to be a how to or any kind of definitive work: what I want to accomplish is to introduce my fellow developers to the glory of SSIS and highlight some of the reasons I think you should be learning this technology.
Project Greenfield and SSIS
For Project Greenfield, one of the primary tasks is to convert data from the legacy IBM *insert nom du jour here* midrange server database to the new Sql Server database.
This is far more than pushing data from once place to another: the structure is completely different. Relationships are defined now that previously were unenforced and large tables are broken into dozens of smaller, more normalized tables, often in different schemas. Fields that were previously fixed length and Numeric types are now varchars and ints. In some cases single fields have been broken into multiple fields, and in some cases multiple fields have been combined. In all cases, data coming out is Unicode but is being stored as ANSI.
Obviously, this conversion represents a significant body of work in its own right. One of my recent tasks was to provide enough of a conversion that I could start prototyping (fake data just wasn’t what we wanted.) The amount of work I was able to do in a week would have easily taken over a month to write using ADO.NET. And best of all, now that I have a solid framework in place making changes is very easy.
Getting Started with SSIS
In order to start with SSIS, you will have to have it installed. More accurately, you will need the SQL Server Business Intelligence Development Studio installed, also known as BIDS. This is found as an option when installing SQL Server and I’m pretty sure it is not available below SQL Server Standard.
The current version of BIDS runs in Visual Studio 2008. If you already have VS2008 installed you will find a new Project Type category called Business Intelligence Projects added to your existing install. If you do not have VS2008 BIDS will install a Visual Studio 2008 Shell, even if you have VS2010 installed.
To start a new project, select the Business Intelligence Projects category and Integration Services Project in the create new project dialog. Once it is created, opening it and working with it is basically the same as any other solution.
Work Flow in SSIS
BIDS itself is the first application I’ve seen that serves as a compelling example of a Workflow driven application. The Package Designer workspace is organized in tabs, the only two of which I’ve needed so far are Control Flow and Data Flow.
All tasks are defined as compartmentalized units of work. The visual blocks for those are all shown in the Control Flow tab. These tasks may or may not be grouped into containers such as Sequence Container or Foreach Loop Container. You may define as many containers as necessary to organize the Package. So far I have preferred Sequence Containers as they allow me to organize tasks procedurally. Except for the simplest Package, I would not define tasks outside of containers.
There are many different task types available, but I have only needed three so far: Data Flow Task, Execute SQL Task, and Script Task. And now that I have better knowledge of what I am doing, I could get by without the Execute SQL Task.
Data Flow Task
At the heart of SSIS is the Data Flow Task. The basic formula is this: read data from a data source, manipulate/transform that data, then write the transformed data to the target destination. Data sources can be ADO.NET or OLE DB database connections but can also be Excel, Flat, or XML Files. There are even more options for Target Destinations.
In between the source and the target are the Data Flow Transformations which really represent the power of SSIS. Here is a brief list of the transformations I have so far found most useful.
Conditional Split – Evaluates the data in the current columns and creates logical subsets which can then be handled differently. Each subset effectively becomes it’s own data source at that point.
Derived Column – In my mind, the most important transformation of the bunch: derived columns are the new (or replacement) columns built by converting or transforming the source data. SSIS includes a highly evolved “Expression Language” that is used to convert the data at runtime. String manipulation, type conversion, mathematical operations, and much more are all supported.
Lookup – Second in importance only to Derived Column, this transformation allows you to perform lookup actions against other tables. This is essential for preventing invalid foreign key insertion. It also is great for performing Incremental Loads: basically, this means only inserting records into the database if they don’t already exist. This becomes important for several reasons, the least of which not being that I want to be able to execute the Package as often as possible, especially during development.
Multicast – Multicast creates multiple copies of the current data set, so you can perform multiple writes to multiple destinations.
The Script Task
The Script task allows you to write code to do things. Primarily I have used this to work with package variables, a whole topic in its own right, and for making OLE DB connections dynamic. I see substantial potential in the Script Task though as it really opens up the entire .NET Framework to the process.
Obviously, this barely scratches the surface of SSIS. BIDS is primarily a graphic tool, but there are distinct functions that developers could really leverage. The other place where developers could shine in SSIS is in process flow: once I understood things like Sequence Container and Conditional Split I really felt like I could make SSIS sing. This kind of flow is exactly what we code day in and day out, so I think developers can pick up SSIS quickly.
I may write some more about SSIS going forward, but if you are interested and looking for info now I recommend you check out Andy Leonard’s blog.
I want to start by saying I am not a database expert or an DBA by any stretch of the imagination. I am, though, an old AS/400 RPG developer. I believe that by necessity AS/400 developers tend to be more familiar with the native database functions and maintenance than their .NET counterparts. Even so, there are a great many capabilities that go unused by the day to day coder.
On the other hand, .NET developers tend to be more familiar with things like SQL and Stored Procedures. While these tools are second hand to most developers, they are far less common in RPG applications because RPG has direct database access. These are merely observations: each access technique has its strengths and weaknesses, a discussion of which is not the purpose of this article: I’ve probably rambled on about it too long already.
My day to day experience is typically making these two worlds peacefully coexist. The task this week is another conversion project, but this time from AS/400 tables to AS/400 tables. An argument could be made that I should have done the conversion in RPG using native database access, but in this particular case, I have some string manipulation that is far easier to handle in .NET than RPG.
The project is simple enough: in our legacy application we are making several significant changes: normalizing data, expanding configuration options, and extracting embedded data into a supporting table. The normalizing part requires that we combine data from two tables into a single repository. At the same time, those values contain embedded information that needs to be extracted and stored in a separate table.
The configuration of that embedded data has been expanded. It used to be fixed but is now much more flexible, allowing the client to configure the description, rules, and presentation of the data. As a result, each embedded item also produces a series of table entries in a related table: the number of entries and rules are not known at design time.
This may not be totally clear, but the result is the important part here. In a small client data set, what used to be 3,458 records in one table and 3,013 in the 2nd table is now 6,471 records in a single table. Those 6,471 also create 38,862 records in the extracted data table. There is also a 3rd table with 6,471 rows that acts as a cross reference between these this table and the legacy system that adds some additional information. So all in all, we’ve gone from 6,471 records in two tables, poorly normalized, into three tables with a total of 51,804 records.
The Problem of Performance
Obviously these are not big tables, but performing the table inserts over the wire from a .NET application incurs a big performance hit. I’m not going to point fingers, because I honestly don’t know where to point: it could be a problem with the SQL engine on the iSeries, or with the ADO.NET Managed Provider (our version is V5R4, which is still based on .NET 1.1). It’s unlikely that it is the network or the my PC, but I won’t rule anything out for certain.
I wrote previously about using Stored Procedures on the iSeries to wring out the last bit of performance. Overall, we’ve been very satisfied, and performance has been excellent. I wrote a few Stored Procedures to handle the inserts to these three tables and used them from my .NET application to perform the inserts. Quite unexpectedly, when I tested this small data set the performance was terrible. It took 57 minutes to perform this task, and this was our smallest data set. Our largest is more than 10 times the size, so it could potentially run 10-12 hours.
At first, I was tempted to leave it alone: this is not production code, but rather a utility that will be run once for each of our clients and then discarded. At least, that’s what I thought. I later realized that each client may have multiple data sets, so this may be run as many as 10 times per client. Then on top of that, I realized that my staff has to run this for all the clients. At that point, performance for this “one time” application suddenly became a huge problem.
Things aren’t always what they seem…
I was floored at these initial results. The reason we started using Stored Procedures is because they are supposed to provide the highest degree of performance. I refactored the logic twice trying to speed it up but to no avail. I assumed I had done something wrong in the procedure, so I hit the books, but I didn’t find anything. There’s not a lot to a procedure that just maps incoming parameters to an Insert statement.
There is one thing in the procedure, though, that I believe was the culprit. The way our legacy database is structured, you never know until run time what library (database) and file (table) you need to execute the function against. This means that the variables indicating those have to be passed in to the Stored Procedure … every … time.
This has not been a problem when pulling data, but I believe because of the frequency and interval with this mass of Insert statements that it was really hampering performance. I believe that if the Stored Procedure had specified the library (database) and file (table) names then the performance would not have been an issue. The system had to determine what access path to use to the database on every single request. The purpose of using Stored Procedures in the first place had been completely negated. I might as well have been using dynamic strings in my code.
Fortunately, I came up with a solution I can live with: I turned to an old friend, the Parameterized Query. As I got to thinking about the problem, I realized that what I needed was to create the command object once and reuse it. I used Parameterized Queries frequently before turning to Stored Procedures, and they suit this particular problem because I only need to set the library (database) and file (table) names once.
So I fired up a branch in Mercurial and hacked up my code. I refactored out some of my refactorings and replaced all the Stored Procedure calls with Parameterized Queries. I made sure I was only creating the command once and then reusing it for each insert. I anxiously ran the program again: 9 minutes later I had the confirmation I needed. This was MUCH better.
So the lesson for today is to never trust your assumptions. I was sure that Stored Procedures would automatically give me the best performance. I wasted time developing the procedures, I wasted time implementing them, then I wasted time testing and debunking my incorrect assumption. Now I’m wasting time writing about it :-) But it reminds me of a very valuable tip: choose the right tool for the right job.