Project Greenfield: Learning SSIS
While Expression Blend remains my favorite tool and Visual Studio my most productive tool (with a little R# love), Sql Server Integration Services (SSIS) has moved solidly into the #3 slot.
I frequently have tasks that require me to move data from one location to another, perform conversions and transformations, create new tables with the resulting columns, etc. In the past, this meant a lot of ADO.NET and a lot of coding. These processes can be exceedingly slow and time consuming to produce and the performance is frequently less than desirable. With what I’ve learned about SSIS, most of that work will be a thing of the past. I can now do conversion work in a fraction of the time it took before and the resulting product, a Package in SSIS lingo, can execute in far less time.
It seems that SSIS is thought of as a DBA’s tool, but I believe that as Blend is not just for Designers, SSIS is not just for DBAs. This post is not going to be a how to or any kind of definitive work: what I want to accomplish is to introduce my fellow developers to the glory of SSIS and highlight some of the reasons I think you should be learning this technology.
Project Greenfield and SSIS
For Project Greenfield, one of the primary tasks is to convert data from the legacy IBM *insert nom du jour here* midrange server database to the new Sql Server database.
This is far more than pushing data from once place to another: the structure is completely different. Relationships are defined now that previously were unenforced and large tables are broken into dozens of smaller, more normalized tables, often in different schemas. Fields that were previously fixed length and Numeric types are now varchars and ints. In some cases single fields have been broken into multiple fields, and in some cases multiple fields have been combined. In all cases, data coming out is Unicode but is being stored as ANSI.
Obviously, this conversion represents a significant body of work in its own right. One of my recent tasks was to provide enough of a conversion that I could start prototyping (fake data just wasn’t what we wanted.) The amount of work I was able to do in a week would have easily taken over a month to write using ADO.NET. And best of all, now that I have a solid framework in place making changes is very easy.
Getting Started with SSIS
In order to start with SSIS, you will have to have it installed. More accurately, you will need the SQL Server Business Intelligence Development Studio installed, also known as BIDS. This is found as an option when installing SQL Server and I’m pretty sure it is not available below SQL Server Standard.
The current version of BIDS runs in Visual Studio 2008. If you already have VS2008 installed you will find a new Project Type category called Business Intelligence Projects added to your existing install. If you do not have VS2008 BIDS will install a Visual Studio 2008 Shell, even if you have VS2010 installed.
To start a new project, select the Business Intelligence Projects category and Integration Services Project in the create new project dialog. Once it is created, opening it and working with it is basically the same as any other solution.
Work Flow in SSIS
BIDS itself is the first application I’ve seen that serves as a compelling example of a Workflow driven application. The Package Designer workspace is organized in tabs, the only two of which I’ve needed so far are Control Flow and Data Flow.
All tasks are defined as compartmentalized units of work. The visual blocks for those are all shown in the Control Flow tab. These tasks may or may not be grouped into containers such as Sequence Container or Foreach Loop Container. You may define as many containers as necessary to organize the Package. So far I have preferred Sequence Containers as they allow me to organize tasks procedurally. Except for the simplest Package, I would not define tasks outside of containers.
There are many different task types available, but I have only needed three so far: Data Flow Task, Execute SQL Task, and Script Task. And now that I have better knowledge of what I am doing, I could get by without the Execute SQL Task.
Data Flow Task
At the heart of SSIS is the Data Flow Task. The basic formula is this: read data from a data source, manipulate/transform that data, then write the transformed data to the target destination. Data sources can be ADO.NET or OLE DB database connections but can also be Excel, Flat, or XML Files. There are even more options for Target Destinations.
In between the source and the target are the Data Flow Transformations which really represent the power of SSIS. Here is a brief list of the transformations I have so far found most useful.
Conditional Split – Evaluates the data in the current columns and creates logical subsets which can then be handled differently. Each subset effectively becomes it’s own data source at that point.
Derived Column – In my mind, the most important transformation of the bunch: derived columns are the new (or replacement) columns built by converting or transforming the source data. SSIS includes a highly evolved “Expression Language” that is used to convert the data at runtime. String manipulation, type conversion, mathematical operations, and much more are all supported.
Lookup – Second in importance only to Derived Column, this transformation allows you to perform lookup actions against other tables. This is essential for preventing invalid foreign key insertion. It also is great for performing Incremental Loads: basically, this means only inserting records into the database if they don’t already exist. This becomes important for several reasons, the least of which not being that I want to be able to execute the Package as often as possible, especially during development.
Multicast – Multicast creates multiple copies of the current data set, so you can perform multiple writes to multiple destinations.
The Script Task
The Script task allows you to write code to do things. Primarily I have used this to work with package variables, a whole topic in its own right, and for making OLE DB connections dynamic. I see substantial potential in the Script Task though as it really opens up the entire .NET Framework to the process.
Obviously, this barely scratches the surface of SSIS. BIDS is primarily a graphic tool, but there are distinct functions that developers could really leverage. The other place where developers could shine in SSIS is in process flow: once I understood things like Sequence Container and Conditional Split I really felt like I could make SSIS sing. This kind of flow is exactly what we code day in and day out, so I think developers can pick up SSIS quickly.
I may write some more about SSIS going forward, but if you are interested and looking for info now I recommend you check out Andy Leonard’s blog.