If you’re a regular reader then you’ll know I’ve recently become enamored of SSIS. I’m absolutely blown away by this product, I wish I learned about it 5 years ago!
Of course, being new to such an extensive technology means dealing with landmines along the way. Unfortunately, I’ve run into a couple of them already. The first one was my own fault: I developed a project in 2008 but the client has Sql Server 2005. To solve it I had to install Sql Server 2005 and create a 2005 version of the Package.
AcquireConnection Method Error
The second issue I’ve encountered was hardly as intuitive. I was at the client site yesterday deploying my first SSIS Package outside my own network. With the help of Andy Leonard, my SSIS mentor, I had worked through all the necessary steps to make the Package dynamic, a process worthy of a blog post of its own. At the client site I followed my cheat sheet to the letter, explaining every step to the client along the way. After solving a missing prerequisite problem on the server, we fired up the Package only to be quickly stopped with a Validation failure message:
The AcquireConnection method call to the connection manager failed with error code 0xC0202009.
I assumed there was a problem with the dynamic connection strings. I spent about an hour diagnosing the problem only to find out there was no problem, at least not with the dynamic connections. The problem, it turns out, was in a thing called DelayValidation.
Dynamic vs. Design Time Connection Strings
When you design an SSIS Package, you need to define connections to data sources and destinations that you can access at design time. Without these connections, SSIS would be pretty useless. So these design time data connections have to be hard coded somewhere: I put mine in package variables and bind the ConnectionString property of the ConnectionManager objects to the variables. These variables are then overridden at runtime by external values, making the connections dynamic and portable.
What I was unaware of was that each ConnectionManager object and each DataFlow Task has a property called DelayValidation. When the Package begins, it validates all of the DataFlow tasks prior to executing any of them. If DelayValidation is set to False, then the validation process uses the design time connection objects. In other words, it tries to connect using that connection string before it has a chance to be set by the dynamic process. And here’s the kicker: False is the default value, so this is the default behavior.
Working on my own network, I never even noticed this was happening. On the client network, however, it immediately blows up because the tasks cannot connect to my design time databases. I found a good article online that explained the problem and it actually makes sense. The solution is to set DelayValidation to True, which will delay the validation process until the task begins to execute, giving it time to be set dynamically.
I started by setting the property on every ConnectionManager object. At first I believed that this would be sufficient but it didn’t stop the validation error: I had to do the same thing to every DataFlow task as well, after which the Package ran properly. This was a real pain because I had so many to change. I could select multiple tasks within the same container, but I could not select tasks across container boundaries, and I have a lot of containers.
While I understand what DelayValidation is doing and why it does it, I don’t understand why False is the default value. If nothing else, it seems to me this should be a Package level setting. If you know of a better or easier way to handle this problem, please share it in the comments below.
While Expression Blend remains my favorite tool and Visual Studio my most productive tool (with a little R# love), Sql Server Integration Services (SSIS) has moved solidly into the #3 slot.
I frequently have tasks that require me to move data from one location to another, perform conversions and transformations, create new tables with the resulting columns, etc. In the past, this meant a lot of ADO.NET and a lot of coding. These processes can be exceedingly slow and time consuming to produce and the performance is frequently less than desirable. With what I’ve learned about SSIS, most of that work will be a thing of the past. I can now do conversion work in a fraction of the time it took before and the resulting product, a Package in SSIS lingo, can execute in far less time.
It seems that SSIS is thought of as a DBA’s tool, but I believe that as Blend is not just for Designers, SSIS is not just for DBAs. This post is not going to be a how to or any kind of definitive work: what I want to accomplish is to introduce my fellow developers to the glory of SSIS and highlight some of the reasons I think you should be learning this technology.
Project Greenfield and SSIS
For Project Greenfield, one of the primary tasks is to convert data from the legacy IBM *insert nom du jour here* midrange server database to the new Sql Server database.
This is far more than pushing data from once place to another: the structure is completely different. Relationships are defined now that previously were unenforced and large tables are broken into dozens of smaller, more normalized tables, often in different schemas. Fields that were previously fixed length and Numeric types are now varchars and ints. In some cases single fields have been broken into multiple fields, and in some cases multiple fields have been combined. In all cases, data coming out is Unicode but is being stored as ANSI.
Obviously, this conversion represents a significant body of work in its own right. One of my recent tasks was to provide enough of a conversion that I could start prototyping (fake data just wasn’t what we wanted.) The amount of work I was able to do in a week would have easily taken over a month to write using ADO.NET. And best of all, now that I have a solid framework in place making changes is very easy.
Getting Started with SSIS
In order to start with SSIS, you will have to have it installed. More accurately, you will need the SQL Server Business Intelligence Development Studio installed, also known as BIDS. This is found as an option when installing SQL Server and I’m pretty sure it is not available below SQL Server Standard.
The current version of BIDS runs in Visual Studio 2008. If you already have VS2008 installed you will find a new Project Type category called Business Intelligence Projects added to your existing install. If you do not have VS2008 BIDS will install a Visual Studio 2008 Shell, even if you have VS2010 installed.
To start a new project, select the Business Intelligence Projects category and Integration Services Project in the create new project dialog. Once it is created, opening it and working with it is basically the same as any other solution.
Work Flow in SSIS
BIDS itself is the first application I’ve seen that serves as a compelling example of a Workflow driven application. The Package Designer workspace is organized in tabs, the only two of which I’ve needed so far are Control Flow and Data Flow.
All tasks are defined as compartmentalized units of work. The visual blocks for those are all shown in the Control Flow tab. These tasks may or may not be grouped into containers such as Sequence Container or Foreach Loop Container. You may define as many containers as necessary to organize the Package. So far I have preferred Sequence Containers as they allow me to organize tasks procedurally. Except for the simplest Package, I would not define tasks outside of containers.
There are many different task types available, but I have only needed three so far: Data Flow Task, Execute SQL Task, and Script Task. And now that I have better knowledge of what I am doing, I could get by without the Execute SQL Task.
Data Flow Task
At the heart of SSIS is the Data Flow Task. The basic formula is this: read data from a data source, manipulate/transform that data, then write the transformed data to the target destination. Data sources can be ADO.NET or OLE DB database connections but can also be Excel, Flat, or XML Files. There are even more options for Target Destinations.
In between the source and the target are the Data Flow Transformations which really represent the power of SSIS. Here is a brief list of the transformations I have so far found most useful.
Conditional Split – Evaluates the data in the current columns and creates logical subsets which can then be handled differently. Each subset effectively becomes it’s own data source at that point.
Derived Column – In my mind, the most important transformation of the bunch: derived columns are the new (or replacement) columns built by converting or transforming the source data. SSIS includes a highly evolved “Expression Language” that is used to convert the data at runtime. String manipulation, type conversion, mathematical operations, and much more are all supported.
Lookup – Second in importance only to Derived Column, this transformation allows you to perform lookup actions against other tables. This is essential for preventing invalid foreign key insertion. It also is great for performing Incremental Loads: basically, this means only inserting records into the database if they don’t already exist. This becomes important for several reasons, the least of which not being that I want to be able to execute the Package as often as possible, especially during development.
Multicast – Multicast creates multiple copies of the current data set, so you can perform multiple writes to multiple destinations.
The Script Task
The Script task allows you to write code to do things. Primarily I have used this to work with package variables, a whole topic in its own right, and for making OLE DB connections dynamic. I see substantial potential in the Script Task though as it really opens up the entire .NET Framework to the process.
Obviously, this barely scratches the surface of SSIS. BIDS is primarily a graphic tool, but there are distinct functions that developers could really leverage. The other place where developers could shine in SSIS is in process flow: once I understood things like Sequence Container and Conditional Split I really felt like I could make SSIS sing. This kind of flow is exactly what we code day in and day out, so I think developers can pick up SSIS quickly.
I may write some more about SSIS going forward, but if you are interested and looking for info now I recommend you check out Andy Leonard’s blog.
In the previous article Clearing a SQL Server Database I wrote about my adventures in writing a Stored Procedure to clear all the data from all the tables in a SQL Server database. I also said I’d be writing about using Stored Procedures in Entity Framework next, but I’ve belayed that for a while, so my apologies to anyone who is dying to read that article.
More than one way to skin a Table
My buddy Stuart posted another way to do this in the comments:
I would generalize the script so you don’t have to keep updating the stored proc each time you add another table. This is a script that I’ve used that does the same thing (but gets around having to order the tables by disabling the constraints first):
– disable referential integrity
EXEC sp_MSForEachTable ‘ALTER TABLE ? NOCHECK CONSTRAINT ALL’
EXEC sp_MSForEachTable ‘
IF OBJECTPROPERTY(object_id(”?”), ”TableHasForeignRef”) = 1
DELETE FROM ?
TRUNCATE TABLE ?
– enable referential integrity again
EXEC sp_MSForEachTable ‘ALTER TABLE ? CHECK CONSTRAINT ALL’
– Reseed each table
EXEC sp_MSForEachTable ‘
IF OBJECTPROPERTY(object_id(”?”), ”TableHasIdentity”) = 1
DBCC CHECKIDENT (”?”, RESEED, 0)’
He makes a great point: what if I, or some other developer, comes along later and adds a table, but forgets to add it to the Stored Procedure? I’m not actually sure what the consequences would be, so why take the chance?
What’s all this doing?
If you are an old hat at SQL Server, you probably are thinking “So what?”, but I have professed several times that I am a SQL Server newbie, so I found this really cool. Here are the things I learned about thanks to this little gem:
According to my Bing research, this is an “undocumented feature” built in to SQL Server. Since “undocumented” really means “unsupported”, the caveat here is that you shouldn’t rely on it because Microsoft could change it or pull the plug without notice. Personally, given the wealth of information out there about this procedure, I’ve decided to take my chances.
This procedure will loop through every Table in the current database and execute the given statement against it. See the “?” part? That is where the Table identifier will be inserted into each statement. So in the example above, the first thing we do is loop through all the tables and remove the Constraints. These were the pesky things in the previous article that prevented us from using TRUNCATE. Then we loop through again and clear the tables, using TRUNCATE when we can and DELETE when we must. Then we reinstate the Constraints and reseed the Identity Columns.
All in all, this is a very nice bit of work and has really opened my eyes to the power of these SQL Server goodies. You can bet I’ll be returning to this topic in the future.
On a side note, there is a similar method that will loop through and operate the specified statement against every Database as well. This feels a lot more scary to me, and I am unlikely to use it anytime soon, but it is cool knowing it is there.
OBJECTY_ID and OBJECT_PROPERTY
You’ll also notice that we use a built in function called OBJECT_ID. Basically, OBJECT_ID retrieves a handle to the Table itself, which we can then pass to other functions to retrieve information about the table.
In the example above, we are passing this handle to another function called OBJECT_PROPERTY. This allows us to retrieve a metadata property containing information about the Table in question. We use it twice above, once to check the "TableHasForeignRef" property and again to check the "TableHasIdentity" property. These values are then used to further determine whether or not to execute specific actions against these tables.
But what if I don’t want ALL the tables?
I actually ran into this issue pretty quickly: in my project I have a table that holds some configuration information. It’s not in my diagram because it isn’t related to any of the other tables, it’s just holding data. The problem I ran into was that this table was getting cleared along with the rest, but it needs to retain its data when the other tables are cleared.
The configuration table wasn’t getting cleared previously because it wasn’t specifically listed in the original Stored Procedure. Now that we are doing things more dynamically, however, it got caught in the cross fire and cleared out. Naturally, this created an assortment of “unintended consequences” the first time the software was run following a CLearDatabase command.
So the question became “how do I execute against all tables except this one?” I needed a way to ignore that table. Fortunately, part of the answer was right in front of my face: OBJECT_ID.
Ignoring a specific Table
I realized I should be able to use the same OBJECT_ID function to specify which Table I wanted to skip, I just didn’t know what other function to use. I did a little Binging and found the OBJECT_NAME function, but I had a little trouble getting the result to match a string I was using to specify the Table name. A little more digging led me to an example that allowed me to query the database and see metadata about the table. To display the Table name, it used another function called QUOTENAME, which had as a parameter the results of OBJECT_NAME. So I strung all this together and came up with the following:
IF QUOTENAME(OBJECT_NAME(object_id("?"))) != ”[Configuration]”
Where "Configuration” is the name of the Table I want to ignore. Notice the use of the Square Brackets: QUOTENAME always seems to return the Table name wrapped in Square Brackets. I’ve seen these before in other SQL Server samples, but I don’t know yet what their significance is, I just know that adding them made this work.
Also, since it may not come across too clearly, those are paired single quotes (or Tics) around the Table name. I don’t know if it matters, perhaps regular double quotes would be fine, I just know that in the final version that worked that is what I used.
The only part of the procedure that needs this is the part that TRUNCATEs or DELETEs, so here is the updated version of that part:
IF QUOTENAME(OBJECT_NAME(object_id("?"))) != ”[Configuration]”
IF OBJECTPROPERTY(object_id("?"), "TableHasForeignRef") = 1
DELETE FROM ?
TRUNCATE TABLE ?
So far this has worked very well, and I now know that there are a lot of options open to me in SQL Server for solving these kinds of problems. I plan to spend some time getting to know more of them very soon. There is plenty of stuff to learn in the Transact-SQL Reference on MSDN.
A Special Thanks to Stuart for sharing, I love a spirit of co
llaboration: “Good on ya, Mate!”
UPDATE: Be sure to check out the follow-up article, Clearing a SQL Server Database, Take 2.
I posted recently in Project Greenfield: Testing the TDD Waters about a small conversion project I’ve been using as a test bed for some of the techniques and technologies I’ll be using in Project Greenfield. I know SQL and am very comfortable in DB2 on the IBM System i, but this is the most extensive work I have done with SQL Server in recent memory. I have really appreciated the ease of database management provided by the tools integrated into Visual Studio. I especially appreciate it since I cannot seem to get SQL Server Management Studio installed on my development machine, but I won’t go into that right now.
The Database Schema
As background for the rest of this article, here is the schema we’ll be discussing. The diagram itself is too hairy to post, but the simplified version below should suffice. In this case, “->” means “Parent Of” and indicates a 1 to many relationship.
This collection of tables and relationships is exposed by the Entity Framework as a RealEstateMaster Entity. In the database, these tables also hold numeric Codes for various elements, each with a corresponding table, a “look up” table for normalization. There are well over a dozen of these, so I’ll not list them all, but they all function like so:
From an Entity standpoint, these are not child and parent relationships, but from a database standpoint they do enforce the same type of Foreign Key constraints. In other words, each code in the CardDetail table must exist in it’s corresponding CodeTable.
I have several scenarios where the conversion process requires a “fresh start”, in other words a clean database with no data in the tables. This means that on demand I need to be able to wipe out all the data from all the tables in the database. This seemingly simple task turned out to take a lot more effort to figure out than I originally anticipated.
Using Entity Framework
At first, I assumed (wrongly) that since I was using Entity Framework for all of my database access that there would be a way to do this built in to the Entity Context. I made the rookie mistake of equating my EF classes to direct access to the database and all it’s functionality. I also made the rookie mistake of equating the EF classes to database tables: this one to one mapping is in no way necessary, so in hindsight I understand why there is no “TableName.Clear()” kind of an option.
I believe this problem can be solved using the EF classes but it would be very cumbersome. As I see it, it would require you to loop through the entire collection of RealEstateMaster entities and delete each one. That delete operation should loop through it’s children and delete those records as well. Afterwards, you could then do the same to each code table, which at that point should have no constraining records.
NOTE: The statements above are theoretical: I did not try this because it seemed like way too much work and not really a proper application of EF. I chose EF because it provides an easier way to work with SQL Server, but when EF actually gets in the way, it tells me I should find a different solution.
Back to SQL Server
Having explored my EF options, I decided the best thing to do was create a Stored Procedure in SQL Server to perform this task. Having never written a Stored Procedure in SQL Server, I wasn’t sure exactly what I was getting into, so I reached out for help to a known SQL Server guru: Andy Leonard. One of the great things about being involved with the community is knowing people who know things!
Andy graciously tolerated my newbie questions and with his guidance via an email exchange he led me to the solution I finally implemented. With his permission, I’m going to share a little of our exchange. I’m going to leave it unedited, mostly because I love Andy’s way of putting things, but also so you can get the same undiluted experience I did.
ME: having explained the situation and schema above …
So I guess I have several questions:
1) How do the relationships affect the deletes? Does the order matter?
2) Is there a way to do a "cascading" delete that will loop through the relationships and delete the related table rows automatically?
3) Am I making this harder than it needs to be? Is there a better way?
1. Usually 1-many indicates parent-child. The parent is usually on the "one" side of this relationship; child is usually on the "many" side. Order matters. You want to remove the child(ren) first. If you want to be extra cool about it, remove the child(ren) and then the parent in a transaction. That way, if something "bad" happens (like crossing the streams </GhostBusters>) during the parent delete, the child delete can rollback. Transactions are built for this. You are set up here for something "bad" to happen – you have multiple children for a given parent. If you miss one and get all the rest – and there’s data in that parent-child relationship you miss – your best outcome is a failure with rollback. Everything will return to its pre-transaction state. Without a transaction, you risk introducing data pollution (half the data for a given entity is missing).
2. There is a way to set up cascading referential integrity. It’s rare in practice and has to be in place before you begin your delete statements.
3. This is rocket surgery. You are not adding complexity, the complexity was here when you arrived.
My solution would be something like:
from SalesHistory sh
inner join MasterRecord mr on mr.ID = sh.MasterRecordID
where mr.ID in (…<list of MasterRecord table IDs>…)
<cut more examples of the same approach>
— commit tran
— rollback tran
Notice I worked from the bottom of the list to the top – that’s intentional. Most people think of entity construction "top down." Deletes need to work in the opposite order.
If everything appears to work, execute the commit statement. If not, you can execute the rollback and put everything back just like it was before you started. As a precaution, always execute the commit or rollback at least twice – you want to make sure you close all the transactions you opened. And it’s easy to start a new one accidentally – and it becomes a nested transaction when you do (if you close the SSMS window and leave an open transaction, the tables involved are locked. That’s "bad"…). You want to highlight the commit or rollback in SSMS and keep clicking Execute until you get an error indicating there are no open transactions to end. It’s a best practice.
My first quest
ion would be why this:
from SalesHistory sh
inner join MasterRecord mr on mr.ID = sh.MasterRecordID
where mr.ID in (…<list of MasterRecord table IDs>…)
instead of this:
from SalesHistory sh
Here is why I ask:
1) The purpose here is really just to clear out all the tables, completely disregarding the current data. A total purge, if you will.
2) Using the first statement leaves open the possibility of orphaned data – or does it? If the relationships are defined, what happens when there are SalesHistory rows with no associated MasterRecord row?
3) It seems like additional complexity: won’t the joins be a performance hog?
If you’re just after clearing all the tables, a simple DELETE statement – starting with the children – will work. There is a popular myth that JOINs slow down performance. It’s akin to saying a farm of web servers slow down performance because there’s all that time lost deciding which server to send the request and then managing the distributed session management.
The truth is Joins can improve performance as much as hurt it. They’re a tool. Proper indexing and server management are the keys to performance.
That said, you can use Truncate Table to clear them. That does a couple things:
1. Wipes out the data.
2. Is not logged (so it flies).
3. Resets identity columns to the initial seed value (usually 1).
4. Requires the ddl_admin role for permission.
That’s a nice middle ground between dropping/recreating and deleting.
Andy’s first response talked about the best practice for doing an operation of this nature, which I rejected only because I just wanted a total purge of the data: if I was doing something more production oriented I would have taken the approach Andy suggested.
So the idea of just issuing a bunch of DELETE commands over all the tables does what I need. The first lesson here, though, is that Order Matters. Because of the relationships I created between the tables I could not simply issue a DELETE against the parent tables until there were no longer any children constraining them.
Recall the relationships listed above:
I had to start at the deepest point in the heirarchy and work my way up, so the final order looks like this:
- DELETE CardImprovementDetail
- DELETE CardDetail
- DELETE TransferHistory
- DELETE LandDetail
- DELETE RealEstateMaster
- DELETE all the Code tables mentioned above
Using TRUNCATE instead of DELETE
From Andy’s last email, I decided that TRUNCATE might be a better option. I had never heard of TRUNCATE before, but using it is very simple: ex. “TRUNCATE CardDetail”.
Unfortunately, when I changed all my DELETEs to TRUNCATEs, I discovered a little foible. Apparently, TRUNCATE will not work when a Foreign Key Relationship is defined, even when there is no data affected. So in other words, I can issue TRUNCATE CardImprovementDetail, because it does not define a FOREIGN KEY relationship to any other table. I can NOT, however, issue TRUNCATE CardDetail, because it defines a FOREIGN KEY relationship to CardImprovementDetail (as well as all of it’s corresponding Code tables). This held trus even after CardImprovementDetail had been truncated itself and held no data.
The Final Solution
So the final solution was to use TRUNCATE when possible, and DELETE when necessary. I wrapped all of these up in a Stored Procedure and now when I need to clear the entire database I can simply execute.
Remembering that Order Matters, the final procedure execution looks like this:
ALTER PROCEDURE dbo.ClearDatabase
TRUNCATE TABLE CardImprovementDetail;
DELETE FROM CardDetail;
TRUNCATE TABLE Land;
TRUNCATE TABLE TransferHistory;
DELETE FROM RealEstateMaster;
DELETE FROM Carport;
DELETE FROM Condition;
DELETE FROM Easement;
DELETE FROM ExteriorWall;
DELETE FROM Floor;
DELETE FROM Foundation;
DELETE FROM Garage;
DELETE FROM Heat;
DELETE FROM InteriorWall;
DELETE FROM MagisterialDistrict;
DELETE FROM Occupancy;
DELETE FROM RightOfWay;
DELETE FROM RoofMaterial;
DELETE FROM RoofType;
DELETE FROM Sewer;
DELETE FROM SiteCharacteristic;
DELETE FROM SiteTerrain;
DELETE FROM Water;
The Nuclear Option
Before I close, I did want to mention the Nuclear Option: I could just drop the database and recreate it at runtime. I considered it briefly because while the Entity Context does not have a Clear Table kind of option, it does have CreateDatabase(). It also has CreateDatabaseScript(), which you can use to extract the schema into an executable SQL script. It seems to me that you could just nuke the database (probably with a Stored Procedure) and use some combination of these to recreate it.
I only considered it for a moment, because it seems heavy handed. On top of that, if something were to go wrong it could leave the application in an unusable state. It also assumes that the SQL Script generated by EF will match the standards required by the application or the client. I’m not saying the generated schema would not function, but there could be outside factors. I suppose you cold store the schema in a file locally and use it to recreate the database, outside of EF, but it just feels ill advised.
Back to Entity Framework
At the end of all this, what I was left with was a Stored Procedure defined in my SQL Server that will do the task required. Unfortunately, if I leave it as is it means I will need to use a tool like SQL Server Management Studio to execute the procedure manually. Since my users will occasionally need to do this themselves, I don’t think that is a viable option.
Instead, I need to be able to run the Stored Procedure programmatically, but doing so means using traditional ADO.NET methods. I would then turn around and create an Entity Context, and it feels silly to do both in the same program. To solve that problem, I added the Stored Procedure to my Entity Context. And that’s where we’ll pick up next time: using Stored Procedures in Entity Framework.