Data Transformation Services Tips & Best Practices

Tech Stuff

1/24/2005

SQL Server Data Transformation Services Tips & Best Practices

Data Transformation Services Tips & Best Practices


What is Data Transformation Services?

DTS is a high-end, sophisticated, COM-based solution for a wide range of data exchange. COM makes the perfect implementation because it is first and foremost a separation of interface and implementation.

DTS is a way to move data from one OLE DB data source to another OLE DB data source. This can be from Microsoft Excel to Microsoft SQL Server, SQL Server to Microsoft Access, Microsoft Excel to Microsoft Access and so forth. The design is such that it is solely based on the OLE DB specification for the source and destination.

DTS can exchange data with, but is not limited to:

• Microsoft Excel
• Microsoft Access
• Microsoft Visual FoxPro
• Oracle
• DB2
• Microsoft Message Queue
• Microsoft SQL Server
• ASCII files
• Informix


What Data Transformation Services is Not

DTS does not provide generic Object Transfer Manager features (the creation of indexes, referential integrity constraints, defaults, and so forth). DTS is designed only to manipulate and move data. Schema definition and movement is not part of the specification or design of the product. For those activities you should use the Transfer Manager and appropriate upgrade wizards (when you work with SQL Server databases), or generate and execute scripts for indexes, constraints, defaults, and so forth (when you work with non-SQL Server databases) after the data has been loaded.



Data Transformation Services Rules of Thumb

• Use a Bulk Insert task to load text files.
• Use BCP to export to a text file.
• If you really need speed and the source and destination are both SQL Server, a SELECT INTO with the Bulk Insert database option set to ON is faster than a DTS copy (just a little harder to maintain).
• Where possible, use source SQL instead of ActiveX scripts to manipulate the incoming data stream. Using the Bulk Insert task followed by a DTS copy is still faster than a single ActiveX transform. Leverage the power of the relational engine and perform data transformations in the SQL source statement.
• Use source SQL statements instead of lookups, where possible.
• Use multiple pump tasks instead of a single Data-Driven Query (DDQ) task when the destination is SQL Server.
• For very large imports set the Insert Commit Size to 10,000 (or less) instead of zero, and set the destination database log properties to truncate the log on checkpoint (this keeps the log growth in check).

To set Insert Commit Size in SQL Server 7.0, right-click the Transform Data Task, click Properties, click Advanced, and set the Insert Commit Size value in the Data Movement section.

If you are using SQL Server 2000 or the SQL Server 2000 client tools to manage a SQL Server 7.0 computer, right-click the Transform Data Task, click Properties, click Options, and set the Insert Batch Size value. (This value is the same as the Insert Commit Size value.)

• Clean up your source data rather than cleaning the data on the way to the warehouse. This takes longer, but is a better practice. Use staging tables in SQL Server to gather, cleanse, and validate the data before the data is loaded into the warehouse.
• When you use DDQ or update the warehouse, use a source Select statement that compares the staging table to the warehouse table to select only the new records. To reduce the records in the incoming datastream with a sub-select and an IN operator is much more efficient than checking all incoming data with ActiveX scripts, lookups, or DDQ tasks for those records that are to be processed.

Use Fewer Transformations

When you copy data, you can move each column in its own transformation, or you can copy many columns in a single transformation. Transformations are serialized, so fewer transformations may speed up the copy if you have a lot of columns. As a rule of thumb, use a single transformation if you have more than 20 columns (you can see up to a 5% improvement). If you have 20 or fewer columns you will not see much difference, so do not bother with fewer transforms. It is easier to see what's happening in the user interface with multiple transforms.

Use Ordinals Instead of Names in ActiveX Scripts

Sometimes, if you use ordinals instead of names, you may see a small performance improvement, especially on scripts with many columns. For example:
DTSDestination(1) = DTSSource(1)

A performance improvement is not always the case and it depends on where the other bottlenecks in your transformation are. In general, the performance gain is 5% to 10% and this may not be enough to compensate for the loss of readability of the script.


Use SQL Instead of ActiveX and Use Fewer ActiveX Scripts

Take advantage of SQL Server's Query Processor. If you can perform an operation in the SELECT statement that is the source SQL statement for a DTS pump, then do that instead of using an ActiveX script. To call an ActiveX script may cut performance by 2 to 4 times over a straight DTS copy.

Combine ActiveX Scripts And Save Time

Each call to the ActiveX scripting engine generates the majority of the overhead for the transformation. So, if you need to use an ActiveX script to perform data transformations on several columns, use a single script for all the columns to limit the number of calls to the scripting engine. If you have to use an ActiveX script, since the majority of overhead is in the call to the scripting engine and not in the calculations that appear in the script, do what you need to do to transform your data and do not worry about the cost of performing complex calculations in the script.

Scripting Languages

VBScript is faster than JScript, which is faster than PerlScript. Each incurs about a 10% reduction in speed. PerlScript does not ship with SQL Server but is available from ActiveState.com.

Other scripting languages can be used if they are installed; however, they are not tested, the functions will not show up in the scripting dialog, and the main function is not autogenerated.

File Import/Export

The fastest way to import data from a file into SQL Server is to use the Transact-SQL BULK INSERT statement or the DTS Bulk Insert task. The DTS Bulk Insert task is a user-interface that creates the SQL statement for you. This technique is new to SQL Server 7.0 and is not the same as BCP or DTS. It tends to be about twice as fast as either BCP or DTS because it is a server-side process and is executed entirely in the SQL Server process space. Bulk Insert can use existing BCP format files or handle some standard delimiters.


There are certain limitations to the use of Bulk Insert:
• You cannot perform data transformations with Bulk Insert.
• Bulk Insert only imports from text files and only exports to SQL Server 7.0.
• Since the SQL Server service performs the loading, the source text file must be accessible from the SQL Server process.

Simple Export

The fastest way to export to a text file in SQL Server 7.0 is to use BCP.exe. DTS is much slower than BCP when you export to a text file.

Since you cannot perform data transformations with BCP, this may not be an option if you do not export the contents of a table. If you need transformation on export, use the DTS Data Pump task, although performance may be slower.
SQL Server is the only allowed source (although multiple versions of SQL Server are supported).


Import with Transformations

A bulk insert to a staging table followed by a DTS copy may be faster than importing and transforming all in one step with a single ActiveX transformation. Use SQL instead of ActiveX scripts even if it takes more steps. Using more steps is not always slower.


Import of Files (or other data) to Non-SQL Server Destinations
Unless your third-party provider has a bulk insert program, your best bet is DTS. You can use either a DTS Data Pump copy task or Data-Driven Query (DDQ) task. Depending on the driver, DDQ may be marginally faster than the pump task (you'll have to try it yourself, though). The pump task is far easier to create and maintain. DDQ cannot export to a text file or to Microsoft Excel.


It is better to use the SQL Server 7.0 OLEDB driver to communicate with previous versions of SQL Server since it has neither limitation. However, in order to use the SQL Server 7.0 OLEDB driver, you must run the SQL Server 7.0 version of instcat.sql on the previous version of SQL Server. You will see error messages when you run the instcat.sql script. The instcat.sql script is designed to be backward compatible so that the error messages are preceded by a message, which indicates that you are running on a non SQL Server 7.0 server and that the errors are expected and can be ignored.



For Loading the Data Warehouse


Use Pump Tasks for Inserts

Do not use a DDQ task to insert records when the destination is SQL Server. Use DDQ for updates and other non-insert tasks. This is even true when you split records.


Splitting Records

Suppose you want to send data from a single staging table to both a dimension table and a fact table. You might think it would be more efficient to eliminate multiple reads on the source and do this in a single task such as a DDQ task. This is possible, but typically slow.

A faster method is to use two data pump tasks. Even though you are doing two source reads, you make up for it by being able to do fast load on the destination.

Dealing with a Surrogate Key

A common special case of loading two tables at once is when the dimension table has a surrogate key like an identity column. You then need the ID of each dimension record when you load the fact table. You can use the same two methods that were mentioned in the Splitting Records section.


You can use a single DDQ task to create both records at once (this won't do a bulk load and is best when the destination is not SQL Server). Here's an example that uses a stored procedure to INSERT into a table and perform an UPDATE if the insert fails:
create proc sp_updauth
@ID char(30),@LName varchar(40)
as
INSERT INTO authors2 (au_id, au_lname) VALUES (@ID,@LName)
if (select @@error) = 2627 ! cannot insert duplicate key
begin
update authors2 set au_lname = @LName where au_id = @ID
return (0)
end


When the destination is SQL Server, use two pump tasks instead of DDQ. For the pump task that loads the fact table, have it follow the load of the dimension table and attempt to get the ID for the dimension record with a SELECT statement in the source. If that fails, you can use a lookup with a suitable cache size. The benefits of fast loading the destination data (which you would not get with DDQ) outweigh the cost of the lookup.


Update the Data Warehouse


Create Staging Tables

Staging tables are very useful for several reasons:
• They allow you to collect data from several systems and clean or validate it before it goes into the warehouse.
• Cleaning a staging table with a stored procedure is often faster than an ActiveX script task during a copy.
• Staging tables can be selected against to reduce the number of rows copied or perform final transformations with SQL source statements instead of ActiveX scripts. Many other sources (like text files) cannot have a SELECT statement run against them (or it is expensive to do so).
• A bulk insert to a staging table followed by a DTS copy may be faster than a single ActiveX transform.

Select Only the Records Needed to Add or Update

Do not attempt to copy every record, relying on a script or a primary key to reject records that should not be copied. This takes much longer than selecting only the records you need to copy (see the "Creating Surrogate Keys" section for information on how to do this).

As a side benefit, selecting only changed records makes packages restartable. If not all the records were copied, restart the package and it picks up where it left off. This is the fastest form of row level restartability. Create Surrogate Keys for Data Comparison (Change Capture)

When you check to see what source records do not exist on the destination, you have several options (listed best to worst):
• Use a "Date Changed" or "Date Last Updated" (or some other increasing field on the source and select the records in the date range you want to copy.
• Compare an ID in the source table to the destination with a statement like:
• SELECT * FROM source..Test WHERE id NOT IN (SELECT id FROM dest..test)
• If a unique ID does not exist in your source, you can create a temporary one to find unique records by concatenating columns to produce a unique record.

Here is an example of how you might concatenate columns to produce a temporary key:
select * from src test
where col1 + col2 + convert(col3, varchar(20)) not in
(select distinct col1 + col2 + convert(col3, varchar(20))
from dest..test)


If a temporary key like this is to be used repeatedly, consider creating a comparison key on the staging table and in the warehouse table just to speed updates. This key can be the column that contains the concatenation mentioned before or a column with a unique id or identity that you can use for the "SELECT ...NOT IN (SELECT ..." comparison.

Miles To Go