Ssis Is An In Memory Pipeline Computer Science Essay

5366 words (21 pages) Essay

1st Jan 1970 Computer Science Reference this

Tags:

Disclaimer: This work has been submitted by a university student. This is not an example of the work produced by our Essay Writing Service. You can view samples of our professional work here.

Any opinions, findings, conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of UKEssays.com.

Since SSIS is an in-memory pipeline, one has to ensure that transactions occur in the memory for performance benefits. To check if your package is staying within memory limits, one should review the SSIS performance counter Buffers spooled. This has an initial value of 0. Any value above 0 is an indication that the engine has started disk-swapping activities.

Capacity planning to understand resource utilization

In order to understand resource utilization it is very important to monitor CPU, Memory, I/O and Network utilization of the SSIS package.

CPU

It is important to understand how much CPU is being utilized by SSIS and how much of CPU is being utilized by overall SQL Server while Integration Services is running. This latter point is very important, especially if you have SSIS and SQL Server on the same box, because if there is resource contention, SQL Server will surely win that will result into disk spilling from Integration Services resulting in slower transformation speed.

The performance counter that should be monitored is Process / % Processor Time (Total). One should measure this counter for both sqlservr.exe and dtexec.exe. If SSIS is not close to 100% CPU load, then this indicates:

Application contention – For e.g. SQL Server takes more processor resources, makes it unavailable for SSIS

Hardware contention – Probably a suboptimal disk I/O or not enough memory to handled the amount of data to be processed

Design limitation – The SSIS design is not making use of parallelism, and/or the package has too many single-threaded tasks

Network

SSIS moves data as fast as your network is able to handle it. Hence, it is important to understand your network topology and ensure that the path between the source and destination have both low latency and high throughput. Following performance counters can help you tune the topology:

Network Interface / Current Bandwidth – Provides estimate of current bandwidth

Network Interface / Bytes Total/Sec – The rates at which bytes are sent and received on each network adapter

Network Interface / Transfers/Sec – How many network transfers per second are occurring. If the number is close to 40,000 IOPs, then get another NIC card and use teaming between the NIC cards

Input / Output (I/O)

A good SSIS package should hit the disk only when it reads from the sources and writes back to the target. But if the I/O is slow, reading and especially writing can create a bottleneck. So it is very important to understand that the I/O system is not only specified in size (like 1 TB, 2 TB) but also its sustainable speed (like 20,000 IOPs).

Memory

The key counters to monitor memory for SSIS and SQL Server are as follows:

Process / Private Bytes (DTEXEC.EXE) – amount of memory currently used by Integration Services that cannot be shared with other processes

Process / Working Set (DTEXEC.EXE) – amount of allocated memory by Integration Services

SQL Server: Memory Manager / Total Server Memory – amount of allocated memory for SQL Server. This counter is the best indicator of total memory used by SQL, because SQL Server has another way to allocate memory using the AWE API

Memory / Page Reads/sec – total memory pressure on the system. If this consistently goes above 500, it is an indication that the system is under memory pressure

Baseline Source System Extract Speed

It is important to understand the source system and the speed at which data can be extracted from it. Measure the speed of the source system by creating a simple package that reads data from some source with the destination that says “Row Count”

Execute this package from the command line and measure the time it took for it to complete the task. Using Integration Services log output, you can measure the time taken. Formula to be used:

Rows/Sec = RowCount / Time

Based on the above value, you can judge the maximum number of rows per second that can be read from the source. To increase the Rows/Sec calculation, you can perform one of the following operations:

Improve drivers and driver configurations: Ensure you are using the up-to-date driver configurations for the network, data source and disk I/O.

Start multiple connections: To overcome limitations of drivers, you can start multiple connections to your data source. If the source is able to handle many concurrent connections, the throughput will increase if you start several extracts at once. If concurrency causes locking or blocking issues, consider partitioning the source having your packages read from different partitions to more evenly distribute the load

Use multiple NIC cards: If network is the bottleneck and you have ensured you are using gigabit network cards and routers, then a potential solution is to use multiple NIC cards per server.

Optimize SQL data source, Lookup transformations and Destination

Here are some optimization tips that you can implement in your SSIS packages:

Use NOLOCK or TABLOCK hints to remove locking overhead

Refrain from using “SELECT *” in SQL queries. Mention each column name in the SELECT clause for which data needs to be retrieved

If possible, perform datetime conversions at source or target databases

In SQL Server 2008 Integration Services, there is a new feature of shared lookup cache. During the use of parallel pipelines, it provides high-speed, shared cache

If Integration Services and SQL Server run on the same box, use SQL Server destination instead of OLE DB

Commit size 0 is fastest on heap bulk targets. If you cannot use 0, use the highest possible value of commit size to reduce overhead of multiple-batch writing. Commit size = 0 is bad while inserting into BTree – because all incoming rows must be sorted at once into the target BTree, and if the memory is limited, there is a likelihood of spill. Batchsize=0 is ideal for inserting into a heap. Please note that a commit size value of 0 might cause the running package to stop responding if the OLE DB destination and another data flow component are updating the same source table. To ensure that the package does not stop, set the maximum insert commit size option to 2147483647

Use a commit size of < 5000 to avoid lock escalation when inserting

Heap inserts are typically faster than using a clustered index. This means it is recommended to drop and rebuild all the indexes if there is a large part of the destination table getting changed.

Use partitions and partition SWITCH command. In other words load a work table that contains single partition and SWITCH it into the main table after the indexes are build and then put the constraints on

Network tuning

Packet size is the main property of the network that needs to be monitored / looked at in order to take decisions for Network tuning. By default this value is set to 4,096 bytes. As noted in SqlConnection.PacketSize property in .Net Framework Class Library, when the packet size is increased, it will improve performance because fewer network read and write operations are required to transfer a large data set. If your system is transactional in nature, lowering the value will improve the performance.

Another network tuning technique is to use network affinity at the operating system level to increase the performance at high throughputs.

Use Data Type wisely

Following are some best practices related to usage of data types:

Define data types as narrow as possible

Do not perform excessing casting of data types. Match your data types to the source or destination and explicitly specify data type casting

Take care of precision when using money, float and decimal data types. Money data type is always faster than decimal and has fewer precision considerations than float.

Change the design

Following are some best practices related to SSIS design:

Do not SORT within Integration Services unless absolutely necessary. In order to sort the data Integration Services allocates memory space for the entire data set that needs to be transformed. Preferably, presort the data before hand. Another way to sort the data is by using ORDER BY clause to sort large data in the database.

There are times where using Transact-SQL will be faster than processing the data in SSIS. Generally all set-based operations will perform faster in Transact-SQL because the problem can be transformed into a relational algebra formulation that SQL Server is optimized to resolve.

Set-based UPDATE statements – these are more efficient than row-by-row OLE DB calls

Aggregation statements like GROUP BY and SUM are also calculated faster using T-SQL instead of in-memory calculations by a pipeline

Delta detection is a technique where you change existing rows in the target table instead of reloading the table. To perform delta detection, one can change detection mechanism such as the new SQL Server 2008 Change Data Capture (CDC) functionality. As a rule of thumb, if the target table has changed > 10 %, it is often faster to simply reload than to perform the delta detection

Partition the problem

For ETL design, partition source data into smaller chunks of equal size. Here are some more partitioning tips:

Use partitioning on your target table. Multiple versions of the same package can be executed in parallel to insert data into different partitions of the same table. The SWITCH statement should be used during partitioning. It not only increases parallel load speed, but also allows efficient transfer of data.

As implied above, the package should have a parameter defined that specifies which partition should it work on.

Minimize logged operations

If possible, used minimal logged operations while inserting data into your target SQL Server database. When data is inserted into a database in fully logged mode, the size of the log grows quickly, because each row that is written in the database is also written to the log. Therefore, consider the following while designing SSIS packages:

Try to perform data flows in bulk mode instead of row by row. This will help minimize the number of entries to the log file. This eventually results into less disk I/O hence improving the performance

If for any reason you need to delete data, organize the data in such a way that you can use TRUNCATE instead of DELETE. The later places an entry of each row that is deleted into the log file. The former will delete all the data and just put one entry into the log file

If for any reason partition need to be move around, use the SWITCH statement. This is a minimally logged operation

If you use DML statements along with your INSERT statements, minimum logging is suppressed.

Schedule and distribute it correctly

Good way to handle execution is to create a priority queue for your package and then execute multiple instances of the same package (with different partition parameter values). This queue can be a simple SQL Server table. A simple loop in the control flow should be a part of each package to:

Pick a relevant chunk from the queue

“Relevant” means that is not already been processed and that all chunks it depends on have already executed

Exit the package if no item is returned from the queue

Perform work required on the chunk

Mark the chunk as “done” in the queue

Return to the start of the loop

Picking an item from the queue and marking it as “done” can be implemented as a stored procedure. Once you have the queue in place, you can simple start multiple copies of DTEXEC to increase parallelism.

Keep it simple

Unnecessary use of components should be avoided. Here is one of the way to avoid it:

Step 1: Declare the variable varServerDate

Step 2: Use ExecuteSQLTask in the control flow to execute a SQL query to get the server datatime and store it in the variable

Step 3: Use the dataflow task and insert/update database with the server datatime from the variable varServerDate

This sequence is advisable only in cases where the time difference from Step 2 to Step 3 really matters. If that does not matter, then just use the getdate() command at Step 3 as shown below:

Create table #Table1(t_ID int, t_date datetime)

Insert into #Table1(t_ID, t_date) values(1, getdate())

Executing a child package multiple times from a parent with different parameter values

While executing a child package from a master package, parameters that are passed from the master package should be configured in the child package. Use the ‘Parent Package Configuration’ option in the child package to implement this feature. But for using this option, you need to specify the name of the ‘Parent Package Variable’ that is passed to the child package. If there is a need to call the same child package multiple times (each time with a different parameter value), declare the parent package variables (with the same name as given in the child package) with a scope limited to ‘Execute Package Tasks’. SSIS allows declaring variables with the same name but the scope limited to different tasks – all inside the same package.

SQL Job with many atomic steps

For the SQL job that calls the SSIS packages, create multiple steps, each performing small tasks rather than one step that performs all the tasks. Creating one big step, the transaction log grows too big and if a rollback takes place, it make take the full processing space of the server.

Avoid unnecessary typecasts

Avoid unnecessary typecasts. For e.g., flat file connection manager, be default, uses the string [DT-STR] data type for all columns. You will have to manually change it, if there is a need to use the actual data type. It is always a good option to change it at the source-level itself to avoid unnecessary type casting.

Transactions

Usually, ETL processes handle large volume of data. In such scenarios, do not attempt a transaction on the whole package logic. SSIS does support transactions, and it is advisable to use transactions.

Distributed transaction that span across multiple tasks

The control flow of an SSIS package threads together various control tasks. In SSIS it is possible to set a transaction that can span into multiple tasks using the same connection. To enable this, set value of the “retainsameconnection” property of the Connection Manager to “true”

Limit the package name to maximum of 100 characters

When a SSIS package with a package name exceeding 100 characters is deployed in SQL Server, it trims the package name to 100 characters, which may cause an execution failure.

SELECT * FROM

Do not pass any unnecessary columns from the source to the destination. With the OLEDB connection manager source, using the “Table or View” data access mode is equivalent to “SELECT * FROM tablename”, which will fetch all the columns. Use ‘SQL Command’ to fetch only required columns and pass that to the destination.

Excel source and 64-bit runtime

Excel Source or Excel Connection manager works only with the 32-bit runtime. Whenever a package that uses Excel Source is enabled for 64-bit runtime (by default, this is enabled), it will fail on the production server using the 64-bit runtime. Go to solution property pages debugging and set Run64BitRuntime to FALSE.

On failure of a component, stop / continue the execution with the next component

When a component fails, the property failParentonFailure can be effectively used either to stop the package execution or continue with the next component execution in the sequence container. The constraint value connecting the components in the sequence should be set to “Completion”. Also the failParentonFailure property should be set to FALSE.

Protection

To avoid most of the package deployment error from one system to other, set the package protection level to ‘DontSaveSensitive’

Copy pasting script component

Once you copy-paste a script component and execute the package, it may fail. As a work-around, open the script editor of the pasted script component, save the script and then execute the package.

Configuration filter – Use as a filter

As a best practice use the package name as the configuration filter for all the configuration items that are specific to a package. This is typically useful when there are so many packages with package specific configuration items. Use a generic name for configuration items that are general to many packages.

Optimal use of configuration records

Avoid using the same configuration item recorded under different filter / object name. For e.g. there should be only one configuration record created if two packages are using the same connection string. This can be achieved by using the same name for the connection manager in both the packages. This is quite useful at the time of porting from one environment to other (like UAT to Prod).

Pulling High Volume data

Process of pulling high volume is represented in the following flowchart:

The recommendation is to consider dropping all indexes from the target tables if possible before inserting data especially when the volume inserts are high.

Effect of OLEDB Destination Settings

Certain settings with OLEDB destination will impact the performance of the data transfer. Let’s look at some of them:

Data Access Mode – This setting provides ‘fast load’ option, which internally uses BULK INSERT statement for uploading data into the destination table.

Keep Identity – By default this setting is unchecked which means the destination table (if it has an identity column) will create identity values on its own. On checking this setting, the dataflow engine will ensure that the source identity values are preserved and same value is inserted into the destination table.

Keep NULLs – By default this setting is unchecked which means default value will be inserted (if the default constraint is defined on the target column) during INSERT into the destination table if NULL value is coming from the source for that particular column. On checking this option, the default constraint on the destination table’s column will be ignored and preserved NULL of the source column will be inserted into the destination column.

Table Lock – By default this setting is checked and the recommendation is to let it be checked unless the same table is being used by some other process at the same time.

Check Constraints – By default this setting is checked and recommendation is to have it unchecked if you are sure the incoming data is not going to violate constraints of the destination table. This setting indicates that the dataflow pipeline engine will validate the incoming data against the constraints of target table. Performance of data load can be improved by unchecking this option.

Effects of Rows per Batch and Maximum Insert Commit Size settings

Rows per batch – The default value for this setting is -1 which means all incoming rows will be treated as a single batch. If required you can change this to a positive integer value to break all incoming rows into multiple batches. The positive integer value will represent the total number of rows in a batch

Maximum insert commit size – Default value for this setting is ‘2147483647’ which means all incoming rows will be committed once on successful completion. If required, you can change this positive integer to any other positive integer number that would represent that the commit will be done for those specified number of records. This might put an overhead on the dataflow engine to commit several times, but on the other side it will release the pressure on the transaction log and save tempdb from growing tremendously especially during high volume data transfers.

The above two settings are mainly focused on improving the performance of tempdb and transaction log.

Avoid Synchronous/Asynchronous transformations

While executing the package, SSIS runtime engine executes every task other than data flow task in defined sequence. On encountering a data flow task the execution of the data flow task is taken over by the data flow pipeline engine. The dataflow pipeline engine then breaks the execution of the data flow task into one ore more execution tree(s). It may also execute these trees in parallel to achieve high performance.

To make things a bit clearly, here is what an Execution Tree means. An Execution tree starts at a source or an asynchronous transformation and ends at a destination or first asynchronous transformation in the hierarchy. Each tree has a set of allocated buffer and scope of these buffers is associated to this tree. Also in addition to this every tree is allocated an OS thread (worker-thread) and unlike buffers other execution tree may share this thread.

Synchronous transformation gets a record, processes it and passes it to the other transformation or destination in the sequence. The processing of a record does not dependent on the other incoming rows. Since synchronous transformations output the same number of rows as the input, it does not require new buffers to be created and hence is faster in processing. For e.g., in the Derived column transformation, a new column gets added in each incoming row, without adding any additional records to the output.

In case of asynchronous transformation, different number of rows can be created than the input requiring new buffers to be created. Since an output is dependent on one or more records it is called blocking transformation. It might be partial or full blocking. For e.g., the Sort Transformation is a fully blocking transformation as it requires all the incoming rows to arrive before processing.

Since the asynchronous transformation requires additional buffers it performs slower than synchronous transformations. Hence asynchronous transformations must be avoided wherever possible. For e.g. instead of using Sort Transformation to get sorted results, use ORDER BY clause in the source itself.

Implement Parallel Execution in SSIS

Parallel execution in allowed by SQL Server Integration Services (SSIS) in two different ways by controlling two properties mentioned below:

MaxConcurrentExecutables – this property defines how many tasks (executable) can run simultaneously. This property defaults to -1, which is translated to the number of processors plus 2. In case, hyper-threading is turned on in your box, it is the logical processor rather than the physically present processor that is counted. For e.g. we have a package with 3 Data Flow tasks where every task has 10 flows in the form of “OLE DB Source -> SQL Server Destination”. To execute all 3 Data Flow Tasks simultaneously, set the value of MaxConcurrentExecutables to 3.

The second property named EngineThreads controls whether all 10 flows in each individual Data Flow Task get started concurrently.

 

EngineThreads – this property defines how many work threads the schedule will create and run in parallel. The default value for this property is 5.

In the above example, if we set the EngineThreads to 10 on all 3 Data Flow Tasks, then all the 30 flows will start at the same time.

 

One thing we want to be clear about EngineThreads is that it governs both source threads (for source components) and work threads (for transformation and destination components). Source and work threads are both engine threads created by the Data Flow’s scheduler. Looking back at the above example, setting a value of 10 for Engine Threads means up to 10 source and 10 work threads each.

 

In SSIS, we don’t affinitize the threads that we create to any of the processors. If the number of threads surpasses the number of available processors, it might hurt the throughput due to an excessive amount of context switches.

Package restart without losing pipeline data

SSIS has a cool feature called Checkpoint. This feature allows your package to start from the last point of failure on next execution. You can save a lot of time by enabling this feature to start the package execution from the task that failed in the last execution. To enable this feature for your package set values for three properties CheckpointFileName, CheckpointUsage and SaveCheckpoints. Apart from this you should also set FailPackageOnFailure property to TRUE for all tasks that you want to be considered in restarting.

By doing this, on failure of that task, the package fails and the information is captured in the checkpoint file and on subsequent execution, the execution starts from that tasks.

It is very important to note that you can enable a task to participate in checkpoint including data flow task but it does not apply inside the data flow task. Let’s consider a scenario, where you have a data flow task for which you have set FailPackageOnFailure property to TRUE to participate in checkpoint. Lets assume that inside the data flow task there are five transformations in sequence and the execution fails at 5th transformation (assumption is that earlier 4 transformations complete successfully). On the following execution instance, the execution will start from the data flow task and the first 4 transformations will run again before coming to 5th one.

It is worth noting below points.

For loop and for each loop do not honor Checkpoint.

Checkpoint is enabled at only control flow level and not at data level, so regardless of checkpoint the package will execute the control flow/data flow from the start in a case of restart.

If package fails, checkpoint file, all server configurations and variables values are stored and also point of failure. So if package restarted, it takes all configuration values from checkpoint file. During failure you cannot change the configuration values.

Best practices for logging

Integration Services includes logging features that write log entries when run-time events occur and can also write custom messages. Logging, to help you in auditing and troubleshooting a package every time it is run, can capture run-time information about a package. For e.g., name of the operator who ran the package and the time the package began and finished can be captured in the log.

Logging (or tracing the execution) is a great way of diagnosing the problem occurring during runtime. This is especially very useful when your code does not work as expected. Not only that, SSIS allows you to choose different events of a package and components of the packages to log as well as the location where the log information is to be written (text files, SQL Server, SQL Server Profiler, Windows Events, or XML files).

The logging saves you from several hours of frustration that you might get while finding out the causes of problem if you are not using logging, but the story doesn’t end here. It’s true, it helps you in identifying the problem and its root cause, but at the same time it’s an overhead for SSIS that ultimately affects the performance as well, especially if you are excessively using logging. So the recommendation here is to use logging in a case of error (OnError event of package and containers) . Enable logging on other containers only if required, you can dynamically set the value of the LoggingMode property (of a package and its executables) to enable or disable logging without modifying the package.

You can create your own custom logging which can be used for troubleshooting, package monitoring, ETL operations performance dashboard creation etc.

However the best approach is to use the built-in SSIS logging where appropriate and augment it with your own custom logging. A normal custom logging can provide all the information you need as per requirement.

Security audit and data audit is out of scope of this document.

To help you understand which bulk load operations will be minimally logged and which will not, the following table lists the possible combinations.

Table Indexes

Rows in table

Hints

Without TF 610

With TF 610

Concurrent possible

Heap

Any

TABLOCK

Minimal

Minimal

Yes

Heap

Any

None

Full

Full

Yes

Heap + Index

Any

TABLOCK

Full

Depends (3)

No

Cluster

Empty

TABLOCK, ORDER (1)

Minimal

Minimal

No

Cluster

Empty

None

Full

Minimal

Yes (2)

Cluster

Any

None

Full

Minimal

Yes (2)

Cluster

Any

TABLOCK

Full

Minimal

No

Cluster + Index

Any

None

Full

Depends (3)

Yes (2)

Cluster + Index

Any

TABLOCK

Full

Depends (3)

No

(1) It is not necessary to specify the ORDER hint, if you are using the INSERT … SELECT method, but the rows need to be in the same order as the clustered index. While using BULK INSERT it is necessary to use the ORDER hint.

(2) Concurrent loads are only possible under certain conditions. Only rows those are written to newly allocated pages are minimally logged.

(3) Based on the plan chosen by the optimizer, the non-clustered index on the table may either be fully- or minimally logged.

Best practices for error handling

There are two methods of extending the logging capability,

Build a custom log provider

Use event handlers

We can extent SSIS’s event handler for error logging. We can capture error on OnError event of package and let package handle it gracefully. We can capture actual error using script task and log it in text file or in a SQL server tables. You can capture error details using system variables System::ErrorCode, System::ErrorDescription, System::SourceDescription etc.

If you are using custom logging, log the error in same table.

In some cases you may wish to ignore it or handle the error at container level or in some cases at task level.

Event handlers can be attached to any container in the package and that event handler will catch all events raised by that container and any child containers of that container. Hence, by attaching an event handler to the package (which is parent container) we can catch all events raised of that event type by every container in the package. This is powerful because it saves us from building event handlers for each task in the package.

A container has an option to “opt out” of having its events captured by an event handler. Let’s say, you had a sequence container for which you didn’t find it important to capture events, you can then simply switch them off using the sequence container’s DisableEventHandlers property.

If are looking to capture only certain events of that sequence task by an event handler, you could control this using the System::Propogate variable.

We recommend you to use se

Cite This Work

To export a reference to this article please select a referencing stye below:

Reference Copied to Clipboard.
Reference Copied to Clipboard.
Reference Copied to Clipboard.
Reference Copied to Clipboard.
Reference Copied to Clipboard.
Reference Copied to Clipboard.
Reference Copied to Clipboard.

Related Services

View all

DMCA / Removal Request

If you are the original writer of this essay and no longer wish to have your work published on the UKDiss.com website then please:

Related Lectures

Study for free with our range of university lectures!