Tuesday, March 15, 2011

New features of SSIS 2008

1. Script Task
    - In SSIS 2005, Script task was supporting only Microsoft Visual Basic .Net, but in SSIS 2008, Script task supports C# too.
    - Before writing script, we can chose the language. This is due to provision of VSTA (Visual Studio Tools for Application)


2. Lookup Transformation
    - In 2005, cache file was not there so it was reloading a cache every time, now in 2008, we can create a cache file and it does not need to repopulate every time.
    - In 2005, it has only 2 outputs (1 regular and other Error), in 2008, 3 outputs. (Matching, non matching and Error)


3. Data Profiling Task
    - New task introduced in 2008


4. Multicast
    - Though it was splitting the flow into multiple outputs, in 2005, all outputs were get processed serially. actual parallelism was not there. in 2005, They have introduced subpath, so actually they will be processed in parallel.


5. New T-SQL features
    - we can leverage on CDC, MERGE, etc.



Debugging SSIS and using Data Viewer

- SSIS package can be debugged using breakpoints and data viewer

DEMO

- Add one variable called "counter" of integer data type
- Drag For Loop Counter in control flow.
- Set the properties like this.
     InitExpression : @counter = 1
     EvalExpression : @counter <= 10
     AssignExpression : @counter = @counter + 1
- Drag Data Flow Task inside For Loop Container
- Open DFT
- Drag OLEDBSource which should point to some table
- Drag OLEDBDestination which should point to some table
- Connect source and destination
- Double click on the link between source and destination
- Go to Data Viewers and click on Add.
- Go to General tab, and make sure "Data Grid" is selected
- Go to "Grid" tab and select columns for which you want to do see the data
- now  your data flow task should look like this
- come back to control flow
- Right click on Data Flow Task and "Edit Breakpoints"
- We have different options in breakpoints, where we want to debug.
- This shows that, if Data Flow Task gets pre-executed 5th time or more, debugger will be stopped.
- Now execute the package.
- Data viewer will display the data like this.
- Now you can see exactly what data is floating between source and destination
- Click on > button 4 times
- Now it will be stopped like this
- Now you can enable watch window like this
- Now drag variable into watch window (Name column)
- it will display the value of a variable.
- You can hit F5 to proceed further.


Control Flow vs Data Flow




Control Flow
Data Flow
Step
1st step for creating package
2nd step for creating package, it will be part of control flow
Made up of
Containers and tasks connected with precedence constraints
Source, transformation and destinations
About
It is about data preparation and managing the process
It is about ETL activities
Smallest unit
Task
Component
Data movement
It does not move the data from one task to another task
It moves the data from source to destination
Processing
Task1 has to be completed before Task2 begins
Multiple components can be processed at the same time
Count
Only 1 control flow is available
There can be multiple data flow tasks inside control flow
Error output
Explicit Error output is not there but it can be achieved by editing precedence constraint
Error output is available in source/transformation/destination

Performance Improvement in SSIS

(E) EXTRACT IMPROVEMENT


1) If there is a Flat file source / Derived Column Transformation, then set "Fast Parse" to "True.
    - It is available only in Flat File Source & Derived Column Transformation
    - It is specified at column level
    - Default value is False,
    - When we set it true, it will avoid some kind of pre-execute validations and considers all your data fine


    Steps

  1. Right-click the Flat File source or Data Conversion transformation, and then click Show Advanced Editor.
  2. In the Advanced Editor dialog box, click the Input and Output Properties tab.
  3. In the Inputs and Outputs pane, click the column for which you want to enable fast parse.
  4. In the Properties window, expand the Custom Properties node, and then set the FastParse property to True.
  5. Click OK.

2) Set packet size to 32767 for Connection Manager.
    - This will bump up the packet size from 4K (which is default)
    - This needs network admin to enable "Jumbo Frames"




3) In OLEDB source, use T-SQL Query instead of table as a direct input 
    - This will allow you to choose specific columns instead of pulling all the columns
    - We can specify nolock which avoids locking the table
    - We can use sort, group by, joins, forumlated columns instead of using different transformations like Sort,       
       Merge Join, Derived Column, Aggregate transformations.


4) In Cache connection manager, try to use create a file instead of using memory


5) If same OLEDB source connection, you are using at multiple places, then set "RetainSameConnection" property to "True"


     - This will allow engine to use the same connection every time
     - Default value : False. This will create connection - get data - close connection every time. 
     - by making it to TRUE, above activities will be done only once.




6) Divide source into a chunk instead of having a single master pool.  


(T) TRANSFORM IMPROVEMENT


1) Use Transformation based on the usage and buffer matrix


Behind the scenes, the data flow engine uses a buffer-oriented architecture to efficiently load and manipulate datasets in memory.
  • Row Transformations - 
    - They either manipulate data / create new fields using the data that is available in that row. 
    - They might create new columns but not new rows
    - Each output row has a 1:1 relationship with an input row
    - Also known as synchronous transformations
    Uses existing buffer rather than new buffer
    - Examples -  Derived Column, Data Conversion, Multicast, and Lookup. 

  • Partially blocking transformations 
    - They are often used to combine datasets using multiple data inputs. 
    - As a result, their output may have the same, greater, or fewer records than the total number of input records. 
    - Also known as asynchronous transformations
    - Output of the transformation is copied into a new buffer and a new thread may be introduced into the data flow
    - Examples - Merge, Merge Join, and Union All. 

  • Blocking transformations 
    - They must read and process all input records before creating any output records. 
    - They perform the most work and can have the greatest impact on available resources rather than above 2 categories
    - Also known as asynchronous transformations
    - With partially blocking transformations, the output of the transformation is copied into a new buffer and a new thread may be introduced into the data flow
    - Example - Aggregate and Sort. 


2) Use MERGE statement when you have to do INSERT-UPDATE-DELETE from source to target instead of using multiple transformations


3) Choice of type of cache inside Lookup Transformation
     - Full Cache : for small dataset
     - No Cache : for volatile dataset
     - Partial Cache : for large dataset


4)  Sort, Merge Join, Union All, Pivot, Aggregation SCD, Data Conversion can be easily replaced by normal T-SQL


      - There will be much more control on all the objects
      - T-SQL operation will be much more faster than SSIS Transformations because all the 
         buffers won't be used.



5) Make datatype as narrow as possible so that they will allocate less memory




(L) LOAD IMPROVEMENT


1) Try to execute the package on your destination server, rather than source server. 


      - LOAD is expensive operation than EXTRACT
      - So we can execute the package on the same server as destination server


2) Make a smart choice between Dropping/Keeping Index


      - It is not necessary to keep index always OR drop index always before you load.
      - If there is a clustered index, don't drop because data is sorted using this key. And dropping  
        and rebuilding clustered index will take even more time.
      - If there is a single non-clustered index and you expect more than 100% new data, then 
        dropping and re-creating index will help.
      - If there are multiple non-clustered index, probably leave them as it is.


   But these are not thumb rules, trial and error will always give you the best result.


3) If there is a huge huge load on destination, probably partitioning a table will help


4) If there is a huge huge load on destination, probably partitioning a table will help


5) Setting proper value of "Rows per batch" &  "Maximum Insert Commit Size"




Rows per batch - how many rows you want to send to insert the data
Maximum insert Commit Size -  how may rows you want to commit in one shot
      - If the value is 2147483647, these many rows will be committed in one single transaction and 
        they will be committed.
      - If you really have these many rows to load, better you define proper value in this commit 
        size. Let's say if you define 100000, then 1 lac rows will be committed in one shot. A huge 
        DML operation in one single transaction will degrade the performance.
      - If it is 0, it means, a package might stop responding, if the same table is being used by 
        some other source.





Sunday, March 13, 2011

Transaction in SSIS

- Transaction in SSIS is supported and it is in-built integrated with SSIS
- We can have individual transaction at package, container or task level.
- There is a property called "TransactionOption" at all these levels.
- 3 values are there for this property
        (A) Supported
                - This is default
                - It means if transaction is already there before running the package, then it will join into that tran.
                - But it will not create any new transaction
        (B) Not supported
                - It will neither create any new transaction nor join into existing transaction
        (C) Required
                - If transaction exists before running a package, then it will join into that transaction
                - otherwise it will create new transaction
- Design
        - We can set "Required" value at package level and "Supported" value at all the containers/task level

Lookup, Cache Transformation & Cache Connection

Cache Connection Manager
- This connection manager is to define the cache
- Cache will be stored in a memory, however we can create cache file to hold the data.
- Once we feed the data into that file (*.caw) , later in all data flow tasks, we can use connection manager and get the data from that cache file

Cache Transformation
- This is one of the Data flow transformation
- It uses memory cache / file cache as configured in Cache Connection Manager and pulls data from there
- This is a very helpful transformation in terms of performance improvement

Lookup Transformation
- Basically it is being used to lookup the data
- It is using equi-join.
- All the values from source table (joining column) SHOULD exist in Reference table. If not, it will throw an
   error
- NULL value also will be considered as non-matching row
- This is CASE-SENSITIVE.
- It can internally use Cache Connection / OLEDB Connection.
- There can be 3 cache types
         (A) Full Cache
                  - Cache connection manager can be used only with this type of cache
                  - In this case, when package starts, data will be pulled and kept into memory cache /  file cache
                     and later only cache will be used and not database
                  - We might not get the latest data from database
         (B) Partial Cache
                  - Cache connection manager can't be used with this
                  - "Advance" tab of Lookup transformation will be enabled only in case of Partial cache where one                                    
                     can configure cache size
                  - This is bit different than Full cache
                  - Initially cache will be empty. For each value, first it checks in Cache, if not found then goes to               
                    database, if found from database, then stores that value in cache so it can be used in later stage.
                  - Startup time will be less than Full cache but processing time will be longer
                  - Also lot of available memory should be there.
         (C) No Cache
                  - Every time, it will get it from database
- There are 3 outputs
         (A) Matching Row
         (B) No Matching Row
         (C) Error

EXAMPLE

Pre-requisite

Execute the following query in database


CREATE TABLE LookupDemo (EID int, EName varchar(10), Country VARCHAR(10))
CREATE TABLE LookupDemoReference (Country VARCHAR(10), Region VARCHAR(15))

CREATE TABLE LookupDemoOutput (EID int, EName varchar(10), Country VARCHAR(10), Region VARCHAR(15))

INSERT INTO LookupDemo (EID,EName,Country) VALUES
(1,'Nisarg','India'),
(2,'Megha','INDIA'),
(3,'Swara','China'),
(4,'Nidhi','USA'),
(5,'Lalu','Japan')

INSERT INTO LookupDemoReference (Country, Region) VALUES
('India','ASIA'),
('India','ASIA-Pacific'),
('China','ASIA-Pacific'),
('USA','North America'),
('Japan','ASIA-Pacific')




Steps

1. Right click in Connection Tray and click on "New Connection"
2. select "CACHE" and click on Add
3. Check "Use File Cache" checkbox to store the cache data in a file.
4. select the file destination and give some name
5. Click on "columns" tab and start adding columns.
6. Add 2 columns (Country , Region) with IndexPosition 1 and 0 respectively.
    Index Position 1 : this will be a joining key on which we will use joining conditions
    Index Position 0 : Other columns

7. Drag data flow task and rename as "Build Cache"
8. Go inside DFT and drag OLEDB source which should refer to LookupDemoReference table.
9. Drag Cache Transformation and connect it with source
10. In Connection Manager, select created cache connection manager
11. click on mapping and make sure, you have proper mapping between columns.
12. Go back to control flow and drag one more data flow task and rename it to "Lookup Data"
13. Connect 1st DFT to 2nd DFT. It should look like this.
14. Go to Lookup Data and add OLEDB source which should point to LookupDemo table
15. Drag Lookup Transformation and connect it with source
16. Select "Full Cache" in General tab of Lookup Transformation. Select "Cache Connection"

17. Select created connection manager and in Connection tab
18. Do column mappings as shown below.
19. Drag OLEDB destination and connect it with Lookup Transformation. make sure to choose "Lookup Match output"
20. OLEDB destination should point to LookupDemoOutput table
20. Column mapping should be like this.

21. Now execute the package.
22. Package will be failed at lookup transformation
23. Reason of failure.
      - Source table has one record with INDIA which is not available in Reference table, this signifies that  
         Lookup transformation is CASE-SENSITIVE
      - Lookup Transformation is different then Merge Join here. Merge join will ignore the rows if they are not  
         matching, but Lookup will throw an error.
24a. Open Lookup Transformation and change settings like this.

24b. Now drag Derived Column Transformation and connect it with Lookup Transformation. Make sure Lookup No Match Output gets displayed on the arrow.
25. In Derived Column, add a new column like this. Make sure to cast it into DT_STR

26. Remove the connection between OLEDB Destination and Lookup Transformation.
26. Now drag "Union All Transformation" and connect it with Lookup Transformation. Make sure Lookup Match Output gets displayed on the arrow.
27. Select "Invalid Column" in last row, last column

28. Connect Union All transformation to OLEDB destination
29. Whole DFT will look like this.
30. Execute the package.
31. We can see that 1 rows has been redirected as "No match output" and finally using Union All it is reaching to destination.
32. Result table is looking like this.
33. This signifies following things.
      - Lookup Transformation is Case-sensitive
      - If 2 reference records are existing then Lookup Trans. picks up only 1st record.
      - Lookup Trans. fails if joining key column is not in reference table in case we haven't configure No
         Match output.





Tuesday, March 8, 2011

Error Handling in SSIS

--> Error Handling can be achieved in Data Flow Task by configuring Error Output.
--> 3 actions can be performed whenever some error occurs
      - Ignore Failure
      - Redirect Row
      - Fail Component
--> By default Fail Component is being used by SSIS
--> Ignore Failure - if error occurs, SSIS will ignore that error and move ahead
--> Redirect Row - if error occurs, SSIS moves those erroneous (bad) records to different destination and   
      correct records to actual destination

We will look into an example of redirecting rows.

EXAMPLE

0. Create one source.txt file with some data. like this
           1, 01/01/20001
           2, 01/01/20002
           3, 01/01/20003
           4, 02/30/20004
    Here we can see that 4th record does not have correct date.
1. Create one Data Flow Task
2. In DFT, drag Flat File source and create new connection manager.
3. Give a source connection manager name and choose the created file.
4. Click on columns and you can see the records like this.
5. Now drag Data Conversion Transformation and connect it with source.
6. Convert 1st column into Integer and 2nd column into Date as shown below.
6. Click on Configure Error Output
7. Select "Redirect Row" from dropdown in Error column

8. Drag FlatFile Destination and connect it with Data Conversion Transformation
9. Click on new connection, give connection manager name and mention a file name. It is not required that file should exists in mentioned path.
10. "Column names in the first data row" checkbox will allow us to display header in destination file.
11. click on Advanced. and delete original column.

12. if you want to overwrite the file (means whenever you execute the package, it will overwrite the content otherwise it will delete the content and add one more time)
13. Now check the mapping once. It should be like this.
14. Drag one more Flat File Destination and connect it with Data Conversion Transformation (Red Arrow).
15. Create new connection with new error file and Check the mappings like this,
16. Now execute the package and check the result. Success File should have 3 rows and Error file should have 1 row with Error number.


Monday, March 7, 2011

Pivot Transformation in SSIS

- It is being used to transpose rows into columns just like Excel Transpose
- In T-SQL also, we have PIVOT command available which is more reliable and faster than SSIS
- Pivot in SSIS is not recommended generally


This is our original dataset




And this is our desired result







EXAMPLE


Pre-requisite


Following script has to be executed

CREATE TABLE PivotDemo (Student VARCHAR(10), Subject VARCHAR(10), Mark INT)
CREATE TABLE PivotOutputDemo (Student VARCHAR(10), Mathematics INT, Science INT, EnglishLanguage INT)

INSERT INTO PivotDemo VALUES
('Nisarg','Maths',100),
('Nisarg','Science',100),
('Nisarg','English',99),
('Megha','Maths', 98),
('Megha','Science', 100),
('Megha','English', 95),
('Swara','Maths', 96),
('Swara','Science', 80),
('Swara','English', 90)



Steps


1. Add Data Flow Task in Control Flow
2. Drag source in DFT and it should point to PivotDemo table
3. Drag Pivot transformation and connect it with source
4. Go to "Input columns" tab and select all 3 columns



5. Go to "Input & output properties" tab and expand "Input columns". Set PivotUsage property as below.


Student --> 1
Subject --> 2
Mark --> 3




Definition of PivotUsage value



  • 0 – the column is passed through unaffected
  • 1 – the column values become the rows of the pivot (aka the Set Key)
  • 2 – the column values become the column names of the pivot (aka the Pivot Column)
  • 3 - the column values that are pivoted in the pivot
6. Expand Pivot Default output and add following columns with following properties.


7. Name tells that what is the output name you want to have
8. PivotKeyValue tells that what is the value existing in original source for which you want to generate a column.
9. SourceColumn signifies LineageID property. For Student it should be same as LineageID property of "Student" Input column. But for remaining 3 columns, it would be LineageID property of Input column which has PivotUsage value as 3 i.e. Marks. (highlighted in 4th figure)

10. Drag destination and connect it with Pivot and do proper mapping.
11. Execute the package and desired result would be there in output table.

PROBLEMS

- If any new subjects gets added in source table, then SSIS package has to be updated to add corresponding output column
- If we want to make it dynamic then we need to use script task
- configuration of Pivot transformation is bit tricky and erroneous, so generally it should be avoided.