Friday, April 1, 2011

SSIS - Overview & Interview Questions

About SSIS


-          Introduced in 2005
-          Advanced version of DTS (Data Transformation Services)
-          An easy to use, highly flexible, extremely capable, highly scalable ETL (Extract Transform Load) tool
-          Creates structure called package
-          Package contains mainly following things
o   Control Flow (defines overall flow)
o   Data Flow (extracts data from multiple sources and putting into multiple destinations)
o   Event Handler
o   Package Explorer
o   Connections Tray (holds all kind of connections)



Connection Manager


-          OLEDB connection
-          Flat file connection
-          Excel connection
-          Cache connection
-          FTP connection
-          Multifile connection
-          SMTP connection


Difference between Control Flow & Data Flow Task
     
      Click Here


Variables


       - Variables store values in SSIS Package at package, container, task level
       - Variables can be used in
             a. Script task - we can use the variable using Dts.Variables["Blah"]
             b. Expression in any transformation
             c. Expression in any Connection Manager
             d. Execute SQL Task to pass value to SP, and store value from SP
             e. Precedence Constraint to define success/failure flow
             f. Error output
       - 2 type of variables
             a. System Variable
             b. User-defined Variable

       - Variable name is case-sensitive
       - We can create 2 variables with same name at different level. e.g. Container level and then  
          Data Flow Task Level
       - In that case, local scope will get more priority.



What are the different control flow tasks you have used?


         -          Execute SQL Task (more details)
         -          Execute Package Task (more details)
         -          Send Mail Task (more details)
         -          Data Profiling Task (more details)
         -          Transfer Logins Task (more details)
         -          Bulk Insert Task (more details)
         -          Data Flow Task
         -          For Loop Container (more details)
         -          For Each Loop Container (more details)
         -          Sequence Container
         -          File System Task (more details)
         -          Script Task
         -          FTP Task
         -          Backup Database Task
         -          Rebuild Index Task
         -          Reorganize Index Task
         -          Shrink Database Task
         -          Update Statistics Task


What are the different data flow transformations you have used?


            -          Simple DFT

         -          Copy Column Transformation (more details)
         -          Derived Column & Data Conversion Transformation (more details)
         -          Conditional Split Transformation (more details)
         -          Character Map Transformation (more details)
         -          Audit Transformation (more details)
         -          Aggregate & Multicast Transformation (more details)
         -          Merge & Sort Transformation (more details)
         -          Merge Join Transformation (more details)
         -          Union All Transformation (more details)
         -          Row Count Transformation (more details)
         -          Row Sampling Transformation
         -          Fuzzy Lookup Transformation
         -          Fuzzy Grouping Transformation
         -          Cache Transformation (more details)
         -          LookupTransformation (more details)


        How would you do logging in SSIS?
             click here

   Approach for Error Handling
             click here



          Debugging and Using Data Viewer
                  click here


         How to deploy SSIS package on production server?
             To be added

        New features of SSIS 2008?
             click here

        How to pass a variable to child package?
             To be added

         Performance improvement in SSIS
             click here

         How would you restart package from point of failure? What is checkpoint?
             To be added 

      How you can achieve parallelism in SSIS?
               To be added 

         Difference between Merge, Merge Join & Union All
              

              
         Difference between Conditional Split & Multicast
               To be added 







Tuesday, March 15, 2011

New features of SSIS 2008

1. Script Task
    - In SSIS 2005, Script task was supporting only Microsoft Visual Basic .Net, but in SSIS 2008, Script task supports C# too.
    - Before writing script, we can chose the language. This is due to provision of VSTA (Visual Studio Tools for Application)


2. Lookup Transformation
    - In 2005, cache file was not there so it was reloading a cache every time, now in 2008, we can create a cache file and it does not need to repopulate every time.
    - In 2005, it has only 2 outputs (1 regular and other Error), in 2008, 3 outputs. (Matching, non matching and Error)


3. Data Profiling Task
    - New task introduced in 2008


4. Multicast
    - Though it was splitting the flow into multiple outputs, in 2005, all outputs were get processed serially. actual parallelism was not there. in 2005, They have introduced subpath, so actually they will be processed in parallel.


5. New T-SQL features
    - we can leverage on CDC, MERGE, etc.



Debugging SSIS and using Data Viewer

- SSIS package can be debugged using breakpoints and data viewer

DEMO

- Add one variable called "counter" of integer data type
- Drag For Loop Counter in control flow.
- Set the properties like this.
     InitExpression : @counter = 1
     EvalExpression : @counter <= 10
     AssignExpression : @counter = @counter + 1
- Drag Data Flow Task inside For Loop Container
- Open DFT
- Drag OLEDBSource which should point to some table
- Drag OLEDBDestination which should point to some table
- Connect source and destination
- Double click on the link between source and destination
- Go to Data Viewers and click on Add.
- Go to General tab, and make sure "Data Grid" is selected
- Go to "Grid" tab and select columns for which you want to do see the data
- now  your data flow task should look like this
- come back to control flow
- Right click on Data Flow Task and "Edit Breakpoints"
- We have different options in breakpoints, where we want to debug.
- This shows that, if Data Flow Task gets pre-executed 5th time or more, debugger will be stopped.
- Now execute the package.
- Data viewer will display the data like this.
- Now you can see exactly what data is floating between source and destination
- Click on > button 4 times
- Now it will be stopped like this
- Now you can enable watch window like this
- Now drag variable into watch window (Name column)
- it will display the value of a variable.
- You can hit F5 to proceed further.


Control Flow vs Data Flow




Control Flow
Data Flow
Step
1st step for creating package
2nd step for creating package, it will be part of control flow
Made up of
Containers and tasks connected with precedence constraints
Source, transformation and destinations
About
It is about data preparation and managing the process
It is about ETL activities
Smallest unit
Task
Component
Data movement
It does not move the data from one task to another task
It moves the data from source to destination
Processing
Task1 has to be completed before Task2 begins
Multiple components can be processed at the same time
Count
Only 1 control flow is available
There can be multiple data flow tasks inside control flow
Error output
Explicit Error output is not there but it can be achieved by editing precedence constraint
Error output is available in source/transformation/destination

Performance Improvement in SSIS

(E) EXTRACT IMPROVEMENT


1) If there is a Flat file source / Derived Column Transformation, then set "Fast Parse" to "True.
    - It is available only in Flat File Source & Derived Column Transformation
    - It is specified at column level
    - Default value is False,
    - When we set it true, it will avoid some kind of pre-execute validations and considers all your data fine


    Steps

  1. Right-click the Flat File source or Data Conversion transformation, and then click Show Advanced Editor.
  2. In the Advanced Editor dialog box, click the Input and Output Properties tab.
  3. In the Inputs and Outputs pane, click the column for which you want to enable fast parse.
  4. In the Properties window, expand the Custom Properties node, and then set the FastParse property to True.
  5. Click OK.

2) Set packet size to 32767 for Connection Manager.
    - This will bump up the packet size from 4K (which is default)
    - This needs network admin to enable "Jumbo Frames"




3) In OLEDB source, use T-SQL Query instead of table as a direct input 
    - This will allow you to choose specific columns instead of pulling all the columns
    - We can specify nolock which avoids locking the table
    - We can use sort, group by, joins, forumlated columns instead of using different transformations like Sort,       
       Merge Join, Derived Column, Aggregate transformations.


4) In Cache connection manager, try to use create a file instead of using memory


5) If same OLEDB source connection, you are using at multiple places, then set "RetainSameConnection" property to "True"


     - This will allow engine to use the same connection every time
     - Default value : False. This will create connection - get data - close connection every time. 
     - by making it to TRUE, above activities will be done only once.




6) Divide source into a chunk instead of having a single master pool.  


(T) TRANSFORM IMPROVEMENT


1) Use Transformation based on the usage and buffer matrix


Behind the scenes, the data flow engine uses a buffer-oriented architecture to efficiently load and manipulate datasets in memory.
  • Row Transformations - 
    - They either manipulate data / create new fields using the data that is available in that row. 
    - They might create new columns but not new rows
    - Each output row has a 1:1 relationship with an input row
    - Also known as synchronous transformations
    Uses existing buffer rather than new buffer
    - Examples -  Derived Column, Data Conversion, Multicast, and Lookup. 

  • Partially blocking transformations 
    - They are often used to combine datasets using multiple data inputs. 
    - As a result, their output may have the same, greater, or fewer records than the total number of input records. 
    - Also known as asynchronous transformations
    - Output of the transformation is copied into a new buffer and a new thread may be introduced into the data flow
    - Examples - Merge, Merge Join, and Union All. 

  • Blocking transformations 
    - They must read and process all input records before creating any output records. 
    - They perform the most work and can have the greatest impact on available resources rather than above 2 categories
    - Also known as asynchronous transformations
    - With partially blocking transformations, the output of the transformation is copied into a new buffer and a new thread may be introduced into the data flow
    - Example - Aggregate and Sort. 


2) Use MERGE statement when you have to do INSERT-UPDATE-DELETE from source to target instead of using multiple transformations


3) Choice of type of cache inside Lookup Transformation
     - Full Cache : for small dataset
     - No Cache : for volatile dataset
     - Partial Cache : for large dataset


4)  Sort, Merge Join, Union All, Pivot, Aggregation SCD, Data Conversion can be easily replaced by normal T-SQL


      - There will be much more control on all the objects
      - T-SQL operation will be much more faster than SSIS Transformations because all the 
         buffers won't be used.



5) Make datatype as narrow as possible so that they will allocate less memory




(L) LOAD IMPROVEMENT


1) Try to execute the package on your destination server, rather than source server. 


      - LOAD is expensive operation than EXTRACT
      - So we can execute the package on the same server as destination server


2) Make a smart choice between Dropping/Keeping Index


      - It is not necessary to keep index always OR drop index always before you load.
      - If there is a clustered index, don't drop because data is sorted using this key. And dropping  
        and rebuilding clustered index will take even more time.
      - If there is a single non-clustered index and you expect more than 100% new data, then 
        dropping and re-creating index will help.
      - If there are multiple non-clustered index, probably leave them as it is.


   But these are not thumb rules, trial and error will always give you the best result.


3) If there is a huge huge load on destination, probably partitioning a table will help


4) If there is a huge huge load on destination, probably partitioning a table will help


5) Setting proper value of "Rows per batch" &  "Maximum Insert Commit Size"




Rows per batch - how many rows you want to send to insert the data
Maximum insert Commit Size -  how may rows you want to commit in one shot
      - If the value is 2147483647, these many rows will be committed in one single transaction and 
        they will be committed.
      - If you really have these many rows to load, better you define proper value in this commit 
        size. Let's say if you define 100000, then 1 lac rows will be committed in one shot. A huge 
        DML operation in one single transaction will degrade the performance.
      - If it is 0, it means, a package might stop responding, if the same table is being used by 
        some other source.