Friday, August 30, 2013

Table of Contents


[BEST VIEWABLE IN GOOGLE CHROME]

Basic SQL Server Fundamentals
Advanced SQL Server Fundamentals
Performance, Maintenance & Administration
New Features of SQL Server 2005
TRY-CATCH
New functions (ROW_NUMBER, RANK, DENES_RANK)
New operands (INTERSECT, EXCEPT)
New Clause (OUTPUT, TABLESTAMP)
New DataType (XML)
New Tool (Data Tuning Advisor)
New Features of SQL Server 2008
Intellisense
Policy Based Management
Filtered Index
New DataType (HIERARCHYID, DATE)
Query Editor Regions
Table Valued Parameters
Filestream
New Features of SQL Server 2012
Miscellaneous
Business Intelligence

Thursday, August 8, 2013

FACT overview

What is fact?

  • Stores a value which is measurable, indicates some numerical figure
  • e.g. SalesFact, InvoiceFact, etc…

Type of Columns

  • Foreign Key
    • Foreign keys to Dimension tables which make a sense to numerical figure
    • e.g. CustomerKey, PurchaseDateKey, GeographyKey, etc…
  • Measure
    • Basic essence of a table contains a numerical value
    • e.g. SalesAmount, InvoiceAmount, Profit, etc…
  • Lineage Columns
    • Used for Auditing and never exposed to end users
    • CreatedOn, LastModifedOn, etc…
  • Business key columns from primary source
    • Source key columns used for reference
    • e.g. TransactionID, CustomerID. CustomerID is the unique key coming from source where CustomerKey is the FK to CustomerDimension what we have created in our data warehouse.

Measure Types

  • Non Additive
    • Which cannot be used inside any aggregate function
    • e.g. Ratio, ProfitMarginPercentage… SUM(Ratio), MAX(Ratio) does not make sense
  • Semi Additive
    • Which can be used inside few aggregate functions and not all
    • e.g. CurrencyRate …. SUM(CurrencyRate) does not make sense… but AVG(CurrencyRate) / MAX(CurrentRate) makes
    • AccountBalance - SUM(AccountBalance) does not make sense… but MAX(AccountBalance) does
  • Additive
    • Which can be used inside all kind of aggregate function
    • e.g. Quantity

Type of FACTs



Types of FACT table

    • Transactional FACT

      • used to records one row per transaction with measures of particular business event.
      • must be design with Additive facts
      • takes more memory space in the database
      • Contains additive Measures
      • e.g

    • Periodic FACT

      • used to records one row for a group of transactions with aggregated measures of business events that happen over a period of time.
      • takes minimum memory space in the database when compare with Transaction fact table
      • e.g.

    • Accumulating FACT

      • used to records one row for the entire lifetime of transaction with measures of business events that happen over a period of time.
      • must be design with combination of Additive facts with multiple date keys that helps to track the lifetime of business event
      • takes minimum memory space in the database
      • e.g.

    • Factless FACT

      • A fact which does not contain any measure, called factless fact
      • Used to maintain the mapping of different dimension keys
      • e.g. in School database, we need to have a mapping which teacher is teaching which student, so when we create a table StudentTeacherMap, it will have just StudentKey and TeacherKey, so this table becomes factless fact

    • Coverage FACT

      • Factless fact can answer only optimistic queries, but cannot answer negative one
      • If we talk about above example, then Factless Fact does NOT tell which teacher is NOT teaching which student
      • So to cover all these scenarios, we need to introduce a flag in that table, which will have Y/N value based on their mapping and this becomes Coverage Fact

Dimension

What is Dimension?

  • Something that quantifies a measure
  • For an example, consider this: If I just say… “20kg”, it does not mean anything. But if I say, "20kg of Rice (Product) is sold to Ramesh (customer) on 5th April (date)", then that gives a meaningful sense. These product, customer and dates are some dimension that qualified the measure - 20kg
  • Dimensions create a sense to identify a number

Dimension Column Types

  • Dimension table contains columns of different types
  • Keys
    • These are surrogate keys which are used to identify the rows uniquely
    • e.g. CustomerKey
  • Name Columns
    • Used for human names for easy identification of the entity
    • e.g CustomerFullName, CustomerNickName
  • Attributes
    • Used for pivoting in analysis
    • Will be used to analyse the business based on different perspectives
    • e.g. Gender, Marital Status, Age, etc..
  • Member Properties
    • Used for labels in the report
    • Generally we don't do pivoting or analysis on these columns but they are used to provide extra info on a report
    • e.g. Address, PhoneNumber, Email, etc..
  • Lineage Columns
    • Used for auditing, never exposed to end users
    • LastCreatedDate, LastModifiedBy, etc..

Types of Dimension

  • Conformed Dimension
  • Junk Dimension
  • Degenerated Dimension
  • Role Playing Dimension

Based on how frequently the data inside a dimension changes, we can further classify dimension as
  • Unchanging or static dimension (UCD)
  • Slowly changing dimension (SCD)
  • Rapidly changing Dimension (RCD)

Conformed Dimension

  • A dimension which is shared across multiple subject areas
  • e.g. "Customer" dimension used in both Marketting & Sales departments
  • Similarly Time & Date dimension

Degenerated Dimension

  • A Key in the Fact table, which does not have its own dimension
  • e.g. In InvoiceFact, we can have columns like TransactionAmount, InvoiceAmount, CustomerKey, CreatedDateKey, GeographyKey, InvoiceNumber, TransactionNumber
  • So we can see here InvoiceNumber & TransactionNumber don't have their own Dimensions like Customer, Date & Geography, but they are required to do other business activities

Junk Dimension

  • grouping of typically low-cardinality attributes (flags, indicators etc.) so that those can be removed from other tables and can be junked into an abstract dimension table
  • These junk dimension attributes might not be related. The only purpose of this table is to store all the combinations of the dimensional attributes which you could not fit into the different dimension tables otherwise
  • e.g. Gender & Marital Status, data looks like this
  • Gender
    Male
    Female
    Marital Status
    Single
    Married
    Divorced
  • So both these dimension have very low cardinality means they have very less number of distinct values
  • So rather than having Gender & MaritalStatus separately, we can apply cross join and generate one single structure like this
GM_id
Gender
MaritalStatus
1
Male
Single
2
Male
Married
3
Male
Divorced
4
Female
Single
5
Female
Married
6
Female
Divorced
  • Now we can refer GM_id in our Facts/Dimensions, this new dimension is called Junk Dimension
  • Improves manageability and improves SQL query performance

Role playing Dimension

  • A dimension which is being used at multiple places with different contextual meaning but having the same data
  • e.g. "Date"… for a particular business, we can have "Date of Sale", "Date of Delivery", "Date of Shipping", "Date of Hire", etc… basically all of them have the basic entity i.e. Date, but then context is different
  • So we create only 1 basic table (i.e. DateDimension) and then we refer it with different names, which are called Role playing Dimensions
  • They can be created separately in cube and in database, they can be created in terms of Views. e.g. On top of DateDimension, we create vwHireDateDimension, vwSalesDateDimension, etc…

Slowly Changing Dimension

Rapidly Changing Dimension

Rapidly Changing Dimension

  • Dimension where data changes are happening very rapidly
  • "Slowly Changing Dimension" blog tells how we can maintain the history of changes
  • But if we really implement the same technic for RCD, it will end up creating N number of Type-2 records for same entity and hence we need to come up with a smart design
  • We can handle this by creating Junk Dimension
  • Let's take an example of Customer Dimension, assume we have following columns
    • CUSTOMER_KEY
    • CUSTOMER_NAME
    • CUSTOMER_GENDER
    • CUSTOMER_MARITAL_STATUS
    • CUSTOMER_TIER
    • CUSTOMER_STATUS
  • Now, Name, Gender, MaritalStatus gets a change very rare, once in a while but Tier & Status get the change very frequently, assume that we are not interested in historical changes of Name, Gender & Marital Status, holding only latest value should be fine, but we need to track the history changes of Tier & Status
  • Now create a junk dimension by removing Tier & Status from Original Dimension
  • So our Original Customer Dimension will be like this
    • CUSTOMER_KEY
    • CUSTOMER_NAME
    • CUSTOMER_GENDER
    • CUSTOMER_MARITAL_STATUS
  • Create a new junk dimension for Tier, Status and create another mapping dimension, so overall structure will look like this



  • By doing so, new type-2 record generation will happen only in a mapping table and not in the main table, so that way we can have easy mainteanance.
  • Only disadvantage is - it will introduce extra joins to get the full picture of history changes of a particular customer
  • But maintenance and storage vise, we will get benefits

Slowly Changing Dimension

Slowly Changing Dimension

  • Often called as SCD, where the data of a dimension is changing slowly
  • Available in different types (SCD Type 0, Type 1, Type 2, Type 3, Type 6)
  • 1, 2 & 6 are most common

Type - 0

  • Where dimension changes are not considered
  • Even if the value gets changed in real scenario. 
  • It does not carry to the dimension and dimension still holds the old data

Type - 1

  • Where history is not maintained and table always show the latest data
  • Always updated with recent values
  • e.g. Before Change

Supplier_Key
Supplier_Code
Supplier_Name
Supplier_State
123
ABC
Acme Supply Co
CA

After Change

Supplier_Key
Supplier_Code
Supplier_Name
Supplier_State
123
ABC
Acme Supply Co
IL


Type - 2

  • Tracks historical updates by creating separate rows
  • Tracking can be done in 2 ways
  • 1st method is via Version

Supplier_Key
Supplier_Code
Supplier_Name
Supplier_State
Version.
123
ABC
Acme Supply Co
CA
0
124
ABC
Acme Supply Co
IL
1

So latest version # is the latest row, and that's how we preserve the history

  • 2nd method is via StartDate & EndDate

Supplier_Key
Supplier_Code
Supplier_Name
Supplier_State
Start_Date
End_Date
123
ABC
Acme Supply Co
CA
01-Jan-2000
21-Dec-2004
124
ABC
Acme Supply Co
IL
22-Dec-2004
NULL

Row with NULL EndDate is the latest row.

Type - 3

  • Tracks historical updates by creating separate columns
  • Type-2 is horizontal growth and Type-3 is vertical growth

Supplier_Key
Supplier_Code
Supplier_Name
Original_Supplier_State
Current_Supplier_State
123
ABC
Acme Supply Co
CA
IL

  • Due to difficult maintenance of vertical growth, only 1 level history is being tracked generally.
  • Let's say now state gets changed from IL to WA, then this is how the data will look

Supplier_Key
Supplier_Code
Supplier_Name
Original_Supplier_State
Current_Supplier_State
123
ABC
Acme Supply Co
IL
WA

So we lose the prior history.

Type - 4

  • Tracks history in a separate table
  • "Supplier" table
Supplier_key
Supplier_Code
Supplier_Name
Supplier_State
123
ABC
Acme Supply Co
IL
  • "SupplierHistory" table
Supplier_key
Supplier_Code
Supplier_Name
Supplier_State
Create_Date
123
ABC
Acme Supply Co
CA
22-Dec-2004

  • One table will maintain the latest data 
  • Other will keep on maintaining historical updates with CreationDate

Type -6

  • It is hybrid of Type-1,2 & 3
Supplier_Key
Supplier_Code
Supplier_Name
Current_State
Historical_State
Start_Date
End_Date
Current_Flag
123
ABC
Acme Supply Co
IL
CA
01-Jan-2000
21-Dec-2004
N
124
ABC
Acme Supply Co
IL
IL
22-Dec-2004
31-Dec-9999
Y

  • StartDate, EndDate will identify during which period that record was active
  • CurrentFlag will identify which is the latest record and which is the historical record
  • CurrentXYZ column will always hold the latest value for all the records