Data Warehouse and Data
Modelling
Data Modelling
• It is diagrammatic representation showing how the entities are related to each other. It is the initial step
towards database design. We first create the conceptual model, then logical model and finally move to
the physical model.
• What is data Modelling?
• Conceptual representation of structure of tables which express the business requirement, or the blueprint or
map called be termed as data modelling.
• Represents visually the nature of data, business rules governing the data and how it will be
organized in the database.
• Data model –conceptual, Logical and physical
• Top-down approach- form Business requirements.
• Bottom-up approach-from existing tables.
Data Modelling
OLAP (online analytical processing) OLTP (online transactional processing)
• Handles large volume of data and complex queries • Handle large number of small transactions
• Based on insert, update and delete command • Based on select to aggregate data for reporting purpose
• Response time should be milliseconds • Response time can be milliseconds seconds and hours
• Industry specific- retail , banking etc based on amount of data
• Subject specific- sales, marketing , pharma etc.
• Real time
• Fast updated
• Discover hidden secrets
• Data refreshed periodically
• Normalized Databases for efficiency
• Denormalized Databases for analysis
• Customer facing personal –clerks etc.
• Day to Day transaction
• Data Analyst etc.
• Regular backups needed
• Multi view of data
• Can be retrieved from OLTP systems
Data Modelling
There are two types of Data Models:-Star Schema and Snowflake Schema
• Star Schema
• The Centre of it can be one or more “Fact Table” which can be associated with multiple Dimension Table.
• The Dimension tables are Denormalized
Advantages of Star Schema
• Simpler query and fast performance
Challenges
• Decreased Data Integrity
• Less capable of handling diverse and complex queries
• No Many-to-Many Relationships
Data Modelling
There are two types of Data Models:-Star Schema and Snowflake Schema
• Snowflake Schema
• Unlike Star Schema in snowflake the Dimension tables are more normalized.
• It solves the slow writing process of star schema
Advantages of Star Schema
• Since the Dimension tables are normalize it used less space and saves lot of storage cost
Challenges
• Complex data Schema
• Slow at processing cube data
• Lower data integrity levels
Data Modelling
When to use star and Snowflake Schema ?
• Snowflake:
• In data warehouses. As the warehouse is Data Central for the company, we could save lot of space this way. Because in
some cases Dimension table can store lot of redundant information resulting in huge Dimension table
• Star Schema:
• In data marts. Data marts are subsets of data taken out of the central data warehouse. They are usually created for
different departments and don’t even contain all the history data. In this setting, saving storage space is not a priority
Types of Dimensions
• Conformed Dimensions:
• It is a dimension common dim_prescriber (physician universe) and dim_time_period shared across thetable
that is shared by multiple fact tables
Example: - suppose in a pharmaceutical organization we have different data marts based on field
forces. There can be a different data marts then dim_prescriber is called as conformed dimensions.
Prod_typ Cpn_typ Pymt_typ Junk_id
Junk Dimensions:
• Table composed of low cardinality column that do not have Online Yes Cash 1
place in fact table
Offlilne Yes Cash 2
• initial Transaction_tbl Online No Cash 3
(transaction_id,product_id,customer_id,emp_id,order_id,payme Offline No Cash 4
nt_id,coupon_id,amount,qty)
Online Yes Card 5
• Final Transaction_tbl Offlilne Yes Card 6
• (transaction_id,product_id,customer_id,emp_id,order_id,junk_i Online No Card 7
d,amount,qty)
Offline No Card 8
Junk table
Types of Dimensions
• Role Playing Dimensions:
• Dimensions utilized for multiple purpose in the same database.
• Example Dim time period.
• Degenerated Dimensions:-
• A dimensions which is not a fact but present in the fact table as a PK
• Example invoice number or order number.
• Slowly Changing Dimensions:
• These are the dimensions whose attribute changes over a period of time.
• SCD 0- dimensions whose attributes remain steady with time.
• SCD 1- In these type of dimensions the previous value is replaced by current value.
• SDC 2- In these type of dimensions unlimited history is preserved.
• SCD 3- In these type of dimension limited history is preserved.
• SCD 4- History is maintained in separate table.
Types of Dimensions
SCD- 2 implementation in SQL
MERGE INTO customers
USING (
-- These rows will either UPDATE the current addresses of existing customers or INSERT the new addresses of new customers
SELECT updates.customerId as mergeKey, updates.* FROM updates
UNION ALL
-- These rows will INSERT new addresses of existing customers
-- Setting the mergeKey to NULL forces these rows to NOT MATCH and be INSERTed.
SELECT NULL as mergeKey, updates.*
FROM updates JOIN customers
ON updates.customerid = customers.customerid
WHERE customers.current = true AND updates.address <> customers.address
) staged_updates
ON customers.customerId = mergeKey
WHEN MATCHED AND customers.current = true AND customers.address <> staged_updates.address THEN
UPDATE SET current = false, endDate = staged_updates.effectiveDate -- Set current to false and endDate to source's effective date.
WHEN NOT MATCHED THEN
INSERT(customerid, address, current, effectivedate, enddate)
VALUES(staged_updates.customerId, staged_updates.address, true, staged_updates.effectiveDate, null) -- Set current to true along with the
Types of Dimensions
Shrunken Dimensions:
Shrunken dimensions are conformed dimensions that are a subset of rows and /or columns of a base
dimension. Shrunken rollup dimensions are required when constructing aggregate fact tables. They are also
necessary for business processes that naturally capture data at a higher level of granularity, such as a forecast by
month and brand (instead of the more atomic date and product associated with sales data). Another case of
conformed dimension subsetting occurs when two dimensions are at the same level of detail, but one represents
only a subset of rows.
Static Dimensions:
Static dimensions are not extracted from the original data source, but are created within the context of the data
warehouse. A static dimension can be loaded manually — for example with status codes — or it can be generated
by a procedure, such as a date or time dimension.
Types of Dimensions
• Late Arriving Dimensions:
• the natural key in the fact record has not yet been loaded in a related dimension preventing a successful foreign key
lookup to the dimension’s surrogate key
• How to Deal With Late Arriving Dimensions
• Never process – Simply omit the record from the fact load. This is rarely the case.
• Queue & Retry – wait and process later
• Unknown Member – If the fact record does have value without a reference to a related late arriving dimension, and we
either do not expect or do not care if a dimension record shows up at a later time, then we can simply load the fact
record with a foreign key reference to the unknown record in the related dimension; typically -1 surrogate key value.
Note that we must pay particular attention to the fact grain if we are going to take the unknown member approach to
failed foreign key lookups. Most fact table primary keys are set as a composite across foreign keys and degenerate
dimensions. If allowing a foreign key to be set to -1 causes a possible primary key violation, then we need to look for
other options such as adding a degenerate dimension.
• Inferred Member –we can use the natural key in the record bound for the fact table to seed a record (inferred member)
in the related dimensions. Once the dimension record shows up, it will have the same natural key and the dimension
record’s attributes can then be updated. Particular attention should be paid to the inferred member’s record time in
cases where dimension history is tracked such as SCD 2 situations.