1
DATA WARE HOSE FUNDAMENTALS:
1. What is Data warehouse?
1)It is a database which stores the historical data
2) We use this data for query & analysis purpose based on the analysis results
we will make
the business decisions
3) We call DWH as decision supporting system
4) The data that is available in DWH is read only data
2. Why do we use DWH?
a)A Data Warehouse Delivers Enhanced Business Intelligence
By providing data from various sources, managers and executives will no longer
need to make business decisions based on limited data or their gut. In
addition, data warehouses and related BI can be applied directly to business
processes
including
marketing
segmentation,
inventory
management,
financial management, and sales.
b) A Data Warehouse Saves Time
Since business users can quickly access critical data from a number of sources
all in one placethey can rapidly make informed decisions on key initiatives.
They wont waste precious time retrieving data from multiple sources.
Not only that but the business execs can query the data themselves with little
or no support from ITsaving more time and more money. That means the
business users wont have to wait until IT gets around to generating the
reports, and those hardworking folks in IT can do what they do bestkeep the
business running.
c) A Data Warehouse Enhances Data Quality and Consistency
A
data
warehouse implementation
includes
the
conversion
of
data
from numerous source systems into a common format. Since each data from
the various departments is standardized, each department will produce results
that are in line with all the other departments. So you can have more
confidence in the accuracy of your data. And accurate data is the basis for
strong business decisions.
d) A Data Warehouse Provides Historical Intelligence
A data warehouse stores large amounts of historical data so you can analyze
different time periods and trends in order to make future predictions. Such
data typically cannot be stored in a transactional database or used to
generate reports from a transactional system.
e) A Data Warehouse Generates a High ROI
ROI retun oninvestment
Finally, the piece de resistancereturn on investment. Companies that have
implemented
data
warehouses
and
complementary
BI
systems
have generated more revenue and saved more money than companies that
havent invested in BI systems and data warehouses.
3.What is the Difference between Ralph Kimball and WH Inmon Approach?
Ralph Kimball
1)As per ralph Kimball approach
first we define the data marts
and we should define data
warehouse.
2) In this approach data marts
are called independent data
marts
3) This approach we called
bottom up approach
Characteristics of DWH?
WH Inmon
1)As per WH Inmon approach
first we define data warehouse
and we should define data marts
4. What
are
2) In this approach data marts
are called dependent data marts
3) this approach we called as top
down approach
1)Time variant: All the data in data warehouse identified with particular time
period. The data of DWH must be historical
Ex: Day, Week, Month etc.
2) Non-volatile: Data in DWH are never over written or deleted once committed.
The data is static. It is used for read only data we cant change. This data used
for future reporting.
3) Subject oriented: In this we collect the data from different subject area. The
collected data must be business oriented.
Ex: Sales, accounts, HR etc.
4) Integrated: The DWH contains data gathered from all organizations and
merged into a coherent.
5. What is the difference between ODS and DWH
ODS(Operational Data Store)
DWH
3
1) It is design to support
operational data monitoring.
the
1) It is design to support the
decision making process.
2) Data is volatile
2) Data is non-volatile
3) It contains current data
3) It contains historical data
4) It contains detailed data
4) It contains summary data
5) It contains less volume of data
5) It contains high volume of data
6. What is the difference between Data Mart and DWH?
Data Mart
Data Warehouse
1) Data mart is usually sponsored at the
department level and developed with a
specific issue or subject in mind, a data
mart is a data warehouse with a focused
objective.
1) Data warehouse is a SubjectOriented, Integrated, Time-Variant, Nonvolatile collection of data in support of
decision making.
2) A data mart is used on a business
division/ department level.
2) A data warehouse is used on an
enterprise level
3) A Data Mart is a subset of data from a
Data Warehouse. Data Marts are built for
specific user groups.
3) A Data Warehouse is simply an
integrated consolidation of data from a
variety of sources that is specially
designed to support strategic and
tactical decision making.
4) By providing decision makers with
only a subset of data from the Data
Warehouse, Privacy, Performance and
Clarity Objectives can be attained.
4) The main objective of Data Warehouse
is to provide an integrated environment
and coherent picture of the business at a
point in time.
7. What are dimensions and Facts? Give the example?
Dimensions:It is structural or characteristic attribute. A dimension is essentially
an entry point for getting at the facts. Dimensions are things of interests to the
business
Ex: Customer, Product, Store, Country, State etc.
Facts: A business performance measurement tropically numeric and additive. It
is stored in the fact table. Facts are also called KPI (key performance indicator)
Or
A fact is a numeric value that a business wishes to count or sum
Ex: Store key, Customer Key, Orders, The number of products sold, the value of
products sold, the products produced
8. What are different types of dimensions?
4
1) Slowly changed dimension: It will change with respect to time.
Ex: Customer address etc.
It is three types
1) Type 1: In case of type 1 we will not maintain any history. We will
maintain only current record by updating existing record. Technically
type1 is insert else update.
In our example, recall we originally have the following table:
Customer Key
Name
State
1001
Christina
Illinois
After Christina moved from Illinois to California, the new information
replaces the new record, and we have the following table:
Customer Key
Name
State
1001
Christina
California
Advantages:
- This is the easiest way to handle the Slowly Changing Dimension
problem, since there is no need to keep track of the old information.
Disadvantages:
All history is lost. By applying this methodology, it is not possible to trace
back in history. For example, in this case, the company would not be able
to know that Christina lived in Illinois before. Usage:
About 50% of the time.
When to use Type 1:
Type 1 slowly changing dimension should be used when it is not
necessary for the data warehouse to keep track of historical changes.
2) Type2:
In case of type2 we will maintain historical data.
Whenever there is a change in address will insert new address in address
table and end dated the existing record.
To maintain the versioning in SCD type2 we use a) start date, end date b)
flag c) Address key or version number
In our example, recall we originally have the following table:
Customer Key
Name
State
1001
Christina
Illinois
After Christina moved from Illinois to California, we add the new
information as a new row into the table:
Customer Key
Name
State
1001
Christina
Illinois
1005
Christina
California
Advantages:
- This allows us to accurately keep all historical information.
Disadvantages:
- This will cause the size of the table to grow fast. In cases where the
number of rows for the table is very high to start with, storage and
performance can become a concern.
- This necessarily complicates the ETL process.
Usage:
About 50% of the time.
When to use Type 2:
Type 2 slowly changing dimension should be used when it is necessary
for the data warehouse to track historical changes.
3) Type3: In case of scd type3 we will maintain the current record and
previous record. It means one customer can have maximum two records.
In our example, recall we originally have the following table:
Customer Key
Name
State
1001
Christina
Illinois
To accommodate Type 3 Slowly Changing Dimension, we will now have
the following columns:
Customer Key
Name
Original State
Current State
Effective Date
After Christina moved from Illinois to California, the original information
gets updated, and we have the following table (assuming the effective
date of change is January 15, 2003):
Custom
er Key
Name
Origin
al
State
Current
State
Effective
Date
1001
Christi
Illinois
Californ
15-JAN-
na
ia
2003
Advantages:
- This does not increase the size of the table, since new information is
updated.
- This allows us to keep some part of history.
Disadvantages:
Type 3 will not be able to keep all history where an attribute is changed
more than once. For example, if Christina later moves to Texas on
December 15, 2003, the California information will be lost.
Usage:
Type 3 is rarely used in actual practice.
When to use Type 3:
Type III slowly changing dimension should only be used when it is
necessary for the data warehouse to track historical changes, and when
such changes will only occur for a finite number of time.
9. What is Confirmed dimension? Give example?
A Dimension which is used two or more fact tables is called conformed
dimension.
Ex: Time, Product, Country etc.
10.What is Degenerated Dimension? Give example?
By nature it is a dimension but it will be located in fact table. Degenerated
dimension contains only a key and no attributes.
Ex: Bill number, Receipt number etc.
11.What is Junk Dimension? Give example?
A junk dimension is collection of random transactional codes, flags and/or text
attributes that are unrelated to any particular dimension.
Ex: Yes or No, Male or Female etc.
12.What is Dirty Dimension? Give example?
If a record occurs more than one time in a table by the difference of a non-key
attribute then such dimension is called dirty dimension.
For example A "dirty dimension" is one in which data quality cannot be
guaranteed. For example, in most banks, account-oriented source applications
contain data about the same customer multiple times. Many banks attempt to
derive a "customer" by matching names and addresses across account
applications, but this process results in more than one entry for each bank
customer. Similarly, different attributes must be held for each of a bank's
heterogeneous products. Attributes that are meaningful for a loan, such as term,
7
credit risk assessment, and collateral, have no meaning for savings, checking, or
investment products.
13.What is Role play dimension?
Dimension will pay multiple roles in the fact table like orders, sales, shipment
date, Invoice date etc.
Ex: Time dimension etc.
14.What is lately arrived Dimension?
In data warehousing some-times we receive fact data first and later we receive
dimension data this kind of dimension is called lately arrived dimension.
15.What is rapidly changing dimension?
It is dimension that changes rapidly with respect to time period.
Ex: share price etc.
16.What are different types of slowly changing dimensions?
Refer Q8.
17.What is the difference between type1, type2 and type3?
Refer Q8
18.How do we maintain the versioning in type2 dimension?
Refer Q8.
19.What are different types of Facts? Give Examples?
These are three types
1) Additive : These are facts on which we can perform arthematic operation like
addition on the fact table
Ex: If we want find yearly sales we can sum quarter sales
2) Semi additive : These are fact we can perform sum operation for some of the
dimension not to others
Ex: In bank customer account we can find out monthly balance using credits
and debits happen in the particular month. We cant find monthly balances
using daily transactions.
3) Non-additive: These are facts we cant be summed up for any dimension
present in the fact table
Ex: Temperature, Ratios etc.
20.What is fact less fact table? Give the example?
Fact less fact table captures many to many relation between dimensions but
contains with-out any measures
Common examples of fact less fact tables include:
Identifying product promotion events (to determine promoted products
that didnt sell)
Tracking student attendance or registration events
Tracking insurance-related accident events
21.What are steps involved in dimensional modelling?
Dimensional modelling involved in three steps:
Conceptual Data Model
Features of conceptual data model include:
Includes the important entities and the relationships among them.
No attribute is specified.
No primary key is specified.
At this level, the data modeller attempts to identify the highest-level
relationships among the different entities.
Logical Data Model
Features of logical data model include:
Includes all entities and relationships among them.
All attributes for each entity are specified.
The primary key for each entity specified.
Foreign keys (keys identifying the relationship between different entities)
are specified.
Normalization occurs at this level.
At this level, the data modeller attempts to describe the data in as much
detail as possible, without regard to how they will be physically implemented
in the database.
Physical Data Model
Features of physical data model include:
Specification all tables and columns.
Foreign keys are used to identify relationships between tables.
Demoralization may occur based on user requirements.
komPhysical considerations may cause the physical data model to be
quite different from the logical data model.
this level, the data modeller will specify how the logical data model will be
realized in the database schema.
22.What is the difference between logical and physical modelling?
Refer Q21.
23.What is difference between start schema and snowflake?
Star Schema
Snow Flake Schema
The star schema is the simplest data
warehouse scheme.
Snowflake schema is a more complex
data warehouse model than a star
schema.
In star schema each of the dimensions
is represented in a single table .It
should not have any hierarc.lhies
between dims.
In snow flake schema at least one
hierarchy
should
exists
between
dimension tables.
It contains a fact table surrounded by
dimension tables. If the dimensions are
de-normalized, we say it is a star
schema design.
It contains a fact table surrounded by
dimension tables. If a dimension is
normalized, we say it is a snow flaked
design.
In star schema only one join
establishes the relationship between
the fact table and any one of the
dimension
tables.
So
query
performance is good
In snow flake schema since there is
relationship between the dimensions
tables it has to do many joins to fetch
the data. So query performance is poor
A
star
schema
optimizes
the
performance by keeping queries simple
and providing fast response time. All
the information about the each level is
stored in one row.
Snowflake
schemas
normalize
dimensions to eliminated redundancy.
The result is more complex queries and
reduced query performance.
It is called a star schema because the
diagram resembles a star.
It is called
because the
snowflake.
It occupies more space compare
snowflake schema because it is DE
normalized
Snowflake schema occupies less space
when compare to star schema because
it is normalized
a snowflake schema
diagram resembles a
24.Why query performance is better in Start Schema?
In star schema there will be less number of dimension tables so less number of
joins so query performance is good.
25.Which design Start Schema or snow flake is normalized one?
Snow flake design is normalized because snow flake design split one dimension
table into multiple dimensions to achieve the dimension.
26.What is the difference between the full load and incremental load?
10
Full load
First time in the data warehouse
we load all the data from source
into DWH is called full load
Full load we prefer load data first
time
Incremental load
Capture the changed data to load
DWH we use incremental load
Second time onwards we perform
incremental load in DWH
By performing the incremental load
we are trying to load less number
of data to improve performance
27.How can we do the incremental load from source?
Capture the changed data in source then we load changed data to DWH.
28.What is surrogate key? Give an example where did you use it in your project?
It is the system generated sequence number used for maintaining the unique
ness or primary key
In case of SCD Type2 to maintain the versioning number we use surrogate key .
29.What is Staging area? Why do we use staging area?
If target and0 source databases are different and target table volume is high it
contains some millions of records in this scenario without staging table we need
to design your Informatica using look up to find out whether the record exists or
not in the target table since target has huge volumes so its costly to create
cache it will hit the performance.
If we create staging tables in the target database we can simply do outer join in
the source qualifier to determine insert/update this approach will give you good
performance.
It will avoid full table scan to determine insert/updates on target.
And also we can create index on staging tables since these tables were designed
for specific application it will not impact to any other schemas/users.
While processing flat files to data warehousing we can perform cleansing.
Data cleansing, also known as data scrubbing, is the process of ensuring that a
set of data is correct and accurate. During data cleansing, records are checked
for accuracy and consistency.
Since it is one-to-one mapping from ODS to staging we do truncate and
reload.
We can create indexes in the staging state, to perform our source qualifier
best.
If we have the staging area no need to relay on the informatics
transformation to known whether the record exists or not.
30.What is data cleaning?
11
Weeding out unnecessary or unwanted things (characters and spaces etc.) from
incoming data to make it more meaningful and informative
31.What is the difference between business key and Surrogate key?
Business Key
A business key or natural key is an
index which identifies unique ness
of a row based on column that
exists
naturally
in
a
table
according to business rules
Business are a good way of
avoiding duplicate records so in
practice I typically add business
key as an alternative key in a
dimension
I want to manage the patient
information using business key will
add
the
patient_id
and
patient_code this is different for
each hospital
Surrogate key
It is the system generated
sequence
number
used
for
maintaining the unique ness or
primary key
It is alternative of business key.
Surrogate key are most efficient
then business key so I would
always recommended joining a fact
table to a dimension table using
surrogate key
In case of SCD Type2 to maintain
the versioning number we use
surrogate key.
32.What is normalization? Why do we normalize the data?
1) Spiting of one table into multiple table is called normalization.
2) It is used to eliminate the data redundancy.
3) It also used to eliminate the duplicates
33.What are aggregate tables?
Aggregation is a summarization of base-level fact table records
Aggregations need to account for the additive nature of the measures, created on-the-fly or by
pre-aggregation
Common aggregation scenarios
a. Category product aggs by store by day
b. District store aggs by product by day
c. Monthly sales aggs by product by store
d. Category product aggs by store district by day
Common aggregations = Sum, Count, Distinct Count, Max, Min, Average, etc.
34.What is granularity?
12
The granularity is the lowest level of information stored in the fact table. The depth of data
level is known as granularity. In date dimension the level could be year, month, quarter,
period, week, day of granularity.
The process consists of the following two steps:
- Determining the dimensions that are to be included
- Determining the location to place the hierarchy of each dimension of information
35.What is the difference between view and materialized view?
View
View is a logical object
It will not store any data
It is used for security purpose and
projection
We can perform DML operation on
simple views
View will fetch the data from base
table it will run the base query
There is no refresh clause in the views
Materialized View
It is physical object
It will store summary or aggregated
data
It is used for reporting purpose
In this we cant perform DML
operations
It will have less number of records
compare to base tables performance
will be improved in the reports
We can refresh the materialized views
with refresh clause
If we create view on single table we
can call as simple view and we create
view on more tables it is called as
complex view
36.Explain a situation where snowflake is better than start schema?
Snowflake is occupies less amount of space when compare to star schema.
It is Normalized schema
37.What are different architectures of data warehouse?
38.What is the size of your DWH?
3 or 4 or 5 TBs
39.What type of indexes normally used in DWH?
In data warehouse normally we use B-tree and Bit Map indexes
B-Tree:
1) It will create on the columns with highest cardinality columns
2) It is default index created by oracle database.
Bit Map:
1) It will create on lowest cardinality columns.
40.What is loading order of dimension and facts?
13
My understanding says that the dimensions should be extract first and fact
should be extracted. That way the foreign keys still be honoured in the staging
area.
41.What happens if we try to load the fact tables before loading the dimension
tables?
If we try to load the fact tables before loading the dimension tables data will be
rejected.
42.What is a Schema?
Graphical Representation of the data structure is called as schema.