Q.
What is datawarehouse
-> It is specifically designed for reporting, data analysis, and business intelligence
(BI) purposes. Data warehouses are used to collect, transform, integrate, and store
data from different operational systems, making it accessible for analysis and
decision-making.
ETL (Extract, Transform, Load): ETL processes are used to extract data from
source systems, transform it into a consistent format, and load it into the data
warehouse. ETL tools and workflows are essential for data integration and
transformation.
Data Modeling: Data modeling involves designing the structure of the data
warehouse, including the definition of tables, relationships, and attributes. Common
data modeling techniques include star schemas and snowflake schemas.
Business Intelligence (BI): BI refers to the tools, technologies, and processes used
to turn raw data into meaningful insights and reports. BI tools allow users to create
dashboards, generate ad-hoc reports, and perform data analysis.
OLAP, or Online Analytical Processing, is a category of computer processing that
enables users to interactively analyze and explore multidimensional data stored in a
data warehouse or similar repositories. OLAP systems are designed for complex
queries and reporting, allowing users to gain insights from data by viewing it from
different angles or dimensions. Here's how OLAP works and how it helps in Data
Warehousing and Business Intelligence (DWBI):Data Cubes: OLAP data is often
represented as data cubes.
A fact table holds the data to be analyzed, and a dimension table stores data about
the ways in which the data in the fact table can be analyzed.
1.Fact Table
Definition: A fact table is a central table in a data warehouse that contains
quantitative data, typically numerical values, and is used to represent business facts
or events. Fact tables store the metrics or measures that analysts want to analyze
and report on.Examples: In a retail DWBI system, a fact table might store sales
revenue, quantity sold, and discounts for each transaction. In a web analytics system,
it could store data on page views, clicks, and user interactions.
2.Dimension Table:
Definition: A dimension table is a table in a data warehouse that contains
descriptive or textual information about the business entities or attributes related to
the facts stored in the fact table. Dimension tables provide context and allow
analysts to filter, group, or slice data.Examples: In a retail DWBI system,
dimension tables might include a time dimension with attributes like date, month,
and year, a product dimension with attributes like product name and category, and a
customer dimension with attributes like customer name and address.
Unit 1
Q. What is data warehouse ?
Data warehouse is specially design for reporting, data analysing,
and business Intelligence (BI) purpose.data warehouse is used to
collect , transfer, and store data from different operational systems
and making it accessible to analyse and making decision.
It is a concept of subject-oriented, time-variant, volatile,Integrated
of collection of data that support to manage decision making
process. Data warehouse is organized around major subjects of the
enterprise such as customer, product and sales and rather then the
major areas such as product sales and stock control.
Benefits
1.potential high returns on investments
2.Competitive advantage
3.increased productivity of corporate decision makers
Problems with data warehouse
1.hidden problems with source systems.
2.Required data is not captured.
3.Increased end-users demands
4.High demand resources
5.High maintenance
6.Long duration projects.
Q.Technical architecture of BI (Explain in your words.)
Q.explain OLAP ,R-OLAP , M-OLAP , H-OLAP
OLAP - it stands for online analytical processing that enable users
to interactively analyse and explore multidimensional data from
data warehouse. This system is designed for complex queries ,
reporting and allows user to gain of insights of data by viewing it
from different angles and dimensions. The data is represented in
OLAP Is in cube format. It have three types
1.R-OLAP - it stands for relational -OLAP
This system store data in relational database and use SQL to
manage and query the data. They can be more flexible for
handling complex query , ad-hoc queries but may not good
performance as M-OLAP.
2.M-OLAP- it stands for Multidimensional - OLAP
This system store data in multidimensional format and providing
query performance is fast. Eg. Microsoft analytical service and
IBM congos TM1.
3.H-OLAP- it stands hybrid-OLAP
This system is the combination of ROLAP and MOLAP approaches.
It stores summary of data in multidimensional cube format like
MOLAP and stores detailed data in relational database like ROLAP
and it allow to fast query like ROLAP and performance like MOLAP
for certain type of data.
OLTP - it stands for online transaction processing
It is one of the computer category which is focus on recording and
managing day-to-day transactions and operational in real-time.it
is designed for operational efficiency and data consistency.
Characteristics are
1.data consistency
2.Structured data
3.High concurrency
4.Real time processing
5.Normalize data
Difference between Olap and
Q.Frame work of BI
1.project management
2.business requirement
3.technical architecture design
4.project installation
5.dimensional modeling
6.physical design
7.ETL design
8.BI application and design
9.deployment
10.maintenance
11.growth
Business Intelligence (BI) is a comprehensive framework and set
of technologies designed to acquire, analyze, and present
business information. Here is a simplified framework for Business
Intelligence:
1.Data Sources: Data is the foundation of BI. It can come from
various sources, including databases, spreadsheets, external data
feeds, and more.
2.Data Integration: Data from diverse sources is collected, cleaned,
transformed, and integrated into a central repository for analysis.
ETL (Extract, Transform, Load) processes are often used for this
purpose.
3.Data Warehousing: A data warehouse is a structured storage
system that houses integrated data for historical analysis and
reporting. It's optimized for querying and reporting.
4.Data Modeling: Data models are used to represent data
relationships and structure. They help organize and interpret data
effectively.
5.Business Intelligence Tools: These tools provide the means to
access, analyze, and visualize data. They include dashboards,
reporting tools, data visualization software, and ad-hoc query
tools.
6.Analysis and Reporting: Users, such as analysts and decision-
makers, access the BI tools to create reports, perform ad-hoc
analysis, and generate insights.
7.Data Visualization: Data is presented in visually engaging
formats, such as charts, graphs, and interactive dashboards, to aid
in decision-making.
8.Performance Monitoring: BI solutions often include
performance monitoring and key performance indicator (KPI)
tracking to measure business performance.
9.Decision-Making: Armed with insights, organizations can make
data-driven decisions that can lead to improved strategies,
operations, and profitability.
10.Continuous Improvement: BI is an ongoing process, and
organizations use feedback and results to make continuous
improvements to their strategies and operations.
Q.Explain metadata with types
In BI the metadata is refers to data that provides information of
about other data. It describes characteristics,properties and
attributes of data. It helps to users and system to understand and
manage the data.
There are three types of metadata
1.presentation server metadata
The presentation server metadata refers to metadata related to
presentation layer of BI. And provides the information of how
data is presented , displayed and accessed by the end-users. It
includes information like reports,dashboard, visualization and
other elements used to present data to user.
2.Business metadata
This type of metadata focused on providing context and
understanding data about business operations, process and
decision-making. It describe the data in terms that are meaningful
to business users. It has data ownership , data access and
security.
3.Technical metadata
This type of metadata focused on providing technical aspect of
data and it’s management. It provides the information about
database , rows, columns , source of data, ETL process, structure,
format , location , storage
Unit 2
Q.What is data modeling and types
Data modeling is a technique which is used in data warehouse and
business intelligence to organize and structure data for efficient
query and reporting. It is designed to optimize data retrieval and
analysis for decision support and reporting purpose.data
modeling typically involves two types of tables. 1.fact and
2.dimension table.
Fact table:- fact table contains numerical data (measure) that
represent facts such as sales ,revenue, profit , loss, quantity, sold ,
etc. It also include foreign key to link with dimensional table to
make relationship between measure and attribute.
Dimension table table:- this table contains descriptive attributes
or hierarchies.and they contains various level of attribute.
Advantages and disadvantages.
Types of dimension model
1.star schema
In a star schema, a central fact table is connected to dimension
tables through foreign keys, forming a star-like structure. Star
schemas are simple, intuitive, and easy to understand and query.
They are suitable for scenarios where there are one or more fact
tables sharing common dimensions.
2.snowflake schema
A snowflake schema is an extension of the star schema, where
dimension tables are further normalized into sub-dimensions,
resulting in a more complex, tree-like structure.
Both dig. Draw in yourself
Q.what are the rules for dimensional modeling
1.load atomic data, into dimensional structure.
2.build dimensional model around business process.
3.need to ensure that each fact table has associative dimensional
table.
4.ensure that all fact in single fact table are in same gain or level
of details.
5.it is essential to store report label and filter domain values in
dimensional table.
6.need to ensure that dimensional table uses surrogate keys.
7.continously balance requirements and realtime delivery
business solution to support their decision making.
Unit 4:Introduction to ETL
Q.Explain ETL system requirements
1.Business need
Need information which are required by business user to make
decision. Business needs drive choice of data sources and their
transformation. Maintain dialog amoung ETL time, Moduler, and
business users.
2.Compliance
Ensure the reporting number is accurate and complete. Saving
archive copies of data source and subsequent staging. Provide
proof of complete transaction flow when the data is changed.
3.Data quality
Three important demand
1.First Data is important for user to function
2.Second Data is distributed
3.Third increses the data demands of compliance.
4.Security
1.which data is publish and whom , 2.seek advice from security
manager.
5.Data Integration
It makes all system works together seamlessly. And The data
integration must be take place on transaction system.data
integration is confirming dimension and confirming facts in DW.
6.Data latency
How quickly source data must delivered to business users it affect
on the ETL architecture. The good algorithm, parallelization and
good hardware can speed up batch processing data. ETL must
convert batch processing to stream processing if the latency is
most urgent.
7.Archive
Staging data after all major activity in ETL process it archive all the
staged data on permanent storage device.Each staged/ archive
data have metadata.
8.User Deliver interface
deliver data in a format which is easy to understand, well
structured and easy to retrieve
9.Available skills
ETL design should be based on the resources and skills
available.check the technical issues and license cost.
Q.Explain change data capture system
Change Data Capture is a software process that identifies and
tracks changes to data in a database. CDC provides real-time or
near-real-time movement of data by moving and processing data
continuously as new database events occur.
1.Audit column
    Source is appended with audit columns to store data and
data and time when a record is modified. Columns are
populated(loaded) via database triggers. Sometimes columns are
populated from the souce applications.
2.extract time
     Select the all rows where data in create or modified date
fields equal SYSDATE(Yesterday).
3.Full diff compare
     Compare full snapshot of yesterday with today for what data
is changed. It is detailed process and time consuming and it
perform on source machine.
4.Database Log scraping
     It involves monitoring and extracting changes mades to a
database by checking it’s transaction logs.
5.Message queue monitoring
     It managing the flow of message within message queue
system to track and capture changes made to the database.
Q.Audit Dimension assembler
The audit dimension is a special dimension that is assembled in
the back room by the ETL system for each fact table. The audit
dimension in Figure 9-2 contains the metadata context at the
moment when a specific fact table record is created. You might
say that we have elevated metadata to real data! To visualize how
audit dimension records are created, imagine that this shipments
fact table is updated once per day from a batch file. Suppose that
today we have a perfect run with no errors flagged. In this case,
we would generate only one audit dimension record and it would
be attached to every fact record loaded today. All of the
categories, scores, and version numbers would be the same.
Q.Role of data profiling in ETL
Data profiling plays a crucial role in the ETL (Extract, Transform,
Load) process by providing insights of structure, quality, and
characteristics of the data being processed.
1.Understanding Data Quality:
It examines various aspects such as completeness, accuracy,
consistency, and integrity. Identifying data quality issues early in
the ETL process allows for better decision-making regarding data
cleansing and transformation.
2.Data Discovery and Exploration:
This includes identifying patterns, distributions, and relationships
within the data. This exploration is crucial for designing effective
ETL processes.
3.Identifying Anomalies and Outliers:
Data profiling helps detect anomalies, outliers, and irregularities
in the source data.
4. Data Standardization and Normalization:
5.Rule Discovery and Validation:
Data profiling tools enable the discovery and validation of
business rules associated with the data. This ensures that the ETL
processes adhere to the predefined rules and constraints,
maintaining data integrity and compliance.
6.Performance Optimization:
7. Metadata Generation:
Data profiling generates valuable metadata, including statistical
summaries, distribution profiles, and data lineage information.
This metadata is useful for documentation, impact analysis, and
maintaining data governance throughout the ETL lifecycle.
8.Data Security and Privacy Compliance:
identifying sensitive information within the data and handle it
during ETL process.
Q.5. Explain Extract System in ETL process
The Extract phase in the ETL (Extract, Transform, Load) process is
responsible for gathering data from source systems and preparing
it for further processing. The Extract system extracts data from
various source systems.
1. File-Oriented Extract:
In a file-oriented extract, data is extracted from source systems
and stored in files. These files can be in various formats, such as
CSV, JSON, XML, or proprietary formats.The Extract system
collects the relevant data from source databases, transforms it
into the chosen file format, and stores it in files for subsequent
processing.
2. Stream-Oriented Extract:
Stream-oriented extract involves the real-time extraction of data
from source systems as it becomes available.
3. Data Compression:
Data compression is a technique used in the Extract phase to
reduce the size of the data being transferred or stored.
Compressing data helps in optimizing storage space and
improving the efficiency of data transmission over networks.
During extraction, data may be compressed using algorithms such
as gzip or zlib.
4. Data Encryption:
Data encryption is employed to secure sensitive information
during the extraction process. When data is extracted from source
systems, it may contain sensitive details that require protection.
Encryption algorithms are applied like AES (Advanced Encryption
Standard) and SSL/TLS for secure communication.
Q.What is the role of error event schema subsystem in ETL
process?
Q.Duplication and conformation
16. Explain slowly changing Dimension(SCD) manager in ETL
System.
One of the most important elements of ETL architecture who
capable to implement Slowly Changing Dimension. The ETL
system must determine how to handle a dimension attribute
value of values which are already stored in data warehouse. Then
the SCD is applied on it. It shows the previous updated
information in different types of SCD.
Draw a diagram(table) for example and explain shortly.
https://www.youtube.com/watch?v=vOYSVTSYvW8
There are four types of
1.Type 1-Overwritting
     In this type of SCD it is the simple overwrite of one or more
attributes in an existing dimension row. You take the revised data
from change data capture system and overwrite data in
dimension table. It can be used when correcting data or when
there is no business need to keep the history of previous values.
Some ETL tools contain UPDATE else INSERT functionality. This
functionality may be convenient for the developer, but can be a
performance killer. For maximum performance existing row
UPDATEs should be used for new row INSERTs. In type1 .
2.Type-2- Create a new Row(preserve history)
-Create new rows
-. For type 2 updates, copy the previous version of the dimension
row and create a new dimension row with a new surrogate key.If
there is not a previous version of the dimension row, create a new
one from scratch. You then update this row with the columns that
have changed. This is the main workhorse technique for handling
dimension attribute changes that need to be tracked over
time.The type 2 process does not change history as the type 1
process does. And we give the version or the date column as the
number of rows are created or record updated. As we called as
assign flag
3.Type-3 Add new Column(Partially preserve history)
The type 3 technique is designed to support attribute "soft"
changes that require a user to refer either to the old value of the
attribute or the new value. For example, if a sales team is
assigned to a newly named sales region, there may be a need to
track the old region assignment. And we assign flag as type-2
4.Type-4 Hybrid :(Combination of type 1 and type 2)
Each dimension table will use one, two, or all three of these
techniques to manage data changes. In some cases, the
techniques will be used in hybrid combinations. In this situation
you use two columns of dimension where one is for current
value(Type-1) and another one is for old/ history of old record.you
simply look at the first column as a type 1 column and the second
column as a type 2 column, both columns - 341 - being sourced
from the same source field.
Unit 5 Introducing Business
Intelligence Applications
Q.Explain different types of BI applications
1.direct access query and reporting tool :- it allow user to query
the dimensional model directly and define a result set.The Ad Hoc
is used in this case it show the result in the tabular form. User can
drag drop columns, set constraints and define calculations.
2.Standard reports :- standard reports are predefined and
preformatted reports which are generally provide some level of
user interaction like a ability to enter parameter and link the
related reports.
3.Analytic application:-Analytic applications are a type of
business application software, used to measure and improve the
performance of business operations. It manage the set of reports
embedding domain expertise for particular business
process.Analytic applications take advantage of Data mining
application.
4.Dashboards and scorecards :- it involves combination of reports
and charts that are used for exception highlighting and drilldown
capabilities to analyse data from multiple business processes.
5.Data mining:- used for data cleaning and …..Data mining aids
and operates within Business Intelligence Applications by
extracting valuable patterns and insights from large datasets,
enhancing decision-making processes. In simpler terms, data
mining helps in Business Intelligence Apps by uncovering
important information from big sets of data, which helps make
better decisions
6.Operational BI :-Provides real time data and……
Q.Use of Dashboards and Scorecards in BI systems
Dashboard and scorecards are used by executives.
1.Dashboard provides interface tools that support status reports
and alert across multiple data source at a high level and also allow
drilldown detailed data. The alerts can be generated by
comparing the actual number, like sales, to a fixed number, a
percentage, or, in many cases, to a separate target or goal
number. A dashboard is usually aimed at a specific user type, such
as a sales manager.
Explain the following figure.
2.Scoreboard
    It used to manage performance across organization.
Q.Explain Analytic Cycle for BI application
The business analysis is breaks into five process which are
1.Monitor activity :- This is the first step in the analytic cycle in
the monitor activity stage focus on the presentation layer and
include technologies such as dashboards, portals, and scorecards.
Many data warehouse implementations stop at this stage and
declare success.
2.Identify exception :-This stage focuses on the identification of
"what's the matter?" or "where are the problems?"in this stage is
to identify the exceptions to normal performance, as well as the
opportunities.The identify exceptions stage requires additional
capabilities, such as distribution servers that distribute alerts to
users' devices of choice based upon exception triggers and
visualization tools to view the data in different more creative
ways, including trend lines, spark lines, geographical maps, or
clusters
3.Determine casual factor:-This stage tries to understand the
"why" or root causes of the identified exceptions.additional
software including statistical tools and/or data mining algorithms,
such as association, sequencing, classification, and segmentation,
to quantify cause-and-effect. This step frequently requires new
data to investigate causal factors. For instance, if you are trying to
understand a broad sales downturn in West Virginia during a
particular month, you might want to include a weather data
source for that area and time.
4.Model alternatives :-In this stage, you build on cause-and-effect
relationships to develop models for evaluating decision
alternatives. Your data warehouse architecture may also need to
accommodate additional technologies in the model alternatives
stage, including statistical tools and data mining algorithms for
model evaluation, such as sensitivity analysis, Monte Carlo
simulations, and goal seeking optimizations.
5.Take action and track result :-The take action/track results
stage places additional demands on your data warehouse
architecture.The results of the action should be captured and
analyzed in order to continuously for the analysis process,
business rules, and analytic models. This brings you right back to
the monitor activity stage to start the cycle all over again. As
Figure 11-2 shows, these stages occur in a circular process. The
result of each analytic cycle becomes input to the next cycle. The
results of one decision must be monitored, analyzed for
exceptions, and so on through the five stages.
Q.How datamining is used in DWBI system.
In Data Warehousing and Business Intelligence (DWBI) systems,
data mining plays a crucial role in extracting valuable patterns and
insights from large datasets. Here's how various data mining
techniques are commonly used in DWBI systems:
1.Clustering:
Purpose: Grouping similar data points together based on certain
characteristics.
Application: In DWBI, clustering helps identify natural groupings
within data, supporting segmentation for targeted analysis or
reporting.
2.Classifying:
Purpose: Assigning predefined categories or classes to new data
points based on patterns learned from historical data.
Application: Classification is used in DWBI for tasks such as
categorizing customers into segments, predicting product
preferences, or identifying potential issues.
3.Estimating and Predicting:
Purpose: Making predictions or estimating future values based on
historical patterns.
Application: In DWBI, predicting future sales, demand, or trends is
common. Estimation techniques help in forecasting and planning.
4.Affinity Grouping (Association Rule Mining):
Purpose: Discovering relationships and associations between
different variables or items in the dataset.
Application: In DWBI, affinity grouping helps identify patterns like
"customers who buy product A are likely to buy product B,"
supporting cross-selling strategies.
5.Anomaly Detection:
Purpose: Identifying unusual patterns or outliers in the data that
deviate from the norm.
Application: In DWBI, anomaly detection helps highlight
irregularities, such as detecting fraudulent transactions or
identifying data quality issue
Q.Explain need of following technical features for BI application
and Query Tools/ What are the technical features expected in
Query Tool used in BI Systems?
1.Multitasking :- user should run other programs and create and
run other queries while a query is running
2.Cancel query :-able to kill a single query without killing all
without rebooting
3.Scripting :- scripting and command line is critical for automating
report execution
4.Connectivity:- Connecting to            all data sources- text,
spreadsheets, XML files, other rdbms
5.Schedulling :- allow users to differ the execution of queries
6.Metadata driven :-admin should be able to define subset of
warehouse such as those tables involved in a single business
process
7.Security :- need to provide authorization functions, limiting
users to access to those reports they are allowed to use
8.Querying :- direct querying of database should be supported