KEMBAR78
Unit-1 Lecture Notes | PDF | Data Warehouse | Databases
100% found this document useful (1 vote)
851 views43 pages

Unit-1 Lecture Notes

Data warehouses store consolidated data from multiple sources to support analytical reporting and business intelligence. A data warehouse uses a multi-tier architecture with a database, OLAP server, and front-end tools. Data is organized using dimensional modeling with facts and dimensions. Common schemas include star schemas with a central fact table linked to dimension tables, and snowflake schemas with some normalized dimensions. Dimensional data can be visualized as cubes with measures across dimensions of time, products, locations etc. to enable analysis of trends.

Uploaded by

Sravani Gunnu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
851 views43 pages

Unit-1 Lecture Notes

Data warehouses store consolidated data from multiple sources to support analytical reporting and business intelligence. A data warehouse uses a multi-tier architecture with a database, OLAP server, and front-end tools. Data is organized using dimensional modeling with facts and dimensions. Common schemas include star schemas with a central fact table linked to dimension tables, and snowflake schemas with some normalized dimensions. Dimensional data can be visualized as cubes with measures across dimensions of time, products, locations etc. to enable analysis of trends.

Uploaded by

Sravani Gunnu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 43

DATAWAREHOUSE INTRODUCTION

What is Data and Information?


Data is an individual unit that contains raw materials which do not carry any specific meaning.
Information is a group of data that collectively carries a logical meaning.
Data doesn't depend on information.
Information depends on data.
Data is measured in bits and bytes.
Information is measured in meaningful units like time, quantity, etc.

Data Warehouse:
Data warehouse is like a relational database designed for analytical needs. It functions on the
basis of OLAP (Online Analytical Processing). It is a central location where consolidated data
from multiple locations (databases) are stored.

What is Data warehousing?


Data warehousing is the act of organizing & storing data in a way so as to make its retrieval
efficient and insightful. It is also called as the process of transforming data into information.

1
Fig: Data warehousing Process

Data Warehouse Characteristics:


A Data warehouse is a subject-oriented, integrated, time variant and non-volatile collection of
data in support of management’s decision making process.
Subject-oriented:
A Data warehouse can be used to analyze a particular subject area
Ex: “Sales” can be particular subject
Integrated:
A Data warehouse integrates data from multiple data sources.
Time Variant:
Historical data is kept in a data warehouse.
Ex: one can retrieve data from 3 months, 6months, 12 months or even older data from a
data warehouse. This contrasts with a transactions system, where often only the most recent
data is kept.
Non-Volatile:
Once data is in the data warehouse, it will not change. So historical data in a data warehouse
should never be altered.

2
Data warehouse Architecture:

Fig: Data ware housing Architecture


Data warehouses often adopt a three-tier architecture
 The bottom tier is a warehouse database server that is almost always a relational
database system. Back-end tools and utilities are used to feed the data into the bottom
tier from operational database or other external sources. These tools and utilities
perform data extraction, cleaning and transformation(ex. To merge similar data from
different sources into a unified format), as well as load and refresh functions to update
the data warehouse. The data are extracted using application program interfaces known
as gateways.A gateway is supported by the underlying DBMS and allows client programs
to generate SQL code to be executed at a server.
Examples of gateways include ODBC(Open Database Connection) and OLEDB(Open
Linking and Embedding for Databases) by Microsoft and JDBC(Java Database
Connection).This tier also contains a metadata repository, which stores information
about the data warehouse and its contents.
The middle tier is an OLAP server that is typically implemented using either
(a) a relational OLAP(ROLAP) model, that is an extended relational DBMS that maps
operations on multidimensional data to standard relational operations, or

3
(b) a multidimensional OLAP(MOLAP) model that is a special-purpose server that directly
implements multidimensional data and operations.
The top tier is a front end client layer, which contains query and reporting tools, analysis tools
and data mining tools(ex: trend analysis, prediction….)

Multi-dimensional Data Model:


 A multidimensional model views data in the form of a data-cube.
 When data is grouped or combined in multidimensional matrices called Data Cubes.
 A data cube enables data to be modeled and viewed in multiple dimensions.
It is defined by dimensions and facts.
 A multidimensional data model is organized around a central theme, for example, sales.
This theme is represented by a fact table. Facts are numerical measures.
 The fact table contains the names of the facts or measures of the related dimensional
tables.
FACT VS DIMENSION
Fact/Measure(What you want to analyse is your fact)
Ex: What is My sales, What is my profit, What is my custmes preferences.
Dimensions(By Which you want to Analyze is your Dimensions)
Sales By Location/Product/Period
Total Profit By Location/Product/Period
 These Dimensions allow the store to keep track of things like monthly sales of items
and branches and locations at which the items were sold.
 Each dimension may have a table associated with it called a dimension table, which
further describes the dimension.

4
Fig: Multidimensional Representation
 Consider the data of a shop for items sold per quarter in the city of Delhi. The data
is shown in the table.
 In this 2D representation, the sales for Delhi are shown for the time dimension
(organized in quarters) and the item dimension (classified according to the types of an
item sold).
 The fact or measure displayed in rupee_sold (in thousands).

Now, if we want to view the sales data with a third dimension, For example, suppose the data
according to time and item, as well as the location is considered for the cities Chennai, Kolkata,
Mumbai, and Delhi. These 3D data are shown in the table. The 3D data of the table are
represented as a series of 2D tables.

5
Conceptually, it may also be represented by the same data in the form of a 3D data cube, as
shown in fig:

What is Schema?
 Schema is a logical description of the entire database.
 It includes the name and description of records of all record types including all
associated data-Items and aggregates.
 Much like a database, a data warehouse also requires to maintain a schema.
 A database uses relational model, while a data warehouse uses Star, Snowflake, and
Fact Constellation schema.
 Modeling data warehouses: dimensions & measures

6
 Star schema: A fact table in the middle connected to a set of dimension tables
 Snowflake schema: A refinement of star schema where some dimensional hierarchy is
normalized into a set of smaller dimension tables, forming a shape similar to snowflake
 Fact constellations: Multiple fact tables share dimension tables, viewed as a collection
of stars, therefore called galaxy schema or fact constellation

Star Schema:
 A star schema is the elementary form of a dimensional model, in which data are
organized into facts and dimensions.
 A fact is an event that is counted or measured, such as a sale or log in. A dimension
includes reference data about the fact, such as date, item, or customer.
 A star schema is a relational schema where a relational schema whose design represents
a multidimensional data model.
 The star schema is the explicit data warehouse schema. It is known as star
schema because the entity-relationship diagram of this schemas simulates a star, with
points, diverge from a central table.
 The center of the schema consists of a large fact table, and the points of the star are the
dimension tables.

Fig: Star Schema Representation

7
Star Schema:
 Each dimension in a star schema is represented with only one-dimension table.
 This dimension table contains the set of attributes.
 The following diagram shows the sales data of a company with respect to the four
dimensions, namely time, item, branch, and location.

 There is a fact table at the center. It contains the keys to each of four dimensions.
 The fact table also contains the attributes, namely dollars sold and units sold.
 Each dimension has only one dimension table and each table holds a set of attributes.
For example, the location dimension table contains the attribute set {location_key,
street, city, province_or_state,country}. This constraint may cause data redundancy. For
example, "Vancouver" and "Victoria" both the cities are in the Canadian province of
British Columbia. The entries for such cities may cause data redundancy along the
attributes province_or_state and country.

Characteristics of Star Schema:


 Every dimension in a star schema is represented with the only one-dimension table.
 The dimension table should contain the set of attributes.
 The dimension table is joined to the fact table using a foreign key
 The dimension table are not joined to each other
 Fact table would contain key and measure
 The Star schema is easy to understand and provides optimal disk usage.

8
 The dimension tables are not normalized. For instance, in the above figure, Country_ID
does not have Country lookup table as an OLTP design would have.
 The schema is widely supported by BI Tools.
 Advantages:
 (i) Simplest and Easiest
 (ii) It optimizes navigation through database
 (iii) Most suitable for Query Processing

Snowflake Schema:
 Some dimension tables in the Snowflake schema are normalized.
 The normalization splits up the data into additional tables.
 Unlike Star schema, the dimensions table in a snowflake schema are normalized.
 For example, the item dimension table in star schema is normalized and split into two
dimension tables, namely item and supplier table.

 Now the item dimension table contains the attributes item_key, item_name, type,
brand, and supplier-key.
 The supplier key is linked to the supplier dimension table. The supplier dimension table
contains the attributes supplier_key and supplier_type.

9
Note : Due to normalization in the Snowflake schema, the redundancy is reduced and
therefore, it becomes easy to maintain and the save storage space.

Fig: Snowflake image


A snowflake schemas can have any number of dimension, and each dimension can have any
number of levels.
The following diagram shows a snowflake schema with two dimensions, each having three
levels.

Advantages:
(i) Less redundancies due to normalization Dimension Tables.

10
(ii) Dimension Tables are easier to update.
Disadvantages:
It is complex schema when compared to star schema.

Fact Constellation Schema:


 A Fact constellation means two or more fact tables sharing one or more dimensions. It is
also called Galaxy schema.
 Fact Constellation Schema describes a logical structure of data warehouse or data mart.
Fact Constellation Schema can design with a collection of de-normalized FACT, Shared,
and Conformed Dimension tables.

11
A fact constellation schema is shown in the figure below.

 This schema defines two fact tables, sales, and shipping. Sales are treated along four
dimensions, namely, time, item, branch, and location.
 The schema contains a fact table for sales that includes keys to each of the four
dimensions, along with two measures: Rupee_sold and units_sold.
 The shipping table has five dimensions, or keys: item_key, time_key, shipper_key,
from_location, and to_location, and two measures: Rupee_cost and units_shipped.
 It is also possible to share dimension tables between fact tables. For example, time,
item, and location dimension tables are shared between the sales and shipping fact
table.

Disadvantages:
(i) Complex due to multiple fact tables
(ii) It is difficult to manage
(iii) Dimension Tables are very large.

OLAP OPERATIONS:
 In the multidimensional model, the records are organized into various dimensions, and
each dimension includes multiple levels of abstraction described by concept hierarchies.

12
 This organization support users with the flexibility to view data from various
perspectives.
 A number of OLAP data cube operation exist to demonstrate these different views,
allowing interactive queries and search of the record at hand. Hence, OLAP supports a
user-friendly environment for interactive data analysis.
 Consider the OLAP operations which are to be performed on multidimensional data.
 The data cubes for sales of a shop. The cube contains the dimensions, location, and time
and item, where the location is aggregated with regard to city values, time is
aggregated with respect to quarters, and an item is aggregated with respect to item
types.
OLAP having 5 different operations

(i) Roll-up
(ii) Drill-down
(iii) Slice
(iv) Dice
(v) Pivot

Roll-up:
 The roll-up operation performs aggregation on a data cube, by climbing down concept
hierarchies, i.e., dimension reduction. Roll-up is like zooming-out on the data cubes.
 It is also known as drill-up or aggregation operation
 Figure shows the result of roll-up operations performed on the dimension location. The
hierarchy for the location is defined as the Order Street, city, province, or state, country.
 The roll-up operation aggregates the data by ascending the location hierarchy from the
level of the city to the level of the country.
 When a roll-up is performed by dimensions reduction, one or more dimensions are
removed from the cube.
 For example, consider a sales data cube having two dimensions, location and time. Roll-
up may be performed by removing, the time dimensions, appearing in an aggregation of
the total sales by location, relatively than by location and by time.

13
Fig: Roll-up operation on Data Cube

Drill-Down
 The drill-down operation is the reverse operation of roll-up.
 It is also called roll-down operation.
 Drill-down is like zooming-in on the data cube.
 It navigates from less detailed record to more detailed data. Drill-down can be
performed by either stepping down a concept hierarchy for a dimension or adding
additional dimensions.
 Figure shows a drill-down operation performed on the dimension time by stepping
down a concept hierarchy which is defined as day, month, quarter, and year.
 Drill-down appears by descending the time hierarchy from the level of the quarter to a
more detailed level of the month.
 Because a drill-down adds more details to the given data, it can also be performed by
adding a new dimension to a cube.

14
Fig: Drill-down operation

Slice:
 A slice is a subset of the cubes corresponding to a single value for one or more members
of the dimension.
 The slice operation provides a new sub cube from one particular dimension in a given
cube.
 For example, a slice operation is executed when the customer wants a selection on one
dimension of a three-dimensional cube resulting in a two-dimensional site. So, the Slice
operations perform a selection on one dimension of the given cube, thus resulting in a
sub cube.
 Here Slice is functioning for the dimensions "time" using the criterion time = "Q1".
 It will form a new sub-cubes by selecting one or more dimensions.

15
Fig: Slice operation

Dice:
 The dice operation describes a sub cube by operating a selection on two or more
dimension.

16
Fig: Dice operation

 The dice operation on the cubes based on the following selection criteria involves three
dimensions.
(location = "Toronto" or "Vancouver")
(time = "Q1" or "Q2")
(item =" Mobile" or "Modem")

Pivot:
 The pivot operation is also called a rotation.
 Pivot is a visualization operations which rotates the data axes in view to provide an
alternative presentation of the data.
 It may contain swapping the rows and columns or moving one of the row-dimensions
into the column dimensions.

Fig: Pivot Operation

17
Parallel DBMS Vendors:

What is a DBMS vendor?

(Data base Management System)Software that controls the organization, storage, retrieval, security and
integrity of data in a database.

The major DBMS vendors are Oracle, IBM, Microsoft and Sybase (see Oracle Database, DB2, SQL Server
and ASE).

D V Type Primary Market


B en
M do
S r
Access (Jet, MSDE) Microsoft R Desktop
Adabas D Software AG R Enterprise
Adaptive Server Anywhere Sybase R Mobile/Embedded
Adaptive Server Enterprise Sybase R Enterprise
Advantage Database Server Extended Systems R Mobile/Enterprise
Datacom Computer Associates R Enterprise
DB2 Everyplace IBM R Mobile
Filemaker FileMaker Inc. R Desktop
IDMS Computer Associates R Enterprise
Ingres ii Computer Associates R Enterprise

18
Interbase Inprise (Borland) R Open Source
MySQL Freeware R Open Source
NonStop SQL Tandem R Enterprise
Pervasive.SQL 2000 Pervasive Software R Embedded
(Btrieve)
Pervasive.SQL Workgroup Pervasive Software R Enterprise (Windows 32)
Progress Progress Software R Mobile/Embedded
Quadbase SQL Server Quadbase Systems, Inc. Relation Enterprise
al
R:Base R:Base Technologies Relation Enterprise
al
Rdb Oracle R Enterprise
Enterprise
Red Brick Informix (Red Brick) R
(Data
Warehousi
ng)
Enterprise
SQL Server Microsoft R
Mobile/Em
SQLBase Centura Software R bedded
Enterprise
SUPRA Cincom R
VLDB
Teradata NCR R
(Data
Warehousi
ng)
Enterprise
YARD-SQL YARD Software Ltd. R
In-
TimesTen TimesTen Performance R
Memory
Software
Enterprise
Adabas Software AG XR
VLDB
Model 204 Computer Corporation of XR
America
Enterprise
UniData Informix (Ardent) XR
Enterprise
UniVerse Informix (Ardent) XR
Enterprise
Cache' InterSystems OR
Mobile/Em
Cloudscape Informix OR bedded

19
Enterprise/
DB2 IBM OR VLDB
Enterprise
Informix Dynamic Server Informix OR
2000
VLDB
Informix Extended Parallel Informix OR
(Data
Server Warehousi
ng)
Mobile
Oracle Lite Oracle OR
Enterprise
Oracle 8I Oracle OR
Embedded
PointBase Embedded PointBase OR
Mobile
PointBase Mobile PointBase OR
Enterprise
PointBase Network Server PointBase OR
Open
PostgreSQL Freeware OR Source
Enterprise
UniSQL Cincom OR
Enterprise
Jasmine ii Computer Associates OO
Enterprise
Object Store Exceleron OO
VLDB
Objectivity DB Objectivity OO (Scientific)
Enterprise
POET Object Server Suite Poet Software OO
Enterprise
Versant Versant Corporation OO
Mobile/Em
Raima Database Manager Centura Software RN bedded
Enterprise/
Velocis Centura Software RN Embedded
Open
Db.linux Centura Software RNH
Source/Mo
bile/Embe
dded
Open
Db.star Centura Software RNH
Source/Mo
bile/Embe
dded

20
Types of Data Warehouse:
There are three main types of DWH. Each has its specific role in data management operations.

1. Enterprise Data Warehouse


Enterprise data warehouse (EDW) serves as a central or main database to facilitate decision-
making throughout the enterprise. Key benefits of having an EDW include access to cross-
organizational information, the ability to run complex queries, and the enablement of enriched,
far-sighted insights for data-driven decisions and early risk assessment.

2. ODS (Operational Data Store)


In ODS, the DWH refreshes in real-time. Therefore, organizations often used it for routine
enterprise activities, such as storing records of the employees. Business processes also use ODS
as a source for providing data to the EDW.

3. Data Mart
It is a subset of a DWH that supports a particular department, region, or business unit. Consider
this: You have multiple departments, including sales, marketing, product development, etc.
Each department will have a central repository where it stores data. This repository is called a
data mart. The EDW stores the data from the data mart in the ODS on a daily/weekly (or as
configured) basis. The ODS acts as a staging area for data integration. It then sends the data to
the EDW to store it and use it for BI purposes.

DATAWAREHOUSE COMPONENTS:
The data warehouse is based on an RDBMS server which is a central information repository that
is surrounded by some key components to make the entire environment functional,
manageable and accessible
There are mainly five components of Data Warehouse:

DATA WAREHOUSE DATABASE:


The central database is the foundation of the data warehousing environment. This database is
implemented on the RDBMS technology. Although, this kind of implementation is constrained

21
by the fact that traditional RDBMS system is optimized for transactional database processing
and not for data warehousing. For instance, ad-hoc query, multi-table joins, aggregates are
resource intensive and slow down performance.
Hence, alternative approaches to Database are used as listed below-
In a data warehouse, relational databases are deployed in parallel to allow for scalability.
Parallel relational databases also allow shared memory or shared nothing model on various
multiprocessor configurations or massively parallel processors.
New index structures are used to bypass relational table scan and improve speed.
Use of multidimensional database (MDDBs) to overcome any limitations which are placed
because of the relational data model. Example: Essbase from Oracle.

SOURCING, ACQUISITION, CLEAN-UP AND TRANSFORMATION TOOLS


(ETL):
The data sourcing, transformation, and migration tools are used for performing all the
conversions, summarizations, and all the changes needed to transform data into a unified
format in the datawarehouse. They are also called Extract, Transform and Load (ETL) Tools.
These Extract, Transform, and Load tools may generate cron jobs, background jobs, Cobol
programs, shell scripts, etc. that regularly update data in datawarehouse. These tools are also
helpful to maintain the Metadata.
These ETL Tools have to deal with challenges of Database & Data heterogeneity.

METADATA:
The name Meta Data suggests some high- level technological concept. However, it is quite
simple. Metadata is data about data which defines the data warehouse. It is used for building,
maintaining and managing the data warehouse.
In the Data Warehouse Architecture, meta-data plays an important role as it specifies the
source, usage, values, and features of data warehouse data. It also defines how data can be
changed and processed. It is closely connected to the data warehouse.

QUERY TOOLS:
One of the primary objects of data warehousing is to provide information to businesses to
make strategic decisions. Query tools allow users to interact with the data warehouse system.

22
These tools fall into four different categories:
Query and reporting tools
Application Development tools
Data mining tools
OLAP tools

Characteristics of OLAP:
The main characteristics of OLAP are as follows:
Multidimensional conceptual view: OLAP systems let business users have a dimensional and
logical view of the data in the data warehouse. It helps in carrying slice and dice operations.
Multi-User Support: Since the OLAP techniques are shared, the OLAP operation should provide
normal database operations, containing retrieval, update, adequacy control, integrity, and
security.
Accessibility: OLAP acts as a mediator between data warehouses and front-end. The OLAP
operations should be sitting between data sources (e.g., data warehouses) and an OLAP front-
end.
Storing OLAP results: OLAP results are kept separate from data sources.
Uniform documenting performance: Increasing the number of dimensions or database size
should not significantly degrade the reporting performance of the OLAP system.
OLAP provides for distinguishing between zero values and missing values so that aggregates are
computed correctly.
OLAP system should ignore all missing values and compute correct aggregate values.
OLAP facilitate interactive query and complex analysis for the users.
OLAP allows users to drill down for greater details or roll up for aggregations of metrics along a
single business dimension or across multiple dimension.
OLAP provides the ability to perform intricate calculations and comparisons.
OLAP presents results in a number of meaningful ways, including charts and graphs.

23
OLAP Types:
Three types of OLAP servers are:-
1. Relational OLAP (ROLAP)
2. Multidimensional OLAP (MOLAP)
3. Hybrid OLAP (HOLAP)

1. Relational OLAP (ROLAP):


Relational On-Line Analytical Processing (ROLAP) work mainly for the data that resides in a
relational database, where the base data and dimension tables are stored as relational tables.
ROLAP servers are placed between the relational back-end server and client front-end tools.
ROLAP servers use RDBMS to store and manage warehouse data, and OLAP middleware to
support missing pieces.

Advantages of ROLAP:
1. ROLAP can handle large amounts of data.
2. Can be used with data warehouse and OLTP systems.

Disadvantages of ROLAP:
1. Limited by SQL functionalities.
2. Hard to maintain aggregate tables.

2. Multidimensional OLAP (MOLAP):


Multidimensional On-Line Analytical Processing (MOLAP) support multidimensional views of
data through array-based multidimensional storage engines. With multidimensional data
stores, the storage utilization may be low if the data set is sparse.

Advantages of MOLAP
1. Optimal for slice and dice operations.
2. Performs better than ROLAP when data is dense.
3. Can perform complex calculations.

24
Disadvantages of MOLAP
1. Difficult to change dimension without re-aggregation.
2. MOLAP can handle limited amount of data.

3. Hybrid OLAP (HOLAP):


Hybrid On-Line Analytical Processing (HOLAP) is a combination of ROLAP and MOLAP. HOLAP provide greater
scalability of ROLAP and the faster computation of MOLAP.

Advantages of HOLAP
1. HOLAP provide advantages of both MOLAP and ROLAP.
2. Provide fast access at all levels of aggregation.

Disadvantages of HOLAP
1. HOLAP architecture is very complex because it support both MOLAP and ROLAP servers.

What motivated data mining? Why is it important?

The major reason that data mining has attracted a great deal of attention in information
industry in recent years is due to the wide availability of huge amounts of data and the
imminent need for turning such data into useful information and knowledge. The
information and knowledge gained can be used for applications ranging from business
management, production control, and market analysis, to engineering design and science
exploration.

25
The evolution of database technology
Data collection and Database Creation
(1960s and earlier)
Primitive file processing

Database Management Systems


(1970s-early 1980s)
Hierarchical and network database system
Relational database system
Data modeling tools: entity-relational models, etc
Indexing and accessing methods: B-trees, hashing etc.
Query languages: SQL, etc. User Interfaces, forms and
reports
Query Processing and Query Optimization
Transactions, concurrency control and recovery
Online transaction Processing (OLTP)

Advanced Data Analysis: Web based databases


Advanced Database Systems
Data warehousing and Data mining (late 1980s- (1990s-present)
(mid 1980s-present)
present) XML- based database systems
Advanced Data models: 1)Data warehouse and OLAP 2)Data mining and
Integration with information
Extended relational, object- knowledge
discovery:generalization,classification,associ retrieval 3)Data and
relational ,etc.
ation,clustering,frequent pattern, outlier information Integration
Advanced applications; Spatial,
temporal, multimedia, active analysis, etc
3)Advanced data mining applications: Stream
stream and sensor, knowledge
data mining,bio-data mining, text mining, web
based mining etc

New Generation of Integrated Data and Information Systems(present future)

26
What is data mining?

Data mining refers to extracting or mining" knowledge from large amounts of data. There
are many other terms related to data mining, such as knowledge mining, knowledge
extraction, data/pattern analysis, data archaeology, and data dredging. Many people treat
data mining as a synonym for another popularly used
term, KnowledgeDiscovery in
Databases", or KDD

Essential step in the process of knowledge discovery in databases

Knowledge discovery as a process is depicted in following


figure and consists of an iterative sequence of the following
steps:

data cleaning: to remove noise or irrelevant data


data integration: where multiple data sources may be combined
data selection: where data relevant to the analysis task are retrieved from
the database
27
data transformation: where data are transformed or consolidated into forms
appropriate for mining by performing summary or aggregation operations
data mining :an essential process where intelligent methods are applied in order
to extract data patterns
pattern evaluation to identify the truly interesting patterns representing knowledge
based on some interestingness measures
knowledge presentation: where visualization and knowledge representation techniques are used to present
the mined knowledge to the user.

Architecture of a typical data mining system/Major Components

Data mining is the process of discovering interesting knowledge from large amounts of data
stored either in databases, data warehouses, or other information repositories. Based on this
view, the architecture of a typical data mining system may have the following major
components:

1. A database, data warehouse, or other information repository, which consists of the


set of databases, data warehouses, spreadsheets, or other kinds of
information repositories containing the student and course information.
2. A database or data warehouse server which fetches the relevant data based on
users’ data mining requests.
3. A knowledge base that contains the domain knowledge used to guide the search
or to evaluate the interestingness of resulting patterns. For example, the
knowledge base may contain metadata which describes data from multiple
heterogeneous sources.
4. A data mining engine, which consists of a set of functional modules for tasks such as
classification, association, classification, cluster analysis, and evolution and
deviation analysis.
5. A pattern evaluation module that works in tandem with the data mining
modules by employing interestingness measures to help focus the search
towards interestingness patterns.

28
6. A graphical user interface that allows the user an
interactive approach to the data mining system.

Data mining: on what kind of data?

In principle, data mining should be applicable to any kind of


information repository. This includes relational databases, data
warehouses, transactional databases, advanced database systems,
flat files, and the World-Wide Web. Advanced database systems
include object-oriented and object-relational databases, and special
c application-oriented databases, such as

spatial databases, time-series databases, text databases, and multimedia


databases.

Flat files: Flat files are actually the most common data source for
data mining algorithms, especially at the research level. Flat files are
simple data files in text or binary format with a structure known by
the data mining algorithm to be applied. The data in these files can
be transactions, time-series data, scientific measurements, etc.

Relational Databases: a relational database consists of a set of


tables containing either values of entity attributes, or values of
attributes from entity relationships. Tables have columns and rows,
where columns represent attributes and rows represent tuples. A
tuple in a relational table corresponds to either an object or a
relationship between objects and is identified by a set of attribute
values representing a unique key. In following figure it presents
some relations Customer, Items, and Borrow representing business
activity in a video store. These relations are just a subset of what
could be a database for the video store and is given as an example.

The most commonly used query language for relational database is


SQL, which allows retrieval and manipulation of the data stored in
the tables, as well as the calculation of aggregate functions such as
average, sum, min, max and count. For instance, an SQL query to
select the videos grouped by category would be:

SELECT count(*) FROM Items WHERE type=video GROUP BY category.

Data mining algorithms using relational databases can be more


versatile than data mining algorithms specifically written for flat
files, since they can take advantage of the structure inherent to
relational databases. While data mining can benefit from SQL for
data selection, transformation and consolidation, it goes beyond
what SQL could provide, such as
predicting, comparing, detecting deviations, etc.

Data warehouses

A data warehouse is a repository of information collected from


multiple sources, stored under a unified schema, and which usually
resides at a single site. Data warehouses are constructed via a
process of data cleansing, data transformation, data integration, data
loading, and periodic data refreshing. The figure shows the basic
architecture of a data warehouse.

In order to facilitate decision making, the data in a data warehouse


are organized around major subjects, such as customer, item,
supplier, and activity. The data are stored to provide information
from a historical perspective and are typically summarized.

A data warehouse is usually modeled by a multidimensional


database structure, where each dimension corresponds to an
attribute or a set of attributes in the schema, and each cell stores the
value of some aggregate measure, such as count or sales amount. The
actual physical structure of a data warehouse may be a relational
data store or a multidimensional data cube. It provides a
multidimensional view of data and allows the precomputation and
fast accessing of summarized data.

Transactional databases

In general, a transactional database consists of a flat file where each


record represents a transaction. A transaction typically includes a
unique transaction identity number (trans ID), and a list of the items
making up the transaction (such as items purchased in a store) as
shown below:

SALES
Trans-ID List of item_ID’s
T100 I1,I3,I8
…….. ………

Advanced database systems and advanced database applications

• An objected-oriented database is designed based on the object-


oriented programming paradigm where data are a large number of
objects organized into classes and class hierarchies. Each entity in
the database is considered as an object. The object contains a set of
variables that describe the object, a set of messages that the object
can use to communicate with other objects or with the rest of the
database system and a set of methods where each method holds the
code to implement a message.

• A spatial database contains spatial-related data, which may be


represented in the form of raster or vector data. Raster data consists
of n-dimensional bit maps or pixel maps, and vector data are
represented by lines, points, polygons or other kinds of processed
primitives, Some examples of spatial databases include geographical
(map) databases, VLSI chip designs, and medical and satellite images
databases.

• Time-Series Databases: Time-series databases contain time


related data such stock market data or logged activities. These
databases usually have a continuous flow of new data coming in,
which sometimes causes the need for a challenging real time
analysis. Data mining in such databases commonly includes the
study of trends and correlations between evolutions of different
variables, as well as the prediction of trends and movements of the
variables in time.

• A text database is a database that contains text documents or


other word descriptions in the form of long sentences or paragraphs,
such as product specifications, error or bug reports, warning
messages, summary reports, notes, or other documents.

• A multimedia database stores images, audio, and video data, and


is used in applications such as picture content-based retrieval, voice-
mail systems, video-on-demand systems, the World Wide Web, and
speech-based user interfaces.

• The World-Wide Web provides rich, world-wide, on-line


information services, where data objects are linked together to
facilitate interactive access. Some examples of distributed
information services associated with the World-Wide Web include
America Online, Yahoo!, AltaVista, and Prodigy.

Data mining functionalities/Data mining tasks: what kinds of patterns can be mined?

Data mining functionalities are used to specify the kind of patterns to be found in data mining
tasks. In general, data mining tasks can be classified into two categories:

• Descriptive
• predictive

Descriptive mining tasks characterize the general properties of the data in the database.
Predictive mining tasks perform inference on the current data in order to make predictions.

Describe data mining functionalities, and the kinds of patterns they can discover (or)
Define each of the following data mining functionalities: characterization, discrimination,
association and correlation analysis, classification, prediction, clustering, and evolution analysis.
Give examples of each data mining functionality, using a real-life database that you are familiar
with.

1 .4.1 Concept/class description: characterization and discrimination

Data can be associated with classes or concepts. It describes a given set of data in a concise and
summarative manner, presenting interesting general properties of the data. These descriptions can
be derived via
1. data characterization, by summarizing the data of the class under study (often
called the target class)
2. data discrimination, by comparison of the target class with one or a set of
comparative classes
3. both data characterization and discrimination

Data characterization

It is a summarization of the general characteristics or features of a target class of data.

Example:

A data mining system should be able to produce a description summarizing the characteristics of
a student who has obtained more than 75% in every semester; the result could be a general
profile of the student.

Data Discrimination is a comparison of the general features of target class data objects with the
general features of objects from one or a set of contrasting classes.

Example
The general features of students with high GPA’s may be compared with the general features of
students with low GPA’s. The resulting description could be a general comparative profile of the
students such as 75% of the students with high GPA’s are fourth-year computing science
students while 65% of the students with low GPA’s are not.

The output of data characterization can be presented in various forms. Examples include pie
charts, bar charts, curves, multidimensional data cubes, and multidimensional tables, including
crosstabs. The resulting descriptions can also be presented as generalized relations, or in rule
form called characteristic rules.
Discrimination descriptions expressed in rule form are referred to as discriminant rules.

1.4.2 Mining Frequent Patterns, Association and Correlations

It is the discovery of association rules showing attribute-value conditions that occur frequently
together in a given set of data. For example, a data mining system may find association rules like
major(X, “computing science””) ⇒ owns(X, “personal computer”)

[support = 12%, confidence = 98%]


where X is a variable representing a student. The rule indicates that of the students under study,
12% (support) major in computing science and own a personal computer. There is a 98%
probability (confidence, or certainty) that a student in this group owns a personal computer.
Example:

A grocery store retailer to decide whether to put bread on sale. To help determine the impact of
this decision, the retailer generates association rules that show what other products are frequently
purchased with bread. He finds 60% of the times that bread is sold so are pretzels and that 70%
of the time jelly is also sold. Based on these facts, he tries to capitalize on the association
between bread, pretzels, and jelly by placing some pretzels and jelly at the end of the aisle where
the bread is placed. In addition, he decides not to place either of these items on sale at the same
time.

1.4.3 Classification and prediction

Classification:

Classification:

 It predicts categorical class labels


 It classifies data (constructs a model) based on the training set and the values (class
labels) in a classifying attribute and uses it in classifying new data
Typical Applications:

 credit approval
 target marketing
 medical diagnosis
 treatment effectiveness analysis

Classification can be defined as the process of finding a model (or function) that describes and
distinguishes data classes or concepts, for the purpose of being able to use the model to predict
the class of objects whose class label is unknown. The derived model is based on the analysis of
a set of training data (i.e., data objects whose class label is known).

Example:

An airport security screening station is used to deter mine if passengers are potential terrorist or
criminals. To do this, the face of each passenger is scanned and its basic pattern(distance between
eyes, size, and shape of mouth, head etc) is identified. This pattern is compared to entries in a
database to see if it matches any patterns that are associated with known offenders

A classification model can be represented in various forms, such as

1) IF-THEN rules,

student ( class , "undergraduate") AND concentration ( level, "high") ==> class A


student (class ,"undergraduate") AND concentrtion (level,"low") ==> class B

student (class , "post graduate") ==> class C

2) Decision tree

3) Neural Network
Prediction:

Find some missing or unavailable data values rather than class labels referred to as prediction.
Although prediction may refer to both data value prediction and class label prediction, it is
usually confined to data value prediction and thus is distinct from classification. Prediction also
encompasses the identification of distribution trends based on the available data.

Example:

Predicting flooding is difficult problem. One approach is uses monitors placed at various points
in the river. These monitors collect data relevant to flood prediction: water level, rain amount,
time, humidity etc. These water levels at a potential flooding point in the river can be predicted
based on the data collected by the sensors upriver from this point. The prediction must be made
with respect to the time the data were collected.

Classification vs. Prediction

Classification differs from prediction in that the former is to construct a set of models (or
functions) that describe and distinguish data class or concepts, whereas the latter is to predict
some missing or unavailable, and often numerical, data values. Their similarity is that they are
both tools for prediction: Classification is used for predicting the class label of data objects and
prediction is typically used for predicting missing numerical data values.

1.4.4 Clustering analysis

Clustering analyzes data objects without consulting a known class label.

The objects are clustered or grouped based on the principle of maximizing the intraclass
similarity and minimizing the interclass similarity.
Each cluster that is formed can be viewed as a class of objects.
Clustering can also facilitate taxonomy formation, that is, the organization of observations into a
hierarchy of classes that group similar events together as shown below:
Example:

A certain national department store chain creates special catalogs targeted to various
demographic groups based on attributes such as income, location and physical characteristics of
potential customers (age, height, weight, etc). To determine the target mailings of the various
catalogs and to assist in the creation of new, more specific catalogs, the company performs a
clustering of potential customers based on the determined attribute values. The results of the
clustering exercise are the used by management to create special catalogs and distribute them to
the correct target population based on the cluster for that catalog.

Classification vs. Clustering

 In general, in classification you have a set of predefined classes and want to know
which class a new object belongs to.
 Clustering tries to group a set of objects and find whether there is some
relationship between the objects.
 In the context of machine learning, classification is supervised learning and
clustering is unsupervised learning.

1.4.5 Outlier analysis: A database may contain data objects that do not comply with general
model of data. These data objects are outliers. In other words, the data objects which do not fall
within the cluster will be called as outlier data objects. Noisy data or exceptional data are also
called as outlier data. The analysis of outlier data is referred to as outlier mining.

Example
Outlier analysis may uncover fraudulent usage of credit cards by detecting purchases of
extremely large amounts for a given account number in comparison to regular charges incurred
by the same account. Outlier values may also be detected with respect to the location and type of
purchase, or the purchase frequency.

1.4.6 Data evolution analysis describes and models regularities or trends for objects whose
behavior changes over time.

Example:
The data of result the last several years of a college would give an idea if quality of graduated
produced by it

Correlation analysis

Correlation analysis is a technique use to measure the association between two variables. A
correlation coefficient (r) is a statistic used for measuring the strength of a supposed linear
association between two variables. Correlations range from -1.0 to +1.0 in value.

A correlation coefficient of 1.0 indicates a perfect positive relationship in which high values of
one variable are related perfectly to high values in the other variable, and conversely, low values
on one variable are perfectly related to low values on the other variable.

A correlation coefficient of 0.0 indicates no relationship between the two variables. That is, one
cannot use the scores on one variable to tell anything about the scores on the second variable.

A correlation coefficient of -1.0 indicates a perfect negative relationship in which high values of
one variable are related perfectly to low values in the other variables, and conversely, low values
in one variable are perfectly related to high values on the other variable.

What is the difference between discrimination and classification? Between characterization


and clustering? Between classification and prediction? For each of these pairs of tasks, how
are they similar?
Answer:
• Discrimination differs from classification in that the former refers to a comparison of the
general features of target class data objects with the general features of objects from one or a set
of contrasting classes, while the latter is the process of finding a set of models (or functions) that
describe and distinguish data classes or concepts for the purpose of being able to use the model to
predict the class of objects whose class label is unknown. Discrimination and classification are
similar in that they both deal with the analysis of class data objects.
• Characterization differs from clustering in that the former refers to a summarization of the
general characteristics or features of a target class of data while the latter deals with the analysis
of data objects without consulting a known class label. This pair of tasks is similar in that they
both deal with grouping together objects or data that are related or have high similarity in
comparison to one another.
• Classification differs from prediction in that the former is the process of finding a set of models
(or functions) that describe and distinguish data class or concepts while the latter predicts
missing or unavailable, and often numerical, data values. This pair of tasks is similar in that they
both are tools for
Prediction: Classification is used for predicting the class label of data objects and prediction is
typically used for predicting missing numerical data values.
1.5 Are all of the patterns interesting? / What makes a pattern interesting? A

pattern is interesting if,

(1) It is easily understood by humans,

(2) Valid on new or test data with some degree of certainty,

(3) Potentially useful, and

(4) Novel.

A pattern is also interesting if it validates a hypothesis that the user sought to confirm. An
interesting pattern represents knowledge.

1.6 Which Technologies Are Used?


(1) Statistics: Statistics studies the collection, analysis, interpretation or explanation, and
presentation of data. A statistical model is a set of mathematical functions that describe the
behaviour of the objects in a target class in terms of random variables and their associated
probability distributions. Statistical models are widely used to model data and data classes.
For example, in datamining tasks like data characterization and classification, statistical models
of target classes can be built. For example, we can use statistics to model noise and missing data
values. Then, when mining patterns in a large data set, the data mining process can use the model
to help identify and handle noisy or missing values in the data. Statistics research develops tools
for prediction and forecasting using data and statistical models. Statistical methods can be used to
summarize or describe a collection of data.
Inferential statistics (or predictive statistics) models data in a way that accounts for randomness
and uncertainty in the observations and is used to draw inferences about the process or
population under investigation. Statistical methods can also be used to verify data mining results.
For example, after a classification or prediction model is mined, the model should be verified by
statistical hypothesis testing.
A statistical hypothesis test (sometimes called confirmatory data analysis) makes statistical
decisions using experimental data. A result is called statistically significant if it is unlikely to
have occurred by chance. If the classification or prediction model holds true, then the descriptive
statistics of the model increases the soundness of the model.
(2) Machine Learning
Machine learning investigates how computers can learn (or improve their performance) based on
data.
Supervised learning is basically a synonym for classification. The supervision in the learning
comes from the labeled examples in the training data set.
For example, in the postal code recognition problem, a set of handwritten postal code images
and their corresponding machine-readable translations are used as the training examples, which
supervise the learning of the classification model.
Unsupervised learning is essentially a synonym for clustering. The learning process is
unsupervised since the input examples are not class labelled. Typically, we may use clustering to
discover classes within the data.
For example, an unsupervised learning method can take, as input , a set of images of
handwritten digits. Suppose that it finds 10 clusters of data. These clusters may correspond to the
10 distinct digits of 0 to 9, respectively. However, since the training data are not labelled, the
learned model cannot tell us the semantic meaning of the clusters found.
Semi-supervised learning is a class of machine learning techniques that make use of both
labelled and unlabelled examples when learning a model. In one approach, labelled examples are
used to learn class models and unlabelled examples are used to refine the boundaries between
classes.
For a two-class problem, we can think of the set of examples belonging to one class as the
positive examples and those belonging to the other class as the negative examples.
In Figure, if we do not consider the unlabelled examples, the dashed line is the decision boundary
that best partitions the positive examples from the negative examples. Using the unlabelled
examples, we can refine the decision boundary to the solid line. Moreover, we can detect that the
two positive examples at the top right corner, though labelled, are likely noise or outliers.

Active learning is a machine learning approach that lets users play an active role in the learning
process. An active learning approach can ask a user (e.g., a domain expert) to label an example,
which may be from a set of unlabelled examples or synthesized by the learning program. The
goal is to optimize the model quality by actively acquiring knowledge from human users, given a
constraint on how many examples they can be asked to label.

(3) Database Systems and Data Warehouses Database systems research focuses on the
creation, maintenance, and use of databases for organizations and end-users. Particularly,
database systems researchers have established highly recognized principles in data models, query
languages, query processing and optimization methods, data storage, and indexing and accessing
methods. Database systems are often well known for their high scalability in processing very
large, relatively structured data sets.
Many data mining tasks need to handle large data sets or even real-time, fast streaming data.
Therefore, data mining can make good use of scalable database technologies to achieve high
efficiency and scalability on large datasets. More over,datamining tasks can be used to extend the
capability of existing database systems to satisfy advanced users’ sophisticated data analysis
requirements. Recent database systems have built systematic data analysis capabilities on
database data using data warehousing and data mining facilities. A data warehouse integrates
data originating from multiple sources and various timeframes. It consolidates data in
multidimensional space to form partially materialized data cubes. The data cube model not only
facilitates OLAP in multidimensional databases but also promotes multidimensional data mining.
(4) Information Retrieval: Information retrieval (IR) is the science of searching for documents
or information in documents. Documents can be text or multimedia, and may reside on the Web.
The differences between traditional information retrieval and database systems are twofold:
Information retrieval assumes that (1) the data under search are unstructured; and (2) the queries
are formed mainly by keywords, which do not have complex structures (unlike SQL queries in
database systems). The typical approaches in information retrieval adopt probabilistic models.
For example, a text document can be regarded as a bag of words, that is, a multiset of words
appearing in the document. The document’s language model is the probability density function
that generates the bag of words in the document. The similarity between two documents can be
measured by the similarity between their corresponding language models. Furthermore, a topic in
a set of text documents can be modelled as a probability distribution over the vocabulary, which
is called a topic model. A text document, which may involve one or multiple topics, can be
regarded as a mixture of multiple topic models.

1.7 Which Kinds of Applications Are Targeted?


Business Intelligence: It is critical for businesses to acquire a better understanding of the
commercial context of their organization, such as their customers, the market, supply and
resources, and competitors. Business intelligence (BI) technologies provide historical, current,
and predictive views of business operations.
Examples include reporting, online analytical processing, business performance management,
competitive intelligence, benchmarking, and predictive analytics.
“How important is business intelligence?” Without data mining, many businesses may not be
able to perform effective market analysis, compare customer feedback on similar products,
discover the strengths and weaknesses of their competitors, retain highly valuable customers, and
make smart business decisions. Clearly, data mining is the core of business intelligence.
Online analytical processing tools in business intelligence rely on data warehousing and
multidimensional data mining. Classification and prediction techniques are the core of predictive
analytics in business intelligence, for which there are many applications in analysing markets,
supplies, and sales. Moreover, clustering plays a central role in customer relationship
management, which groups customers based on their similarities. Using characterization mining
techniques, we can better understand features of each customer group and develop customized
customer reward programs.
Web Search Engines: A Web search engine is a specialized computer server that searches for
information on the Web. The search results of a user query are often returned as a list (sometimes
called hits). The hits may consist of web pages, images, and other types of files.
Some search engines also search and return data available in public databases or open directories.
Search engines differ from web directories in that web directories are maintained by human
editors whereas search engines operate algorithmically or by a mixture of algorithmic and human
input.
Web search engines are essentially very large data mining applications. Various data mining
techniques are used in all aspects of search engines, ranging from crawling5 (e.g., deciding
which pages should be crawled and the crawling frequencies), indexing (e.g., selecting pages to
be indexed and deciding to which extent the index should be constructed), and searching (e.g.,
deciding how pages should be ranked, which advertisements should be added, and how the
search results can be personalized or made “context aware”).
Search engines pose grand challenges to data mining.
(1)They have to handle a huge and ever-growing amount of data. Typically, such data cannot be
processed using one or a few machines. Instead, search engines often need to use computer
clouds, which consist of thousands or even hundreds of thousands of computers that
collaboratively mine the huge amount of data. A search engine may be able to afford
constructing a model offline on huge data sets. To do this, it may construct a query classifier that
assigns a search query to predefined categories based on the query topic (i.e., whether the search
query “apple” is meant to retrieve information about a fruit or a brand of computers). Whether a
model is constructed offline, the application of the model online must be fast enough to answer
user queries in real time.
(2)Maintaining and incrementally updating a model on fast growing data streams.
For example, a query classifier may need to be incrementally maintained continuously since new
queries keep emerging and predefined categories and the data distribution may change. Most of
the existing model training methods are offline and static and thus cannot be used in such a
scenario.
(3) Web search engines often have to deal with queries that are asked only a very small number
of times. Suppose a search engine wants to provide context-aware query recommendations. That
is, when a user poses a query, the search engine tries to infer the context of the query using the
user’s profile and his query history in order to return more customized answers within a small
fraction of a second. However, although the total number of queries asked can be huge, most of
the queries may be asked only once or a few times. Such severely skewed data are challenging
for many data mining and machine learning methods.

You might also like