ASSIGNMENT
ON
DATA WAREHOUSING
Submitted To: Submitted By:
Ms Sakshi Rushil Nagwan
MBA (Final) Sec- B
Roll No. - 112
Data Warehouse
Data warehousing is the process of constructing and using a data warehouse. A data warehouse is
constructed by integrating data from multiple heterogeneous sources that support analytical
reporting, structured and/or ad hoc queries, and decision making. Data warehousing involves
data cleaning, data integration, and data consolidations.
Large companies have presence in many places each of which may generate a large volume of
data for instance, large retail chains have hundreds or thousands of stores, and where as
insurance companies may have data from thousands of local branches. Large organizations have
a complex internal organization structure, and therefore different data may be present in different
locations or on different operational systems or under different schemes. Setting up queries on
individual sources is both cumbersome and inefficient moreover the sources of data may store
only current data whereas decision makers may need access to part data as well for instance
information about how purchase patterns have changed in the past year could be of great
importance.
Data warehouse provide a solution to these problems. A data warehouse is a repository of
information gathered from multiple sources stored under a unified schema at a single site. Once
gathered, the data are stored for a long time permitting access to historical data. Thus data
warehouses provide the user to take decision support queries easier to write.
Organizational Data Warehouse
Data warehousing overcomes various problems that result from the need to connect large
number of decision support systems to large number of operational systems by providing
a hub for subject based historical, consistent and non volatile information.
By connecting decision sup poet stems and operational systems to a centralized hub the
number of interfaces can be reduced dramatically and information quality can be
guaranteed more effectively.
Several studies point out that organization related issues area among the most critical
success factor for data warehouse project.
Most projects (enterprise data warehousing) project fail for political and organizational
reasons, rather than for technical ones.
As a foundation for developing the organization of data warehousing the concept of data
ownership has to be derived from traditional process-oriented ownership concepts.
Characteristics of Data Warehouse Design
1. Theme-Focused
A data warehouse design uses a particular theme. It provides information concerning a
subject rather than a business’s operations. These themes can be related to sales,
advertising, marketing, and more. Instead of focusing on the business operations or
transactions, data warehousing emphasizes on business intelligence (BI) that is,
displaying and analyzing data for decision-making. It also offers a straightforward and
succinct interpretation of the particular theme by eliminating data that may not be useful
for decision-makers.
2. Unified
A data warehouse design unifies and integrates all analogous data from different
databases in a collectively acceptable way using data modeling. It incorporates data from
diverse sources such as relational and non-relational databases, flat files, mainframe,
cloud-based systems, etc. Besides, a data warehouse must maintain consistent
nomenclature, layout, and coding to facilitate effective data analysis.
3. Time Variance
Unlike other operational systems, a data warehouse stores data collected over an
extensive time horizon. The data gathered is identified with specific time duration and
provides insights from the past perspective. Moreover, when data is entered in the
warehouse, it cannot be restructured or altered.
4. Non-volatility
Another important characteristic is non-volatility which means that the preceding data is
not removed when new data is loaded to the data warehouse. Moreover, data is only
readable and can be intermittently refreshed to deliver a complete and updated picture to
the user.
Data warehouse Architecture and its seven components
1. Data sourcing, cleanup, transformation, and migration tools
2. Metadata repository
3. Warehouse/database technology
4. Data marts
5. Data query, reporting, analysis, and mining tools
6. Data warehouse administration and management
7. Information delivery system
Data warehouse is an environment, not a product which is based on relational database
management system that functions as the central repository for informational data. The central
repository information is surrounded by number of key components designed to make the
environment is functional, manageable and accessible.
The data source for data warehouse is coming from operational applications. The data entered
into the data warehouse transformed into an integrated structure and format. The transformation
process involves conversion, summarization, filtering and condensation. The data warehouse
must be capable of holding and managing large volumes of data as well as different structure of
data structures over the time.
1. Data warehouse database
This is the central part of the data warehousing environment. This is the item number 2 in
the above arch. diagram. This is implemented based on RDBMS technology.
2. Sourcing, Acquisition, Clean up, and Transformation Tools
This is item number 1 in the above arch diagram. They perform conversions,
summarization, key changes, structural changes and condensation. The data
transformation is required so that the information can be used by decision support tools.
The transformation produces programs, control statements, JCL code, COBOL code,
UNIX scripts, and SQL DDL code etc., to move the data into data warehouse from
multiple operational systems.
The functionalities of these tools are listed below:
To remove unwanted data from operational db
Converting to common data names and attributes
Calculating summaries and derived data
Establishing defaults for missing data
Accommodating source data definition change.
Issues to be considered while data sourcing, cleanup, extract and transformation:
Data heterogeneity: It refers to DBMS different nature such as it may be in different data
modules, it may have different access languages, it may have data navigation methods,
operations, concurrency, integrity and recovery processes etc.,
Data heterogeneity: It refers to the different way the data is defined and used in different
modules.
Some experts involved in the development of such tools:
Prism Solutions, Evolutionary Technology Inc., Vality, Praxis and Carleton
3. Meta data
It is data about data. It is used for maintaining, managing and using the data warehouse. It
is classified into two:
1. Technical Meta data: It contains information about data warehouse data used by
warehouse designer, administrator to carry out development and management tasks.
It includes,
Info about data stores
Transformation descriptions. That is mapping methods from operational db to
warehoused
Warehouse Object and data structure definitions for target data
The rules used to perform clean up, and data enhancement
Data mapping operations
Access authorization, backup history, archive history, info delivery history, data
acquisition history, data access etc.,
2. Business Meta data: It contains info that gives info stored in data warehouse to users.
It includes,
Subject areas, and info object type including queries, reports, images, video, audio clips etc.
Internet home pages
Info related to info delivery system
Data warehouse operational info such as ownerships, audit trails etc.,
Meta data helps the users to understand content and find the data. Meta data are stored in a
separate data stores which is known as informational directory or Meta data repository which
helps to integrate, maintain and view the contents of the data warehouse.
The following lists the characteristics of info directory/ Meta data:
It is the gateway to the data warehouse environment
It supports easy distribution and replication of content for high performance and
availability
It should be searchable by business oriented key words
It should act as a launch platform for end user to access data and analysis tools
It should support the sharing of info
It should support scheduling options for request
IT should support and provide interface to other applications
It should support end user monitoring of the status of the data warehouse environment
4. Access tools
Its purpose is to provide info to business users for decision making. There are five
categories:
Data query and reporting tools
Application development tools
Executive info system tools (EIS)
OLAP tools
Data mining tools
Query and reporting tools are used to generate query and report. There are two types
of reporting tools. They are:
Production reporting tool used to generate regular operational reports
Desktop report writer are inexpensive desktop tools designed for end users.
Managed Query tools: used to generate SQL query. It uses Meta layer software in between
users and databases which offers a point-and-click creation of SQL statement. This tool is a
preferred choice of users to perform segment identification, demographic analysis, territory
management and preparation of customer mailing lists etc.
Application development tools: This is a graphical data access environment which integrates
OLAP tools with data warehouse and can be used to access all db systems
OLAP Tools: are used to analyze the data in multi dimensional and complex views. To enable
multidimensional properties it uses MDDB and MRDB where MDDB refers multi dimensional
data base and MRDB refers multi relational data bases.
Data mining tools: are used to discover knowledge from the data warehouse data also can be
used for data visualization and data correction purposes.
5. Data marts
Departmental subsets that focus on selected subjects. They are independent used by
dedicated user group. They are used for rapid delivery of enhanced decision support
functionality to end users. Data mart is used in the following situation:
Extremely urgent user requirement
The absence of a budget for a full scale data warehouse strategy
The decentralization of business needs
The attraction of easy to use tools and mind sized project
Data mart presents two problems:
1. Scalability: A small data mart can grow quickly in multi dimensions. So that while designing
it, the organization has to pay more attention on system scalability, consistency and
manageability issues
2. Data integration
6. Data warehouse admin and management
The management of data warehouse includes,
Security and priority management
Monitoring updates from multiple sources
Data quality checks
Managing and updating meta data
Auditing and reporting data warehouse usage and status
Purging data
Replicating, sub setting and distributing data
Backup and recovery
Data warehouse storage management which includes capacity planning, hierarchical
storage management and purging of aged data etc.,
7. Information delivery system
It is used to enable the process of subscribing for data warehouse info.
Delivery to one or more destinations according to specified scheduling algorithm