KEMBAR78
Data Extraction | PDF | Data Warehouse | Databases
0% found this document useful (0 votes)
23 views14 pages

Data Extraction

Uploaded by

ankitjantwal0
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views14 pages

Data Extraction

Uploaded by

ankitjantwal0
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

Data extraction in a data warehouse is the process of retrieving data from various source

systems to load it into a data warehouse. This step is crucial as it ensures that the data
warehouse contains accurate and timely data for analysis. Here's a simplified explanation
of the process and important concepts:

Simplified Explanation of Data Extraction

1. Identifying Data Sources: Determine where the data is coming from. This could be
databases, cloud services, spreadsheets, or other data storage systems.
2. Connecting to Data Sources: Establish connections to these sources using
software tools or custom scripts. These connections allow the data extraction
process to access and retrieve data.
3. Extracting Data: Pull the necessary data from the sources. This can involve copying
entire databases, specific tables, or just certain rows and columns that meet
specific criteria.
4. Transforming Data (Optional): Sometimes data needs to be cleaned or formatted
before loading it into the data warehouse. This step, known as ETL (Extract,
Transform, Load), includes data extraction as the first part.
5. Loading Data into Data Warehouse: Finally, the extracted (and possibly
transformed) data is loaded into the data warehouse where it can be stored and
later analyzed.

Important Concepts in Data Extraction

6. ETL Process:
• Extract: Retrieving data from source systems.
• Transform: Cleaning, formatting, and preparing the data.
• Load: Storing the prepared data in the data warehouse.
7. Data Sources: The systems or files where the data originally resides. Common
sources include:
• Relational databases (e.g., MySQL, Oracle)
• Cloud storage (e.g., Amazon S3, Google Cloud Storage)
• Flat files (e.g., CSV, Excel)
• Web services and APIs
8. Extraction Methods:
• Full Extraction: Extracting all data from the source system, typically used
when setting up the data warehouse for the first time.
• Incremental Extraction: Extracting only the data that has changed since the
last extraction, used for ongoing updates.
9. Data Cleaning: The process of correcting or removing inaccurate records from the
dataset. This step ensures that the data loaded into the warehouse is reliable.
10. Data Transformation: Converting the data into a format suitable for analysis, which
might include normalization, aggregation, and formatting changes.
11. Data Loading: Moving the data into the data warehouse, which can be done using
bulk load operations for large datasets or incremental loads for updates.
12. Data Integration: Combining data from different sources to provide a unified view,
which is a key objective of the data extraction process in a data warehouse.
13. Scheduling and Automation: Setting up regular schedules for data extraction to
ensure the data warehouse is always up-to-date. Automation tools can help
manage these processes efficiently.

Example Scenario

Imagine a company that sells products online and in physical stores. They have multiple
data sources:

• Online sales data stored in a MySQL database


• In-store sales data stored in a spreadsheet
• Customer feedback from a cloud-based service

To analyze their overall sales performance, the company needs to extract data from these
sources and load it into a data warehouse. They would:

14. Identify the data sources: MySQL database, spreadsheet, and cloud service.
15. Use a tool to connect to these sources and extract the necessary data.
16. Clean the data (e.g., removing duplicates, correcting errors).
17. Transform the data (e.g., converting all dates to the same format).
18. Load the data into the data warehouse.

Once in the data warehouse, the company can analyze the data to understand sales
trends, customer preferences, and other important insights.

By following these steps and understanding the key concepts, the company ensures that
their data warehouse contains accurate, timely, and integrated data for effective decision-
making.
Schem Structure Characteristics Use Case
a Type
Star Central fact table with Simplified, fewer joins, Simple queries and
Schem denormalized dimension easier querying straightforward
a tables reporting
Snowfl Central fact table with More complex, more Complex queries and
ake normalized dimension joins, better data handling of data
Schem tables integrity redundancy
a
Galaxy Multiple fact tables Combines multiple Complex databases
Schem sharing dimension tables star schemas, with multiple fact tables
a comprehensive
• Fact Table: Central table that contains quantitative data (metrics) for analysis,
such as sales amount, units sold, etc.
• Dimension Tables: Surround the fact table and contain descriptive attributes
related to the fact data, such as date, product, customer, etc.

OLTP (Online Transaction Processing) systems are designed to manage and facilitate
transaction-oriented applications. These systems are optimized for handling a large
number of short, online transaction requests.

Key Points to Understand OLTP

19. Purpose:
• OLTP systems are used to manage day-to-day transactional data. They are
designed to support real-time, high-volume transaction processing.
20. Transactions:
• A transaction is a single unit of work that typically involves reading and
writing to a database. Examples include ATM withdrawals, online bookings,
order entries, and inventory updates.
21. Key Characteristics:
• High Volume of Transactions: OLTP systems process a large number of
transactions per second.
• Short Transactions: Each transaction is usually short in duration and
involves small amounts of data.
• Real-Time Processing: Transactions are processed immediately, ensuring
up-to-date data and quick responses.
• Data Integrity: Ensures accuracy and consistency of data, using
mechanisms like ACID properties (Atomicity, Consistency, Isolation,
Durability).
22. ACID Properties:
• Atomicity: Ensures that all parts of a transaction are completed
successfully. If any part fails, the entire transaction is rolled back.
• Consistency: Ensures that a transaction takes the database from one valid
state to another, maintaining database rules.
• Isolation: Ensures that transactions are processed independently, without
interference from other concurrent transactions.
• Durability: Ensures that once a transaction is committed, it is permanently
stored in the database, even in the event of a system failure.
23. Examples of OLTP Systems:
• Banking Systems: Processing deposits, withdrawals, transfers.
• E-commerce: Managing online orders, inventory, customer data.
• Retail: Point of sale systems, stock management.
• Airline Reservation Systems: Booking flights, updating seat availability.
24. Database Design:
• Normalized Tables: OLTP databases are highly normalized to reduce
redundancy and ensure data integrity.
• Indexes: Used extensively to speed up query performance.
• Small and Simple Queries: Designed to handle a high number of simple
queries that typically involve specific rows of data.
25. User Interaction:
• OLTP systems support multiple users performing transactions
simultaneously. They ensure that each user gets a consistent view of the
data.
26. Performance Metrics:
• Response Time: The time taken to complete a transaction.
• Throughput: The number of transactions processed per second.
• Availability: Ensuring the system is available for users most of the time.
27. Technology:
• OLTP systems often use relational databases like MySQL, PostgreSQL,
Oracle, and SQL Server. These databases are optimized for transaction
processing.
Summary

OLTP systems are essential for managing and processing large volumes of transactions in
real time, ensuring data integrity and consistency. They are widely used in various
industries like banking, retail, and e-commerce to handle day-to-day operations efficiently.
These systems are characterized by their ability to process short, simple transactions
quickly, support multiple concurrent users, and maintain high availability and
performance.

Online Analytical Processing (OLAP) in Simple Words

OLAP (Online Analytical Processing) systems are designed to handle complex queries
and support decision-making through data analysis. These systems are optimized for
querying and reporting, rather than transaction processing.

Key Points to Understand OLAP

28. Purpose:
• OLAP systems are used for analyzing large volumes of data. They help in
generating insights from historical data to support business decisions.
29. Data Organization:
• Data in OLAP systems is typically organized into multidimensional cubes,
which allow for efficient querying and analysis.
30. Key Characteristics:
• Complex Queries: OLAP systems are designed to handle complex queries
that can involve aggregations and calculations.
• Historical Data: Often work with large sets of historical data to identify
trends and patterns.
• Data Aggregation: Summarizes data across various dimensions for analysis.
• User-Friendly: Tools often include graphical interfaces for easy interaction
by business users.
31. Types of OLAP:
• MOLAP (Multidimensional OLAP): Uses multidimensional cubes for data
storage and offers fast query performance.
• ROLAP (Relational OLAP): Uses relational databases to store data and
dynamically create multidimensional views.
• HOLAP (Hybrid OLAP): Combines features of both MOLAP and ROLAP.
32. OLAP Operations:
• Roll-Up (Drill-Up): Aggregates data along a dimension. For example, rolling
up daily sales data to monthly sales data.
• Drill-Down: Opposite of roll-up; breaks down data into finer details. For
example, drilling down from yearly sales data to quarterly sales data.
• Slice: Extracts a single layer of data from a cube. For example, viewing sales
data for a specific region.
• Dice: Extracts a sub-cube by specifying a range of values for multiple
dimensions. For example, viewing sales data for a specific product line and
time period.
• Pivot (Rotate): Rotates the data cube to view it from different perspectives.
For example, switching rows and columns to see different dimensions of
data.
• Drill-Through: Accesses detailed data from the OLAP cubes by navigating
from summary data to detailed data in the underlying databases.
33. Example Scenario:
• A retail company uses an OLAP system to analyze sales data. They might:
▪ Roll-Up: Aggregate daily sales data into monthly or yearly sales data.
▪ Drill-Down: Break down yearly sales data into quarterly, monthly, or
daily sales data.
▪ Slice: View sales data for a specific product category.
▪ Dice: Analyze sales data for a particular product category over a
specific time period.
▪ Pivot: Change the view to compare sales data across different
regions.
34. Benefits:
• Improved Decision Making: Provides insights through data analysis, helping
businesses make informed decisions.
• Fast Query Performance: Optimized for complex queries and aggregations.
• Multidimensional Analysis: Allows users to analyze data from multiple
perspectives and dimensions.
35. Technology:
• OLAP systems often use specialized tools and software like Microsoft SQL
Server Analysis Services (SSAS), Oracle OLAP, and IBM Cognos. These tools
are designed to handle the complexity and volume of data required for
analytical processing.
Summary

OLAP systems are essential for data analysis and decision support. They allow users to
perform complex queries and analyze large volumes of historical data, providing valuable
insights into business trends and patterns. Key OLAP operations like roll-up, drill-down,
slice, dice, pivot, and drill-through enable users to interact with data in meaningful ways,
making it easier to identify and understand business performance and opportunities.

What is a Data Mart?

A Data Mart is a subset of a data warehouse that is focused on a specific business area or
function. It is designed to meet the needs of a particular group of users, such as sales,
finance, or marketing, by providing them with tailored data relevant to their specific needs.

Key Points to Understand Data Mart

36. Purpose:
• Data marts are created to provide a focused view of data, making it easier
and faster for users to retrieve and analyze information pertinent to their
business area.
37. Scope:
• Focused: Unlike a data warehouse, which is broad and covers the entire
organization, a data mart is limited to a specific subject area or department.
38. Types of Data Marts:
• Dependent Data Mart: Created directly from a data warehouse. It inherits
the data from the central data warehouse.
• Independent Data Mart: Built independently from various source systems.
It is not directly connected to a data warehouse and may contain data
extracted from different sources.
39. Characteristics:
• Subject-Oriented: Organized around a particular subject area (e.g., sales,
finance).
• Consistent and Integrated: Data is cleaned, transformed, and integrated,
ensuring consistency and accuracy.
•Optimized for Query Performance: Designed to improve query
performance and speed up data retrieval for specific business needs.
40. Components:
• Data Source: The origin of the data, which could be transactional systems,
databases, or other data marts.
• ETL Process: Extract, Transform, Load process to pull data from sources,
clean, and load it into the data mart.
• Data Storage: The physical storage where the data mart’s data resides,
often a relational database or a data warehouse.
• User Interface: Tools and applications that allow users to access and
analyze the data, such as business intelligence (BI) tools.
41. Benefits:
• Improved Performance: Focused data structure improves query
performance, making data retrieval faster.
• Simplified Access: Users have direct access to the data they need, reducing
complexity and improving user experience.
• Cost-Effective: Easier and less expensive to implement compared to a full-
scale data warehouse.
• Faster Implementation: Data marts can be developed and deployed quickly
to address specific business needs.
42. Example Scenarios:
• Sales Data Mart: Contains data related to sales transactions, customer
information, and sales performance metrics.
• Finance Data Mart: Includes financial data such as budgets, expenditures,
and financial statements.
• Marketing Data Mart: Focuses on marketing data like campaign
performance, customer demographics, and market trends.

Summary

A Data Mart is a valuable component in a data architecture, tailored to specific business


areas or functions. It simplifies data access, enhances performance, and speeds up
decision-making by providing relevant, organized data to users. Data marts can be built
directly from a central data warehouse or independently from various data sources,
making them flexible and efficient tools for data analysis and reporting.

• What is the primary purpose of a data warehouse?


• A) Transaction processing
• B) Real-time processing
• C) Data analysis and reporting
• D) Data entry
• Answer: C
• Which component is used to extract data from source systems in a data
warehouse?
• A) OLAP
• B) ETL
• C) Data Mart
• D) Data Cube
• Answer: B
• What does OLAP stand for?
• A) Online Log Analysis Processing
• B) Online Analytical Processing
• C) Offline Analytical Processing
• D) Online Data Processing
• Answer: B
• What is a fact table in a data warehouse?
• A) Table containing metadata
• B) Table with descriptive data
• C) Table with numerical data for analysis
• D) Table with historical data
• Answer: C
• What is a dimension table?
• A) Table with transaction data
• B) Table with primary key
• C) Table with descriptive attributes
• D) Table with analytical functions
• Answer: C
• Which schema uses normalized tables?
• A) Star Schema
• B) Snowflake Schema
• C) Galaxy Schema
• D) Fact Constellation Schema
• Answer: B
• What is the purpose of the "roll-up" operation in OLAP?
• A) Detail view
• B) Aggregating data
• C) Filtering data
• D) Splitting data
• Answer: B
• What does the "slice" operation do in OLAP?
• A) Aggregates data
• B) Filters data from a multidimensional cube
• C) Rotates the data cube
• D) Extracts detailed data
• Answer: B
• Which tool is commonly used for OLAP analysis?
• A) RDBMS
• B) Data Mart
• C) Data Warehouse
• D) BI Tool
• Answer: D
• What is a star schema?
• A) A multidimensional structure with normalized dimensions
• B) A data organization with a central fact table and surrounding dimension tables
• C) A fact table surrounded by multiple fact tables
• D) A type of data mart
• Answer: B
• What is the main function of a data mart?
• A) Store data for long-term analysis
• B) Provide focused data for specific business areas
• C) Manage transactional data
• D) Centralize all company data
• Answer: B
• Which process involves cleaning and transforming data before loading it into
the data warehouse?
• A) Data Extraction
• B) Data Integration
• C) Data Transformation
• D) Data Loading
• Answer: C
• What is the main characteristic of MOLAP?
• A) Uses relational databases
• B) Stores data in multidimensional cubes
• C) Combines features of ROLAP and MOLAP
• D) Processes online transactions
• Answer: B
• Which term refers to detailed data analysis in OLAP?
• A) Drill-Up
• B) Drill-Down
• C) Slice
• D) Pivot
• Answer: B
• What is a "data cube"?
• A) A table storing relational data
• B) A multidimensional array of data
• C) A single-dimensional table
• D) A data mart storage system
• Answer: B
• Which of the following is not a typical characteristic of OLTP systems?
• A) High transaction volume
• B) Complex queries
• C) Real-time processing
• D) Short, simple transactions
• Answer: B
• In which type of schema are dimension tables normalized?
• A) Star Schema
• B) Snowflake Schema
• C) Galaxy Schema
• D) Fact Constellation Schema
• Answer: B
• What is the main advantage of using a data warehouse over operational
databases?
• A) Faster transaction processing
• B) Improved data consistency
• C) Enhanced data analysis capabilities
• D) Real-time updates
• Answer: C
• What does "drill-through" in OLAP allow you to do?
• A) Aggregate data
• B) Access detailed data from summary data
• C) Filter data
• D) Rotate data views
• Answer: B
• Which process is involved in loading data into the data warehouse?
• A) Data Cleaning
• B) Data Extraction
• C) Data Transformation
• D) Data Loading
• Answer: D
• What is the primary focus of an independent data mart?
• A) Integration with a central data warehouse
• B) Building data from various source systems
• C) Normalization of data
• D) Data aggregation
• Answer: B
• Which of the following is a feature of a snowflake schema?
• A) Denormalized dimension tables
• B) Single fact table surrounded by dimension tables
• C) Normalized dimension tables
• D) Multiple fact tables sharing dimensions
• Answer: C
• What does the "pivot" operation do in OLAP?
• A) Aggregates data
• B) Filters data
• C) Rotates data to view from different perspectives
• D) Drills down into data
• Answer: C
• Which is not a benefit of a data mart?
• A) Improved query performance
• B) Simplified data access
• C) Comprehensive data for the entire organization
• D) Faster implementation
• Answer: C
• What is the purpose of "dicing" in OLAP?
• A) Aggregates data
• B) Extracts a sub-cube by specifying ranges
• C) Rotates data views
• D) Filters data
• Answer: B
• What type of data does a fact table typically contain?
• A) Descriptive data
• B) Transactional or quantitative data
• C) Metadata
• D) Organizational data
• Answer: B
• Which OLAP type combines features of MOLAP and ROLAP?
• A) MOLAP
• B) ROLAP
• C) HOLAP
• D) DOLAP
• Answer: C
• What does the "drill-up" operation in OLAP do?
• A) Aggregates data to a higher level
• B) Filters data to a lower level
• C) Rotates the data cube
• D) Accesses detailed data
• Answer: A
• Which of the following is typically part of the ETL process?
• A) Data Retrieval
• B) Data Compression
• C) Data Transformation
• D) Data Backup
• Answer: C
• What type of schema is best for handling complex, large-scale data analysis?
• A) Star Schema
• B) Snowflake Schema
• C) Galaxy Schema
• D) Flat Schema
• Answer: C

You might also like