KEMBAR78
Databases Unit Test 3 | PDF | Data Warehouse | Databases
0% found this document useful (0 votes)
23 views8 pages

Databases Unit Test 3

Uploaded by

ayushvanani01
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views8 pages

Databases Unit Test 3

Uploaded by

ayushvanani01
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 8

Question 1: Define a data warehouse.

Compare and contrast the design


of a data warehouse with the design of a third normal form database.

A data warehouse is a centralized repository designed to store, manage, and


analyze large volumes of structured and sometimes semi-structured data. It is
optimized for querying and reporting, enabling organizations to derive insights
from historical and current data. Data in a warehouse is often subject-oriented,
integrated from multiple sources, time-variant, and non-volatile.

Comparison: Data Warehouse vs. Third Normal Form (3NF) Database


Aspect Data Warehouse 3NF Database
Optimized for
Optimized for querying,
Purpose transactional processing
reporting, and data analysis.
(OLTP).
Uses normalized
Data Uses denormalized structures
structures to reduce
Structure like star or snowflake schemas.
redundancy.
Historical and aggregated data Real-time data for day-to-
Focus
for decision-making. day operations.
Query performance is prioritized,
Performan Write and update
often through denormalization
ce performance prioritized.
and indexing.
Typically, single-source,
Data Integrates data from multiple
handling specific
Sources heterogeneous sources.
operational data.
Frequent updates to
Data Rarely updated; data is
support transactional
Updates appended for analysis purposes.
needs.
Highly normalized
Schema Star Schema or Snowflake
schema (up to 3NF or
Design Schema.
higher).
Smaller datasets are
Handles large datasets; uses
Storage designed for compact
partitioning and compression.
storage.

Key Differences
1. Design Goals:
o A data warehouse focuses on efficient querying and analytics, often
employing denormalization for performance.
o A 3NF database ensures data integrity and minimizes redundancy,
aligning with operational needs.
2. Schema Design:
o Data warehouses use star or snowflake schemas with fewer joins for
quicker query performance.
o 3NF databases emphasize many joins due to normalization.
3. Data Updates:
o Data in a warehouse is primarily read-only and updated through
periodic ETL (Extract, Transform, Load) processes.
o 3NF databases undergo frequent updates to handle transactional
workflows.
4. Usage Scenarios:
o Data warehouses support strategic decision-making through historical
data analysis.
o 3NF databases support day-to-day business operations.
Question 2: Define and explain the difference between a fact and a
dimension.

 Fact: A fact represents the numerical or measurable data in a dataset.


These are typically quantitative metrics or values that an organization wants
to analyze, such as sales revenue, quantities, or profit. Facts are usually
stored in tables in a data warehouse.
 Dimension: A dimension provides the descriptive context for facts. These
are qualitative attributes or categories that help interpret the facts, such as
product names, customer details, geographic locations, or time. Dimensions
are stored in dimension tables and are used to slice, filter, or aggregate
the fact data during analysis.

Key Differences Between Fact and Dimension


Aspect Fac Dimension
t

Nature of Data Quantitative Qualitative or


measurable data. descriptive data.

Purpose Represents "what is Provides context or


being measured." "how to describe the
facts."
Example Sales revenue, units Product name,
sold, profit, cost. customer
demographics, region.
Storage Stored in fact Stored in dimension
tables tables.

Table Characteristics Contains foreign keys Contains primary keys


linking to dimension and descriptive
tables and measurable attributes.
columns.
Granularity Highly granular, Less granular,
reflecting individual reflecting categories or
transactions or events. hierarchies.
Analysis role Provides the "values" Provides the context to
for calculations like group, filter, or drill
sum, average, or count. down into the facts.
Example in a Data Warehouse
Scenario: A retailer analyzing sales data.
 Fact Table:
o Columns: Sale_ID (key), Product_ID (foreign key), Customer_ID
(foreign key), Sale_Amount, Quantity.
o Facts: Sale_Amount and Quantity are the measurable data points.
 Dimension Table:
o Product Dimension:
 Columns: Product_ID (key), Product_Name, Category, Brand.
 Dimensions: Product_Name and Category provide descriptive
details for the facts.

Question 3:

1. Provide a high-level description (without modeling) of the fact and dimension


tables required

Fact Table

The fact tables store the key metrics (e.g., total users, usage hours). It will also
capture the activity in the computer labs. It will include metrics like the number of
students, session durations, and applications used.

Dimension Tables
The dimension tables provide descriptive attributes such as time, student
demographics, majors, classes, and software details.

Time Dimension: Stores data about date and time for aggregating metrics by
different time periods.
Student Dimension: Stores demographic and academic details about students,
such as gender, age, major, semester, and class.
Lab Dimension: Contains information about the computer labs, such as location
and available resources.
Application Dimension: Tracks details about the applications used, including the
name, version, and operating system.
Class Dimension: Represents the academic classes or courses students are
associated with.
Major Dimension: Tracks details about the academic majors offered.

2. Describe the granularity of each of the fact tables

Lab Usage Fact Table: One row per student session in a lab, per application
used, with a timestamp. For example, A single session where a student used two
different applications will generate two rows.
3. Present the full table attributes of both the fact and dimension tables

Lab Usage Fact


Column Name Key Data Type Description
usage_id PK INT Unique identifier for the fact row
time_id FK INT Foreign key to the Time Dimension
student_id FK INT Foreign key to the Student
Dimension
lab_id FK INT Foreign key to the Lab Dimension
application_id FK INT Foreign key to the Application
Dimension
class_id FK INT Foreign key to the Class Dimension
major_id FK INT Foreign key to the Major Dimension
num_students INT Number of students in this session
session_duratio FLOAT Duration of the session in hours
n

Time Dimension
Column Ke Data Description
Name y Type
time_id PK INT Unique identifier
date DATE Full date
day_of_wee VARCHAR Day of the week
k
month VARCHAR Month name
year INT Year
quarter VARCHAR Quarter (e.g., Q1, Q2)

Student Dimension
Column Ke Data Type Description
Name y
student_id PK INT Unique identifier
gender CHAR(1) Gender (M/F/Other)
age INT Age of the student
major_id FK INT Foreign key to Major
Dimension
semester VARCHAR Semester (e.g., Fall 2024)

Lab Dimension
Column Ke Data Description
Name y Type
lab_id PK INT Unique identifier
location VARCHAR Lab location
capacity INT Number of workstations
Application Dimension
Column Ke Data Description
Name y Type
application PK INT Unique identifier
_id
app_name VARCHA Name of the application
R
version VARCHA Version of the application
R
os VARCHA Operating system (e.g.,
R Windows)

Class Dimension
Column Ke Data Description
Name y Type
class_id PK INT Unique identifier
class_name VARCHAR Name of the class/course

Major Dimension
Column Ke Data Description
Name y Type
major_id PK INT Unique identifier
major_nam VARCHAR Name of the major
e

4. Provide a diagram representing the data warehouse design

Student Dimension
student_id PK
gender
age
Time Dimension major_id FK
time_id PK semester
date
day_of_week
month Lab Dimension
year lab_id PK
quarter location
Lab Usage
usage_id
Fact PK
time_id FK
student_id FK
lab_id FK
application_id FK
Application Dimension class_id FK
major_id FK
application_id PK num_students
Class
session_durati Major Dimension
app_name major_id PK
Dimension
version class_id PK major_name
os class_nam
e
5. Using sample data (which you will create), provide a sample of three
reports that may be generated from this warehouse data.

Report: Total Number of Users by Month and Major

 Columns: Month, Major, Total Users


 Example Data:

Month Major Total


Users
Januar Computer 450
y Sci
Januar Business 300
y

Report: Application Usage by Gender

 Columns: Application Name, Gender, Total Sessions


 Example Data:

App Gend Total


Name er Sessions
Microsoft M 1200
Word
MATLAB F 850

Report: Average Session Duration by Lab Location

 Columns: Lab Location, Average Duration


 Example Data:

Lab Average
Location Duration (hrs)
Lab A 1.5
Lab B 2.0

6. Provide a solution handling the following cases and explain why your
solution works:

I. When a Student Changes Majors

Solution: Add a new record for the student with the updated major while
preserving the old record for historical tracking. Include fields such as
Effective_Start_Date and Effective_End_Date to manage validity
periods.
Why This Works: It preserves the historical association of the student
with their previous major while linking new activities to the updated
major. This ensures accurate reporting and trend analysis over time.

II. When a Lab Moves Locations

Solution: Add a new record with the updated location details while
keeping the old record for historical reference. Use a surrogate key to
differentiate between the old and new locations.

Why This Works: It maintains a historical record of the lab's prior


location, allowing reports to reflect the correct location for activities
conducted before and after the move.

III. When a New Version of a Software Package Is Installed in the


Lab
Solution: Create a new record for the software with the updated version
and operating system details. Link any new usage data to this record.

Why This Works: This allows the data warehouse to track software
usage at a granular level, distinguishing between different versions of the
same application. It ensures that reporting can provide insights into
which version was used during specific periods.

7. Does your solution include any snow flaking? If so, why?

Yes, Snowflaking Is Used for Lab Location Dimension. The Lab Location
Dimension might reference a separate table for Campus Details (e.g.,
campus name, city, state) to normalize the design. This creates a
snowflaked schema for labs.

Why? Because if multiple labs share common attributes (e.g., campus),


snowflaking reduces redundancy by storing shared attributes in a
separate table. This can improve data consistency and reduce storage
overhead for repetitive data.

You might also like