Data Modeling Strategies
Data Modeling Strategies
Strategies
Databricks Academy
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Agenda
Data Warehouse Data Modeling Time Lecture Demo Lab
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Agenda
Modern Data Architecture Use Cases Time Lecture Demo Lab
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Course Objectives
● Design and implement data models tailored to specific business needs
within the Databricks Lakehouse environment.
● Differentiate between different types of modeling techniques and
understand their respective use cases.
● Analyze business needs to determine data modeling decisions.
● Design logical and physical data models for specific use cases.
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Course Objectives
● Explore the stages of data product lifecycle.
● Understand Data Products definition and use cases.
● Understand the data product lifecycle.
● Organize Data in Domains and in Unity Catalog.
● Utilize Delta Lake and Unity Catalog to define data architectures.
● Explore Data Integration and secure data sharing techniques.
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Data Warehouse Data
Modeling
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Content Map
Inmon’s Kimball’s
Data Vault 2.0
Corporate Information Factory Dimensional Modeling
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Data Warehouse Data Modeling
LECTURE
Lakehouse
Architecture
Recap
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Lakehouse Architecture Recap
Laying the Foundation for Data Modeling Strategies
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Core Principles & Medallion Architecture
How the Lakehouse Organizes Data for Scalable Processing
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Medallion Architecture
Bronze Silver Gold
Time series
resampled &
interpolated
Feature
enhanced
Ingestion Curated
Final
Raw data Cleansed and Curated business-level
• No data processing conformed data tables
Orchestration
Collaboration
AI Engine
Cloud Storage
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Unity Catalog for Governance & Modeling
Ensuring Schema Consistency, Security, and Interoperability
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Unity Catalog Overview
Before and After Unity Catalog
Metastore Metastore
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Unity Catalog for Governance & Modeling
Ensuring Schema Consistency, Security, and Interoperability
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Unity Catalog Overview
(Unity) View
Catalog
Metastore
Schema
Databricks Volume
assigned to Catalog
Account
Schema
Databricks Function
Workspace
Databricks Model
Workspace
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Unity Catalog for Governance & Modeling
Ensuring Schema Consistency, Security, and Interoperability
● Cross-Domain Interoperability:
○ Ensures consistent definitions across teams, avoiding schema drift.
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Unity Catalog for Governance & Modeling
Ensuring Schema Consistency, Security, and Interoperability
● Cross-Domain Interoperability:
○ Ensures consistent definitions across teams, avoiding schema drift.
Prepare Data Develop & Evaluate AI Serve Data & AI AI Models & Tools
● Discover & Transform Structured ● Train and Test Algorithms ● Low Latency Model Serving ● Commercial AI models
Data into Features ● Fine-Tune & Prompt Engineer Models ● Log Model Requests/ Responses ● Community AI models
● Chunk & Create Embeddings ● Create GenAI agents & tools ● Low Latency Feature Serving ● Community tools
from Unstructured Data ● Evaluate Experiments ● Query Embeddings in Vector DB
Cloud Storage
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Mosaic AI
… fully integrated into the Data Intelligence Platform
Lakehouse common capabilities Mosaic AI specific capabilities External Services
Asset Bundles
(CI/CD support) MLOps + LLMOps MLFlow
Prepare Data Develop & Evaluate AI Serve Apps AI Models & Tools
Notebooks SQL Notebooks AutoML AI Playground AI Gateway Model Serving
Delta Files/Volumes
(structured) (unstructured) Cloud Storage
* Online Tables
#
Calling models from SQL
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Data Intelligence & Feature Engineering
Bridging Structured Analytics & AI Workloads
LECTURE
Data
Warehousing
Modeling
Overview
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Why Model?
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Data Warehouse (DWH) Data Modeling
Why?
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
DWH Data Modeling
How?
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Data Modeling Methods
Using what methodology?
Historically, there have been three dominant schools of thought for data
warehousing practitioners:
● The top-down approach, as defined by Bill Inmon
○ Building the Data Warehouse, 1992
● The bottom-up approach, as defined by Ralph Kimball
○ The Data Warehouse Toolkit, 1996
● Data Vault 2.0, as defined by Dan Linstedt
○ Building a Scalable Data Warehouse with Data Vault 2.0, 2015
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Data Warehousing - Purpose of Modeling
Business
Requirements
(Information Need)
Business Apps
(structured)
Physical Staging DWH Physical Data
Data Mart BI Tools
Model Model (PDM)
Other clouds
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Logical Data Modeling
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Logical Data Modeling
Key Terms
Entity: Person, place, thing or concept about you wish to record facts
Attribute: a non-decomposable atomic piece of information describing
an entity
Non-Decomposable: Smallest unit of information you will want to
reference
Business rules: specifications that preserves the integrity of the LDM by
governing which values attributes may assume
Business rules (two categories):
● Key business rules - The identification of unique records
● Domain business rules - Validation of attribute values
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Logical Data Modeling
Optimal Approach
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Building a Data Warehouse
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Building a DWH - Process
Simplified DWH Process
Models are front-and-center when Analyze Design Build
building a data warehouse. Business Data Staging Source Data in
● Business Information Model (BIM) requirements Design Staging
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Building a DWH - Process
Simplified DWH Process
Models describe the business world Analyze Design Build
and its relationships, i.e. they depict Business Data Staging Source Data in
the business processes within the requirements Design Staging
organization. Business
Information
Model
Source Data
Analysis
Source ETL
ETL Design
Mapping Development
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Building a DWH - the process
Simplified DWH Process
For this process: Analyze Design Build
● Analyze is technology agnostic Business Data Staging Source Data in
● Design is impacted by technology requirements Design Staging
Source Data
Analysis
Source ETL
ETL Design
Mapping Development
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Data Warehousing in the Lakehouse
LECTURE
Inmon’s
Corporate
Information
Factory
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Inmon in a Nutshell
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Bill Inmon’s Corporate Information Factory
Understanding the Foundation of Data Warehousing
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Top-Down Data Warehousing Approach
Building the Foundation Before Data Marts
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Subject-Oriented Data Modeling
Organizing Data Around Business Subjects
With Inmon, data is categorized into subjects (e.g., sales, finance, inventory)
rather than applications or processes.
Benefits:
● Enhances clarity and relevance for business users.
● Facilitates easier data analysis and reporting.
Implementation:
● Utilizes dimensional models like star schemas within each subject area,
ensuring data is organized logically.
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Integrated and Consistent Data
Ensuring Data Uniformity Across the Warehouse
Data warehouses store historical data, allowing analysis over different time
periods.
Importance:
● Enables businesses to track changes, identify trends, and make informed
predictions.
Implementation:
● Snapshot Schemas: Capture data at specific intervals.
● Slowly Changing Dimensions (SCD): Manage changes in dimension
attributes over time without losing historical accuracy.
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Non-Volatile Data Storage
Stability and Consistency of Warehouse Data
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Corporate Information Factory Architecture
Core Components and Their Roles
Extract: Retrieves data from various source systems, which can include
databases, applications, and external files.
Transform: Cleanses, standardizes, and enriches data to ensure
consistency and quality. This step may involve:
● Data cleansing (removing duplicates, correcting errors)
● Data integration (combining data from different sources)
● Data transformation (converting data types, aggregating data)
Load: Inserts the transformed data into the data warehouse, ensuring it is
organized for efficient querying and analysis.
Tools and Technologies: Examples include Informatica, Talend, and
Microsoft SSIS, which automate and manage ETL processes.
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Data Marts in Inmon Data Warehouses
Specialized Subsets for Targeted Analysis
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Normalization: From UNF to 3NF
Enhancing Data Integrity Through Progressive Constraints
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Normalization: From UNF to 3NF
Enhancing Data Integrity Through Progressive Constraints
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Normalization Pros and Cons – Inmon
Central to the Inmon EDW Strategy
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Normalization and the Databricks Platform
Joins for the most part lead to exchange of data between workers through
serialization and deserialization.
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Inmon Visualized
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Inmon’s Corporate Information Factory
Process and Logical View
DWH Physical
Physical Data Mart
Sources Data Model
or denormalized models)
Logical DWH (3NF)
View Source Data Mart
Data
Cube
Source Staging
Data Mart
Source
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Data Modeling Work-process (Inmon)
Logical Physical
User View
(Domain) 2
Business Information
User View Composite Physical Data Model
Model 2
(Domain) Logical Data Model (PDM)
(conceptual view)
User View
4
1 (Domain) 2 3
Wide Business Data requirement perspective per function / user Data integration and conflict resolution Efficiency and usability
Perspective
High level model of actors (User = Business Function) Tasks:
Tasks:
and interactions of Each business process is worked on individually. ● Combine User Views
interest for the business. ● Translate the logical data structure
Tasks: ● Integrate with existing data
○ Identify tables and columns
● Identify major entities models
The focus is to capture the ○ Adapt structure to technology
● Determine relationships between entities ● Analyze for stability and
major processes of ○ Design how to enforce business
● Determine primary and alternate keys growth
interest. rules around entities (PK,FK)
● Determine foreign keys
○ Design how to enforce integrity
● Determine key business rules (relationships)
● Add remaining attributes ○ Tune storage related mechanisms
● Validate normalization rules
● Determine data types
⇒ This is an iterative ongoing process across the warehouse lifecycle where information captured in later steps may inform prior steps.
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Data Warehouse Data Modeling
DEMONSTRATION
Entity
Relationship
Modeling
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Data Warehouse Data Modeling
LECTURE
Kimball’s
Dimensional
Modeling
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Kimball in a Nutshell
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Ralph Kimball’s Dimensional Modeling
A Practical Approach to Data Warehousing
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Kimball vs. Inmon
Comparing Methodologies for Data Warehousing
Key Differences:
● Implementation Speed: Kimball’s approach typically delivers results
quicker.
● Scalability: Inmon’s method may better support large-scale,
enterprise-wide initiatives.
● Flexibility: Kimball’s approach allows for more iterative and adaptable
development.
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Dimensional Modeling
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Core Concepts of Dimensional Modeling
Building Blocks of Kimball’s Approach - Tables
Fact Tables: Central tables that store measurable, quantitative data related
to business processes.
● Contain foreign keys referencing dimension tables.
● Include numeric metrics (e.g., sales amount, quantity).
● Often contain additive, semi-additive, or non-additive measures.
Dimension Tables: Surrounding tables that provide descriptive attributes
related to fact data.
● Contain textual or categorical information (e.g., product names).
● Often denormalized to optimize query performance.
● Support hierarchical relationships (e.g., dates with year, quarter, month).
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Core Concepts of Dimensional Modeling
Building Blocks of Kimball’s Approach - Schemas
Star Schema:
● Structure: Fact table at the center connected to multiple dimension
tables.
● Advantages: Simplifies queries, enhances performance, and improves
readability.
Snowflake Schema:
● Structure: Extension of star schema; dimension tables are normalized
into multiple related tables.
● Advantages: Reduces data redundancy and can save storage space, but
may complicate queries.
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Designing Fact Tables
Capturing Business Metrics Effectively
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Designing Fact Tables
Capturing Business Metrics Effectively
Measures:
● Additive Measures: Can be summed across any dimension.
● Semi-Additive Measures: Can be summed across some dimensions but
not all.
● Non-Additive Measures: Cannot be summed
Foreign Keys:
● Role: Link fact tables to corresponding dimension tables.
● Implementation: Ensure referential integrity and support efficient joins
during queries.
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Designing Dimension Tables
Structuring Descriptive Context for Facts
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Designing Dimension Tables
Structuring Descriptive Context for Facts
Types of Dimensions:
● Conformed Dimensions: Shared across multiple fact tables and data
marts, ensuring consistency.
● Role-Playing Dimensions: Used multiple times within the same schema
(e.g., date dimension used for order date and ship date).
● Junk Dimensions: Combine unrelated low-cardinality attributes into a
single dimension to reduce clutter in fact tables.
Handling Slowly Changing Dimensions (SCD):
● SCD Type 1: Overwrites old data with new data, not preserving history.
● SCD Type 2: Creates a new record to preserve historical data.
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Star Schema Design
Simplifying Data Access and Querying
Structure:
● Central Fact Table: Contains measures and foreign keys to dimension
tables.
● Surrounding Dimension Tables: Provide descriptive context for facts.
Advantages:
● Simplicity: Easy to understand and navigate for end-users and analysts.
● Performance: Optimized for read-heavy operations, enhancing query
speed.
● Flexibility: Facilitates ad-hoc querying and reporting without complex
joins.
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Star Schema Design
Simplifying Data Access and Querying
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Snowflake Schema Design
Normalizing Dimensions for Efficiency
Structure:
● Central Fact Table: Similar to the star schema, contains measures and
foreign keys.
● Normalized Dimension Tables: Break down dimension tables into multiple
related tables.
Advantages:
● Storage Efficiency: Reduces data redundancy, saving storage space.
● Data Integrity: Maintains consistency through normalized tables.
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Snowflake Schema Design
Normalizing Dimensions for Efficiency
Disadvantages:
● Complexity: Increases the number of joins required for queries,
potentially impacting performance.
● Maintenance: More complex to manage and understand compared to
star schemas.
When to Use:
● Large, Complex Dimensions: Where normalization can significantly
reduce redundancy.
● Strict Data Integrity Requirements: Ensuring consistency across
normalized tables.
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Kimball and Denormalization
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Denormalization Pros and Cons – Kimball
Key to Dimensional Modeling & Performance
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Kimball Visualized
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Star Schema vs. Snowflake Schema
Star schema Snowflake schema
product customer
category country
customer
product customer product customer city
store store
store store
region type
Fact table contains business "facts" (like transaction Fact table as with star schema
amounts and quantities) Dimension tables are broken down into sub-dimensions
Dimension tables contain information about descriptive Dimensions are normalized
attributes and are typically denormalized
Simple data model enforcing data quality, with fast
Star schemas enable users to slice and dice the data,
retrieval
typically by joining two or more fact tables and dimension
tables together Higher setup and maintenance efforts
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Kimball’s Dimensional Modeling
Process and Logical View
BI App BI
schema Design Development
Data
Source Mart
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Dimensional modeling according to Kimball
Fundamental Concepts
See https://www.kimballgroup.com/wp-content/uploads/2013/08/2013.09-Kimball-Dimensional-Modeling-Techniques11.pdf
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Example Logical Design (Retail business)
Design Model
1 Select a business process
2 Determine Granularity
Business 1
Processes 3 Choose Dimensions
Assortment 4 Identify Measures
Plans
Purchase
Orders
Inventory
Customer 2 3 4
Orders
For each fact For each For each fact
Customer For each BP For each fact
define lowest dimension decide define all
Shipments 1..N Facts define dimensions
granularity granularity measurements
Credit
Returns
Trended
Surveys
General
Ledger
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Data Modeling: Dimensional Modeling
Landing (bronze)
• Raw data in its original format (temporarily)
Ingestion (bronze) Landing Integration Presentation
• Raw data converted to Delta (from Avro, CSV, Raw data
(temp.) Data Mart
parquet, XML, JSON format in Landing)
Business Information Model Dim.
• Verified data contract: schema (typically derived Model
SQL
from the source), timeframe, …
Logical Data Model (3NF*)
• Sometimes called Staging Ingestion Data Mart
Integration - Physical data model (silver) Verified Physical Data Model Dim.
Model
• Detailed information covering multiple business data
domains (including glossary and taxonomy)
bronze silver gold
• Integrates all data sources
• Does not necessarily use a dimensional model but ETL/ELT
feeds dimensional models. Dimensional model (star schema)
* 3NF = “Third normal form” in data modelling
Data Mart (gold) Order
• Subset of the Integrated layer, sometimes filtered Dim Fact Dim
or aggregated data Customer Product
• Focus on dimensional modeling with star schema
Dim
• Typically oriented to a specific line of business or
team Time
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Data Warehouse Data Modeling
DEMONSTRATION
Dimensional
Modeling
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Data Warehouse Data Modeling
LECTURE
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Data Vault 2.0 in a Nutshell
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Introduction to Data Vault 2.0
Modernizing Data Warehousing for Agility and Scalability
Data Vault 2.0 is an advanced evolution of the original Data Vault modeling
methodology, designed to address the complexities of modern data
warehousing.
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Introduction to Data Vault 2.0
Modernizing Data Warehousing for Agility and Scalability
Key Objectives:
● Enhance scalability and flexibility to handle large and rapidly changing
data environments.
● Improve data integration from diverse sources with minimal latency.
● Support agile and iterative development methodologies for faster
deployment and adaptability.
Importance:
● Meets the demands of contemporary businesses for timely, accurate,
and comprehensive data insights.
● Facilitates the integration of structured and unstructured data,
accommodating various data types and sources.
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Core Components of Data Vault 2.0
Building Blocks for Robust Data Integration
Hubs: Central entities representing unique business keys (e.g., Customer ID,
Product SKU).
● Contain a unique list of keys with minimal attributes (Business Key, Load
Date, Record Source).
● Serve as the primary point of integration for related data.
Links: Associations or relationships between Hubs (e.g., Customer
purchases Product).
● Capture many-to-many relationships without redundancy.
● Include foreign keys referencing related Hubs, Load Date, and Record
Source.
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Core Components of Data Vault 2.0
Building Blocks for Robust Data Integration
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Data Vault 2.0 Architecture
Structuring for Scalability and Flexibility
Layered Architecture
● Raw Data Vault: Ingests and stores data as-is, ensuring data integrity and
traceability.
○ Components: Hubs, Links, Satellites.
● Business Data Vault: Enhances the Raw Data Vault with business logic,
derived data, and additional context.
○ Components: Derived Satellites, Calculated Metrics.
● Information Delivery Layer: Provides data through data marts, reporting
and analytics platforms.
○ Components: Data Marts (Star/Snowflake Schemas), APIs, BI Tools.
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Data Vault 2.0 Architecture
Structuring for Scalability and Flexibility
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Data Vault 2.0 Methodology
Agile and Scalable Development Practices
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Data Vault 2.0 Methodology
Agile and Scalable Development Practices
ETL Development
● Develop Extract, Load, Transform (ELT) processes to populate the Raw
and Business Data Vaults.
● Implement data quality checks and transformation logic.
Testing and Validation
● Ensure data accuracy, integrity, and performance through rigorous
testing.
● Validate against business requirements and use cases.
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Data Vault 2.0 Methodology
Agile and Scalable Development Practices
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Hubs, Links, and Satellites
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Implementing Hubs in Data Vault 2.0
Capturing Core Business Entities
Purpose of Hubs:
● Represent business keys; central points of integration for related data.
● Ensure consistency and traceability of core business entities.
Design Considerations:
● Business Keys: Stable and unique business identifiers (e.g., Customer ID).
● Minimal Attributes: Maintain simplicity and reduce redundancy.
● Ingestion Date and Record Source: For auditing and lineage purposes.
Best Practices:
● Consistent Naming Conventions: Use clear, standardized names for Hubs.
● Avoid Redundancy: Each Hub represents a single business key.
● Referential Integrity: Between Hubs and Links/Satellites
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Implementing Links in Data Vault 2.0
Modeling Relationships Between Business Entities
Purpose of Links:
● Capture relationships between Hubs (e.g., Customer purchases Product).
● Enable modeling of many-to-many relationships without redundancy.
Design Considerations:
● Identify Relationships: Determine how business keys interact and relate.
● Include Foreign Keys: Reference primary keys from related Hubs.
● Load Date and Record Source: Track Link ingestion time and source.
Best Practices:
● Atomic Relationships: A link should be a single relationship of Hubs.
● Avoid Overcomplicating: Links are for meaningful business relationships.
● Scalability: Design to accommodate future expansions and relationships.
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Implementing Satellites in Data Vault 2.0
Storing Descriptive and Historical Data
Purpose of Satellites:
● Store descriptive, contextual, time-variant data related to Hubs or Links.
● Enable historical tracking and auditing of changes over time.
Design Considerations:
● Segmentation: Separate Satellites by subject areas or update frequency.
● Include Load Metadata: e.g. Load_Date, Record_Source, and End_Date.
● Handle SCDs: Manage changes in dimension attributes.
Best Practices:
● Granular Separation: Separate Satellites for different types of data.
● Update Mechanisms: Uniform processes updating Satellite data.
● Documentation: Satellite purpose and contents to aid in usage.
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Data Vault 2.0 Visualized
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
DWH Modeling Approaches - Data Vault
Process and Logical View
● Information Marts
Bridge
Link Satellite
Point in Time
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Data Vault Work-process
Logical Data Model Physical Data Model
Model Model
Define Define Model
Information Business
Ontology Taxonomies Raw Vault
Mart Vault
Enterprise
Start with what the
Business
Ontology
business needs
Ontologies Taxonomies
• Define how business sees • Follow a hierarchical format, provide names for
their data each object in relation to other objects
• Model real-life entities • Capture the membership properties of each
• Start with business concepts object in relation to other objects
Ontologies provides context to the developers, designers
• Connect business concepts • Have specific rules to classify, categorize any and business users on how the data fits to the business
with business keys object in a domain.
Data Vault Modeling was, is, and always will be about the business
• Drill down into the hierarchies The rules must be complete, consistent, and - Dan Linstedt (creator of Data Vault)
(Taxonomies) unambiguous
• Inherits all properties of class above it, and
may have additional properties
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Data Modeling: Data Vault 2.0
Landing (bronze)
• Raw data in its original format (temporarily)
Ingestion (bronze, sometimes called Staging) Landing Integration Presentation
• Raw data converted to Delta (from Avro, CSV, Raw data
parquet, XML, JSON format in Landing) (temp.) Raw Business Information
Vault Vault Mart
• Verified data contract: schema (typically derived
from the source), timeframe, … Hub PIT
Business
Integration - Raw Vault (silver) Views
SQL
Link Bridge
Data is modeled as Ingestion
Verified Satellite Views
• Hubs (unique business keys)
data
• Links (relationship and associations)
• Satellites (descriptive data) bronze silver gold
DEMONSTRATION
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Modern Data
Architecture Use
Cases
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Modern Data Architecture Use Cases
LECTURE
Feature Stores
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Modern Data Modeling for ML and AI
Foundations for Advanced Analytics and Intelligence
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Understanding Feature Stores
Centralizing Feature Management for ML and AI
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Architecture of a Feature Store
Components of a Feature Store
Core Components:
● Repository: Stores and manages feature definitions and metadata.
● Storage Layer: Physical storage systems (e.g., databases, data lakes)
where feature data resides.
● Serving Layer: APIs and services that provide features to ML models in
real-time or batch modes.
● Registry: Catalogs available features, including their definitions, sources,
and usage statistics.
● Transformation Layer: Tools and processes for feature engineering and
transformations.
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Architecture of a Feature Store
Workflow of a Feature Store
Workflow:
● Feature Engineering: Data scientists create and transform raw data into
features.
● Feature Registration: Features are registered in the feature store with
metadata.
● Feature Storage: Transformed features are stored in the feature
repository.
● Feature Serving: Features are served to ML models during training and
inference.
● Monitoring and Management: Ongoing monitoring of feature quality,
usage, and performance.
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Feature Tables
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Feature Tables: Definition and Purpose
Organizing Features for Efficient ML Workflows
Feature Tables are structured tables within a feature store that organize
related features for specific ML use cases or business domains.
Each row represents a unique instance of a feature for a specific entity.
Purpose:
● Logical Grouping: Groups features by subject area for easier
management and access.
● Performance Optimization: Organizes features in a way that aligns with
ML workflows.
● Version Control: Manages feature tables versions to track changes and
ensure reproducibility.
● Access Control: Implements granular access permissions.
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Feature Tables: Definition and Purpose
Organizing Features for Efficient ML Workflows
Feature Tables are structured tables within a feature store that organize
related features for specific ML use cases or business domains.
Each row represents a unique instance of a feature for a specific entity.
Structure (Columns):
● Feature Name: Identifier for each feature.
● Data Type: Specifies the type of data (e.g., integer, float, string).
● Description: Detailed explanation of the feature’s purpose and usage.
● Source: Origin of the feature (e.g., raw data, derived).
● Creation Timestamp: When the feature was created or last updated.
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Types of Feature Tables
Categorizing Feature Tables Based on Use Cases
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Types of Feature Tables
Categorizing Feature Tables Based on Use Cases
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Modern Use Cases Visualized
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Data Modeling: Modern use cases (ML and
AI)
Landing (bronze)
• Raw data in its original format (could be
temporarily).
• A landing zone allows bronze in Delta format, Landing Curation Final
independent of the original input format Raw data
Ingestion (bronze) (temp.) Augmented Project
data Python
• Delta data converted from Raw (from Avro, CSV, Cleansed
data
R
parquet, XML, JSON format in Landing) data SQL
Filtered Business-level
• Verification typically lightweight compared to data Scala
Ingestion aggregates
DWH
Verified
• No other transformation or business logic is data
applied
bronze silver gold
• Often “schema on read” approach
Curation (silver) ETL/ELT
• Cleansed data, filtered data and augmented data
Final (gold)
• Business level aggregates
• Masked, reduced, anonymized for project
purposes
• Denormalized for performance if needed
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Modern Data Architecture Use Cases
DEMONSTRATION
Modern Case
Study:
Feature Store
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Modern Data Architecture Use Cases
LECTURE
Combining
Approaches
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Assessing DWH Models
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Evaluating DWH modeling paradigms
Key examples (non-exhaustive)
Ability to change
● Big impact on Inmon models when business process changes (higher
effort and duration).
● Business changes, especially significant ones, can break the basis of a
Kimball model (higher effort and duration).
● Data Vault 2.0 structure facilitates reacting to business changes (lower
effort).
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Evaluating DWH modeling paradigms
Key examples (non-exhaustive)
Complexity
● Inmon leads to very complex ETL and load dependencies that need to
be handled through load flow optimizations or additional ETL jobs to
ensure model consistency.
● Kimball dimensional models can be very hard to populate, since you
have to ensure consistency with the dimensions. Dimension logic can be
hard, particularly slowly changing dimensions of Type 2 and above can
be challenging.
● Data Vault 2.0 has 3-6 times more objects than a pure 3NF DW; this
impacts ETL, but the ETL is simplified, easily automated, and can for the
most part be run in parallel.
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Evaluating DWH modeling paradigms
Key examples (non-exhaustive)
Robustness
● Inmon models can easily break due to changes in business processes
and business rules.
● Kimball is the simplest model to understand, but a critical mass of
changes entails remodeling large portions.
● For Data Vault 2.0, most changes can be compartmentalized to a
specific layer.
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
DWH Model Advantages
Summary
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
DWH Model Challenges
Summary
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Common DWH modeling challenges
The attraction of “no modeling”
Many organizations find it challenging to handle the life cycle around data
models. Challenges come in the form of people, process, and technology.
● To maintain a database, you need DBAs or data engineers
● To model databases, you need data modelers
● To do a “correct” data model, you need access to the business
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Common DWH modeling challenges
The attraction of “no modeling”
For most organizations, the mere overhead of having data modelers talk to
the business, as well as the time it takes to introduce changes, negatively
affect the organization’s ability to adapt to new conditions.
Still, cutting corners in this process has side effects: Data correctness and
potentially data quality degrade, traded for speed and agility in the data
process.
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
The Enhanced Medallion
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Medallion, the best practice pipeline
Bronze Silver Gold
Time series
resampled &
interpolated
Feature
enhanced
Ingestion Curated
Final
Raw data Cleansed and Curated business-level
• No data processing conformed data tables
BI use cases
streaming Ingestion Integration Presentation (strictly modeled
Verified data Business information model Data marts
and verified data)
bronze silver gold
Cloud storage
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Three layers from data to information
Curation Final
Cleansed, Business-specific
augmented, … Modern use cases (exploratory data analysis, data science, …)
• High flexibility
Data for Modern use • all sorts of workload types (e.g. ML, experimental workloads)
Explorative &
modern use cases like ML supported
Flexible cases (self-service) • all data types supported
• no compliance with business information model needed
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Three layers from data to information
Curation Final
Cleansed, Business-specific
augmented, …
Semantically Enhanced
business
consistent
perspective
Certified Data Products
data
(self-service)
silver gold
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Data Products
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Data Products
LECTURE
Defining Data
Products
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Why Data Products?
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Traditional Data Management Falls Short
Disconnected teams, inconsistent data, and slow time to value
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
From Data Assets to Data Products
Shifting from fragmented datasets to managed, reusable assets
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Semantic Consistency and Interoperability
Building Trust Through Governance and Standardization
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Scalable, Governed, and AI-Ready
Aligning structured data management with modern use cases
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Data and “product thinking”
A data product facilitates an end goal through the use of data
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Data Product
Usability Characteristics
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Data Product
Imperatives
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Data Product concept attributes
Ownership
Discoverable Quality & Valuable on its own
Discoverability
Addressable (published data Observability Understandable
product) (trusted data asset)
Natively accessible Trustworthy and truthful
Data
Product
Semantic
Security Consistency
(organization wide (compliant with
data governance) governance Interoperable and
rules)
compensable
Privacy
(potentially anonymized)
Secure
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Data Product
Categories & Hierarchies
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Example data product categories
Much more than tables, with varying producers (P) and consumers (C)
Data Data ML Business Business
Category Data Product
Engineer Scientist Engineer Analyst User
Tabular data
(SQL tables, dataframes; e.g. facts, dimensions, P/C P/C C P/C
metrics, timeseries, kpis, metadata, …)
Datasets
ML & AI features P/C P/C C C Data
Streams P/C C C Resources
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Data Product hierarchy
Source-aligned Derived Consumer-aligned
Source Systems
Data Products Data Products Data Products
PLM plm
crm_tickets customer_products
product_popularity
CRM call_centers customer_loyalty
• Represent the relevant data as it • Created by processing and transforming • Consumer-aligned data
is in the operational system with source-aligned data products or other products are specifically built
minimal transformation derived data products. for end users, e.g.
• Cleansed and transformed to • Satisfy user needs e.g., for decision-making, dashboards, reports
ensure quality and automated decision-making.
• First step to creating more • Can be reused in other derived data
valuable data products. products.
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Data Product hierarchy (with Ownership)
Source-aligned Derived Consumer-aligned
Source Systems
Data Products Data Products Data Products
PLM plm
crm_tickets customer_products
product_popularity
CRM call_centers customer_loyalty
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Data Products in the Lakehouse
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Data Products to combine different worlds
Domain following Inmon DWH paradigm
Source
Data
Domain having standardised on Mart
Source DWH
the data products paradigm Staging (3NF)
Data
Mart
Source
Source aligned
data product Derived
Source data product
Source aligned
data product
Source
Source aligned Derived
data product data product
Source
Source aligned
data product
Source
Data
Mart DWH
Source (3NF)
Data
Mart
Source
Source
Data
Mart
Source DWH
Domain having standardised on Staging (3NF)
Data
the data products paradigm Source
Mart
Data
Semantically
Data
Source aligned product product
data product consistent (facade) (facade)
Source
Derived
Source aligned data product
data product Data Derived Derived
Source product data data
Source aligned (facade) product product
data product Derived
Source data product
Data
Source aligned product
data product (facade)
Source
Data
Mart DWH
Data Implemented as view Source
product
Data (3NF)
(facade) or materialized view
Mart
Source
Source
System supported by
Source
Lakehouse federation
Source
LF LF
Data Data
Semantically products products
consistent (facade) (facade)
Source Data Integration
Data Derived Derived
Domain Enter-
Source DS products data data prise
(facade) product product
Governed by
catalog
Source Unity
Data
products
Catalog
(facade)
ET
Data Direct access or Source
product
(facade) materialized view
Source
System storing data in
LF Lakehouse Federation object storage
ET External Table Source
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Five core processes
When defining a data product
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Data Products
Typical lifecycle
Consumption +
feedback
Value Creation
feedback information
Operation +
Inception Design Creation Publishing Retirement
Governance
Iterate new version
● Start with desired ● Create a data ● Build modular ● Deploy using DataOps ● Monitor metrics, ● Deprecate product
business outcomes contract pipelines, features, or MLOps (models) quality, usage, ● Inform consumers
● Assign owner ● Create a data product models, dashboards, ● Publish to catalog permissions
alerts, … ● Shutdown
● Assign resources design specification ● Manage access ● Handle compliance production
● Ensure semantic ● Test against data permissions according requests
● Define business contract ● Archive assets
metrics consistency with to the data contract ● Audit data product
other data products access ● Clean up resources
Business/consumer
Product owner
Data engineer / Data scientist / Business analyst
Data steward Data steward
DataOps/MLOps
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Data and AI
Governance
Data Products
Unity Catalog
Discovery Lineage
Access Control
Consumption +
feedback Value Creation
feedback information
Operation +
Inception Design Creation Publishing Retirement
Governance
Orchestration
Owner
Docs repo Databricks Workflows Repos CI/CD MLOps/LLMOps
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Data Contracts
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Data Contract
A formal way to align domains and implement federated governance
Data Contract
Data description Data SLAs
Last update, expiration dates, retention time, usage
Name, owner, description, source systems, …
restrictions, code of conduct, re-sharing conditions, …
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Data contract-based governance
Potential process to achieve consistent “certified data products”
- Assess
2 - Feedback Certified data products have the
- Approve
“stamp” of the Governance team (golden
Governance Team
data products), but other data products
can be published without involving the
Governance team
Propose data Contract
1 3
contract approval Understand usage
6
(via data contract)
5
Publish Catalog / Discover
Domain 1 Domain 2
Marketplace
4
Use data
7
product
Cloud storage
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Independent and certified data products
“Equilibrium between centralization and decentralization”
Data Domain
Data Domain
Autonomous Data Domain
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation. 169
Data Product Topologies
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
The basic topologies
Harmonized vs. Hub-and-Spoke
When structuring a distributed architecture where the different domains are autonomous,
but need to share data, two basic approaches exist:
Harmonized Hub-and-Spoke
Fully autonomous
data domain Data domain
C Global catalog
H Global hub
publish metadata/
publish and discover
discover data
data products
consume data C consume data
H
external sharing
external sharing
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Harmonized topology
No central data team, all domains are autonomous
automations
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Hub-and-Spoke
Global data hub for publishing, discovery, and serving of data products
external sharing
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Topologies in the real world
The target topology will be a mix of both
○ Context
DP Data product
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Module Title
LAB EXERCISE
Data Warehousing
Modeling with ERM
and Dimensional
Modeling in
Databricks
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Summary and Next
Steps
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Course Learning Objective Recap
● Design and implement data models tailored to specific business needs
within the Databricks Lakehouse environment.
● Differentiate between different types of modeling techniques and
understand their respective use cases.
● Analyze business needs to determine data modeling decisions.
● Design logical and physical data models for specific use cases.
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Course Learning Objective Recap
● Understand Data Products definition and use cases.
● Understand the data product lifecycle.
● Explore the stages of data product lifecycle.
● Organize Data in Domains and in Unity Catalog.
● Utilize Delta Lake and Unity Catalog to define data architectures.
● Explore Data Integration and secure data sharing techniques.
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Next Steps
Additional resources for continuing the learning journey.
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.