KEMBAR78
Data Modeling Strategies | PDF | Data Warehouse | Data Management
0% found this document useful (0 votes)
69 views181 pages

Data Modeling Strategies

Uploaded by

Pratik Patil
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
69 views181 pages

Data Modeling Strategies

Uploaded by

Pratik Patil
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 181

Data Modeling

Strategies

Databricks Academy

© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Agenda
Data Warehouse Data Modeling Time Lecture Demo Lab

Lakehouse Architecture Recap 8 mins ✓

Data Warehousing Modeling Overview 10 mins ✓

Inmon’s Corporate Information Factory 25 mins ✓ ✓

Kimball’s Dimensional Modeling 25 mins ✓ ✓

Data Vault 2.0 22 mins ✓ ✓

© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Agenda
Modern Data Architecture Use Cases Time Lecture Demo Lab

Feature Store 19 mins ✓ ✓

Combining Approaches 16 mins ✓

Data Products Time Lecture Demo Lab

Defining Data Products 36 mins ✓

Summary and Next Steps 2 min ✓


Data Warehousing Modeling with ERM and
60 min ✓
Dimensional Modeling in Databricks

© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Course Objectives
● Design and implement data models tailored to specific business needs
within the Databricks Lakehouse environment.
● Differentiate between different types of modeling techniques and
understand their respective use cases.
● Analyze business needs to determine data modeling decisions.
● Design logical and physical data models for specific use cases.

© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Course Objectives
● Explore the stages of data product lifecycle.
● Understand Data Products definition and use cases.
● Understand the data product lifecycle.
● Organize Data in Domains and in Unity Catalog.
● Utilize Delta Lake and Unity Catalog to define data architectures.
● Explore Data Integration and secure data sharing techniques.

© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Data Warehouse Data
Modeling

Data Modeling Strategies


Data Modeling Strategies
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Objectives
● Understand Bill Inmon’s top-down (3NF) approach
● Map Inmon’s EDW concepts to Databricks medallion layers
● Summarize Kimball’s bottom-up, star schema-driven approach
(facts/dimensions)
● Illustrate how star schemas integrate with Lakehouse
● Understand Data Vault 2.0’s Hubs, Links, and Satellites for agile schema
evolution
● Compare Data Vault to Inmon/Kimball

© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Content Map
Inmon’s Kimball’s
Data Vault 2.0
Corporate Information Factory Dimensional Modeling

Overview of Inmon Fact vs. Dimension; Hubs (business keys), Links


methodology; Strengths Conformed dimensions, SCD (relationships), Satellites
(governance, single source of types; Kimball vs. medallion (attributes); TPC-H mapping
truth) vs. limitations alignment; Surrogate keys & example; Strengths
dimension creation; Fact table (historization, incremental
referencing dimension keys loads) vs. complexity

© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Data Warehouse Data Modeling

LECTURE

Lakehouse
Architecture
Recap

© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Lakehouse Architecture Recap
Laying the Foundation for Data Modeling Strategies

● This Data Modeling Strategies course builds on your understanding of


Lakehouse principles.
○ Understanding the Lakehouse framework helps us structure scalable, efficient data models.
● Modeling decisions depend on data governance, processing, and storage
layers.
○ The Medallion Architecture (Bronze, Silver, Gold) dictates where and how data is transformed.
● Unity Catalog enforces governance and interoperability.
○ Schema consistency, lineage tracking, and access control impact how we model data across domains.
● Bridging AI and BI—Data models serve both workloads.
○ Feature engineering and structured analytics rely on properly designed data models.

© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Core Principles & Medallion Architecture
How the Lakehouse Organizes Data for Scalable Processing

● Combining Data Lakes & Warehouses:


○ The Lakehouse eliminates silos by supporting both structured and unstructured data.

● The Medallion Architecture (Bronze, Silver, Gold):


○ Bronze Layer: Raw ingestion, schema validation, historical record-keeping.
○ Silver Layer: Cleaned, conformed, and enriched data, ready for modeling.
○ Gold Layer: Optimized for BI, ML, and domain-specific analytics.

● Schema Enforcement & Governance:


○ Supports open formats like Delta Lake while enforcing schema consistency.

● Built for Performance & Scale:


○ Combines ACID transactions, indexing, and caching for high-performance querying.

© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Medallion Architecture
Bronze Silver Gold

Time series
resampled &
interpolated

Spark stream Feature


reduction

Feature
enhanced
Ingestion Curated
Final
Raw data Cleansed and Curated business-level
• No data processing conformed data tables

• Data kept around • Directly queryable • Project/use case specific


to fix mistakes • PII • Denormalized and
masking/redaction read-optimized data
models
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Modern Data and AI Platform
Data Engineer ML Engineer Data Scientist Business Analyst Business Partners

ETL & DS tools BI Tools


BI Tools

Orchestration

Collaboration

Ingest & Transform Advanced Analytics, ML & AI Data Warehouse

AI Engine

Data & AI Governance

Cloud Storage

© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Unity Catalog for Governance & Modeling
Ensuring Schema Consistency, Security, and Interoperability

● Centralized Governance for Data & AI:


○ Manages schemas, tables, and permissions across all workspaces and clouds.

© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Unity Catalog Overview
Before and After Unity Catalog

Before Unity Catalog With Unity Catalog

Workspace 1 Workspace 2 Unity Catalog

User/group User/group User/group Access


Metastore
management management management controls

Metastore Metastore

Access controls Access controls Workspace 1 Workspace 2

Compute Compute Compute Compute


resources resources resources resources

© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Unity Catalog for Governance & Modeling
Ensuring Schema Consistency, Security, and Interoperability

● Centralized Governance for Data & AI:


○ Manages schemas, tables, and permissions across all workspaces and clouds.

● Schema Enforcement & Data Lineage:


○ Tracks data movement, transformations, and dependencies for model reproducibility.

© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Unity Catalog Overview

(Unity) Unity Catalog Table


Metastore

(Unity) View
Catalog
Metastore
Schema
Databricks Volume
assigned to Catalog
Account
Schema
Databricks Function
Workspace

Databricks Model
Workspace

SELECT * FROM catalog1.schema1.table1;


© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Unity Catalog for Governance & Modeling
Ensuring Schema Consistency, Security, and Interoperability

● Centralized Governance for Data & AI:


○ Manages schemas, tables, and permissions across all workspaces and clouds.

● Schema Enforcement & Data Lineage:


○ Tracks data movement, transformations, and dependencies for model reproducibility.

● Fine-Grained Access Control:


○ Column- and row-level permissions ensure data is secure yet accessible.

© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Unity Catalog for Governance & Modeling
Ensuring Schema Consistency, Security, and Interoperability

● Centralized Governance for Data & AI:


○ Manages schemas, tables, and permissions across all workspaces and clouds.

● Schema Enforcement & Data Lineage:


○ Tracks data movement, transformations, and dependencies for model reproducibility.

● Fine-Grained Access Control:


○ Column- and row-level permissions ensure data is secure yet accessible.

● Cross-Domain Interoperability:
○ Ensures consistent definitions across teams, avoiding schema drift.

© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Unity Catalog for Governance & Modeling
Ensuring Schema Consistency, Security, and Interoperability

● Centralized Governance for Data & AI:


○ Manages schemas, tables, and permissions across all workspaces and clouds.

● Schema Enforcement & Data Lineage:


○ Tracks data movement, transformations, and dependencies for model reproducibility.

● Fine-Grained Access Control:


○ Column- and row-level permissions ensure data is secure yet accessible.

● Cross-Domain Interoperability:
○ Ensures consistent definitions across teams, avoiding schema drift.

● Supports Multi-Cloud & Open Data Formats:


○ Enables governed access to Delta Lake, Parquet, and other formats.
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Data Intelligence & Feature Engineering
Bridging Structured Analytics & AI Workloads

● The Well-Architected Lakehouse Supports Both BI & AI:


○ Structured analytics (SQL, BI dashboards) and AI-driven feature engineering coexist in a unified
architecture.

● Feature Engineering Requires Scalable Data Pipelines:


○ AI workloads need real-time and batch processing for feature extraction and transformation.

● Feature Stores Ensure Consistency Across ML Pipelines:


○ Prevents “training-serving skew” by storing reusable, versioned features.

● Data Intelligence Optimizes Business & AI Use Cases:


○ Combines predictive modeling with historical analytics for deeper insights.

● Supports Real-Time & Batch Inference:


○ ML models leverage streaming & historical data for accurate, real-time decisioning.
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Mosaic AI
End to end AI capabilities …
Databricks AI External Services

● Move Code, Data, and Models between development and production


MLOps + LLMOps ● Manage Models, Features, Experiments

Prepare Data Develop & Evaluate AI Serve Data & AI AI Models & Tools
● Discover & Transform Structured ● Train and Test Algorithms ● Low Latency Model Serving ● Commercial AI models
Data into Features ● Fine-Tune & Prompt Engineer Models ● Log Model Requests/ Responses ● Community AI models
● Chunk & Create Embeddings ● Create GenAI agents & tools ● Low Latency Feature Serving ● Community tools
from Unstructured Data ● Evaluate Experiments ● Query Embeddings in Vector DB

● AI driven discovery and search


AI Engine ● AI Assistant
● AI driven performance optimization and scaling

● Manage security & permissions


● Manage models, features, and functions
Data & AI ● Track model lineage
Data monitoring
Governance ●
● AI monitoring (metrics, model quality, drift of data and predictions)

Cloud Storage
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Mosaic AI
… fully integrated into the Data Intelligence Platform
Lakehouse common capabilities Mosaic AI specific capabilities External Services

Asset Bundles
(CI/CD support) MLOps + LLMOps MLFlow

Prepare Data Develop & Evaluate AI Serve Apps AI Models & Tools
Notebooks SQL Notebooks AutoML AI Playground AI Gateway Model Serving

DLT Workflows MLflow Model Training Workflows AI Functions#


Hugging OpenAI LangChain
Lakeflow Agent Agent Databricks Face
Connect Framework Evaluation Apps …
Function Feature Vector
Serve Data Serving Serving* Search

Lakehouse Model Registry Feature Store


Unity Catalog
Monitoring in UC in UC
Data & AI Tools Catalog Models
Delta Sharing
Governance in UC in Marketplace

Delta Files/Volumes
(structured) (unstructured) Cloud Storage

* Online Tables
#
Calling models from SQL
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Data Intelligence & Feature Engineering
Bridging Structured Analytics & AI Workloads

● Lakehouse Supports Both BI & AI:


○ Structured analytics (SQL, BI dashboards) and AI-driven feature engineering coexist in a unified
architecture.

● Feature Engineering Requires Scalable Data Pipelines:


○ AI workloads need real-time and batch processing for feature extraction and transformation.

● Feature Stores Ensure Consistency Across ML Pipelines:


○ Prevents “training-serving skew” by storing reusable, versioned features.

● Data Intelligence Optimizes Business & AI Use Cases:


○ Combines predictive modeling with historical analytics for deeper insights.

● Supports Real-Time & Batch Inference:


○ ML models leverage streaming & historical data for accurate, real-time decisioning.
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Key Takeaways
How Lakehouse Architecture Shapes Data Modeling Strategies

● The Lakehouse integrates structured & unstructured data.


○ Supports BI, ML, and real-time analytics in a single framework.

● Medallion Architecture provides a structured data flow.


○ Bronze (raw), Silver (cleansed), and Gold (optimized) layers define where and how data models should be
applied.

● Unity Catalog enforces governance & consistency.


○ Standardized schemas, access control, and data lineage tracking enable trustworthy data modeling.

● Feature Stores bridge AI & business analytics.


○ Ensures consistent, versioned feature definitions across training & inference workflows.

● A strong data modeling strategy builds on these principles.


○ Ensuring that data remains scalable, governed, and optimized for AI & analytics.
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Data Warehouse Data Modeling

LECTURE

Data
Warehousing
Modeling
Overview
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Why Model?

© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Data Warehouse (DWH) Data Modeling
Why?

A data warehouse is used by business users to evaluate and make business


decisions.
Data warehouse data needs to be modeled to:
● Correctly represent the business
● Ensure that insights and decisions based on the data warehouse are
impactful.

© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
DWH Data Modeling
How?

● Understand the business – its actors, relationships, processes,


requirements
● Create a logical data model of the organization's business processes and
needs
● Ensure data is of high quality: accurate, consistent, and well-organized
● Enable effective support for business intelligence, analytics, reporting,
etc.
Logical Model DWH Business
Business Model
(processes, actors,
(formal business Implementation Intelligence,
model, Analytics and
relationships, (technology-specific
technology-agnostic
requirements,…) ) Reporting
)

© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Data Modeling Methods
Using what methodology?

Historically, there have been three dominant schools of thought for data
warehousing practitioners:
● The top-down approach, as defined by Bill Inmon
○ Building the Data Warehouse, 1992
● The bottom-up approach, as defined by Ralph Kimball
○ The Data Warehouse Toolkit, 1996
● Data Vault 2.0, as defined by Dan Linstedt
○ Building a Scalable Data Warehouse with Data Vault 2.0, 2015

© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Data Warehousing - Purpose of Modeling
Business
Requirements
(Information Need)

Data warehouse Environment

RDBMS Business Information Model


(structured)
Apps

Files / Logs DWH Logical Data Data Marts Logical


(semi-structured) Model (LDM) Data Model (LDM)

Business Apps
(structured)
Physical Staging DWH Physical Data
Data Mart BI Tools
Model Model (PDM)
Other clouds

Source Ingest Integration Delivery & Access Serve


© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Context for Concepts

We can easily explore a methodology-agnostic conceptual approach to


data modeling based on the terminology of the Inmon approach.
Key concepts that originated with Inmon, such as the logical model,
translate effectively into the terminologies of competing methodologies,
e.g. ontologies and taxonomies, as used in Kimball and Data Vault.

© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Logical Data Modeling

© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Logical Data Modeling
Key Terms

Entity: Person, place, thing or concept about you wish to record facts
Attribute: a non-decomposable atomic piece of information describing
an entity
Non-Decomposable: Smallest unit of information you will want to
reference
Business rules: specifications that preserves the integrity of the LDM by
governing which values attributes may assume
Business rules (two categories):
● Key business rules - The identification of unique records
● Domain business rules - Validation of attribute values
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Logical Data Modeling
Optimal Approach

1. Structural validity - Consistency with how the business defines and


organizes information
2. Simplicity - Ease of understanding
3. No redundancy - No extraneous information
4. Shareability - Not specific to one solution, usable by many
5. Extensibility - Ability to evolve with minimal effect on existing base
6. Integrity - Consistency with the way the business uses and manages
information values

© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Building a Data Warehouse

© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Building a DWH - Process
Simplified DWH Process
Models are front-and-center when Analyze Design Build
building a data warehouse. Business Data Staging Source Data in
● Business Information Model (BIM) requirements Design Staging

Modeling actors, their relations, Business

and how they interact (“How the


Information
Model

business works”) Logical Physical DWH

● Logical Data Model (LDM)


Data Model Data Modeling Implementation

Model of the data that is Data Mart


Logical
Data Mart
Physical

associated with the BIM Data Model Data Modeling

● Physical Data Model (PDM) Source Data

Implemented data model derived Analysis

from the LDM Source


ETL Design
ETL
Mapping Development

© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Building a DWH - Process
Simplified DWH Process
Models describe the business world Analyze Design Build
and its relationships, i.e. they depict Business Data Staging Source Data in
the business processes within the requirements Design Staging

organization. Business
Information
Model

Models generate the business Logical Physical DWH

context required to create business


Data Model Data Modeling Implementation

information from the data and store Data Mart


Logical
Data Mart
Physical

it accordingly. Data Model Data Modeling

Source Data
Analysis

Source ETL
ETL Design
Mapping Development

© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Building a DWH - the process
Simplified DWH Process
For this process: Analyze Design Build
● Analyze is technology agnostic Business Data Staging Source Data in
● Design is impacted by technology requirements Design Staging

and understandability for Business


Information
consumers Model

● Build is using the actual Logical Physical DWH

technology to implement the


Data Model Data Modeling Implementation

physical model and the ETL Data Mart


Logical
Data Mart
Physical

processes Data Model Data Modeling

Source Data
Analysis

Source ETL
ETL Design
Mapping Development

© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Data Warehousing in the Lakehouse

When it comes to Data Warehousing migrations, implementations, and


associated use cases, the data architect is typically not in a position to
dictate the methodologies that govern a legacy data warehouse.
The beauty of the Databricks Lakehouse is that it can easily support the
harmonious coexistence of as many legacy DWH methodologies as the
business requires.
Furthermore, and crucially important for maximizing the value of your data
in Databricks, a well-architected Lake House opens up new opportunities
to apply data warehouse data to modern use cases.
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Data Warehouse Data Modeling

LECTURE

Inmon’s
Corporate
Information
Factory
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Inmon in a Nutshell

© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Bill Inmon’s Corporate Information Factory
Understanding the Foundation of Data Warehousing

Bill Inmon is often referred to as the "father of data warehousing." His


Corporate Information Factory (CIF) provides a comprehensive framework
for building enterprise-wide data warehouses (EDWs).

Key Principles: Emphasizes a top-down approach, integrated data, subject


orientation, time-variance, and non-volatility.

Importance: Establishes a robust architecture that supports strategic


decision-making and business intelligence initiatives.

© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Top-Down Data Warehousing Approach
Building the Foundation Before Data Marts

Inmon advocates for creating a centralized data warehouse before


developing specialized data marts.
Process Flow:
● Enterprise Data Warehouse (EDW): Serves as the single source of truth
● Data Marts: Derived from the EDW to serve specific business functions
Advantages:
● Ensures consistency
● Reduces data redundancy
● Provides a unified view across the organization

© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Subject-Oriented Data Modeling
Organizing Data Around Business Subjects

With Inmon, data is categorized into subjects (e.g., sales, finance, inventory)
rather than applications or processes.
Benefits:
● Enhances clarity and relevance for business users.
● Facilitates easier data analysis and reporting.
Implementation:
● Utilizes dimensional models like star schemas within each subject area,
ensuring data is organized logically.

© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Integrated and Consistent Data
Ensuring Data Uniformity Across the Warehouse

Integration: Combines data from disparate sources, ensuring consistency


in formats, naming conventions, and definitions.
Challenges Addressed:
● Resolves data silos
● Eliminates discrepancies
● Harmonizes differing data standards.
Techniques:
● ETL Processes: Extract, Transform, Load operations are crucial for data
integration.
● Metadata Management: Maintains information about data sources,
transformations, and structures to support integration.
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Time-Variant Data
Capturing Historical Data for Trend Analysis

Data warehouses store historical data, allowing analysis over different time
periods.
Importance:
● Enables businesses to track changes, identify trends, and make informed
predictions.
Implementation:
● Snapshot Schemas: Capture data at specific intervals.
● Slowly Changing Dimensions (SCD): Manage changes in dimension
attributes over time without losing historical accuracy.

© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Non-Volatile Data Storage
Stability and Consistency of Warehouse Data

Once data enters the data warehouse, it is not updated or deleted; it


remains stable to ensure reliability.
Advantages:
● Provides a consistent historical record.
● Enhances trustworthiness for decision-making.
Operational Implications:
● Focuses on append-only data loading, preventing unintended alterations
and preserving data integrity.

© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Corporate Information Factory Architecture
Core Components and Their Roles

Enterprise Data Warehouse (EDW): Central repository integrating data


from all sources.
Data Marts: Subsets of the EDW, tailored for specific business areas.
Operational Data Store (ODS): Handles current, transactional data for
operational reporting.
ETL Layer: Manages data extraction, transformation, and loading into the
warehouse.
Metadata Repository: Stores information about data sources, structures,
and transformations.
Access Tools: Facilitate data retrieval, reporting, and analysis for
end-users.
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Extract, Transform, Load (ETL) Processes
The Backbone of Data Integration

Extract: Retrieves data from various source systems, which can include
databases, applications, and external files.
Transform: Cleanses, standardizes, and enriches data to ensure
consistency and quality. This step may involve:
● Data cleansing (removing duplicates, correcting errors)
● Data integration (combining data from different sources)
● Data transformation (converting data types, aggregating data)
Load: Inserts the transformed data into the data warehouse, ensuring it is
organized for efficient querying and analysis.
Tools and Technologies: Examples include Informatica, Talend, and
Microsoft SSIS, which automate and manage ETL processes.
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Data Marts in Inmon Data Warehouses
Specialized Subsets for Targeted Analysis

Data marts are focused segments of the data warehouse, designed to


serve specific business lines or departments.
Types:
● Dependent: Sourced directly from the EDW, ensuring consistency.
● Independent: Created from separate data sources, typically used in
bottom-up approaches but can complement the top-down strategy.
Benefits:
● Enhanced performance for specific queries.
● Tailored data models meeting the unique needs of different user groups.
Integration with EDW: Ensures that all data marts maintain alignment with
the centralized data warehouse for unified reporting.
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Corporate Information Factory Benefits
Why Choose the Top-Down Approach?

Scalability: Supports growth by providing a flexible and expandable


architecture.
Consistency: Maintains uniform data definitions and standards across the
organization.
Comprehensive View: Offers an enterprise-wide perspective, facilitating
holistic decision-making.
Data Quality: Emphasizes rigorous data integration and cleansing
processes.
Long-Term Investment: Focuses on building a sustainable and
maintainable data infrastructure that adapts to evolving business needs.
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Inmon and Normalization

© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Normalization: From UNF to 3NF
Enhancing Data Integrity Through Progressive Constraints

Unnormalized Form (UNF):


● Data may contain repeating groups and multi-valued attributes.
● No enforced rules on data organization.
First Normal Form (1NF):
● Eliminate Repeating Groups: Each field contains only atomic values.
● Unique Rows: Each record must be unique.

© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Normalization: From UNF to 3NF
Enhancing Data Integrity Through Progressive Constraints

Second Normal Form (2NF):


● Already in 1NF
● Eliminate Partial Dependencies: Non-key attributes must depend entirely
on the primary key, not just part of it.
Third Normal Form (3NF):
● Already in 2NF
● Eliminate Transitive Dependencies: Non-key attributes must depend only
on the primary key and not on other non-key attributes.

© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Normalization Pros and Cons – Inmon
Central to the Inmon EDW Strategy

Pros of Normalization: Cons of Normalization:


● Minimizes Data Redundancy ● Query Performance Trade-offs
○ Ensures a single source of truth by avoiding ○ Highly normalized structures require multiple
duplicate data storage. joins, increasing query complexity.
● Enhances Data Integrity & Consistency ● Slower Analytical Processing
○ Updates occur only in one place, preventing ○ Complex joins can impact BI and reporting
synchronization issues. performance.
● Optimized for Transactional Updates ● Requires ETL Effort for Denormalization
○ Reduces storage costs and improves efficiency ○ Data marts often need further transformation
in operational environments. for efficient end-user querying.
● Provides Flexibility for Data Integration ● Not Always Ideal for AI & ML Workloads
○ Allows cross-enterprise data modeling with ○ ML pipelines often require denormalized feature
strict entity relationships. stores, requiring additional processing steps.

© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Normalization and the Databricks Platform

Databricks’ parallel engine (Apache Spark) is tremendously good at


scanning and processing large volumes of data since these steps can be
done in parallel over N number of workers.

Joins for the most part lead to exchange of data between workers through
serialization and deserialization.

By utilizing modern features such as liquid clustering, predictive


optimization, deletion vectors, and gathering of statistics one can
dramatically reduce the impact of normalization.

© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Inmon Visualized

© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Inmon’s Corporate Information Factory
Process and Logical View

Relational Modeling with Process Business Requirements

normalized data as the core of Business Information Model

the data warehouse DWH Logical


Data Model
Data Marts
Logical Data
Model (LDM)

DWH Physical
Physical Data Mart
Sources Data Model

Data marts (often dimensional


Staging Model (dimensional)
(relational)

or denormalized models)
Logical DWH (3NF)
View Source Data Mart
Data
Cube

Source Staging
Data Mart

Source

© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Data Modeling Work-process (Inmon)
Logical Physical

User View
(Domain) 2

Business Information
User View Composite Physical Data Model
Model 2
(Domain) Logical Data Model (PDM)
(conceptual view)

User View
4
1 (Domain) 2 3

Wide Business Data requirement perspective per function / user Data integration and conflict resolution Efficiency and usability
Perspective
High level model of actors (User = Business Function) Tasks:
Tasks:
and interactions of Each business process is worked on individually. ● Combine User Views
interest for the business. ● Translate the logical data structure
Tasks: ● Integrate with existing data
○ Identify tables and columns
● Identify major entities models
The focus is to capture the ○ Adapt structure to technology
● Determine relationships between entities ● Analyze for stability and
major processes of ○ Design how to enforce business
● Determine primary and alternate keys growth
interest. rules around entities (PK,FK)
● Determine foreign keys
○ Design how to enforce integrity
● Determine key business rules (relationships)
● Add remaining attributes ○ Tune storage related mechanisms
● Validate normalization rules
● Determine data types

⇒ This is an iterative ongoing process across the warehouse lifecycle where information captured in later steps may inform prior steps.

© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Data Warehouse Data Modeling

DEMONSTRATION

Entity
Relationship
Modeling

© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Data Warehouse Data Modeling

LECTURE

Kimball’s
Dimensional
Modeling

© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Kimball in a Nutshell

© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Ralph Kimball’s Dimensional Modeling
A Practical Approach to Data Warehousing

Ralph Kimball advocates for a bottom-up approach. His Dimensional


Modeling technique focuses on user accessibility and performance.
Key Principles:
● Dimensional Design: Organizes data into fact and dimension tables.
● Bus Architecture: Ensures scalability and consistency across the data
warehouse.
● Incremental Development: Builds the data warehouse iteratively through
data marts.
Importance:
● Emphasizes ease of use for business users.
● Optimizes query performance for reporting and analysis.
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Kimball vs. Inmon
Comparing Methodologies for Data Warehousing

Kimball’s Bottom-Up Approach:


● Focus: Starts with creating data marts for specific business processes.
● Architecture: Data marts are integrated into a cohesive data warehouse
using conformed dimensions.
● Advantages: Faster implementation, immediate business value, flexibility.
Inmon’s Top-Down Approach:
● Focus: Begins with a comprehensive Enterprise Data Warehouse (EDW).
● Architecture: EDW is the repository from which data marts are derived.
● Advantages: Ensures data consistency and integration across the
enterprise.

© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Kimball vs. Inmon
Comparing Methodologies for Data Warehousing

Key Differences:
● Implementation Speed: Kimball’s approach typically delivers results
quicker.
● Scalability: Inmon’s method may better support large-scale,
enterprise-wide initiatives.
● Flexibility: Kimball’s approach allows for more iterative and adaptable
development.

© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Dimensional Modeling

© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Core Concepts of Dimensional Modeling
Building Blocks of Kimball’s Approach - Tables

Fact Tables: Central tables that store measurable, quantitative data related
to business processes.
● Contain foreign keys referencing dimension tables.
● Include numeric metrics (e.g., sales amount, quantity).
● Often contain additive, semi-additive, or non-additive measures.
Dimension Tables: Surrounding tables that provide descriptive attributes
related to fact data.
● Contain textual or categorical information (e.g., product names).
● Often denormalized to optimize query performance.
● Support hierarchical relationships (e.g., dates with year, quarter, month).

© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Core Concepts of Dimensional Modeling
Building Blocks of Kimball’s Approach - Schemas

Star Schema:
● Structure: Fact table at the center connected to multiple dimension
tables.
● Advantages: Simplifies queries, enhances performance, and improves
readability.
Snowflake Schema:
● Structure: Extension of star schema; dimension tables are normalized
into multiple related tables.
● Advantages: Reduces data redundancy and can save storage space, but
may complicate queries.

© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Designing Fact Tables
Capturing Business Metrics Effectively

Types of Fact Tables:


● Transactional Facts: Record individual business transactions.
● Periodic Snapshot Facts: Capture data at regular intervals.
● Accumulating Snapshot Facts: Track the progression of a process.
Grain Definition:
● Importance: Defines the level of detail stored in the fact table.

© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Designing Fact Tables
Capturing Business Metrics Effectively

Measures:
● Additive Measures: Can be summed across any dimension.
● Semi-Additive Measures: Can be summed across some dimensions but
not all.
● Non-Additive Measures: Cannot be summed
Foreign Keys:
● Role: Link fact tables to corresponding dimension tables.
● Implementation: Ensure referential integrity and support efficient joins
during queries.

© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Designing Dimension Tables
Structuring Descriptive Context for Facts

Characteristics of Dimension Tables:


● Descriptive Attributes: Provide context to fact data (e.g., customer name,
product category).
● Surrogate Keys: Unique identifiers used instead of natural keys to handle
changes over time.
● Hierarchies: Enable drill-down capabilities in reports (e.g., geographic
hierarchies from country to city).

© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Designing Dimension Tables
Structuring Descriptive Context for Facts

Types of Dimensions:
● Conformed Dimensions: Shared across multiple fact tables and data
marts, ensuring consistency.
● Role-Playing Dimensions: Used multiple times within the same schema
(e.g., date dimension used for order date and ship date).
● Junk Dimensions: Combine unrelated low-cardinality attributes into a
single dimension to reduce clutter in fact tables.
Handling Slowly Changing Dimensions (SCD):
● SCD Type 1: Overwrites old data with new data, not preserving history.
● SCD Type 2: Creates a new record to preserve historical data.

© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Star Schema Design
Simplifying Data Access and Querying

Structure:
● Central Fact Table: Contains measures and foreign keys to dimension
tables.
● Surrounding Dimension Tables: Provide descriptive context for facts.
Advantages:
● Simplicity: Easy to understand and navigate for end-users and analysts.
● Performance: Optimized for read-heavy operations, enhancing query
speed.
● Flexibility: Facilitates ad-hoc querying and reporting without complex
joins.

© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Star Schema Design
Simplifying Data Access and Querying

Design Best Practices:


● Denormalize Dimensions: Reduce the number of joins required for
queries.
● Use Surrogate Keys: Maintain consistency and handle changes
effectively.
● Ensure Conformed Dimensions: Promote reuse and consistency across
different fact tables and data marts.

© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Snowflake Schema Design
Normalizing Dimensions for Efficiency

Structure:
● Central Fact Table: Similar to the star schema, contains measures and
foreign keys.
● Normalized Dimension Tables: Break down dimension tables into multiple
related tables.
Advantages:
● Storage Efficiency: Reduces data redundancy, saving storage space.
● Data Integrity: Maintains consistency through normalized tables.

© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Snowflake Schema Design
Normalizing Dimensions for Efficiency

Disadvantages:
● Complexity: Increases the number of joins required for queries,
potentially impacting performance.
● Maintenance: More complex to manage and understand compared to
star schemas.
When to Use:
● Large, Complex Dimensions: Where normalization can significantly
reduce redundancy.
● Strict Data Integrity Requirements: Ensuring consistency across
normalized tables.

© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Kimball and Denormalization

© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Denormalization Pros and Cons – Kimball
Key to Dimensional Modeling & Performance

Pros of Denormalization: Cons of Denormalization:


● Optimized for Query Performance ● Increased Data Redundancy
○ Pre-joined tables eliminate expensive ○ Fact tables store repeated dimension values,
multi-table joins, making queries run faster. leading to larger storage requirements.
● Intuitive for Business Users ● Risk of Data Inconsistency
○ Star schema structure aligns with how analysts ○ Updates must be carefully managed to avoid
think and report on data. misaligned data across multiple tables.
● Simplifies BI & Aggregation ● Not Ideal for Transactional Updates
○ Measures and dimensions are pre-aggregated, ○ Kimball’s approach is read-optimized, making
reducing computation time. transactional updates complex.
● Ideal for AI Feature Stores ● More Storage Overhead
○ Machine learning models often require flat, wide ○ Large, flattened tables may result in higher
tables—a direct outcome of denormalization. storage costs compared to normalized
schemas.
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Denormalization and the Databricks
Platform
The advent of columnar storages such as Delta Lake reduced the need of
strict normalization; having multiple columns in the same table does not
induce the cost of having to scan a complete row any more.
Denormalization almost always means duplication of data at some level,
but due to Databricks’ storage compression mechanisms filtering
capabilities, the impact of denormalization is limited.
Taken together with the ability to store row formats in columns, the Data
Architect can have tables pre-joined but isolated in separate structs,
thereby able be treated as individual tables or a pre-joined result.

© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Kimball Visualized

© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Star Schema vs. Snowflake Schema
Star schema Snowflake schema

product customer
category country

customer
product customer product customer city

sales product sales


details customer
role

store store

store store
region type

Fact table contains business "facts" (like transaction Fact table as with star schema
amounts and quantities) Dimension tables are broken down into sub-dimensions
Dimension tables contain information about descriptive Dimensions are normalized
attributes and are typically denormalized
Simple data model enforcing data quality, with fast
Star schemas enable users to slice and dice the data,
retrieval
typically by joining two or more fact tables and dimension
tables together Higher setup and maintenance efforts

© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Kimball’s Dimensional Modeling
Process and Logical View

Denormalized data model Tech Arch


Design

Business Dimensional Physical ETL Design &


Process
Built as a star or snowflake Requirements Modeling Design Development

BI App BI
schema Design Development

Central fact tables surrounded


by dimension tables DWH (dimensional)
Source
Data Cube
Data
Logical Mart
Source Staging
View consistent

Data
Source Mart

© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Dimensional modeling according to Kimball
Fundamental Concepts

1. Gather Business Requirements and Data Realities


2. Collaborative Dimensional Modeling Workshops
3. Four-Step Dimensional Design Process
4. Business Processes
5. Grain
6. Dimensions for Descriptive Context
7. Facts for Measurements
8. Star Schemas and OLAP cubes
9. Grace Extensions to Dimensional Modeling

See https://www.kimballgroup.com/wp-content/uploads/2013/08/2013.09-Kimball-Dimensional-Modeling-Techniques11.pdf

© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Example Logical Design (Retail business)
Design Model
1 Select a business process
2 Determine Granularity
Business 1
Processes 3 Choose Dimensions
Assortment 4 Identify Measures
Plans
Purchase
Orders

Inventory

Customer 2 3 4
Orders
For each fact For each For each fact
Customer For each BP For each fact
define lowest dimension decide define all
Shipments 1..N Facts define dimensions
granularity granularity measurements
Credit

Returns

Trended
Surveys
General
Ledger

© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Data Modeling: Dimensional Modeling
Landing (bronze)
• Raw data in its original format (temporarily)
Ingestion (bronze) Landing Integration Presentation
• Raw data converted to Delta (from Avro, CSV, Raw data
(temp.) Data Mart
parquet, XML, JSON format in Landing)
Business Information Model Dim.
• Verified data contract: schema (typically derived Model
SQL
from the source), timeframe, …
Logical Data Model (3NF*)
• Sometimes called Staging Ingestion Data Mart
Integration - Physical data model (silver) Verified Physical Data Model Dim.
Model
• Detailed information covering multiple business data
domains (including glossary and taxonomy)
bronze silver gold
• Integrates all data sources
• Does not necessarily use a dimensional model but ETL/ELT
feeds dimensional models. Dimensional model (star schema)
* 3NF = “Third normal form” in data modelling
Data Mart (gold) Order
• Subset of the Integrated layer, sometimes filtered Dim Fact Dim
or aggregated data Customer Product
• Focus on dimensional modeling with star schema
Dim
• Typically oriented to a specific line of business or
team Time

© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Data Warehouse Data Modeling

DEMONSTRATION

Dimensional
Modeling

© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Data Warehouse Data Modeling

LECTURE

Data Vault 2.0

© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Data Vault 2.0 in a Nutshell

© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Introduction to Data Vault 2.0
Modernizing Data Warehousing for Agility and Scalability

Data Vault 2.0 is an advanced evolution of the original Data Vault modeling
methodology, designed to address the complexities of modern data
warehousing.

It combines the strengths of Data Vault 1.0 with additional features to


support big data, real-time analytics, and agile development practices.

© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Introduction to Data Vault 2.0
Modernizing Data Warehousing for Agility and Scalability

Key Objectives:
● Enhance scalability and flexibility to handle large and rapidly changing
data environments.
● Improve data integration from diverse sources with minimal latency.
● Support agile and iterative development methodologies for faster
deployment and adaptability.
Importance:
● Meets the demands of contemporary businesses for timely, accurate,
and comprehensive data insights.
● Facilitates the integration of structured and unstructured data,
accommodating various data types and sources.
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Core Components of Data Vault 2.0
Building Blocks for Robust Data Integration

Hubs: Central entities representing unique business keys (e.g., Customer ID,
Product SKU).
● Contain a unique list of keys with minimal attributes (Business Key, Load
Date, Record Source).
● Serve as the primary point of integration for related data.
Links: Associations or relationships between Hubs (e.g., Customer
purchases Product).
● Capture many-to-many relationships without redundancy.
● Include foreign keys referencing related Hubs, Load Date, and Record
Source.

© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Core Components of Data Vault 2.0
Building Blocks for Robust Data Integration

Satellites: Descriptive or contextual data related to Hubs or Links (e.g.,


Customer Name, Address).
● Store historical and time-variant data.
● Include attributes such as Data Fields, Load Date, and Record Source.
Pit and Bridge Tables (Data Vault 2.0 Enhancements)
● Pit Tables: Facilitate point-in-time reporting by consolidating data from
multiple Satellites.
● Bridge Tables: Handle complex many-to-many relationships and
hierarchies within the data model.

© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Data Vault 2.0 Architecture
Structuring for Scalability and Flexibility

Layered Architecture
● Raw Data Vault: Ingests and stores data as-is, ensuring data integrity and
traceability.
○ Components: Hubs, Links, Satellites.
● Business Data Vault: Enhances the Raw Data Vault with business logic,
derived data, and additional context.
○ Components: Derived Satellites, Calculated Metrics.
● Information Delivery Layer: Provides data through data marts, reporting
and analytics platforms.
○ Components: Data Marts (Star/Snowflake Schemas), APIs, BI Tools.

© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Data Vault 2.0 Architecture
Structuring for Scalability and Flexibility

Integration with Modern Technologies


● Big Data Platforms: Seamlessly integrates with Hadoop, Spark, and
cloud-based data warehouses.
● Real-Time Processing: Supports real-time data ingestion and streaming
analytics.
Agile and DevOps Alignment
● CI/CD: Facilitates automated testing, deployment, and version control.
● Modular Development: Enables incremental and parallel development of
different components.

© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Data Vault 2.0 Methodology
Agile and Scalable Development Practices

Planning and Requirements Gathering


● Define business objectives, key metrics, and data sources.
● Establish governance and data quality standards.
Modeling
● Design Hubs, Links, and Satellites based on business keys and
relationships.
● Incorporate Pit and Bridge Tables as needed.

© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Data Vault 2.0 Methodology
Agile and Scalable Development Practices

ETL Development
● Develop Extract, Load, Transform (ELT) processes to populate the Raw
and Business Data Vaults.
● Implement data quality checks and transformation logic.
Testing and Validation
● Ensure data accuracy, integrity, and performance through rigorous
testing.
● Validate against business requirements and use cases.

© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Data Vault 2.0 Methodology
Agile and Scalable Development Practices

Deployment and Maintenance


● Deploy the Data Vault to production environments.
● Continuously monitor, maintain, and enhance the data warehouse.
Agile Practices
● Iterative Development: Build the data warehouse in manageable
increments, allowing for flexibility and adjustments.
● Cross-Functional Teams: Collaborate across technical and business
teams to ensure alignment and address evolving needs.
● Continuous Feedback: Incorporate user feedback to refine and optimize
the data model and ETL processes.

© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Hubs, Links, and Satellites

© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Implementing Hubs in Data Vault 2.0
Capturing Core Business Entities

Purpose of Hubs:
● Represent business keys; central points of integration for related data.
● Ensure consistency and traceability of core business entities.
Design Considerations:
● Business Keys: Stable and unique business identifiers (e.g., Customer ID).
● Minimal Attributes: Maintain simplicity and reduce redundancy.
● Ingestion Date and Record Source: For auditing and lineage purposes.
Best Practices:
● Consistent Naming Conventions: Use clear, standardized names for Hubs.
● Avoid Redundancy: Each Hub represents a single business key.
● Referential Integrity: Between Hubs and Links/Satellites
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Implementing Links in Data Vault 2.0
Modeling Relationships Between Business Entities

Purpose of Links:
● Capture relationships between Hubs (e.g., Customer purchases Product).
● Enable modeling of many-to-many relationships without redundancy.
Design Considerations:
● Identify Relationships: Determine how business keys interact and relate.
● Include Foreign Keys: Reference primary keys from related Hubs.
● Load Date and Record Source: Track Link ingestion time and source.
Best Practices:
● Atomic Relationships: A link should be a single relationship of Hubs.
● Avoid Overcomplicating: Links are for meaningful business relationships.
● Scalability: Design to accommodate future expansions and relationships.
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Implementing Satellites in Data Vault 2.0
Storing Descriptive and Historical Data

Purpose of Satellites:
● Store descriptive, contextual, time-variant data related to Hubs or Links.
● Enable historical tracking and auditing of changes over time.
Design Considerations:
● Segmentation: Separate Satellites by subject areas or update frequency.
● Include Load Metadata: e.g. Load_Date, Record_Source, and End_Date.
● Handle SCDs: Manage changes in dimension attributes.
Best Practices:
● Granular Separation: Separate Satellites for different types of data.
● Update Mechanisms: Uniform processes updating Satellite data.
● Documentation: Satellite purpose and contents to aid in usage.
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Data Vault 2.0 Visualized

© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
DWH Modeling Approaches - Data Vault
Process and Logical View

The Data Vault model is based on three basic entity types:


• Hubs separate core business concepts
• Links store relationships between business concepts Architecture

• Satellites store the attributes of a business concepts or Data Model


relationships Process Logical Data Model Physical Data Models
The Data Vault model is split into ● Taxonomies ● Raw Vault
● Ontologies Business Vault
• Raw Vault / Raw Data Vault:

● Information Marts

• Stores unaltered, granular source data.


• Immutable, historical record of all data in the Data
Vault.
DWH Information
• Business Vault / Business Data Vault: (Data Vault) Mart
Source
• Sparsely modeled DWH based on Data Vault design Raw Business Data Cube
principles Logical View
Vault Vault
Source Staging
• Data is modified according to business rules or Information
Mart
requirements
Source
• Information Marts
• Like Data Marts Hub

Bridge
Link Satellite

Point in Time

• Dimensional model based

© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Data Vault Work-process
Logical Data Model Physical Data Model

Model Model
Define Define Model
Information Business
Ontology Taxonomies Raw Vault
Mart Vault

Enterprise
Start with what the
Business
Ontology
business needs

Domain Domain Domain


Ontology Ontology Ontology

Ontologies Taxonomies
• Define how business sees • Follow a hierarchical format, provide names for
their data each object in relation to other objects
• Model real-life entities • Capture the membership properties of each
• Start with business concepts object in relation to other objects
Ontologies provides context to the developers, designers
• Connect business concepts • Have specific rules to classify, categorize any and business users on how the data fits to the business
with business keys object in a domain.
Data Vault Modeling was, is, and always will be about the business
• Drill down into the hierarchies The rules must be complete, consistent, and - Dan Linstedt (creator of Data Vault)
(Taxonomies) unambiguous
• Inherits all properties of class above it, and
may have additional properties
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Data Modeling: Data Vault 2.0
Landing (bronze)
• Raw data in its original format (temporarily)
Ingestion (bronze, sometimes called Staging) Landing Integration Presentation
• Raw data converted to Delta (from Avro, CSV, Raw data
parquet, XML, JSON format in Landing) (temp.) Raw Business Information
Vault Vault Mart
• Verified data contract: schema (typically derived
from the source), timeframe, … Hub PIT
Business
Integration - Raw Vault (silver) Views
SQL
Link Bridge
Data is modeled as Ingestion
Verified Satellite Views
• Hubs (unique business keys)
data
• Links (relationship and associations)
• Satellites (descriptive data) bronze silver gold

Integration - Business Vault (silver)


Tables with applied business rules, data quality rules,
ETL/ELT
Data Vault 2.0 model
cleansing and conforming rules
• Business views Satellite
Satellite
• Point-in-Time (PIT) tables (opt.) Satellite
• Bridge tables are created on top of the business Satellite Hub Link Hub
Satellite
vault (opt.) Satellite
Customer Product

Presentation - Information Marts (gold) Hub Order


• Similar to a classical Data Mart with data that has
been cleansed and harmonized Satellite Satellite
• Consumer-oriented models (typically views)
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Data Warehouse Data Modeling

DEMONSTRATION

Data Vault 2.0

© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Modern Data
Architecture Use
Cases

Data Modeling Strategies


Data Modeling Strategies
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Objectives
● Introduce modern AI-driven use cases: featurization, real-time inference
● Illustrate modern use case study (as distinguished from DWH use case)
● Explore medallion approach for featurization
● Highlight Feature Store integration
● Create a feature table
● Register the table in Feature Store
● Recap differences among Inmon, Kimball, DV, and Modern approaches
● Examine the benefits of the enhanced medallion architecture

© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Modern Data Architecture Use Cases

LECTURE

Feature Stores

© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Modern Data Modeling for ML and AI
Foundations for Advanced Analytics and Intelligence

Modern data modeling for ML and AI encompasses specialized structures


and practices to support the development, deployment, and maintenance
of intelligent systems.
Key Components:
● Feature Stores and Feature Tables
● Data Pipelines
● Model Management
Importance:
● Enhances the efficiency and effectiveness of ML/AI workflows.
● Ensures consistency, scalability, and reusability of features across
different models and teams.
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Feature Stores in a Nutshell

© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Understanding Feature Stores
Centralizing Feature Management for ML and AI

A Feature Store is a centralized repository that manages, stores, and


serves features used in machine learning models.
Purpose:
● Consistency: Features used during training and serving are identical.
● Reusability: Allows reuse of existing features across multiple models.
● Scalability: Supports large-scale feature computation and storage.
Core Functions:
● Feature Storage: Persistent storage of computed features.
● Feature Serving: Real-time or batch access for model inference.
● Feature Governance: Metadata, lineage, and access controls.
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Understanding Feature Stores
Centralizing Feature Management for ML and AI

A Feature Store is a centralized repository that manages, stores, and


serves features used in machine learning models.
Benefits:
● Reduces duplication of feature engineering efforts.
● Enhances collaboration between data engineering and data science
teams.
● Improves model deployment speed and reliability.

© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Architecture of a Feature Store
Components of a Feature Store

Core Components:
● Repository: Stores and manages feature definitions and metadata.
● Storage Layer: Physical storage systems (e.g., databases, data lakes)
where feature data resides.
● Serving Layer: APIs and services that provide features to ML models in
real-time or batch modes.
● Registry: Catalogs available features, including their definitions, sources,
and usage statistics.
● Transformation Layer: Tools and processes for feature engineering and
transformations.

© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Architecture of a Feature Store
Workflow of a Feature Store

Workflow:
● Feature Engineering: Data scientists create and transform raw data into
features.
● Feature Registration: Features are registered in the feature store with
metadata.
● Feature Storage: Transformed features are stored in the feature
repository.
● Feature Serving: Features are served to ML models during training and
inference.
● Monitoring and Management: Ongoing monitoring of feature quality,
usage, and performance.
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Feature Tables

© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Feature Tables: Definition and Purpose
Organizing Features for Efficient ML Workflows

Feature Tables are structured tables within a feature store that organize
related features for specific ML use cases or business domains.
Each row represents a unique instance of a feature for a specific entity.
Purpose:
● Logical Grouping: Groups features by subject area for easier
management and access.
● Performance Optimization: Organizes features in a way that aligns with
ML workflows.
● Version Control: Manages feature tables versions to track changes and
ensure reproducibility.
● Access Control: Implements granular access permissions.
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Feature Tables: Definition and Purpose
Organizing Features for Efficient ML Workflows

Feature Tables are structured tables within a feature store that organize
related features for specific ML use cases or business domains.
Each row represents a unique instance of a feature for a specific entity.
Structure (Columns):
● Feature Name: Identifier for each feature.
● Data Type: Specifies the type of data (e.g., integer, float, string).
● Description: Detailed explanation of the feature’s purpose and usage.
● Source: Origin of the feature (e.g., raw data, derived).
● Creation Timestamp: When the feature was created or last updated.

© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Types of Feature Tables
Categorizing Feature Tables Based on Use Cases

Static Feature Tables: Features that remain constant over time.


● Low update frequency.
● Typically sourced from master data systems.
Dynamic Feature Tables: Features that are updated regularly.
● High update frequency.
● Often derived from transactional or streaming data sources.
Aggregated Feature Tables: Features that represent aggregated data.
● Computed using aggregation functions (e.g., sum, average).
● Support trend analysis and forecasting.

© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Types of Feature Tables
Categorizing Feature Tables Based on Use Cases

Temporal Feature Tables: Features that capture time-based changes and


trends.
● Incorporate historical data points.
● Enable time-series analysis and predictive modeling.
Composite Feature Tables: Combine multiple types of features from
different sources or categories.
● Merge static, dynamic, and aggregated features.
● Support complex ML models that require diverse feature sets.

© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Modern Use Cases Visualized

© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Data Modeling: Modern use cases (ML and
AI)
Landing (bronze)
• Raw data in its original format (could be
temporarily).
• A landing zone allows bronze in Delta format, Landing Curation Final
independent of the original input format Raw data
Ingestion (bronze) (temp.) Augmented Project
data Python
• Delta data converted from Raw (from Avro, CSV, Cleansed
data
R
parquet, XML, JSON format in Landing) data SQL
Filtered Business-level
• Verification typically lightweight compared to data Scala
Ingestion aggregates
DWH
Verified
• No other transformation or business logic is data
applied
bronze silver gold
• Often “schema on read” approach
Curation (silver) ETL/ELT
• Cleansed data, filtered data and augmented data
Final (gold)
• Business level aggregates
• Masked, reduced, anonymized for project
purposes
• Denormalized for performance if needed

© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Modern Data Architecture Use Cases

DEMONSTRATION

Modern Case
Study:
Feature Store

© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Modern Data Architecture Use Cases

LECTURE

Combining
Approaches

© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Assessing DWH Models

© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Evaluating DWH modeling paradigms
Key examples (non-exhaustive)

Ability to change
● Big impact on Inmon models when business process changes (higher
effort and duration).
● Business changes, especially significant ones, can break the basis of a
Kimball model (higher effort and duration).
● Data Vault 2.0 structure facilitates reacting to business changes (lower
effort).

© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Evaluating DWH modeling paradigms
Key examples (non-exhaustive)

Complexity
● Inmon leads to very complex ETL and load dependencies that need to
be handled through load flow optimizations or additional ETL jobs to
ensure model consistency.
● Kimball dimensional models can be very hard to populate, since you
have to ensure consistency with the dimensions. Dimension logic can be
hard, particularly slowly changing dimensions of Type 2 and above can
be challenging.
● Data Vault 2.0 has 3-6 times more objects than a pure 3NF DW; this
impacts ETL, but the ETL is simplified, easily automated, and can for the
most part be run in parallel.
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Evaluating DWH modeling paradigms
Key examples (non-exhaustive)

Robustness
● Inmon models can easily break due to changes in business processes
and business rules.
● Kimball is the simplest model to understand, but a critical mass of
changes entails remodeling large portions.
● For Data Vault 2.0, most changes can be compartmentalized to a
specific layer.

© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
DWH Model Advantages
Summary

Inmon Kimball Data Vault


Normalized Structure Optimized for Reporting and Analytics Operational Flexibility
Inmon models follow a normalized Dimensional models are designed Data Vault allows you to stay close to the
approach, reducing data redundancy. specifically for efficient querying and source data, making it auditable and
reporting. They provide a clear structure scalable.
Single Source of Truth
for business users to understand.
The data warehouse is designed as a Easier to Add New Sources
single integrated repository. Easy to Understand Data Vault is flexible and accommodates
Business users find dimensional models new data sources seamlessly.
Well-Suited for Large EDW
intuitive due to their star schema or
Recommended for large-scale data Historical Data Tracking
snowflake schema representation.
integration. Inherent support for historical data.
Well-Suited for Smaller Projects
Quicker to implement for smaller-scale
data marts.

© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
DWH Model Challenges
Summary

Inmon Kimball Data Vault


Complexity Data Redundancy Not Ideal for Analysis & Reporting
Inmon models can be complex to Dimensional models may have some Data Vault may not be the best choice
implement and maintain. data redundancy due to for direct analysis and reporting; you
denormalization, which can lead to might still need dimensional modeling for
Slower Query Performance
maintenance challenges. virtual data marts.
Normalized structures may require more
joins, impacting query performance. No Single Source of Truth Complexity
Data marts are organized around Data Vault models can become complex,
Less Intuitive for Business Users
business areas, which can result in especially when directly populating data
Business users may find the normalized
multiple versions of the same data. marts from them.
structure less intuitive.
Joins and Recursions
Populating data marts can lead to
complex joins and recursions.

© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Common DWH modeling challenges
The attraction of “no modeling”

Many organizations find it challenging to handle the life cycle around data
models. Challenges come in the form of people, process, and technology.
● To maintain a database, you need DBAs or data engineers
● To model databases, you need data modelers
● To do a “correct” data model, you need access to the business

© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Common DWH modeling challenges
The attraction of “no modeling”

For most organizations, the mere overhead of having data modelers talk to
the business, as well as the time it takes to introduce changes, negatively
affect the organization’s ability to adapt to new conditions.
Still, cutting corners in this process has side effects: Data correctness and
potentially data quality degrade, traded for speed and agility in the data
process.

© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
The Enhanced Medallion

© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Medallion, the best practice pipeline
Bronze Silver Gold

Time series
resampled &
interpolated

Spark stream Feature


reduction

Feature
enhanced
Ingestion Curated
Final
Raw data Cleansed and Curated business-level
• No data processing conformed data tables

• Data kept around • Directly queryable • Project/use case specific


to fix mistakes • PII • Denormalized and
masking/redaction read-optimized data
models
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Combining worlds - the Enhanced Medallion
batch Landing Curation Final Modern use cases
Raw data, temp. Cleansed, augmented, … Business specific

BI use cases
streaming Ingestion Integration Presentation (strictly modeled
Verified data Business information model Data marts
and verified data)
bronze silver gold

Cloud storage

Staging: Raw data in its original format


Ingestion: Raw data verified and converted to Delta
Data Lake for modern use cases Combining both worlds in the Lakehouse allows:
• Curation: Cleansed, homogenized and fundamental business logic 1. Access to corporate KPIs for modern uses
applied cases
• Final: Business/project ready datasets 2. Accelerated delivery for DWH use cases
DWH for BI use cases
• Integration: Enterprise DWH (one or more)
• Presentation: Business-ready DWH information (data marts)

© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Three layers from data to information
Curation Final
Cleansed, Business-specific
augmented, … Modern use cases (exploratory data analysis, data science, …)
• High flexibility
Data for Modern use • all sorts of workload types (e.g. ML, experimental workloads)
Explorative &
modern use cases like ML supported
Flexible cases (self-service) • all data types supported
• no compliance with business information model needed

Semantically consistent data


Semantically Enhanced • Not integrated into the business information model, however,
consistent business consistently transformed to allow joining with integration data
data perspective
(self-service)
• Allows enhancements of business perspectives while keeping DWH
model stable

BI and advanced Analytics use cases on business information


Integration • Stable business information model according to business objectives
Conformed & Presentation and OKRs
Business
Stable information model
Data marts • Used for financial reporting, KPIs, company dashboards, etc.
• Follows strict change process
silver gold

© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Three layers from data to information
Curation Final
Cleansed, Business-specific
augmented, …

Data for Modern use


Explorative &
Flexible
modern use cases like ML Arbitrary data & Independent data products
cases (self-service)

Semantically Enhanced
business
consistent
perspective
Certified Data Products
data
(self-service)

Conformed & Integration Presentation


Stable Business
Data marts DWH model
information model

silver gold

© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Data Products

Data Modeling Strategies


Data Modeling Strategies
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Objectives
● Introduce the “data product” concept in a domain-oriented approach
● Adopt a working definition for any data product
● Understand data product categories and hierarchies
● Explore data product processes and a typical lifecycle
● Map these ideas to Lakehouse capabilities (Unity Catalog, medallion
layers)
● Introduce the concept of data contracts for data products
● Consider potential topologies for managing a data product portfolio
across domains.

© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Data Products

LECTURE

Defining Data
Products

© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Why Data Products?

© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Traditional Data Management Falls Short
Disconnected teams, inconsistent data, and slow time to value

● Centralized data platforms struggle to scale governance across teams.


○ Traditional DWH approaches require central IT ownership, creating bottlenecks.

● Teams operate in silos, creating fragmented datasets.


○ Redundant datasets emerge as different teams manage their own versions.

● Data trust is low due to inconsistency and unclear ownership.


○ Business units often rely on unofficial data sources, leading to misaligned reporting.

● AI and ML demand new levels of data availability and reusability.


○ Traditional governance models can’t keep up with real-time, feature-driven AI needs.

© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
From Data Assets to Data Products
Shifting from fragmented datasets to managed, reusable assets

● Data products apply product thinking to data management.


○ Data is owned, documented, and versioned like software products.

● Each data product has clear ownership, governance, and SLAs.


○ Domains are responsible for quality, compliance, and evolution.

● Data consumers (AI, BI, analytics) access standardized, reusable data.


○ Instead of duplicating datasets, teams reuse certified, governed products.

● Accelerates AI, ML, and cross-functional collaboration.


○ Feature Stores, data marts, and real-time APIs function as governed data products.

© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Semantic Consistency and Interoperability
Building Trust Through Governance and Standardization

● Data contracts set clear expectations for producers and consumers.


○ Define schema, update frequency, access policies, and SLAs.

● Publishing and certification enable trust and discoverability.


○ Data products are registered in a centralized catalog (Unity Catalog).

● Versioning and lineage tracking ensure long-term usability.


○ Consumers always know which version of a data product they are using.

● AI and analytics teams can integrate trusted, reproducible data


products.
○ Removes manual wrangling, reprocessing, and inconsistency in ML pipelines.
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
How Data Products Power AI & BI
Accelerating value delivery across functional domains

● Marketing teams leverage certified customer segmentation models.


○ Eliminates ad-hoc data wrangling for campaign targeting.

● Data Science teams access feature-engineered datasets.


○ Ensures real-time ML models remain consistent with training data.

● Finance teams operate on version-controlled financial data.


○ Guarantees auditability, accuracy, and reporting consistency.

● Cross-domain data sharing is simplified via governance.


○ Instead of manually reconciling datasets, teams use trusted, certified data products.

© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Scalable, Governed, and AI-Ready
Aligning structured data management with modern use cases

● Data products offer a structured, scalable alternative to ad-hoc


datasets.
○ Eliminates redundancy while enforcing governance and usability.

● Governance is embedded, ensuring compliance without bottlenecks.


○ Data owners maintain contracts, SLAs, and lineage tracking.

● AI & BI teams can leverage certified data instead of manual extracts.


○ Reduces time spent cleaning and transforming data.

● A well-architected Lakehouse supports data product thinking at scale.


○ Unity Catalog + Medallion Architecture ensure structured, governed data flows.
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Data Products in a Nutshell

© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Data and “product thinking”
A data product facilitates an end goal through the use of data

To publish data as data products, “product thinking” needs to be applied:


Every data product:
● Has an owner and is built for specific audiences;
● Follows a defined product life cycle;
● Is defined and described by a data contract; and
● Is published following an agreed-upon governance process.

© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Data Product
Usability Characteristics

A Data Product adheres to a set of usability characteristics:

● Discoverable: Users need to be able explore the availability of data


● Addressable: Permanent and unique address for programmatic access
● Understandable: What are its semantics? How is it serialized?
● Trustworthy and truthful: Correctly represents the business
● Natively accessible: Readily available in the user’s environment/tool
● Interoperable and compensable: Cross-domain semantic consistency
● Valuable on its own: Provides insights without related data
● Secure: Access control and privacy

© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Data Product
Imperatives

A Data Product needs to be:

● Consumption-ready: Trusted by consumers


● Kept up to date: By engineering teams for agreed-upon SLAs
● Approved for use: Governed using data contracts/agreements

© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Data Product concept attributes
Ownership
Discoverable Quality & Valuable on its own
Discoverability
Addressable (published data Observability Understandable
product) (trusted data asset)
Natively accessible Trustworthy and truthful

Data
Product
Semantic
Security Consistency
(organization wide (compliant with
data governance) governance Interoperable and
rules)
compensable

Privacy
(potentially anonymized)
Secure

Responsibility of the owner

© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Data Product
Categories & Hierarchies

© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Example data product categories
Much more than tables, with varying producers (P) and consumers (C)
Data Data ML Business Business
Category Data Product
Engineer Scientist Engineer Analyst User

Tabular data
(SQL tables, dataframes; e.g. facts, dimensions, P/C P/C C P/C
metrics, timeseries, kpis, metadata, …)
Datasets
ML & AI features P/C P/C C C Data
Streams P/C C C Resources

Classical ML & AI P/C P C


Models
LLM P/C P C

Queries & Notebooks (consumption ready) P/C P/C C C

Dashboards P/C P/C P/C C


Consumption Data
Reports P/C P/C P/C C
Channels Services
Alerts P/C P/C P/C C

API (e.g. served models) P/C P/C C C

© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Data Product hierarchy
Source-aligned Derived Consumer-aligned
Source Systems
Data Products Data Products Data Products
PLM plm

Manufacturing manufacturing products

SCM scm_suppliers partners recommendations

crm_tickets customer_products
product_popularity
CRM call_centers customer_loyalty

crm_customers customers customer_segments

Social Media Agg social_media clickstream

Order Data orders

• Represent the relevant data as it • Created by processing and transforming • Consumer-aligned data
is in the operational system with source-aligned data products or other products are specifically built
minimal transformation derived data products. for end users, e.g.
• Cleansed and transformed to • Satisfy user needs e.g., for decision-making, dashboards, reports
ensure quality and automated decision-making.
• First step to creating more • Can be reused in other derived data
valuable data products. products.
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Data Product hierarchy (with Ownership)
Source-aligned Derived Consumer-aligned
Source Systems
Data Products Data Products Data Products
PLM plm

Manufacturing manufacturing products

SCM scm_suppliers partners recommendations

crm_tickets customer_products
product_popularity
CRM call_centers customer_loyalty

crm_customers customers customer_segments

Social Media Agg social_media clickstream

Order Data orders

bronze silver gold silver gold silver gold gold

Owners and Domains Owners supported by the Hub domain

© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Data Products in the Lakehouse

© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Data Products to combine different worlds
Domain following Inmon DWH paradigm

Source
Data
Domain having standardised on Mart
Source DWH
the data products paradigm Staging (3NF)
Data
Mart
Source

Source aligned
data product Derived
Source data product
Source aligned
data product
Source
Source aligned Derived
data product data product
Source
Source aligned
data product

Source
Data
Mart DWH
Source (3NF)
Data
Mart
Source

Domain following Kimball DWH paradigm


© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Data products as facades
Domain following Inmon DWH paradigm

Source
Data
Mart
Source DWH
Domain having standardised on Staging (3NF)
Data
the data products paradigm Source
Mart

Data
Semantically
Data
Source aligned product product
data product consistent (facade) (facade)
Source
Derived
Source aligned data product
data product Data Derived Derived
Source product data data
Source aligned (facade) product product
data product Derived
Source data product
Data
Source aligned product
data product (facade)

Source
Data
Mart DWH
Data Implemented as view Source
product
Data (3NF)
(facade) or materialized view
Mart
Source

Domain following Kimball DWH paradigm


© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Data Product integration into Unity Catalog
Example architecture Domain following Inmon DWH paradigm

Source

System supported by
Source
Lakehouse federation
Source
LF LF

Data Data
Semantically products products
consistent (facade) (facade)
Source Data Integration
Data Derived Derived
Domain Enter-
Source DS products data data prise
(facade) product product
Governed by
catalog
Source Unity
Data
products
Catalog
(facade)

ET
Data Direct access or Source
product
(facade) materialized view
Source
System storing data in
LF Lakehouse Federation object storage
ET External Table Source

DS D2D Delta Sharing

Domain following Kimball DWH paradigm


© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Data Product
Processes and Lifecycle

© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Five core processes
When defining a data product

Data Production Data Publishing Data Consumption


Domain-specific Creation of data products via A deliberate step to provide Use of own data and
Ingestion & ETL or by different access to a data product to published data products for
business-oriented teams. other consumers. analysis, reporting, ML, …

Federated Computational Governance


Ensuring a data ecosystem that adheres to organizational rules and industry regulations through
standardization. The goal is an “equilibrium between centralization and decentralization”: data products
conform to a shared set of rules, leaving space for autonomous decision-making for the data domains.
Domain-agnostic
Platform Operations
A central platform team is responsible for defining a common, organization-wide infrastructure, ensuring
company wide rules and policies are applied. To avoid this team becoming a bottleneck, necessary
capabilities for the data domains need to be provided in a Self Service way.

© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Data Products
Typical lifecycle
Consumption +
feedback
Value Creation

feedback information

Operation +
Inception Design Creation Publishing Retirement
Governance
Iterate new version

● Start with desired ● Create a data ● Build modular ● Deploy using DataOps ● Monitor metrics, ● Deprecate product
business outcomes contract pipelines, features, or MLOps (models) quality, usage, ● Inform consumers
● Assign owner ● Create a data product models, dashboards, ● Publish to catalog permissions
alerts, … ● Shutdown
● Assign resources design specification ● Manage access ● Handle compliance production
● Ensure semantic ● Test against data permissions according requests
● Define business contract ● Archive assets
metrics consistency with to the data contract ● Audit data product
other data products access ● Clean up resources

Business/consumer
Product owner
Data engineer / Data scientist / Business analyst
Data steward Data steward
DataOps/MLOps
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Data and AI
Governance

Data Products
Unity Catalog
Discovery Lineage
Access Control

Mapping the Databricks Lakehouse to the Data Warehousing


Dashboards
data product lifecycle Lakehouse AI
Notebooks

Consumption +
feedback Value Creation
feedback information

Operation +
Inception Design Creation Publishing Retirement
Governance

Orchestration
Owner
Docs repo Databricks Workflows Repos CI/CD MLOps/LLMOps

ETL and Processing Engine


DLT Auto Loader Structured Streaming
team
data data Data and AI
contract product Data Warehousing Data and AI
Databricks SQL Governance
spec Governance
(serverless) Unity Catalog
Dashboards Unity Catalog
Access Data Auditing Lineage
Lakehouse AI Control Explorer
Notebooks Lakehouse System
Lineage Monitoring tables
Features AutoML Marketplace

© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Data Contracts

© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Data Contract
A formal way to align domains and implement federated governance

A data contract should be provided by the data producer, however, being


designed with the consumer in mind. Important aspects for a consumer:
● Data description (name, description, source systems, attribute selection)
● Data schema (tables, columns, anonymization & encryption, filters, masks)
● Usage policies (tags, PII, guidelines, data residency)
● Data quality (applied quality checks and constraints, quality metrics)
● Security (who is allowed to use the data product)
● Data SLAs (last update, expiration dates, retention time)
● Responsibilities (owner, maintainer, escalation contact, change process)
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Data Contract

Data Contract
Data description Data SLAs
Last update, expiration dates, retention time, usage
Name, owner, description, source systems, …
restrictions, code of conduct, re-sharing conditions, …

Data schema Security


Tables, columns, anonymization and encryption info, … Who is allowed to use the data product

Data quality Explanatory add-ons (optional)


Applied quality checks, quality metrics, … Notebook, dashboard, sample code, …

© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Data contract-based governance
Potential process to achieve consistent “certified data products”
- Assess
2 - Feedback Certified data products have the
- Approve
“stamp” of the Governance team (golden
Governance Team
data products), but other data products
can be published without involving the
Governance team
Propose data Contract
1 3
contract approval Understand usage
6
(via data contract)
5
Publish Catalog / Discover
Domain 1 Domain 2
Marketplace
4
Use data
7
product

Cloud storage
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Independent and certified data products
“Equilibrium between centralization and decentralization”

Certified Data Products Independent Data Products


High quality data products that are High quality data products, but
semantically consistent and Data Product Data Product no guarantee that they can be
can be combined easily Data Product Data Product combined easily
Data Product Data Product

Team of domain representatives to


Domains can publish data products as they
agree on rules and policies for certified
data products and to approve and
Governance Team believe that their data is represented best
govern their data contracts

Data Domain
Data Domain
Autonomous Data Domain

© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation. 169
Data Product Topologies

© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
The basic topologies
Harmonized vs. Hub-and-Spoke

When structuring a distributed architecture where the different domains are autonomous,
but need to share data, two basic approaches exist:
Harmonized Hub-and-Spoke

Fully autonomous
data domain Data domain

C Global catalog
H Global hub
publish metadata/
publish and discover
discover data
data products
consume data C consume data
H
external sharing
external sharing

© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Harmonized topology
No central data team, all domains are autonomous

Each domain hosts and serves its own data


products
Global catalog for central discovery
C
Each domain has skills to manage
end-to-end data lifecycle
May be inefficient if there is a high-level of
repeatability/similarity between data
products Fully autonomous
data domain
publish metadata/
discover data

Central IT defines the technology blueprint C Global catalog consume data

and provides best practices and setup external sharing

automations
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Hub-and-Spoke
Global data hub for publishing, discovery, and serving of data products

More central data governance


Central team maintain infrastructure
services like a global catalog
H
Central team can be a data domain building
data products
Requires spoke (domain) to publish
shareable data products to global data hub
Can reduce data sharing and management Data domain publish and discover
data products

overheads when there are many domains


H Global hub
consume data

external sharing

© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Topologies in the real world
The target topology will be a mix of both

There is a tendency to go for Hub & Spoke.


However, to avoid bottlenecking, the
architecture and processes need to:
● Support autonomous domains H

● Allow non-autonomous domains to


mature over time and become Potential candidates to become
autonomous autonomous over time

Even for autonomous data domains, data Fully autonomous


data domain
publish and discover
data products

should be published centrally to enable Data domain consume data

consistent governance and a single point of H Global hub external sharing

discovery (global data catalog).


© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
How to ensure semantic consistency?
The published data products need to be aligned

To increase data quality and enable users to work


with and combine data sets, published data
products need to be semantically consistent: DP
DP DP
● Align data products on DP

○ Context

○ Granularity Semantically consistent data products

○ Terminology (naming consistency)


Fully autonomous publish and discover
● Ensure correctness (viz. business logic) data domain data products

Data domain consume data

H Global hub external sharing

DP Data product

© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Module Title

LAB EXERCISE

Data Warehousing
Modeling with ERM
and Dimensional
Modeling in
Databricks
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Summary and Next
Steps

Data Modeling Strategies

© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Course Learning Objective Recap
● Design and implement data models tailored to specific business needs
within the Databricks Lakehouse environment.
● Differentiate between different types of modeling techniques and
understand their respective use cases.
● Analyze business needs to determine data modeling decisions.
● Design logical and physical data models for specific use cases.

© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Course Learning Objective Recap
● Understand Data Products definition and use cases.
● Understand the data product lifecycle.
● Explore the stages of data product lifecycle.
● Organize Data in Domains and in Unity Catalog.
● Utilize Delta Lake and Unity Catalog to define data architectures.
● Explore Data Integration and secure data sharing techniques.

© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Next Steps
Additional resources for continuing the learning journey.

Data Architect Associate Learning Pathway

● Continue your learning through self-paced or instructor-led offerings


● Further courses offer hands-on instruction in:
○ Governance and Security for Data + AI (coming soon)
○ Optimization and Best Practices ((coming soon)

© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.

You might also like