0% found this document useful (0 votes)

69 views181 pages

Data Modeling Strategies

Uploaded by

Pratik Patil

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

69 views181 pages

Data Modeling Strategies

Uploaded by

Pratik Patil

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 181

Data Modeling

Strategies

Databricks Academy

Lakehouse Architecture Recap 8 mins ✓

Data Warehousing Modeling Overview 10 mins ✓

Inmon’s Corporate Information Factory 25 mins ✓ ✓

Kimball’s Dimensional Modeling 25 mins ✓ ✓

Data Vault 2.0 22 mins ✓ ✓

Feature Store 19 mins ✓ ✓

Combining Approaches 16 mins ✓

Data Products Time Lecture Demo Lab

Deﬁning Data Products 36 mins ✓

Summary and Next Steps 2 min ✓

Data Warehousing Modeling with ERM and
60 min ✓
Dimensional Modeling in Databricks

© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Course Objectives
● Design and implement data models tailored to speciﬁc business needs
within the Databricks Lakehouse environment.
● Differentiate between different types of modeling techniques and
understand their respective use cases.
● Analyze business needs to determine data modeling decisions.
● Design logical and physical data models for speciﬁc use cases.

© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Course Objectives
● Explore the stages of data product lifecycle.
● Understand Data Products deﬁnition and use cases.
● Understand the data product lifecycle.
● Organize Data in Domains and in Unity Catalog.
● Utilize Delta Lake and Unity Catalog to deﬁne data architectures.
● Explore Data Integration and secure data sharing techniques.

Data Modeling Strategies

Data Modeling Strategies
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Objectives
● Understand Bill Inmon’s top-down (3NF) approach
● Map Inmon’s EDW concepts to Databricks medallion layers
● Summarize Kimball’s bottom-up, star schema-driven approach
(facts/dimensions)
● Illustrate how star schemas integrate with Lakehouse
● Understand Data Vault 2.0’s Hubs, Links, and Satellites for agile schema
evolution
● Compare Data Vault to Inmon/Kimball

Overview of Inmon Fact vs. Dimension; Hubs (business keys), Links

methodology; Strengths Conformed dimensions, SCD (relationships), Satellites
(governance, single source of types; Kimball vs. medallion (attributes); TPC-H mapping
truth) vs. limitations alignment; Surrogate keys & example; Strengths
dimension creation; Fact table (historization, incremental
referencing dimension keys loads) vs. complexity

LECTURE

Lakehouse
Architecture
Recap

● This Data Modeling Strategies course builds on your understanding of

Lakehouse principles.
○ Understanding the Lakehouse framework helps us structure scalable, efﬁcient data models.
● Modeling decisions depend on data governance, processing, and storage
layers.
○ The Medallion Architecture (Bronze, Silver, Gold) dictates where and how data is transformed.
● Unity Catalog enforces governance and interoperability.
○ Schema consistency, lineage tracking, and access control impact how we model data across domains.
● Bridging AI and BI—Data models serve both workloads.
○ Feature engineering and structured analytics rely on properly designed data models.

● Combining Data Lakes & Warehouses:

○ The Lakehouse eliminates silos by supporting both structured and unstructured data.

● The Medallion Architecture (Bronze, Silver, Gold):

○ Bronze Layer: Raw ingestion, schema validation, historical record-keeping.
○ Silver Layer: Cleaned, conformed, and enriched data, ready for modeling.
○ Gold Layer: Optimized for BI, ML, and domain-speciﬁc analytics.

● Schema Enforcement & Governance:

○ Supports open formats like Delta Lake while enforcing schema consistency.

● Built for Performance & Scale:

○ Combines ACID transactions, indexing, and caching for high-performance querying.

Time series
resampled &
interpolated

Spark stream Feature

reduction

Feature
enhanced
Ingestion Curated
Final
Raw data Cleansed and Curated business-level
• No data processing conformed data tables

• Data kept around • Directly queryable • Project/use case speciﬁc

to ﬁx mistakes • PII • Denormalized and
masking/redaction read-optimized data
models
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Modern Data and AI Platform
Data Engineer ML Engineer Data Scientist Business Analyst Business Partners

ETL & DS tools BI Tools

BI Tools

Orchestration

Collaboration

Ingest & Transform Advanced Analytics, ML & AI Data Warehouse

AI Engine

Data & AI Governance

Cloud Storage

● Centralized Governance for Data & AI:

○ Manages schemas, tables, and permissions across all workspaces and clouds.

Before Unity Catalog With Unity Catalog

Workspace 1 Workspace 2 Unity Catalog

User/group User/group User/group Access

Metastore
management management management controls

Metastore Metastore

Access controls Access controls Workspace 1 Workspace 2

Compute Compute Compute Compute

resources resources resources resources

● Centralized Governance for Data & AI:

○ Manages schemas, tables, and permissions across all workspaces and clouds.

● Schema Enforcement & Data Lineage:

○ Tracks data movement, transformations, and dependencies for model reproducibility.

(Unity) Unity Catalog Table

Metastore

(Unity) View
Catalog
Metastore
Schema
Databricks Volume
assigned to Catalog
Account
Schema
Databricks Function
Workspace

Databricks Model
Workspace

SELECT * FROM catalog1.schema1.table1;

● Centralized Governance for Data & AI:

○ Manages schemas, tables, and permissions across all workspaces and clouds.

● Schema Enforcement & Data Lineage:

○ Tracks data movement, transformations, and dependencies for model reproducibility.

● Fine-Grained Access Control:

○ Column- and row-level permissions ensure data is secure yet accessible.

● Centralized Governance for Data & AI:

○ Manages schemas, tables, and permissions across all workspaces and clouds.

● Schema Enforcement & Data Lineage:

○ Tracks data movement, transformations, and dependencies for model reproducibility.

● Fine-Grained Access Control:

○ Column- and row-level permissions ensure data is secure yet accessible.

● Cross-Domain Interoperability:
○ Ensures consistent deﬁnitions across teams, avoiding schema drift.

● Centralized Governance for Data & AI:

○ Manages schemas, tables, and permissions across all workspaces and clouds.

● Schema Enforcement & Data Lineage:

○ Tracks data movement, transformations, and dependencies for model reproducibility.

● Fine-Grained Access Control:

○ Column- and row-level permissions ensure data is secure yet accessible.

● Cross-Domain Interoperability:
○ Ensures consistent deﬁnitions across teams, avoiding schema drift.

● Supports Multi-Cloud & Open Data Formats:

○ Enables governed access to Delta Lake, Parquet, and other formats.
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Data Intelligence & Feature Engineering
Bridging Structured Analytics & AI Workloads

● The Well-Architected Lakehouse Supports Both BI & AI:

○ Structured analytics (SQL, BI dashboards) and AI-driven feature engineering coexist in a uniﬁed
architecture.

● Feature Engineering Requires Scalable Data Pipelines:

○ AI workloads need real-time and batch processing for feature extraction and transformation.

● Feature Stores Ensure Consistency Across ML Pipelines:

○ Prevents “training-serving skew” by storing reusable, versioned features.

● Data Intelligence Optimizes Business & AI Use Cases:

○ Combines predictive modeling with historical analytics for deeper insights.

● Supports Real-Time & Batch Inference:

○ ML models leverage streaming & historical data for accurate, real-time decisioning.
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Mosaic AI
End to end AI capabilities …
Databricks AI External Services

● Move Code, Data, and Models between development and production

MLOps + LLMOps ● Manage Models, Features, Experiments

Prepare Data Develop & Evaluate AI Serve Data & AI AI Models & Tools
● Discover & Transform Structured ● Train and Test Algorithms ● Low Latency Model Serving ● Commercial AI models
Data into Features ● Fine-Tune & Prompt Engineer Models ● Log Model Requests/ Responses ● Community AI models
● Chunk & Create Embeddings ● Create GenAI agents & tools ● Low Latency Feature Serving ● Community tools
from Unstructured Data ● Evaluate Experiments ● Query Embeddings in Vector DB

● AI driven discovery and search

AI Engine ● AI Assistant
● AI driven performance optimization and scaling

● Manage security & permissions

● Manage models, features, and functions
Data & AI ● Track model lineage
Data monitoring
Governance ●
● AI monitoring (metrics, model quality, drift of data and predictions)

Cloud Storage
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Mosaic AI
… fully integrated into the Data Intelligence Platform
Lakehouse common capabilities Mosaic AI speciﬁc capabilities External Services

Asset Bundles
(CI/CD support) MLOps + LLMOps MLFlow

Prepare Data Develop & Evaluate AI Serve Apps AI Models & Tools
Notebooks SQL Notebooks AutoML AI Playground AI Gateway Model Serving

DLT Workflows MLflow Model Training Workflows AI Functions#

Hugging OpenAI LangChain
Lakeﬂow Agent Agent Databricks Face
Connect Framework Evaluation Apps …
Function Feature Vector
Serve Data Serving Serving* Search

Lakehouse Model Registry Feature Store

Unity Catalog
Monitoring in UC in UC
Data & AI Tools Catalog Models
Delta Sharing
Governance in UC in Marketplace

Delta Files/Volumes
(structured) (unstructured) Cloud Storage

* Online Tables
#
Calling models from SQL
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Data Intelligence & Feature Engineering
Bridging Structured Analytics & AI Workloads

● Lakehouse Supports Both BI & AI:

○ Structured analytics (SQL, BI dashboards) and AI-driven feature engineering coexist in a uniﬁed
architecture.

● Feature Engineering Requires Scalable Data Pipelines:

○ AI workloads need real-time and batch processing for feature extraction and transformation.

● Feature Stores Ensure Consistency Across ML Pipelines:

○ Prevents “training-serving skew” by storing reusable, versioned features.

● Data Intelligence Optimizes Business & AI Use Cases:

○ Combines predictive modeling with historical analytics for deeper insights.

● Supports Real-Time & Batch Inference:

○ ML models leverage streaming & historical data for accurate, real-time decisioning.
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Key Takeaways
How Lakehouse Architecture Shapes Data Modeling Strategies

● The Lakehouse integrates structured & unstructured data.

○ Supports BI, ML, and real-time analytics in a single framework.

● Medallion Architecture provides a structured data ﬂow.

○ Bronze (raw), Silver (cleansed), and Gold (optimized) layers deﬁne where and how data models should be
applied.

● Unity Catalog enforces governance & consistency.

○ Standardized schemas, access control, and data lineage tracking enable trustworthy data modeling.

● Feature Stores bridge AI & business analytics.

○ Ensures consistent, versioned feature deﬁnitions across training & inference workﬂows.

● A strong data modeling strategy builds on these principles.

○ Ensuring that data remains scalable, governed, and optimized for AI & analytics.
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Data Warehouse Data Modeling

LECTURE

Data
Warehousing
Modeling
Overview
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Why Model?

A data warehouse is used by business users to evaluate and make business

decisions.
Data warehouse data needs to be modeled to:
● Correctly represent the business
● Ensure that insights and decisions based on the data warehouse are
impactful.

● Understand the business – its actors, relationships, processes,

requirements
● Create a logical data model of the organization's business processes and
needs
● Ensure data is of high quality: accurate, consistent, and well-organized
● Enable effective support for business intelligence, analytics, reporting,
etc.
Logical Model DWH Business
Business Model
(processes, actors,
(formal business Implementation Intelligence,
model, Analytics and
relationships, (technology-speciﬁc
technology-agnostic
requirements,…) ) Reporting
)

Historically, there have been three dominant schools of thought for data
warehousing practitioners:
● The top-down approach, as defined by Bill Inmon
○ Building the Data Warehouse, 1992
● The bottom-up approach, as defined by Ralph Kimball
○ The Data Warehouse Toolkit, 1996
● Data Vault 2.0, as defined by Dan Linstedt
○ Building a Scalable Data Warehouse with Data Vault 2.0, 2015

Data warehouse Environment

RDBMS Business Information Model

(structured)
Apps

Files / Logs DWH Logical Data Data Marts Logical

(semi-structured) Model (LDM) Data Model (LDM)

Business Apps
(structured)
Physical Staging DWH Physical Data
Data Mart BI Tools
Model Model (PDM)
Other clouds

Source Ingest Integration Delivery & Access Serve

We can easily explore a methodology-agnostic conceptual approach to

data modeling based on the terminology of the Inmon approach.
Key concepts that originated with Inmon, such as the logical model,
translate effectively into the terminologies of competing methodologies,
e.g. ontologies and taxonomies, as used in Kimball and Data Vault.

Entity: Person, place, thing or concept about you wish to record facts
Attribute: a non-decomposable atomic piece of information describing
an entity
Non-Decomposable: Smallest unit of information you will want to
reference
Business rules: speciﬁcations that preserves the integrity of the LDM by
governing which values attributes may assume
Business rules (two categories):
● Key business rules - The identiﬁcation of unique records
● Domain business rules - Validation of attribute values
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Logical Data Modeling
Optimal Approach

1. Structural validity - Consistency with how the business deﬁnes and

organizes information
2. Simplicity - Ease of understanding
3. No redundancy - No extraneous information
4. Shareability - Not speciﬁc to one solution, usable by many
5. Extensibility - Ability to evolve with minimal effect on existing base
6. Integrity - Consistency with the way the business uses and manages
information values

© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Building a DWH - Process
Simpliﬁed DWH Process
Models are front-and-center when Analyze Design Build
building a data warehouse. Business Data Staging Source Data in
● Business Information Model (BIM) requirements Design Staging

Modeling actors, their relations, Business

and how they interact (“How the

Information
Model

business works”) Logical Physical DWH

● Logical Data Model (LDM)

Data Model Data Modeling Implementation

Model of the data that is Data Mart

Logical
Data Mart
Physical

associated with the BIM Data Model Data Modeling

● Physical Data Model (PDM) Source Data

Implemented data model derived Analysis

from the LDM Source

ETL Design
ETL
Mapping Development

© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Building a DWH - Process
Simpliﬁed DWH Process
Models describe the business world Analyze Design Build
and its relationships, i.e. they depict Business Data Staging Source Data in
the business processes within the requirements Design Staging

organization. Business
Information
Model

Models generate the business Logical Physical DWH

context required to create business

Data Model Data Modeling Implementation

information from the data and store Data Mart

Logical
Data Mart
Physical

it accordingly. Data Model Data Modeling

Source Data
Analysis

Source ETL
ETL Design
Mapping Development

© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Building a DWH - the process
Simpliﬁed DWH Process
For this process: Analyze Design Build
● Analyze is technology agnostic Business Data Staging Source Data in
● Design is impacted by technology requirements Design Staging

and understandability for Business

Information
consumers Model

● Build is using the actual Logical Physical DWH

technology to implement the

Data Model Data Modeling Implementation

physical model and the ETL Data Mart

Logical
Data Mart
Physical

processes Data Model Data Modeling

Source Data
Analysis

Source ETL
ETL Design
Mapping Development

When it comes to Data Warehousing migrations, implementations, and

associated use cases, the data architect is typically not in a position to
dictate the methodologies that govern a legacy data warehouse.
The beauty of the Databricks Lakehouse is that it can easily support the
harmonious coexistence of as many legacy DWH methodologies as the
business requires.
Furthermore, and crucially important for maximizing the value of your data
in Databricks, a well-architected Lake House opens up new opportunities
to apply data warehouse data to modern use cases.
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Data Warehouse Data Modeling

LECTURE

Inmon’s
Corporate
Information
Factory
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Inmon in a Nutshell

Bill Inmon is often referred to as the "father of data warehousing." His

Corporate Information Factory (CIF) provides a comprehensive framework
for building enterprise-wide data warehouses (EDWs).

Key Principles: Emphasizes a top-down approach, integrated data, subject

orientation, time-variance, and non-volatility.

Importance: Establishes a robust architecture that supports strategic

decision-making and business intelligence initiatives.

Inmon advocates for creating a centralized data warehouse before

developing specialized data marts.
Process Flow:
● Enterprise Data Warehouse (EDW): Serves as the single source of truth
● Data Marts: Derived from the EDW to serve speciﬁc business functions
Advantages:
● Ensures consistency
● Reduces data redundancy
● Provides a uniﬁed view across the organization

With Inmon, data is categorized into subjects (e.g., sales, ﬁnance, inventory)
rather than applications or processes.
Beneﬁts:
● Enhances clarity and relevance for business users.
● Facilitates easier data analysis and reporting.
Implementation:
● Utilizes dimensional models like star schemas within each subject area,
ensuring data is organized logically.

Integration: Combines data from disparate sources, ensuring consistency

in formats, naming conventions, and deﬁnitions.
Challenges Addressed:
● Resolves data silos
● Eliminates discrepancies
● Harmonizes differing data standards.
Techniques:
● ETL Processes: Extract, Transform, Load operations are crucial for data
integration.
● Metadata Management: Maintains information about data sources,
transformations, and structures to support integration.
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Time-Variant Data
Capturing Historical Data for Trend Analysis

Data warehouses store historical data, allowing analysis over different time
periods.
Importance:
● Enables businesses to track changes, identify trends, and make informed
predictions.
Implementation:
● Snapshot Schemas: Capture data at speciﬁc intervals.
● Slowly Changing Dimensions (SCD): Manage changes in dimension
attributes over time without losing historical accuracy.

Once data enters the data warehouse, it is not updated or deleted; it

remains stable to ensure reliability.
Advantages:
● Provides a consistent historical record.
● Enhances trustworthiness for decision-making.
Operational Implications:
● Focuses on append-only data loading, preventing unintended alterations
and preserving data integrity.

Enterprise Data Warehouse (EDW): Central repository integrating data

from all sources.
Data Marts: Subsets of the EDW, tailored for speciﬁc business areas.
Operational Data Store (ODS): Handles current, transactional data for
operational reporting.
ETL Layer: Manages data extraction, transformation, and loading into the
warehouse.
Metadata Repository: Stores information about data sources, structures,
and transformations.
Access Tools: Facilitate data retrieval, reporting, and analysis for
end-users.
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Extract, Transform, Load (ETL) Processes
The Backbone of Data Integration

Extract: Retrieves data from various source systems, which can include
databases, applications, and external ﬁles.
Transform: Cleanses, standardizes, and enriches data to ensure
consistency and quality. This step may involve:
● Data cleansing (removing duplicates, correcting errors)
● Data integration (combining data from different sources)
● Data transformation (converting data types, aggregating data)
Load: Inserts the transformed data into the data warehouse, ensuring it is
organized for efﬁcient querying and analysis.
Tools and Technologies: Examples include Informatica, Talend, and
Microsoft SSIS, which automate and manage ETL processes.
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Data Marts in Inmon Data Warehouses
Specialized Subsets for Targeted Analysis

Data marts are focused segments of the data warehouse, designed to

serve specific business lines or departments.
Types:
● Dependent: Sourced directly from the EDW, ensuring consistency.
● Independent: Created from separate data sources, typically used in
bottom-up approaches but can complement the top-down strategy.
Benefits:
● Enhanced performance for specific queries.
● Tailored data models meeting the unique needs of different user groups.
Integration with EDW: Ensures that all data marts maintain alignment with
the centralized data warehouse for unified reporting.
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Corporate Information Factory Benefits
Why Choose the Top-Down Approach?

Scalability: Supports growth by providing a ﬂexible and expandable

architecture.
Consistency: Maintains uniform data deﬁnitions and standards across the
organization.
Comprehensive View: Offers an enterprise-wide perspective, facilitating
holistic decision-making.
Data Quality: Emphasizes rigorous data integration and cleansing
processes.
Long-Term Investment: Focuses on building a sustainable and
maintainable data infrastructure that adapts to evolving business needs.
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Inmon and Normalization

Unnormalized Form (UNF):

● Data may contain repeating groups and multi-valued attributes.
● No enforced rules on data organization.
First Normal Form (1NF):
● Eliminate Repeating Groups: Each ﬁeld contains only atomic values.
● Unique Rows: Each record must be unique.

Second Normal Form (2NF):

● Already in 1NF
● Eliminate Partial Dependencies: Non-key attributes must depend entirely
on the primary key, not just part of it.
Third Normal Form (3NF):
● Already in 2NF
● Eliminate Transitive Dependencies: Non-key attributes must depend only
on the primary key and not on other non-key attributes.

Pros of Normalization: Cons of Normalization:

● Minimizes Data Redundancy ● Query Performance Trade-offs
○ Ensures a single source of truth by avoiding ○ Highly normalized structures require multiple
duplicate data storage. joins, increasing query complexity.
● Enhances Data Integrity & Consistency ● Slower Analytical Processing
○ Updates occur only in one place, preventing ○ Complex joins can impact BI and reporting
synchronization issues. performance.
● Optimized for Transactional Updates ● Requires ETL Effort for Denormalization
○ Reduces storage costs and improves efﬁciency ○ Data marts often need further transformation
in operational environments. for efﬁcient end-user querying.
● Provides Flexibility for Data Integration ● Not Always Ideal for AI & ML Workloads
○ Allows cross-enterprise data modeling with ○ ML pipelines often require denormalized feature
strict entity relationships. stores, requiring additional processing steps.

Databricks’ parallel engine (Apache Spark) is tremendously good at

scanning and processing large volumes of data since these steps can be
done in parallel over N number of workers.

Joins for the most part lead to exchange of data between workers through
serialization and deserialization.

By utilizing modern features such as liquid clustering, predictive

optimization, deletion vectors, and gathering of statistics one can
dramatically reduce the impact of normalization.

Relational Modeling with Process Business Requirements

normalized data as the core of Business Information Model

the data warehouse DWH Logical

Data Model
Data Marts
Logical Data
Model (LDM)

DWH Physical
Physical Data Mart
Sources Data Model

Data marts (often dimensional

Staging Model (dimensional)
(relational)

or denormalized models)
Logical DWH (3NF)
View Source Data Mart
Data
Cube

Source Staging
Data Mart

Source

User View
(Domain) 2

Business Information
User View Composite Physical Data Model
Model 2
(Domain) Logical Data Model (PDM)
(conceptual view)

User View
4
1 (Domain) 2 3

Wide Business Data requirement perspective per function / user Data integration and conﬂict resolution Efﬁciency and usability
Perspective
High level model of actors (User = Business Function) Tasks:
Tasks:
and interactions of Each business process is worked on individually. ● Combine User Views
interest for the business. ● Translate the logical data structure
Tasks: ● Integrate with existing data
○ Identify tables and columns
● Identify major entities models
The focus is to capture the ○ Adapt structure to technology
● Determine relationships between entities ● Analyze for stability and
major processes of ○ Design how to enforce business
● Determine primary and alternate keys growth
interest. rules around entities (PK,FK)
● Determine foreign keys
○ Design how to enforce integrity
● Determine key business rules (relationships)
● Add remaining attributes ○ Tune storage related mechanisms
● Validate normalization rules
● Determine data types

⇒ This is an iterative ongoing process across the warehouse lifecycle where information captured in later steps may inform prior steps.

DEMONSTRATION

Entity
Relationship
Modeling

LECTURE

Kimball’s
Dimensional
Modeling

Ralph Kimball advocates for a bottom-up approach. His Dimensional

Modeling technique focuses on user accessibility and performance.
Key Principles:
● Dimensional Design: Organizes data into fact and dimension tables.
● Bus Architecture: Ensures scalability and consistency across the data
warehouse.
● Incremental Development: Builds the data warehouse iteratively through
data marts.
Importance:
● Emphasizes ease of use for business users.
● Optimizes query performance for reporting and analysis.
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Kimball vs. Inmon
Comparing Methodologies for Data Warehousing

Kimball’s Bottom-Up Approach:

● Focus: Starts with creating data marts for speciﬁc business processes.
● Architecture: Data marts are integrated into a cohesive data warehouse
using conformed dimensions.
● Advantages: Faster implementation, immediate business value, ﬂexibility.
Inmon’s Top-Down Approach:
● Focus: Begins with a comprehensive Enterprise Data Warehouse (EDW).
● Architecture: EDW is the repository from which data marts are derived.
● Advantages: Ensures data consistency and integration across the
enterprise.

Key Differences:
● Implementation Speed: Kimball’s approach typically delivers results
quicker.
● Scalability: Inmon’s method may better support large-scale,
enterprise-wide initiatives.
● Flexibility: Kimball’s approach allows for more iterative and adaptable
development.

Fact Tables: Central tables that store measurable, quantitative data related
to business processes.
● Contain foreign keys referencing dimension tables.
● Include numeric metrics (e.g., sales amount, quantity).
● Often contain additive, semi-additive, or non-additive measures.
Dimension Tables: Surrounding tables that provide descriptive attributes
related to fact data.
● Contain textual or categorical information (e.g., product names).
● Often denormalized to optimize query performance.
● Support hierarchical relationships (e.g., dates with year, quarter, month).

Star Schema:
● Structure: Fact table at the center connected to multiple dimension
tables.
● Advantages: Simpliﬁes queries, enhances performance, and improves
readability.
Snowﬂake Schema:
● Structure: Extension of star schema; dimension tables are normalized
into multiple related tables.
● Advantages: Reduces data redundancy and can save storage space, but
may complicate queries.

Types of Fact Tables:

● Transactional Facts: Record individual business transactions.
● Periodic Snapshot Facts: Capture data at regular intervals.
● Accumulating Snapshot Facts: Track the progression of a process.
Grain Deﬁnition:
● Importance: Deﬁnes the level of detail stored in the fact table.

Measures:
● Additive Measures: Can be summed across any dimension.
● Semi-Additive Measures: Can be summed across some dimensions but
not all.
● Non-Additive Measures: Cannot be summed
Foreign Keys:
● Role: Link fact tables to corresponding dimension tables.
● Implementation: Ensure referential integrity and support efﬁcient joins
during queries.

Characteristics of Dimension Tables:

● Descriptive Attributes: Provide context to fact data (e.g., customer name,
product category).
● Surrogate Keys: Unique identiﬁers used instead of natural keys to handle
changes over time.
● Hierarchies: Enable drill-down capabilities in reports (e.g., geographic
hierarchies from country to city).

Types of Dimensions:
● Conformed Dimensions: Shared across multiple fact tables and data
marts, ensuring consistency.
● Role-Playing Dimensions: Used multiple times within the same schema
(e.g., date dimension used for order date and ship date).
● Junk Dimensions: Combine unrelated low-cardinality attributes into a
single dimension to reduce clutter in fact tables.
Handling Slowly Changing Dimensions (SCD):
● SCD Type 1: Overwrites old data with new data, not preserving history.
● SCD Type 2: Creates a new record to preserve historical data.

Structure:
● Central Fact Table: Contains measures and foreign keys to dimension
tables.
● Surrounding Dimension Tables: Provide descriptive context for facts.
Advantages:
● Simplicity: Easy to understand and navigate for end-users and analysts.
● Performance: Optimized for read-heavy operations, enhancing query
speed.
● Flexibility: Facilitates ad-hoc querying and reporting without complex
joins.

Design Best Practices:

● Denormalize Dimensions: Reduce the number of joins required for
queries.
● Use Surrogate Keys: Maintain consistency and handle changes
effectively.
● Ensure Conformed Dimensions: Promote reuse and consistency across
different fact tables and data marts.

Structure:
● Central Fact Table: Similar to the star schema, contains measures and
foreign keys.
● Normalized Dimension Tables: Break down dimension tables into multiple
related tables.
Advantages:
● Storage Efﬁciency: Reduces data redundancy, saving storage space.
● Data Integrity: Maintains consistency through normalized tables.

Disadvantages:
● Complexity: Increases the number of joins required for queries,
potentially impacting performance.
● Maintenance: More complex to manage and understand compared to
star schemas.
When to Use:
● Large, Complex Dimensions: Where normalization can signiﬁcantly
reduce redundancy.
● Strict Data Integrity Requirements: Ensuring consistency across
normalized tables.

Pros of Denormalization: Cons of Denormalization:

● Optimized for Query Performance ● Increased Data Redundancy
○ Pre-joined tables eliminate expensive ○ Fact tables store repeated dimension values,
multi-table joins, making queries run faster. leading to larger storage requirements.
● Intuitive for Business Users ● Risk of Data Inconsistency
○ Star schema structure aligns with how analysts ○ Updates must be carefully managed to avoid
think and report on data. misaligned data across multiple tables.
● Simplifies BI & Aggregation ● Not Ideal for Transactional Updates
○ Measures and dimensions are pre-aggregated, ○ Kimball’s approach is read-optimized, making
reducing computation time. transactional updates complex.
● Ideal for AI Feature Stores ● More Storage Overhead
○ Machine learning models often require flat, wide ○ Large, flattened tables may result in higher
tables—a direct outcome of denormalization. storage costs compared to normalized
schemas.
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Denormalization and the Databricks
Platform
The advent of columnar storages such as Delta Lake reduced the need of
strict normalization; having multiple columns in the same table does not
induce the cost of having to scan a complete row any more.
Denormalization almost always means duplication of data at some level,
but due to Databricks’ storage compression mechanisms filtering
capabilities, the impact of denormalization is limited.
Taken together with the ability to store row formats in columns, the Data
Architect can have tables pre-joined but isolated in separate structs,
thereby able be treated as individual tables or a pre-joined result.

product customer
category country

customer
product customer product customer city

sales product sales

details customer
role

store store

store store
region type

Fact table contains business "facts" (like transaction Fact table as with star schema
amounts and quantities) Dimension tables are broken down into sub-dimensions
Dimension tables contain information about descriptive Dimensions are normalized
attributes and are typically denormalized
Simple data model enforcing data quality, with fast
Star schemas enable users to slice and dice the data,
retrieval
typically by joining two or more fact tables and dimension
tables together Higher setup and maintenance efforts

Denormalized data model Tech Arch

Design

Business Dimensional Physical ETL Design &

Process
Built as a star or snowﬂake Requirements Modeling Design Development

BI App BI
schema Design Development

Central fact tables surrounded

by dimension tables DWH (dimensional)
Source
Data Cube
Data
Logical Mart
Source Staging
View consistent

Data
Source Mart

1. Gather Business Requirements and Data Realities

2. Collaborative Dimensional Modeling Workshops
3. Four-Step Dimensional Design Process
4. Business Processes
5. Grain
6. Dimensions for Descriptive Context
7. Facts for Measurements
8. Star Schemas and OLAP cubes
9. Grace Extensions to Dimensional Modeling

See https://www.kimballgroup.com/wp-content/uploads/2013/08/2013.09-Kimball-Dimensional-Modeling-Techniques11.pdf

© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Example Logical Design (Retail business)
Design Model
1 Select a business process
2 Determine Granularity
Business 1
Processes 3 Choose Dimensions
Assortment 4 Identify Measures
Plans
Purchase
Orders

Inventory

Customer 2 3 4
Orders
For each fact For each For each fact
Customer For each BP For each fact
define lowest dimension decide define all
Shipments 1..N Facts define dimensions
granularity granularity measurements
Credit

Returns

Trended
Surveys
General
Ledger

© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Data Modeling: Dimensional Modeling
Landing (bronze)
• Raw data in its original format (temporarily)
Ingestion (bronze) Landing Integration Presentation
• Raw data converted to Delta (from Avro, CSV, Raw data
(temp.) Data Mart
parquet, XML, JSON format in Landing)
Business Information Model Dim.
• Verified data contract: schema (typically derived Model
SQL
from the source), timeframe, …
Logical Data Model (3NF*)
• Sometimes called Staging Ingestion Data Mart
Integration - Physical data model (silver) Verified Physical Data Model Dim.
Model
• Detailed information covering multiple business data
domains (including glossary and taxonomy)
bronze silver gold
• Integrates all data sources
• Does not necessarily use a dimensional model but ETL/ELT
feeds dimensional models. Dimensional model (star schema)
* 3NF = “Third normal form” in data modelling
Data Mart (gold) Order
• Subset of the Integrated layer, sometimes filtered Dim Fact Dim
or aggregated data Customer Product
• Focus on dimensional modeling with star schema
Dim
• Typically oriented to a specific line of business or
team Time

DEMONSTRATION

Dimensional
Modeling

LECTURE

Data Vault 2.0

Data Vault 2.0 is an advanced evolution of the original Data Vault modeling
methodology, designed to address the complexities of modern data
warehousing.

It combines the strengths of Data Vault 1.0 with additional features to

support big data, real-time analytics, and agile development practices.

Key Objectives:
● Enhance scalability and ﬂexibility to handle large and rapidly changing
data environments.
● Improve data integration from diverse sources with minimal latency.
● Support agile and iterative development methodologies for faster
deployment and adaptability.
Importance:
● Meets the demands of contemporary businesses for timely, accurate,
and comprehensive data insights.
● Facilitates the integration of structured and unstructured data,
accommodating various data types and sources.
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Core Components of Data Vault 2.0
Building Blocks for Robust Data Integration

Hubs: Central entities representing unique business keys (e.g., Customer ID,
Product SKU).
● Contain a unique list of keys with minimal attributes (Business Key, Load
Date, Record Source).
● Serve as the primary point of integration for related data.
Links: Associations or relationships between Hubs (e.g., Customer
purchases Product).
● Capture many-to-many relationships without redundancy.
● Include foreign keys referencing related Hubs, Load Date, and Record
Source.

Satellites: Descriptive or contextual data related to Hubs or Links (e.g.,

Customer Name, Address).
● Store historical and time-variant data.
● Include attributes such as Data Fields, Load Date, and Record Source.
Pit and Bridge Tables (Data Vault 2.0 Enhancements)
● Pit Tables: Facilitate point-in-time reporting by consolidating data from
multiple Satellites.
● Bridge Tables: Handle complex many-to-many relationships and
hierarchies within the data model.

Layered Architecture
● Raw Data Vault: Ingests and stores data as-is, ensuring data integrity and
traceability.
○ Components: Hubs, Links, Satellites.
● Business Data Vault: Enhances the Raw Data Vault with business logic,
derived data, and additional context.
○ Components: Derived Satellites, Calculated Metrics.
● Information Delivery Layer: Provides data through data marts, reporting
and analytics platforms.
○ Components: Data Marts (Star/Snowﬂake Schemas), APIs, BI Tools.

Integration with Modern Technologies

● Big Data Platforms: Seamlessly integrates with Hadoop, Spark, and
cloud-based data warehouses.
● Real-Time Processing: Supports real-time data ingestion and streaming
analytics.
Agile and DevOps Alignment
● CI/CD: Facilitates automated testing, deployment, and version control.
● Modular Development: Enables incremental and parallel development of
different components.

Planning and Requirements Gathering

● Deﬁne business objectives, key metrics, and data sources.
● Establish governance and data quality standards.
Modeling
● Design Hubs, Links, and Satellites based on business keys and
relationships.
● Incorporate Pit and Bridge Tables as needed.

ETL Development
● Develop Extract, Load, Transform (ELT) processes to populate the Raw
and Business Data Vaults.
● Implement data quality checks and transformation logic.
Testing and Validation
● Ensure data accuracy, integrity, and performance through rigorous
testing.
● Validate against business requirements and use cases.

Deployment and Maintenance

● Deploy the Data Vault to production environments.
● Continuously monitor, maintain, and enhance the data warehouse.
Agile Practices
● Iterative Development: Build the data warehouse in manageable
increments, allowing for ﬂexibility and adjustments.
● Cross-Functional Teams: Collaborate across technical and business
teams to ensure alignment and address evolving needs.
● Continuous Feedback: Incorporate user feedback to reﬁne and optimize
the data model and ETL processes.

Purpose of Hubs:
● Represent business keys; central points of integration for related data.
● Ensure consistency and traceability of core business entities.
Design Considerations:
● Business Keys: Stable and unique business identiﬁers (e.g., Customer ID).
● Minimal Attributes: Maintain simplicity and reduce redundancy.
● Ingestion Date and Record Source: For auditing and lineage purposes.
Best Practices:
● Consistent Naming Conventions: Use clear, standardized names for Hubs.
● Avoid Redundancy: Each Hub represents a single business key.
● Referential Integrity: Between Hubs and Links/Satellites
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Implementing Links in Data Vault 2.0
Modeling Relationships Between Business Entities

Purpose of Links:
● Capture relationships between Hubs (e.g., Customer purchases Product).
● Enable modeling of many-to-many relationships without redundancy.
Design Considerations:
● Identify Relationships: Determine how business keys interact and relate.
● Include Foreign Keys: Reference primary keys from related Hubs.
● Load Date and Record Source: Track Link ingestion time and source.
Best Practices:
● Atomic Relationships: A link should be a single relationship of Hubs.
● Avoid Overcomplicating: Links are for meaningful business relationships.
● Scalability: Design to accommodate future expansions and relationships.
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Implementing Satellites in Data Vault 2.0
Storing Descriptive and Historical Data

Purpose of Satellites:
● Store descriptive, contextual, time-variant data related to Hubs or Links.
● Enable historical tracking and auditing of changes over time.
Design Considerations:
● Segmentation: Separate Satellites by subject areas or update frequency.
● Include Load Metadata: e.g. Load_Date, Record_Source, and End_Date.
● Handle SCDs: Manage changes in dimension attributes.
Best Practices:
● Granular Separation: Separate Satellites for different types of data.
● Update Mechanisms: Uniform processes updating Satellite data.
● Documentation: Satellite purpose and contents to aid in usage.
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Data Vault 2.0 Visualized

The Data Vault model is based on three basic entity types:

• Hubs separate core business concepts
• Links store relationships between business concepts Architecture

• Satellites store the attributes of a business concepts or Data Model

relationships Process Logical Data Model Physical Data Models
The Data Vault model is split into ● Taxonomies ● Raw Vault
● Ontologies Business Vault
• Raw Vault / Raw Data Vault:
●

● Information Marts

• Stores unaltered, granular source data.

• Immutable, historical record of all data in the Data
Vault.
DWH Information
• Business Vault / Business Data Vault: (Data Vault) Mart
Source
• Sparsely modeled DWH based on Data Vault design Raw Business Data Cube
principles Logical View
Vault Vault
Source Staging
• Data is modiﬁed according to business rules or Information
Mart
requirements
Source
• Information Marts
• Like Data Marts Hub

Bridge
Link Satellite

Point in Time

• Dimensional model based

Model Model
Deﬁne Deﬁne Model
Information Business
Ontology Taxonomies Raw Vault
Mart Vault

Enterprise
Start with what the
Business
Ontology
business needs

Domain Domain Domain

Ontology Ontology Ontology

Ontologies Taxonomies
• Define how business sees • Follow a hierarchical format, provide names for
their data each object in relation to other objects
• Model real-life entities • Capture the membership properties of each
• Start with business concepts object in relation to other objects
Ontologies provides context to the developers, designers
• Connect business concepts • Have specific rules to classify, categorize any and business users on how the data fits to the business
with business keys object in a domain.
Data Vault Modeling was, is, and always will be about the business
• Drill down into the hierarchies The rules must be complete, consistent, and - Dan Linstedt (creator of Data Vault)
(Taxonomies) unambiguous
• Inherits all properties of class above it, and
may have additional properties
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Data Modeling: Data Vault 2.0
Landing (bronze)
• Raw data in its original format (temporarily)
Ingestion (bronze, sometimes called Staging) Landing Integration Presentation
• Raw data converted to Delta (from Avro, CSV, Raw data
parquet, XML, JSON format in Landing) (temp.) Raw Business Information
Vault Vault Mart
• Verified data contract: schema (typically derived
from the source), timeframe, … Hub PIT
Business
Integration - Raw Vault (silver) Views
SQL
Link Bridge
Data is modeled as Ingestion
Verified Satellite Views
• Hubs (unique business keys)
data
• Links (relationship and associations)
• Satellites (descriptive data) bronze silver gold

Integration - Business Vault (silver)

Tables with applied business rules, data quality rules,
ETL/ELT
Data Vault 2.0 model
cleansing and conforming rules
• Business views Satellite
Satellite
• Point-in-Time (PIT) tables (opt.) Satellite
• Bridge tables are created on top of the business Satellite Hub Link Hub
Satellite
vault (opt.) Satellite
Customer Product

Presentation - Information Marts (gold) Hub Order

• Similar to a classical Data Mart with data that has
been cleansed and harmonized Satellite Satellite
• Consumer-oriented models (typically views)
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Data Warehouse Data Modeling

DEMONSTRATION

Data Vault 2.0

Data Modeling Strategies

Data Modeling Strategies
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Objectives
● Introduce modern AI-driven use cases: featurization, real-time inference
● Illustrate modern use case study (as distinguished from DWH use case)
● Explore medallion approach for featurization
● Highlight Feature Store integration
● Create a feature table
● Register the table in Feature Store
● Recap differences among Inmon, Kimball, DV, and Modern approaches
● Examine the beneﬁts of the enhanced medallion architecture

LECTURE

Feature Stores

Modern data modeling for ML and AI encompasses specialized structures

and practices to support the development, deployment, and maintenance
of intelligent systems.
Key Components:
● Feature Stores and Feature Tables
● Data Pipelines
● Model Management
Importance:
● Enhances the efﬁciency and effectiveness of ML/AI workﬂows.
● Ensures consistency, scalability, and reusability of features across
different models and teams.
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Feature Stores in a Nutshell

A Feature Store is a centralized repository that manages, stores, and

serves features used in machine learning models.
Purpose:
● Consistency: Features used during training and serving are identical.
● Reusability: Allows reuse of existing features across multiple models.
● Scalability: Supports large-scale feature computation and storage.
Core Functions:
● Feature Storage: Persistent storage of computed features.
● Feature Serving: Real-time or batch access for model inference.
● Feature Governance: Metadata, lineage, and access controls.
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Understanding Feature Stores
Centralizing Feature Management for ML and AI

A Feature Store is a centralized repository that manages, stores, and

serves features used in machine learning models.
Beneﬁts:
● Reduces duplication of feature engineering efforts.
● Enhances collaboration between data engineering and data science
teams.
● Improves model deployment speed and reliability.

Core Components:
● Repository: Stores and manages feature deﬁnitions and metadata.
● Storage Layer: Physical storage systems (e.g., databases, data lakes)
where feature data resides.
● Serving Layer: APIs and services that provide features to ML models in
real-time or batch modes.
● Registry: Catalogs available features, including their deﬁnitions, sources,
and usage statistics.
● Transformation Layer: Tools and processes for feature engineering and
transformations.

Workﬂow:
● Feature Engineering: Data scientists create and transform raw data into
features.
● Feature Registration: Features are registered in the feature store with
metadata.
● Feature Storage: Transformed features are stored in the feature
repository.
● Feature Serving: Features are served to ML models during training and
inference.
● Monitoring and Management: Ongoing monitoring of feature quality,
usage, and performance.
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Feature Tables

Feature Tables are structured tables within a feature store that organize
related features for specific ML use cases or business domains.
Each row represents a unique instance of a feature for a specific entity.
Purpose:
● Logical Grouping: Groups features by subject area for easier
management and access.
● Performance Optimization: Organizes features in a way that aligns with
ML workflows.
● Version Control: Manages feature tables versions to track changes and
ensure reproducibility.
● Access Control: Implements granular access permissions.
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Feature Tables: Definition and Purpose
Organizing Features for Efficient ML Workflows

Feature Tables are structured tables within a feature store that organize
related features for specific ML use cases or business domains.
Each row represents a unique instance of a feature for a specific entity.
Structure (Columns):
● Feature Name: Identifier for each feature.
● Data Type: Specifies the type of data (e.g., integer, float, string).
● Description: Detailed explanation of the feature’s purpose and usage.
● Source: Origin of the feature (e.g., raw data, derived).
● Creation Timestamp: When the feature was created or last updated.

Static Feature Tables: Features that remain constant over time.

● Low update frequency.
● Typically sourced from master data systems.
Dynamic Feature Tables: Features that are updated regularly.
● High update frequency.
● Often derived from transactional or streaming data sources.
Aggregated Feature Tables: Features that represent aggregated data.
● Computed using aggregation functions (e.g., sum, average).
● Support trend analysis and forecasting.

Temporal Feature Tables: Features that capture time-based changes and

trends.
● Incorporate historical data points.
● Enable time-series analysis and predictive modeling.
Composite Feature Tables: Combine multiple types of features from
different sources or categories.
● Merge static, dynamic, and aggregated features.
● Support complex ML models that require diverse feature sets.

© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Data Modeling: Modern use cases (ML and
AI)
Landing (bronze)
• Raw data in its original format (could be
temporarily).
• A landing zone allows bronze in Delta format, Landing Curation Final
independent of the original input format Raw data
Ingestion (bronze) (temp.) Augmented Project
data Python
• Delta data converted from Raw (from Avro, CSV, Cleansed
data
R
parquet, XML, JSON format in Landing) data SQL
Filtered Business-level
• Verification typically lightweight compared to data Scala
Ingestion aggregates
DWH
Verified
• No other transformation or business logic is data
applied
bronze silver gold
• Often “schema on read” approach
Curation (silver) ETL/ELT
• Cleansed data, filtered data and augmented data
Final (gold)
• Business level aggregates
• Masked, reduced, anonymized for project
purposes
• Denormalized for performance if needed

DEMONSTRATION

Modern Case
Study:
Feature Store

LECTURE

Combining
Approaches

Ability to change
● Big impact on Inmon models when business process changes (higher
effort and duration).
● Business changes, especially signiﬁcant ones, can break the basis of a
Kimball model (higher effort and duration).
● Data Vault 2.0 structure facilitates reacting to business changes (lower
effort).

Complexity
● Inmon leads to very complex ETL and load dependencies that need to
be handled through load ﬂow optimizations or additional ETL jobs to
ensure model consistency.
● Kimball dimensional models can be very hard to populate, since you
have to ensure consistency with the dimensions. Dimension logic can be
hard, particularly slowly changing dimensions of Type 2 and above can
be challenging.
● Data Vault 2.0 has 3-6 times more objects than a pure 3NF DW; this
impacts ETL, but the ETL is simpliﬁed, easily automated, and can for the
most part be run in parallel.
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Evaluating DWH modeling paradigms
Key examples (non-exhaustive)

Robustness
● Inmon models can easily break due to changes in business processes
and business rules.
● Kimball is the simplest model to understand, but a critical mass of
changes entails remodeling large portions.
● For Data Vault 2.0, most changes can be compartmentalized to a
speciﬁc layer.

Inmon Kimball Data Vault

Normalized Structure Optimized for Reporting and Analytics Operational Flexibility
Inmon models follow a normalized Dimensional models are designed Data Vault allows you to stay close to the
approach, reducing data redundancy. specifically for efficient querying and source data, making it auditable and
reporting. They provide a clear structure scalable.
Single Source of Truth
for business users to understand.
The data warehouse is designed as a Easier to Add New Sources
single integrated repository. Easy to Understand Data Vault is flexible and accommodates
Business users find dimensional models new data sources seamlessly.
Well-Suited for Large EDW
intuitive due to their star schema or
Recommended for large-scale data Historical Data Tracking
snowflake schema representation.
integration. Inherent support for historical data.
Well-Suited for Smaller Projects
Quicker to implement for smaller-scale
data marts.

Inmon Kimball Data Vault

Complexity Data Redundancy Not Ideal for Analysis & Reporting
Inmon models can be complex to Dimensional models may have some Data Vault may not be the best choice
implement and maintain. data redundancy due to for direct analysis and reporting; you
denormalization, which can lead to might still need dimensional modeling for
Slower Query Performance
maintenance challenges. virtual data marts.
Normalized structures may require more
joins, impacting query performance. No Single Source of Truth Complexity
Data marts are organized around Data Vault models can become complex,
Less Intuitive for Business Users
business areas, which can result in especially when directly populating data
Business users may ﬁnd the normalized
multiple versions of the same data. marts from them.
structure less intuitive.
Joins and Recursions
Populating data marts can lead to
complex joins and recursions.

Many organizations ﬁnd it challenging to handle the life cycle around data
models. Challenges come in the form of people, process, and technology.
● To maintain a database, you need DBAs or data engineers
● To model databases, you need data modelers
● To do a “correct” data model, you need access to the business

For most organizations, the mere overhead of having data modelers talk to
the business, as well as the time it takes to introduce changes, negatively
affect the organization’s ability to adapt to new conditions.
Still, cutting corners in this process has side effects: Data correctness and
potentially data quality degrade, traded for speed and agility in the data
process.

Time series
resampled &
interpolated

Spark stream Feature

reduction

Feature
enhanced
Ingestion Curated
Final
Raw data Cleansed and Curated business-level
• No data processing conformed data tables

• Data kept around • Directly queryable • Project/use case speciﬁc

to ﬁx mistakes • PII • Denormalized and
masking/redaction read-optimized data
models
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Combining worlds - the Enhanced Medallion
batch Landing Curation Final Modern use cases
Raw data, temp. Cleansed, augmented, … Business speciﬁc

BI use cases
streaming Ingestion Integration Presentation (strictly modeled
Veriﬁed data Business information model Data marts
and veriﬁed data)
bronze silver gold

Cloud storage

Staging: Raw data in its original format

Ingestion: Raw data veriﬁed and converted to Delta
Data Lake for modern use cases Combining both worlds in the Lakehouse allows:
• Curation: Cleansed, homogenized and fundamental business logic 1. Access to corporate KPIs for modern uses
applied cases
• Final: Business/project ready datasets 2. Accelerated delivery for DWH use cases
DWH for BI use cases
• Integration: Enterprise DWH (one or more)
• Presentation: Business-ready DWH information (data marts)

© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Three layers from data to information
Curation Final
Cleansed, Business-speciﬁc
augmented, … Modern use cases (exploratory data analysis, data science, …)
• High ﬂexibility
Data for Modern use • all sorts of workload types (e.g. ML, experimental workloads)
Explorative &
modern use cases like ML supported
Flexible cases (self-service) • all data types supported
• no compliance with business information model needed

Semantically consistent data

Semantically Enhanced • Not integrated into the business information model, however,
consistent business consistently transformed to allow joining with integration data
data perspective
(self-service)
• Allows enhancements of business perspectives while keeping DWH
model stable

BI and advanced Analytics use cases on business information

Integration • Stable business information model according to business objectives
Conformed & Presentation and OKRs
Business
Stable information model
Data marts • Used for ﬁnancial reporting, KPIs, company dashboards, etc.
• Follows strict change process
silver gold

Data for Modern use

Explorative &
Flexible
modern use cases like ML Arbitrary data & Independent data products
cases (self-service)

Semantically Enhanced
business
consistent
perspective
Certiﬁed Data Products
data
(self-service)

Conformed & Integration Presentation

Stable Business
Data marts DWH model
information model

silver gold

Data Modeling Strategies

Data Modeling Strategies
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Objectives
● Introduce the “data product” concept in a domain-oriented approach
● Adopt a working deﬁnition for any data product
● Understand data product categories and hierarchies
● Explore data product processes and a typical lifecycle
● Map these ideas to Lakehouse capabilities (Unity Catalog, medallion
layers)
● Introduce the concept of data contracts for data products
● Consider potential topologies for managing a data product portfolio
across domains.

LECTURE

Deﬁning Data
Products

● Centralized data platforms struggle to scale governance across teams.

○ Traditional DWH approaches require central IT ownership, creating bottlenecks.

● Teams operate in silos, creating fragmented datasets.

○ Redundant datasets emerge as different teams manage their own versions.

● Data trust is low due to inconsistency and unclear ownership.

○ Business units often rely on unofﬁcial data sources, leading to misaligned reporting.

● AI and ML demand new levels of data availability and reusability.

○ Traditional governance models can’t keep up with real-time, feature-driven AI needs.

● Data products apply product thinking to data management.

○ Data is owned, documented, and versioned like software products.

● Each data product has clear ownership, governance, and SLAs.

○ Domains are responsible for quality, compliance, and evolution.

● Data consumers (AI, BI, analytics) access standardized, reusable data.

○ Instead of duplicating datasets, teams reuse certiﬁed, governed products.

● Accelerates AI, ML, and cross-functional collaboration.

○ Feature Stores, data marts, and real-time APIs function as governed data products.

● Data contracts set clear expectations for producers and consumers.

○ Deﬁne schema, update frequency, access policies, and SLAs.

● Publishing and certiﬁcation enable trust and discoverability.

○ Data products are registered in a centralized catalog (Unity Catalog).

● Versioning and lineage tracking ensure long-term usability.

○ Consumers always know which version of a data product they are using.

● AI and analytics teams can integrate trusted, reproducible data

products.
○ Removes manual wrangling, reprocessing, and inconsistency in ML pipelines.
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
How Data Products Power AI & BI
Accelerating value delivery across functional domains

● Marketing teams leverage certiﬁed customer segmentation models.

○ Eliminates ad-hoc data wrangling for campaign targeting.

● Data Science teams access feature-engineered datasets.

○ Ensures real-time ML models remain consistent with training data.

● Finance teams operate on version-controlled ﬁnancial data.

○ Guarantees auditability, accuracy, and reporting consistency.

● Cross-domain data sharing is simpliﬁed via governance.

○ Instead of manually reconciling datasets, teams use trusted, certiﬁed data products.

● Data products offer a structured, scalable alternative to ad-hoc

datasets.
○ Eliminates redundancy while enforcing governance and usability.

● Governance is embedded, ensuring compliance without bottlenecks.

○ Data owners maintain contracts, SLAs, and lineage tracking.

● AI & BI teams can leverage certiﬁed data instead of manual extracts.

○ Reduces time spent cleaning and transforming data.

● A well-architected Lakehouse supports data product thinking at scale.

○ Unity Catalog + Medallion Architecture ensure structured, governed data ﬂows.
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Data Products in a Nutshell

To publish data as data products, “product thinking” needs to be applied:

Every data product:
● Has an owner and is built for specific audiences;
● Follows a defined product life cycle;
● Is defined and described by a data contract; and
● Is published following an agreed-upon governance process.

A Data Product adheres to a set of usability characteristics:

● Discoverable: Users need to be able explore the availability of data

● Addressable: Permanent and unique address for programmatic access
● Understandable: What are its semantics? How is it serialized?
● Trustworthy and truthful: Correctly represents the business
● Natively accessible: Readily available in the user’s environment/tool
● Interoperable and compensable: Cross-domain semantic consistency
● Valuable on its own: Provides insights without related data
● Secure: Access control and privacy

A Data Product needs to be:

● Consumption-ready: Trusted by consumers

● Kept up to date: By engineering teams for agreed-upon SLAs
● Approved for use: Governed using data contracts/agreements

© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Data Product concept attributes
Ownership
Discoverable Quality & Valuable on its own
Discoverability
Addressable (published data Observability Understandable
product) (trusted data asset)
Natively accessible Trustworthy and truthful

Data
Product
Semantic
Security Consistency
(organization wide (compliant with
data governance) governance Interoperable and
rules)
compensable

Privacy
(potentially anonymized)
Secure

Responsibility of the owner

© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Example data product categories
Much more than tables, with varying producers (P) and consumers (C)
Data Data ML Business Business
Category Data Product
Engineer Scientist Engineer Analyst User

Tabular data
(SQL tables, dataframes; e.g. facts, dimensions, P/C P/C C P/C
metrics, timeseries, kpis, metadata, …)
Datasets
ML & AI features P/C P/C C C Data
Streams P/C C C Resources

Classical ML & AI P/C P C

Models
LLM P/C P C

Queries & Notebooks (consumption ready) P/C P/C C C

Dashboards P/C P/C P/C C

Consumption Data
Reports P/C P/C P/C C
Channels Services
Alerts P/C P/C P/C C

API (e.g. served models) P/C P/C C C

© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Data Product hierarchy
Source-aligned Derived Consumer-aligned
Source Systems
Data Products Data Products Data Products
PLM plm

Manufacturing manufacturing products

SCM scm_suppliers partners recommendations

crm_tickets customer_products
product_popularity
CRM call_centers customer_loyalty

crm_customers customers customer_segments

Social Media Agg social_media clickstream

Order Data orders

• Represent the relevant data as it • Created by processing and transforming • Consumer-aligned data
is in the operational system with source-aligned data products or other products are speciﬁcally built
minimal transformation derived data products. for end users, e.g.
• Cleansed and transformed to • Satisfy user needs e.g., for decision-making, dashboards, reports
ensure quality and automated decision-making.
• First step to creating more • Can be reused in other derived data
valuable data products. products.
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Data Product hierarchy (with Ownership)
Source-aligned Derived Consumer-aligned
Source Systems
Data Products Data Products Data Products
PLM plm

Manufacturing manufacturing products

SCM scm_suppliers partners recommendations

crm_tickets customer_products
product_popularity
CRM call_centers customer_loyalty

crm_customers customers customer_segments

Social Media Agg social_media clickstream

Order Data orders

bronze silver gold silver gold silver gold gold

Owners and Domains Owners supported by the Hub domain

Source
Data
Domain having standardised on Mart
Source DWH
the data products paradigm Staging (3NF)
Data
Mart
Source

Source aligned
data product Derived
Source data product
Source aligned
data product
Source
Source aligned Derived
data product data product
Source
Source aligned
data product

Source
Data
Mart DWH
Source (3NF)
Data
Mart
Source

Domain following Kimball DWH paradigm

Source
Data
Mart
Source DWH
Domain having standardised on Staging (3NF)
Data
the data products paradigm Source
Mart

Data
Semantically
Data
Source aligned product product
data product consistent (facade) (facade)
Source
Derived
Source aligned data product
data product Data Derived Derived
Source product data data
Source aligned (facade) product product
data product Derived
Source data product
Data
Source aligned product
data product (facade)

Source
Data
Mart DWH
Data Implemented as view Source
product
Data (3NF)
(facade) or materialized view
Mart
Source

Domain following Kimball DWH paradigm

Source

System supported by
Source
Lakehouse federation
Source
LF LF

Data Data
Semantically products products
consistent (facade) (facade)
Source Data Integration
Data Derived Derived
Domain Enter-
Source DS products data data prise
(facade) product product
Governed by
catalog
Source Unity
Data
products
Catalog
(facade)

ET
Data Direct access or Source
product
(facade) materialized view
Source
System storing data in
LF Lakehouse Federation object storage
ET External Table Source

DS D2D Delta Sharing

Domain following Kimball DWH paradigm

Data Production Data Publishing Data Consumption

Domain-speciﬁc Creation of data products via A deliberate step to provide Use of own data and
Ingestion & ETL or by different access to a data product to published data products for
business-oriented teams. other consumers. analysis, reporting, ML, …

Federated Computational Governance

Ensuring a data ecosystem that adheres to organizational rules and industry regulations through
standardization. The goal is an “equilibrium between centralization and decentralization”: data products
conform to a shared set of rules, leaving space for autonomous decision-making for the data domains.
Domain-agnostic
Platform Operations
A central platform team is responsible for deﬁning a common, organization-wide infrastructure, ensuring
company wide rules and policies are applied. To avoid this team becoming a bottleneck, necessary
capabilities for the data domains need to be provided in a Self Service way.

feedback information

Operation +
Inception Design Creation Publishing Retirement
Governance
Iterate new version

● Start with desired ● Create a data ● Build modular ● Deploy using DataOps ● Monitor metrics, ● Deprecate product
business outcomes contract pipelines, features, or MLOps (models) quality, usage, ● Inform consumers
● Assign owner ● Create a data product models, dashboards, ● Publish to catalog permissions
alerts, … ● Shutdown
● Assign resources design speciﬁcation ● Manage access ● Handle compliance production
● Ensure semantic ● Test against data permissions according requests
● Deﬁne business contract ● Archive assets
metrics consistency with to the data contract ● Audit data product
other data products access ● Clean up resources

Business/consumer
Product owner
Data engineer / Data scientist / Business analyst
Data steward Data steward
DataOps/MLOps
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Data and AI
Governance

Data Products
Unity Catalog
Discovery Lineage
Access Control

Mapping the Databricks Lakehouse to the Data Warehousing

Dashboards
data product lifecycle Lakehouse AI
Notebooks

Consumption +
feedback Value Creation
feedback information

Operation +
Inception Design Creation Publishing Retirement
Governance

Orchestration
Owner
Docs repo Databricks Workﬂows Repos CI/CD MLOps/LLMOps

ETL and Processing Engine

DLT Auto Loader Structured Streaming
team
data data Data and AI
contract product Data Warehousing Data and AI
Databricks SQL Governance
spec Governance
(serverless) Unity Catalog
Dashboards Unity Catalog
Access Data Auditing Lineage
Lakehouse AI Control Explorer
Notebooks Lakehouse System
Lineage Monitoring tables
Features AutoML Marketplace

A data contract should be provided by the data producer, however, being

designed with the consumer in mind. Important aspects for a consumer:
● Data description (name, description, source systems, attribute selection)
● Data schema (tables, columns, anonymization & encryption, ﬁlters, masks)
● Usage policies (tags, PII, guidelines, data residency)
● Data quality (applied quality checks and constraints, quality metrics)
● Security (who is allowed to use the data product)
● Data SLAs (last update, expiration dates, retention time)
● Responsibilities (owner, maintainer, escalation contact, change process)
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Data Contract

Data Contract
Data description Data SLAs
Last update, expiration dates, retention time, usage
Name, owner, description, source systems, …
restrictions, code of conduct, re-sharing conditions, …

Data schema Security

Tables, columns, anonymization and encryption info, … Who is allowed to use the data product

Data quality Explanatory add-ons (optional)

Applied quality checks, quality metrics, … Notebook, dashboard, sample code, …

© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Data contract-based governance
Potential process to achieve consistent “certiﬁed data products”
- Assess
2 - Feedback Certiﬁed data products have the
- Approve
“stamp” of the Governance team (golden
Governance Team
data products), but other data products
can be published without involving the
Governance team
Propose data Contract
1 3
contract approval Understand usage
6
(via data contract)
5
Publish Catalog / Discover
Domain 1 Domain 2
Marketplace
4
Use data
7
product

Cloud storage
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Independent and certiﬁed data products
“Equilibrium between centralization and decentralization”

Certiﬁed Data Products Independent Data Products

High quality data products that are High quality data products, but
semantically consistent and Data Product Data Product no guarantee that they can be
can be combined easily Data Product Data Product combined easily
Data Product Data Product

Team of domain representatives to

Domains can publish data products as they
agree on rules and policies for certiﬁed
data products and to approve and
Governance Team believe that their data is represented best
govern their data contracts

Data Domain
Data Domain
Autonomous Data Domain

When structuring a distributed architecture where the different domains are autonomous,
but need to share data, two basic approaches exist:
Harmonized Hub-and-Spoke

Fully autonomous
data domain Data domain

C Global catalog
H Global hub
publish metadata/
publish and discover
discover data
data products
consume data C consume data
H
external sharing
external sharing

Each domain hosts and serves its own data

products
Global catalog for central discovery
C
Each domain has skills to manage
end-to-end data lifecycle
May be inefﬁcient if there is a high-level of
repeatability/similarity between data
products Fully autonomous
data domain
publish metadata/
discover data

Central IT deﬁnes the technology blueprint C Global catalog consume data

and provides best practices and setup external sharing

automations
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Hub-and-Spoke
Global data hub for publishing, discovery, and serving of data products

More central data governance

Central team maintain infrastructure
services like a global catalog
H
Central team can be a data domain building
data products
Requires spoke (domain) to publish
shareable data products to global data hub
Can reduce data sharing and management Data domain publish and discover
data products

overheads when there are many domains

H Global hub
consume data

external sharing

There is a tendency to go for Hub & Spoke.

However, to avoid bottlenecking, the
architecture and processes need to:
● Support autonomous domains H

● Allow non-autonomous domains to

mature over time and become Potential candidates to become
autonomous autonomous over time

Even for autonomous data domains, data Fully autonomous

data domain
publish and discover
data products

should be published centrally to enable Data domain consume data

consistent governance and a single point of H Global hub external sharing

discovery (global data catalog).

To increase data quality and enable users to work

with and combine data sets, published data
products need to be semantically consistent: DP
DP DP
● Align data products on DP

○ Context

○ Granularity Semantically consistent data products

○ Terminology (naming consistency)

Fully autonomous publish and discover
● Ensure correctness (viz. business logic) data domain data products

Data domain consume data

H Global hub external sharing

DP Data product

LAB EXERCISE

Data Warehousing
Modeling with ERM
and Dimensional
Modeling in
Databricks
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Summary and Next
Steps

Data Modeling Strategies

© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Course Learning Objective Recap
● Design and implement data models tailored to speciﬁc business needs
within the Databricks Lakehouse environment.
● Differentiate between different types of modeling techniques and
understand their respective use cases.
● Analyze business needs to determine data modeling decisions.
● Design logical and physical data models for speciﬁc use cases.

© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
Course Learning Objective Recap
● Understand Data Products deﬁnition and use cases.
● Understand the data product lifecycle.
● Explore the stages of data product lifecycle.
● Organize Data in Domains and in Unity Catalog.
● Utilize Delta Lake and Unity Catalog to deﬁne data architectures.
● Explore Data Integration and secure data sharing techniques.

Data Architect Associate Learning Pathway

● Continue your learning through self-paced or instructor-led offerings

● Further courses offer hands-on instruction in:
○ Governance and Security for Data + AI (coming soon)
○ Optimization and Best Practices ((coming soon)

© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache
Iceberg logo are trademarks of the Apache Software Foundation.

Data Management and Governance With Unity Catalog
No ratings yet
Data Management and Governance With Unity Catalog
104 pages
Data Governance On Unity Catalog - Jul 2024
100% (1)
Data Governance On Unity Catalog - Jul 2024
56 pages
Databricks Unity Catalog - Jan 2024
No ratings yet
Databricks Unity Catalog - Jan 2024
55 pages
Leveraging Enterprise Data Warehousing (EDW) To The Lakehouse Architecture
No ratings yet
Leveraging Enterprise Data Warehousing (EDW) To The Lakehouse Architecture
36 pages
Databricks Data Privacy
No ratings yet
Databricks Data Privacy
93 pages
Data Management & AI On Databricks
No ratings yet
Data Management & AI On Databricks
14 pages
Day 1
No ratings yet
Day 1
10 pages
Data Ingestion With Lakeflow Connect
No ratings yet
Data Ingestion With Lakeflow Connect
98 pages
Databricks Lakehouse & AI Overview
No ratings yet
Databricks Lakehouse & AI Overview
60 pages
Databricks 101 Crystal
No ratings yet
Databricks 101 Crystal
65 pages
Databricks Lakehouse for Enterprises
No ratings yet
Databricks Lakehouse for Enterprises
30 pages
Data Intelligence With Azure Databricks - Virtual 22 - 02 - 2024
No ratings yet
Data Intelligence With Azure Databricks - Virtual 22 - 02 - 2024
32 pages
Build Data Pipelines With Lakeflow Declarative Pipelines
No ratings yet
Build Data Pipelines With Lakeflow Declarative Pipelines
98 pages
Databricks Performance Optimization
No ratings yet
Databricks Performance Optimization
94 pages
Big Book of Data Warehousing and Bi v11 010925 Final
No ratings yet
Big Book of Data Warehousing and Bi v11 010925 Final
110 pages
Well Architected Lakehouse Workshop
100% (1)
Well Architected Lakehouse Workshop
49 pages
Data Engineering With Databricks
100% (2)
Data Engineering With Databricks
63 pages
Apache Spark Programming With Databricks
No ratings yet
Apache Spark Programming With Databricks
112 pages
Manage Data Access With Unity Catalog
No ratings yet
Manage Data Access With Unity Catalog
17 pages
Snowflake To Lakehouse Migration Assessment 5-23
100% (1)
Snowflake To Lakehouse Migration Assessment 5-23
22 pages
Aibi For Data Analysts Databricks
No ratings yet
Aibi For Data Analysts Databricks
78 pages
19 Databricks
No ratings yet
19 Databricks
28 pages
Data Engineering - Session 03
No ratings yet
Data Engineering - Session 03
26 pages
Lake XM Ref33
No ratings yet
Lake XM Ref33
8 pages
Data Bricks S
No ratings yet
Data Bricks S
18 pages
Get Started With Databricks For Machine Learning
No ratings yet
Get Started With Databricks For Machine Learning
85 pages
Explain Databricks
No ratings yet
Explain Databricks
26 pages
Big Book of Data Warehousing and Bi
No ratings yet
Big Book of Data Warehousing and Bi
88 pages
Big Book of Data Warehousing and Bi v9 122723 Final 0
No ratings yet
Big Book of Data Warehousing and Bi v9 122723 Final 0
88 pages
Data Warehousing & BI Guide
No ratings yet
Data Warehousing & BI Guide
88 pages
Deploy Workloads With Lakeflow Jobs
No ratings yet
Deploy Workloads With Lakeflow Jobs
91 pages
Databricks Overview Deck
No ratings yet
Databricks Overview Deck
42 pages
Data Engineering Databricks
No ratings yet
Data Engineering Databricks
139 pages
Databricks
No ratings yet
Databricks
81 pages
DataCamp Databricks
No ratings yet
DataCamp Databricks
33 pages
Databricks Guide
No ratings yet
Databricks Guide
31 pages
Databricks Certified Data Engineer Associate Course V2 Release
No ratings yet
Databricks Certified Data Engineer Associate Course V2 Release
300 pages
Data Analysis With Databricks Version 2
No ratings yet
Data Analysis With Databricks Version 2
137 pages
Lakehouse With Delta Lake Deep Dive
100% (2)
Lakehouse With Delta Lake Deep Dive
64 pages
Databricks Unity Catalog - TechSession-Spain Oct. 2022
No ratings yet
Databricks Unity Catalog - TechSession-Spain Oct. 2022
51 pages
Ebook: The Data Store For AI
No ratings yet
Ebook: The Data Store For AI
17 pages
(Guia Databrick Lakehouse)
No ratings yet
(Guia Databrick Lakehouse)
83 pages
Comprehensive Guide to Data Management Concepts
No ratings yet
Comprehensive Guide to Data Management Concepts
17 pages
O Reilly Data Lake Bootcamp Day 11694182865124
No ratings yet
O Reilly Data Lake Bootcamp Day 11694182865124
46 pages
Delta Lake
No ratings yet
Delta Lake
12 pages
Databricks Lakehouse Fundamentals Slide Deck
No ratings yet
Databricks Lakehouse Fundamentals Slide Deck
121 pages
Data Engineering With Databricks (Verma, Sumit) (Z-Library)
No ratings yet
Data Engineering With Databricks (Verma, Sumit) (Z-Library)
193 pages
Open Data Architecture Evolution
No ratings yet
Open Data Architecture Evolution
8 pages
Data Engg
No ratings yet
Data Engg
19 pages
Real Scenarios On Data Term 1722747078
No ratings yet
Real Scenarios On Data Term 1722747078
11 pages
Cloud Data Engineering
No ratings yet
Cloud Data Engineering
2 pages
The State of Data Engineering 2022 - LakeFS
No ratings yet
The State of Data Engineering 2022 - LakeFS
15 pages
TDWI Checklist Report KPDL Databricks Tableau Halper Web
No ratings yet
TDWI Checklist Report KPDL Databricks Tableau Halper Web
9 pages
AWS Data Lakes Course Overview
No ratings yet
AWS Data Lakes Course Overview
187 pages
??? ????????? ???
No ratings yet
??? ????????? ???
21 pages
DWM 2
No ratings yet
DWM 2
31 pages
GCP - DataPlex - Building A Data Lakehouse
No ratings yet
GCP - DataPlex - Building A Data Lakehouse
19 pages
Fundamentals of Data Engineering by Joe Reis and Matt Housley 81
No ratings yet
Fundamentals of Data Engineering by Joe Reis and Matt Housley 81
6 pages
Training Brochure 2022-23
No ratings yet
Training Brochure 2022-23
12 pages
P2V Consideration and Pre-Post Migration Checklist: Candidate Selection
No ratings yet
P2V Consideration and Pre-Post Migration Checklist: Candidate Selection
2 pages
RFID Safety for School Transport
No ratings yet
RFID Safety for School Transport
1 page
CV Trajkovic Jelena
No ratings yet
CV Trajkovic Jelena
2 pages
UPM3010 Universal Power Meter Specs
No ratings yet
UPM3010 Universal Power Meter Specs
2 pages
PDF Signer Software
No ratings yet
PDF Signer Software
11 pages
OpenDtect Data Import and Interpretation Guide
No ratings yet
OpenDtect Data Import and Interpretation Guide
32 pages
PROFIBUS Diagnostic Bundle For SIMATIC S7
No ratings yet
PROFIBUS Diagnostic Bundle For SIMATIC S7
11 pages
LogRhythm Schema Dictionary and Guide RevB
No ratings yet
LogRhythm Schema Dictionary and Guide RevB
226 pages
Cardworks Information Packet - TERP
No ratings yet
Cardworks Information Packet - TERP
10 pages
Log
No ratings yet
Log
2 pages
MAD Mini Project Format 2022 PDF
No ratings yet
MAD Mini Project Format 2022 PDF
12 pages
Noise Removal
No ratings yet
Noise Removal
16 pages
S1 CS - U4 Data Ranges - Frequencies - Shifting
No ratings yet
S1 CS - U4 Data Ranges - Frequencies - Shifting
24 pages
REVISED 2020 CAPE Digital Media Unit 2 Word Document Summary Version 1 Oct 17, 2019 PDF
No ratings yet
REVISED 2020 CAPE Digital Media Unit 2 Word Document Summary Version 1 Oct 17, 2019 PDF
14 pages
Harmony Controller User Guide
No ratings yet
Harmony Controller User Guide
568 pages
Spesifikasi CS 7600
No ratings yet
Spesifikasi CS 7600
1 page
Auto Insurance Fraud Detection
No ratings yet
Auto Insurance Fraud Detection
27 pages
F
No ratings yet
F
23 pages
BCS Course Catalog0411202012435124072021041107
No ratings yet
BCS Course Catalog0411202012435124072021041107
46 pages
RWS - Q$ - Module 1 - Week 1 - Hypertext and Itntertext
No ratings yet
RWS - Q$ - Module 1 - Week 1 - Hypertext and Itntertext
27 pages
Guide To Effective ChatGPT Prompting
No ratings yet
Guide To Effective ChatGPT Prompting
42 pages
High-Speed Binary Memory Solution
No ratings yet
High-Speed Binary Memory Solution
8 pages
IStorage Datashur User Guide - V2.1
No ratings yet
IStorage Datashur User Guide - V2.1
14 pages
Introduction To Operating System (OS) : by Vinod Sencha
No ratings yet
Introduction To Operating System (OS) : by Vinod Sencha
59 pages
OS Lab Assignment Final
No ratings yet
OS Lab Assignment Final
6 pages
AS Level Computer Science - Ownership & Software Licensing
No ratings yet
AS Level Computer Science - Ownership & Software Licensing
21 pages
Netcool
No ratings yet
Netcool
39 pages
Data Dara Dasar
No ratings yet
Data Dara Dasar
14 pages
Graph Traversal Techniques
No ratings yet
Graph Traversal Techniques
31 pages