White Paper Modern Data Stack
White Paper Modern Data Stack
Why Semantics
Matter in the
Modern Data Stack
The term the “Modern Data Stack,” describes architectures that address the challenge of
connecting cloud-managed data to the business users who derive value from it. This article
explores the role of the semantic layer in the modern data stack.
When applied correctly, a semantic layer forms a new center of knowledge gravity that
maintains the business context and semantic meaning necessary for users to work with and
create value from enterprise data assets. Further, it becomes a hub for leveraging active and
passive metadata to optimize analytics experiences, improve productivity, and manage cloud
costs.
2
What is the Semantic Layer?
A semantic layer is “a business representation of data and offers a unified and consolidated
view of data across an organization.”
The term was originally coined in the age of on-premise data stores — a time when business
analytics infrastructure was costly and highly limited in functionality. While the semantic layer’s
origins lie in the days of OLAP, the concept is even more relevant today.
By formalizing a semantic layer within a modern, cloud-oriented data stack, organizations can
provide business users with more meaningful analytics experiences. This kicks off a virtuous
cycle of data democratization, domain-oriented analytics innovation, and data-driven value
creation.
I like the discussion from the team at Hevo that tracks the evolution of the data stack from the
pre-cloud era dominated by on-premise infrastructure and OLAP-style data management; to
the age of proto-cloud architectures with the launch of Amazon Redshift; and finally into the
modern era dominated by cloud data platforms from the likes of Snowflake, Databricks, and
Google BigQuery.
3
Shift in Data Gravity
In my opinion, Matt Bornstein, Jennifer Li, and Martin Casado from Andreessen Horowitz offer
the cleanest view of modern data stacks in “Emerging Architectures for Modern Data
Infrastructure''.This representation carries the bias of the A16Z investment thesis, but it is a
good model to work from. I will refer to this simplified diagram (with example companies
removed) based on their work below:
OLTP Databases
Data Replication Data Warehouse Dashboards
via CDC
Operational Apps Data Lake Spark Platform Data Modeling Augmented Analytics
4
This representation tracks the flow of data from left to right. Raw data from various sources
move through ingestion and transport services into core data platforms that manage storage,
query and processing, and transformation prior to being consumed by users in a variety of
analysis and output modalities.
We see the differentiation between unstructured storage (data lakes), structured (data
warehouses), and real-time data stores. In addition to storage, the data platforms offer SQL
query engines and access to AI/ML utilities. A set of shared services cuts across the entire
data processing flow at the bottom of the diagram.
I’ll try to refrain from incorporating my own bias in analyzing this representation and will
instead focus on placing the semantic layer within this view of the modern data stack.
Historically, semantic layers were implemented within analysis tools (i.e. BI platforms) or within
the data warehouse. With the rise of the modern data stack and the importance of data
engineering, we are now seeing semantic layers form within ELT pipelines. All three of these
approaches have limitations that are exacerbated by modern cloud-scale data.
5
Where is the
Semantic Layer?
Data Pipelines
Hard coded into ELT transformations -
can be difficult to govern and keep
consistent across disparate use cases.
Analytics Tools
Results in siloed semantic layers with
inconsistancies across different use
cases or work groups using different
analytics consumption tools.
This is a challenge as most modern organizations will use more than one analytics tool.
Dashboards are best delivered in a BI tool like Power BI or Tableau. Financial analysis is best
done in a spreadsheet like Excel. Business process support is best done with analytics
embedded in applications. Data science is best done from Jupyter notebooks.
Semantic sprawl happens when different teams manage specialized semantic layers in each
tool. Just like human language evolves differences when speakers are geographically isolated,
definitions and meanings of key business data concepts evolve differences when teams are
isolated.
6
Challenges with Data Warehouse-Based
Semantic Layers
Data warehouses are designed for architectural integrity with normalized tables that are
difficult for business users to analyze directly. Data needs to be made “business-ready” before
it can be directly analyzed by a business user. Data marts are an attempt to create centralized,
business-oriented views of data.
Centrally controlled definitions that are “hard coded” into table structures become static. It’s
difficult for centralized architecture teams to keep up with domain-specific needs of different
workgroups. Furthermore, user queries against massive cloud-scale tables become slow for
even the most powerful cloud query engines.
This almost always results in users extracting data into analytics platforms for easier
manipulation and faster query performance. That in turn leads to the semantic sprawl of
localized semantic layer formation, as described above.
These pipelines will include transforms that create “analytics-ready” forms of data. If there is
no formal semantic layer strategy or proper governance, data engineers will encode semantic
meaning within their pipelines in order to support their analytics customers. This can result in
semantic sprawl and create extreme inefficiency, as data engineers recreate common
business concepts (e.g. month to fiscal quarter mapping) every time they design a new
pipeline.
7
The Universal Semantic Layer
I use the term “universal semantic layer” to describe a thin, logical layer sitting between the
data platform and analysis and output services. It abstracts the complexity of raw data assets
so that users can work with business-oriented metrics and analysis frameworks within their
preferred analytics tools.
The challenge here is how to assemble the minimum viable set of capabilities that gives data
teams sufficient control and governance while delivering end-users more benefit than they
could get by extracting data into localized tools.
8
Implementing a Universal
Semantic Layer using
Transformation Services
The modern semantic layer needs to be implemented by leveraging the services positioned
within the Transformation category of the A16Z data stack — within the Metrics Layer, Data
Modeling, Workflow Management, and Entitlements & Security services. When implemented,
coordinated, and orchestrated properly, these services form a universal semantic layer that
delivers important benefits including
Creating a single source of truth for enterprise metrics and hierarchical dimensions,
accessible from any analytics tool
Providing the agility to easily update or define new metrics, design domain-specific views
of data, and incorporate new raw data assets
Optimize analytics performance while monitoring and optimizing cloud resource
consumption
Enforce governance policies related to access control, definitions, performance, and
resource consumption.
Metrics Layer
Analytics Consumers
Cloud Data Platforms
Data Modeling
Workflow Manager
Entitlements &
Security
Semantic
Layer
The key to success is providing all of these benefits so individual users and work groups can
be free to innovate and deliver value while using a centrally-governed semantic layer. If
attempts to centrally manage a semantic layer inhibit business users, semantic sprawl will
happen with absolute certainty.
9
Let’s step through each transformation service with an eye toward how they must interact to
form an effective semantic layer.
Data Modeling
Data modeling is the creation of business-oriented, logical data concepts that are directly
mapped to the physical data structures in the warehouse or lakehouse. Data modeling
services can be based on no or low-code visual frameworks oriented toward business users or
code-based markup languages oriented towards developers or analytics engineers.
Regardless of their preferred modeling paradigm, data modelers focus on three key activities:
1. Making Data Analytics-Ready: Preparing data for analytics use cases requires de-
normalizing and blending of raw data assets to create views of data appropriate for
analytics interaction. Whether the modeling will result in a physical materialization of a new
data view (i.e. through an ELT process), or a virtualized view (i.e. through an SQL-based
query) may influence the best modeling approach.
3. Metrics Design: The most tangible output of data modeling is the set of metrics that are
published to metrics layers for discovery and use by data consumers. Metrics may be
simple quantitative measures like revenue or cost. They may be calculations like gross
margin (e.g. revenue - cost / revenue). They may be ordinal (e.g. lowest, highest, median).
They may be time relative calculations (period to period change).
Metrics design is typically the most frequently iterated activity within data modeling and
benefits from an agile, end-user oriented option for designing and maintaining metrics
definitions. That said, it is also critical to ensure governance and prevent “metrics sprawl,”
where multiple definitions of an important metric like revenue cause confusion and erode
trust in analytics. Metrics design may sometimes be thought of as a metrics layer activity
vs. a data modeling activity.
10
Metrics design and dimension definition is where business semantics are embedded into the
naming and descriptions of a data model. But beyond naming, data modeling services in the
modern data stack must actually implement the model elements for use within a metrics layer
— not just define and communicate relationships in an entity relationship diagram.
Within this discussion of data modeling services, it is worth considering the advantages of
taking a composable analytics approach. Ideally, elements of data models can be created and
managed in a way that enables shareability and reuse. For instance, a single, curated product
dimension could be shared across different data models supporting different workgroups. This
approach simplifies new model creation and change management.
Incorporating composability into a semantic layer strategy can be an enabler for supporting
data mesh or hub and spoke analytics management. My AtScale colleague, Elif Tutuk, wrote an
excellent blog series on how a semantic layer can support data mesh. In these analytics
management paradigms, key elements of data models and definitions are centrally managed in
a way that allows decentralized creation of data products by domain-specific teams. This can
be a powerful approach for fostering data product innovation while ensuring consistency and
governance.
I refer to the output of semantic layer data modeling as a semantic model. In this context, a
semantic model is a logical representation of enterprise data with business context embedded
in data views exposed to data consumers. The term semantic model is also sometimes used to
describe knowledge graph representations of enterprise data that draw from research related
to the Semantic Web. While sometimes confusing, these two definitions of semantic model are
related, but distinct.
It could be argued that metrics design and change management are a metrics layer service
instead of (or in addition to) a data modeling service as I have treated above. But since I am
positioning all these services within a super set of semantic layer services, it doesn’t really
matter.
11
Metrics stores are essentially identical to the feature stores, like Feast, used by data science
teams. In simplistic implementations, they are pre-calculated, shared repositories for key
business metrics (e.g. revenue or ship quantity). In richer implementations, metrics are
dynamically calculated and served on demand.
The term “Headless BI” is sometimes used to describe a metrics layer service that supports
user queries from a variety of BI tools. This is actually a fundamental capability. As mentioned
earlier, if users are unable to interact with a semantic layer using their analytics tools, they will
end up extracting data directly using SQL and recreating a localized semantic layer.
1. Curation: Metrics stewards will move between data modeling and the metrics layer to
curate the set of metrics provided to data product creators and business users.
2. Change Management: The metrics layer serves as an abstraction layer that shields
complexity of raw data from data consumers. As a metrics definition changes, existing
reports or dashboards are automatically updated. Metrics lineage may be directly managed
in the metrics layer or integrated with a data catalog service.
3. Discoverability: Data product creators need to easily find and implement the proper
metrics for their purpose. This becomes more important as the list of curated metrics
grows to include a broader set of calculated or time relative metrics.
Metrics layer stewards need to invest time in creating definitional metadata used to
support discoverability. An interesting area of research is in AI-assisted discoverability
using both passive metadata (e.g. descriptions of metrics) and active metadata (e.g. how
often is a given metric is used).
4. Serving: Metrics layers are queried directly from analytics and output tools. As end users
request a metrics cut from a dashboard, the metrics layer needs to serve the request fast
enough to support positive analytics user experience.
As described earlier, poor performance will result in data extracts and semantic sprawl.
Performance management strategies that support metrics serving are discussed in the
next section, but may directly relate to metrics layer implementation as well. Again, this is
an argument for semantic layer thinking rather than a set of independent transformation
services.
12
Workflow Management
To Materialize or Virtualize: That is the question. As noted in the data modeling discussion,
transformation of raw data into an analytics-ready state can be based on physically
materialized transforms, virtual views based on SQL, or some combination. Workflow
management is the orchestration and automation of the physical and virtual transforms that
support semantic layer function. Decisions on what to materialize and what to virtualize should
be based on a cost-performance optimization.
Performance: Analytics consumers have a very low tolerance for query latency. A universal
semantic layer cannot introduce a query performance penalty otherwise clever end users or
work groups will again go down the data-extract-semantic-sprawl route. The raw size of
modern cloud scale data means even the most powerful query engines are not able to
consistently deliver “speed of thought” analytics without some level of physical materialization
of aggregates.
Legacy OLAP approaches take this to the extreme by materializing specialized cube data
structures that deliver high performance but do not scale beyond a few terabytes of raw data.
Effective performance management workflows automate and orchestrate materialization as
well as decide what and when to materialize. This functionality needs to be dynamic and
adaptive based on user query behavior, query runtimes, and other active metadata.
Cost: While there are labor cost considerations related to analytics pipeline management to
take into account, the primary cost tradeoff for performance is related to cloud resource
consumption. Physical transformations executed in the data platform (i.e. ELT transforms)
consume compute cycles and cost money. Query volume (i.e. queried terabytes per month)
consumes compute cycles and costs money. The emerging enterprise discipline of FinOps is
focused on managing cloud costs. Implementing a FinOps program for data and analytics
requires data that is best collected from the semantic layer.
Workflow management within the semantic layer supports FinOps goals while ensuring proper
performance to support user experience. This becomes an interesting optimization problem
that needs to be managed for each data product and use case. The emerging discussion of
data contracts focuses on scaling the automated interaction between data products and
infrastructure services that automate the definition of this optimization problem.
13
Entitlements and Security
Entitlements and security relate to the active application of data governance policies to
analytics. Beyond cataloging data governance policies, the modern data stack must enforce
policies at query time as metrics are accessed by different users and as models are created
and modified.
Management and enforcement need to be tightly integrated within the semantic layer service,
so that policies can be actively applied at query time as unique users make requests from
different data products. Many different types of entitlements may be managed and enforced
alongside (or embedded in) a semantic layer.
Access Control: Proper access control services ensure all users can get access to all of the
data they are entitled to see. Lack of an effective integrated service will result in access
loopholes or overly conservative governance policies (e.g. no updates to certain reports
during revenue reporting periods).
Model and Metrics Consistency: Maintaining semantic layer integrity requires some level of
centralized governance of how metrics are defined, shared, and used. Conflicts arise when
users or work groups share different definitions of a key metric (e.g. revenue) without proper
documentation. Inconsistent results rapidly erode trust in centralized analytics resources
which leads to semantic layer breakdown.
Performance and Resource Consumption: As discussed above, there are constant tradeoffs
being made on performance and resource consumption. User entitlements and use case
priority may also factor into the optimization that needs to happen when allocating resources
to a given user query. For instance, a real-time interactive dashboard supporting executives
may be entitled to more resources than a monthly batch report update.
While all entitlements and security services and their integration with enterprise access
control utilities reach beyond the scope of an analytics semantic layer, real-time enforcement
of governance policies is critical for maintaining semantic layer integrity.
14
Integrating the Semantic Layer
within the Modern Data Stack
Layers in the modern data stack must seamlessly integrate with other surrounding layers. The
semantic layer requires deep integration with its data fabric neighbors, including the data
platform, analysis and output, and the metadata and services layers.
A functioning semantic layer that orchestrates data platforms relies on workflow management
services tightly integrated with cloud data platforms to function. It’s important to consider how
a general service can support the wide variety of data platform architectures, including data
warehouses (cloud and on-premise), data lakehouse platforms, and data virtualization
platforms. A coordinated set of semantic layer services needs to integrate with the data
platform in a few important ways:
Query Engine Orchestration: The semantic layer dynamically translates incoming queries
from query consumers (which refer to metrics layer logical constructs) to platform-specific
SQL (auto-generated to reflect the logical to physical mapping defined in the semantic model).
The query needs to be optimized for each data platform’s idiosyncrasies, including
understanding nested data structures and partitioning schemes.
Write-back Orchestration: There may be use cases where user or AI/ML interaction with the
semantic layer may create new data (or metadata) in the form of features or predicted metrics
that is best managed within the data platform. This requires the capability for semantic layers
to orchestrate data write-backs to data platforms.
15
User Defined Functions (UDF): Modern cloud data platforms offer libraries of functions that
can be leveraged by analysis and output utilities. The semantic layer may also leverage these
functions.
Therefore, a semantic layer must be capable of the following query consumer integrations:
Inbound Query Protocol Support: A semantic layer must support multiple inbound query
protocols, including (but not exclusively) SQL, MDX, DAX, Python, and RESTful interfaces using
standard protocols such as ODBC, JDBC, HTTP(s), and XMLA.
Live Query Connection: A semantic layer must support live connections to data and avoid
data extracts or external caching layers while providing “speed of thought” query
performance.
Persona Support: A semantic layer must address the needs of multiple end-user personas
(like business analysts, data scientists, or application developers). Failure to support the
needs of a persona risks creation of localized semantic layers. For instance, if it is difficult to
create time relative metrics in the metrics layer to support time series analysis, data scientists
must extract data and create a localized semantic layer.
16
Metadata and Support Services
While a semantic layer is indeed metadata rich, it’s not the exclusive repository of business
metadata. Many tools in a data fabric ecosystem consume and generate metadata, so it’s
critical that a semantic layer supports the following integrations with the metadata and
support services layer(s)
A semantic layer must be capable of sharing its metadata and lineage with enterprise data
cataloging tools to support the search and discovery of metrics and data models by data
catalog consumers
A semantic layer must be capable of importing metadata from other tools to automate the
creation of semantic data models while driving consistency and conformance with
enterprise standards
A semantic layer must expose monitoring endpoints to help manage users' access,
uptime, and system performance.
17
Beyond Descriptive Analytics
Leading data organizations emphasize the role of augmented analytics that go beyond classic
descriptive analytics to include diagnostic, predictive, and prescriptive analytics powered by
artificial intelligence.
A key driver for embracing the modern data stack is access to Data Science/Machine Learning
(DS/ML) platforms included in the Query and Processing service category. It is far more
efficient to leverage powerful AI algorithms if you do not need to move or re-model large
volumes of data.
There are a few important considerations for how a semantic layer strategy can support
accessibility and value creation from the full spectrum of augmented analytics
The Metrics Store is a Feature Store: This point was noted in our discussion of metrics
layer services. Data scientists should leverage the business vetted features defined and
actively curated in metrics layers.
Natural Language Query: “Alexa, what was our sales revenue in Massachusetts last
quarter?” will only return the right results if Alexa has a clear understanding of the data’s
semantic constructs, right revenue metric, the right geographical dimension, and the right
time dimension
Publishing Model-Generated Insights: Production AI/ML models generate new data points
(e.g. predictions, features) that need to be exposed to users in order to create value. A
semantic layer can leverage existing analytics and output infrastructure to more easily
disseminate augmented analytics
Explainable AI / Trusted AI: The semantic layer can be leveraged to organize and
disseminate information related to why an AI model is providing a particular answer. For
instance, business users can gain value from knowing not only the prediction for sales next
quarter, but also the key drivers for this prediction. Delivering better insight on the
reasoning behind AI/ML model suggestions directly supports explainability and enhances
the level of trust in model-generated insights.
18
The Semantic Layer as the
Center of Data Knowledge
Gravity
While the center of data gravity has clearly shifted to cloud data platforms, business
knowledge about the significance of data is still sprinkled across data fabrics in the form of
metadata. There is a vibrant market for metadata management and support services for
cataloging data assets, monitoring data quality, and tracking data usage.
The semantic layer has an advantageous position of seeing a large portion of active and
passive metadata created for analytics use cases. This creates an opportunity for forward-
thinking organizations to better manage knowledge gravity while using this rich set of
metadata to improve analytics experiences and drive incremental value.
We have already discussed a few examples of how the semantic layer becomes the center of
mass for knowledge gravity. Business context, definitions, and documentation for appropriate
use of metrics and analysis dimensions get encoded within the views of analytics exposed to
business users. FinOps efforts can use data on query patterns and efficiency to better
manage cloud resources and set policy.
One exciting area of research is in active metadata on query patterns, which can suggest the
types of questions the business asks and identify analytics best practices (e.g. what KPIs are
accessed most often by the most successful sales leaders).
19
Managing Business Augmenting Analytics
Context with Passive Experience
A growing set of vendors (including AtScale) are offering semantic layer platforms that
package two or more of these services for supported integrations. Building your own services
or managing your own integrations carries labor and overhead costs. It’s also more prone to
catastrophic disconnects.
The big three cloud service providers (Google, AWS, and Azure) offer a huge array of services
but do not currently offer an integrated set of semantic layer services. While this will likely
change, cloud lock-in concerns will continue to argue for some level of vendor abstraction
from analytics and output tools.
20
It will be interesting to see how the modern data stack evolves. Will transformation services
(comprising the semantic layer) be provided by major cloud service providers, or can open
Regardless of the answer, it’s clear that the semantic layer matters in the modern data stack —
Dave is founder and Chief Technology Officer of AtScale. Dave is a Big Data
21