RESOURCE
COMPETITIVE POSITIONING
More
Thethan
Datajust a dataPrimer
Catalog catalog
Learn how to make datadata
End-to-end discovery and access
workspace a breeze
for modern datafor your data team
teams
AtlanHQ Compiled with 💙 by
Table of Contents
Chapter 1: The evolution of the data management ecosystem ….…… 03
We explore the different data management technologies and how they changed over the years,
from data warehouses to the cloud and Hadoop.
Chapter 2: The problem with traditional data catalogs ….…… 13
We discover the challenges presented by traditional data catalogs, featuring some of the biggest pain
points from Gartner’s Peer Review.
Chapter 3: The ideal data catalog ….…… 19
We understand what a modern data catalog should look like by exploring the foundations of an ideal
data catalog.
Meet Atlan: An end-to-end data workspace for modern data teams ….…… 22
We take a sneak peek at Atlan’s futuristic data catalog platform for modern data teams.
Please note:
This report was first compiled in October 2019 and was updated in April 2020 by Atlan.
The text, images, or a combination of both, as described in this material, cannot be copied, modified, published or distributed without prior written permission
from Atlan (Peeply Technologies Pvt Ltd) and its respective authors.
The names, logos and brand marks of all data software, platform and tools other than Atlan’s which are mentioned in this report are the properties of their
respective owners. No copyright infringement is intended. Should there be any question or concern, you can write to hello@atlan.com
2
Chapter 01
The evolution of the
data management
ecosystem
The 1990s: The era of data warehousing, data integration
and metadata management
2000-2015: The era of big data, cloud computing, data lakes
and (traditional) data catalogs
3
The evolution of the data management ecosystem
We’ve come a long way from just wondering whether we have the technology to store large
amounts of data. The challenges we face today are slightly more complex and revolve around
effectively using the data we have.
The questions that keep CDOs (Chief Data Officers) and CDAOs (Chief Data and Analytics
Officers) up all night have more to do with:
1. Access to the right data
2. Understanding the context and meaning of data
3. Using this data for further analysis
4. Keeping all organizational data safe while complying with local, regional and federal
regulations around data privacy and management
However, before we start scouring the (virtual) world for solutions, let’s understand why we
face these challenges now, starting with a quick (and nostalgic)
stroll down through the history of data management.🚶
Having troubles visualizing how much is a petabyte? This should help.
Kilobytes were stored on floppy disks. Megabytes were stored on hard disks,
So, let’s get started, shall we? while Terabytes in disk arrays. Petabytes are stored on the cloud.
Chris Anderson
THE WIRED
4
The 1990s
The era of data warehousing
5
The birth of the internet and the rise of data warehouses
The 90s = the birth of the internet… and the search engine Google.
Guess what that meant? More data! Computer scientist Michael Lesk estimated that the
amount of data available on the internet was around 12,000 petabytes (1 PB = 1000 TB), with
the size expected to grow tenfold each year. 😱
With increase in the amounts and sources of data, the demand for data management also
spiked, triggering the need for one solution to store, discover, analyze and use data for
decision making, i.e., a single source of truth.
And that led to the rise of on-premise data warehouses.
P.S. Bill Inmon coined the term “data warehouse” in the 90s[1], and wrote a book about it in
1992—considered to be a fundamental source on warehousing even today.
With data warehousing, data that had previously been spread across What is a data warehouse? A data warehouse is a key component
numerous sources could now be held together in one place. Moreover, of BI (business intelligence). For the uninitiated, a data warehouse is
these warehouses were specifically designed to support the analytical a central repository for all structured data from multiple sources
functions required for business intelligence. and/or transactional systems (aka CRMs like Salesforce or ERPs like
SAP). Through the 90s, companies worldwide had started adopting
Furhaad Shah data warehouses. Many in the data universe considered
DATACONOMY
warehouses as the one-stop solution to data chaos.
[1]
TDAN (The Data Administration Newsletter) on building data warehouses published on May 29, 2007
6
Enter data integration and metadata management tools
Data warehouses ruled the early 90s as they supported complex queries for data analysis.
However, with changing data formats and sources, transforming data (i.e. ETL) became
complicated and time-consuming.
Suddenly, warehouses from the 90s weren’t fast enough to keep up with the changes. This led
to the rise of data integration tools that simplified and fast-tracked the ETL workflows.
But that wasn’t the only problem. On-premise data warehouses lacked context. With more
data coming in every day, enterprises had tons of data, but no clue how to interpret or use it.
Enter metadata management tools that provided much-needed context and meaning to data.
Lastly, on-premise data warehouses were expensive to build and maintain. Since they’re built
considering peak usage capacity, it was extremely difficult to predict and estimate future
capacity needs as the amount of information keeps growing exponentially.
And that brings us to the next step of evolution in the data ecosystem.
What is metadata? Metadata is information that describes your data
and provides useful context. Think about your favorite song. That’s
data. The name of the song, genre, and singer is the metadata. In
other words, metadata acts like an explainer for your data.
7
2000-2015
The era of the big data
and the cloud
8
Hadoop—the Swiss army knife of the 21st century
Two technological disruptions were happening in the 2000s.
Firstly, with the rise of web 2.0, the volume of data available started increasing exponentially,
ushering in the era of big data.
BTW, Web 2.0 (popularized by O’Reilly Media in 2005[2]) is nothing but user-generated
content—the internet as we know today (in 2020).
Think videos, audios, images, location data, social media interactions. Most of this data is
unstructured.
Guess what else is mostly unstructured? Big data.
Processing all that big data to extract meaningful insights was proving to be a major headache,
and that fueled the need for big data technologies like Hadoop.
First released in 2005[3], Hadoop quickly gained popularity and by 2013[4], almost half of Fortune
50 companies had adopted Hadoop for processing big data.
Now let’s see what was the other technological disruption.
Originating with technologies developed by Yahoo, Google, and other Web
What is Hadoop? A set of open source programs and processes that act as 2.0 pioneers in the mid-2000s, Hadoop is now central to the big data
the backbone of your data operations. It has four major components—HDFS strategies of enterprises, service providers, and other organizations.
(a distributed file system), MapReduce, Hadoop Common and YARN (Yet
Another Resource Negotiator). The Apache Software Foundation is James Kobielus
FORRESTER RESEARCH
responsible for maintaining Hadoop.
[2] [3]
O’Reilly Media on Web 2.0 published on September 30, 2005 Bernard Marr & Co. on Hadoop
[4]
PR Newswire on Altior’s AltraSTAR published on December 18, 2012
9
Cloud computing and warehousing
The 2000s was also the era of the cloud, popularized by the launch of Amazon S3 and EC2 in
2006[5], Windows Azure in 2010[6].
In 2011, Google[7] and IBM[8] threw their hats into the ring and with that, cloud had officially
arrived.
Remember the on-premise data warehouses being hard to maintain and expensive to scale?
In 2012[9], AWS introduced Redshift—a low-cost cloud data warehouse that’s easy to deploy
and scale.
While this solved some of the problems, it wasn’t enough.
Warehouses like Redshift were limited to their service providers (Redshift was limited to AWS
whereas BigQuery was limited to Google Cloud), making it challenging to find an alternative.
Also, compute and storage were interdependent, making it impossible to shut down compute
without affecting storage.
Lastly, even cloud data warehouses didn’t support unstructured We did the math and found that it costs between $19,000 and $25,000 per
data. So, still no single source of truth. 🤷 TB per year, at list prices, to build and run a good-sized data warehouse on
your own. Amazon Redshift will cost you less than $1,000 per TB per year.
And that brings us to data lakes.
Jeff Barr
AWS EVANGELIST
[5] [8]
AWS on offering cloud computing services to businesses Cloudpro on IBM Cloud published on July 29, 2011
[6] [9]
Microsoft on Windows Azure Availability published on February 1, 2010 Information Week on Amazon Redshift published on November 28, 2012
[7]
Google Code on Google Cloud SQL published on October 6, 2011 10
Making sense of the unstructured with data lakes
Coined by James Dixon in 2010[10], data lakes seemed to be the one-stop solution for all big
data management problems. At least on paper.
In reality? Not so much. What enterprises ended up with was less of a lake and more of a
swamp (a data dump).
Other than the complexities in big data architecture, these are the top three challenges:
1. Finding actionable data is frustrating
2. Tracing data lineage can be elusive
3. Implementing data governance is painful
Bonus issue: Data lakes still don’t solve the single source of truth conundrum.
Originally, most companies I talked to thought that they would have one
huge, on-premises data lake that would contain all their data. As their
What is a data lake? A data lake stores a collection of various raw data understanding evolved, most enterprises realized that a single go-to point
sets from multiple internal and external data sources. The data in a data was not ideal. Between data sovereignty regulations and organizational
lake can be unstructured, semi-structured or structured. We’re talking pressures, multiple data lakes typically proved to be a better solution.
messy data from audio files, emails, photos or satellite imagery to more
neat and clean data like phone numbers, customer names, addresses and Alex Gorelik
zip codes. AUTHOR, THE ENTERPRISE BIG DATA LAKE
[10]
James Dixon’s blog on data lakes published on October 14, 2010
11
Enter the data catalog
Remember the reason why data warehouses came into the picture (way back in the 90s)?
Because enterprises needed one solution to store, discover, analyze and use data for decision
making, i.e., a single source of truth.
With cloud data warehouses, data lakes and big data technologies, data infrastructure had
gotten extremely complex. To make things easier, several companies came up with numerous
tools and technologies, which only added on to the complexities.
And despite all these advances in infrastructure, enterprises still found it difficult to find the
right data. Still no single source of truth containing all enterprise data along with metadata
information and context.
That’s where data catalogs come to the rescue by making data discovery easy across
the data ecosystem.
And with that, we end our history lesson. So far so good, yeah?
Fifty-four million data workers worldwide spend 44% of their workday on
unsuccessful data activities. Searching for and preparing data are the most
What is a data catalog? A data catalog is a library or inventory of all your common activities of the data worker role at 15% and 33% respectively. On
data assets—a place where all your data is neatly indexed, organized and average, they use four to seven different tools to perform data activities,
kept ready for use.
adding to the complexity of the data and analytics process.
The State of Data Science and Analytics Report[11]
[11]
The State of Data Science and Analytics by IDC
12
Chapter 02
The problem with
traditional data catalogs
4 critical shortcomings of traditional data catalogs
Biggest pain points from Gartner’s Peer Review on existing
data catalogs
4 reasons why data teams need a modern data catalog
13
How do traditional data catalogs fall short?
While data infrastructure has evolved, data management hasn’t.
On paper, traditional data catalogs are supposed to help enterprises make sense of their
data—where did it come from, what purpose does it serve and how is it being used.
In reality, traditional data catalogs fall short as they aren’t built for the new world of data.
Here are the top 4 shortcomings.
1. Not built for the cloud 2. Built for IT, not business
3. Need extensive support 4. Opaque pricing,
and maintenance not pay-as-you-go
14
The biggest pain points from Gartner’s Peer Review on existing catalogs
Don't just take our word for it. We sourced some of the biggest pain points that humans of data experienced while using
traditional data catalogs. It’s important to remember that the world we live in today is vastly different from that of the 90s.
Cloud Native Limited support for cloud based storage/warehouse.
Monolithic architecture is difficult to work with, and does not
scale well for enterprises or cloud deployments.
Difficult to set up. (For data ingestion need to create airflow dags
Ease of Setup Implementation can be a challenge unless you have a good
partner.
or run python scripts. Templates available.) A data engineering
team required to set-up and maintain.
Ease of Use The catalog platform is built with a technical user in mind.
Experience for non-technical folks can be challenging. Need
better UI elements for non-technical personas.
Data catalog works well for static data, but the current data
Scalability Data catalog is built to scale (being cloud native). For AWS, it movement and architecture around data in motion (streaming) is
& Big Data has documentation for cluster deployment. not supported as well as hive structures. (Coming soon, but
behind.)
Maintenance Make sure you have an engineer team (2-3) prepared to support,
& Support
I need to hire Java developers to customize anything.
and upgrade the catalog platform.
Why do we need a modern data catalog?
We live in a new world of data with….
deequ
Cloud Proliferation Thriving Open-Source Ecosystem
Analysts Scientists Engineers
Business Users ML Researchers
Diverse Data Consumers Rapidly Innovating Ecosystem
The need of the hour is a modern data catalog for this new, increasingly cloud-first world.
Let's take a look.
16
Enter the era of
The modern data catalog
17
What is a modern data catalog?
What makes a data catalog modern?
One fundamental characteristic—empowering non-technical or business users to understand,
interpret and use data for data and analytics initiatives.
Sounds Utopian, doesn’t it?
So long as we’re indulging ourselves, let’s take things slightly further and think of the ideal
data catalog.
What would that look like?
A data catalog creates and maintains an inventory of data assets through the
discovery, description and organization of distributed datasets. The data catalog
provides context to enable data stewards, data/business analysts, data engineers,
data scientists and other line of business (LOB) data consumers to find and
understand relevant datasets for the purpose of extracting business value.
18
Chapter 03
The ideal data catalog
The foundations of the concept of an ideal data catalog
6 factors that make modern data catalogs the way forward
for data teams of 2020 and beyond
Meet Atlan: The first data catalog built for the future
19
The foundation for an ideal data catalog
As a concept, the ideal data catalog for our modern world should be built on the basis of four foundational pillars.
Collaboration
4 DRIVE VALUE FROM DATA
Trust
TRUST DATA Everyone should be able to
3
use the data they need in
Knowledge Everyone should be the environment they are
2 UNDERSTAND DATA able to trust the data most comfortable in
is the right data for
Everyone should be their use case
Agility able to understand
1 DISCOVER DATA data with all its
context
Everyone should be
able to discover the
data they need in
seconds
Ideal Modern data catalogs: The way forward
To support our current data ecosystem and empower non-technical data
consumers, an ideal data catalog should have six key attributes.
BI & Reporting
Engineering
The same stack used by companies like
Data Science
01. Cloud-First: 02. Built on Open-Source, 03. Plug-And-Play With Your
Deploy on Your Cloud VPC Open By Default Favorite Data Tools
Completely Integrates directly
...designing the interface and
self-service, no into your existing
user experience of a data tool Usage
engineering AD / permissions and Users
should not be an afterthought!
support needed data infrastructure
04. Built for Business 05. No Training 06. Pay-As-You-Go Pricing
-Not Just IT or Support Overhead
The first data catalog built for the future
Cloud-native
data catalog
24 hours
to get up and running
Democratization
for business
Governance
for IT
Watch Guided Demo Take Guided Tour
22
THANK YOU FOR READING THE
Data Catalog Primer
Check out our other resources for data teams
WEBINAR SERIES EBOOK
How are top data teams The ultimate guide on
making the move to implementing agile for
remote? data teams
Sign Up → Download →
Compiled with ❤ by Atlan
23
Trusted by data teams around the world
We are proud to be supported by
Rajan Anandan Manoj Menon Ratan Tata
Former MD Google India Partner & MD Frost & Sullivan (APAC) Chairman Emeritus Tata Sons