CDMP Study Group
SESSION 15
September 2, 2020
Nupur Gandhi, DAMA New England, VP Online Services
Email: nupurgandhi@gmail.com
AGENDA
• Facilitator
• Introductory Note
• Chapter 14: Big Data and Data Science
• Overview
•Q&A
• Next Session
New England Data Management Community
Facilitator
Nupur Gandhi
The Hartford, Senior Consultant - Reference
Data Management
DAMA NE Chapter, VP, Online Services
CONTACT INFO:
EMAIL: nupurgandhi@gmail.com
PHONE: 860-712-6097
: /IN/nupur gandhi
New England Data Management Community
INTRODUCTORY NOTE
This study group is offered as a service of DAMA New England for DAMA New England
members. It not an official, DAMA International authorized training course because DAMA-I has
not yet created an authorized trainer program.
The purpose of this group is to help prepare members to take the CDMP. We will do so by
reviewing the content of chapters of the DMBOK2.
The chapter makes no claims for the effectiveness of the sessions or the ability of participants to
pass the CDMP exam after having attended. In fact, you should plan on doing a lot of individual
study to pass the exam.
New England Data Management Community
On a Fun Note….
Big Data & Data Science Can Be Linked To Teenage Relationships
Everyone Talks About It,
Few Really Know How To Do It,
It’s Been The Source Of Many Rumors,
Everyone Thinks That Everyone Else Is Doing It,
So Everyone Claims They Are Doing It !
HOMEWORK – Big Data & Data Science
Is there a difference between Big Data and Data
Science?
Big Data – Collection of Data
Data Science – Analysis of the big data/ Applied
Statistics
Big Data and Data Science have helped to generate,
store and analyze larger amounts of data
New England Data Management Community
Chapter 14: Big Data Introduction
Big Data and Data Science Vs BI/DW
Business Intelligence (BI): Rear view
mirror reporting
Think analysis to describe past trends
Data Science: Forward looking
(windshield) view of the organization
Think analysis to describe future trends
New England Data Management Community
Data Processing -Traditional DW Vs Big Data
Traditional Big Data
DW
How is data Relational Not relational
organized Model model
Concept ETL (Extract, ELT (Extract,
Transform Load and
and Load) Transform)
New England Data Management Community
Big Data Overview
Definition: The collection (Big Data) and analysis (Data Science, Analytics and Visualization) of many
different types of data to find answers and insights for questions that are not known at the start of the
analysis.
Goals:
1. Discover relationships between data and the business
2. Support the iterative integration of data source(s) from the enterprise
3. Discover and analyze new factors that might affect the business
4. Publish data using: visualization techniques (appropriate, trusted, efficient manner)
New England Data Management Community
Big Data Overview
Biggest Business
Driver for using Big
Data:
Find and act on
business opportunities
that may be
discovered through
data sets generated
through a diversified
range of products
New England Data Management Community
Real life examples using Data Science
New England Data Management Community
Big Data Features
What are the 6 V’s in Big Data?
• Volume (Amount of Data)
• Velocity (Speed at which data is produced)
• Variety/Variability (Various Forms, formats, data structures)
• Viscosity (How difficult the data is to use or integrate)
• Volatility (How data changes occur and therefore how long the data is useful)
• Veracity (How trustworthy the data is)
New England Data Management Community
Visualization of Big Data
New England Data Management Community
Big Data Storage Challenges
Every 2 days we create as much data as we did from the beginning of time until 2003
Sources of Big Data:
• Social Sites Audio/Video
• Sensors
• Blogs
• Advertising Data
• Web Logs
• Phone
• POS devices
• Online Orders
• Online Video Games
New England Data Management Community
Conceptual DW/BI and Big Data Architecture
Sr.
Big Data Data Warehouse
No
Big data is the data which is in enormous form on which Data warehouse is the collection of historical data from
1.
technologies can be applied. different operations in an enterprise.
Data warehouse is an architecture used to organize the
2. Big data is a technology to store and manage large amount of data.
data.
It takes structured, non-structured or semi-structured data as an
3. It only takes structured data as an input.
input.
Data warehouse doesn’t use distributed file system for
4. Big data does processing by using distributed file system.
processing.
In data warehouse we use SQL queries to fetch data from
5. Big data doesn’t follow any SQL queries to fetch data from database.
relational databases.
Data warehouse cannot be used to handle enormous
6. Apache Hadoop can be used to handle enormous amount of data.
amount of data.
When new data is added, the changes in data are stored in the form When new data is added, the changes in data do not
7.
of a file which is represented by a table. directly impact the data warehouse.
Data warehouse requires more efficient management
Big data doesn’t require efficient management techniques as
8. techniques as the data is collected from different
compared to data warehouse.
departments of the enterprise
New England Data Management Community
Analytics Progression
New England Data Management Community
Conceptual DW/BI and Big Data Architecture
- Selection, installation
and configuration of
Big Data environment
requires specialized
expertise
- Develop and
rationalize end to end
architecture using:
• Data exploratory
tools
• New acquisitions
In a Big Data environment, data is ingested and loaded before it is integrated (extract, LOAD, transform) Vs
In a data warehouse, data is integrated as it is brought in the warehouse (extract, LOAD, transform)
New England Data Management Community
Game: Match the following?
1. What is a Data Lake? a. Anticipates what will happen, when it will happen
and implies why it will happen
2. What is Machine Learning?
b. Analysis that reveals patterns in data using various
3. What is Sentiment Analysis? algorithms
4. What is Data Mining? c. Analyzes documents with text analysis and data
mining technologies to classify content automatically
5. What is Text Mining? d. Development of probability models based on
variables using historical data
6. What is Predictive Analytics?
e. Uses NLP (Natural Language Processing) to detect
7. What is Prescriptive sentiment and reveal changes in sentiment to predict
Analytics? possible scenarios
f. Environment where a vast amount of data of various
types and structures can be ingested, stored,
assessed and analyzed
g. Explores construction and study of learning
algorithms
Answers: 1f, 2g , 3e , 4b , 5c, 6d, 7a
New England Data Management Community
Game: What do I stand for?
Services based architecture(SBA) is a way to provide immediate data as well as update a
historical data set
Speed layer is referred as ODS, all transactions are updates only if required
Speed Layer?
Batch Layer?
Serving Layer?
Every transaction is an insert
New England Data Management Community
Game: What do I stand for?
Machine learning explores the construction and study of learning algorithms.
These algorithms are fall into 3 types…What are those?
Supervised Learning, Unsupervised Learning, Reinforcement Learning
Supervised Learning: Based on generalized rules eg Separating SPAM from non SPAM email
Unsupervised Learning: Based on identifying hidden patterns (Data Mining)
Reinforcement Learning: Based on achieving a goal (Beating a opponent at chess)
New England Data Management Community
Big Data Process
1. Define Big Data Strategy & Business Need(s):
Define requirements that identify desired outcomes with
measured tangible benefits
2. Choose Data Source(s):
Identify gaps in the current data asset and find data sources to fill
those gaps
3. Acquire & Ingest Data Source(s):
Obtain data sources and onboard them
4. Develop Data Science Hypothesis(es) & Methods:
Obtain data sources, refine requirements, define model inputs,
types or model hypotheses
5. Integrate/Align Data For Analysis:
Model feasibility depends on the quality of the data source,
leverage trusted and credible sources
6. Explore Data Using Models:
Apply statistical analysis and Machine Learning algorithms against the
integrated data. Validate, train and over time, evolve the model
7. Deploy and Monitor:
Deploy models to production for ongoing monitoring of value and
effectiveness
New England Data Management Community
Big Data Science Activities
1. Define Big Data Strategy & Business Need(s)
• “Define business requirements that identify desired outcomes with measurable tangible benefits”
• Start With An “Problem Statement” And / Or A Hypothesis “Why are diaper sales down and in what markets or areas?”
2. Choose Data Source(s)
• “Identify gaps in the current data asset base and find data sources or sets to fill the gaps.”
• Determine What Data Would Be Needed To Understand Problem
External: US Census (Demographics), US Natality (birth rates), IRS, …
Internal: Sales (POS - RSI), Distribution (SAP), Trade (Promax), …
3. Acquire & Ingest Data Source(s)
• “Secure (purchase or obtain) data sets and onboard them into your environment”
4. Develop Data Science Hypothesis(es) & Methods
• “Explore data sources using profiling, visualization, mining, or other methods to understand this data and then refine
Theory(ies) and define model algorithm inputs, types, or test methods of analysis”
• Review the data to understand its value, composition and key relationships
“What data sets have what time intervals and how do they sync up together”
New England Data Management Community
Big Data Science Activities
5. Integrate/Align Data For Analysis
• Model feasibility depends on the quality and matching of the source data sets: the higher quality/match of the data, the
more likely the model will succeed. Leveraging trusted and credible sources and applying appropriate data integration and
cleansing techniques can increase usefulness of data sets.
6. Explore Data Using Models
• Exploration involves applying statistical analysis (models) and machine learning (AI) algorithms against the integrated data.
The model is constructed, evaluated against training sets, and validated. Training the model is also critical to its effectiveness.
Training entails repeated runs of the model against actual data to verify assumptions and make adjustments, such as identifying
outliers and selecting different variables.
• Through this process, hypotheses will be refined. Initial feasibility / viability metrics can guide evolution of the model.
New hypotheses may be introduced that require additional data sets and results of this exploration will shape the future
modeling and outputs.
7. Deploy and Monitor
• Those models that produce useful information can be deployed to production for ongoing monitoring of value and
effectiveness. Often times data science projects turn into data warehousing projects where more vigorous development
processes are put in place (ETL, DQ, Master Data, etc.).
New England Data Management Community
Tools and Techniques
New England Data Management Community
Tools
• MPP (Massively Parallel Processing): Data is partitioned across multiple processing servers
• Distributed File Based Databases
Eg: Open Source Hadoop
• In-database algorithms
Eg: K-means Clustering, Linear regression, Conjugate Gradient, Cohort Analysis
• Big data Cloud Solutions
• Statistical Computing and Graphical Langages
Eg: R
• Data Visualization Tools
Eg: Radar Charts, Coordinate plots, Tag Charts, Heat Map
New England Data Management Community
Techniques
Analytic Modelling Big Data Modelling
Technical challenge but critical
if organization wants to
Learn by example through describe and govern the data
training the model
Types of analysis associated
with Analytic models
Data Modelling while
accounting for the varierty of
Descriptive Modelling sources
Explanatory Modelling
New England Data Management Community
Techniques
Descriptive Modelling: Represents data structures in a compact Manner
Does not Validate a Causal Hypothesis or Predict outcomes, uses algorithm to define or refine
relationships across variables
Explanatory Modelling: Application of statistical models to data for testing casual hypothesis about
theortical constructs
Does not predict outcomes, match model results only with existing data
New England Data Management Community
Implementation Best Practices & Guidelines
Aligning the business to a implementation plan is the key to success
a. Strategy Alignment
• Strategically aligned with organizational objectives
b. Readiness Assessment / Risk Assessment
• Business Relevance (Align initiatives with company’s business)
• Business Readiness (Commitment to knowledge centre, Skillset Gap)
• Economic Viability (Ownership Costs, Benefits: Tangible and Intangible)
• Prototype (Prototype for subset of the end user community for a finite timeframes
c. Organization & Cultural Change
• New roles and responsibilities are required to implement
• These roles are in addition to the existing data management/BI roles
Big Data Platform Architect
Ingestion Architect
Metadata Specialist
Analytic Design Lead
Data Scientist
New England Data Management Community
Big Data Governance
Big Data, like other data, requires governance.
Need to consider business and technical controls addressing below questions:
• Sourcing: What to source, when to source, what is the best source of data for particular study.
• Sharing: What data sharing agreements and contracts to enter into, terms and conditions both inside and
outside the organization
• Metadata: What the data means on the source side, how to interpret the results on the output side
• Enrichment: Whether to enrich the data, how to enrich the data, and what the benefits will be to do so
• Access: What to publish and to whom, how and when
Think Visualization Management, Visualization Standards, Data Security, Metadata, Data Quality, Metrics
New England Data Management Community
STUDY GROUP MATERIALS
Study group presentations will be posted on CDMP Study Group page, on DAMA New England website, in the Schedule &
Agenda section.
New England Data Management Community
Q&A
New England Data Management Community