ToolKit 1| Unit 1| Introduction
To Data Analytics
01
What is Data?
A formalized representation of facts, concepts, or instructions that is
suitable for transmission, interpretation, or processing by a human or an
electronic system is referred to as data. Data is represented by characters
such as alphabets (A-Z, a-z), digits (0-9) or special characters (+,-,/,*,,>,=,
etc.). A collection of information obtained by observations, measurements,
study, or analysis is referred to as data.
Types of Data Classification
Data can be essentially classified into four types namely:
1 Geographical Data 2 Chronological Data
3 Quantitative Data 4 Qualitative Data
Phases of Data Processing Cycle
1 Collection 2 Preparation 3 Input
4 Processing 5 Output 6 Storage
Data Analytics
Data analytics is about studying raw data with the purpose of drawing
conclusions about it. Data analytics is critical since it allows firms to improve
their performance. Companies that include it into their business models can
help cut costs by developing more efficient methods of doing business and
storing massive volumes of data.
02
Data Analysis Steps
Step 1 Establish your aim
Step 2 Gather data
Step 3 Arrange data in order to examine
Step 4 Analyze the data
Step 5 Create a model or representation
Step 6 Validation
Components of Data Analytics
Roadmap and
1 operating model
2 Data Acquisition
Data Governance
3 Data Security 4 and Standards
5 Insights and analysis 6 Data Storage
7 Data Visualization 8 Data Optimisation
03
Data Analytics Life Cycle
It involves 6 phases namely:
Discovery
Measure
Effectiveness 1
Data Prep
6 2
Communicate
Results/publish 5 3
Insights
Plan Model
4
Build Model
4 types of Data Analytics
Descriptive Analysis
Descriptive analytics simply describes the answer to what is happening
to the business and it alters raw information from numerous data
sources to give important knowledge into the past.
Diagnostic Analytics
At this stage, historical information can be classified against other data
to acknowledge the topic of why something happened. Diagnostic
analytics provides top to bottom bits of knowledge into a specific issue.
04
Predictive Analysis
Predictive analytics is giving hints that it is something related to future
prediction. Yes, it is as it tells about what is going to happen. It uses the
discoveries of descriptive and diagnostic analytics to identify bunches
and special cases and to predict future trends, which makes it a
significant device for estimating.
Prescriptive Analytics
The motivation behind prescriptive analytics is to prescribe what move
to make to eliminate a future issue or take full advantage of a promising
trend. Prescriptive analytics utilizes advanced tools and technologies,
similar to machine learning, business rules, and algorithms, which
makes it modern to actualize and manage.
PwC’s Global Data and
Analytics Survey 2016
Over 250 executives in the UK were observed on what they will be making major
decisions about before 2020. The most likely proactive decisions are around
developing or launching new products or services (25% envisage having to do this);
investment in IT (20%); or entering new markets with existing products (18%). And
executives in the UK are motivated by market leadership and the need to survive.
Data Collection
Data collection is the procedure of collecting, measuring and analyzing accurate
insights for research using standard validated techniques.
The most important goal of data collecting is to collect information-rich and
accurate data for statistical analysis so that data-driven research choices may
be made.
05
Data Collection Methods
1 Primary
This is original, first-hand data collected by the data researchers.
Primary data results are highly accurate provided the researcher
collects the information.
2 Secondary
Secondary data is second-hand data collected by other parties
and already having undergone statistical analysis. This data is
either information that the researcher has tasked other people to
collect or information the researcher has looked up.
Methods of Primary Data Collection
1 Direct personal interviews
2 Indirect Oral Interviews
3 Information from correspondents
4 Mailed questionnaire method
5 Schedules sent through Enumerators
Sources of Secondary Data
1 Published Sources 2 Unpublished Sources
06
Data Collection Tools
1 Interviews 2 Questionnaires 3 Case Studies
4 Checklists 5 Surveys 6 Observations
Documents and
7 records
8 Focus groups 9 Oral histories
Factors to be considered before
choosing a Data Collection tool
Variable Type: Consider the type of information you want to collect, your
research specialty, and the overall goals of the study.
Study design: Choose the method you'll use to gather this data.
Data collection technique: Determine which strategies and technologies you
like for data collection.
Sample data: Decide where you want to collect data and sample it. This really
refers to the sampled population. Determine which segments of the population
will be included in your inquiry.
Sample size: Consider the number of subjects you wish to include in your study.
Sample design: Also, think about how you will choose the sample.
Time factor: When selecting on a technique of data gathering, the availability
of time must also be considered.
Availability of funds: The availability of finances for the research topic dictates
the approach to be employed for data collecting to a considerable extent.
Nature, scope and object of enquiry: This is the most essential aspect
influencing technique selection. The approach used should be appropriate for
the sort of investigation to be done by the researcher.
Precision required: Another key issue to consider when deciding on a data
gathering strategy is precision required.
07
How to deliver value with analytics?
• Enable self-service analytics
• Provide specific goals and their related KPIs to help teams
measure success
• Democratize advanced analysis with intuitive AI
• Support development of data literacy or confidence when working
with data
• Identify subject matter experts in each department
The Data and Analytics Framework
A framework matrix is a table of rows and columns that summarizes and
analyses qualitative data. It supports both cross-case and theme-based data
sorting. Individual instances are typically organized by row, while themes to
which the data has been coded constitute the matrix's columns. The source
material relating to the intersecting case and theme is described in each
intersecting cell.
Aspects of Framework
1 Discovery 2 Insights
3 Actions 4 Outcomes
6 layers in Data and Analytics Framework
1 Use Cases 2 Datasets 3 Data Collection
Intelligent
4 Data Preparation 5 Learning 6 Actions
08
Techniques of Framework
The big data analytics framework is primarily based on two fundamental
frameworks, namely:
1 SQL frameworks 2 NoSQL frameworks
Many entrepreneurs all around the world employ data analytics frameworks.
• Apache Cassandra
• Knime
• Datawrapper
• Lumify
• Apache Storm
• Rapidminer
• Flink
Big Data
Big data is, as the term implies, a "large" quantity of data. It refers to a data
collection that is both huge in volume and complicated. Traditional data
processing software cannot manage Big Data due to its vast volume and
increased complexity. Big Data simply refers to datasets that contain a
significant quantity of different data, both organized and unstructured.
5 Vs of Big Data
Volume Volume is a huge amount of data.
Velocity refers to the high speed of accumulation of data. In
Velocity Big Data velocity data flows in from sources like machines,
networks, social media, mobile phones etc.
09
It refers to the nature of data that is structured,
Variety semi-structured and unstructured data. It also refers
to heterogeneous sources.
The bulk of Data having no Value is of no good to the
Value
company, unless you turn it into something useful
It refers to inconsistencies and uncertainty in data,
Veracity that is data which is available can sometimes get
messy and quality and accuracy are difficult to control.
Application of Big Data in Real World
Customer Machine Demand
1 Experience
2 Learning
3 Forecasting
Big Data Storage
Big data storage is a storage system that is especially built to store,
handle, and retrieve huge volumes of data, often known as big data. Big
data storage allows for the storing and sorting of large amounts of data
so that it may be quickly accessible, consumed, and processed by big
data applications and services.
Big data storage is a compute-and-storage architecture that allows you
to collect and manage massive datasets as well as execute real-time data
analytics. The results of these studies can then be utilized to produce
intelligence from metadata.
Types of Big Data
Semi-
1 Structured 2 Unstructured 3 Structured
10
Big Data Life-cycle
There are 9 phases involved in the Big Data Life Cycle. They are as
follows:
• Business Case/Problem Definition
• Data Identification
• Data Acquisition and filtration
• Data Extraction
• Data Munging(Validation and Cleaning)
• Data Aggregation & Representation(Storage)
• Exploratory Data Analysis
• Data Visualization(Preparation for Modeling and Assessment)
• Utilization of analysis results.
Big Data Tools
Big Data requires a set of tools and techniques for analysis to gain
insights from it.
There are a number of big data tools available in the market such as
Hadoop which helps in storing and processing large data, Storm helps in
faster processing of unbounded data, Apache Cassandra provides high
availability and scalability of a database,, so there are different functions
of every Big Data tool.
1 Hadoop 2 Atlas.ti 3 HPCC
4 Storm 5 Cassandra 6 Stats iQ
7 CouchDB 8 RapidMiner
11
Data Warehouse
An analytics-focused type of data management system called a data
warehouse is designed to support and facilitate business intelligence (BI)
operations. Data warehouses are only used to conduct searches and
analyses on vast amounts of historical data. Data for a data warehouse is
frequently produced from a variety of sources, such as transactional
programmes and application log files.
Advantages of Data Warehouse
• Provides quick access to crucial data from numerous sources
• Gives consistent information on a variety of cross-functional
operations. Ad hoc reporting and querying is also possible
• Helps to integrate a number of data sources to decrease workload on
production system
• Reduces the amount of time it takes for analysis and reporting to
be completed
• Enables access of crucial data from several sources in one place.
The user therefore saves time while gathering data from various sources.
Drawbacks of Data Warehouse
• Ineffective at handling unstructured data
• Building and implementing a data warehouse takes time.
• Possibility of getting outdated quickly
• Challenging to make changes to data types, ranges, data source
structure, indexes, and searches.
12
Data Warehouse Components
1 ETL- Extract/Transform/Load:
A variety of tasks are performed by ETL such as:
• Logical data conversion
• Verification of Domain
• Converting one DMS to another
• Default values generation, when required
• Summarizing the Data
• Adding time values to the data key
• Restructuring the data key
• Records integration
• Getting rid of extraneous or duplicate data.
2 ODS- Operational Data Store
Online updates of integrated data are carried out with an OLTP (online
Transaction Processing) response time in the ODS. An integrated format
for application data is created in the hybrid environment known as the ODS
(often via ETL). Data can be used for high-performance processing,
including update processing, once it is placed in the ODS.
3 Data Mart
The data mart is designed around a single set of user-wide expectations for
how data should appear and is typically arranged by department. There is a
separate information warehouse for finance. Compared to the data warehouse,
each data mart typically contains much less data. Additionally, data marts
frequently include a sizable amount of summarized and aggregated data.
4 Exploration Warehouse
End users that wish to undertake discovery processing go to the exploration
warehouse. The exploration warehouse does a lot of statistical analysis.
13
Approaches to building a Warehouse
Inmon’s Approach
Bill Inmon developed the Inmon's technique to developing a data
warehouse. Starting point for this strategy is a business data model.
This model takes into account important areas as well as customers,
goods, and vendors. This model is used to provide a thorough logical
model that is applied to significant processes. A physical model is then
created using details and models. The normalized nature of this
approach reduces data redundancy.
Kimball’s Approach
This approach of designing a data warehouse was introduced by Ralph
Kimball. Recognizing the business process and questions that Data
warehouse must address is the first step in this strategy. These data
sets are carefully evaluated and then documented.
Steps to build a warehouse
1 To extract the data (transnational) from different data sources
2 To transform the transnational data
3 To load the data (transformed) into the dimensional database
14
Data warehouse can be mapped into
different types of architecture as follows:
Shared memory architecture: The standard method for putting an RDBMS on
SMP hardware is to implement it in shared-memory or shared-everything form.
The main benefit of this method is that a single RDBMS server can likely
access all memory, all CPUs, and the whole database, giving the client a
consistent single system image.
Shared disk architecture: The idea of shared ownership of the complete
database between RDBMS servers, each of which is executing on a node of a
distributed memory system, is implemented via shared-disk architecture. Each
RDBMS server can access the same shared database to read, write, update,
and delete data, necessitating the implementation of a distributed lock
management (DLM).
Shared nothing architecture: Systems that share nothing are often loosely
connected. Only one CPU is attached to a specific disc in shared nothing
systems. Access is entirely dependent on the PU that owns any tables or
databases that are stored on that disc.
15