Introduction to Data Analytics: Types of Data sources, Sampling, Types of
Data Elements, Types of Data Analysis- descriptive, predictive, diagnostic,
exploratory, survival and social network. The Phases of Data Analysis, Data
Analytics Methodologies and Workflows. Data Quality, Software and
Privacy.
Module 1: Introduction to • Time-Seri es Data Phases of Data Analysis
Data Analytics • Spatial Data • Data Collection
• Understanding Data Analytics Types of Data Analysis • Data Preprocessing
• Importance and Applications of • Descripti ve Analysis • Data Analysis
Data Analytics
• Predictive Analysis • Interpretation and Visualization
• Overvi ew of Types of Data Sources
• Diagnostic Analysis • Decision Making
• Basics of Sampling Techniques
• Explorator y Analysis
Types of Data Elements
• Survival Analysis
• Categorical Data
• Social Networ k Analysis
• Numerical Data
Data Analytics Methodologies Module 6: Data Quality and Module 7: Privacy in Data
and Workflows Software Analytics
• Traditional vs. Agile Approaches • Understanding Data Quality • Importance of Privacy
• CRISP-DM (Cross-Indust r y Standard • Data Cleaning and Preprocessing • Legal and Ethical Considerati ons
Process for Data Mining) Techniques
• Privacy-P reser ving Techniques
• KDD (Knowl edge Discovery in • Introduction to Data Quality Tools
• Privacy Regulations and Compliance
Databases)
• Popular Data Analytics Software
• Other Methodol ogi es and
Frameworks
Introduction to Data Analytics
Understanding Data Analytics
Importance and Applications of Data
Analytics
Overview of Types of Data Sources
Basics of Sampling Techniques
Understanding Data Analytics
• Definition: Data analytics refers to the process of examining data sets to draw
conclusions about the information they contain.
• Purpose: It aims to uncover valuable insights, trends, and patterns that can
inform decision-making and drive business strategies.
• Components: Data analytics involves various techniques such as data mining,
statistical analysis, machine learning, and predictive modeling.
• Role: It plays a crucial role in various fields including business, healthcare,
finance, marketing, and more, by enabling organizations to make data -driven
decisions.
Importance and Applications of Data Analytics:
• Importance :
• Enables organizations to gain competitive advantages by leveraging data for strategic decision -making.
• Facilitates proactive problem -solving and innovation.
• Helps in understanding customer behavior and preferences, leading to better products and services.
• Applicatio ns :
• Business intelligence: Analyzing sales data, market trends, and customer feedback to optimize business
operati ons.
• Healthcare: Predictive analytics for disease diagnosis, patient care optimization, and drug development.
• Finance: Risk assessment , fraud detection, and investment analysis.
• Marketing: Customer segmentation, campaign optimization, and personalized marketing strategies.
Overview of Types of Data Sources:
• Internal Data Sources : Data generated within an organization, including transactional data,
customer databases, and operational records.
• External Data Sources : Data acquired from external sources such as government databases,
third-party vendors, social media platforms, and public datasets.
• Structured Data : Data organized in a predefined format, typically stored in databases or
spreadsheets.
• Unstructured Data : Data that lacks a predefined structure, including text documents, images,
videos, and social media posts.
• Streaming Data: Continuous data generated in real-time from sources like sensors, IoT devices,
and web applications.
Basics of Sampling Techniques:
• Definition: Sampling involves selecting a subset of data from a larger population for analysis,
aiming to draw inferences about the population.
Types
• Simple Random Sampling : Each member of the population has an equal chance of being selected.
• Stratified Sampling : Population divided into homogene ous subgroups (strata), and samples are
drawn from each subgroup.
• Cluster Sampling : Population divided into clusters, and random clusters are selected for sampling.
• Systematic Sampling : Selecting ever y nth element from the population after a random start.
Sampling Bias : Potential biases in sampling methods that can lead to inaccurate conclusions.
Types of Data Elements
i. Categorical Data
ii. Numerical Data
iii. Time-Series Data
iv. Spatial Data
Categorical Data:
Definition:
• Categorical data represents characteristics or attributes and is typically divided into categories.
Examples:
• Gender (male/female), marital status (single/married/divorced), product categories
(electronics/clothing /books), etc.
Measurement:
• Categorical data is qualitative and cannot be measured on a numerical scale.
Analysis:
• Analyzing categorical data involves frequency counts, proportions, and visualizations such as bar
charts, pie charts, and histograms.
Challenges:
• Limited in statistical analysis; may require encoding for machine learning algorithms.
Numerical Data:
D ef i nitio n :
• N u m e r ical d ata re p re s e nts m e a s u rab le q u a nti ti e s a n d i s ex p re s s e d o n a n u m e r ical s ca l e .
Ty p e s :
• D i s c rete N u m e r ical D ata: C o u ntabl e a n d f i n i te va l u e s , t y p i cal ly w h o l e n u m b e rs ( e . g ., n u m b e r o f e m p l oye e s,
n u m b e r o f ca rs ).
• C o nt i n u o us N u m e r i cal D ata: I nf i n ite n u m b e r o f p o s s i bl e va l u e s wi t h i n a ra n ge ( e . g ., h e i g ht, we i g ht,
te m p e rature ).
M e a s urem ent :
• N u m e r ical d ata ca n b e m e a s u re d a n d s u b j e c te d to m at h e mati cal o p e rati ons s u c h a s a d d i ti on, s u bt rac ti on,
m u l t i pl i cati on, a n d d i vi s i o n.
A n a lys is :
• Stat i sti cal tec h ni ques s u c h a s m e a n , m e d i a n, m o d e , sta n dard d evi at ion , co r rel ati on, a n d reg res s i on a re u s ed fo r
a n a lys is.
V i s u alizatio ns :
• H i stograms , b ox p l o t s , s catte r p l o t s , a n d l i n e g ra phs a re co m m on vi s u a l izat ion m et h o d s fo r n u m e r i cal d ata .
Time-Series Data:
Definition:
• Time-series data consists of observations recorded at successive time intervals.
• Examples: Stock prices, weather data, sales figures, sensor readings, etc.
Components:
• Time Stamps: Each observation is associated with a specific time or date.
• Data Points: Numeric values recorded at each time point.
Analysis:
• Time-series analysis involves identifying patterns, trends, seasonality, and forecasting future
values.
Methods:
• Moving averages, exponential smoothing, ARIMA ( AutoRegressive Integrated Moving Average),
and machine learning models like LSTM (Long Short -Term Memory).
Visualizations:
• Line charts, area charts, and heatmaps are commonly used to visualize time -series data.
Spatial data
Definition:
• Spatial data represents the geographical features and their attributes on the Earth's surface.
Types:
• Vector Data : Represent s points, lines, and polygons and includes attributes (e. g., GIS data, maps, GPS
coordinates).
• Raster Data : Grid-based representati on with cells containing values (e. g., satellite images, elevation
data).
Analysis:
• Spatial analysis involves analyzing relationships, patterns, and distributions within geographic space.
Methods:
• Spatial queries, interpolation, buffering, clustering, and overlay operati ons.
Applicatio ns :
• Urban planning, environmental monitoring , transportati on logistics, and location -based services
Types of Data Analysis
i. Descriptive Analysis
ii. Predictive Analysis
iii. Diagnostic Analysis
iv. Exploratory Analysis
v. Survival Analysis
vi. Social Network Analysis
Descriptive Analysis:
Definition:
• Descriptive analysis involves summarizing and describing the main features of a
dataset.
Purpose:
• To gain insights into the basic characteristics of the data, such as central tendency,
variability, distribution, and relationships between variables.
Techniques:
• Measures of central tendency (mean, median, mode), measures of dispersion
(range, standard deviation), frequency distributions, and graphical representations
(histograms, box plots, scatter plots).
Example:
• Calculating the average salary of employees in a company and visualizing it using a
histogram.
Descriptive Analytics
• In descriptive analytics, the aim is to describe patterns
of customer behavior.
• There is no real target variable (e.g., churn or fraud
indicator) available.
• Descriptive analytics is often referred to as
unsupervised learning because there is no target
variable to steer the learning process.
Association Rule
• An association rule is then an implication of the
form X ⇒ Y , whereby X ⊂ I , Y⊂ I and X ∩ Y=
∅.
• X is referred to as the rule antecedent, whereas Y is
referred to as the rule consequent.
Examples
• If a customer has a car loan and car insurance, then
the customer has a checking account in 80% of the
cases.
• If a customer buys bread, then the customer buys milk
in 70 percent of the cases.
• If a customer visits web page A, then the customer
will visit web page B in 90% of the cases.
Support and Confidence
• Support and confidence are two key measures to
quantify the strength of an association rule.
• The support of an item set is defined as the percentage
of total transactions in the database that contains the
item set.
• Support(XUY)=No of transactions X and Y present
Total no of transactions
• A frequent item set is one for which the
support is higher than a threshold
(minsup).
• A lower support will obviously generate
more (less) frequent item sets.
Confidence
• The confidence measures the strength of the
association and is defined as the conditional
probability of the rule consequent, given the rule
antecedent.
• Confidence(X->Y)=support(XUY)
support(X)
Association Rule Mining
• Mining association rules from data is essentially a
two‐step process as follows:
• Identification of all item sets having support above
minsup (i.e., “frequent” item sets)
• Discovery of all derived association rules having
confidence above minconf
Sequence Rules
• Given a database D of customer transactions, the problem of
mining sequential rules is to find the maximal sequences among
all sequences that have certain user‐specified minimum support
and confi dence.
• An example could be a sequence of web page visits in a web
analytics setting, as follows:
• Home page ⇒ Electronics ⇒ Cameras and Camcorders ⇒ Digital
Cameras ⇒ Shopping cart ⇒ Order confirmation ⇒ Return to
shopping
• Association rules are concerned about
what items appear together at the same
time (intra transaction patterns),
sequence rules are concerned about
what items appear at different times
(intertransaction patterns)
• A sequential version can then be obtained as
follows:
• Session 1: A, B, C
• Session 2: B, C
• Session 3: A, C, D
• Session 4: A, B, D
• Session 5: D, C, A
Segmentation
• The aim of segmentation is to split up a set of
customer observations into segments such that the
homogeneity within a segment is maximized and
the heterogeneity between segments is maximized.
• Popular applications include:
➢ Understanding a customer population (e.g., targeted
marketing or advertising [mass customization])
➢ Efficiently allocating marketing resources
➢ Differentiating between brands in a portfolio
➢ Identifying the most profitable customers
➢ Identifying shopping pattern
➢ Identifying the need for new products
• Divisive hierarchical clustering starts from the
whole data set in one cluster, and then breaks
this up in each time smaller clusters until one
observation per cluster remains.
• Agglomerative clustering works the other way
around, starting from all observations in one
cluster and continuing to merge the ones that
are most similar until all observations make
up one big cluster
K‐Means Clustering
• K ‐means clustering is a nonhierarchical procedure
that works along the following steps:
➢ Select k observations as initial cluster centroids (seeds).
➢ Assign each observation to the cluster that has the
closest centroid (for example, in Euclidean sense).
➢ When all observations have been assigned, recalculate
the positions of the k centroids.
Self‐Organizing Maps
• A self‐organizing map (SOM) is an unsupervised
learning algorithm that allows you to visualize and
cluster high‐dimensional data on a low‐dimensional
grid of neurons .
Predictive Analysis:
Definition:
• Predictive analysis involves using historical data to make predictions about future
outcomes or trends.
Purpose:
• To forecast future events or behaviors based on past patterns and relationships in the
data.
Techniques:
• Regression analysis, time series forecasting, machine learning algorithms (e.g., linear
regression, decision trees, neural networks), and statistical modeling.
Example:
• Predicting stock prices based on historical trading data or forecasting sales revenue for
the next quarter .
Predictive Analytics
• In predictive analytics, the aim is to build an analytical
model predicting a target measure of interest.
• Two types of predictive analytics can be distinguished:
regression and classification.
• In regression, the target variable is continuous.
• Popular examples are predicting stock prices, loss given
default (LGD), and customer lifetime value (CLV).
• In classification, the target is
categorical.
• It can be binary (e.g., fraud, churn,
credit risk) or multiclass (e.g.,
predicting credit ratings).
TARGET DEFINITION
• In a customer attrition setting, churn can be defined in
various ways.
• Active churn implies that the customer stops the
relation ship with the firm.
• Passive churn occurs when a customer decreases the
intensity of the relationship with the firm, for example,
by decreasing product or service usage.
Classification
• K-Nearest Neighbour
• Decision tree
• Random forest model
• Support vector machines
• Logistic Regression
K-Nearest Neighbour
• The k-nearest neighbors (KNN) algorithm is a, supervised
learning classifier, which uses proximity to make classifications
or predictions about the grouping of an individual data point.
• For example, if a group of users who have similar browsing
habits and demographics have responded positively to a
particular ad, a KNN model can predict that a new user with
geometrically similar attributes would also respond positively to
that ad.
Decision tree
• Decision Tree is a Supervised learning technique that
can be used for both classification and Regression
problems, but mostly it is preferred for solving
Classification problems.
• It is a tree-structured classifier, where internal nodes
represent the features of a dataset, branches
represent the decision rules and each leaf node
represents the outcome.
Random forest model
• Random forest is a commonly-used machine
learning algorithm that combines the output of
multiple decision trees to reach a single result.
Support vector machines
• Su pp or t Ve cto r M a chin e i s on e o f the mo s t po pula r Sup er vi sed L ea rnin g alg orithm s,
w h i ch i s u s e d f o r C l a s s i f ica tion a s w e l l a s Re g r e s s io n p r ob lems .
• H o w e ver, p r i m ar ily, it i s u s e d f o r C l a s s i f ica tio n p r o b l e ms i n M a ch i n e L e a r n in g .
• Th e g oal of th e SVM alg orithm i s to cre ate the be s t lin e o r de ci sio n bo un dar y th at can
s e gr eg ate n -di me n sion al sp a ce in to clas s e s s o th at w e ca n ea sily p ut the ne w d ata poi nt
i n th e co r r e ct ca te g o r y i n th e f u tu r e .
• Th i s b e s t d e ci s io n b o u n d a r y i s ca l l e d a hy p e r p lan e .
Logistic Regression
• Logistic regression is a supervised machine learning algorithm used
for classification tasks where the goal is to predict the probability that an
instance belongs to a given class or not.
• Binomial: In binomial Logistic regression, there can be only two possible
types of the dependent variables, such as 0 or 1, Pass or Fail, etc.
• Multinomial: In multinomial Logistic regression, there can be 3 or more
possible unordered types of the dependent variable, such as “cat”, “dogs”, or
“sheep”
• Ordinal: In ordinal Logistic regression, there can be 3 or more possible
ordered types of dependent variables, such as “low ”, “Medium”, or “High”.
Example
•A credit card company wants to know whether
transaction amount and credit score impact the
probability of a given transaction being fraud
• The response variable in the model will be “fraudulent”
and it has two potential outcomes:
• The transaction is fraudulent.
• The transaction is not fraudulent.
Regression
• Si m p le lin ear Re g r e s s io n
• M u l tiple l i n ea r Re gr es s ion
• Po l y n omial Re g r e s s ion
Simple linear Regression
• Simple linear regression is a model that describes
the relationship between one dependent and
one independent variable using a straight line.
Multiple linear Regression
• Multiple linear regression is a model for
predicting the value of one dependent variable
based on two or more independent variables.
• Example house price we can predict based on
location ,square feet, how many floors constructed
etc.
Polynomial Regression
• Polynomial Regression is a regression algorithm that
models the relationship between a dependent(y)
and independent variable(x) as nth degree
polynomial. The Polynomial Regression equation is
given below:
• y= b 0 +b 1 x 1 + b 2 x 1 2 + b 2 x 1 3 +...... b n x 1 n
Diagnostic Analysis:
Definition:
• Diagnostic analysis focuses on understanding the causes of observed phenomena or
outcomes.
Purpose:
• To identify the factors or variables that influence a particular outcome and assess
their impact.
Techniques:
• Hypothesis testing, causal inference methods (e.g., regression analysis, ANOVA), root
cause analysis, and sensitivity analysis.
Example:
• Investigating the factors contributing to customer churn in a subscription -based
service and determining the most significant predictors.
Diagnostic analytics
• Diagnostic analytics examines data to understand the
root causes of events, behaviors, and outcomes.
• Data analysts use diverse techniques and tools to
identify patterns, trends, and connections to explain
why certain events occurred.
• Its main goal is to offer insights into the factors
contributing to a particular outcome or problem.
Diagnostic analytics aims to answer
questions like:
➢Why did an event or outcome happen?
➢What were the key factors that influenced it?
➢Were there any anomalies or deviations, and
what caused them?
➢What correlations or relationships exist?
➢How did actions or changes impact the
outcomes?
• Diagnostic analytics fills the space
between knowing what happened –
descriptive analytics – and foreseeing
potential outcomes—predictive analytics.
• These insights add context and detail to
the data, helping us to make more precise
choices by fully understanding influencing
factors.
• Diagnostic analytics uses a variety of
techniques to provide insights into the
causes of trends. These include:
➢Data drilling
➢Data mining
➢Correlation analysis
Data drilling
• Drilling down into a dataset can reveal more detailed
information about which aspects of the data are driving
the observed trends.
• For example, analysts may drill down into national
sales data to determine whether specific regions,
customers or retail channels are responsible for
increased sales growth.
Data mining
• Data mining hunts through large volumes of data to
find patterns and associations within the data.
• For example, data mining might reveal the most
common factors associated with a rise in insurance
claims.
• Data mining can be conducted manually or
automatically with machine learning technology.
Correlation analysis
• Correlation analysis examines how strongly
different variables are linked to each other. For
example, sales of ice cream and refrigerated soda
may sold on hot days.
Diagnostic Analytics Categories
• Identify anomalies
• Discovery
• Causal relationships
Identify anomalies
• Trends or anomalies highlighted by descriptive analysis may
require diagnostic analytics if the cause isn’t immediately
obvious.
• In addition, it can sometimes be difficult to determine whether
the results of descriptive analysis really show a new trend,
especially if there’s a lot of natural variability in the data.
• In those cases, statistical analysis can help to determine
whether the results actually represent a departure from the
norm.
Discovery
• The next step is to look for data that explains the
anomalies: data discovery.
• That may involve gathering external data as well as drilling
into internal data.
• For example, searching external data might reveal changes
in supply chains, new regulatory requirements, a shifting
competitive landscape or weather patterns that are
associated with the anomalous data.
Causal relationships
• Further investigation can provide insights into whether the
associations in the data point to the true cause of the
anomaly.
• The fact that two events correlate doesn’t necessarily mean
one causes the other.
• Deeper examination of the data associated with the sales
increase can indicate which factor or factors were the most
likely cause.
Exploratory Analysis:
Definition:
• Exploratory analysis aims to explore and understand the structure and patterns
within a dataset without preconceived hypotheses.
Purpose:
• To uncover hidden insights, trends, and relationships that may not be immediately
apparent.
Techniques:
• Data visualization, clustering algorithms, dimensionality reduction techniques (e.g.,
PCA- Principal Component Analysis ), and unsupervised learning methods.
Example:
• Using clustering algorithms to group customers based on their purchasing behavior
to identify market segments.
Exploratory Analytics
• Exploratory Data Analysis (EDA) refers to the method
of studying and exploring record sets to apprehend
their predominant traits, discover patterns, locate
outliers, and identify relationships between variables.
• EDA is normally carried out as a preliminary step
before undertaking extra formal statistical analyses or
modeling.
The Foremost Goals of EDA
• Data Cleaning
• Descriptive Statistics
• Data Visualization
• Feature Engineering
• Correlation and Relationships
• Data Segmentation
• Hypothesis Generation
• Data Quality Assessment
Data Cleaning
EDA involves examining the information for errors,
lacking values, and inconsistencies. It includes techniques
including records imputation, managing missing statistics,
and figuring out and getting rid of outliers.
Descriptive Statistics
EDA utilizes precise records to recognize the
important tendency, variability, and distribution of
variables. Measures like suggest, median, mode,
preferred deviation, range, and percentiles are
usually used.
Data Visualization
EDA employs visual techniques to represent the
statistics graphically.
Visualizations consisting of histograms, box
plots, scatter plots, line plots, heatmaps, and bar
charts assist in identifying styles, trends, and
relationships within the facts.
Feature Engineering
EDA allows for the exploration of various
variables and their adjustments to create new
functions or derive meaningful insights. Feature
engineering can contain scaling, normalization,
binning, encoding express variables, and creating
derived variables.
Correlation and Relationships
EDA allows discover relationships and
dependencies between variables.
Techniques such as correlation analysis, scatter
plots offer insights into the power and direction of
relationships between variables.
Data Segmentation
EDA can contain dividing the information into
significant segments based totally on sure
standards or traits. This segmentation allows
advantage insights into unique subgroups inside the
information and might cause extra focused analysis.
Hypothesis Generation
EDA aids in generating hypotheses or studies
questions based totally on the preliminary
exploration of the data. It facilitates form the
inspiration for in addition evaluation and model
building.
Data Quality Assessment
EDA permits for assessing the nice and
reliability of the information. It involves checking
for records integrity, consistency, and accuracy to
make certain the information is suitable for
analysis.
Types of EDA
• Univariate Analysis
• Bivariate Analysis
• Multivariate Analysis
• Time Series Analysis
• Missing Data Analysis
• Outlier Analysis
• Data Visualization
Survival Analysis:
Definition:
• Survival analysis deals with time -to-event data and is used to analyze the duration
until the occurrence of an event of interest.
Purpose:
• To study the time until an event (e.g., death, failure, recovery) happens, considering
censoring and time-dependent variables.
Techniques:
• Kaplan-Meier estimator, Cox proportional hazards model, and parametric survival
models.
Example:
• Analyzing the survival time of patients after receiving a particular medical treatment.
Survival Analytics
• Survival analysis is a set of statistical techniques focusing
on the occurrence and timing of events.
• As the name suggests, it originates from a medical context
where it was used to study survival times of patients that
had received certain treatments.
• In fact, many classification analytics problems we have
discussed before also have a time aspect included, which
can be analyzed using survival analysis techniques.
So m e e xa m p l es a r e :
■ Pr e d ict w h e n cu s to m e r s ch u r n
■ Pr e d ict w h e n cu s to m e r s m a ke th e i r n e xt p u r ch a s e
■ Pr e d i ct w h e n cu s to m e rs d e f a ult
■ Pr e d i ct w h e n cu s to m e rs p ay o f f th e i r l o a n e a r l y
■ Pr e d i ct w h e n cu s to m e r w ill v i s i t a w e b s i te n e xt
• Survival analysis deals with time-to-event
data and is used to analyze the duration until
the occurrence of an event of interest.
• It is used to study the time until an event
(e.g., death, failure, recovery) happens,
considering censoring and time-dependent
variables.
• Kaplan-Meier estimator, Cox proportional
hazards model, and parametric survival
models.
Social Network Analysis:
Definition:
• Social network analysis examines the structure, interactions, and patterns within social
networks.
Purpose:
• To understand the relationships between individuals or entities in a network and
analyze information flow, influence, and community structure.
Techniques:
• Network visualization, centrality measures (e.g., degree centrality, betweenness
centrality), clustering algorithms, and community detection methods.
Example:
• Studying communication patterns in a social media network or analyzing collaboration
networks in academic research.
Social Network Analytics
• Social Network means it could be any set of nodes (also referred to as
vertices ) connected by edges in a par ticular business setting.
• Examples of social networks could be:
➢ We b p a g e s co n n e cte d b y hy p e r lin ks
➢ E m a i l tra f f ic b e tw e e n p e o p l e
➢ Re s e a r ch p a p e r s co n n e cte d b y ci ta tio ns
➢ Te l e p ho ne ca l l s b e t w e e n cu s to m e r s o f a te l co p r ov i de r
➢ B a n k s co n n e cted b y l i q uid ity d e p e n d en cies
➢ Sp r e a d o f i l l ne s s b e tw e e n p a ti e nts
• A social network consists of both nodes (ver tices) and
edges.
• Both need to be clearly defined at the outset of the
analysis.
• A node (vertex) could be defined as a customer
(private /professional), household / family, patient, doctor,
paper, author, terrorist, web page, and so forth.
• An edge can be defined as a friend relationship, a call,
transmission of a disease, reference, and so on.
• Edges can also be weighted based on interaction frequency,
importance of information exchange, intimacy, and
emotional intensity.
• For example, in a churn prediction
setting, the edge can be weighted
according to the time two customers
called each other during a specific
period. Social networks can be
represented as a sociogram.
• Sociograms are good for small‐scale
networks.
• For larger‐scale networks, the network
will typically be represented as a
matrix.
• The matrix can also contain the weights
in case of weighted connections
• A popular technique here is the Girvan‐Newman
algorithm, which works as follows:
• 1. The betweenness of all existing edges in the
network is calculated first.
• 2. The edge with the highest betweenness is
removed.
• 3. The betweenness of all edges affected by the
removal is recalculated.
• 4. Steps 2 and 3 are repeated until no edges
remain.
• social network learner will usually consist of the
following components:
• A local model: This is a model using only
node‐specific characteristics, typically estimated
using a classical predictive analytics model (e.g.,
logistic regression, decision tree).
• A network model: This is a model that will make use
of the connections in the network to do the
inferencing.
• A collective inferencing procedure: This is a
procedure to determine how the unknown nodes are
estimated together, hereby influencing each other.
• The relational neighbor classifier makes
use of the homophily assumption, which
states that connected nodes have a
propensity to belong to the same class.
• The probabilistic relational neighbor
classifier extension of the relational
neighbor classifier, whereby the posterior
class probability for node n to belong to
class c.
• RELATIONAL LOGISTIC REGRESSION
starts off from a data set with local
node‐specific characteristics and adds
network characteristics to it, as follows:
• Most frequently occurring class of neighbor
(mode‐link)
• Frequency of the classes of the neighbors
(count‐link)
• Binary indicators indicating class presence
(binary‐link)
EGONETS
BIGRAPHS
Phases of Data Analysis
• Data Collection
• Data Preprocessing
• Data Analysis
• Interpretation and
Visualization
• Decision Making
Data Collection:
Definition:
• Data collection involves gathering raw data from various sources, which could be
internal or external to the organization.
Methods:
• Data can be collected through surveys, interviews, observations, sensors, web
scraping, transaction records, and more.
Considerations:
• It's essential to ensure data quality, relevance, and legality during the collection
process.
Documentation:
• Proper documentation of data sources, collection methods, and any associated
metadata is crucial for future analysis and reproducibility.
Data Preprocessing:
Definition:
• Data preprocessing encompasses cleaning, transforming, and preparing the raw data
for analysis.
Steps:
• Data Cleaning: Handling missing values, outliers, and inconsistencies in the data.
• Data Transformation: Standardizing, normalizing, or scaling the data to make it
suitable for analysis.
• Feature Engineering: Creating new features or modifying existing ones to
improve predictive power.
• Dimensionality Reduction: Reducing the number of variables while preserving
essential information.
Importance: Proper data preprocessing enhances the quality and reliability of
subsequent analysis results.
Data Analysis:
Definition:
• Data analysis involves applying various statistical, mathematical, or machine learning
techniques to derive insights from the processed data.
Methods:
• Descriptive statistics, inferential statistics, regression analysis, clustering,
classification, and association rule mining are some common analysis methods.
Tools:
• Utilizing software tools and programming languages such as Python, R, SAS, and SQL
for conducting analysis.
Validation:
• It's essential to validate analysis results and ensure their accuracy and reliability
through appropriate testing and validation techniques.
Interpretation and Visualization:
Definition:
• Interpretation involves making sense of the analysis results and deriving meaningful
insights to address the underlying questions or objectives.
Visualization:
• Visual representations of data, such as charts, graphs, and dashboards, aid in
understanding patterns, trends, and relationships within the data.
Storytelling:
• Effectively communicating the findings to stakeholders through storytelling techniques,
emphasizing key insights and actionable recommendations.
Iterative Process:
• Interpretation often involves iteratively refining analysis approaches and exploring
additional questions or hypotheses based on initial findings.
Decision Making:
Definition:
• Decision making involves using the insights derived from data analysis to inform
and support organizational or strategic decisions.
Types:
• Data-driven decision making relies on evidence and analysis results, while also
considering contextual factors, expertise, and judgment.
Implementation:
• Translating insights into actionable strategies, initiatives, or interventions, and
monitoring their effectiveness over time.
Feedback Loop:
• Establishing a feedback loop to continuously evaluate the impact of decisions,
refine strategies, and incorporate new data and insights.
Data Analytics Methodologies
and Workflows
• Traditional vs. Agile Approaches
• CRISP-DM (Cross-Industry Standard
Process for Data Mining)
• KDD (Knowledge Discovery in
Databases)
• Other Methodologies and
Frameworks
Traditional vs. Agile Approaches:
• Traditional Approach : • Agile Approach:
• Sequential and linear process with • Iterative and flexible approach that
distinct phases (e.g., requirement focuses on adaptive planning and
analysis, design, implementation, incremental delivery.
testing, deployment).
• Emphasizes collaboration, customer
• Emphasizes detailed planning and feedback, and continuous improvement.
documentation upfront.
• Divides the project into small,
• Well-suited for projects with clear manageable iterations (sprints) with
requirements and stable objectives. frequent releases.
• Examples include Waterfall model and V - • Well-suited for projects with evolving
model. requirements and fast -changing
environments.
• Examples include Scrum, Kanban, and
Extreme Programming (XP).
CRISP-DM (Cross-Industry Standard Process for Data Mining):
Overview:
• CRISP-DM is a widely used methodology for conducting data mining and analytics projects.
Phases:
• Business Understanding : Understanding project objectives, requirements, and success criteria.
• Data Understanding: Exploring and understanding the available data sources and their
characteristics.
• Data Preparation : Cleaning, transforming, and integrating data for analysis.
• Modeling: Selecting appropriate modeling techniques and building predictive or descriptive
models.
• Evaluation: Assessing model performance and validating results against business objectives.
• Deployment: Deploying models into production and integrating them into business processes.
Iterative Process:
• CRISP-DM is an iterative process where outputs from one phase often inform activities in
subsequent phases.
KDD (Knowledge Discovery in Databases):
Overview:
• KDD is a process of discovering useful knowledge from large volumes of data.
Phases:
• Selection: Selecting and acquiring relevant data from multiple sources.
• Preprocessing: Cleaning, transforming, and integrating data to prepare it for analysis.
• Transformation: Applying data mining algorithms to extract patterns and relationships from
the data.
• Data Mining: Exploring the transformed data to discover actionable insights and knowledge.
• Interpretation/Evaluation : Interpreting the discovered patterns and evaluating their
usefulness and relevance.
• Utilization: Incorporating the discovered knowledge into decision -making processes and
taking appropriate actions.
Iterative and Interactive : KDD is an iterative and interactive process, with feedback loops between
phases .
Other Methodologies and Frameworks:
Agile Data Science:
• Combines principles from agile software development with data science practices to
enable iterative and collaborative data analytics projects.
TDSP (Team Data Science Process) :
• Microsoft's framework for collaborative data science projects, focusing on iterative
development, reproducibility, and scalability.
SEMMA (Sample, Explore, Modify, Model, Assess) :
• A data mining methodology developed by SAS Institute, emphasizing a sequential
approach to data analysis.
Lean Analytics:
• Adapts lean startup principles to analytics projects, focusing on identifying key metrics,
rapid experimentation, and data -driven decision-making.
Data Quality and Software
• Understanding Data Quality
• Data Cleaning and Preprocessing
Techniques
• Introduction to Data Quality
Tools
• Popular Data Analytics Software
Understanding Data Quality:
• D e f i n i t i o n : D a ta q u a l i t y refe rs to t h e a c c u ra c y, co m p l ete n e s s , co n s i ste n cy, re l i a b i l i t y, a n d re l eva n c e o f d a ta fo r
i t s i nte n d e d u s e .
• I m p o r ta n c e : H i g h - q u a l i t y d a ta i s e s s e nt i a l fo r m a k i n g i nfo r me d d e c i s i o n s , e n s u r i n g t h e ef fe c t i ve n e s s o f
a n a l y t i c s , a n d m a i nta i n i n g t r u s t i n d a ta - d r i ve n p ro c e s s e s .
• Dimensions of Data Quality :
• A c c u ra c y : T h e d e g re e to w h i c h d a ta a c c u ra te l y re p re s e nt s t h e re a l - wo r l d p h e n o m e n o n i t d e s c r i b e s .
• C o m p l eten e s s : T h e ex te nt to w h i c h d a ta co nta i n s a l l n e c e s s a r y i nfo r m at i o n w i t h o u t m i s s i n g va l u e s o r
ga ps .
• C o n s i ste n c y : T h e a bs e n c e o f co nt ra d i c t i o n s o r d i s c re p a n c i e s w i t h i n t h e d a ta .
• Re l i a b i l i t y : T h e t r u s t w o r t h i n e s s a n d co n s i ste n cy o f d a ta o v e r t i m e a n d a c ro s s d i f fe rent s o u rc e s .
• Re l e va n c e : T h e ex te nt to w h i c h d a ta i s a p p l i ca b l e a n d u s ef u l fo r t h e i nte n d e d p u r p o s e .
Data Cleaning and Preprocessing Techniques:
• D ata C l e a ning : sta n dard ra n ge to e n s u re co m p arab il i ty.
• H a n d lin g Mi s s ing Va l ues : Te c h n i que s i n c l u de • Tra n sform ation : C o nver ti ng d ata i nto a m o re
i m p u tati on ( e . g ., m e a n i m p u tat ion, p re d i c ti ve s u i ta bl e fo r m at fo r a n a l ysi s ( e . g., l o g
i m p u tati on) o r d e l et i o n o f m i s s i n g d ata . t ra n sfo rmati on).
• O u tl ier D ete c ti o n a n d Tre atm ent : I d e nti f yi ng • Fe atu re E n g i n eering : C re at ing n ew fe at u re s
a n d h a n dl ing o u t l i e rs t h at m ay s kew a n a l ysi s o r m o d i f ying ex i st i ng o n e s to i m p rove m o d e l
re s u l t s. p e r fo r m ance .
• E rro r C o rre c ti o n : C o r re c ti ng e r ro rs o r • D i m e nsio n ality Re d u c tio n : Re d u c i ng t h e
i n co n si ste nci e s i n d ata t h rough m a n ual o r n u m b e r o f va r i abl e s wh i l e p re s e r vi ng
a u tomate d p ro c e ss e s. e s s e nt i al i nfo r m ati on ( e . g., P C A , fe at u re
s el ec t i o n).
• D ata P re p ro c e ss ing :
• No rm a l izatio n : Sca l i ng n u m e r i cal d ata to a
Introduction to Data Quality Tools:
• D ata P ro f i lin g To o l s : An a l yze t h e st r u c ture a n d c h a racte r ist ic s o f d ataset s to i d e nt if y d ata q u a l i ty i s s u e s
( e . g ., I B M I nfo Sp he re I nfo r mat ion An a l yze r, Ta l e n d D ata Q u a l i t y).
• D ata C l e a nsing To o l s : Au to mate t h e p ro c e s s o f c l e a n i ng a n d sta n dardi zing d ata by d ete c t ing a n d
co r re c ti ng e r ro rs , d u p l i cate s, a n d i n co ns iste nc i e s ( e . g., Tr i fac ta, O p e n Ref in e ).
• D ata G o ve rn ance To o l s : M a n a ge a n d e nfo rc e d ata q u a l ity sta ndards , p o l i c i e s, a n d wo r k f l o ws a c ro ss a n
o rgani zati on ( e. g., C o l l i bra, I nfo r mati ca Axo n ).
• D ata Q u a lity M o n i to ri n g To o l s : C o nt i nuousl y m o n i tor d ata q u a l ity m et r i c s a n d a l e r t s to d ete c t a n d
a d d re ss i s s u e s i n re a l - t i m e ( e . g., Ata c cam a , Ta l e n d D ata Stewa rds hi p).
Popular Data Analytics Software:
• P y th o n : A ve rs ati le p ro gramm ing l a n guage wi t h l i b ra rie s l i ke Pa n d as, N u m P y, a n d Sc i k i t - l e arn fo r d ata
m a n i pul ati on, a n a lys i s, a n d m a c h i ne l e a r n ing.
• R : A stati sti cal p ro gramm ing l a n guage wi t h ex te n s i ve p a c kage s fo r d ata a n a lys is , vi s u a l izatio n, a n d
stat isti cal m o d e l i ng.
• S Q L ( S tru c tu re d Q u e r y L a n g u age) : U s e d fo r q u e r ying a n d m a n i pul ating re l at i onal d ata base s to ex t ra c t
a n d a n a lyze d ata.
• Ta bleau : A p o we r f ul d ata vi s u a l i zatio n to o l t h at e n a b le s u s e rs to c re ate i nte rac ti ve d a s h boards a n d
re p o r ts f ro m va r i o us d ata s o u rc e s.
• M i c ro s o f t E xc e l : W i d e l y u s e d fo r d ata a n a lys is , vi s u a l izati on, a n d b a s i c stati sti cal o p e rat ions d u e to i t s
u s e r - f r i e ndly i nte r fac e a n d fa m i l i ari ty.
Privacy in Data Analytics
• Importance of Privacy
• Legal and Ethical
Considerations
• Privacy-Preserving Techniques
• Privacy Regulations and
Compliance
Importance of Privacy:
• P ro te c ti o n o f In d i v iduals : P r i va cy e n s u re s t h at i n d i vi duals h ave co ntrol ove r t h e i r p e rs o nal i nfo r m ati on
a n d s afe g uards t h e m f ro m u n a uthor ize d a c c e s s , m i s u s e , o r ex p l o i tation.
• Tru st a n d Re p u tatio n : M a i ntai ni ng p r i va cy fo ste rs t r u st b et we e n o rgan izati ons a n d t h e i r c u stome rs,
c l i e nt s , o r u s e rs , e n h a n ci ng re p u tati on a n d c re d i b i l ity.
• L e ga l a n d Eth i cal O b ligatio ns : O rgani zati ons h ave l e ga l a n d et h i ca l re s p o n si bi li ti e s to p ro te ct t h e p r i va cy
r i g hts o f i n d i vi dual s a n d co m p l y wi t h a p p l i cabl e re g u l ati ons.
Legal and Ethical Considerations:
• D ata P ro te c ti o n L aws : Re g u l ati ons s u c h a s t h e G e n e ra l D ata P ro te c ti on Re g u l ati on ( G D P R) i n t h e E U a n d
t h e C a l i fo rni a C o n s um e r P r i va cy Ac t ( C C PA) i n t h e U S i m p o s e re q u i re me nts o n o rgani zati ons re ga rdi ng t h e
co l l e c t io n, p ro c e s si ng , a n d sto rage o f p e rs o nal d ata.
• Eth i ca l G u i delin es : Ad h e r ing to et h i ca l p r i n ci pl e s s u c h a s t ra n spare ncy, fa i r ne s s, a c co u ntabi l ity, a n d
re s p e c t fo r i n d i vi dual a u to nomy i s e s s e nt i al i n d ata a n a ly ti c s to e n s u re et h i ca l u s e o f d ata .
• Inform e d C on s e nt : O bta i ni ng i nfo r m e d co n s e nt f ro m i n d i vi dual s b efo re co l l ec t i ng o r p ro c ess ing t h ei r
p e rs o nal d ata i s a f u n d a me ntal et h i ca l p r i n c ip le to re s p e c t t h e i r p r i va cy r i g ht s.
Privacy-Preserving Techniques:
• A n o ny m izatio n : Re m ovi ng o r e n c r ypting p e rs o nall y i d e nti f i able i nfo r m ati on ( P II) f ro m d ataset s to
p reve nt i n d i vi du als f ro m b e i n g i d e nt i f ie d.
• Ps e ud o ny mizatio n : Re p l a c ing d i re c t i d e nt if i e rs wi t h ps e u d onyms o r to ke n s to p ro te ct i n d i vi dual
i d e nt i tie s wh i l e m a i ntai ni ng d ata u s a b il i ty.
• D i f ferential P ri va c y : Ad d i ng n o i s e o r ra n d om ne ss to q u e r y re s p o ns e s to p ro te ct t h e p r i va cy o f i n d i vi duals
i n stat i stical d ata bases.
• H o m o m o rp h i c E n c r y p tio n : Pe r fo r mi ng co m p u tati ons o n e n c r ypte d d ata wi t h o ut d e c r ypting i t , p re s e r vi ng
p r i va cy wh i l e a l l o wi ng a n a lys is .
• S e c u re M u l ti party C o m p u tatio n ( S M C ) : E n a b li ng m u l t i ple p a r t ie s to j o i nt ly co m p ute a f u n c t i on ove r t h e i r
i n p u ts wh i l e ke e p i n g t h e i r i n p u t s p r i vate .
Privacy Regulations and Compliance:
• G e n eral D ata P ro te c ti o n Re g u l atio n ( G D P R) : I m p ose s st r i ct re q u i re m e nts o n o rgani zati ons re ga rdi ng t h e
p ro c e s si ng a n d p ro te c ti on o f p e rs onal d ata o f E U re s i d e nt s, i n c l u d ing d ata s u b j e c t r i g hts, b re a c h
n o t i f i cati on, a n d p r i va c y by d e s i g n p r i n c ipl e s.
• C a l i fo rnia C o n s u m e r P ri va c y A c t ( C C PA ) : G ra nts C a l i fo rni a re s i d e nt s c e r ta i n r i g hts re ga rdi ng t h e i r
p e rs o nal i nfo r m ati on a n d i m p o s e s o b l i gat ions o n b u s i n e ss e s re ga rd ing d ata t ra n spare ncy a n d co n s ume r
p r i va cy r i g hts.
• H e a lth In s u ra n ce Po rta b ility a n d A c co u ntability A c t ( H IPA A) : Re g u late s t h e u s e a n d d i s c l o sure o f
p ro te c te d h e a l th i nfo r m ati on ( P H I) a n d re q u i re s s afe g uards to p ro te c t p at i e nt p r i va cy a n d co nf i d e nti al ity.
• O th e r Re g u l ato r y Fra m ewo rks : Va r i ous o t h e r re g u l ati ons a n d sta n dard s ex i st g l o b all y, s u c h a s t h e
Pers onal I nfo rm ati on P ro te ct ion a n d E l e c t ro nic D o c u m e nts Ac t ( P IP EDA) i n C a n ada a n d t h e P r i va cy Ac t i n
Au st ral ia.