SPECIALIZATIONELECTIVE
DATA SCIENCE (22DS102006)
Module 1 INTRODUCTION
1. Definition of data science
Data Science (DS): Study and practice of collecting, storing, processing, and analyzing data
to extract meaningful insights and support decision-making. It includes scientific methods,
computational thinking, and systematic analysis of data. DS combines computer science,
statistics, mathematics, and domain expertise.
The 3Vs of data that justify its rise:
Volume: Massive amounts of data (e.g., zettabytes)
Velocity: Speed at which data is generated
Variety: Structured and unstructured data formats (text, image, video, etc.)
Data science is used to derive meaningful insights to solve real-world problems through
systematic, repeatable, and verifiable processes.
Figure 1.1 shows that, by 2020, global data usage would increase 40 zettabytes—a 50-fold increase
since 2010. This proves increasing importance of DS in managing and analyzing this massive amount
of information.
Figure 1.1: Increase of Data Volume in 15 Years
(Source: IDCs Digital Universe Study, December 2012)
Where Do We See Data Science?
Data science finds widespread application across a multitude of domains owing to its
capability to extract meaningful insights from vast and complex datasets. Below are some key
areas where data science plays a transformative role:
Finance: In the financial sector, data science facilitates fraud detection, credit risk
assessment, customer segmentation, and the formulation of investment strategies. By
analyzing transactional data and historical customer records, predictive models can be
developed to support decisions such as loan approvals and portfolio management.
Public Policy: Governments leverage data science to examine citizen behavior, traffic
flow, usage of public services, and social trends, thereby enhancing governance and
infrastructure development. Open data platforms, such as data.gov, provide access to
datasets that support evidence-based policymaking across diverse areas including
sanitation, transportation, and urban planning.
Politics: Political campaigns increasingly rely on data science for voter profiling,
targeted campaigning, and sentiment analysis via social media. By identifying
undecided voters and customizing campaign messages, political strategists can
optimize outreach and engagement efforts.
Healthcare: In the healthcare industry, data science revolutionizes disease diagnosis,
patient data management, and personalized treatment plans. Wearable health devices,
such as fitness trackers, continuously collect physiological data, which can be
analyzed to alert users and physicians about potential health risks and support
preventive care.
Urban Planning: Smart city projects benefit significantly from data science through
the integration of sensor networks and analytical tools to optimize traffic control,
energy consumption, and public safety. Examples include systems like Chicago’s
Plow Tracker and Boston’s SnowCOP, which utilize real-time data to enhance
municipal service delivery.
Education: Educational institutions use data science to support personalized learning
by monitoring student performance and engagement. These insights enable educators
to tailor teaching strategies and deliver timely academic interventions.
Libraries and Information Science: Libraries apply data science to automate
systematic literature reviews, enhance metadata management, and streamline access
to academic resources. These capabilities assist researchers in identifying relevant
materials more efficiently.
Business Sector: In business, data science strengthens marketing, sales, and customer
support by analyzing consumer behavior, forecasting product demand, and delivering
personalized recommendations. Tech giants like Amazon and Netflix utilize
sophisticated data models to optimize user experience and drive customer retention.
Industrial Sectors: Manufacturing and industrial operations use data science for
predictive maintenance, supply chain optimization, and quality assurance. With the
rise of IoT devices, real-time data collection from machinery enables smarter
production and process control.
Across these diverse fields, data science stands out as a powerful enabler for uncovering
patterns, making accurate predictions, and facilitating data-driven decisions.
2. Skills for data science
The success of DS involves adopting a mindset that merges curiosity, analytical thinking, and
technical expertise. Three key skills used in DS are willingness to experiment, mathematical
reasoning, and data literacy. These skills reflect not only what employers seek but also what
enables professionals to succeed in this evolving, interdisciplinary domain. The hybrid nature
of a data scientist, need the following essential skills:
a) Willingness to Experiment
Curious, creative problem-solvers who define hypotheses and explore data.
Employers often assess curiosity through open-ended questions.
b) Proficiency in Mathematical Reasoning
Solid understanding of statistics and logical reasoning.
Interpretation of numeric data is critical for analysis.
c) Data Literacy
The ability to read, interpret, and work with data.
Involves understanding data relevance, visualizing insights, and making data-driven
decisions.
Considered a fundamental 21st-century skill, like reading and writing.
d) Roles and Job Types in Data Science
Data Analyst: Entry-level role focused on visualizations, reporting, Excel, Google
Analytics, etc.
Data Engineer: Designs infrastructure for handling large volumes of data.
Product-Focused Data Scientist: Develops data-driven services/products (e.g.,
recommendation engines).
Generalist in Data-Driven Companies: Wears multiple hats including ML,
visualization, reporting.
In another view, Dave Holtz blogs about specific skill sets desired by various positions to which a
data scientist may apply
A) Data Analyst as Entry-Level Data Scientist: In many companies, entry-level data scientists perform
tasks similar to data analysts—using tools like Excel, MySQL, Google Analytics, and Tableau for data
retrieval, reporting, and visualization. These roles focus on descriptive analysis and serve as a
practical starting point for beginners.
B) Data Wrangler / Engineer Roles: Some organizations hire data scientists to manage large,
unstructured datasets. These roles, often labeled as data engineers, involve building infrastructure
and data pipelines. They require strong programming skills and are ideal for self-motivated
individuals, though they often lack formal mentorship.
C) Data-Product Focused Roles: In data-centric companies, data scientists develop products powered
by data—such as recommendation engines or predictive models. These roles emphasize machine
learning and advanced analytics, typically attracting candidates with academic backgrounds in math,
statistics, or physics.
D) Generalist Roles in Data-Driven Companies: Many businesses use data to support decision-
making. Data scientists here work on established teams and require broad skills—analysis,
visualization, coding—along with specialization in areas like ML or dashboard design. Familiarity with
big data tools and messy datasets is often essential.
The above-discussed skills are summarized in Figure 1.2.
Figure 1.2: Types of Data Science Roles
3. Tools for data science
Becoming a successful data scientist requires hands-on experience with tools that are widely used in
the industry for solving data-driven problems. They offer powerful capabilities for implementing,
visualizing, and testing data solutions quickly. Python and R, on the other hand, offer user-friendly
environments with extensive libraries that facilitate easy setup and enable beginners to produce
meaningful results.
A) Python
High-level scripting language
Interpreted line by line (no need for compiling like Java/C++)
Simple syntax; easy to learn and write
Widely supported with open-source libraries and packages
Suitable for both beginners and advanced users
Common Use Cases:
o Data manipulation: Pandas
o Numerical computing: NumPy, SciPy
o Visualization: Matplotlib, Seaborn
o Machine learning: scikit-learn, TensorFlow, PyTorch
B) R
Statistical programming language
Designed specifically for statistical analysis and data visualization
Simple syntax similar to natural mathematical expressions
Strong community support for packages in CRAN (Comprehensive R Archive Network)
Common Use Cases:
o Statistical modeling
o Hypothesis testing
o Data visualization (with ggplot2)
o Report generation using RMarkdown
C) SQL (Structured Query Language)
Domain-specific language for managing relational databases
Used for querying and manipulating structured data
Efficient for handling large datasets stored in relational databases
Can be used within Python or R for integration with large-scale databases
Common Use Cases:
o Extracting specific data subsets
o Filtering, joining, and aggregating data tables
o Working with real-time or remote datasets
SQL is particularly useful when data is too large to load into memory as a flat file (like CSV).
D) UNIX (Shell/Command Line)
Operating system environment and set of command-line tools
Supports automation and batch data processing
Powerful for file handling, text processing, and scripting
Reduces the need for coding in certain routine data tasks
Common Use Cases:
o Filtering and processing logs
o Running scripts and pipelines
o Managing large datasets without GUI
UNIX is often used in combination with Python or R in professional environments, especially on
servers or cloud-based systems.
Need of These Tools
These tools are popular not because they were built specifically for data science, but because:
They support rapid development and iteration
They are open-source and backed by strong community ecosystems
They interoperate well—e.g., using SQL with Python, or R with UNIX
They handle a wide variety of data formats and volumes
Other Mentions
While the file mainly emphasizes Python, R, SQL, and UNIX, in practice, other tools that data
scientists often encounter include:
Jupyter Notebooks: For interactive Python code + narrative
Tableau / Power BI: For dashboarding and business intelligence (BI)
Hadoop / Spark: For distributed data processing (Big Data)
4. Data Types
One of the most basic concepts in data The most practical classification of data would be to classify it
as either structured or unstructured. This difference is of high importance since methods of storage,
querying, processing, and analysis are frequently differentiated according to this division. The
structured data is very much organized and it is mostly available in tabular form with labeled rows
and columns and it can be handled comfortably with the help of databases. On a different note,
unstructured data is not pre-structured and could be in the form of the free text, audio, video or
images. Most of the world data is unstructured, therefore handling it tends to be more complex and
requires special processing procedures.
Structured Data
Structured data: this can also be called neatly organized data most often in a tabular format with
fields. In this structure, each data point is linked to a label or any particular attribute in which case, it
is easy to search, analyze and visualize. Consider, as an example, values of height and weight of
people; in this case, a dataset is called structured, since every numeric value has a label, which
explains its purpose. Structured data doesn’t always have to be numerical; it can include other types
such as text (e.g., "housing type"), Boolean (e.g., "is employed"), or categorical variables (e.g.,
"gender" or "marital status"). The exact format of values does not determine structured data; it is
the definite structure of naming and labelling.
To illustrate this, a table (1.1) might contain records with a mix of data types—such as an individual’s
age, employment status, and place of residence.
Table 1.1: Customer Data Sample
Unstructured Data
On the contrary, unstructured data is disorganized and not labeled. It is not tied to a certain model
or format, therefore, it is more difficult to manage and interpret it. The unstructured form of data
would present a good example of a narrative paragraph in a medical or social study. Such an
example can be the following combination of words: “It was discovered that a woman measured
between 65 and 67 inches and measured 125130 IQ points.” Although this has valuable information,
it is not written in a way that makes it easily parsed by software. It does not contain column names
or immediate labels that interrelate any piece of data to a unique entity or property. Consequently,
interpreting and processing this type of data may be a task that needs human revising or the usage
of specialized natural language processing software.
The major challenge of unstructured data is represented by its ambiguity. It can either be context-
related or a semantic interpretation meaning that it is even harder to automate. In contrast to
structured data, one cannot be sure of the meaning of the term 22 in labelled column, when it
comes to unstructured data, one has to follow more paths to come out of similar meanings.
Nevertheless, unstructured data is an untapped pool of knowledge, especially when used in the
social sphere, customer notes, or multimedia entries, but with the correct methodologies to process
and organize it, data is still information that can provide excellent insight.
5. Data collections
When looking for datasets similar to those discussed in earlier sections or chapters, there are several
reliable online sources available. These resources provide access to various data formats for analysis,
research, and application development.
5.1.Open Data
Open data refers to datasets that are made freely accessible to the public without legal or
licensing restrictions. The goal is to allow unrestricted usage, modification, and distribution.
Governments (both local and national), NGOs, and academic institutions are major
contributors to open data initiatives. For instance, datasets are published by the U.S.
Government and cities like Chicago for public use.
In 2013, the White House launched Project Open Data to promote the open data policy by
offering resources such as tools, case studies, and guidelines. A policy memo (M-13-3) was
also introduced, emphasizing the need to treat data as a valuable asset and release it in
accessible formats whenever possible.
Key Principles of Open Data:
Public: Agencies are encouraged to share data openly, unless restricted by privacy, security,
or legal concerns.
Accessible: Data must be provided in user-friendly, modifiable formats that support search
and download. These formats should ideally be machine-readable and non-proprietary.
Described: Open data must include sufficient metadata and documentation to help users
understand the data’s structure, limitations, and collection methods.
Reusable: Data should be shared under an open license, with no limitations on reuse.
Complete: Wherever possible, data should be released in its raw, original form, along with
any derived versions.
Timely: Data should be made available promptly to retain its relevance and usefulness.
Managed After Release: A designated contact should be available to support data users and
respond to compliance issues.
5.2.Social Media Data
Social media platforms are valuable sources for collecting user-generated content for research and
marketing. These platforms offer APIs (Application Programming Interfaces) that enable developers
and researchers to programmatically retrieve information such as user profiles, posts, and
engagement metrics.
An example is the Facebook Graph API, which allows data extraction in structured formats like XML.
These APIs enable developers to create innovative applications, conduct behavioral studies, or
monitor responses to events like natural disasters. Some platforms, like Yelp, actively release
datasets to encourage exploration and academic research. These datasets have supported projects
in areas like photo classification, natural language processing (NLP), sentiment analysis, and network
analysis. Researchers can participate in initiatives like the Yelp Dataset Challenge to solve real-world
data problems.
5.3. Multimodal Data
In today’s increasingly connected world, many devices—from smart bulbs to vehicles—generate
diverse types of data as part of the Internet of Things (IoT) ecosystem. This data often goes beyond
traditional formats (like numbers or text) and includes rich content such as images, audio, gestures,
and spatial data.
Data from such sources is known as multimodal or multimedia data, and it often requires specialized
techniques to analyze. For example, in medical research, brain imaging data can come from multiple
sensors like EEG, MEG, and fMRI. These measurements, captured in sequences or time-series, are
analyzed using techniques like Statistical Parametric Mapping (SPM)—a method developed by Karl
Friston—to identify patterns of brain activity.
For those interested in working with such complex datasets, Appendix E (not included here) offers
guidance on additional sources and ongoing challenges related to data analysis and real-world
problem solving.
5.4.Data Storage and Presentation
Data can be stored in various formats depending on its structure and intended use. The following are
common formats for storing and presenting data:
A) CSV (Comma-Separated Values)
Widely adopted for sharing spreadsheets and database exports.
Stores data in plain text with fields separated by commas.
Advantages: Universally readable by spreadsheet tools like Excel and Google Sheets; easy to share
and understand.
Challenges: If the data itself contains commas, parsing becomes difficult unless proper escape
characters are used.
Example: A file named Depression.csv from UF Biostatistics contains records of treatments and
outcomes for individuals with depression.
B) TSV (Tab-Separated Values)
Often used to store raw data between systems or tools.
Format: Similar to CSV, but fields are separated by tabs instead of commas.
Advantages: Reduces ambiguity since tabs are less likely to occur in the data itself.
Challenges: Less commonly used than CSV; requires attention when tabs exist within field content.
Example: Employee registration data stored with fields like Name, Age, and Address separated by
tab spaces.
C) XML (eXtensible Markup Language)
Designed to be both human-readable and machine-readable.
Format: Structured text using custom tags (e.g., <book>, <price>).
Advantages: Platform-independent and suitable for data exchange between incompatible systems.
Real-world Relevance: Many IT systems use XML to standardize communication between databases
and applications. XML is also familiar to those who have worked with HTML.
D) RSS (Really Simple Syndication)
RSS is a web-based format used to distribute content between platforms and was
introduced with version 1.0 of XML. It allows websites to deliver updated
information, like news or blog posts, in a standardized XML file known as an RSS
feed. While most modern web browsers can read RSS feeds directly, users often rely
on RSS readers or aggregators, which are tools designed to collect and organize feeds
from multiple sources.
RSS uses standard XML syntax but extends it by specifying particular tags (some
mandatory, others optional) and the type of content each tag should contain. It is
tailored for delivering selective, structured content efficiently.
Example Use Case:
Imagine running a website that publishes frequent updates such as news headlines, stock
prices, or weather alerts. Users interested in staying informed would need to check the site
repeatedly, which can be inefficient—either they miss updates or waste time checking
unnecessarily. Using an RSS feed, users can subscribe through an aggregator, which
automatically checks for updates and delivers them as soon as they’re available—often
through notifications.
Why RSS is Useful:
It is lightweight and quick to load.
Easily integrates with devices like smartphones, smartwatches, and PDAs.
Ideal for frequently updated platforms such as:
o News websites (providing headlines, dates, and brief summaries)
o Business pages (announcements and new product updates)
o Calendars (event reminders and important dates)
o Websites with frequent content changes (tracking new or modified pages)
E) RSS (Really Simple Syndication)
RSS is a web-based format used to distribute content between platforms and was
introduced with version 1.0 of XML. It allows websites to deliver updated information,
like news or blog posts, in a standardized XML file known as an RSS feed. While most
modern web browsers can read RSS feeds directly, users often rely on RSS readers or
aggregators, which are tools designed to collect and organize feeds from multiple
sources.
RSS uses standard XML syntax but extends it by specifying particular tags (some
mandatory, others optional) and the type of content each tag should contain. It is
tailored for delivering selective, structured content efficiently.
Example Use Case:
Imagine running a website that publishes frequent updates such as news headlines, stock prices,
or weather alerts. Users interested in staying informed would need to check the site repeatedly,
which can be inefficient—either they miss updates or waste time checking unnecessarily. Using
an RSS feed, users can subscribe through an aggregator, which automatically checks for updates
and delivers them as soon as they’re available—often through notifications.
Why RSS is Useful:
It is lightweight and quick to load.
Easily integrates with devices like smartphones, smartwatches, and PDAs.
Ideal for frequently updated platforms such as:
News websites (providing headlines, dates, and brief summaries)
Business pages (announcements and new product updates)
Calendars (event reminders and important dates)
Websites with frequent content changes (tracking new or modified pages)
6. Data Preprocessing
In the real world, data is rarely perfect. It often needs to be cleaned, reformatted, or adjusted before
it can be analyzed effectively. This entire process is referred to as data preprocessing, and it is a
critical first step in any data analysis or machine learning workflow.
Common Problems in Raw Data
Data is considered "dirty" when it contains any of the following issues:
Incomplete Data: Missing attribute values or records that only provide partial
information.
Noisy Data: Data with errors or outliers that distort the actual trend.
Inconsistent Data: Mismatched formats, naming discrepancies, or improperly
formatted entries.
Major Preprocessing Tasks
A) Data Cleaning
Data cleaning focuses on identifying and correcting issues in raw data. It may involve:
Fixing incorrect or inconsistent values
Removing duplicates
Handling missing values
Standardizing text entries (e.g., names or categories)
B) Data Munging (Wrangling)
Data munging involves reshaping or converting data into a usable format. This may be
necessary when data is stored in a form that’s difficult to analyze—like a paragraph or an
unstructured text. There is no single method for wrangling; it often requires creative or
manual solutions to make data "analysis-friendly."
C) Handling Missing Values
Missing data can occur due to user input errors, device malfunctions, or insufficient data
collection at the time. Strategies to handle missing values include:
Dropping records
Filling with a default value or average
Using statistical methods like inference or imputation
D) Smoothing Noisy Data
Noisy data—caused by errors in data entry, sensor faults, or loss of precision—can be
reduced by:
Detecting and removing outliers
Standardizing data formats
Applying smoothing techniques (e.g., averaging)
E) Data Integration
When data comes from multiple sources, integration is necessary. This involves:
Merging databases or files into a unified dataset
Resolving schema and attribute conflicts
Identifying and eliminating redundant data
Example: Combining revenue data from different systems using different units (e.g.,
INR vs. USD)
F) Data Reduction
Data reduction helps simplify datasets without losing essential information. Two main
techniques:
Aggregation: Condensing detailed records into summarized forms (e.g., monthly
totals from daily sales)
Dimensionality Reduction: Removing irrelevant or redundant features using methods
like PCA, clustering, or feature merging
G) Data Discretization
Discretization involves converting continuous data (like temperature or stock prices) into
categories or ranges. This simplifies analysis, especially for numerical attributes.
Types of attributes:
o Nominal: No inherent order (e.g., color)
o Ordinal: Ordered categories (e.g., education level)
o Continuous: Numeric values (e.g., age, temperature)
Example: Grouping temperature into "cold", "moderate", and "hot"
7. Data analysis and data analytics
Although often used interchangeably, the terms data analysis and data analytics are not exactly the
same. This distinction, while subtle, is significant—especially for professionals working with data.
Misinterpreting the terms may hinder effective use of data for insights and decision-making.
Dave Kasik, a Senior Technical Fellow at Boeing, offers a helpful perspective. In his view, data
analysis refers to direct, hands-on work involving the exploration and assessment of data. On the
other hand, data analytics encompasses a broader scope, with data analysis being just one of its
components. Kasik describes analytics as the science behind the analytical process, focusing on how
analysts cognitively approach and solve data problems.
A practical way to differentiate them is through a time-based lens:
Data analysis generally deals with understanding what has already happened. It provides a
retrospective view, commonly used in fields like marketing to evaluate past performance.
Data analytics looks forward—it leverages models and statistical tools to forecast outcomes,
offering insights that guide future actions and strategies.
Analytics heavily relies on mathematical and statistical techniques. It includes not just interpreting
data, but also using descriptive and predictive models to extract valuable insights. These insights are
then used to support business decisions or recommend courses of action. Rather than focusing
solely on individual analysis tasks, analytics considers the entire framework and methodology behind
data-driven problem-solving.
Although there's no universally accepted classification of data science techniques, a practical
approach is to group them based on their purpose and stage in the data lifecycle. Accordingly, the
following six categories can be used to classify common analytical techniques:
Descriptive Analysis
Diagnostic Analytics
Predictive Analytics
Prescriptive Analytics
Exploratory Analysis
Mechanistic Analysis
Each of these plays a distinct role in how data is interpreted and applied in real-world scenarios.
8. Descriptive analysis
Descriptive analysis focuses on summarizing current data to understand “what is happening now.” It
provides a quantitative overview of the main characteristics of a dataset and is typically the initial
step in data examination.
Key Features
First Step: Often the first analysis done on large datasets like census information.
Volume-Ready: Well-suited for large-scale data.
Two-Phase Process: Data description precedes interpretation.
Descriptive statistics are essential for organizing raw data into a structured form that reveals
patterns. These summaries help identify insights, but accurate interpretation demands appropriate
statistical methods.
Why Use Descriptive Analysis?
Helps group customers by preferences or behavior.
Enables population-wide summaries (e.g., census).
Transforms raw data into meaningful insights.
8.1. Variables
Variables are labels assigned to the data being studied. For example, a column labeled "Age"
in a spreadsheet represents a numeric variable. Variables are categorized as:
Nominal: Represent categories without order (e.g., colors).
Ordinal: Categories with inherent order (e.g., rankings).
Interval: Numeric values with meaningful differences, but no true zero (e.g.,
temperature in Celsius).
Ratio: Numeric values with a true zero, allowing all arithmetic operations (e.g.,
weight, height).
Variables are also classified based on their role:
Independent/Predictor: The cause or influencing factor.
Dependent/Outcome: The effect or the result we are predicting.
8.2. Frequency Distribution
A frequency distribution shows how often each value appears. It can be visualized using:
Histogram: Bars show frequency of numerical values.
Pie Chart: Useful for visualizing categorical data proportions.
Example:
A histogram of productivity scores can quickly show the most frequent ranges of
performance in a workforce.
8.3. Normal Distribution
An ideal dataset follows a bell-shaped curve:
Symmetrical Center: Most values near the mean.
Skewness: Indicates asymmetry (positive or negative).
Kurtosis: Measures the "peakedness" or flatness of the distribution.
Visual tools like histograms help assess normality.
8.4. Measures of Centrality
These metrics summarize the center of a dataset:
Mean: Arithmetic average. Sensitive to outliers.
Median: Middle value when data is sorted. Better for skewed data.
Mode: Most frequently occurring value. Best for categorical data.
Each measure provides a different lens on the dataset and choosing the right one depends on
the data type and distribution.
8.5. Dispersion of Distribution
Dispersion tells us how spread out the data is:
Range: Difference between the largest and smallest values.
Interquartile Range (IQR): Range of the middle 50%, useful for reducing outlier influence.
Variance: It is a statistical measure that indicates how much the values in a dataset differ
from the average (mean). It reflects the degree to which data points are spread out. The
more the values deviate from the mean, the higher the variance; conversely, closely grouped
values result in a lower variance.There are two types of variance:
A) Population Variance (σ²)
It measures the dispersion of the entire population. The formula is:
B) Sample Variance (s²)
It measures the spread within a sample subset of the population and adjusts the
denominator to avoid bias. The formula is:
8.6. Standard Deviation
Although variance measures spread, it presents the result in squared units (e.g., years²), which
is not intuitive. To convert it back into the original unit of measurement, we take the square
root of the variance, resulting in the standard deviation.
The formula for sample standard deviation is:
Standard deviation gives a clearer idea of the average distance of each data point from the
mean, expressed in the same unit as the original data (e.g., years, dollars, etc.).
9. Diagnostic analytics
Definition:
Diagnostic analytics focuses on understanding why an event occurred. It investigates the
causes behind trends or anomalies identified in descriptive analytics.
Key Characteristics:
Also known as causal analysis.
Explores relationships between causes and outcomes.
Useful for reviewing past performance and identifying contributing factors.
Example Use Case:
In a social media campaign, after using descriptive analytics to track metrics (likes,
comments, shares), diagnostic analytics helps understand why certain posts performed better
than others.
Common Technique – Correlation:
Measures how strongly two variables are related.
Value ranges from -1 (strong negative) to +1 (strong positive).
10. Predictive analytics
Definition:
Predictive analytics estimates what might happen in the future by analyzing patterns in
current and historical data.
Key Concepts:
Based on probabilities, not certainties.
Generates foresight from hindsight and insights.
Often uses regression to confirm data relationships.
Four-Step Process:
Data Collection – Gather historical and current data.
Data Cleaning – Ensure data quality for meaningful analysis.
Pattern Recognition – Use visualizations and statistical models.
Prediction – Apply models to forecast outcomes.
Applications:
Marketing (forecasting sales)
Healthcare (predicting patient risks)
Finance (credit scoring, fraud detection)
Tools Used: SAS, IBM SPSS, RapidMiner, etc.
Example: Estimating future sales based on historical advertising spend across platforms
11. Prescriptive analytics
Definition:
Prescriptive analytics identifies the best course of action from available options. It goes
beyond predictions to recommend decisions.
How It Works:
Starts with descriptive/predictive insights.
Evaluates possible decisions and their impact.
Suggests optimal strategies using simulations, game theory, and optimization models.
Use Case in Healthcare:
Analyzing obese patient data along with diabetes and cholesterol levels to determine priority
treatment zones.
Challenges:
Computationally intensive.
Still underutilized in industries (used by only ~3% of organizations).
Benefits:
Provides clear decision paths.
Improves proactive planning.
12. Exploratory analysis
Definition:
Exploratory Data Analysis (EDA) is used when the problem is not clearly defined. It helps
uncover hidden patterns and relationships in data.
Approach:
Involves visualizations (scatterplots, bar charts).
Encourages open-ended exploration without hypotheses.
Acts as a starting point for more formal analysis.
Philosophy:
EDA is more about mindset than tools—it guides how to examine data rather than what
techniques to use.
Application Example:
Using US Census data to search for unexpected trends across dozens of variables before
deciding on specific analytical methods.
Limitations:
Should not be used for drawing conclusions or predictions on its own.
Serves to inform further analysis rather than end it.
13. Mechanistic analysis.
Definition:
Mechanistic analysis investigates how specific changes in one or more variables bring about
changes in others, often in physical or controlled systems.
Use Case Examples:
Studying the effect of CO₂ levels on global temperature changes.
Assessing how increasing employee perks affects productivity.
Core Technique – Regression Analysis:
Establishes mathematical relationships between variables.
Helps predict outcomes (e.g., score = f(attitude)) using linear models:
Where:
When to Use:
To model cause-effect relationships at an individual object level.
Especially suitable for scientific and industrial applications.