CLASS: XII
ARTIFICIAL INTELLIGENCE
UNIT 2:
Data Science Methodology: An Analytic Approach to Capstone Project
DATA SCIENCE METHODOLOGY
• A Methodology gives the Data Scientist a framework for designing an AI Project. The framework will help the
team to decide on the methods, processes and strategies that will be employed to obtain the correct output
required from the AI Project.
• Data Science Methodology is a process with a prescribed sequence of iterative steps that data scientists
follow to approach a problem and find a solution.
• The foundation methodology of Data Science provides a deep insight on how every AI project can be solved
from beginning to end.
• It was put forward by John Rollins, a Data Scientist at IBM Analytics.
• It consists of 10 steps.
• There are five modules, each going through two stages of the methodology, explaining the rationale as to
why each stage is required.
1. From Problem to Approach
2. From Requirements to Collection
3. From Understanding to Preparation
4. From Modelling to Evaluation
5. From Deployment to Feedback
1. Business understanding –
“ What is the problem that you are trying to solve? ”
• In this stage, first, we understand the problem of the customer by asking questions and try to
comprehend what is exactly required for them. With this understanding we can figure out the objectives
that support the customer’s goal. This is also known as Problem Scoping and defining.
• The team can use 5W1H Problem Canvas to deeply understand the issue. This stage also involves using DT
(Design Thinking) Framework.
• To solve a problem, it's crucial to understand the customer's needs. This can be achieved by asking
relevant questions and engaging in discussions with all stakeholders.
2. Analytic approach –
“ How can you use the data to answer the question? ”
• When the business problem has been established clearly, the data scientist will be able to define the
analytical approach to solve the problem.
• This stage involves asking more questions to the stakeholders so that the AI Project team can decide on
the correct approach to solve the problem.
• The different questions that can be asked now are:
1. Do I need to find how much or how many? (Regression)
2. Which category does the data belong to? (Classification)
3. Can the data be grouped? (Clustering)
4. Is there any unusual pattern in the data? (Anomaly detection)
5. Which option should be given to the customer? (Recommendation)
• There are four main types of data analytics:
1. Descriptive Analytics
2. Diagnostic Analytics
3. Predictive Analytics
4. Prescriptive Analytics
Descriptive Analytics: This summarizes past data to understand what has happened. It is the first step
undertaken in data analytics to describe the trends and patterns using tools like graphs, charts etc. and
statistical measures like mean, median, mode to understand the central tendency. This method also
examines the spread of data using range, variance and standard deviation.
For example: To calculate the average marks of students in an exam or analyzing sales data from the
previous year.
Diagnostic Analytics: It helps to understand the reason behind why some things have happened. This is
normally done by analyzing past data using techniques like root cause analysis, hypothesis testing,
correlation analysis etc. The main purpose is to identify the causes or factors that led to a certain outcome.
For example: If the sales of a company dropped, diagnostic analysis will help to find the cause for it, by
analyzing questions like “Is it due to poor customer service” or “low product quality” etc.
Predictive Analytics: This uses the past data to make predictions about future events or trends, using
techniques like regression, classification, clustering etc. Its main purpose is to foresee future outcomes and
make informed decisions.
For example: A company can use predictive analytics to forecast its sales, demand, inventory, customer
purchase pattern etc., based on previous sales data.
Prescriptive Analytics: This recommends the action to be taken to achieve the desired outcome, using
techniques such as optimization, simulation, decision analysis etc. Its purpose is to guide decisions by
suggesting the best course of action based on data analysis.
For example: To design the right strategy to increase the sales during festival season by analyzing past data
and thus optimize pricing, marketing, production etc.
3. Data requirements
“What are the data requirements? ”
• This step identifies the data contents, formats, and the sources for data collection.
• The 5W1H questioning method can be employed in this stage also to determine the data requirements.
The data selected should be able to answer all the ‘what’, ‘who’, ‘when’, ‘where’, ‘why’ and ‘how’ questions
about the problem.
• This stage involves defining our data requirements, including the type, format, source, and necessary
preprocessing steps to ensure the data is usable and accurate for our needs.
• identifying the types of data required, such as numbers, words, or images.
• considering the structure in which the data should be organized, whether it is in a table, text file, or
database.
• identifying the sources from which we can collect the data
• Data for a project can be categorized into three types: structured data (organized in tables, e.g., customer
databases), unstructured data (without a predefined structure, e.g., social media posts, images), and semi-
structured data (having some organization, e.g., emails, XML files).
4. Data collection
“What occurs during data collection? ”
• In this stage, the data scientist identifies all the data resources and collects data in all forms such as
structured, unstructured, and semi-structured data that is relevant to the problem. There are mainly
two sources of data collection:
• Primary data Source - A primary data source refers to the original source of data, where the data is
collected firsthand through direct observation, experimentation, surveys, interviews, or other methods.
This data is raw, unprocessed, and unbiased, providing the most accurate and reliable information for
research, analysis, or decision-making purposes. Examples include marketing campaigns, feedback
forms, IoT sensor data etc.
• Secondary data Source - A secondary data source refers to the data which is already stored and ready
for use. Data given in books, journals, websites, internal transactional databases, etc. can be reused for
data analysis. Some methods of collecting secondary data are social media data tracking, web scraping,
and satellite data tracking. Some sources of online data are data.gov, World bank open data, UNICEF,
open data network, Kaggle, World Health Organization, Google etc.
• Once the data is collected, the data scientist will have a good understanding of what they will be
working with. The Data Collection stage may be revisited after the Data Understanding stage,
where gaps in the data are identified, and strategies are developed to either collect additional
data or make substitutions to ensure data completeness.
5. Data understanding
“What additional work is required to manipulate and work with the
data? ”
• Data Understanding encompasses all activities related to constructing the dataset. In this stage, we
check whether the data collected represents the problem to be solved or not.
• In this stage, the data scientist tries to understand the data collected. Techniques such as descriptive
statistics and visualization (Histogram) can be applied to the dataset, to assess the content, quality, and
initial insights about the data.
6. Data preparation
“Is the data collected representative of the problem to be solved? ”
• This stage comprises all the activities needed to construct the data to make it suitable to be used for
the modeling stage.
• Data preparation includes-
➢ cleaning i.e. managing missing data, deleting duplicates, changing the data into a uniform format
etc.
➢ combine data from multiple sources (archives, tables and platforms)
➢ transform data into meaningful input variables
• Feature Engineering is a part of Data Preparation. The preparation of data is the most time-consuming
step among the Data Science stages.
7. Modeling
“In what way can the data be visualized to get to the required
answer?”
• The modelling stage uses the initial version of the dataset prepared and focuses on developing models
according to the analytical approach previously defined. The modelling process is usually iterative, leading
to the adjustments in the preparation of data.
• Data Modelling focuses on developing models that are either descriptive or predictive.
1. Descriptive Modeling :- It is a concept in data science and statistics that focuses on summarizing and
understanding the characteristics of a dataset without making predictions or decisions. The goal of
descriptive modeling is to describe the data rather than predict or make decisions based on it. This
includes summarizing the main characteristics, patterns, and trends that are present in the data.
Descriptive modeling is useful when you want to understand what is happening within your data and how
it behaves, but not necessarily why it happens.
Common Descriptive Techniques:
➢ Summary Statistics: This includes measures like: Mean (average), Median, Mode
o Standard deviation, Variance
o Range (difference between the highest and lowest values)
o Percentiles (e.g., quartiles)
➢ Visualizations: Graphs and charts to represent the data, such as: Bar charts
o Histograms
o Pie charts
o Box plots
o Scatter plots
2. Predictive modeling:- It involves using data and statistical algorithms to identify patterns and trends in
order to predict future outcomes or values. It relies on historical data and uses it to create a model that
can predict future behavior or trends or forecast what might happen next. It involves techniques like
regression, classification, and time-series forecasting, and can be applied in a variety of fields, from
predicting exam scores to forecasting weather or stock prices.
8. Evaluation
“Does the model used really answer the initial question or does it
need to be adjusted?”
• Evaluates how well your model predicts correct outcomes that match labeled test data.
• It involves using test data to measure metrics like accuracy, precision, recall, or F1 score. This helps
determine if the model is reliable and effective before deploying it in real-world situations.
• Model evaluation can have two main phases.
➢ First phase – Diagnostic measures
The first phase is the diagnostic measurement phase, which ensures that the model works as
intended. If the model is predictive, a decision tree can be used to assess whether the response
provided by the model matches the original design or requires any adjustments. If the model is a
descriptive model that evaluates the relationships, a set of tests with known results can be applied and
the model can be refined as needed.
➢ Second phase – Statistical significance test
The second evaluation phase that can be used is the statistical significance test. This type
of evaluation can be applied to the model to ensure that the model data is processed and interpreted
correctly. This is to avoid a second unnecessary assumption when the answer is revealed.
9. Deployment
“How does the solution reach the hands of the user?”
• Deployment refers to the stage where the trained AI model is made available to the users in real-world
applications.
• Data scientists must make the stakeholders familiar with the tool produced in different scenarios.
• Once the model is evaluated and the data scientist is confident it will work, it is deployed and put to the
ultimate test.
• Depending on the purpose of the model, it may be rolled out to a limited group of users or in a test
environment, to build up confidence in applying the outcome for use across the board.
10.Feedback
“Is the problem solved?
Has the question been satisfactorily answered?”
• The last stage in the methodology is feedback. This includes results collected from the deployment of the
model, feedback on the model’s performance from the users and clients, and observations from how the
model works in the deployed environment. This process continues till the model provides satisfactory and
acceptable results.
• Feedback from the users will help to refine the model and assess it for performance and impact. The
process from modelling to feedback is highly iterative. Data Scientists may automate any or all of the
feedback so that the model refresh process speeds up and can get quick improved results.
• Feedback from users can be received in many ways like Surveys, Website feedback, local media, call
center, support ticket, social likes etc.