Data Science Introduction to Data Science

Introduction to Data Science
Prepared by
Dr.G.DEENA
AP/CSE
SRMIST

Definition
Data Science is a combination of multiple disciplines that uses
statistics, data analysis, and machine learning to analyze data and
to extract knowledge and insights from it.
Key points
• data gathering, analysis and decision-making.
• finding patterns in data, through analysis, and make future
predictions.
Example:
Better decisions made by companies on particular products.
Predictive analysis- What next?

Hidden Pattern
• Let's consider an example, to find hidden patterns in a online retail purchase dataset to understand
customer behavior.
Scenario:
dataset from an e-commerce website with following information about customer transactions:
• Customer ID: Unique identifier for the customer
• Product ID: Unique identifier for the product purchased
• Product Category: The category of the product (e.g., electronics, clothing, groceries)
• Purchase Date: The date when the purchase was made
• Price: The price of the product purchased
• Quantity: The number of items purchased
Example Objective: We want to find hidden patterns, such as:
• What products are commonly bought together?
• At What time the customers more likely to make a purchase?
• Which product categories are popular in different seasons?
• What is the highest sale of the product in the weekend?

Application
• Data Science is used in many industries in the world today –
e.g. Banking, consultancy, Healthcare, and Manufacturing.
• Route planning: To discover the best routes to ship the goods.
• To foresee delays for flight/ship/train etc. (through predictive
analysis)
• To find the best suited time to deliver goods
• To forecast the next years revenue for a company
• To analyze health benefit of training
• To predict who will win elections

Facets of Data
Very large amount of data will generate in data science and they are
of different types,
• Structured
• Unstructured
• Natural Language
• Graph based
• Machine Generated
• Audio, video and images

Structured Data
• Structured data is arranged in rows and column format.
• structured data refers to data that is identifiable because it is
organized in a structure. The most common form of structured
data or records is a database where specific information is stored
based on a methodology of columns and rows.
• Database management system is used for storing structured data.
(retrieve and process data easily.)
• Structured data is also searchable by data type within content.
• An Excel table is an example of structured data.

Unstructured Data
• Unstructured data is data that does not follow a specified format. Row
and columns are not used for unstructured data. Therefore, it is
difficult to retrieve information. It has no identifiable structure.
• It does not follow any template/rules. Hence it is unpredictable in
nature.
• Most of the companies use unstructured data format.
• Eg- word documents, email messages, customer feedbacks, audio,
video, images, email.

Natural Language
• It is a special type of unstructured data.
• Natural language processing(NLP) enables machines to recognize
characters, words and sentences, then apply meaning and
understanding to that information.
• The natural language processing is better in entity recognition, topic
recognition, summarization, text classification and sentiment analysis.

Data Science Process
• The data science process is a powerful toolkit that helps us
unlock hidden knowledge from the available data.
• The data science process is a systematic approach to extracting
knowledge and insights from data.
• It’s a structured framework that guides data scientists through
a series of steps, from defining a problem to communicating
actionable results.

Data Science Process Life Cycle

Framing the Problem
• The process begins with a clear understanding of the problem or
question.
• This process define the project’s objectives and goals.
• A well-defined problem statement acts as a compass, guiding the
entire data science process and ensure the desired outcomes.

Data Collection
• Once the problem clearly defined, its important to collect
the data in data science.
• This involves identifying relevant data sources, whether
internal databases, external APIs, or publicly available
datasets.
• Data scientists must carefully consider the types of data
needed.

Data Cleaning
• Raw data is often messy, with errors, missing values, and
inconsistencies
• This involves removing duplicates, filling in missing values, and
transforming data into a format suitable for further exploration.
• The data cleaning phase is all about removing unwanted and fill
the values as well ensure the data is accurate, complete, and
ready for analysis

Exploratory Data Analysis (EDA)
• EDA is the detective work of data science.
• It’s about to uncover hidden patterns, trends, and anomalies.
• Data scientists use a variety of techniques, including summary
statistics, visualizations, and interactive tools, to gain a deeper
understanding of the data’s characteristics and their
relationships.
• This stage is crucial for identifying potential way for further
investigation.

Model Building
• In this phase, data scientists build models that can predict
future outcomes or classify data into different categories.
• These models are often based on machine learning
algorithms or statistical techniques.
• The choice of model depends on the problem at hand and the
nature of the data.
• Once the model is chosen, it’s trained on the prepared data
to learn patterns and relationships.

Model Deployment
• Once a model is trained and validated, it’s time to put it to work.
• Model deployment involves integrating the model into a
production environment, where it can be used to make
predictions or inform decision-making.

Communicating Results
• The final stage of the data science process involves
communicating the findings and insights to stakeholders.
• This includes creating clear and concise reports, presentations,
and visualizations that effectively convey the results and their
implications.
• The goal is to ensure that stakeholders understand the analysis,
trust the conclusions, and can use the insights to make decisions.

Introduction to NumPy
• NumPy is a Python library used for working with array.
• NumPy stands for Numerical Python.
• It is the fundamental package for mathematical and logical
operations on array.
• Array is homogeneous collection of data.
• The values can be number, characters, Booleans.
• It can be install in Jupyter notebook as pip install numpy

Data Science Introduction to Data Science

More Related Content

Similar to Data Science Introduction to Data Science

Recently uploaded

Data Science Introduction to Data Science