B19ADT301 – FUNDAMENTALS
OF DATA SCIENCE
Dr. C.Deepa
AI&DS
KIT-Kalaignarkarunanidhi Institute of Technology
Course Objectives
1. To introduce the basic concepts of Data Science.
2. To understand the mathematical skills in statistics
3. To acquire the skills in data pre-processing steps.
4. To learn the concepts of feature selection
algorithms in machine learning.
5. To learn the concept of clustering approaches and
to visualize the processed data using visualization
techniques
Unit I
INTRODUCTION
• Need for Data Science – Benefits and uses –
Facets of data – Types of data- Organization of
data- Data Science process- Data Science life
cycle- Role of Data Science- Big Data – sources
and characteristics of Big Data
Introduction
• Data science is the field of study that combines domain expertise,
programming skills, and knowledge of mathematics and statistics to
extract meaningful insights from data.
• Data Science is a blend of various tools, algorithms, and machine
learning principles with the goal to discover hidden patterns from
the raw data.
• Data science is the application of computational and statistical
techniques to address or gain insight into some problem in the real
world
• The data science process is a systematic approach to solving a data
problem. It provides a structured framework for articulating your
problem as a question, deciding how to solve it, and then presenting
the solution to stakeholders.
Introduction
Data science = statistics + data processing + machine learning +
scientific inquiry + visualization + business
analytics + big data + …
Why is data science important?
• Data science plays an important role in virtually all aspects of
business operations and strategies.
• For example, it provides information about customers that
helps companies create stronger marketing campaigns and
targeted advertising to increase product sales.
• It aids in managing financial risks, detecting fraudulent
transactions and preventing equipment breakdowns in
Introduction
• Data science uses the most powerful hardware, programming
systems, and most efficient algorithms to solve the data related
problems. It is the future of artificial intelligence.
• In short, we can say that data science is all about:
• Asking the correct questions and analyzing the raw data.
• Modeling the data using various complex and efficient
algorithms.
• Visualizing the data to get a better perspective.
• Understanding the data to make better decisions and finding the
final result.
Example
• Let suppose we want to travel from station A to station B by
car.
• Now, we need to take some decisions such as which route will
be the best route to reach faster at the location, in which route
there will be no traffic jam, and which will be cost-effective.
• All these decision factors will act as input data, and we will
get an appropriate answer from these decisions, so this
analysis of data is called the data analysis, which is a part of
data science.
Need for Data Science
Following are some main reasons for
using data science technology
• With the help of data science technology, we can convert the
massive amount of raw and unstructured data into meaningful
insights.
• Data science technology is opting by various companies,
whether it is a big brand or a startup. Google, Amazon, Netflix,
etc, which handle the huge amount of data, are using data
science algorithms for better customer experience.
• Data science is working for automating transportation such as
creating a self-driving car, which is the future of transportation.
• Data science can help in different predictions such as various
survey, elections, flight ticket confirmation, etc.
Data science Jobs
• Data scientists are the experts who can use various
statistical tools and machine learning algorithms to
understand and analyze the data.
• The average salary range for data scientist will be
approximately $95,000 to $ 165,000 per annum, and as per
different researches, about 11.5 millions of job will be created
by the year 2026.
Types of Data Science Job
• If you learn data science, then you get the opportunity to find
the various exciting job roles in this domain. The main job
roles are given below:
• Data Scientist
• Data Analyst
• Machine learning expert
• Data engineer
• Data Architect
• Data Administrator
• Business Analyst
• Business Intelligence Manager
Data Science Components
Tools for Data Science
• Following are some tools required for data science:
• Data Analysis tools: R, Python, Statistics, SAS, Jupyter, R
Studio, MATLAB, Excel, RapidMiner.
• Data Warehousing: ETL, SQL, Hadoop, Informatica/Talend,
AWS Redshift
• Data Visualization tools: R, Jupyter, Tableau, Cognos.
• Machine learning tools: Spark, Mahout, Azure ML studio.
Data Science Lifecycle
Data Science Life Cycle
1. Discovery:
• The first phase is discovery, which involves asking the right
questions.
• When you start any data science project, you need to
determine what are the basic requirements, priorities, and
project budget.
• In this phase, we need to determine all the requirements of the
project such as the number of people, technology, time, data,
an end goal, and then we can frame the business problem on
first hypothesis level.
2. Data preparation:
Data preparation is also known as Data Munging. In this phase, we need to
perform the following tasks:
• Data cleaning
• Data Reduction
• Data integration
• Data transformation,
• After performing all the above tasks, we can easily use this data for our
further processes.
3. Model Planning:
In this phase, we need to determine the various methods and techniques to
establish the relation between input variables. We will apply Exploratory data
analytics(EDA) by using various statistical formula and visualization tools to
understand the relations between variable and to see what data can inform us.
Common tools used for model planning are:
• SQL Analysis Services , R, SAS, Python
4. Model-building:
In this phase, the process of model building starts. We will create datasets for training
and testing purpose. We will apply different techniques such as association,
classification, and clustering, to build the model.
• Following are some common Model building tools:
• SAS Enterprise Miner
• WEKA
• SPCS Modeler
• MATLAB
5. Operationalize:
In this phase, we will deliver the final reports of the project, along with briefings,
code, and technical documents. This phase provides you a clear overview of complete
project performance and other components on a small scale before the full
deployment.
6. Communicate results:
In this phase, we will check if we reach the goal, which we have set on the initial
phase. We will communicate the findings and final result with the business team.
Applications of Data Science:
• Image recognition and speech recognition
– Ok Google, Siri, Cortana
• Gaming world
– EA Sports, Sony, Nintendo
• Internet search
– Google, Yahoo, Bing,
• Transport
– self-driving cars.
• Healthcare
– tumor detection, drug discovery, medical image analysis,
virtual medical bots
• Recommendation systems
– suggestions for similar products
• Risk detection
– issue of fraud and risk of losses
Benefits and uses of Data Science
• Improves Business Predictions
• Business Intelligence
• Helps in Sales & Marketing
• Complex Data Interpretation
• Helps in Making Decisions
• Automating Recruitment Processes
advantages
Facets of Data
The main categories of data are:
■ Structured
■ Unstructured
■ Natural language
■ Machine-generated
■ Graph-based
■ Audio, video, and images
■ Streaming
Structured Data
Structured data is data that depends on a data model and
resides in a fixed field within a record.
As such, it’s often easy to store structured data in tables within
databases or Excel files (figure 1.1)
Unstructured data
❖ Unstructured data is data that isn’t easy to fit into a data
model because the content is context-specific or varying.
❖ One example of unstructured data is your regular email
❖ Although email contains structured elements such as the
sender, title, and body text, it’s a challenge to find the
number of people who have written an email complaint
about a specific employee because so many ways exist to
refer to a person, for example.
❖ The thousands of different languages and dialects out there
further complicate this.
Natural Language
❖ Natural language is a special type of unstructured data; it’s
challenging to process because it requires knowledge of specific
data science techniques and linguistics
❖ Machine-generated data
➢ Machine-generated data is information that’s automatically
created by a computer, process, application, or other
machine without human intervention. Machine-generated
data is becoming a major data resource and will continue to
do so
➢ The analysis of machine data relies on highly scalable tools,
due to its high volume and speed. Examples of machine
data are web server logs, call detail records, network event
Graph-based or network data
❖ “Graph data” can be a confusing term because any data can
be shown in a graph.
❖ “Graph” in this case points to mathematical graph theory.
❖ In graph theory, a graph is a mathematical structure to
model pair-wise relationships between objects.
❖ Graph or network data is, in short, data that focuses on the
relationship or adjacency of objects.
❖ The graph structures use nodes, edges, and properties to
represent and store graphical data.
❖ Graph-based data is a natural way to represent social
networks, and its structure allows you to calculate specific
metrics such as the influence of a person and the shortest
path between two people.
Audio, image, and video
❖ Audio, image, and video are data types that pose specific
challenges to a data scientist.
❖ Tasks that are trivial for humans, such as recognizing objects
in pictures, turn out to be challenging for computers. MLBAM
(Major League Baseball Advanced Media) announced in 2014
that they’ll increase video capture to approximately 7 TB per
game for the purpose of live, in-game analytics.
❖ High-speed cameras at stadiums will capture ball and athlete
movements to calculate in real time, for example, the path
taken by a defender relative to two baselines.
Streaming data
❖ While streaming data can take almost any of the previous
forms, it has an extra property.
❖ The data flows into the system when an event happens
instead of being loaded into a data store in a batch.
❖ Although this isn’t really a different type of data, we treat it
here as such because you need to adapt your process to deal
with this type of information.
❖ Examples are the “What’s trending” on Twitter, live sporting
or music events, and the stock market.
The data science process
1. Setting the research goal
❖ Data science is mostly applied in the context of an
organization.
❖ When the business asks you to perform a data science
project, you’ll first prepare a project charter.
❖ This charter contains information such as what you’re going
to research, how the company benefits from that, what data
and resources you need, a timetable, and deliverables
❖ Define the research goal
❖ create the research charter
1. Setting the research goal
1. Spend time understanding the goals and context of your research
2. Create a project charter
project charter requires teamwork, and your input covers at least the
following:
❖ A clear research goal
❖ The project mission and context
❖ How you’re going to perform your analysis
❖ What resources you expect to use
❖ Proof that it’s an achievable project, or proof of concepts
❖ Deliverables and a measure of success
❖ A timeline
2. Retrieving data
❖ Start with data stored within the company
❖ Get the external data
➢ Open data site Description
➢ Data.gov The home of the US Government’s open data
➢ https://open-data.europa.eu/ The home of the European
Commission’s open data
➢ Freebase.org An open database that retrieves its information
from sites like Wikipedia, MusicBrains, and the SEC archive
➢ Data.worldbank.org Open data initiative from the World Bank
➢ Aiddata.org Open data for international development
➢ Open.fda.gov Open data from the US Food and Drug
Administration
❖ Do data quality checks now to prevent problems later
3. Data Preparation
3. Data Preparation
Step 3: Cleansing, integrating, and transforming data
1. Cleansing data
❖ Data cleansing is a subprocess of the data science process that focuses
on removing errors in your data so your data becomes a true and
consistent representation of the processes it originates from.
❖ DATA ENTRY ERRORS
❖ REDUNDANT WHITESPACE
❖ FIXING CAPITAL LETTER MISMATCHES
❖ OUTLIERS
❖ DEALING WITH MISSING VALUES
❖ DEVIATIONS FROM A CODE BOOK
❖ DIFFERENT UNITS OF MEASUREMEN
❖ DIFFERENT LEVELS OF AGGREGATION
3. Data Preparation
2. Correct errors as early as possible
3. Combining data from different data sources
THE DIFFERENT WAYS OF COMBINING DATA
❖ JOINING TABLES
❖ APPENDING TABLES
3. Data Preparation
4. Transforming data
3. Data Preparation
❖ REDUCING THE NUMBER OF VARIABLES
➢ Having too many variables in your model makes the
model difficult to handle, and certain techniques don’t
perform well when you overload them with too many
input variables
❖ TURNING VARIABLES INTO DUMMIES
➢ Variables can be turned into
dummy variables (True/ False)
4. Exploratory data analysis
❖ brushing and linking. With brushing and linking you
combine and link different graphs and tables (or views) so
changes in one graph are automatically transferred to the
other graphs.
5. Build the model
1 Selection of a modeling technique
You’ll need to consider model performance and whether your project meets all the
requirements to use your model, as well as other factors:
■ Must the model be moved to a production environment and, if so, would it be
easy to implement?
■ How difficult is the maintenance on the model: how long will it remain relevant
if left untouched?
■ Does the model need to be easy to explain?
When the thinking is done, it’s time for action.
5. Build the model
2 Execution of the model
5. Build the model
3. Model diagnostics and model comparison
6. Presentation and Automation
❖ This is an exciting part
❖ Presenting your results to the stakeholders and
industrializing your analysis process for repetitive reuse and
integration with other tools.
Data Science Process - Summary
❖ Setting the research goal—Defining the what, the why, and the how of your
project in a project charter.
❖ Retrieving data—Finding and getting access to data needed in your project.
This data is either found within the company or retrieved from a third party.
❖ Data preparation—Checking and remediating data errors, enriching the data
with data from other data sources, and transforming it into a suitable format
for your models.
❖ Data exploration—Diving deeper into your data using descriptive statistics
and visual techniques.
❖ Data modeling—Using machine learning and statistical techniques to achieve
your project goal.
❖ Presentation and automation—Presenting your results to the stakeholders
and industrializing your analysis process for repetitive reuse and integration
with other tools.
Big data
❖ Big data is larger, more complex data sets, especially from new data
sources.
❖ Data which are very large in size is called Big Data.
❖ Big data is data that contains greater variety, arriving in increasing
volumes and with more velocity. This is also known as the three Vs
➢ Volume- process high volumes of low-density, unstructured data
➢ Velocity-Velocity is the fast rate at which data is received and (perhaps) acted on
➢ Variety - various types of data
❖ Big data benefits:
➢ Big data makes it possible for you to gain more complete answers
because you have more information.
➢ More complete answers mean more confidence in the data—which
means a completely different approach to tackling problems.
Sources of Big Data
❖ Social networking sites: Facebook, Google, LinkedIn all these sites
generates huge amount of data on a day to day basis as they have
billions of users worldwide.
❖ E-commerce site: Sites like Amazon, Flipkart, Alibaba generates
huge amount of logs from which users buying trends can be traced.
❖ Weather Station: All the weather station and satellite gives very
huge data which are stored and manipulated to forecast weather.
❖ Telecom company: Telecom giants like Airtel, Vodafone study the
user trends and accordingly publish their plans and for this they
store the data of its million users.
❖ Share Market: Stock exchange across the world generates huge
amount of data through its daily transaction.
Characteristics of Big Data
There are five v's of Big Data that explains the characteristics.
Characteristic of Big data
1. Volume
❖ The name Big Data itself is related to an enormous size.
❖ Big Data is a vast 'volumes' of data generated from many
sources daily, such as business processes, machines, social
media platforms, networks, human interactions, and many
more.
❖ Facebook can generate approximately a billion messages,
4.5 billion times that the "Like" button is recorded, and more
than 350 million new posts are uploaded each day.
❖ Big data technologies can handle large amounts of data.
Characteristic of Big data
2. Variety
❖ Big Data can be structured, unstructured, and semi-
structured that are being collected from different sources.
❖ Data will only be collected from databases and sheets in the
past, But these days the data will comes in array forms, that
are PDFs, Emails, audios, SM posts, photos, videos, etc.
Characteristic of Big data
The data is categorized as below:
❖ Structured data: In Structured schema, along with all the required
columns. It is in a tabular form. Structured Data is stored in the
relational database management system.
❖ Semi-structured: In Semi-structured, the schema is not appropriately
defined, e.g., JSON, XML, CSV, TSV, and email. OLTP (Online
Transaction Processing) systems are built to work with semi-structured
data. It is stored in relations, i.e., tables.
❖ Unstructured Data: All the unstructured files, log files, audio files, and
image files are included in the unstructured data. Some organizations
have much data available, but they did not know how to derive the
value of data since the data is raw.
❖ Quasi-structured Data:The data format contains textual data with
inconsistent data formats that are formatted with effort and time with
some tools.
Characteristic of Big data
3. Veracity
❖ Veracity means how much the data is reliable. It has many
ways to filter or translate the data.
❖ Veracity is the process of being able to handle and manage
data efficiently.
❖ Big Data is also essential in business development.
❖ For example, Facebook posts with hashtags.
4. Value
❖ Value is an essential characteristic of big data.
❖ It is not the data that we process or store.
❖ It is valuable and reliable data that we store, process, and
also analyze.
Characteristic of Big data
Characteristic of Big data
5. Velocity
❖ Velocity plays an important role compared to others.
❖ Velocity creates the speed by which the data is created in
real-time.
❖ It contains the linking of incoming data sets speeds, rate of
change, and activity bursts.
❖ The primary aspect of Big Data is to provide demanding data
rapidly.
❖ Big data velocity deals with the speed at the data flows from
sources like application logs, business processes, networks,
and social media sites, sensors, mobile devices, etc.
Applications of big data
1. Travel and Tourism - forecasts, multiple location ,
2. Financial and banking sector - Investment pattern, shopping
trends
3. Healthcare- Predictive analytics
4. Telecommunication and media
5. Government and Military - hacking, online fraud,
6. E-Commerce - Amazon
7. Social media - Facebook(Audio, video, msg exchange)