INTRODUCTION TO DATA SCIENCE
MODULE 1
TOPICS TO BE COVERED
◼ What is Data?
◼ Different kinds of data,
◼ Data Science Process or lifecycle.
◼ Data Pre-processing:
◼ Data quality assessment,
◼ Data Cleaning,
◼ Data Integration and Transformation,
◼ Data Reduction,
◼ Data Discretization and Concept Hierarchy Generation
WHAT IS DATA?
◼ Data is defined as raw facts and figures collected together and stored in database.
◼ Data can be in structured, semi structured and unstructured format.
◼ Data are records which are collected by various ways, large number of resources generates data and
this data is in different formats.
◼ For example, if number of males and females are counted in specific location then the values
represented as no of males and females is data as it is a fact.
WHAT IS DATA? - CONTD..
◼ Data can be processed to get information.
◼ Processing can include combining data from various sources, collecting data, converting or
transforming data into a specific format, summarizing, modelling etc.
◼ Data can be measured, collected, presented, and analyzed by using various tools.
◼ Once the data is collected from various sources it is in the raw format and hence it is also known as
raw data.
WHAT IS DATA? - CONTD..
◼ Raw data then has to undergo through the important phase of cleaning.
◼ Data cleaning method is used to remove the garbage and unwanted values.
◼ Raw data collected from various source will not always provide the proper values and hence data
cleaning and then data evaluating phase after collection provides assurance about the genuineness of
the data.
◼ Data is usually evaluated by comparing it with some standard values or by validating it through the
experts.
WHAT IS DATA? - CONTD..
◼ Data - Raw Facts and Figures.
◼ Information - Processed Data provides Information.
◼ Knowledge - Mastering the use of information in particular
fashion provides knowledge.
◼ Wisdom - Application of the knowledge is known as wisdom.
DIFFERENT KINDS OF DATA
Qualitative Data Quantitative Data
Binomial Data Discrete Data
Ordinal Data Continuous Data
Nominal Data
QUANTITATIVE DATA
◼ It is represented using numbers or anything through which someone can measure various dimensions such as
height, weight, width, length, etc.
◼ Discrete Data: Data which can be counted completely is discrete data. It is mostly the integer values. For
example number of children in family, number of players in cricket team etc., are the discrete values.
◼ Continuous Data: Continuous data is divided into the finer levels and they are usually floating point values.
Example of continuous data can be height, weight, length etc.
QUALITATIVE DATA
◼ It provides the characteristics and descriptors which cannot be easily measured.
◼ Qualitative data can be observed subjectively. For example smell, taste, textures, Color etc.
◼ Binomial data: Binomial means two values which are similar to binary data, two values can be true
or false, yes or no, except or reject, right or wrong. etc.
◼ Nominal data: It is also known as unordered data here every individual element will not have a kind
of ranking but it will have some categories. For example let us say there are 10 items which are having
different colors so they can be categories according to color if the next value comes then it is easily
categories.
◼ Ordinal data: Ordinal data is also known as ordered data here every element have some kind of
order for example short, medium, tall can be three categories for height and now if look at their
names they follow some order.
DATA PRE-PROCESSING
◼ Data preprocessing is a step in the data mining and data analysis process that takes
raw data and transforms it into a format that can be understood and analyzed by
computers and machine learning.
◼ Raw, real-world data in the form of text, images, video, etc., is messy. Not only
may it contain errors and inconsistencies, but it is often incomplete, and doesn’t
have a regular, uniform design.
DATA PRE-PROCESSING PROCESS
Data quality assessment Data cleaning
Data reduction Data transformation
DATA QUALITY ASSESSMENT
◼ Data quality and consistency is very important.
◼ There are a number of data anomalies and inherent problems to look out for in almost any data set, for example:
◼ Mismatched data types: When you collect data from many different sources, it may come to you in different formats.
While the ultimate goal of this entire process is to reformat your data for machines, you still need to begin with similarly
formatted data. For example, if part of your analysis involves family income from multiple countries, you’ll have to convert
each income amount into a single currency.
◼ Mixed data values: Perhaps different sources use different descriptors for features – for example, man or male. These
value descriptors should all be made uniform.
1,2,3,4,5,6,7,8,15,500 = 55.1
1,2,3,4,5,6,7,8,15,,20 = 7.1
1,2,3,4,5,6,7,8,15 = 5.1
CONTD..
◼ Data outliers: Outliers can have a huge impact on data analysis results. For example if you're averaging test
scores for a class, and one student didn’t respond to any of the questions, their 0% could greatly skew the
results.
◼ Missing data: Take a look for missing data fields, blank spaces in text, or unanswered survey questions. This
could be due to human error or incomplete data. To take care of missing data, you’ll have to perform data
cleaning.
DATA CLEANING
◼ Data cleaning is the process of adding missing data and correcting, repairing, or removing incorrect or irrelevant
data from a data set.
◼ Data cleaning is the most important step of preprocessing because it will ensure that data is ready to go for
downstream needs.
◼ Data cleaning will correct all of the inconsistent data you uncovered in your data quality assessment.
◼ Depending on the kind of data you’re working with, there are a number of possible cleaners you’ll need to run
your data through.
CONTD..
◼ Missing data
◼ There are a number of ways to correct for missing data, but the two most common are:
◼ Ignore the tuples: A tuple is an ordered list or sequence of numbers or entities. If multiple values are
missing within tuples, you may simply discard the tuples with that missing information. This is only
recommended for large data sets, when a few ignored tuples won’t harm further analysis.
◼ Manually fill in missing data: This can be tedious, but is definitely necessary when working with smaller
data sets.
CONTD..
◼ Noisy data
◼ Data cleaning also includes fixing “noisy” data. This is data that includes unnecessary data points,
irrelevant data, and data that’s more difficult to group together.
◼ Binning: Binning sorts data of a wide data set into smaller groups of more similar data. It’s often
used when analyzing demographics. Income, for example, could be grouped: $35,000-$50,000,
$50,000-$75,000, etc.
◼ Regression: Regression is used to decide which variables will actually apply to your analysis.
Regression analysis is used to smooth large amounts of data. This will help you get a handle on
your data, so you’re not overburdened with unnecessary data.
◼ Clustering: Clustering algorithms are used to properly group data, so that it can be analyzed with
like data. They’re generally used in unsupervised learning, when not a lot is known about the
relationships within your data.
CONTD..
◼ After data cleaning, you may realize you have insufficient data for the task at hand.
◼ At this point you can also perform data wrangling or data enrichment to add new data sets and run them through
quality assessment and cleaning again before adding them to your original data.
DATA TRANSFORMATION
◼ With data cleaning, we’ve already begun to modify our data, but data transformation will begin the
process of turning the data into the proper format(s) you’ll need for analysis and other downstream
processes.
◼ This generally happens in one or more of the below:
◼ Aggregation
◼ Normalization
◼ Feature selection
◼ Discreditization
◼ Concept hierarchy generation
CONTD..
◼ Aggregation: Data aggregation combines all of your data together
in a uniform format.
◼ Normalization: Normalization scales your data into a regularized
range so that you can compare it more accurately. For example, if
you’re comparing employee loss or gain within a number of
companies (some with just a dozen employees and some with
200+), you’ll have to scale them within a specified range, like -1.0
to 1.0 or 0.0 to 1.0.
◼ Feature selection: Feature selection is the process of deciding
which variables (features, characteristics, categories, etc.) are most
important to your analysis. These features will be used to train ML
models.
CONTD..
◼ Discreditization: Discreditiization pools data into smaller intervals. It’s somewhat similar to binning,
but usually happens after data has been cleaned. For example, when calculating average daily exercise,
rather than using the exact minutes and seconds, you could join together data to fall into 0-15
minutes, 15-30, etc.
◼ Concept hierarchy generation: Concept hierarchy generation can add a hierarchy within and
between your features that wasn’t present in the original data.
DATA REDUCTION
◼ The more data you’re working with, the harder it will be to analyze, even after cleaning and
transforming it.
◼ Depending on your task at hand, you may actually have more data than you need.
◼ Especially when working with text analysis, much of regular human speech is superfluous or irrelevant
to the needs of the researcher.
◼ Data reduction not only makes the analysis easier and more accurate, but cuts down on data storage.
CONTD..
◼ Attribute selection: Similar to discreditization, attribute selection can fit your data into smaller pools. It,
essentially, combines tags or features, so that tags like male/female and professor could be combined into male
professor/female professor.
◼ Numerosity reduction: This will help with data storage and transmission. You can use a regression model, for
example, to use only the data and variables that are relevant to your analysis.
◼ Dimensionality reduction: This, again, reduces the amount of data used to help facilitate analysis and
downstream processes. Algorithms like K-nearest neighbors use pattern recognition to combine similar data and
make it more manageable.
DATA DISCRETIZATION
◼ Discretization techniques can be used to reduce the number of values for a given continuous attribute, by
dividing the attribute into a range of intervals.
◼ Interval value labels can be used to replace actual data values.
◼ These methods are typically recursive, where a large amount of time is spent on sorting the data at each step.
◼ The smaller the number of distinct values to sort, the faster these methods should be.
◼ Many discretization techniques can be applied recursively in order to provide a hierarchical or multiresolution
partitioning of the attribute values known as concept hierarchy.
CONCEPT HIERARCHY GENERATION
◼ A concept hierarchy for a given numeric attribute defines a discretization of the attribute.
◼ Concept hierarchies can be used to reduce the data by collecting and replacing low-level concepts (such as
numeric value for the attribute age) by higher level concepts (such as young, middle-aged, or senior).
◼ Although detail is lost by such generalization, it becomes meaningful and it is easier to interpret.