Unit - I - Introduction
Unit - I - Introduction
Krishnasamy
College of Engineering & Technology
Department of Computer Science & Engineering
Notes of Lesson
B.E - CSE
Year & Sem: II / III
Regulations -2021
Material Reference:
David Cielen, Arno D. B. Meysman, and Mohamed Ali, “Introducing Data Science”,
Manning Publications, 2016. (first two chapters for Unit I) Page Number: 1 to 56
and Internet.
CHAPTER – I
Data science is the domain of study that deals with vast volumes of data using
modern tools and techniques to find unseen patterns, derive meaningful information, and
make business decisions. Data science uses complex machine learning algorithms to build
predictive models. The data used for analysis can come from many different sources and
presented in various formats.
Data Science:
Big data is a huge collection of data with wide variety of different data set and in different
formats. It is hard for the conventional management techniques to extract the data of
different format and process them.
Data science involves using methods to analyse massive amounts of data and extract
the knowledge it contains.
The example is like a relationship between crude oil and oil refinery.
Facets of Data:
It is used to represent the various forms in which the data could be represented inside
Big Data. The following are the various forms in which the data could be represented.
Structured
Unstructured
Natural Language
Machine Generated
Graph Based
Audio, Video & Image
Streaming Data
Structured
Structured data is data that depends on a data model and resides in a fixed field
within a record. As such, it’s often easy to store structured data in tables within databases
or Excel files. SQL, or Structured Query Language, is the preferred way to manage and
query data that resides in databases.
Unstructured
Unstructured data is data that isn’t easy to fit into a data model because the content is
context-specific or varying.
Natural Language
The natural language processing community has had success in entity recognition,
topic recognition, summarization, text completion, and sentiment analysis, but models
trained in one domain don’t generalise well to other domains.
Machine Generated
Graph Based
“Graph data” can be a confusing term because any data can be shown in a graph.
“Graph” in this case points to mathematical graph theory. In graph theory, a graph is a
mathematical structure to model pair-wise relationships between objects. Graph or network
data is, in short, data that focuses on the relationship or adjacency of objects. The graph
structures use nodes, edges, and properties to represent and store graphical data. Graph-
based data is a natural way to represent social networks, and its structure allows you to
calculate specific metrics such as the influence of a person and the shortest path between
two people.
Audio, image, and video are data types that pose specific challenges to a data scientist.
Tasks that are trivial for humans, such as recognizing objects in pictures, turn out to be
challenging for computers.
Examples: You Tube videos, podcast, music and lots more to add up to.
Streaming Data:
While streaming data can take almost any of the previous forms, it has an extra
property. The data flows into the system when an event happens instead of being loaded
into a data store in a batch. Although this isn’t really a different type of data, we treat it
here as such because you need to adapt your process to deal with this type of information.
Examples: Video conferences and live telecasts all work on these basics.
2. Describe the overview of the data science process AP / MAY 2023 NOV / DEC 2022
The first step of this process is setting a research goal. The main purpose here is
making sure all the stakeholders understand the what, how, and why of the project. In
every serious project this will result in a project charter.
Defining research goal
An essential outcome is the research goal that states the purpose of your assignment
in a clear and focused manner. Understanding the business goals and context is critical for
project success.
2. Retrieving Data
The second phase is data retrieval. You want to have data available for analysis, so
this step includes finding suitable data and getting access to the data from the data owner.
The result is data in its raw form, which probably needs polishing and transformation
before it becomes usable.
Data can be stored in many forms, ranging from simple text files to tables in a
database. The objective now is acquiring all the data you need. This may be difficult, and
even if you succeed, data is often like a diamond in the rough: it needs polishing to be of
any use to you.
Start with data stored within the company.
The data stored in the data might be already cleaned and maintained inrepositories
such as databases, data marts, data warehouses and data lakes.
3. Briefly describe the steps involved in Data Preparation. NOV / DEC 2023
4. Data Preparation
Redundant White space - White spaces tend to be hard to detect but cause errors
like other redundant characters would. White spaces at the beginning of a word or at a end
of a word is much hard to identify and rectify.
Impossible values and sanity checks - Here the data are checked for physically
and theoretically impossible values.
Outliers - Here the data are checked for physically and theoretically impossible
values. An outlier is an observation that seems to be distant from other observations or,
more specifically, one observation that follows a different logic or generative process than
the other observations.
Dealing with the Missing values - Missing values aren’t necessarily wrong, but
you still need to handle them separately; certain modelling techniques can’t handle missing
values.
Data Transformation - Converting a data from linear data into sequential or continuous
form of data
Reducing the number of variables - Having too many variables in your model makes the
model difficult to handle, and certain techniques don’t perform well when you overload
them with too many input variables.
5. Data Exploration
Information becomes much easier to grasp when shown in a picture, therefore you
mainly use graphical techniques to gain an understanding of your data and the interactions
between variables.
Examples
Pareto diagram is a combination of the values and a cumulative distribution.
Histogram: In it, a variable is cut into discrete categories and the number of occurrences
in each category are summed up and shown in the graph.
Box plot: It doesn’t show how many observations are present but does offer an impression
of the distribution within categories. It can show the maximum, minimum, median, and
other characterising measures at the same time.
With clean data in place and a good understanding of the content, you’re ready to
build models with the goal of making better predictions, classifying objects, or gaining an
understanding of the system that you’re modelling.
Building a model is an iterative process. Most models consist of the following main steps:
Selection of a modelling technique and variables to enter in the model
Execution of the model
Diagnosis and model comparison
Model Execution - Once you’ve chosen a model you’ll need to implement it in code.
After you’ve successfully analysed the data and built a well-performing model,
you’re ready to present your findings to the world. This is an exciting part all your hours of
hard work have paid off and you can explain what you found to the stakeholders.
1. The first step of this process is setting a research goal. The main purpose here is to make
sure all the stakeholders understand the what, how, and why of the project. In every serious
project this will result in a project charter.
2. The second phase is data retrieval. You want to have data available for analysis, so this
step includes finding suitable data and getting access to the data from the data owner.
3. The result is data in its raw form, which probably needs some polishing and
transformation before it becomes usable.
4. Now that you have the raw data, it is time to cleanse it. This includes transforming the
data from a raw form into data that is directly usable in your models. To achieve this, you
will detect and correct different kinds of errors in the model, combine data from different
data sources, and transform it. If you have successfully completed this step, you can
progress to data visualization and modelling.
5. The fourth step is data exploration. The goal of this step is to gain a deep understanding
of the data. You will look for patterns, correlations, and deviations based on visual and
descriptive techniques. The insights you gain from this phase will enable you to start
modelling.
6. Finally we get data modelling: It is now you attempt to gain the insights or make the
predictions that were stated in your project charter. Now is the time to bring out the heavy
guns, but remember research has taught us that often (but not always) a combination of
simple models tends to outperform one complicated model. If you have done this phase
right, you are almost done.
7. The last step of the data science model is presenting your results and automating the
analysis if needed. One goal of a project is to change the process and/or make better
decisions. You might still need to convince the business that your findings will indeed
change the business process as expected.
This is where you can shine in your influencer role. The importance of this step is
more apparent in projects on a strategic and tactical level. Some projects require you to
perform the business process over and over again, so automating the project will save you
lots of time. In reality you will not progress in a linear way from step 1 to step 6; often you
will regress and iterate between the different phases. Following these six steps pays off in
terms of a higher project success ratio and increased impact of research results.
This process ensures you have a well-defined research plan, a good understanding of
the business question, and clear deliverables before you even start looking at data.
The first steps of your process focus on getting high-quality data as input for your models.
This way your models will perform better later on.
In data science there’s a well known paradigm: Garbage in equal’s garbage out.
Another benefit of following a structured approach is that you work more in prototype
mode while you search for the best model.
When building a prototype you will probably try multiple models and won’t focus
heavily on things like program speed or writing code against standards. This allows you to
focus on bringing business value instead.
CHAPTER - 2
DATA MINING
We are in an age often referred to as the information age. In this information age,
because we believe that information leads to power and success, and thanks to
sophisticated technologies such as computers, satellites, etc., we have been collecting
tremendous amounts of information. Initially, with the advent of computers and means for
mass digital storage, we started collecting and storing all sorts of data, counting on the
power of computers to help sort through this amalgam of information.
Today, we have far more information than we can handle: from business
transactions and scientific data, to satellite pictures, text reports and military intelligence.
Information retrieval is simply not enough anymore for decision-making. Confronted with
huge collections of data, we have now created new needs to help us make better
managerial choices. These needs are automatic summarization of data, extraction of the
“essence” of information stored, and the discovery of patterns in raw data.
Data mining refers to extracting or mining knowledge from large amounts of data.
The term is actually a misnomer. Thus, data mining should have been more appropriately
named as knowledge mining which emphasis on mining from large amounts of data. It is
the computational process of discovering patterns in large data sets involving methods at
the intersection of artificial intelligence, machine learning, statistics, and database systems.
The overall goal of the data mining process is to extract information from a data set and
transform it into an understandable structure for further use.
We have been collecting a myriad of data, from simple numerical measurements and text
documents, to more complex information such as spatial data, multimedia channels, and
hypertext documents. Here is a non-exclusive list of a variety of information collected in
digital form in databases and in flat files.
• Business transactions:
• Scientific data:
From government census to personnel and customer files, very large collections of
information are continuously gathered about individuals and groups. Governments,
companies and organizations such as hospitals, are stockpiling very important quantities of
personal data to help them manage human resources, better understand a market, or simply
assist clientele. Regardless of the privacy issues this type of data often reveals, this
information is collected, used and even shared. When correlated with other data this
information can shed light on customer behaviour and the like.
With the amazing collapse of video camera prices, video cameras are becoming
ubiquitous. Video tapes from surveillance cameras are usually recycled and thus the
content is lost. However, there is a tendency today to store the tapes and even digitize them
for future use and analysis.
• Satellite sensing:
There is a countless number of satellites around the globe: some are geo-stationary above a
region, and some are orbiting around the Earth, but all are sending a non-stop stream of
data to the surface. NASA, which controls a large number of satellites, receives more data
every second than what all NASA researchers and engineers can cope with. Many satellite
pictures and data are made public as soon as they are received in the hopes that other
researchers can analyze them.
• Games:
Our society is collecting a tremendous amount of data and statistics about games, players
and athletes. From hockey scores, basketball passes and car-racing lapses, to swimming
times, boxer’s pushes and chess positions, all the data are stored. Commentators and
journalists are using this information for reporting, but trainers and athletes would want to
exploit this data to improve performance and better understand opponents.
• Digital media:
The proliferation of cheap scanners, desktop video cameras and digital cameras is
one of the causes of the explosion in digital media repositories. In addition, many radio
stations, television channels and film studios are digitizing their audio and video
collections to improve the management of their multimedia assets. Associations such as the
NHL and the NBA have already started converting their huge game collection into digital
forms.
There are a multitude of Computer Assisted Design (CAD) systems for architects to
design buildings or engineers to conceive system components or circuits. These systems
are generating a tremendous amount of data. Moreover, software engineering is a source of
considerable similar data with code, function libraries, objects, etc., which need powerful
tools for management and maintenance.
• Virtual Worlds:
There are many applications making use of three-dimensional virtual spaces. These
spaces and the objects they contain are described with special languages such as VRML.
Ideally, these virtual spaces are described in such a way that they can share objects and
places. There is a remarkable amount of virtual reality object and space repositories
available. Management of these repositories as well as content-based search and retrieval
from these repositories are still research issues, while the size of the collections continues
to grow.
Since the inception of the World Wide Web in 1993, documents of all sorts of
formats, content and description have been collected and inter-connected with hyperlinks
making it the largest repository of data ever built. Despite its dynamic and unstructured
nature, its heterogeneous characteristic, and its very often redundancy and inconsistency,
the World Wide Web is the most important data collection regularly used for reference
because of the broad variety of topics covered and the infinite contributions of resources
and publishers. Many believe that the World Wide Web will become the compilation of
human knowledge.
With the enormous amount of data stored in files, databases, and other repositories,
it is increasingly important, if not necessary, to develop powerful means for analysis and
perhaps interpretation of such data and for the extraction of interesting knowledge that
could help in decision-making. Data Mining, also popularly known as Knowledge
Discovery in Databases (KDD), refers to the nontrivial extraction of implicit, previously
unknown and potentially useful information from data in databases.
While data mining and knowledge discovery in databases (or KDD) are frequently
treated as synonyms, data mining is actually part of the knowledge discovery process. The
following figure (Figure 1.1) shows data mining as a step in an iterative knowledge
discovery process.
Pattern
Task-relevantData
Data
Warehouse Selection and
Transformation
Data
Data Integration
Databases
Figure 1.1: Data Mining is the core of Knowledge Discovery process
• Data cleaning: also known as data cleansing, it is a phase in which noise data and
irrelevant data are removed from the collection.
• Data integration: at this stage, multiple data sources, often heterogeneous, may
be combined in a common source.
• Data selection: at this step, the data relevant to the analysis is decided on and
retrieved from the data collection.
• Data mining: it is the crucial step in which clever techniques are applied to
extract patterns potentially useful.
For instance, data cleaning and data integration can be performed together as a pre-
processing phase to generate a data warehouse. Data selection and data transformation can
also be combined where the consolidation of the data is the result of the selection, or, as for
the case of data warehouses, the selection is done on transformed data.
It is, however, a misnomer, since mining for gold in rocks is usually called “gold
mining” and not “rock mining”, thus by analogy, data mining should have been called
“knowledge mining” instead. Nevertheless, data mining became the accepted customary
term, and very rapidly a trend that even overshadowed more general terms such as
knowledge discovery in databases (KDD) that describe a more complete process. Other
similar terms referring to data mining are: data dredging, knowledge extraction and pattern
discovery.
In principle, data mining is not specific to one type of media or data. Data mining
should be applicable to any kind of information repository. However, algorithms and
approaches may differ when applied to different types of data. Indeed, the challenges
presented by different types of data vary significantly.
Data mining is being put into use and studied for databases, including relational
databases, object-relational databases and object oriented databases, data warehouses,
transactional databases, unstructured and semi structured repositories such as the World
Wide Web, advanced databases such as spatial databases, multimedia databases, time-
series databases and textual databases, and even flat files.
• Flat files: Flat files are actually the most common data source for data mining
algorithms, especially at the research level. Flat files are simple data files in text or binary
format with a structure known by the data mining algorithm to be applied. The data in
these files can be transactions, time-series data, scientific measurements, etc.
Borrow
...
Customer
name address password birthdate family_income group …
customerID
John Smith 120 main street Marty 1965/10/10 $45000 A …
C1234
Items
type title media category Value # …
itemID
Video Titanic DVD Drama $15.00 2 …
...
Figure 1.2: Fragments of some relations from a relational database for OurVideoStore.
The most commonly used query language for relational database is SQL, which
allows retrieval and manipulation of the data stored in the tables, as well as the calculation
of aggregate functions such as average, sum, min, max and count.
For instance, an SQL query to select the videos grouped by category would be:
Data mining algorithms using relational databases can be more versatile than data
mining algorithms specifically written for flat files, since they can take advantage of the
structure inherent to relational databases. While data mining can benefit from SQL for data
selection, transformation and consolidation, it goes beyond what SQL could provide, such
as predicting, comparing, detecting deviations, etc.
In other words, data from the different stores would be loaded, cleaned,
transformed and integrated together. To facilitate decision making and multi-dimensional
views, data warehouses are usually modelled by a multi-dimensional data structure.
Figure 1.3: A multi-dimensional data cube structure commonly used in data for data warehousing.
Figure 1.3 shows an example of a three dimensional subset of a data cube structure
used for Our Video Store data warehouse.The figure shows summarized rentals grouped by
film categories, then a cross table of summarized rentals by film categories and time (in
quarters). The data cube gives the summarized rentals along three dimensions: category,
time, and city. A cube contains cells that store values of some aggregate measures (in this
case rental counts), and special cells that store summations along dimensions.
Each dimension of the data cube contains a hierarchy of values for one attribute.
Because of their structure, the pre-computed summarized data they contain and the
hierarchical attribute values of their dimensions, data cubes are well suited for fast
interactive querying and analysis of data at different conceptual levels, known as On-Line
Analytical Processing (OLAP). OLAP operations allow the navigation of data at different
levels of abstraction, such as drill-down, roll-up, slice, dice, etc.
Drill down on Q3
Roll-up on Location
Time (Quarters)
Maritimes
uebe Q1 Q2 Q3 Q4
(province, Ontario
Prairies
Canada) Western Category
Pr Drama
Figure 1.4: Summarized data from OurVideoStore before and after drill-down and roll-up operations.
Figure 1.4 illustrates the drill-down (on the time dimension) and roll-up (on the
location dimension) operations.
Rentals
transactio date time customer itemList
nID ID
T12345 99/09/ 19:38 C1234 I2, I6, I10, I45 …}
06
...
For example, in the case of the video store, the rentals table such as shown in
Figure 1.5, represents the transaction database. Each record is a rental contract with a
customer identifier, a date, and the list of items rented (i.e. video tapes, games, VCR, etc.).
Since relational databases do not allow nested tables (i.e. a set as attribute value),
transactions are usually stored in flat files or stored in two normalized transaction tables,
one for the transactions and one for the transaction items.
One typical data mining analysis on such data is the so-called market Red Deer Q1
Q2 Q4 Drama Horror Sci. Fi.. Comedy Time (Quarters) Location (city, AB) Category Q3
Edmonton Calgary Lethbridge Red Deer Jul Se Drama Horror Sci. Fi.. Comedy Time
(Months, Q3) Location (city, AB) Category Aug Edmonton Calgary Lethbridge Drill down
on Q3 Roll-up on Location Prairies Q1 Q2 Q4 Drama Horror Sci. Fi.. Comedy Time
(Quarters) Location (province, Canada) Category Q3 Maritimes Quebec Ontario Western
Pr.
Figure 1.4: Summarized data from Our Video Store before and after drill-down and
roll-up operations. Basket analysis or association rules in which associations between items
occurring together or in sequence are studied.
• Multimedia Databases: Multimedia databases include video, images, audio and text
media. They can be stored on extended object-relational or object-oriented databases, or
simply on a file system. Multimedia is characterized by its high dimensionality, which
makes data mining even more challenging. Data mining from multimedia repositories may
require computer vision, computer graphics, image interpretation, and natural language
processing methodologies.
• Spatial Databases: Spatial databases are databases that, in addition to usual data, store
geographical information like maps, and global or regional positioning. Such spatial
databases present new challenges to data mining algorithms.
• Time-Series Databases: Time-series databases contain time related data such stock
market data or logged activities. These databases usually have a continuous flow of new
data coming in, which sometimes causes the need for a challenging real time analysis. Data
mining in such databases commonly includes the study of trends and correlations between
evolutions of different variables, as well as the prediction of trends and movements of the
variables in time.
Page 23
Figure 1.7 shows some examples of time-series data. Transaction_ID, date, time,
customer_ID, item_List T12345 99/09/06 19:38 C1234 {I2, I6, I10, I45 …} Rentals . . .
Figure 1.5: Fragment of a transaction database for the rentals at Our Video Store. Figure
1.6: Visualization of spatial OLAP (from Geo Miner system)
• World Wide Web: The World Wide Web is the most heterogeneous and dynamic
repository available. A very large number of authors and publishers are continuously
contributing to its growth and metamorphosis, and a massive number of users are accessing
its resources daily. Data in the World Wide Web is organized in inter-connected
documents. These documents can be text, audio, video, raw data, and even applications.
A fourth dimension can be added relating the dynamic nature or evolution of the
documents. Data mining in the World Wide Web, or web mining, tries to address all these
issues and is often divided into web content mining, web structure mining and web usage
mining.
The kinds of patterns that can be discovered depend upon the data mining tasks
employed. By and large, there are two types of data mining tasks:
Descriptive data mining tasks that describe the general properties of the existing data, and
Predictive data mining tasks that attempt to do predictions based on inference on available
data.
Page 24
The data mining functionalities and the variety of knowledge they discover are
briefly presented in the following list: Figure 1.7: Examples of Time-Series Data (Source:
Thompson Investors Group)
• Discrimination: Data discrimination produces what are called discriminate rules and is
basically the comparison of the general features of objects between two classes referred to
as the target class and the contrasting class. For example, one may want to compare the
general characteristics of the customers who rented more than 30 movies in the last year
with those whose rental account is lower than 5. The techniques used for data
discrimination are very similar to the techniques used for data characterization with the
exception that data discrimination results include comparative measures.
• Association analysis: Association analysis is the discovery of what are commonly called
association rules. It studies the frequency of items occurring together in transactional
databases, and based on a threshold called support, identifies the frequent item sets.
Another threshold, confidence, which is the conditional probability than an item appears in
a transaction when another item appears, is used to pinpoint association rules. Association
analysis is commonly used for market basket analysis.
For example, it could be useful for the Our Video Store manager to know what
movies are often rented together or if there is a relationship between renting a certain type
of movies and buying popcorn or pop. The discovered association rules are of the form:
P→Q [s,c], where P and Q are conjunctions of attribute value-pairs, and s (for support) is
the probability that P and Q appear together in a transaction and c (for confidence) is the
conditional probability that Q appears in a transaction when P is present. For example, the
hypothetic association rule: RentType(X, “game”) ∧ Age(X, “13-19”) → Buys(X, “pop”)
[s=2% ,c=55%] would indicate that 2% of the transactions considered are of customers
aged between 13 and 19 who are renting a game and buying a pop, and that there is a
certainty of 55% that teenage customers who rent a game also buy pop.
Page 25
For example, after starting a credit policy, the Our Video Store managers could analyze the
customers’ behaviours vis-à-vis their credit, and label accordingly the customers who
received credits with three possible labels “safe”, “risky” and “very risky”. The
classification analysis would generate a model that could be used to either accept or reject
credit requests in the future.
• Outlier analysis: Outliers are data elements that cannot be grouped in a given class or
cluster. Also known as exceptions or surprises, they are often very important to identify.
While outliers can be considered noise and discarded in some applications, they can reveal
important knowledge in other domains, and thus can be very significant and their analysis
valuable.
• Evolution and deviation analysis: Evolution and deviation analysis pertain to the study
of time related data that changes in time. Evolution analysis models evolutionary trends in
data, which consent to characterizing, comparing, classifying or clustering of time related
data. Deviation analysis, on the other hand, considers differences between measured values
and expected values, and attempts to find the cause of the deviations from the anticipated
values.
It is common that users do not have a clear idea of the kind of patterns they can
discover or need to discover from the data at hand. It is therefore important to have a
versatile and inclusive data mining system that allows the discovery of different kinds of
knowledge and at different levels of abstraction. This also makes interactivity an important
attribute of a data mining system.
Page 26
Data mining allows the discovery of knowledge potentially useful and unknown.
Whether the knowledge discovered is new, useful or interesting, is very subjective and
depends upon the application and the user. It is certain that data mining can generate, or
discover, a very large number of patterns or rules. In some cases the number of rules can
reach the millions. One can even think of a meta-mining phase to mine the oversized data
mining results. To reduce the number of patterns or rules discovered that have a high
probability to be non-interesting, one has to put a measurement on the patterns. However,
this raises the problem of completeness. The user would want to discover all rules or
patterns, but only those that are interesting.
Typically, measurements for interestingness are based on thresholds set by the user.
These thresholds define the completeness of the patterns discovered. Identifying and
measuring the interestingness of patterns and rules discovered, or to be discovered is
essential for the evaluation of the mined knowledge and the KDD process as a whole.
While some concrete measurements exist, assessing the interestingness of discovered
knowledge is still an important research issue. How do we categorize data mining systems?
There are many data mining systems available or being developed. Some are specialized
systems dedicated to a given data source or are confined to limited data mining
functionalities, other are more versatile and comprehensive.
Data mining systems can be categorized according to various criteria among other
classification are the following:
Page 27
Data mining algorithms embody techniques that have sometimes existed for many
years, but have only lately been applied as reliable and scalable tools that time and again
outperform older classical statistical methods.
While data mining is still in its infancy, it is becoming a trend and ubiquitous.
Before data mining develops into a conventional, mature and trusted discipline, many still
pending issues have to be addressed. Some of these issues are addressed below. Note that
these issues are not exclusive and are not ordered in any way.
Security is an important issue with any data collection that is shared and/or is
intended to be used for strategic decision-making. In addition, when data is collected for
customer profiling, user behaviour understanding, correlating personal data with other
information, etc., large amounts of sensitive and private information about individuals or
companies is gathered and stored. This becomes controversial given the confidential nature
of some of this data and the potential illegal access to the information.
Moreover, data mining could disclose new implicit knowledge about individuals or
groups that could be against privacy policies, especially if there is potential dissemination
of discovered information. Another issue that arises from this concern is the appropriate
use of data mining. Due to the value of data, databases of all sorts of content are regularly
sold, and because of the competitive advantage that can be attained from implicit
knowledge discovered, some important information could be withheld, while other
information could be widely distributed and used without control.
Page 28
in an appropriate visual presentation. There are many visualization ideas and proposals for
effective data graphical presentation. However, there is still much research to accomplish
in order to obtain good visualization tools for large datasets that could be used to display
and manipulate mined knowledge.
The major issues related to user interfaces and visualization is “screen real-estate”,
information rendering, and interaction. Interactivity with the data and data mining results is
crucial since it provides means for the user to focus and refine the mining tasks, as well as
to picture the discovered knowledge from different angles and at different conceptual
levels.
These issues pertain to the data mining approaches applied and their limitations.
Topics such as versatility of the mining approaches, the diversity of data available, the
dimensionality of the domain, the broad analysis needs (when known), the assessment of
the knowledge discovered, the exploitation of background knowledge and metadata, the
control and handling of noise in data, etc. are all examples that can dictate mining
methodology choices. For instance, it is often desirable to have different data mining
methods available since different approaches may perform differently depending upon the
data at hand. Moreover, different approaches may suit and solve user’s needs differently.
Most algorithms assume the data to be noise-free. This is of course a strong assumption.
Most datasets contain exceptions, invalid or incomplete information, etc., which may
complicate, if not obscure, the analysis process and in many cases compromise the
accuracy of the results.
Performance issues:
Many artificial intelligence and statistical methods exist for data analysis and
interpretation. However, these methods were often not designed for the very large data sets
data mining is dealing with today. Terabyte sizes are common. This raises the issues of
scalability and efficiency of the data mining methods when processing considerably large
data. Algorithms with exponential and even medium-order polynomial complexity cannot
be of practical use for data mining. Linear algorithms are usually the norm. In same theme,
sampling can be used for mining instead of the whole dataset.
Page 29
However, concerns such as completeness and choice of samples may arise. Other
topics in the issue of performance are incremental updating, and parallel programming.
There is no doubt that parallelism can help solve the size problem if the dataset can be
subdivided and the results can be merged later. Incremental updating is important for
merging results from parallel mining, or updating data mining results when new data
becomes available without having to re-analyze the complete dataset.
There are many issues related to the data sources, some are practical such as the
diversity of data types, while others are philosophical like the data glut problem. We
certainly have an excess of data since we already have more data than we can handle and
we are still collecting data at an even higher rate. If the spread of database management
systems has helped increase the gathering of information, the advent of data mining is
certainly encouraging more data harvesting. The current practice is to collect as much data
as possible now and process it, or try to process it, later. The concern is whether we are
collecting the right data at the appropriate amount, whether we know what we want to do
with it, and whether we distinguish between what data is important and what data is
insignificant. Regarding the practical issues related to data sources, there is the subject of
heterogeneous databases and the focus on diverse complex data types.
Page 30
parallel. Thus it is important to have a data mining system that can mine multiple kinds of
patterns to accommodate different user expectations or applications. Furthermore, data
mining systems should be able to discover patterns at various granularities (i.e., different
levels of abstraction). Data mining systems should also allow users to specify hints to
guide or focus the search for interesting patterns. Because some patterns may not hold for
all of the data in the database, a measure of certainty or “trustworthiness” is usually
associated with each discovered pattern.
Data mining functionalities, and the kinds of patterns they can discover, are
described Mining Frequent Patterns, Associations, and Correlations Frequent patterns, as
the name suggests, are patterns that occur frequently in data. There are many kinds of
frequent patterns, including item sets, subsequences, and substructures.
A frequent item set typically refers to a set of items that frequently appear together
in a transactional data set, such as milk and bread. A frequently occurring subsequence,
such as the pattern that customers tend to purchase first a PC, followed by a digital camera,
and then a memory card, is a (frequent) sequential pattern. A substructure can refer to
different structural forms, such as graphs, trees, or lattices, which may be combined with
item sets or subsequences. If a substructure occurs frequently, it is called a (frequent)
structured pattern. Mining frequent patterns leads to the discovery of interesting
associations and correlations within data.
3. Briefly explain the architecture of data mining.
A typical data mining system may have the following major components.
Page 31
1. Knowledge Base: This is the domain knowledge that is used to guide the search or
evaluate the interestingness of resulting patterns. Such knowledge can include concept hierarchies,
used to organize attributes or attribute values into different levels of abstraction. Knowledge such
as user beliefs, which can be used to assess a pattern’s interestingness based on its unexpectedness,
may also be included. Other examples of domain knowledge are additional interestingness
constraints or thresholds, and metadata (e.g., describing data from multiple heterogeneous sources).
2. Data Mining Engine: This is essential to the data mining system and ideally consists of
a set of functional modules for tasks such as characterization, association and correlation analysis,
classification, prediction, cluster analysis, outlier analysis, and evolution analysis.
4. User interface: This module communicates between users and the data mining system,
allowing the user to interact with the system by specifying a data mining query or task, providing
information to help focus the search, and performing exploratory data mining based on the
intermediate data mining results. In addition, this component allows the user to browse database
and data warehouse schemas or data structures, evaluate mined patterns, and visualize the patterns
in different forms.
Page 32
CHAPTER – 3
DATA WAREHOUSEING
Page 33
Page 34
Tier-3:
The top tier is a front-end client layer, which contains query and reporting tools, analysis
tools,and/or data mining tools (e.g., trend analysis, prediction, and so on).
1. Enterprise warehouse:
Page 35
2. Data mart:
3. Virtual warehouse:
Metadata are data about data. When used in a data warehouse, metadata are the data
that define warehouse objects. Metadata are created for the data names and definitions of
the given warehouse. Additional metadata are created and captured for time stamping any
extracted data, the source of the extracted data, and missing fields that have been added by
data cleaning or integration processes.
Page 37
Let’s look at each of these schema types. Star schema: The most common modeling
paradigm isthe star schema, in which the data warehouse contains
(1) a large central table (fact table) containing the bulk of the data, with no
redundancy, and
(2) a set of smaller attendant tables (dimension tables), one for each dimension. The
schema graph resembles a starburst, with the dimension tables displayed in a radial pattern
around the central fact table.
Page 38
Star schema:
A star schema for All Electronics sales is shown in Figure. Sales are considered
along four dimensions, namely, time, item, branch, and location. The schema contains a
central fact table for sales that contains keys to each of the four dimensions, along with two
measures: dollars sold and units sold. To minimize the size of the fact table, dimension
identifiers (such as time key and item key) are system-generated identifiers. Notice that in
the star schema, each dimension is represented by only one table, and each table contains a
set of attributes. For example, the location dimension table contains the attribute set
{location key, street, city, province or state, country}. This constraint may introduce some
redundancy.
For example, “Vancouver” and “Victoria” are both cities in the Canadian province of
British Columbia.Entries for such cities in the location dimension table will create
redundancy among the attributes province or state and country, that is, (..., Vancouver,
British Columbia, Canada) and (..., Victoria, British Columbia, Canada). Moreover, the
attributes within a dimension table may form either a hierarchy (total order) or a lattice
(partial order).
Snowflake schema:
A snowflake schema for All Electronics sales is given in Figure Here, the sales fact
table is identical to that of the star schema in Figure . The main difference between the
Page 39
Fact constellation.
A fact constellation schema is shown in Figure. This schema specifies two fact
tables, sales and shipping. The sales table definition is identical to that of the star schema.
The shipping table has five dimensions, or keys: item key, time key, shipper key, from
location, and to location, and two measures: dollars cost and units shipped.
A fact constellation schema allows dimension tables to be shared between fact
tables. For example, the dimensions tables for time, item, and location are shared between
both the sales and shipping fact tables. In data warehousing, there is a distinction between
a data warehouse and a data mart.
A data warehouse collects information about subjects that span the entire
organization, such as customers, items, sales, assets, and personnel, and thus its scope is
enterprise-wide. For data a warehouse, the fact constellation schema is commonly used,
Page 40
since it can model multiple, interrelated subjects. A data mart, on the other hand, is a
department subset of the data warehouse that focuses on selected subjects, and thus its
scope is department wide. For data marts, the star or snowflake schema is commonly
used, since both are geared toward modelling single subjects, although the star schema is
more popular and efficient.
➢ Consolidation (Roll-Up)
➢ Drill-Down
➢ Slicing And Dicing
Page 41
Slicing and dicing is a feature whereby users can take out (slicing) a specific
set of dataof the OLAP cube and view (dicing) the slices from different
viewpoints.
Types of OLAP:
ROLAP works directly with relational databases. The base data and the
dimension tables are stored as relational tables and new tables are created to
hold the aggregated information. It depends on a specialized schema design.
This methodology relies on manipulating the data stored in the relational
database to give the appearance of traditional OLAP's slicing and dicing
functionality. In essence, each action of slicing and dicing is equivalent to
adding a "WHERE" clause in the SQL statement. ROLAP tools do not use
pre-calculated data cubes but instead pose the query to the standard
relational database and its tables in order to bring back the data required to
answer the question.
ROLAP tools feature the ability to ask any question because the methodology
does not limit to the contents of a cube. ROLAP also has the ability to drill down to
the lowest level of detail in the database.
There is no clear agreement across the industry as to what constitutes Hybrid OLAP,
except that a database will divide data between relational and specialized storage.
For example, for some vendors, a HOLAP database will use relational tables to hold the
larger quantities of detailed data, and use specialized storage for at least some aspects of
the smaller quantities of more-aggregate or less-detailed data.
Page 42
CHAPTER – 4
What is Statistics?
A statistic is a number that describes the data from a sample. The Wikipedia
definition of Statistics states that “it is a discipline that concerns the collection,
organization, analysis, interpretation, and presentation of data.”
It means, as part of statistical analysis, we collect, organize, and draw meaningful
insights from the data either through visualizations or mathematical explanations.
Statistics is broadly categorized into two types:
1. Descriptive Statistics
2. Inferential Statistics
Descriptive Statistics:
As the name suggests in Descriptive statistics, we describe the data using the Mean,
Standard deviation, Charts, or Probability distributions.
Page 43
On the other hand, Quantitative data is numerical, and it is again divided into Continuous and
Discrete data.
Continuous data: It can be represented in decimal format. Examples are height, weight,
time, distance, etc.
Discrete data: It cannot be represented in decimal format. Examples are the number of
laptops, number of students in a class. Discrete data is again divided into Categorical and
Count Data.
Categorical data: represent the type of data that can be divided into groups. Examples are
age, sex, etc.
Count data: This data contains non-negative integers. Example: number of children a couple
has.
Mode: The most frequently value in the dataset. If the data have multiple values that occurred
the most frequently, we have a multimodal distribution.
Page 44
Variability:
Range: The difference between the highest and lowest value in the dataset.
Percentiles — A measure that indicates the value below which a given percentage of
observations in a group of observations falls.
Quantiles — Values that divide the number of data points into four more or less equal
parts, or quarters.
Page 45
Causality: Relationship between two events where one event is affected by the other.
Covariance: A quantitative measure of the joint variability between two or more variables.
Correlation: Measure the relationship between two variables and ranges from -1 to 1, the
normalized version of covariance.
Probability Distribution
Probability Distribution Functions (PDF): A function for continuous data where the value
at any given sample can be interpreted as providing a relative likelihood that the value of the
random variable would equal that sample.
Probability Mass Function (PMF): A function that gives the probability that a discrete
random variable is exactly equal to some value.
Page 46
Cumulative Density Function (CDF): A function that gives the probability that a random
variable is less than or equal to a certain value.
Page 47
Page 48
Page 49
Null Hypothesis: A general statement that there is no relationship between two measured
phenomena or no association among groups. Alternative Hypothesis: Be contrary to the null
hypothesis.
In statistical hypothesis testing, a type I error is the rejection of a true null hypothesis, while
Interpretation
P-value: The probability of the test statistic being at least as extreme as the one observed
given that the null hypothesis is true. When p-value > α, we fail to reject the null hypothesis,
while p-value ≤ α, we reject the null hypothesis and we can conclude that we have the
significant result.
Critical Value: A point on the scale of the test statistic beyond which we reject the null
hypothesis, and, is derived from the level of significance α of the test. It depends upon a test
statistic, which is specific to the type of test, and the significance level, α, which defines the
sensitivity of the test.
Page 50
Significance Level and Rejection Region: The rejection region is actually depended on the
significance level. The significance level is denoted by α and is the probability of rejecting
the null hypothesis if it is true.
Z-Test
A Z-test is any statistical test for which the distribution of the test statistic under the null
hypothesis can be approximated by a normal distribution and tests the mean of a distribution
in which we already know the population variance. Therefore, many statistical tests can be
conveniently performed as approximate Z-tests if the sample size is large or the population
variance is known.
T-Test
A T-test is the statistical test if the population variance is unknown and the sample size is not
large (n < 30).
Paired sample means that we collect data twice from the same group, person, item or
thing. Independent sample implies that the two samples must have come from two
completely different populations.
Page 51
ANOVA(Analysis of Variance)
ANOVA is the way to find out if experiment results are significant. One-way
ANOVA compare two means from tow independent group using only one independent
variable. Two-way ANOVA is the extension of one-way ANOVA using two independent
variables to calculate main effect and interaction effect.
ANOVA Table
Page 52
Chi-Square Test
Chi-Square Test check whether or not a model follows approximately normality when
we have s discrete set of data points. Goodness of Fit Test determine if a sample matches the
population fit one categorical variable to a distribution. Chi-Square Test for
Independence compare two sets of data to see if there is a relationship.
1. In Search Engines
The most useful application of Data Science is Search Engines. As we know when we want
to search for something on the internet, we mostly used Search engines like Google, Yahoo,
Safari, Firefox, etc. So Data Science is used to get Searches faster.
For Example, When we search something suppose “Data Structure and algorithm courses ”
then at that time on the Internet Explorer we get the first link of GeeksforGeeks Courses.
This happens because the GeeksforGeeks website is visited most in order to get information
regarding Data Structure courses and Computer related subjects. So this analysis is Done
using Data Science, and we get the Topmost visited Web Links.
2. In Transport
Data Science also entered into the Transport field like Driverless Cars. With the help of
Driverless Cars, it is easy to reduce the number of Accidents.
For Example, In Driverless Cars the training data is fed into the algorithm and with the
help of Data Science techniques, the Data is analyzed like what is the speed limit in
Highway, Busy Streets, Narrow Roads, etc. And how to handle different situations while
driving etc.
3. In Finance
Data Science plays a key role in Financial Industries. Financial Industries always have an
issue of fraud and risk of losses. Thus, Financial Industries needs to automate risk of loss
analysis in order to carry out strategic decisions for the company. Also, Financial Industries
uses Data Science Analytics tools in order to predict the future. It allows the companies to
predict customer lifetime value and their stock market moves.
For Example, In Stock Market, Data Science is the main part. In the Stock Market, Data
Science is used to examine past behavior with past data and their goal is to examine the
future outcome. Data is analyzed in such a way that it makes it possible to predict future
stock prices over a set timetable.
Page 53
4. In E-Commerce
E-Commerce Websites like Amazon, Flipkart, etc. uses data Science to make a better user
experience with personalized recommendations.
For Example, When we search for something on the E-commerce websites we get
suggestions similar to choices according to our past data and also we get recommendations
according to most buy the product, most rated, most searched, etc. This is all done with the
help of Data Science.
5. In Health Care
In the Healthcare Industry data science act as a boon. Data Science is used for:
Detecting Tumor.
Drug discoveries.
Medical Image Analysis.
Virtual Medical Bots.
Genetics and Genomics.
Predictive Modeling for Diagnosis etc.
6. Image Recognition
Currently, Data Science is also used in Image Recognition. For Example, When we upload
our image with our friend on Facebook, Facebook gives suggestions Tagging who is in the
picture. This is done with the help of machine learning and Data Science. When an Image is
Recognized, the data analysis is done on one’s Facebook friends and after analysis, if the
faces which are present in the picture matched with someone else profile then Facebook
suggests us auto-tagging.
7. Targeting Recommendation
Targeting Recommendation is the most important application of Data Science. Whatever
the user searches on the Internet, he/she will see numerous posts everywhere. This can be
explained properly with an example: Suppose I want a mobile phone, so I just Google
search it and after that, I changed my mind to buy offline. Data Science helps those
companies who are paying for Advertisements for their mobile. So everywhere on the
internet in the social media, in the websites, in the apps everywhere I will see the
recommendation of that mobile phone which I searched for. So this will force me to buy
online.
8. Airline Routing Planning
With the help of Data Science, Airline Sector is also growing like with the help of it, it
becomes easy to predict flight delays. It also helps to decide whether to directly land into
the destination or take a halt in between like a flight can have a direct route from Delhi to
the U.S.A or it can halt in between after that reach at the destination.
9. Data Science in Gaming
In most of the games where a user will play with an opponent i.e. a Computer Opponent,
data science concepts are used with machine learning where with the help of past data the
Computer will improve its performance. There are many games like Chess, EA Sports, etc.
will use Data Science concepts.
10. Medicine and Drug Development
The process of creating medicine is very difficult and time-consuming and has to be done
with full disciplined because it is a matter of Someone’s life. Without Data Science, it takes
lots of time, resources, and finance or developing new Medicine or drug but with the help
of Data Science, it becomes easy because the prediction of success rate can be easily
Page 54
determined based on biological data or factors. The algorithms based on data science will
forecast how this will react to the human body without lab experiments.
11. In Delivery Logistics
Various Logistics companies like DHL, FedEx, etc. make use of Data Science. Data
Science helps these companies to find the best route for the Shipment of their Products, the
best time suited for delivery, the best mode of transport to reach the destination, etc.
12. Auto complete
AutoComplete feature is an important part of Data Science where the user will get the
facility to just type a few letters or words, and he will get the feature of auto-completing the
line. In Google Mail, when we are writing formal mail to someone so at that time data
science concept of Autocomplete feature is used where he/she is an efficient choice to auto-
complete the whole line. Also in Search Engines in social media, in various apps,
AutoComplete feature is widely used.
Additional Content:
Page 55
Example:
Let suppose we want to travel from station A to station B by car. Now, we need to
take some decisions such as which route will be the best route to reach faster at the location,
in which route there will be no traffic jam, and which will be cost-effective. All these
decision factors will act as input data, and we will get an appropriate answer from these
decisions, so this analysis of data is called the data analysis, which is a part of data science.
Page 56
Some years ago, data was less and mostly available in a structured form, which could be
easily stored in excel sheets, and processed using BI tools.
But in today's world, data is becoming so vast, i.e., approximately 2.5 quintals bytes of data
is generating on every day, which led to data explosion. It is estimated as per researches, that
by 2020, 1.7 MB of data will be created at every single second, by a single person on earth.
Every Company requires data to work, grow, and improve their businesses.
Now, handling of such huge amount of data is a challenging task for every organization. So
to handle, process, and analysis of this, we required some complex, powerful, and efficient
algorithms and technology, and that technology came into existence as data Science.
Following are some main reasons for using data science technology:
o With the help of data science technology, we can convert the massive amount of raw
and unstructured data into meaningful insights.
o Data science technology is opting by various companies, whether it is a big brand or a
startup. Google, Amazon, Netflix, etc, which handle the huge amount of data, are
using data science algorithms for better customer experience.
o Data science is working for automating transportation such as creating a self-driving
car, which is the future of transportation.
o Data science can help in different predictions such as various survey, elections, flight
ticket confirmation, etc.
Page 57
If you learn data science, then you get the opportunity to find the various exciting job roles in
this domain. The main job roles are given below:
1. Data Scientist
2. Data Analyst
3. Machine learning expert
4. Data engineer
5. Data Architect
6. Data Administrator
7. Business Analyst
8. Business Intelligence Manager
Below is the explanation of some critical job titles of data science.
1. Data Analyst:
Data analyst is an individual, who performs mining of huge amount of data, models the data,
looks for patterns, relationship, trends, and so on. At the end of the day, he comes up with
visualization and reporting for analyzing the data for decision making and problem-solving
process.
Skill required: For becoming a data analyst, you must get a good background
in mathematics, business intelligence, data mining, and basic knowledge of statistics. You
should also be familiar with some computer languages and tools such as MATLAB, Python,
SQL, Hive, Pig, Excel, SAS, R, JS, Spark, etc.
2. Machine Learning Expert:
The machine learning expert is the one who works with various machine learning algorithms
used in data science such as regression, clustering, classification, decision tree, random
forest, etc.
Skill Required: Computer programming languages such as Python, C++, R, Java, and
Hadoop. You should also have an understanding of various algorithms, problem-solving
analytical skill, probability, and statistics.
3. Data Engineer:
A data engineer works with massive amount of data and responsible for building and
maintaining the data architecture of a data science project. Data engineer also works for the
creation of data set processes used in modeling, mining, acquisition, and verification.
Skill required: Data engineer must have depth knowledge of SQL, MongoDB, Cassandra,
HBase, Apache Spark, Hive, MapReduce, with language knowledge of Python, C/C++,
Java, Perl, etc.
4. Data Scientist:
A data scientist is a professional who works with an enormous amount of data to come up
with compelling business insights through the deployment of various tools, techniques,
methodologies, algorithms, etc.
Skill required: To become a data scientist, one should have technical language skills such
as R, SAS, SQL, Python, Hive, Pig, Apache spark, MATLAB. Data scientists must have
an understanding of Statistics, Mathematics, visualization, and communication skills.
Page 58
Page 59
o Curiosity: To learn data science, one must have curiosities. When you have curiosity
and ask various questions, then you can understand the business problem easily.
o Critical Thinking: It is also required for a data scientist so that you can find multiple
new ways to solve the problem with efficiency.
o Communication skills: Communication skills are most important for a data scientist
because after solving a business problem, you need to communicate it with the team.
Technical Prerequisite:
o Machine learning: To understand data science, one needs to understand the concept
of machine learning. Data science uses machine learning algorithms to solve various
problems.
o Mathematical modeling: Mathematical modeling is required to make fast
mathematical calculations and predictions from the available data.
o Statistics: Basic understanding of statistics is required, such as mean, median, or
standard deviation. It is needed to extract knowledge and obtain better results from the
data.
o Computer programming: For data science, knowledge of at least one programming
language is required. R, Python, Spark are some required computer programming
languages for data science.
o Databases: The depth understanding of Databases such as SQL, is essential for data
science to get the data and to work with data.
Data Business intelligence deals with Data science deals with structured and
Source structured data, e.g., data warehouse. unstructured data, e.g., weblogs,
feedback, etc.
Skills Statistics and Visualization are the two Statistics, Visualization, and Machine
skills required for business intelligence. learning are the required skills for data
science.
Focus Business intelligence focuses on both Data science focuses on past data,
Past and present data present data, and also future
predictions.
Page 60
Page 61
6. Mathematics: Mathematics is the critical part of data science. Mathematics involves the
study of quantity, structure, space, and changes. For a data scientist, knowledge of good
mathematics is essential.
7. Machine learning: Machine learning is backbone of data science. Machine learning is all
about to provide training to a machine so that it can act as a human brain. In data science, we
use various machine learning algorithms to solve the problems.
Page 62
o Apriori
We will provide you some brief introduction for few of the important algorithms here,
1. Linear Regression Algorithm: Linear regression is the most popular machine learning
algorithm based on supervised learning. This algorithm work on regression, which is a
method of modeling target values based on independent variables. It represents the form of
the linear equation, which has a relationship between the set of inputs and predictive output.
This algorithm is mostly used in forecasting and predictions. Since it shows the linear
relationship between input and output variable, hence it is called linear regression.
The below equation can describe the relationship between x and y variables:
Y= mx+c
Where, y=Dependentvariable
X=independentvariable
M=slope
C= intercept.
2. Decision Tree: Decision Tree algorithm is another machine learning algorithm, which
belongs to the supervised learning algorithm. This is one of the most popular machine
learning algorithms. It can be used for both classification and regression problems.
In the decision tree algorithm, we can solve the problem, by using tree representation in
which, each node represents a feature, each branch represents a decision, and each leaf
represents the outcome.
Following is the example for a Job offer problem:
Page 63
In the decision tree, we start from the root of the tree and compare the values of the
root attribute with record attribute. On the basis of this comparison, we follow the branch as
per the value and then move to the next node. We continue comparing these values until we
reach the leaf node with predicated class value.
Where,J(V)=>Objectivefunction
'||xi vj||'=>Euclideandistancebetweenxi and vj.
ci' => Number of data points in th
i cluster.
C => Number of clusters.
Page 64
Now, let's understand what are the most common types of problems occurred in data science
and what is the approach to solving the problems. So in data science, problems are solved
using algorithms, and below is the diagram representation for applicable algorithms for
possible questions:
Is this A or B? :
We can refer to this type of problem which has only two fixed solutions such as Yes or No, 1
or 0, may or may not. And this type of problems can be solved using classification
algorithms.
Is this different? :
We can refer to this type of question which belongs to various patterns, and we need to find
odd from them. Such type of problems can be solved using Anomaly Detection Algorithms.
How much or how many?
The other type of problem occurs which ask for numerical values or figures such as what is
the time today, what will be the temperature today, can be solved using regression
algorithms.
How is this organized?
Now if you have a problem which needs to deal with the organization of data, then it can be
solved using clustering algorithms.
Clustering algorithm organizes and groups the data based on features, colors, or other
common characteristics.
Page 65
The main phases of data science life cycle are given below:
1. Discovery: The first phase is discovery, which involves asking the right questions. When
you start any data science project, you need to determine what are the basic requirements,
priorities, and project budget. In this phase, we need to determine all the requirements of the
project such as the number of people, technology, time, data, an end goal, and then we can
frame the business problem on first hypothesis level.
2. Data preparation: Data preparation is also known as Data Munging. In this phase, we
need to perform the following tasks:
o Data cleaning
o Data Reduction
o Data integration
o Data transformation,
After performing all the above tasks, we can easily use this data for our further processes.
3. Model Planning: In this phase, we need to determine the various methods and techniques
to establish the relation between input variables. We will apply Exploratory data analytics
(EDA) by using various statistical formula and visualization tools to understand the relations
between variable and to see what data can inform us. Common tools used for model planning
are:
o SQL Analysis Services
o R
o SAS
o Python
Page 66
Page 67
Exploratory Data Analysis (EDA) is a crucial initial step in data science projects. It
involves analyzing and visualizing data to understand its key characteristics,
uncover patterns, and identify relationships between variables refers to the method
of studying and exploring record sets to apprehend their predominant traits, discover
patterns, locate outliers, and identify relationships between variables. EDA is
normally carried out as a preliminary step before undertaking extra formal statistical
analyses or modeling.
Outlier Detection: Identifying unusual values that deviate from other data
points. Outliers can influence statistical analyses and might indicate data entry errors or
unique cases.
Summary Statistics: Calculating key statistics that provide insight into data
trends and nuances.
Testing Assumptions: Many statistical tests and models assume the data
meet certain conditions (like normality or homoscedasticity). EDA helps verify these
assumptions.
Page 68
Exploratory Data Analysis (EDA) is important for several reasons, especially in the
context of data science and statistical modeling. Here are some of the key reasons
why EDA is a critical step in the data analysis process:
1. Understanding Data Structures: EDA helps in getting familiar with the dataset,
understanding the number of features, the type of data in each feature, and the distribution of data
points. This understanding is crucial for selecting appropriate analysis or prediction techniques.
2. Identifying Patterns and Relationships: Through visualizations and statistical
summaries, EDA can reveal hidden patterns and intrinsic relationships between variables. These
insights can guide further analysis and enable more effective feature engineering and model
building.
3. Detecting Anomalies and Outliers: EDA is essential for identifying errors or unusual
data points that may adversely affect the results of your analysis. Detecting these early can
prevent costly mistakes in predictive modeling and analysis.
4. Testing Assumptions: Many statistical models assume that data follow a certain
distribution or that variables are independent. EDA involves checking these assumptions. If the
assumptions do not hold, the conclusions drawn from the model could be invalid.
5. Informing Feature Selection and Engineering: Insights gained from EDA can
inform which features are most relevant to include in a model and how to transform them
(scaling, encoding) to improve model performance.
6. Optimizing Model Design: By understanding the data’s characteristics, analysts can
choose appropriate modeling techniques, decide on the complexity of the model, and better tune
model parameters.
7. Facilitating Data Cleaning: EDA helps in spotting missing values and errors in the
data, which are critical to address before further analysis to improve data quality and integrity.
8. Enhancing Communication: Visual and statistical summaries from EDA can make it
easier to communicate findings and convince others of the validity of your conclusions,
particularly when explaining data-driven insights to stakeholders without technical backgrounds.
Types of Exploratory Data Analysis
EDA, or Exploratory Data Analysis, refers back to the method of analyzing and analyzing
information units to uncover styles, pick out relationships, and gain insights. There are various
sorts of EDA strategies that can be hired relying on the nature of the records and the desires of
the evaluation. Depending on the number of columns we are analyzing we can divide EDA into
three types: Univariate, bivariate and multivariate.
1. Univariate Analysis
Page 69
2. Bivariate Analysis
3. Multivariate Analysis
Multivariate analysis examines the relationships between two or more variables in the
dataset. It aims to understand how variables interact with one another, which is crucial for most
statistical modeling techniques. Techniques include:
Pair plots: Visualize relationships across several variables simultaneously to capture a
comprehensive view of potential interactions.
Principal Component Analysis (PCA): A dimensionality reduction technique used to
reduce the dimensionality of large datasets, while preserving as much variance as possible.
In addition to univariate and multivariate analysis, there are specialized EDA techniques
tailored for specific types of data or analysis needs:
Spatial Analysis: For geographical data, using maps and spatial plotting to understand
the geographical distribution of variables.
Text Analysis: Involves techniques like word clouds, frequency distributions, and
sentiment analysis to explore text data.
Time Series Analysis: This type of analysis is mainly applied to statistics sets that
have a temporal component. Time collection evaluation entails inspecting and modeling styles, traits,
and seasonality inside the statistics through the years. Techniques like line plots, autocorrelation
analysis, transferring averages, and ARIMA (AutoRegressive Integrated Moving Average) fashions
are generally utilized in time series analysis.
Page 70
Data Analysis
Exploratory Data Analysis (EDA) can be effectively performed using a variety of tools and software,
each offering unique features suitable for handling different types of data and analysis requirements.
1. Python Libraries
Pandas: Provides extensive functions for data manipulation and analysis, including data
structure handling and time series functionality.
Matplotlib: A plotting library for creating static, interactive, and animated visualizations in
Python.
Seaborn: Built on top of Matplotlib, it provides a high-level interface for drawing attractive
and informative statistical graphics.
Plotly: An interactive graphing library for making interactive plots and offers more
sophisticated visualization capabilities.
2. R Packages
ggplot2: Part of the tidyverse, it’s a powerful tool for making complex plots from data in a
data frame.
dplyr: A grammar of data manipulation, providing a consistent set of verbs that
help you solve the most common data manipulation challenges.
tidyr: Helps to tidy your data. Tidying your data means storing it in a consistent
form that matches the semantics of the dataset with the way it is stored.
Performing Exploratory Data Analysis (EDA) involves a series of steps designed to help you
understand the data you’re working with, uncover underlying patterns, identify anomalies, test
hypotheses, and ensure the data is clean and suitable for further analysis.
Page 71
The first step in any information evaluation project is to sincerely apprehend the trouble you are
trying to resolve and the statistics you have at your disposal. This entails asking questions consisting
of:
What is the commercial enterprise goal or research question you are trying to address?
What are the variables inside the information, and what do they mean?
What are the data sorts (numerical, categorical, textual content, etc.) ?
Is there any known information on first-class troubles or obstacles?
Are there any relevant area-unique issues or constraints?
By thoroughly knowing the problem and the information, you can better formulate your evaluation
technique and avoid making incorrect assumptions or drawing misguided conclusions. It is also vital
to contain situations and remember specialists or stakeholders to this degree to ensure you have
complete know-how of the context and requirements.
Once you have clean expertise of the problem and the information, the following step is to import the
data into your evaluation environment (e.g., Python, R, or a spreadsheet program). During this step,
looking into the statistics is critical to gain initial know-how of its structure, variable kinds, and
capability issues.
Here are a few obligations you could carry out at this stage:
Load the facts into your analysis environment, ensuring that the facts are imported efficiently and
without errors or truncations.
Examine the size of the facts (variety of rows and columns) to experience its length and
complexity.
Page 72
Missing records is a joint project in many datasets, and it can significantly impact the quality and
reliability of your evaluation. During the EDA method, it’s critical to pick out and deal with lacking
information as it should be, as ignoring or mishandling lacking data can result in biased or misleading
outcomes.
Here are some techniques you could use to handle missing statistics:
Understand the styles and capacity reasons for missing statistics: Is the information lacking
entirely at random (MCAR), lacking at random (MAR), or lacking not at random (MNAR)?
Understanding the underlying mechanisms can inform the proper method for handling missing
information.
Decide whether to eliminate observations with lacking values (listwise deletion) or attribute
(fill in) missing values: Removing observations with missing values can result in a loss of
statistics and potentially biased outcomes, specifically if the lacking statistics are not MCAR.
Imputing missing values can assist in preserving treasured facts. However, the imputation
approach needs to be chosen cautiously.
Use suitable imputation strategies, such as mean/median imputation, regression imputation, a
couple of imputations, or device-getting-to-know-based imputation methods like k-nearest
associates (KNN) or selection trees. The preference for the imputation technique has to be
primarily based on the characteristics of the information and the assumptions underlying every
method.
Consider the effect of lacking information: Even after imputation, lacking facts can introduce
uncertainty and bias. It is important to acknowledge those limitations and interpret your outcomes
with warning.
Handling missing information nicely can improve the accuracy and reliability of your evaluation and
save you biased or deceptive conclusions. It is likewise vital to record the techniques used to address
missing facts and the motive in the back of your selections.
After addressing the facts that are lacking, the next step within the EDA technique is to explore the
traits of your statistics. This entails examining your variables’ distribution, crucial tendency, and
variability and identifying any ability outliers or anomalies. Understanding the characteristics of your
information is critical in deciding on appropriate analytical techniques, figuring out capability
information first-rate troubles, and gaining insights that may tell subsequent evaluation and modeling
decisions.
Calculate summary facts (suggest, median, mode, preferred deviation, skewness, kurtosis, and many
others.) for numerical variables: These facts provide a concise assessment of the distribution and
critical tendency of each variable, aiding in the identification of ability issues or deviations from
expected patterns.
Page 73
Data transformation is a critical step within the EDA process because it enables you to prepare your
statistics for similar evaluation and modeling. Depending on the traits of your information and the
necessities of your analysis, you may need to carry out various ameliorations to ensure that your
records are in the most appropriate layout.
Here are a few common records transformation strategies:
Scaling or normalizing numerical variables to a standard variety (e.g., min-max scaling,
standardization)
Encoding categorical variables to be used in machine mastering fashions (e.g., one-warm
encoding, label encoding)
Applying mathematical differences to numerical variables (e.g., logarithmic, square root) to
correct for skewness or non-linearity
Creating derived variables or capabilities primarily based on current variables (e.g., calculating
ratios, combining variables)
Aggregating or grouping records mainly based on unique variables or situations
By accurately transforming your information, you could ensure that your evaluation and modeling
strategies are implemented successfully and that your results are reliable and meaningful.
Visualization is an effective tool in the EDA manner, as it allows to discover relationships between
variables and become aware of styles or trends that may not immediately be apparent from summary
statistics or numerical outputs. To visualize data relationships, explore univariate, bivariate, and
multivariate analysis.
Create frequency tables, bar plots, and pie charts for express variables: These visualizations can
help you apprehend the distribution of classes and discover any ability imbalances or unusual
patterns.
Generate histograms, container plots, violin plots, and density plots to visualize the distribution of
numerical variables. These visualizations can screen critical information about the form, unfold,
and ability outliers within the statistics.
Examine the correlation or association among variables using scatter plots, correlation matrices,
or statistical assessments like Pearson’s correlation coefficient or Spearman’s rank correlation:
Understanding the relationships between variables can tell characteristic choice, dimensionality
discount, and modeling choices.
An Outlier is a data item/object that deviates significantly from the rest of the (so-called
normal)objects. They can be caused by measurement or execution errors. The analysis for outlier
detection is referred to as outlier mining. There are many ways to detect outliers, and the removal
process of these outliers from the dataframe is the same as removing a data item from the panda’s
dataframe.
Identify and inspect capability outliers through the usage of strategies like the interquartile range
(IQR), Z-scores, or area-specific regulations: Outliers can considerably impact the results of
statistical analyses and gadget studying fashions, so it’s essential to perceive and take care of them as
it should be.
Page 74
The final step in the EDA technique is effectively discussing your findings and insights. This includes
summarizing your evaluation, highlighting fundamental discoveries, and imparting your outcomes
cleanly and compellingly.
Here are a few hints for effective verbal exchange:
Clearly state the targets and scope of your analysis
Provide context and heritage data to assist others in apprehending your approach
Use visualizations and photos to guide your findings and make them more reachable
Highlight critical insights, patterns, or anomalies discovered for the duration of the EDA manner
Discuss any barriers or caveats related to your analysis
Suggest ability next steps or areas for additional investigation
Effective conversation is critical for ensuring that your EDA efforts have a meaningful impact and
that your insights are understood and acted upon with the aid of stakeholders.
Conclusion
Exploratory Data Analysis forms the bedrock of data science endeavors, offering invaluable insights
into dataset nuances and paving the path for informed decision-making. By delving into data
distributions, relationships, and anomalies, EDA empowers data scientists to unravel hidden truths
and steer projects toward success.
Page 75