Data Mining - Introduction
Pramod Kumar Singh
Professor (Computer Science and Engineering)
ABV – Indian Institute of Information Technology Management Gwalior
Gwalior – 474015, MP, India
Introduction
◼ Why Data Mining?
◼ What Is Data Mining?
◼ What Kind of Data Can Be Mined?
◼ What Kinds of Patterns Can Be Mined?
◼ What Technologies Are Used?
◼ What Kind of Applications Are Targeted?
◼ Major Issues in Data Mining
Data Mining – Why?
We say we live in the information age. However, actually we live in the data age. Because of digitalization data
has increased many fold than the previous era. This growth is explosive. World Wide Web (WWW), social
networks, supermarkets, business houses, industries etc. are generating data in terms of petabytes and more.
◼ Major sources of explosive growth of abundant data
◼ Business: Web, e-commerce, transactions, stocks, …
◼ Science: Remote sensing, bioinformatics, scientific simulation, …
◼ Society and everyone: news, digital cameras, YouTube, …
It is not possible to uncover the knowledge / information hidden in the heap of this data without automated
tools.
◼ We are drowning in data but starving for knowledge!
◼ “Necessity is the mother of invention” — Data mining — Automated analysis of massive data sets.
Data Mining – Why?
This explosively growing, widely available, and gigantic
body of data makes our time truly the data age.
Powerful and versatile tools are badly needed to
automatically uncover valuable information from the
tremendous amounts of data and to transform such
data into organized knowledge.
This necessity has led to the birth of data mining.
Figure: The world is data rich but information poor.
Data Mining – Why?
Data mining can be viewed as a result of the natural
evolution of Information Technology (IT).
Since the 1960s, database and information technology
has evolved systematically from primitive file
processing systems to sophisticated and powerful
database systems.
After the establishment of database management
systems, database technology moved toward the
development of advanced database systems, data
warehousing, and data mining for advanced data
analysis and web-based databases.
Advanced data analysis came in late 1980s onward
because of a steady progress in computer hardware
technology which allowed powerful and affordable
computers, data collection equipment, and storage
media. It boosted information retrieval, and data
analysis.
Data Mining – Why?
One emerging data repository architecture is the data warehouse. This is a repository of multiple
heterogeneous data sources organized under a unified schema at a single site to facilitate management decision
making.
Data warehouse technology includes data cleaning, data integration, and online analytical processing (OLAP) —
that is, analysis techniques with functionalities such as summarization, consolidation, and aggregation, as well as
the ability to view information from different angles.
Though OLAP tools support multidimensional analysis and decision making, additional data analysis tools are
required for in-depth analysis.
For example, data mining tools that provide data classification, clustering, outlier/anomaly detection, and
the characterization of changes in data over time.
Since 1990s, huge volumes of data have been accumulated beyond databases and data warehouses, e.g., World
Wide Web and web-based databases, Internet-based global interconnected, heterogeneous databases /
information bases. They play a vital role in the information industry.
The effective and efficient analysis of data from such different forms of data by integration of information
retrieval, data mining, and information network analysis technologies is a challenging task.
Data Mining – Why?
The fast-growing, tremendous amount of data,
collected and stored in large and numerous data
repositories, has far exceeded our human ability for
comprehension without powerful tools. This situation
(the abundance of data, and the need for powerful
data analysis tools), is described as a data rich but
information poor situation.
Consequently, important decisions are often made
based NOT on the information-rich data stored in data
repositories but rather on a decision maker’s intuition,
simply because the decision maker does not have the
tools to extract the valuable knowledge embedded in
the vast amounts of data.
The widening gap between data and information calls
for the systematic development of data mining tools
that can turn data tombs into golden nuggets of
knowledge. Figure: The world is data rich but information poor.
What is Data Mining
The data mining, a truly interdisciplinary subject, is
basically knowledge mining from data. It is shown in
the adjacent figure.
The other popular names of data mining are knowledge
mining from data, knowledge extraction, data/pattern
analysis, data archaeology, and data dredging.
The data mining is also popularly known as knowledge
discovery from data, or KDD.
Figure: Data mining – searching the knowledge
(interesting patterns) in data.
What is Data Mining
The KDD process is as follows.
1. Data cleaning: removing noise and inconsistent data.
2. Data integration: data from sources are combined.
3. Data selection: data relevant to the analysis task are
retrieved from the database.
4. Data transformation: data are transformed and
consolidated into forms appropriate for mining. It is
usually done by performing summary or aggregation
operations.
5. Data mining: an essential process where intelligent
methods are applied to extract data patterns.
6. Pattern evaluation: identify the truly interesting
patterns representing knowledge based on
interestingness measures.
7. Knowledge presentation: visualization and knowledge
representation techniques are used to present mined
knowledge to users. Figure: Data mining as a step in the process of knowledge
discovery.
What is Data Mining
The first four steps (Steps 1 – 4) are different forms of data preprocessing, where data are prepared for mining.
The data mining step may interact with the user or a knowledge base. The interesting patterns are presented to
the user and may be stored as new knowledge in the knowledge base.
Though the data mining is shown as one step in the knowledge discovery process, in industry, in media, and in
the research milieu, the term data mining is often used to refer to the entire knowledge discovery process.
Therefore, we adopt a broad view of data mining functionality: Data mining is the process of discovering
interesting patterns and knowledge from large amounts of data. The data sources can include databases, data
warehouses, the Web, other information repositories, or data that are streamed into the system dynamically.
What Kinds of Data can be Mined?
As a general technology, data mining can be applied to any kind of data if the data are meaningful for a target
application.
However, the most basic forms of data for mining applications are database data, data warehouse data, and
transactional data.
Data mining can also be applied to other forms of data, e.g., data streams, ordered/sequence data, graph or
networked data, spatial data, text data, multimedia data, the WWW.
Data mining continues to embrace new data types as they emerge.
What Kinds of Patterns can be Mined?
There are several data mining functionalities. Primarily, they are:
Characterization and discrimination: Data characterization is a summarization of the general characteristics or features
of a target class of data. Data discrimination is a comparison of the general features of the target class data objects against
the general features of objects from one or multiple contrasting classes.
Mining of frequent patterns, associations, and correlations: Frequent patterns are the patterns that occur
frequently in data. Association is strong relationships among the items of frequent patterns. Correlation is
interesting statistical correlations between associated attribute–value pairs.
Classification and regression: Classification is the process of finding a model (or function) that describes and
distinguishes data classes or concepts, whereas classification predicts categorical (discrete, unordered) labels,
regression models continuous-valued functions.
Clustering analysis: Unlike classification and regression, which analyze class-labeled (training) data sets,
clustering analyzes data objects without consulting class labels.
Outlier analysis: A data set may contain objects that do not comply with the general behavior or model of the
data. These data objects are outliers.
Data mining functionalities are used to specify the kinds of patterns to be found in data mining tasks. In general,
such tasks can be classified into two categories: descriptive and predictive.
✓ Descriptive mining tasks characterize properties of the data in a target data set.
✓ Predictive mining tasks perform induction on the current data to make predictions.
What Technologies are Used?
As a highly application-driven domain, data mining has incorporated many techniques from other domains such
as statistics, machine learning, pattern recognition, database and data warehouse systems, information retrieval,
visualization, algorithms, high performance computing, and many application domains.
Figure: Data mining adopts techniques from many domains.
The interdisciplinary nature of data mining research and development contributes significantly to the success of
data mining and its extensive applications.
What Kinds of Applications are Targeted?
As a highly application-driven discipline, data mining has seen great successes in many applications. Its
applications are limited only by the human imagination.
However, the two highly successful and popular application examples of data mining: business intelligence and
search engines.
Business Intelligence (BI): BI technologies provide historical, current, and predictive views of business
operations. Examples include reporting, online analytical processing, business performance management,
competitive intelligence, benchmarking, and predictive analytics. The data mining is the core of business
intelligence.
Web Search Engines: Web search engine is a specialized computer server that searches for information on the
Web. The search results of a user query are often returned as a list (sometimes called hits). The hits may consist
of web pages, images, and other types of files. Web search engines are essentially very large data mining
applications. Various data mining techniques are used in all aspects of search engines, ranging from crawling,
indexing, and searching.
Major Issues in Data Mining
Data mining is a dynamic and fast-expanding field with great strengths. The major issues in data mining research
can be partitioned into the following five groups.
Mining methodology
✓ Mining various and new kinds of knowledge: Due to the diversity of applications, new mining tasks
continue to emerge, making data mining a dynamic and fast-growing field.
✓ Mining knowledge in multidimensional space: When searching for knowledge in large data sets, we can
explore the data in multidimensional space.
✓ Data mining – an interdisciplinary effort: The power of data mining can be substantially enhanced by
integrating new methods from multiple disciplines.
✓ Boosting the power of discovery in a networked environment: Most data objects reside in a linked or
interconnected environment, whether it be the Web, database relations, files, or documents.
✓ Handling uncertainty, noise, or incompleteness of data: Data often contain noise, errors, exceptions, or
uncertainty, or are incomplete. Errors and noise may confuse the data mining process, leading to the
derivation of erroneous patterns.
✓ Pattern evaluation and pattern- or constraint-guided mining: Not all the patterns generated by data
mining processes are interesting. Pattern interestingness is user dependent. Therefore, techniques are
needed to assess the interestingness of discovered patterns based on subjective measures.
Major Issues in Data Mining
User interaction
✓ Interactive mining: Interactive mining should allow users to dynamically change the focus of a search, to
refine mining requests based on returned results,
✓ Incorporation of background knowledge: Background knowledge, constraints, rules, and other
information regarding the domain under study should be incorporated into the knowledge discovery
process.
✓ Ad hoc data mining and data mining query languages: There should be high-level data mining query
languages or other high-level flexible user interfaces that give users the freedom to define ad hoc data
mining tasks.
✓ Presentation and visualization of data mining results: How can a data mining system present data mining
results, vividly and flexibly, so that the discovered knowledge can be easily understood and directly
usable by humans?
Major Issues in Data Mining
Efficiency and scalability
✓ Efficiency and scalability of data mining algorithms: The algorithms must be efficient and scalable to
effectively extract information from huge amounts of data in many data repositories or in dynamic data
streams.
✓ Parallel, distributed, and incremental mining algorithms: The humongous size of many data sets, the wide
distribution of data, and the computational complexity of some data mining methods are prime factors.
Diversity of data types
✓ Handling complex types of data: Diverse applications generate a wide spectrum of new and complex data
types.
✓ Mining dynamic, networked, and global data repositories: Mining gigantic, interconnected information
networks may help disclose many more patterns and knowledge in heterogeneous data sets than can be
discovered from a small set of isolated data repositories.
Data mining and society
✓ Social impacts of data mining: With data mining penetrating our everyday lives, it is important to study
the impact of data mining on society.
✓ Privacy-preserving data mining: Data mining poses the risk of disclosing an individual’s personal
information.
✓ Invisible data mining: People should be able to perform data mining or use data mining results simply by
mouse clicking, without any knowledge of data mining algorithms.