DATA MINING
What Is Data Mining?
Data mining is the process of searching and analyzing a large batch of raw data in
order to identify patterns and extract useful information.
Companies use data mining software to learn more about their customers. It can help
them to develop more effective marketing strategies, increase sales, and decrease
costs. Data mining relies on effective data collection, warehousing, and computer
processing.
KEY TAKEAWAYS
Data mining is the process of analyzing a large batch of information to discern
trends and patterns.
Data mining can be used by corporations for everything from learning about what
customers are interested in or want to buy to fraud detection and spam filtering.
Data mining programs break down patterns and connections in data based on what
information users request or provide.
Social media companies use data mining techniques to commodify their users in order
to generate profit.
This use of data mining has come under criticism lately as users are often unaware
of the data mining happening with their personal information, especially when it is
used to influence preferences.
How Data Mining Works
Data mining involves exploring and analyzing large blocks of information to glean
meaningful patterns and trends. It is used in credit risk management, fraud
detection, and spam filtering. It also is a market research tool that helps reveal
the sentiment or opinions of a given group of people. The data mining process
breaks down into four steps:
Data is collected and loaded into data warehouses on-site or on a cloud service.
Business analysts, management teams, and information technology professionals
access the data and determine how they want to organize it.
Custom application software sorts and organizes the data.
The end user presents the data in an easy-to-share format, such as a graph or
table.
Data Warehousing and Mining Software
Data mining programs analyze relationships and patterns in data based on user
requests. It organizes information into classes.
For example, a restaurant may want to use data mining to determine which specials
it should offer and on what days. The data can be organized into classes based on
when customers visit and what they order.
In other cases, data miners find clusters of information based on logical
relationships or look at associations and sequential patterns to draw conclusions
about trends in consumer behavior.
Warehousing is an important aspect of data mining. Warehousing is the
centralization of an organization's data into one database or program. It allows
the organization to spin off segments of data for specific users to analyze and use
depending on their needs.
Cloud data warehouse solutions use the space and power of a cloud provider to store
data. This allows smaller companies to leverage digital solutions for storage,
security, and analytics.
Data Mining Techniques
Data mining uses algorithms and various other techniques to convert large
collections of data into useful output. The most popular types of data mining
techniques include:
Association rules, also referred to as market basket analysis, search for
relationships between variables. This relationship in itself creates additional
value within the data set as it strives to link pieces of data. For example,
association rules would search a company's sales history to see which products are
most commonly purchased together; with this information, stores can plan, promote,
and forecast.
Classification uses predefined classes to assign to objects. These classes describe
the characteristics of items or represent what the data points have in common with
each. This data mining technique allows the underlying data to be more neatly
categorized and summarized across similar features or product lines.
Clustering is similar to classification. However, clustering identifies
similarities between objects, then groups those items based on what makes them
different from other items. While classification may result in groups such as
"shampoo," "conditioner," "soap," and "toothpaste," clustering may identify groups
such as "hair care" and "dental health."
Decision trees are used to classify or predict an outcome based on a set list of
criteria or decisions. A decision tree is used to ask for the input of a series of
cascading questions that sort the dataset based on the responses given. Sometimes
depicted as a tree-like visual, a decision tree allows for specific direction and
user input when drilling deeper into the data.
K-Nearest neighbor (KNN) is an algorithm that classifies data based on its
proximity to other data. The basis for KNN is rooted in the assumption that data
points that are close to each other are more similar to each other than other bits
of data. This non-parametric, supervised technique is used to predict the features
of a group based on individual data points.
Neural networks process data through the use of nodes. These nodes are comprised of
inputs, weights, and an output. Data is mapped through supervised learning, similar
to the ways in which the human brain is interconnected. This model can be
programmed to give threshold values to determine a model's accuracy.
Predictive analysis strives to leverage historical information to build graphical
or mathematical models to forecast future outcomes. Overlapping with regression
analysis, this technique aims at supporting an unknown figure in the future based
on current data on hand.
The Data Mining Process
To be most effective, data analysts generally follow a certain flow of tasks along
the data mining process. Without this structure, an analyst may encounter an issue
in the middle of their analysis that could have easily been prevented had they
prepared for it earlier. The data mining process is usually broken into the
following steps.
Step 1: Understand the Business
Before any data is touched, extracted, cleaned, or analyzed, it is important to
understand the underlying entity and the project at hand. What are the goals the
company is trying to achieve by mining data? What is their current business
situation? What are the findings of a SWOT analysis? Before looking at any data,
the mining process starts by understanding what will define success at the end of
the process.
Step 2: Understand the Data
Once the business problem has been clearly defined, it's time to start thinking
about data. This includes what sources are available, how they will be secured and
stored, how the information will be gathered, and what the final outcome or
analysis may look like. This step also includes determining the limits of the data,
storage, security, and collection and assesses how these constraints will affect
the data mining process.
Step 3: Prepare the Data
Data is gathered, uploaded, extracted, or calculated. It is then cleaned,
standardized, scrubbed for outliers, assessed for mistakes, and checked for
reasonableness. During this stage of data mining, the data may also be checked for
size as an oversized collection of information may unnecessarily slow computations
and analysis.
Step 4: Build the Model
With our clean data set in hand, it's time to crunch the numbers. Data scientists
use the types of data mining above to search for relationships, trends,
associations, or sequential patterns. The data may also be fed into predictive
models to assess how previous bits of information may translate into future
outcomes.
Step 5: Evaluate the Results
The data-centered aspect of data mining concludes by assessing the findings of the
data model or models. The outcomes from the analysis may be aggregated,
interpreted, and presented to decision-makers that have largely been excluded from
the data mining process to this point. In this step, organizations can choose to
make decisions based on the findings.
Step 6: Implement Change and Monitor
The data mining process concludes with management taking steps in response to the
findings of the analysis. The company may decide the information was not strong
enough or the findings were not relevant, or the company may strategically pivot
based on findings. In either case, management reviews the ultimate impacts of the
business and recreates future data mining loops by identifying new business
problems or opportunities.
Different data mining processing models will have different steps, though the
general process is usually pretty similar. For example, the Knowledge Discovery
Databases model has nine steps, the CRISP-DM model has six steps, and the SEMMA
process model has five steps.
1
Applications of Data Mining
In today's age of information, almost any department, industry, sector, or company
can make use of data mining.
Sales
Data mining encourages smarter, more efficient use of capital to drive revenue
growth. Consider the point-of-sale register at your favorite local coffee shop. For
every sale, that coffeehouse collects the time a purchase was made and what
products were sold. Using this information, the shop can strategically craft its
product line.
Marketing
Once the coffeehouse above knows its ideal line-up, it's time to implement the
changes. However, to make its marketing efforts more effective, the store can use
data mining to understand where its clients see ads, what demographics to target,
where to place digital ads, and what marketing strategies most resonate with
customers. This includes aligning marketing campaigns, promotional offers, cross-
sell offers, and programs to the findings of data mining.
Manufacturing
For companies that produce their own goods, data mining plays an integral part in
analyzing how much each raw material costs, what materials are being used most
efficiently, how time is spent along the manufacturing process, and what
bottlenecks negatively impact the process. Data mining helps ensure the flow of
goods is uninterrupted.
Fraud Detection
The heart of data mining is finding patterns, trends, and correlations that link
data points together. Therefore, a company can use data mining to identify outliers
or correlations that should not exist. For example, a company may analyze its cash
flow and find a reoccurring transaction to an unknown account. If this is
unexpected, the company may wish to investigate whether funds are being mismanaged.
Human Resources
Human resources departments often have a wide range of data available for
processing including data on retention, promotions, salary ranges, company
benefits, use of those benefits, and employee satisfaction surveys. Data mining can
correlate this data to get a better understanding of why employees leave and what
entices new hires.
Customer Service
Customer satisfaction may be caused (or destroyed) for a variety of reasons.
Imagine a company that ships goods. A customer may be dissatisfied with shipping
times, shipping quality, or communications. The same customer may be frustrated
with long telephone wait times or slow e-mail responses. Data mining gathers
operational information about customer interactions and summarizes the findings to
pinpoint weak points and highlight what the company is doing right.
Benefits of Data Mining
Data mining ensures a company is collecting and analyzing reliable data. It is
often a more rigid, structured process that formally identifies a problem, gathers
data related to the problem, and strives to formulate a solution. Therefore, data
mining helps a business become more profitable, more efficient, or operationally
stronger.
Data mining can look very different across applications, but the overall process
can be used with almost any new or legacy application. Essentially any type of data
can be gathered and analyzed, and almost every business problem that relies on
qualifiable evidence can be tackled using data mining.
The end goal of data mining is to take raw bits of information and determine if
there is cohesion or correlation among the data. This benefit of data mining allows
a company to create value with the information they have on hand that would
otherwise not be overly apparent. Though data models can be complex, they can also
yield fascinating results, unearth hidden trends, and suggest unique strategies.
Limitations of Data Mining
This complexity of data mining is one of its greatest disadvantages. Data analytics
often requires technical skill sets and certain software tools. Smaller companies
may find this to be a barrier of entry too difficult to overcome.
Data mining doesn't always guarantee results. A company may perform statistical
analysis, make conclusions based on strong data, implement changes, and not reap
any benefits. Through inaccurate findings, market changes, model errors, or
inappropriate data populations, data mining can only guide decisions and not ensure
outcomes.
There is also a cost component to data mining. Data tools may require costly
subscriptions, and some bits of data may be expensive to obtain. Security and
privacy concerns can be pacified, though additional IT infrastructure may be costly
as well. Data mining may also be most effective when using huge data sets; however,
these data sets must be stored and require heavy computational power to analyze.
Even large companies or government agencies have challenges with data mining.
Consider the FDA's white paper on data mining that outlines the challenges of bad
information, duplicate data, underreporting, or overreporting.
2
Data Mining and Social Media
One of the most lucrative applications of data mining has been undertaken by social
media companies. Platforms like Facebook, TikTok, Instagram, and Twitter gather
reams of data about their users, based on their online activities.
That data can be used to make inferences about their preferences. Advertisers can
target their messages to the people who appear to be most likely to respond
positively.
Data mining on social media has become a big point of contention, with several
investigative reports and exposes showing just how intrusive mining users' data can
be. At the heart of the issue, users may agree to the terms and conditions of the
sites not realizing how their personal information is being collected or to whom
their information is being sold.
Examples of Data Mining
Data mining can be used for good, or it can be used illicitly. Here is an example
of both.
eBay and e-Commerce
eBay collects countless bits of information every day from sellers and buyers. The
company uses data mining to attribute relationships between products, assess
desired price ranges, analyze prior purchase patterns, and form product categories.
3
eBay outlines the recommendation process as:
Raw item metadata and user historical data are aggregated.
Scrips are run on a trained model to generate and predict the item and user.
A KNN search is performed.
The results are written to a database.
The real-time recommendation takes the user ID, calls the database results, and
displays them to the user.
3
Facebook-Cambridge Analytica Scandal
Another cautionary example of data mining is the Facebook-Cambridge Analytica data
scandal. During the 2010s, the British consulting firm Cambridge Analytica Ltd.
collected personal data from millions of Facebook users. This information was later
analyzed for use in the 2016 presidential campaigns of Ted Cruz and Donald Trump.
It is suspected that Cambridge Analytica interfered with other notable events such
as the Brexit referendum.
4
In light of this inappropriate data mining and misuse of user data, Facebook agreed
to pay $100 million for misleading investors about its uses of consumer data. The
Securities and Exchange Commission claimed Facebook discovered the misuse in 2015
but did not correct its disclosures for more than two years.
5
Frequently Asked Questions
What Are the Types of Data Mining?
There are two main types of data mining: predictive data mining and descriptive
data mining. Predictive data mining extracts data that may be helpful in
determining an outcome. Description data mining informs users of a given outcome.
How Is Data Mining Done?
Data mining relies on big data and advanced computing processes including machine
learning and other forms of artificial intelligence (AI). The goal is to find
patterns that can lead to inferences or predictions from large and unstructured
data sets.
What Is Another Term for Data Mining?
Data mining also goes by the less-used term "knowledge discovery in data," or KDD.
Where Is Data Mining Used?
Data mining applications have been designed to take on just about any endeavor that
relies on big data. Companies in the financial sector look for patterns in the
markets. Governments try to identify potential security threats. Corporations,
especially online and social media companies, use data mining to create profitable
advertising and marketing campaigns that target specific sets of users.
The Bottom Line
Modern businesses have the ability to gather information on their customers,
products, manufacturing lines, employees, and storefronts. These random pieces of
information may not tell a story, but the use of data mining techniques,
applications, and tools helps piece together information.
The ultimate goal of the data mining process is to compile data, analyze the
results, and execute operational strategies based on data mining results.