KEMBAR78
Data Mining Primer | PDF | Data Warehouse | Data Mining
0% found this document useful (0 votes)
610 views15 pages

Data Mining Primer

Data Mining Primer

Uploaded by

apoorvgadwal
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
610 views15 pages

Data Mining Primer

Data Mining Primer

Uploaded by

apoorvgadwal
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

Data Warehousing

Data Mining Primer


for the Data Warehouse Professional

By: Arlene Zaima Data Mining Marketing Manager Teradata Contributor: James Kashner CTO Teradata Data Mining Lab

Data Mining Primer


Table of Contents
Executive Summary . . . . . . . . . . . . . . . . . 2 What Exactly is Data Mining? . . . . . . . . . 3 Data Mining Makes Its Way to the Business World. . . . . . . . . . . . . . . 3-4 What Can Data Mining Do for Your Business? . . . . . . . . . . . . . . . . 4 The Difference Between OLAP and Data Mining . . . . . . . . . . . . 4-5 How Does Data Mining Work? . . . . . . . 5-6 The Data Mining Process . . . . . . . . . . . . . 6 The Relationship Between Data Mining and Data Warehousing . . . . . . . . . . . . . . . 6 Data Mining Terms and Techniques . . . 6-7 Data Mining Challenges . . . . . . . . . . . 7-13 Data Mining with Teradata . . . . . . . . . . 13 Teradata Warehouse Miner . . . . . . . . . . . 13 Teradata Data Mining Labs . . . . . . . . . . 14 The Data Mining Lab Engagement . . . . . 14 How to Get Started with Data Mining . . . 14 Driving Higher ROI. . . . . . . . . . . . . . 14-15 Summary . . . . . . . . . . . . . . . . . . . . . . . . 15

Executive Summary

By now, youve probably heard or read about the rewards that data mining can bring to your business. But, very little has been written to explain the challenges facing many Information Technology (IT) organizations as they try to make data mining part of their business intelligence operations. This paper explores data mining from the IT perspective giving a quick overview of the data mining technology, technical challenges, and solutions for implementing successful data mining projects. This white paper explains data mining in terms that can be understood by data warehouse professionals. These explanations include: > How data mining is used for business advantages today > The integral relationship between data mining and data warehousing > The challenges that may be encountered with data mining > The details about how to get started with data mining

EB-3078 > 0104 > PAGE 2 OF 15

Data Mining Primer


What Exactly is Data Mining? Data mining is a powerful technology that converts detail data into competitive intelligence that businesses can use to proactively predict future trends and behaviors. Some vendors define data mining as a tool or as the application of an algorithm to data. But the truth is, data mining is not just a tool or an algorithm. Data mining is a process of discovering and interpreting previously unknown patterns in data to solve business problems. Data mining is an iterative process in which each cycle further refines the result set. This can be a complex process, but there are tools available today to help you navigate through the steps of the data mining process. From an IT perspective, the data mining process requires exploration of data, creating the analytic data set, building and testing the model, and integrating the results into business applications. Therefore, the IT organization must provide an environment capable of addressing the following challenges: > Exploring and preprocessing of large data volumes > Sufficient processing power to efficiently analyze many variables (columns) and rows in a timely manner > Integrating data mining results into the business process > Creating an extensible and manageable data mining environment
Figure 1. Business Value of Analytic Applications.
Channel Optimization What is the best channel to reach my customer in each segment? Interact with customers based on their preference and your need to manage cost. Customer Attrition Which customer is at risk of leaving? Prevent loss of high-value customers and let go of lower value customers. Fraud Detection How can I tell which transactions are likely to be fraudulent? Customer Profitability What is the lifetime profitability of my customer? Make individual business interaction decisions based on the overall profitability of customers. Propensity to Buy Which customers are most likely to respond to my promotion? Target customers based on their need to increase their loyalty to your product line. Also, increase campaign profitability by focusing on the most likely to buy. Customer Segmentation What market segments do my customers fall into and what are their characteristics? Personalize customer relationships for higher customer satisfaction and retention.

Data Mining Makes Its Way to the Business World Since the mid 1980s, data mining has been very effective in select and focused situations such as medical diagnosis,

scientific research, and behavioral profiling. In the past ten years, data mining technology has journeyed from the scientific and academic worlds into the business world where it adds a new dimension of predictive

Analytic Application

Business Question

Business Value

Quickly determine fraud and take immediate action to minimize cost.

EB-3078 > 0104 > PAGE 3 OF 15

Data Mining Primer


analysis. To be effective in the business world, the data mining process had to be adapted to deliver models in a more timesensitive manner. Today, with the advent of in-database data mining techniques, businesses have finally found it possible to benefit from the complex, predictive characteristics of a very powerful technology. What Can Data Mining Do for Your Business? For years, businesses have relied on reports and ad hoc query tools to glean useful information from their data. However, as data volumes continue to increase, finding valuable information becomes a daunting task. Data mining technology was designed to sift through detailed historical data to identify hidden patterns that are not obvious to humans or query tools. Many of these previously hidden patterns reveal intelligence that can be integrated into business processes to provide predictive capabilities that lead to strategic business decision-making. Data mining makes analytical business applications, such as CRM, smarter by providing insight that goes beyond just the obvious knowledge. By making your applications smarter, data mining translates into a higher return on your warehouse investment. (See Figure 1). The Difference between OLAP and Data Mining A commonly asked question is What is the difference between data mining and on-line analytical processing (OLAP)? OLAP is a business intelligence tool that Data mining, on the other hand, is a form of discovery driven analysis where statistical and machine learning techniques are used to make predictions or estimates Although these technologies are used for different purposes, OLAP and data mining are complementary. During the data mining exploration phase, you may allows you to analyze and understand particular business drivers. Typically, a specific descriptive or factual question is formulated and either validated or refuted through ad hoc queries. OLAP results are factual results. For example, you may ask, How many size 7 shoes did I sell in the past three months? The results are factual answers that enable you to validate your hypothesis or order decision. But what happens if you have hundreds of variables to analyze? It becomes difficult to formulate a good hypothesis or relationship among your data. In addition, OLAP tools dont produce predictive or estimated values with associated accuracy expectations. about outcomes or traits before knowing their true values. Data mining techniques are used to find meaningful, often complex, and previously unknown patterns in data. For example, you may ask, How many size 7 shoes should I order for the next season? Data mining techniques can be used to build models based on detail data to predict the number of size 7 shoes sold within a given time period. Typically, OLAP analyses use predefined, summarized or aggregated data, such as multi-dimensional cubes, where data mining requires detail data that is aggregated to optimal levels and analyzed at the individual record level.
Figure 2. Differences between OLAP and Data Mining.
Ad hoc queries and reports Statistical and machine learning techniques Verification driven/Factual results Discovery driven Commonly uses predefined aggregate data Requires detail data Typically focuses on current facts Typically focuses on future outcomes or trends

OLAP

Data Mining

EB-3078 > 0104 > PAGE 4 OF 15

Data Mining Primer


use OLAP technology to help you understand your data. Data mining results can also be used in OLAP applications by incorporating new predictive variables or scores as dimensions or attributes in your OLAP tool. For example, if you calculate a new predictive variable called Customer Value that characterizes the value of a customer to your business in terms of profitability, you can include this new variable as an attribute in your OLAP tool. When retailers analyze which products to stock, they can consider products that attract high-value or profitable customers. (See Figure 2). How Does Data Mining Work? Data mining leverages artificial intelligence and statistical techniques to build models. Data mining models are built from situations where you know the outcome. These models are then applied to other situations where you dont know the outcome. For example, if your data warehouse identifies customers who have responded to past marketing campaigns, you can create a model that identifies the characteristics of those customers. This model can be applied to a wider customer database, identifying customers who demonstrate the same characteristics, allowing you to target those likely to respond, thereby improving response rates and reducing marketing costs. In many cases, both descriptive and predictive models are used to solve business problems. A descriptive technique may identify customer segments based on value in terms of profitability to Unlike predictive models, descriptive models do not predict variables based on known outcomes, but rather, describe a particular pattern that has no known outcome. Common techniques include data visualization where large volumes of data are reduced to a picture that can be easily understood. Another common descriptive technique is clustering, where data are grouped into subsets based on common attributes. For example, you may use descriptive techniques to determine customer segments and their attributes.
Figure 3. The Data Mining Process.

Business problems that lend themselves to data mining are predictive and descriptive in nature. Predictive models are used to predict an outcome, referred to as the dependent or target variable, based on the value of other variables in the data set. For example, a predictive model could determine the likelihood that a customer will purchase a product based on her income, number of children, current product ownership, or debt. Predictive techniques build models based on a training set of data with a known outcome, such as prior buying patterns. The algorithm analyzes the values of all input variables and identifies which variables are significant as predictors for a desired outcome.

,ABE A JDA *KIE AII 2H > A


,ABE A JDA >KIE AII > A?JELAI -N= E A JDA @=J=

,ABE A E EJE= =FFH =?D 5? FA FH A?J

-NF HA = @ 2HAFH ?AII ,=J=


,=J= =?GKEIEJE ,=J= ANF H=JE ,=J= IA A?JE ,=J= JH= IB H =JE

,ALA F
,AIEC 6H=E @A

@A
@A

JAIJ = @ L= E@=JA

1 JAHFHAJ = @ AL= K=JA

,AF O
,AF O 4AF HJI )FF E?=JE

M A@CA
@A E JACH=JE

EB-3078 > 0104 > PAGE 5 OF 15

Data Mining Primer


your business, and a predictive technique may identify the likelihood that a particular segment will defect to your competitor. By combining results of the descriptive technique to predict customer defection, you can act to prevent attrition of your high-value customers. The Data Mining Process You cannot buy a data mining product, apply it to data, and expect to generate a meaningful model. Data mining models are built as part of a data mining process an on-going process requiring maintenance throughout the life of the model. The data mining process is not linear, but an iterative process where you loop back to the previous phase. For example, the initial model you create may lead to insight requiring you to return to the data preprocessing phase to create new analytical variables. The data mining process contains four high-level steps: Define the Business Problem, Explore and Preprocess the Data, Develop the Data Model, and Deploy Knowledge (See Figure 3). Tasks for each step are listed in the diagram to provide a brief overview of the data mining process. We will discuss the data mining process indepth when we define the Teradata data mining methodology. Although each step is important, most of your time will be spent in the data exploration and preprocessing phase. A well-structured data warehouse can significantly reduce the pain felt in this phase. The Relationship between Data Mining and Data Warehousing Data mining is all about data. You can mine inconsistent or dirty data, and find patterns. However, the patterns will be meaningless if your data do not accurately reflect the business you are modeling. The key to data mining is ensuring that you have a foundation of good, quality data that is cleansed, consistent, and accurate. A data warehouse provides the right foundation for data mining. Although data mining can be done without having a warehouse in place, the process of gathering, cleansing, and transforming the data from multiple data sources can be arduous. Once the process has been completed for one model, you must repeat the process for subsequent data mining projects. Approximately 70% of the data mining process involves accessing, exploring, and preparing the data. The data warehouse makes data mining more viable by removing many of the data redundancy and system management issues. This allows people to focus on analysis. Data Mining Terms and Techniques This section briefly describes a few data mining terms and techniques commonly used to solve predictive and descriptive analytical problems. Analytic Model A model is a set of logical rules or a mathematical formula that represents patterns found in data that are useful for a business purpose. Once a model has been built based on one set of data, it can be reused to search for the discovered patterns in other similar data. Models are sometimes called predictive models since they can be used to predict behaviors that relate to the discovered patterns. Association This modeling technique is commonly referred to as affinity analysis and is used to identify items that occur together during a particular event. For example, affinity analysis is commonly used to study market baskets by identifying which combinations of products are most likely to be purchased together. Another form of this technique is sequence analysis, a variation on affinity analysis. Using sequence analysis, you could begin to understand the order in which customers tend to purchase specific products. These results may be helpful in the early phases of establishing a potential cross-selling strategy. Clustering Clustering is a type of modeling technique that can be used to place items into groups based on like characteristics. The goal of clustering is to create groups of items that are similar based on their attributes within a given group, but which are very different

EB-3078 > 0104 > PAGE 6 OF 15

Data Mining Primer


from items in other groups. Clustering is frequently used to create customer segments based on a customers behavior or other characteristics. Customers in the same segment share similar characteristics and tend to behave consistently. Knowledge of the typical behavior of a particular segment can be powerful information if you want to predict the behavior of an unknown member of that segment. Data Visualization This process takes large amounts of data and reduces them into more easily interpreted graphs, charts, or tables. Instead of large sets of numbers, colored pictures tell the story with clarity. Decision Tree This technique produces a tree-shaped structure that represents a set of decisions to predict a value of the target variable. This algorithm leverages a variety of techniques to separate or classify data based upon rules. Decision Trees are commonly used to model good/bad risk or loan approval/rejection because the models are represented by rules that humans easily understand. Although each rule might be easily understood, some decision trees contain thousands of rules, requiring data mining tools with good visualization techniques to interpret many rules appropriately. Linear Regression A statistical technique used to find the best-fitting linear relationship between a numeric target variable and its set of Score A score is an outcome of a model that represents a predicted or inferred value on some trait or characteristic of interest. You can think of a score as the result of the model. If your model calculates the customer value, the score for each customer may be a number that indicates a value of a particular customer. Neural Networks This is a non-linear predictive modeling technique, loosely based on the structure of the human brain that learns through training. This technique is commonly used to predict a future outcome based on historic data. However, it frequently requires substantial expertise to understand the rationale for the decisions and predictions it makes. The Neural Network is sometimes referred to as a black box because it produces a model that is less understandable, but often more accurate. Logistic Regression A statistical technique used to find the best-fitting linear relationship between a categorical target variable and a set of predictors. It is commonly used to predict Yes or No questions, such as whether or not a particular transaction is likely to be fraudulent. predictor variables. Linear regression can be used to predict the amount of overdraft protection to offer a customer based on their account balances, years of service, and other characteristics. Data Mining Challenges Although it may appear that data mining is the next logical step for companies that have already implemented their data warehouse, the reality is that many businesses struggle with getting their data mining projects to deliver meaningful results. To be successful, data mining requires the right team, the right methodology, the right architecture and the right technology. The Right Team A big challenge to bringing data mining into the company as an internal corporate service is developing the skill sets required by the data mining team. Data mining projects must be a collaborative effort driven by business experts, developed by analytic modelers, and supported by IT. Your internal skill sets may be developed over time, which may mean initially hiring data mining consultants to develop your data mining capability with the ultimate objective of transferring knowledge to your team. To ensure a successful data mining outcome, you will need the following experts on the team: Business Domain Experts Its imperative to have the business analysts involved in the data mining project. They should be the champions and drivers of every data mining project. They are the ones who need the answers that result from the project, and therefore, they are the people who must clarify the business issues to be solved by the project.

EB-3078 > 0104 > PAGE 7 OF 15

Data Mining Primer


The business experts should ultimately be held accountable for the results of the data mining project. The skill sets needed by the business domain experts include: > Ability to ask and answer strategic questions > Intimacy with enterprise data (accessing and manipulating it for analysis and forecasting) > Ability to clarify outcomes and expectations for thorough evaluation and validation of analytic models > Expertise with certain data analysis tools (Excel, OLAP) > Background in statistical techniques for forecasting and strategic planning The skills needed by a Data Miner/Analytic Information Technology Support Information technology support is critical to the success of the data mining project. The IT organization responsible for the data warehouse provides the bulk of the IT support; however, other groups may be called upon to assist with data cleansing and model integration. The skills needed by an information technologist include: > Data expertise combined with business understanding > The ability to find, access, and manipulate data > Detailed understanding of data structure and transformations The Right Methodology Data mining, like data warehousing, is an ongoing process that must be maintained and changed as business drivers change. The key to a successful data mining project is to base it on a proven methodology. Teradatas data mining methodology has delivered successful models that have uncovered millions of Modeler include: > Expertise in statistics and/or artificial intelligence > Successful application of advanced algorithms in a real-world setting > An understanding of the business domain (otherwise business domain experts can provide this support) Analytic Modelers/Data Miners Analytic Modelers/Data Miners are responsible for preparing the data, designing the model, building the model, and deploying it against the data. The analytic modeler works with the IT organization to integrate the model into the decision support infrastructure and business processes. > Technical expertise for evaluating, installing, and maintaining the tool environment > Application expertise for effectively deploying analytic models into the business environment, the data warehouse, and the operational and application environments dollars in revenue and cost savings for customers. This section defines the Teradata data mining methodology. Although all tasks are equally important, for the purpose of this paper, we will focus primarily on the activities that affect the data warehouse. (See Figure 4).

2H A?J = =CA A J
*KIE AII 2H > A ,ABE EJE

)H?DEJA?JKHA = @ 6A?D CO 2HAF=H=JE

,=J= 2HAF=H=JE

@A ,ALA F A J 6AIJ = @ 8= E@=JE

M A@CA ,EI? LAHO = @ ,AF O A J

M A@CA 6H= IBAH


Figure 4. Teradata Data Mining Methodology.

EB-3078 > 0104 > PAGE 8 OF 15

Data Mining Primer


Project Management Every successful project requires clearly defined objectives, requirements, deliverables and resources. Data mining projects are no exception. Project management activities are required throughout the projects life. The project manager ensures the project will produce satisfactory deliverables from both a technical and business perspective. Basic project management tasks include: > Align the scope and expectations of the project > Ensure communication among team members > Develop a project plan > Coordinate documentation and interim deliverables > Coordinate application development activities and tasks > Assess project effectiveness > Close out the project Business Problem Definition Successful data mining begins with a clearly defined business objective. Without clearly defined business objectives, the data mining project will likely lead nowhere. For example, increasing your customer base is a very different objective than increasing the number of your most valuable customers. Everything from data preprocessing to model selection is driven by your business objective. The business problem is described in operational terms so you can determine initial data availability and the analytic approach. Data Preparation This is the most time consuming step, but also the most critical. You must first collect all the data necessary for your project. If you have an enterprise warehouse, youre in luck. However, you may still need to pull data from different sources. First, examine your data sources to see what is available to address the business problem. Second, ensure that your data is computationally valid and consistent. For example, if you are pulling from different data sources, you must resolve conflicts among data which can be a daunting task. To avoid these issues, we highly recommend starting with a data warehouse where these conflicts are resolved. Once you have gathered data from the different sources, your next step is to explore your data. This is often called Once you have selected your data, some level of transformations may be required. Detail data, as they exist in the data warehouse, are not necessarily ready for data mining. You may want to derive optimal aggregations or new analytic variables to build a better model. For example, debt-income ratio may be a better predictor than just debt or income. Some statistical techniques and algorithms Architecture and Technology Preparation Before tackling a data mining problem, you must understand the development and implementation requirements for the analytic models. These requirements determine how the models are built, what software is required, and whether or not new hardware is required. In most cases, your development and production environments will be different. However, you may leverage the same environment with appropriate resources. There are several techniques to building models. Based on your environment and requirements, the right balance of client/server and/or indatabase mining must be chosen. Next, you must isolate and prepare your data for the particular model. You may exclude outliers for some models, whereas you may build a model based on outliers. For example, if you were predicting baseball attendance and revenue, you would need to exclude abnormal attendance data, such as attendance data from 1994, the year of the baseball players strike. In other cases, such as fraud detection, you should include outliers since they may represent fraudulent transactions. exploratory data analysis. Data visualization and descriptive statistical techniques are used to uncover data quality issues and to better understand the characteristics of your data. You may uncover data quality issues or missing data, which can jeopardize the integrity of any analytic model, so you must compensate, if not correct, the data issue. For example, if you are missing values, you must determine the best method for filling in missing data. You could consider using a data mining technique to predict the value of a missing variable based on other data points.

EB-3078 > 0104 > PAGE 9 OF 15

Data Mining Primer


also require numeric data or data within a certain range. For those variables, you need to recode or transform them into the appropriate input variable for the data mining technique. Model Development, Test, and Validation The next step is to build an analytical model an iterative process of applying analytical techniques to the analytical data set and interpreting mathematical equations. The resulting equations are refined as iterations are performed. Each iteration provides higher statistical and conceptual confidence in the results. Earlier in the process, you identified a preliminary analytical approach required to solve the business problem. Now you must select the specific analytical algorithms or statistical techniques that are most appropriate for building your model. Your selection of specific analytical techniques often requires you to revisit some aspects of data preprocessing that you performed in the previous step. Once youve selected the algorithms, its time to build the model. Building an analytical model requires at least three broad steps: (a) training or fitting (b) testing (c) validation. This requires you to segment your data into at least the following three different data sets: (a) training (b) test (c) validation. Your model is built using the training data, and then tested using the test data to assess the Model validation is a process by which an analytical modeler attempts to establish and maximize a generalizable model beyond the data set with which the model was created. The validation data are used as an independent source of information to assess the degree to which your models accuracy might be overstated. Overstated accuracy is frequently referred to as overfitting, a case where a model is built to closely fit the training and test data, but not the data that you intend to score. Overfitting has a direct and adverse affect on the usefulness, or validity, of your model. For example, if you build a granular model where the rules or formulas are so specific to a single instance (e.g., income=$50,000, gender=F, marital status=divorced, age=28, first name=June, hair color =red, number of children=3, cat= 0, dogs=2, etc.) your accuracy for the training and test sets can be 100%. However, when this model is applied to another data set, accuracy of results is almost guaranteed to be horrendous. If the rules or formulas in a model are so tightly bound to any particular data set, then you wont be able Knowledge Delivery and Deployment Knowledge derived through analytical models unlocks the ROI from your warehouse. There are several methods for deploying the models. Your IT organization may run the model and deliver the results to your business users for business decisions. The model or intelligence generated from the model can also be integrated into your customer relationship management (CRM) or analytical applications to facilitate business user access to the results. Regardless of your implementation, data mining adds intelligence to your business in the form of scores, predictions, descriptions, and profiles. model accuracy. The data mining tool you use should have sufficient model, parameter, and row-level diagnostics that allow you to identify and understand specific strengths and weaknesses in your model during these first two steps. After youve refined your model based upon the diagnostics, its time to validate your model. The analytical models are tested using statistical techniques; comparing models developed from different analytical techniques and the results for these models are further validated against the business criteria for the project. Once you develop the model, you must also establish a process to validate and to refresh the model as the data changes. Its also necessary to monitor the continuing business validity of the analytical models. to use the model for the purpose you build it: to produce scores for data with unknown outcomes that you want to predict with high confidence. The amount of effort put into maximizing the validity of a model is directly proportional to its business value.

EB-3078 > 0104 > PAGE 10 OF 15

Data Mining Primer


Knowledge Transfer One of the unique components of the Teradata data mining methodology is knowledge transfer. Knowledge transfer spans the entire data mining project beginning with the initial interviews with each data mining team member to determine their professional knowledge transfer objectives for the project. Mentoring and education throughout the data mining project arms the data mining team with the necessary modeling and process knowledge to interpret results, maintain the modeling environment, and monitor the analytical model. The Right Architecture There are several data mining architectures commonly used today. They include the distributed, independent data mart; the data warehouse with dependent data marts; and the centralized data warehouse and mining architectures (See Figure 5). Each architecture is described below. Distributed, Independent Data Marts The Distributed Sources with Analytic Data Marts method requires data to be extracted from multiple sources to analytical servers. Data gathered from various sources must be converted into a common and consistent format then merged together into an analytic data mart. Data mining is an iterative process. Its true that you dont need a data warehouse to mine data, however the data movement and data management can add months to your data mining project. Data Data Warehouse with Dependent (Analytic) Data Marts Using a data warehouse simplifies the data management issues since the data have already been gathered, cleansed, and mining tool and database vendors highly recommend beginning with a data warehouse if youre planning to integrate data mining into your business intelligence strategy. Another reason an analyst may opt for a distributed data mart model is for data autonomy. Once you extract data from your sources, you have full control over your analytical environment. The second scenario, Data Warehouse with Analytic Data Mart, allows you to achieve autonomy with a data warehouse. transformed to meet your warehouse criteria. Although youre pulling from a single source, you must still contend with the data movement from your warehouse to your analytical server, potential human error that can occur with sampling, and analytic server management issues. In addition to data movement, you must ensure the data you select are a sample that accurately reflects the business environment. Building models against samples that dont represent your data will produce poor models. Remember that its all about your data. There are other, more efficient alternatives.
,EIJHE>KJA@ 5 KH?AI = @ ) = OJE? ,=J= =HJI
Figure 5. Data Mining Architecture.

5 KH?AI ,=J= 9=HAD KIA ) = OJE? ,=J= =HJI ,AI J F + EA J


,=J= 9=HAD KIA MEJD ) = OJE? ,=J= =HJI +A JH= E A@ ,=J= E E C

EB-3078 > 0104 > PAGE 11 OF 15

Data Mining Primer


Centralized Data Warehouse and Mining As data mining projects are implemented across the enterprise, the number of users leveraging the data mining models continues to grow as does the need to access large data infrastructures. Data warehouse solution providers recognize this situation and are incorporating data mining extensions within the database to offer centralized data mining architecture. The analytic processing performed within the database minimizes data movement in and out of the database and leverages the parallelism of the database. A massively parallel database provides a massively parallel analytical engine that you can use to build, test, and deploy analytical models. The data warehouse becomes a centralized repository for your analytical data, data mining models, and data mining results providing an ideal foundation for data mining projects. Data are available for multiple mining projects across your entire enterprise, and your analytical models can be run against your entire customer table within your warehouse. Data mining models and results combined with your detailed customer records give you insight about customer value, buying patterns, and preferences. The Data Warehouse with Analytical Data Marts architecture is the most commonly used architecture today because of the limitations of databases and data mining tools. Most data mining tool vendors require data to be converted into their Scalability and Performance To get a higher return on their data warehousing investments, data warehouse users are asking more complex questions that require access to large amounts of data. As data volumes and the complexity of the business problems grow, analyses will inevitably take longer to process, requiring acceleration of the data mining process. Users, who analyze data warehouses that scale to the multi-terabyte range, struggle with desktop and client/server data mining tools that dont scale to meet their requirements. This Tools The right technology includes tools that provide a comprehensive set of statistical and machine learning functions along with visualization and data preprocessing techniques. Many tools provide a The Right Technology The right technology begins with the right foundation: the right data warehouse. Effective data mining depends on a comprehensive and robust data warehouse, not a summarized data mart because its difficult to predict the attributes that will contribute to the data mining model. In addition, you must select a warehouse that is built on the right foundation. Some companies are trying to do data warehousing with a database that was designed for OLTP operational processing of highspeed transactions. The functions performed in OLTP adding, deleting, modifying records or row-level functions are entirely different from analyzing large volumes of historical data and require very different database capabilities. Data I/O As large volumes of data are processed and models are deployed across the enterprise, the I/O required by most tools creates a network bandwidth problem. As gigabytes and even terabytes are moved from database to analytic server to business server, the I/O puts a strain on the entire enterprise network. In-database mining eliminates the I/O issues by moving the functions to the data versus moving data to the functions. proprietary format for efficient processing. Technology limitations are discussed in the next section. has required data mining to move from desktop and general-purpose toolboxes on client/server configurations to enterprise applications on massively parallel processing (MPP) configurations. Unfortunately, most tool vendors fail to leverage parallel technology for efficient data processing. However, some database vendors are in a unique position to provide an in-database approach to data mining to answer this need. Mining directly in the database streamlines the data mining process by eliminating data movement and leveraging the parallelism of the database engine for the performance and data scalability required to analyze large volumes of detail data.

EB-3078 > 0104 > PAGE 12 OF 15

Data Mining Primer


Teradata Warehouse Miner Teradata Warehouse Miner dynamically generates Teradata SQL statements and Traditional Approach executes them from a Windows client. The SQL is constructed from options, tables, and columns selected by the user in Teradatas In-Database Mining
Benefits of In-DBS: > Eliminates data movement > Minimizes data redundancy > Reduces cost of system and data management > Eliminates potential sampling errors > Leverages parallel database engines

the graphical Windows interface. In some cases, Teradata Warehouse Miner breaks the algorithms into steps so that the steps which require data access are performed via SQL, while other steps requiring numerical processing are handled by the Teradata Warehouse Miner client. Teradata Warehouse Miner processes functions in the most optimal manner leveraging the parallelism of Teradata Database whenever possible.

Teradata Database

Analytical Data

Results

Figure 6. Benefits of In-DBS Mining.

Traditionally, data mining technologies require that you move data out of the centralized data warehouse and into

sophisticated set of analytical algorithms and graphical interfaces. However, they fail to provide a robust set of data visualization and data preprocessing functions. Since the bulk of the data mining process is spent exploring and conditioning data, you need tools that will facilitate data exploration, visualization, transformation, and data management. Tools must also process large data volumes and provide an interface that enables integration of analytical models into business applications. Data Mining with Teradata Data warehouse solution providers, such as Teradata, a division of NCR, fully understand the data mining challenges

and issues facing companies today. Teradatas in-database data mining approach sets us apart from other data mining solution providers in the industry. Our centralized solution permits users to do data exploration, data preprocessing, analytic modeling, scoring, and deployment all within the database using SQL, taking advantage of Teradata Databases unlimited scalability and exceptional performance. Performing data mining in the database streamlines the process by eliminating data movement and the overhead associated with managing the data and the systems involved in a distributed environment. In-database mining also reduces data redundancy and improves data reliability.

proprietary or flat file structures. With this technique, many copies of the data will reside in various analytical servers or data marts. Imagine how much time it could take to create 20 samples of a terabyte-sized database, extract them into different locations, convert them into different formats and finally, import them into applications. Can you afford the time and inefficiencies of this method? Teradata Warehouse Miners analytic operations can be performed on the data within the Teradata Database. Results from the analysis are stored within your enterprise data warehouse providing access to all users as necessary. (See Figure 6.)

EB-3078 > 0104 > PAGE 13 OF 15

Data Mining Primer


Teradata Data Mining Labs Where Advanced Analytics Come to Life Teradatas Data Mining Services help many customers leverage data mining to grow their business, reduce costs, and better serve their customers, giving them the competitive edge. Our worldwide Data Mining Labs and the San Diego-based Data Mining Center of Expertise are uniquely qualified to offer clients a secure environment where they can investigate how data mining will help them solve their most complex business problems. Teradatas Data Mining Lab consultants are experts in analytical modeling with a strong background in statistics and artificial intelligence. Technological expertise combined with business knowledge is their forte they know how to help customers leverage a sophisticated technology to solve their business challenges with analytical solutions. A Data Mining Lab engagement offers consulting services, educational workshops, and analytic model development to help you integrate predictive models into your business process. The Data Mining Lab Engagement Low Risk, High Value Data Mining Lab engagements have been used by Teradata customers to help them assess the potential of data mining in their environment. This controlled, highly secure Proof of Concept (POC) is a lowrisk, high-value engagement that shows how data mining can be applied to answer your complex business questions. The project length for a data mining POC varies, but typically it takes two weeks to perform business problem qualification and clarification and data discovery for the business questions. Preparing and analyzing the data and developing the analytic models takes from four to eight weeks. The time of engagement depends on many variables, but the three factors are data cleanliness, data availability, and the clarity of the business problem to be solved. The data mining engagement can be used in the lab as a Proof of Concept or in a clients live production environment. How to Get Started with Data Mining the Accelerator Packages from Teradata Many organizations are interested in data mining, but dont know what the next steps are to successfully integrate data mining with their business intelligence strategy. Teradata puts data mining into your hands with a full complement of data mining services built around a mentoring program that ensures that you learn how to mine your own data. Teradata Data Mining Accelerator Packages help you get started with data mining through special educational offerings and pricing incentives. Heres a brief overview of the Data Mining Accelerator Packages: > Exploration Package: This package is designed for Teradata customers who want to understand what data mining can contribute to their business. Theyre already analyzing data in the warehouse using query, reporting, and OLAP tools but want to explore the possibility of including predictive analysis. > Expansion Package: This package is designed for Teradata customers who are ready to integrate data mining into their business processes. Theyre already using their data warehouse for business intelligence and are ready to expand their analytic capabilities with data mining. > Expert Package: This package enables customers who have a staff of analytic modelers experienced in the use of client/server-based data mining tools to integrate in-database mining to leverage the best of both worlds. Driving Higher ROI Data Mining Customer Experiences Ever-increasing global economic challenges are prompting companies to explore new ways to get more from their data warehousing investment.

EB-3078 > 0104 > PAGE 14 OF 15

Data Mining Primer


Teradata.com
Technologies that offer valuable insight and predictive capabilities to drive business growth and improve their ROI are a great next step after the data warehouse is in place. Data mining is the right technology for supercharging CRM and analytic applications by inserting intelligence in the form of predictions, scores, descriptions, and profiles (where data mining excels). Volumes of historical data containing facts about what occurred in business operations can be analyzed and used to predict what will happen in the future. Data mining is one of the fastest growing business intelligence technologies because it pays off in quantitative value. For example, here are a few facts from some of the first Teradata data mining implementers: > A European financial institution saved $8.2 million by gaining a better understanding of their customers ATM behavior. They were able to strategically place ATMs to reduce fees and increase loyalty. > A South American telecommunications provider retained 98% of their highvalue customers during deregulation. They identified who their high-value customers were, understood their profile and customer satisfaction level, and marketed to their customer segments. A packaged goods manufacturer had a partner Customer Loyalty program in which they collected data from their retail partners, analyzed the data comparing their product sales with other products, and sent this information back to the partners. This program was based on five analytical programs including market basket analysis and promotion monitoring. The analysis was performed on an IBM AIX server and a data mining analytic server. The entire process took 312 hours, not including data extract, coding, or data copy, making the application too costly to operate. The Teradata benchmarking team created programs that used Teradata SQL and ran everything directly in Teradata Database. With Teradata, the process ran in only 12 hours, saving the Customer Loyalty program. > A U.S. telecommunications provider improved their targeted marketing response rate tenfold by targeting customers identified through data mining. Summary To develop analytic solutions that can be applied throughout your enterprise, you need a powerful infrastructure that is built for analytic processing. The volume of data being created and captured and the amount of transaction data can cause massive bottlenecks in your decision flow: thousands of variables, millions of transactions per day, and millions of customers. You require timely, accurate, and sophistiThis paper was written by Arlene Zaima, Data Mining Marketing Manager, with contributions from James Kashner, CTO of the Teradata Data Mining Lab. For more information, visit our web site at Teradata.com. cated analysis of your data to maintain your competitive advantage. Reports and OLAP techniques provide the capabilities for navigating massive data warehouses but not the insight required to stay ahead of your competitors. Data mining with Teradata Database offers the analytic foundation to unlock the intelligence from your enterprise data warehouse.

Not Just Better, but the Best Our Benchmarks Prove it!

Teradata and NCR are registered trademarks of NCR Corporation. Windows is a registered trademark of Microsoft Corporation. NCR continually enhances products as new technologies and components become available. NCR, therefore, reserves the right to change specifications without prior notice. All features, functions and operations described herein may not be marketed in all parts of the world. Consult your Teradata representative or Teradata.com for the latest information. 2004 NCR Corporation Dayton, OH U.S.A. All Rights Reserved.

EB-3078 > 0104 > PAGE 15 OF 15

You might also like