KEMBAR78
MLBA Assignment-Anusree Balakrishnan - BD20011 Assignment 1: Data Understanding | PDF | Data | Data Mining
0% found this document useful (0 votes)
62 views12 pages

MLBA Assignment-Anusree Balakrishnan - BD20011 Assignment 1: Data Understanding

The document discusses analyzing customer data from an online retailer using clustering techniques. It performs the following steps: 1. Reads and cleans the customer data, which includes purchase details, invoices, and customer information. 2. Normalizes the data and determines the optimal number of clusters is 4 using the elbow method on K-means clustering. 3. Clusters the customers into 4 segments based on purchase frequency and revenue. 4. Provides recommendations to the retailer on which customer segments to target in order to generate more revenue.

Uploaded by

anu balakrishnan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
62 views12 pages

MLBA Assignment-Anusree Balakrishnan - BD20011 Assignment 1: Data Understanding

The document discusses analyzing customer data from an online retailer using clustering techniques. It performs the following steps: 1. Reads and cleans the customer data, which includes purchase details, invoices, and customer information. 2. Normalizes the data and determines the optimal number of clusters is 4 using the elbow method on K-means clustering. 3. Clusters the customers into 4 segments based on purchase frequency and revenue. 4. Provides recommendations to the retailer on which customer segments to target in order to generate more revenue.

Uploaded by

anu balakrishnan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 12

MLBA Assignment-Anusree Balakrishnan_BD20011

Assignment 1
The Consumer Complaint Database contains complaints that CFPB (Consumer Financial Protection
Bureau) has received about consumer financial products and services. The data is related to customer
complaints about financial products and services of a leading North American Bank. The goal is to
predict if the bank disputes the allegations contained in the complaint.

The source of the data is(https://catalog.data.gov/dataset/consumer-complaint-database)

Data Understanding

The goal for us is to predict whether the Band disputes the allegation contained in the complaint. To
understand that we will be using random forest algorithm. The parameters that we have used for
prediction is that:

 complaint_what_happened
 company_public_response
 company_response

Data Preparation

1) The data is read from the fileand the fields included in the data are:

 date_received: The date the complaint was received by the CFPB.


 product: The sort of product mentioned in the complaint by the customer.
 sub_product: The type of sub-product mentioned in the complaint by the customer.
 issue: The problem that the customer brought up in their complaint.
 sub_issue: The complaint's sub-issue as identified by the consumer.
 complaint_what_happened: The consumer complaint narrative is a statement of "what
happened" in the complaint submitted by the consumer. To share their story, customers must
first opt-in. We will not publish the story unless the customer gives his or her permission, and
customers can opt out at any time. The Consumer Financial Protection Bureau (CFPB) takes
reasonable steps to remove personal information from each complaint that could be used to
identify the complainant.
 company_public_response : An optional public-facing reaction to a customer complaint.
Companies can choose from a pre-determined list of responses that will be published on the
public database.
 company: The complaint is about this company.
 state: The state in which the consumer's mailing address is located.
 zip_code: The consumer's ZIP code for mailing purposes.
 consumer_consent_provided: Determines whether the customer agreed to have their complaint
storey published. We don't share the story unless the customer agrees, and customers can opt
out at any moment.
 submitted_via: How the complaint was submitted to the CFPB.
 date_sent_to_company: The date the CFPB sent the complaint to the company.
 company_response: This is how the company responded and handled the situation
 timely: Whether or whether the company responded in a timely manner.
 consumer_disputed: Whether or not the customer had a problem with the company's response
 complaint_id: The unique identification number for a complaint.

2) Before Analysis all the null values have to be removed. We will first remove null values from
complaint_what_happened , and will then check if there is any other null values in the data.

3) Also, since we will be analyzing with company response, complaint_what_happened,company_public


responses, we will be considering only those values.

4) After this we will be mapping company_public_responses to Dispute or Agrees. We have done the
following mapping
5)After that we distributed these companies public responses

6) complaint_what_happened is renamed to complaint, to make the data easy

7) We will be performing a clean up to remove inwanted characters, email addres, to make the data
more meaningful.
8) In order to predict the accuracy of the model we will be splitting the complaints data for training and
testing in the ration of 70:30.

9) We will be using rand forest classifier here


Output

After using random forest classifier, we got the following output, and our accuracy of the model turns
out to be 97%.

Code
Assignment 2
The following dataset is from an online retailer that wants to perform data mining techniques for
customer-centric business intelligence. The online retailer considered here is a typical one: a small
business and a relatively new entrant to the online retail sector, knowing the growing importance of
being analytical in today’s online businesses and data mining techniques, however, lacking technical
awareness and recourses. This analysis aims to help the retailer better understand its customers and
therefore conduct customer-centric marketing more effectively. Your job is to cluster the customers
from their purchase behaviours using a suitable data mining technique and understand the properties of
each cluster. You also provide a set of recommendations that will help the online retailer company.

Source of data: https://archive.ics.uci.edu/ml/machine-learning-databases/00502/

Data Understanding

The following dataset is from an online retailer that wants to perform data mining techniques for
customer-centric business intelligence. The data has variables such as invoice date, purchase quantity,
customer details, country of purchase and so on.

Data Preparation

1)First we read the excel file and merged both sheets and converted into a single data.

2)First we removed the duplicate entries from Invoice Data, and then we have converted the invoice
data to proper date time format.

3)We dropped the null entries from the data frame.


4) We removed price which were negative. If Quantity was less than 0 we removed them also. We
removed invalid stock codes also.

5) We plotted various graph for analyzing the data file, which can be seen in the code when given.

6) We normalized the data for performing Kmeans.


7) We did Kmeans to find the optimum value of K, to find the optimum value of K. The value lg K=4

Output

We classified different customers based on their frequency of purchase and the revenue generated.

· class 0 Least revenue & less frequent


· class 1 low revenue and less frequent purchase
· class 2 high frequency of purchase and high revenue generated
· class 3 moderate frequency of purchase and moderate revenue generated
· class 4 high revenue generating

By the analysis we got to know which segment the retailer should consider generating more revenue

Code

I will be adding the code externally and sharing them as a document.

You might also like