MLBA Assignment-Anusree Balakrishnan_BD20011
Assignment 1
The Consumer Complaint Database contains complaints that CFPB (Consumer Financial Protection
Bureau) has received about consumer financial products and services. The data is related to customer
complaints about financial products and services of a leading North American Bank. The goal is to
predict if the bank disputes the allegations contained in the complaint.
The source of the data is(https://catalog.data.gov/dataset/consumer-complaint-database)
Data Understanding
The goal for us is to predict whether the Band disputes the allegation contained in the complaint. To
understand that we will be using random forest algorithm. The parameters that we have used for
prediction is that:
complaint_what_happened
company_public_response
company_response
Data Preparation
1) The data is read from the fileand the fields included in the data are:
date_received: The date the complaint was received by the CFPB.
product: The sort of product mentioned in the complaint by the customer.
sub_product: The type of sub-product mentioned in the complaint by the customer.
issue: The problem that the customer brought up in their complaint.
sub_issue: The complaint's sub-issue as identified by the consumer.
complaint_what_happened: The consumer complaint narrative is a statement of "what
happened" in the complaint submitted by the consumer. To share their story, customers must
first opt-in. We will not publish the story unless the customer gives his or her permission, and
customers can opt out at any time. The Consumer Financial Protection Bureau (CFPB) takes
reasonable steps to remove personal information from each complaint that could be used to
identify the complainant.
company_public_response : An optional public-facing reaction to a customer complaint.
Companies can choose from a pre-determined list of responses that will be published on the
public database.
company: The complaint is about this company.
state: The state in which the consumer's mailing address is located.
zip_code: The consumer's ZIP code for mailing purposes.
consumer_consent_provided: Determines whether the customer agreed to have their complaint
storey published. We don't share the story unless the customer agrees, and customers can opt
out at any moment.
submitted_via: How the complaint was submitted to the CFPB.
date_sent_to_company: The date the CFPB sent the complaint to the company.
company_response: This is how the company responded and handled the situation
timely: Whether or whether the company responded in a timely manner.
consumer_disputed: Whether or not the customer had a problem with the company's response
complaint_id: The unique identification number for a complaint.
2) Before Analysis all the null values have to be removed. We will first remove null values from
complaint_what_happened , and will then check if there is any other null values in the data.
3) Also, since we will be analyzing with company response, complaint_what_happened,company_public
responses, we will be considering only those values.
4) After this we will be mapping company_public_responses to Dispute or Agrees. We have done the
following mapping
5)After that we distributed these companies public responses
6) complaint_what_happened is renamed to complaint, to make the data easy
7) We will be performing a clean up to remove inwanted characters, email addres, to make the data
more meaningful.
8) In order to predict the accuracy of the model we will be splitting the complaints data for training and
testing in the ration of 70:30.
9) We will be using rand forest classifier here
Output
After using random forest classifier, we got the following output, and our accuracy of the model turns
out to be 97%.
Code
Assignment 2
The following dataset is from an online retailer that wants to perform data mining techniques for
customer-centric business intelligence. The online retailer considered here is a typical one: a small
business and a relatively new entrant to the online retail sector, knowing the growing importance of
being analytical in today’s online businesses and data mining techniques, however, lacking technical
awareness and recourses. This analysis aims to help the retailer better understand its customers and
therefore conduct customer-centric marketing more effectively. Your job is to cluster the customers
from their purchase behaviours using a suitable data mining technique and understand the properties of
each cluster. You also provide a set of recommendations that will help the online retailer company.
Source of data: https://archive.ics.uci.edu/ml/machine-learning-databases/00502/
Data Understanding
The following dataset is from an online retailer that wants to perform data mining techniques for
customer-centric business intelligence. The data has variables such as invoice date, purchase quantity,
customer details, country of purchase and so on.
Data Preparation
1)First we read the excel file and merged both sheets and converted into a single data.
2)First we removed the duplicate entries from Invoice Data, and then we have converted the invoice
data to proper date time format.
3)We dropped the null entries from the data frame.
4) We removed price which were negative. If Quantity was less than 0 we removed them also. We
removed invalid stock codes also.
5) We plotted various graph for analyzing the data file, which can be seen in the code when given.
6) We normalized the data for performing Kmeans.
7) We did Kmeans to find the optimum value of K, to find the optimum value of K. The value lg K=4
Output
We classified different customers based on their frequency of purchase and the revenue generated.
· class 0 Least revenue & less frequent
· class 1 low revenue and less frequent purchase
· class 2 high frequency of purchase and high revenue generated
· class 3 moderate frequency of purchase and moderate revenue generated
· class 4 high revenue generating
By the analysis we got to know which segment the retailer should consider generating more revenue
Code
I will be adding the code externally and sharing them as a document.