R Data Analysis Projects PDF
R Data Analysis Projects PDF
Gopi Subramanian
BIRMINGHAM - MUMBAI
R Data Analysis Projects
Copyright © 2017 Packt Publishing
www.packtpub.com
                                Contents
Preface                                                        1
Index                                                          345
                                    Preface
This book is for readers who want to leverage the R platform for their data analysis
projects/problems. We introduce readers to different R packages for various data analysis
purposes and help them use the right package for the right task. A complete project built
from scratch in each chapter will help readers better understand building end-to-end
predictive analytics solutions. This book covers a variety of topics, including building deep
learning networks in R, graph mining, streaming data analysis, sentiment classification, and
recommender systems.
Chapter 2, Fuzzy Logic Induced Content-Based Recommendation, addresses the cold start
problem in the recommender system. We handle the ranking problem with multi-similarity
metrics using a fuzzy sets approach.
Chapter 4, Taming Time Series Data Using Deep Neural Networks, introduces MXNet R, a
package for deep learning in R. We leverage MXNet to build a deep connected network to
predict stock closing prices.
Chapter 5, Twitter Text Sentiment Classification Using Kernel Density Estimates, shows ability
to process Twitter data in R. We introduce delta-tfidf, a new metric for sentiment
classification. We leverage the kernel density estimate based Naïve Bayes algorithm to
classify sentiments.
Chapter 6, Record Linkage - Stochastic and Machine Learning Approaches, covers the problem of
master data management and how to solve it in R using the recordLinkage package.
Chapter 7, Streaming Data Clustering Analysis in R, introduces a package stream for
handling streaming data in R, and the clustering of streaming data, as well as the
online/offline clustering model.
Chapter 8, Analyzing and Understanding Networks Using R, covers the igraph package for
performing graph analysis in R. We solve product network analysis problems with graph
algorithms.
Conventions
In this book, you will find a number of text styles that distinguish between different kinds
of information. Here are some examples of these styles and an explanation of their meaning.
Code words in text, database table names, folder names, filenames, file extensions,
pathnames, dummy URLs, user input, and Twitter handles are shown as follows: "We can
include other contexts through the use of the include directive."
                                             [2]
New terms and important words are shown in bold.
Words that you see on the screen, for example, in menus or dialog boxes, appear in the text
like this: "Clicking the Next button moves you to the next screen."
Reader feedback
Feedback from our readers is always welcome. Let us know what you think about this
book-what you liked or disliked. Reader feedback is important for us as it helps us develop
titles that you will really get the most out of.
If there is a topic that you have expertise in and you are interested in either writing or
contributing to a book, see our author guide at www.packtpub.com/authors.
Customer support
Now that you are the proud owner of a Packt book, we have a number of things to help you
to get the most from your purchase.
                                              [3]
Downloading the example code
You can download the example code files for this book from your account at
http://www.packtpub.com. If you purchased this book elsewhere, you can visit
http://www.packtpub.com/support and register to have the files emailed directly to you.
You can download the code files by following these steps:
       1.   Log in or register to our website using your email address and password.
       2.   Hover the mouse pointer on the SUPPORT tab at the top.
       3.   Click on Code Downloads & Errata.
       4.   Enter the name of the book in the Search box.
       5.   Select the book for which you're looking to download the code files.
       6.   Choose from the drop-down menu where you purchased this book from.
       7.   Click on Code Download.
Once the file is downloaded, please make sure that you unzip or extract the folder using the
latest version of:
Errata
Although we have taken every care to ensure the accuracy of our content, mistakes do
happen. If you find a mistake in one of our books-maybe a mistake in the text or the code-
we would be grateful if you could report this to us. By doing so, you can save other readers
from frustration and help us improve subsequent versions of this book. If you find any
errata, please report them by visiting http://www.packtpub.com/submit-errata, selecting
your book, clicking on the Errata Submission Form link, and entering the details of your
errata. Once your errata are verified, your submission will be accepted and the errata will
be uploaded to our website or added to any list of existing errata under the Errata section of
that title. To view the previously submitted errata, go to
https://www.packtpub.com/books/content/support and enter the name of the book in the
search field. The required information will appear under the Errata section.
                                             [4]
                        Association Rule Mining
                                                                                      1
This chapter is ordered in ascending complexity. We will start with our use case, designing
a cross-sell campaign for an imaginative retailer. We will define the goals for this campaign
and the success criteria for these goals. Having defined our problem, we will proceed to our
first recommendation algorithm, association rule mining. Association rule mining, also
called market basket analysis, is a method used to analyze transaction data to extract
product associations.
The subsequent sections are devoted to unveiling the plain vanilla version of association
rule mining and introducing some of the interest measures. We will then proceed to find
ways to establish the minimum support and confidence thresholds, the major two interest
measures, and also the parameters of the association rule mining algorithm. We will explore
more interest measures toward the end, such as lift and conviction, and look at how we can
use them to generate recommendations for our retailer's cross-selling campaign.
We will introduce a variation of the association rule mining algorithm, called the weighted
association rule mining algorithm, which can incorporate some of the retailer input in the
form of weighted transactions. The profitability of a transaction is treated as a weight. In
addition to the products in a transaction, the profitability of the transaction is also recorded.
Now we have a smarter algorithm that can produce the most profitable product
associations.
Association Rule Mining                                                             Chapter 1
 We will then introduce the (HITS) algorithm. In places where a retailer's weight input is
not available, namely, when there is no explicit information about the importance of the
transactions, HITS provides us with a way to generate weights (importance) for the
transactions.
Next, we will introduce a variation of association rule mining called negative association
rule mining, which is an efficient algorithm used to find anti-patterns in the transaction
database. In cases where we need to exclude certain items from our analysis (owing to low
stock or other constraints), negative association rule mining is the best method. Finally, we
will wrap up this chapter by introducing package arulesViz: an R package with some cool
charts and graphics to visualize the association rules, and a small web application designed
to report our analysis using the RShiny R package.
                                             [7]
Association Rule Mining                                                                  Chapter 1
In the last decade, recommender systems have achieved great success with both online
retailers and brick and mortar stores. They have allowed retailers to move away from group
campaigns, where a group of people receive a single offer. Recommender systems
technology has revolutionized marketing campaigns. Today, retailers offer a customized
recommendation to each of their customers. Such recommendations can dramatically
increase customer stickiness.
Retailers design and run sales campaigns to promote up-selling and cross-selling. Up-
selling is a technique by which retailers try to push high-value products to their customers.
Cross-selling is the practice of selling additional products to customers. Recommender
systems provide an empirical method to generate recommendations for retailers up-selling
and cross-selling campaigns.
Retailers can now make quantitative decisions based on solid statistics and math to improve
their businesses. There are a growing number of conferences and journals dedicated to
recommender systems technology, which plays a vital role today at top successful
companies such as Amazon.com, YouTube, Netflix, LinkedIn, Facebook, TripAdvisor, and
IMDb.
Based on the type and volume of available data, recommender systems of varying
complexity and increased accuracy can be built. In the previous paragraph, we defined
historical data as a user and his product interactions. Let's use this definition to illustrate the
different types of data in the context of recommender systems.
Transactions
Transactions are purchases made by a customer on a single visit to a retail store. Typically,
transaction data can include the products purchased, quantity purchased, the price,
discount if applied, and a timestamp. A single transaction can include multiple products. It
may register information about the user who made the transaction in some cases, where the
customer allows the retailer to store his information by joining a rewards program.
                                               [8]
Association Rule Mining                                                                Chapter 1
A simplified view of the transaction data is a binary matrix. The rows of this matrix
correspond to a unique identifier for a transaction; let's call it transaction ID. The columns
correspond to the unique identifier for a product; let's call it product ID. The cell values are
zero or one, if that product is either excluded or included in the transaction.
Txn/Product    P1   P2   P3
                              ....   Pm
T1
               0 1 1 ... 0
T2
               1 1 1 .... 1
...            ... ... ... ... ...
Tn
               o 1 1 ... 1
Weighted transactions
This is additional information added to the transaction to denote its importance, such as the
profitability of the transaction as a whole or the profitability of the individual products in
the transaction. In the case of the preceding binary matrix, a column called weight is added
to store the importance of the transaction.
In this chapter, we will show you how to use transaction data to support cross-selling
campaigns. We will see how the derived user product preferences, or recommendations
from the user's product interactions (transactions/weighted transactions), can fuel
successful cross-selling campaigns. We will implement and understand the algorithms that
can leverage this data in R. We will work on a superficial use case in which we need to
generate recommendations to support a cross-selling campaign for an imaginative retailer.
                                              [9]
Association Rule Mining                                                               Chapter 1
Our goal, by the the end of this chapter, is to understand the concepts of association rule
mining and related topics, and solve the given cross-selling campaign problem using
association rule mining. We will understand how different aspects of the cross-selling
campaign problem can be solved using the family of association rule mining algorithms,
how to implement them in R, and finally build the web application to display our analysis
and results.
We will be following a code-first approach in this book. The style followed throughout this
book is to introduce a real-world problem, following which we will briefly introduce the
algorithm/technique that can be used to solve this problem. We will keep the algorithm
description brief. We will proceed to introduce the R package that implements the
algorithm and subsequently start writing the R code to prepare the data in a way that the
algorithm expects. As we invoke the actual algorithm in R and explore the results, we will
get into the nitty-gritty of the algorithm. Finally, we will provide further references for
curious readers.
He has provided us with his historical transaction data. The data includes his past
transactions, where each transaction is uniquely identified by an order_id integer, and the
list of products present in the transaction called product_id. Remember the binary matrix
representation of transactions we described in the introduction section? The dataset
provided by our retailer is in exactly the same format.
                                             [ 10 ]
Association Rule Mining                                                             Chapter 1
Let's start by reading the data provided by the retailer. The code for this chapter was
written in RStudio Version 0.99.491. It uses R version 3.3.1. As we work through our
example, we will introduce the arules R package that we will be using. In our description,
we will be using the terms order/transaction, user/customer, and item/product
interchangeably. We will not be describing the installation of R packages used through the
code. It's assumed that the reader is well aware of the method for installing an R package
and has installed those packages before using it.
The given data is in a tabular format. Every row is a tuple of order_id, representing the
transaction, product_id, the item included in that transaction, and the department_id,
that is the department to which the item belongs to. This is our binary data representation,
which is absolutely tenable for the classical association rule mining algorithm. This
algorithm is also called market basket analysis, as we are analyzing the customer's
basket—the transactions. Given a large database of customer transactions where each
transaction consists of items purchased by a customer during a visit, the association rule
mining algorithm generates all significant association rules between the items in the
database.
                                            [ 11 ]
Association Rule Mining                                                          Chapter 1
Let's quickly explore our data. We can count the number of unique transactions and the
number of unique products:
   library(dplyr)
   data %>%
    group_by('order_id') %>%
    summarize(order.count = n_distinct(order_id))
    data %>%
    group_by('product_id') %>%
    summarize(product.count = n_distinct(product_id))
   # A tibble: 1 <U+00D7> 2
    `"order_id"` order.count
    <chr> <int>
    1 order_id 6988
   # A tibble: 1 <U+00D7> 2
    `"product_id"` product.count
    <chr> <int>
    1 product_id 16793
We have 6988 transactions and 16793 individual products. There is no information about
the quantity of products purchased in a transaction. We have used the dplyr library to
perform these aggregate calculations, which is a library used to perform efficient data
wrangling on data frames.
                                           [ 12 ]
Association Rule Mining                                                             Chapter 1
In the next section, we will introduce the association rule mining algorithm. We will explain
how this method can be leveraged to generate the top N product recommendations
requested by the retailer for his cross-selling campaign.
We will explain the association rule mining algorithm and the effect of the interest measures
on the algorithm as we write our R code. We will halt our code writing in the required
places to get a deeper understanding of how the algorithm works, the algorithm
terminology such as itemsets, and how to leverage the interest measures to our benefit to
support the cross-selling campaign.
We will use the arules package, version 1.5-0, to help us perform association mining on
this dataset.
   library(arules)
   transactions.obj <- read.transactions(file = data.path, format = "single",
    sep = ",",
    cols = c("order_id", "product_id"),
    rm.duplicates = FALSE,
    quote = "", skip = 0,
    encoding = "unknown")
                                           [ 13 ]
Association Rule Mining                                                               Chapter 1
We begin with reading our transactions stored in the text file and create an arules data
structure called transactions. Let's look at the parameters of read.transactions, the
function used to create the transactions object. For the first parameter, file, we pass our
text file where we have the transactions from the retailer. The second parameter, format,
can take any of two values, single or basket, depending on how the input data is
organized. In our case, we have a tabular format with two columns--one column
representing the unique identifier for our transaction and the other column for a unique
identifier representing the product present in our transaction. This format is named single
by arules. Refer to the arules documentation for a detailed description of all the
parameters.
We can see that there are 6988 transactions and 16793 products. They match the previous
count values from the dplyr output.
Let's explore this transaction object a little bit more. Can we see the most frequent items,
that is, the items that are present in most of the transactions and vice versa—the least
frequent items and the items present in many fewer transactions?
The itemFrequency function in the arules package comes to our rescue. This function
takes a transaction object as input and produces the frequency count (the number of
transactions containing this product) of the individual products:
    data.frame(head(sort(itemFrequency(transactions.obj, type = "absolute")
     , decreasing = TRUE), 10)) # Most frequent
    Banana 1456
     Bag of Organic Bananas 1135
     Organic Strawberries 908
     Organic Baby Spinach 808
     Organic Hass Avocado 729
     Organic Avocado 599
     Large Lemon 534
     Limes 524
     Organic Raspberries 475
     Organic Garlic 432
    head(sort(itemFrequency(transactions.obj, type = "absolute")
     , decreasing = FALSE), 10) # Least frequent
    0% Fat Black Cherry Greek Yogurt y 1
     0% Fat Blueberry Greek Yogurt 1
                                             [ 14 ]
Association Rule Mining                                                             Chapter 1
In the preceding code, we print the most and the least frequent items in our database using
the itemFrequency function. The itemFrequency function produces all the items with
their corresponding frequency, and the number of transactions in which they appear. We
wrap the sort function over itemFrequency to sort this output; the sorting order is
decided by the decreasing parameter. When set to TRUE, it sorts the items in descending
order based on their transaction frequency. We finally wrap the sort function using
the head function to get the top 10 most/least frequent items.
The Banana product is the most frequently occurring across 1,456 transactions. The
itemFrequency method can also return the percentage of transactions rather than an
absolute number if we set the type parameter to relative instead of absolute.
Another convenient way to inspect the frequency distribution of the items is to plot them
visually as a histogram. The arules package provides the itemFrequencyPlot function
to visualize the item frequency:
itemFrequencyPlot(transactions.obj,topN = 25)
In the preceding code, we plot the top 25 items by their relative frequency, as shown in the
following diagram:
                                           [ 15 ]
Association Rule Mining                                                                Chapter 1
As per the figure, Banana is the most frequent item, present in 20 percent of the
transactions. This chart can give us a clue about defining the support value for our
algorithm, a concept we will quickly delve into in the subsequent paragraphs.
Now that we have successfully created the transaction object, let's proceed to apply the
apriori algorithm to this transaction object.
The apriori algorithm works in two phases. Finding the frequent itemsets is the first
phase of the association rule mining algorithm. A group of product IDs is called an itemset.
The algorithm makes multiple passes into the database; in the first pass, it finds out the
transaction frequency of all the individual items. These are itemsets of order 1. Let's
introduce the first interest measure, Support, here:
                                            [ 16 ]
Association Rule Mining                                                             Chapter 1
Now, in the first pass, the algorithm calculates the transaction frequency for each product.
At this stage, we have order 1 itemsets. We will discard all those itemsets that fall below
our support threshold. The assumption here is that items with a high transaction frequency
are more interesting than the ones with a very low frequency. Items with very low support
are not going to make for interesting rules further down the pipeline. Using the most
frequent items, we can construct the itemsets as having two products and find their
transaction frequency, that is, the number of transactions in which both the items are
present. Once again, we discard all the two product itemsets (itemsets of order 2) that are
below the given support threshold. We continue this way until we have exhausted them.
Pass 1 :
    Support = 0.1
    Product, transaction frequency
    {item5}, 0.4
    {item6}, 0.3
    {item9}, 0.2
    {item11}, 0.1
item11 will be discarded in this phase, as its transaction frequency is below the support
threshold.
Pass 2:
    {item5, item6}
    {item5, item9}
    {item6, item9}
As you can see, we have constructed itemsets of order 2 using the filtered items from pass
1. We proceed to find their transaction frequency, discard the itemsets falling below our
minimum support threshold, and step on to pass 3, where once again we create itemsets of
order 3, calculate the transaction frequency, and perform filtering and move on to pass 4. In
one of the subsequent passes, we reach a stage where we cannot create higher order
itemsets. That is when we stop:
    # Interest Measures
     support <- 0.01
    # Frequent item sets
     parameters = list(
     support = support,
     minlen = 2, # Minimal number of items per item set
     maxlen = 10, # Maximal number of items per item set
     target = "frequent itemsets")
     freq.items <- apriori(transactions.obj, parameter = parameters)
                                           [ 17 ]
Association Rule Mining                                                               Chapter 1
The apriori method is used in arules to get the most frequent items. This method takes
two parameters, the transaction.obj and the second parameter, which is a named list.
We create a named list called parameters. Inside the named list, we have an entry for our
support threshold. We have set our support threshold to 0.01, namely, one percent of the
transaction. We settled at this value by looking at the histogram we plotted earlier. By
setting the value of the target parameter to frequent itemsets, we specify that we
expect the method to return the final frequent itemsets. Minlen and maxlen set lower and
upper cut off on how many items we expect in our itemsets. By setting our minlen to 2, we
say we don't want itemsets of order 1. While explaining the apriori in phase 1, we said
that the algorithm can do many passes into the database, and each subsequent pass creates
itemsets that are of order 1, greater than the previous pass. We also said apriori ends
when no higher order itemsets can be found. We don't want our method to run till the end,
hence by using maxlen, we say that if we reach itemsets of order 10, we stop. The
apriori function returns an object of type itemsets.
It's good practice to examine the created object, itemset in this case. A closer look at the
itemset object should shed light on how we ended up using its properties to create our
data frame of itemsets:
    str(freq.items)
    Formal class 'itemsets' [package "arules"] with 4 slots
     ..@ items :Formal class 'itemMatrix' [package "arules"] with 3 slots
     .. .. ..@ data :Formal class 'ngCMatrix' [package "Matrix"] with 5 slots
     .. .. .. .. ..@ i : int [1:141] 1018 4107 4767 11508 4767 6543 4767 11187
    4767 10322 ...
     .. .. .. .. ..@ p : int [1:71] 0 2 4 6 8 10 12 14 16 18 ...
     .. .. .. .. ..@ Dim : int [1:2] 14286 70
     .. .. .. .. ..@ Dimnames:List of 2
     .. .. .. .. .. ..$ : NULL
     .. .. .. .. .. ..$ : NULL
     .. .. .. .. ..@ factors : list()
     .. .. ..@ itemInfo :'data.frame': 14286 obs. of 1 variable:
     .. .. .. ..$ labels: chr [1:14286] "10" "1000" "10006" "10008" ...
     .. .. ..@ itemsetInfo:'data.frame': 0 obs. of 0 variables
     ..@ tidLists: NULL
     ..@ quality :'data.frame': 70 obs. of 1 variable:
     .. ..$ support: num [1:70] 0.0108 0.0124 0.0124 0.0154 0.0122 ...
     ..@ info :List of 4
     .. ..$ data : symbol transactions.obj
     .. ..$ ntransactions: int 4997
     .. ..$ support : num 0.01
     .. ..$ confidence : num 1
                                            [ 18 ]
Association Rule Mining                                                           Chapter 1
   item_set support
   1 {Banana,Red Vine Tomato} 0.01030338
   2 {Bag of Organic Bananas,Organic D'Anjou Pears} 0.01001717
   3 {Bag of Organic Bananas,Organic Kiwi} 0.01016027
   4 {Banana,Organic Gala Apples} 0.01073268
   5 {Banana,Yellow Onions} 0.01302232
tail(freq.items.df)
   item_set support
   79 {Organic Baby Spinach,Organic Strawberries} 0.02575844
   80 {Bag of Organic Bananas,Organic Baby Spinach} 0.02690326
   81 {Banana,Organic Baby Spinach} 0.03048082
   82 {Bag of Organic Bananas,Organic Strawberries} 0.03577562
   83 {Banana,Organic Strawberries} 0.03305667
We create our data frame using these two lists of values. Finally, we use the head and tail
functions to quickly look at the itemsets present at the top and bottom of our data frame.
Before we move on to phase two of the association mining algorithm, let's stop for a
moment to investigate a quick feature present in the apriori method. If you had noticed,
our itemsets consist mostly of the high frequency itemsets. However, we can ask the
apriori method to ignore some of the items:
                                           [ 19 ]
Association Rule Mining                                                             Chapter 1
As you can see, we have excluded Banana and Bag of Organic Bananas from our
itemsets.
Congratulations, you have successfully finished implementing the first phase of the
apriori algorithm in R!
Let's move on to phase two, where we will induce rules from these itemsets. It's time to
introduce our second interest measure, confidence. Let's take an itemset from the list given
to us from phase one of the algorithm, {Banana, Red Vine Tomato}.
                                           [ 20 ]
Association Rule Mining                                                                Chapter 1
How often are these two rules found to be true in our database? The confidence score, our
next interest measure, will help us measure this:
          Confidence: For the rule Banana => Red Vine Tomato, the confidence is
          measured as the ratio of support for the itemset {Banana,Red Vine Tomato}
          and the support for the itemset {Banana}. Let's decipher this formula. Let's say
          that item Banana has occurred in 20 percent of the transactions, and it has
          occurred in 10 percent of transactions along with Red Vine Tomato, so the
          support for the rule is 50 percent, 0.1/0.2.
Similar to support, the confidence threshold is also provided by the user. Only those rules
whose confidence is greater than or equal to the user-provided confidence will be included
by the algorithm.
               Let's say we have a rule, A => B, induced from itemset <A, B>. The support
               for the itemset <A,B> is the number of transactions where A and B occur
               divided by the total number of the transactions. Alternatively, it's the
               probability that a transaction contains <A, B>.
               Now, confidence for A => B is P (B | A); which is the conditional
               probability that a transaction containing B also has A? P(B | A) = P ( A
               U B) / P (A) = Support ( A U B) / Support (A).
               Remember from probability theory that when two events are independent of
               each other, the joint probability is calculated by multiplying the individual
               probability. We will use this shortly in our next interest measure lift.
With this knowledge, let's go ahead and implement phase two of our apriori algorithm in
R:
   > confidence <- 0.4 # Interest Measure
    parameters = list(
    support = support,
    confidence = confidence,
    minlen = 2, # Minimal number of items per item set
    maxlen = 10, # Maximal number of items per item set
    target = "rules"
    )
   rules <- apriori(transactions.obj, parameter = parameters)
   rules.df <- data.frame(rules = labels(rules)
   ,rules@quality)
                                             [ 21 ]
Association Rule Mining                                                                 Chapter 1
Once again, we use the apriori method; however, we set the target parameter in
our parameters named list to rules. Additionally, we also provide the confidence
threshold. After calling the method apriori using the returned object rules, we finally
build our data frame, rules.df, to explore/view our rules conveniently. Let's look at our
output data frame, rules.df. For the given confidence threshold, we can see the set of
rules thrown out by the algorithm:
    head(rules.df)
    rules support confidence lift
     1 {Red Vine Tomato} => {Banana} 0.01030338 0.4067797 1.952319
     2 {Honeycrisp Apple} => {Banana} 0.01617058 0.4248120 2.038864
     3 {Organic Fuji Apple} => {Banana} 0.01817401 0.4110032 1.972590
          Lift: Often, we may have rules with high support and confidence but that are, as
          yet, of no use. This occurs when the item at the right-hand side of the rule has
          more probability of being selected alone than with the associated item. Let's go
          back to a grocery example. Say there are 1,000 transactions. 900 of those
          transactions contain milk. 700 of them contain strawberries. 300 of them have
          both strawberries and milk. A typical rule that can come out of such a scenario is
          strawberry => milk, with a confidence of 66 percent. The support
          (strawberry, milk) / support ( strawberry). This is not accurate and
          it's not going to help the retailer at all, as the probability of people buying milk is
          90 percent, much higher than the 66 percent given by this rule.
This is where lift comes to our rescue. For two products, A and B, lift measures how many
times A and B occur together, more often than expected if they were statistically
independent. Let's calculate the lift value for this hypothetical example:
    lift ( strawberry => milk ) = support ( strawberry, milk) / support(
    strawberry) * support (milk)
    = 0.3 / (0.9)(0.7)
    = 0.47
                                             [ 22 ]
Association Rule Mining                                                            Chapter 1
There are tons of other interest measures available in arules. They can be obtained by
calling the interestMeasure function. We show how we can quickly populate these
measures into our rules data frame:
     interestMeasure(rules, transactions = transactions.obj)
    rules.df <- cbind(rules.df, data.frame(interestMeasure(rules,
     transactions = transactions.obj)))
     We will not go through all of them here. There is a ton of literature
    available to discuss these measures and their use in filtering most useful
    rules.
Alright, we have successfully implemented our association rule mining algorithm; we went
under the hood to understand how the algorithm works in two phases to generate rules. We
have examined three interest measures: support, confidence, and lift. Finally, we
know that lift can be leveraged to make cross-selling recommendations to our retail
customers. Before we proceed further, let's look at some more functions in the arules
package that will allow us to do some sanity checks on the rules. Let us start with
redundant rules:
    is.redundant(rules) # Any redundant rules ?
Rules 1 and Rule 2 share the same right-hand side, Banana. The left-hand side of Rule
2 has a subset of items from Rule 1. The confidence score of Rule 2 is higher than Rule
1. In this case, Rule 2 is considered a more general rule than Rule 1 and hence Rule 2
will be considered as a redundant rule and removed. Another sanity check is to find
duplicated rules:
To remove rules with the same left- and right-hand side, the duplicated function comes in
handy. Let's say we create two rules objects, using two sets of support and confidence
thresholds. If we combine these two rules objects, we may have a lot of duplicate rules. The
duplicated function can be used to filter these duplicate rules. Let us now look at the
significance of a rule:
    is.significant(rules, transactions.obj, method = "fisher")
                                            [ 23 ]
Association Rule Mining                                                             Chapter 1
Given the rule A => B, we explained that lift calculates how many times A and B occur
together more often than expected. There are other ways of testing this independence, like a
chi-square test or a Fisher's test. The arules package provides the is.significant
method to do a Fisher or a chi-square test of independence. The parameter method can
either take the value of fisher or chisq depending on the test we wish to perform. Refer
to the arules documentation for more details. Finally let us see the list of transactions
where these rules are supported:
   >as(supportingTransactions(rules, transactions.obj), "list")
Here is the complete R script we have used up until now to demonstrate how to leverage
arules to extract frequent itemsets, induce rules from those itemsets, and filter the rules
based on interest measures.
                                            [ 24 ]
Association Rule Mining                                                Chapter 1
    transactions.obj
   # Item frequency
    head(sort(itemFrequency(transactions.obj, type = "absolute")
    , decreasing = TRUE), 10) # Most frequent
    head(sort(itemFrequency(transactions.obj, type = "absolute")
    , decreasing = FALSE), 10) # Least frequent
   itemFrequencyPlot(transactions.obj,topN = 25)
   # Interest Measures
    support <- 0.01
   # Frequent item sets
    parameters = list(
    support = support,
    minlen = 2, # Minimal number of items per item set
    maxlen = 10, # Maximal number of items per item set
    target = "frequent itemsets"
    )
   freq.items <- apriori(transactions.obj, parameter = parameters)
    # Let us examine our freq item sites
    freq.items.df <- data.frame(item_set = labels(freq.items)
    , support = freq.items@quality)
   head(freq.items.df, 5)
    tail(freq.items.df, 5)
   # Let us now examine the rules
    confidence <- 0.4 # Interest Measure
    parameters = list(
    support = support,
    confidence = confidence,
    minlen = 2, # Minimal number of items per item set
    maxlen = 10, # Maximal number of items per item set
    target = "rules"
    )
   rules <- apriori(transactions.obj, parameter = parameters)
    rules.df <- data.frame(rules = labels(rules)
    ,rules@quality)
   interestMeasure(rules, transactions = transactions.obj)
                                    [ 25 ]
Association Rule Mining                                                            Chapter 1
   library(ggplot2)
   library(arules)
                                           [ 26 ]
Association Rule Mining                                                 Chapter 1
                                    [ 27 ]
Association Rule Mining                                                             Chapter 1
head(no.rules.df) ##
In the preceding code, we show how we can wrap what we have discussed so far into some
functions to explore the support and confidence values. The get.rules function
conveniently wraps the rule generation functionality. For a given support, confidence, and
threshold, this function will return the rules induced by the association rule mining
algorithm. The explore.parameters function creates a grid of support and confidence
values. You can see that support values are in the range of 0.001 to 0.1, that is, we are
exploring the space where items are present in 0.1 percent of the transactions up to 10
percent of the transactions. Finally, the function evaluates the number of rules generated for
each support confidence pair in the grid. The function nicely wraps the results in a data
frame. For very low support values, we see an extremely large number of rules being
generated. Most of them would be spurious rules:
   support confidence no.rules
    1 0.001 0.05 23024
    2 0.002 0.05 4788
    3 0.003 0.05 2040
                                            [ 28 ]
Association Rule Mining                                                             Chapter 1
Analysts can take a look at this data frame, or alternatively plot it to decide the right
minimum support/confidence threshold. Our get.plots function plots a number of graphs
for different values of confidence. Here is the line plot of the number of rules generated for
various support values, keeping the confidence fixed at 0.1:
The preceding plot can be a good guideline for selecting the support value. We generated
the preceding plot by fixing the confidence at 10 percent. You can experiment with different
values of confidence. Alternatively, use the get.plots function.
                                            [ 29 ]
Association Rule Mining                                                            Chapter 1
              The paper, Mining the Most Interesting Rules, by Roberto J. Bayardo Jr. and
              Rakesh Agrawal demonstrates that the most interesting rules are found on
              the support/confidence border when we plot the rules with support and
              confidence on the x and y axes respectively. They call these rules SC-
              Optimal rules.
    library(arules)
    library(igraph)
                                             [ 30 ]
Association Rule Mining                                                   Chapter 1
    rm.duplicates = FALSE,
    quote = "", skip = 0,
    encoding = "unknown")
    return(transactions.obj)
    }
                                    [ 31 ]
Association Rule Mining                                                           Chapter 1
     return(best.rules.df)
     }
plot.graph(cross.sell.rules)
After exploring the dataset for support and confidence values, we set the support and
confidence values as 0.001 and 0.2 respectively.
We have written a function called find.rules. It internally calls get.rules. This function
returns the list of top N rules given the transaction and support/confidence thresholds. We
are interested in the top 10 rules. As discussed, we are going to use lift values for our
recommendation. The following are our top 10 rules:
     rules support confidence lift conviction leverage
    59 {Organic Hass Avocado} => {Bag of Organic Bananas} 0.03219805 0.3086420
   1.900256 1.211498 0.01525399
    63 {Organic Strawberries} => {Bag of Organic Bananas} 0.03577562 0.2753304
                                          [ 32 ]
Association Rule Mining                                                               Chapter 1
The first entry has a lift value of 1.9, indicating that the products are not independent. This
rule has a support of 3 percent and the system has 30 percent confidence for this rule. We
recommend that the retailer uses these two products in his cross-selling campaign as, given
the lift value, there is a high probability of the customer picking up a {Bag of Organic
Bananas} if he picks up an {Organic Hass Avocado}.
Curiously, we have also included two other interest measures—conviction and leverage.
Leverage
How many more units of A and B are expected to be sold together than expected from
individual sales? With lift, we said that there is a high association between the {Bag of
Organic Bananas} and {Organic Hass Avocado} products. With leverage, we are able
to quantify in terms of sales how profitable these two products would be if sold together.
The retailer can expect 1.5 more unit sales by selling the {Bag of Organic Bananas} and
the {Organic Hass Avocado} together rather than selling them individually. For a given
rule A => B:
    Leverage(A => B) = Support(A => B) - Support(A)*Support(B)
Leverage measures the difference between A and B appearing together in the dataset and
what would be expected if A and B were statistically dependent.
                                             [ 33 ]
Association Rule Mining                                                              Chapter 1
Conviction
Conviction is a measure to ascertain the direction of the rule. Unlike lift, conviction is
sensitive to the rule direction. Conviction (A => B) is not the same as conviction (B => A).
Conviction, with the sense of its direction, gives us a hint that targeting the customers of
Organic Hass Avocado to cross-sell will yield more sales of Bag of Organic Bananas rather
than the other way round.
Thus, using lift, leverage, and conviction, we have provided all the empirical details to our
retailer to design his cross-selling campaign. In our case, we have recommended the top 10
rules to the retailer based on leverage. To provide the results more intuitively and to
indicate what items could go together in a cross-selling campaign, a graph visualization of
the rules can be very appropriate.
The plot.graph function is used to visualize the rules that we have shortlisted based on
their leverage values. It internally uses a package called igraph to create a graph
representation of the rules:
                                            [ 34 ]
Association Rule Mining                                                                   Chapter 1
Our suggestion to the retailer can be the largest subgraph on the left. Items in that graph
can be leveraged for his cross-selling campaign. Depending on the profit margin and other
factors, the retailer can now design his cross-selling campaign using the preceding output.
    Our output up until now has been great. How can I add some additional lists of products
    to my campaign?
You are shocked; the data scientist in you wants to do everything empirically. Now the
retailer is asking for a hard list of products to be added to the campaign. How do you fit
them in?
    The analysis output does not include these products. None of our top rules recommend
    these products. They are not very frequently sold items. Hence, they are not bubbling up in
    accordance with the rules.
The retailer is insistent: "The products I want to add to the campaign are high margin items.
Having them in the campaign will boost my yields."
Voila! The retailer is interested in high-margin items. Let's pull another trick of the
trade—weighted association rule mining.
Jubilant, you reply, "Of course I can accommodate these items. Not just them, but also your
other high-valued items. I see that you are interested in your margins; can you give me the
margin of all your transactions? I can redo the analysis with this additional information and
provide you with the new results. Shouldn't take much time."
The retailer is happy. "Of course; here is my margin for the transactions."
Let's introduce weighted association rule mining with an example from the seminal
paper, Weighted Association Rules: Models and Algorithms by G.D.Ramkumar et al.
                                              [ 35 ]
Association Rule Mining                                                          Chapter 1
Caviar is an expensive and hence a low support item in any supermarket basket. Vodka,
on the other hand, is a high to medium support item. The association, caviar => vodka is
of very high confidence but will never be derived by the existing methods as the {caviar,
vodka} itemset is of low support and will not be included.
The preceding paragraph echoes our retailer's concern. With the additional information
about the margin for our transactions, we can now use weighted association rule mining to
arrive at our new set of recommendations:
   "transactionID","weight"
    "1001861",0.59502283788534
    "1003729",0.658379205926458
    "1003831",0.635451491097042
    "1003851",0.596453384749423
    "1004513",0.558612727312164
    "1004767",0.557096300448959
    "1004795",0.693775098285732
    "1004797",0.519395513963845
    "1004917",0.581376662057422
   library(arules)
    library(igraph)
                                          [ 36 ]
Association Rule Mining                                                   Chapter 1
    #
    # Returns:
    # transaction object
    transactions.obj <- read.transactions(file = data.path, format = "single",
    sep = ",",
    cols = columns,
    rm.duplicates = FALSE,
    quote = "", skip = 0,
    encoding = "unknown")
    return(transactions.obj)
    }
   parameters = list(
    support = support,
    minlen = 2, # Minimal number of items per item set
    maxlen = 10, # Maximal number of items per item set
    target = "frequent itemsets"
    )
    weclat.itemsets <- weclat(transactions.obj, parameter = parameters)
                                    [ 37 ]
Association Rule Mining                                                           Chapter 1
   head(weclat.itemsets.df)
    tail(weclat.itemsets.df)
   # Rule induction
    weclat.rules <- ruleInduction(weclat.itemsets, transactions.obj,
   confidence = 0.3)
    weclat.rules.df <-data.frame(rules = labels(weclat.rules)
    , weclat.rules@quality)
    head(weclat.rules.df)
In the arules package, the weclat method allows us to use weighted transactions to
generate frequent itemsets based on these weights. We introduce the weights through the
itemsetinfo data frame in the str(transactions.obj) transactions object:
The third slot in the transaction object is a data frame with one column, transactionID.
We create a new column called weight in that data frame and push our transaction
weights:
   weights <- read.csv("../../data/weights.csv")
    transactions.obj@itemsetInfo <- weights
    str(transactions.obj)
                                          [ 38 ]
Association Rule Mining                                                           Chapter 1
In the preceding case, we have replaced the whole data frame. You can either do that or
only add the weight column.
We have the transactionID and the weight in the itemsetInfo data frame now. Let's
run the weighted itemset generation using these transaction weights:
    support <- 0.01
    parameters = list(
     support = support,
     minlen = 2, # Minimal number of items per item set
     maxlen = 10, # Maximal number of items per item set
     target = "frequent itemsets"
     )
Once again, we invoke the weclat function with the parameter list and the transactions
object. As the itemInfo data frame has the weight column, the function calculates the
support using the weights provided. The new definition of support is as follows:
                                             [ 39 ]
Association Rule Mining                                                              Chapter 1
The weighted support of an itemset is the sum of the weights of the transactions that
contain the itemset. An itemset is frequent if its weighted support is equal to or greater than
the threshold specified by support (assuming that the weights, sum is equal to one).
With this new definition, you can see now that low support times established by the old
definition of support, if present in high value transactions, will be included. We have
automatically taken care of our retailer's request to include high margin items while
inducing the rules. Once again, for better reading, we create a data frame where each row is
the frequent itemset generated, and a column to indicate
the head(weclat.itemsets.df)support value:
        weclat.itemsets support
    1   {Bag of Organic Bananas,Organic Kiwi} 0.01041131
    2   {Bag of Organic Bananas,Organic D'Anjou Pears} 0.01042194
    3   {Bag of Organic Bananas,Organic Whole String Cheese} 0.01034432
    4   {Organic Baby Spinach,Organic Small Bunch Celery} 0.01039107
    5   {Bag of Organic Bananas,Organic Small Bunch Celery} 0.01109455
    6   {Banana,Seedless Red Grapes} 0.01274448
tail(weclat.itemsets.df)
        weclat.itemsets support
    77   {Banana,Organic Hass Avocado} 0.02008700
    78   {Organic Baby Spinach,Organic Strawberries} 0.02478094
    79   {Bag of Organic Bananas,Organic Baby Spinach} 0.02743582
    80   {Banana,Organic Baby Spinach} 0.02967578
    81   {Bag of Organic Bananas,Organic Strawberries} 0.03626149
    82   {Banana,Organic Strawberries} 0.03065132
In the case of apriori, we used the same function to generate/induce the rules. However,
in the case of weighted association rule mining, we need to call the ruleInduction
function to generate rules. We pass the frequent itemsets from the previous step, the
transactions object, and finally the confidence threshold. Once again, for our convenience,
we create a data frame with the list of all the rules that are induced and their interest
measures:
    weclat.rules <- ruleInduction(weclat.itemsets, transactions.obj, confidence
    = 0.3)
    weclat.rules.df <-data.frame(weclat.rules = labels(weclat.rules)
    , weclat.rules@quality)
head(weclat.rules.df)
                                            [ 40 ]
Association Rule Mining                                                              Chapter 1
Finally, let's use the plot.graph function to view the new set of interesting item
associations:
                                           [ 41 ]
Association Rule Mining                                                                 Chapter 1
Our new recommendation now includes some of the rare items. It is also sensitive to the
profit margin of individual transactions. With these recommendations, the retailer is geared
toward increasing his profitability through the cross-selling campaign.
The arules package provides a method called HITS to help us do the exact same
thing—infer transaction weights. HITS stands for Hyperlink-induced Topic Search—a link
analysis algorithm used to rate web pages developed by John Kleinberg. According to
HITS, hubs are pages with large numbers of out degrees and authority are pages with large
numbers of in degrees.
The rationale is that if a lot of pages link to a particular page, then that page is considered
an authority. If a page has links to a lot of other pages, it is considered a hub.
                                              [ 42 ]
Association Rule Mining                                                              Chapter 1
The basic idea behind using the HITS algorithm for association rule mining is that frequent
items may not be as important as they seem. The paper presents an algorithm that applies
the HITS methodology to bipartite graphs.
According to the HITS modified for the transactions database, the transactions and
products are treated as a bipartite graph, with an arc going from the transaction to the
product if the product is present in the transaction. The following diagram is reproduced
from the paper, Mining Weighted Association Rules without Preassigned Weights by Ke Sun and
Fengshan Bai, to illustrate how transactions in a database are converted to a bipartite graph:
                                            [ 43 ]
Association Rule Mining                                                                 Chapter 1
In this representation, the transactions can be ranked using the HITS algorithm. In this kind
of representation, the support of an item is proportional to its degree. Consider item A. Its
absolute support is 4; in the graph, the in degree of A is four. As you can see, considering
only the support, we totally ignore transaction importance, unless the importance of the
transactions is explicitly provided. How do we get the transaction weights in this scenario?
Again, a way to get the weights intuitively is from a good transaction, which should be
highly weighted and should contain many good items. A good item should be contained by
many good transactions. By treating transactions as hubs and the products as authorities,
the algorithm invokes HITS on this bipartite graph.
The arules package provides the method (HITS), which implements the algorithm that we
described earlier:
    weights.vector <- hits( transactions.obj, type = "relative")
    weights.df <- data.frame(transactionID = labels(weights.vector), weight =
    weights.vector)
We invoke the algorithm using the HITS method. We have described the intuition behind
using the HITS algorithm in transaction databases to give weights to our transactions. We
will briefly describe how the HITS algorithm functions. Curious readers can refer
to Authoritative Sources in a Hyperlinked Environment, J.M. Kleinberg, J. ACM, vol. 46, no. 5, pp.
604-632, 1999, for a better understanding of the HITS algorithm.
The HITS algorithm, to begin with, initializes the weight of all nodes to one; in our case, the
items and the transactions are set to one. That is, we maintain two arrays, one for hub
weights and the other one for authority weights.
It proceeds to do the following three steps in an iterative manner, that is, until our hub and
authority arrays stabilize or don't change with subsequent updates:
       1. Authority node score update: Modify the authority score of each node to the sum
          of the hub scores of each node that points to it.
       2. Hub node score update: Change the hub score of each node to the sum of the
          authority scores of each node that it points to.
       3. Unit normalize the hub and authority scores: Continue with the authority node
          score update until the hub and authority value stabilizes.
                                              [ 44 ]
Association Rule Mining                                                                Chapter 1
At the end of the algorithm, every item has an authority score, which is the sum of the hub
scores of all the transactions that contain this item. Every transaction has a hub score that is
the sum of the authority score of all the items in that transaction. Using the weights created
using the HITS algorithm, we create a weights.df data frame:
    head(weights.df)
                    transactionID weight
    1000431 1000431 1.893931e-04
    100082 100082 1.409053e-04
    1000928 1000928 2.608214e-05
    1001517 1001517 1.735461e-04
    1001650 1001650 1.184581e-04
    1001934 1001934 2.310465e-04
    Pass weights.df our transactions object. We can now generate the weighted
    association rules:
     )
     weclat.itemsets <- weclat(transactions.obj, parameter = parameters)
     weclat.itemsets.df <-data.frame(weclat.itemsets = labels(weclat.itemsets)
     , weclat.itemsets@quality)
    We can look into the output data frames created for frequent itemsets and
    rules:
head(weclat.itemsets.df)
        weclat.itemsets support
    1   {Banana,Russet Potato} 0.01074109
    2   {Banana,Total 0% Nonfat Greek Yogurt} 0.01198206
    3   {100% Raw Coconut Water,Bag of Organic Bananas} 0.01024201
    4   {Organic Roasted Turkey Breast,Organic Strawberries} 0.01124278
    5   {Banana,Roma Tomato} 0.01089124
    6   {Banana,Bartlett Pears} 0.01345293
                                             [ 45 ]
Association Rule Mining                                                           Chapter 1
tail(weclat.itemsets.df)
     weclat.itemsets support
   540 {Bag of Organic Bananas,Organic Baby Spinach,Organic Strawberries}
   0.02142840
   541 {Banana,Organic Baby Spinach,Organic Strawberries} 0.02446832
   542 {Bag of Organic Bananas,Organic Baby Spinach} 0.06536606
   543 {Banana,Organic Baby Spinach} 0.07685530
   544 {Bag of Organic Bananas,Organic Strawberries} 0.08640422
   545 {Banana,Organic Strawberries} 0.08226264
head(weclat.rules.df)
Based on the weights generated using the HITS algorithm, we can now order the items by
their authority score. This is an alternate way of ranking the items in addition to ranking
them by their frequency. We can leverage the itemFrequency function to generate the item
scores:
   freq.weights <- head(sort(itemFrequency(transactions.obj, weighted =
   TRUE),decreasing = TRUE),20)
   freq.nweights <- head(sort(itemFrequency(transactions.obj, weighted =
   FALSE),decreasing = TRUE),20)
                                          [ 46 ]
Association Rule Mining                                                          Chapter 1
The column score gives the relative transaction frequency. The score.new column is the
authority score from the hits algorithm. You can see that Limes and Large Lemon have
interchanged places. Strawberries has gone further up the order compared to the original
transaction frequency score.
                                          [ 47 ]
Association Rule Mining                                                Chapter 1
library(arules)
                                    [ 48 ]
Association Rule Mining                                                Chapter 1
weights.vector)
head(weights.df)
    )
    weclat.itemsets <- weclat(transactions.obj, parameter = parameters)
    weclat.itemsets.df <-data.frame(weclat.itemsets = labels(weclat.itemsets)
    , weclat.itemsets@quality)
   head(weclat.itemsets.df)
    tail(weclat.itemsets.df)
   ## Rule induction
    weclat.rules <- ruleInduction(weclat.itemsets, transactions.obj,
   confidence = 0.3)
    weclat.rules.df <-data.frame(weclat.rules = labels(weclat.rules)
    , weclat.rules@quality)
head(weclat.rules.df)
                                    [ 49 ]
Association Rule Mining                                                               Chapter 1
There are several algorithms proposed to induce negative association rules from a
transaction database. Apriori is not a well-suited algorithm for negative association rule
mining. In order to use apriori, every transaction needs to be updated with all the
items—those that are present in the transaction and those that are absent. This will heavily
inflate the database. In our case, every transaction will have 16,000 items. We can cheat; we
can leverage the use of apriori for negative association rule mining only for a selected list
of items. In transactions that do not contain the item, we can create an entry for that item to
indicate that the item is not present in the transaction. The arules package's
function, addComplement, allows us to do exactly that.
When we pass this transaction to addComplement and say that we want non-Banana entries
to be added to the transaction, the resulting transaction from addComplement will be as
follows:
    Banana, Strawberries
    Onions, ginger, garlic, !Banana
    Milk, Banana
An exclamation mark is the standard way to indicate the absence; however, you can choose
your own prefix:
    get.neg.rules <- function(transactions, itemList, support, confidence){
     # Generate negative association rules for given support confidence value
     #
     # Args:
                                             [ 50 ]
Association Rule Mining                                                        Chapter 1
In the preceding code, we have created a get.neg.rules function. Inside this method, we
have leveraged the addComplement function to introduce the absence entry of the given
items in itemList into the transactions. We generate the rules with the newly formed
transactions, neg.transactions:
   itemList <- c("Organic Whole Milk","Cucumber Kirby")
   neg.rules <- get.neg.rules(transactions.obj,itemList, .05,.6)
Once we have the negative rules, we pass those through is.redundant to remove any
redundant rules and finally print the rules:
                                         [ 51 ]
Association Rule Mining                                                Chapter 1
    #
    # Script:
    #
    # RScript to explain negative associative rule mining
    #
    # Gopi Subramanian
    #########################################################################
   library(arules)
    library(igraph)
                                    [ 52 ]
Association Rule Mining                                                         Chapter 1
    }
    get.neg.rules <- function(transactions, itemList, support, confidence){
    # Generate negative association rules for given support confidence value
    #
    # Args:
    # transactions: Transaction object, list of transactions
    # itemList : list of items to be negated in the transactions
    # support: Minimum support threshold
    # confidence: Minimum confidence threshold
    # Returns:
    # A data frame with the best set negative rules and their support and
   confidence values
    neg.transactions <- addComplement( transactions, labels = itemList)
    rules <- get.rules(support, confidence, neg.transactions)
    return(rules)
    }
labels(neg.rules.nr)
Rules visualization
In the previous sections, we leveraged plotting capability from the arules and igraph
packages to plot induced rules. In this section, we introduce arulesViz, a package
dedicated to plot association rules, generated by the arules package. The arulesViz
package integrates seamlessly with the arules packages in terms of sharing data
structures.
                                          [ 53 ]
Association Rule Mining                                                              Chapter 1
    library(arules)
     library(arulesViz)
                                            [ 54 ]
Association Rule Mining                                                Chapter 1
    #
    # Returns:
    # rules object
    parameters = list(
    support = support,
    confidence = confidence,
    minlen = 2, # Minimal number of items per item set
    maxlen = 10, # Maximal number of items per item set
    target = "rules"
   # Induce Rules
    all.rules <- get.rules(support, confidence, transactions.obj)
                                    [ 55 ]
Association Rule Mining                                       Chapter 1
                                            [ 56 ]
Association Rule Mining                                                               Chapter 1
Wrapping up
The final step in any data analysis project is documentation—either generating a report of
the findings or documenting the scripts and data used. In our case, we are going to wrap up
with a small application. We will use RShiny, an R web application framework. RShiny is a
powerful framework for developing interactive web applications using R. We will leverage
the code that we have written to generate a simple, yet powerful, user interface for our
retail customers.
To keep things simple, we have a set of three screens. The first screen, as shown in the
following screenshot, allows the user to vary support and confidence thresholds and view
the rules generated. It also has additional interest measures, lift, conviction, and leverage.
The user can sort the rule by any of these interest measures:
                                             [ 57 ]
Association Rule Mining                                         Chapter 1
                                             [ 58 ]
Association Rule Mining                                                             Chapter 1
Finally, a graph representation to view the product grouping for the easy selection of
products for the cross-selling campaign is as follows:
   library(shiny)
    library(plotly)
    library(arules)
    library(igraph)
    library(arulesViz)
                                           [ 59 ]
Association Rule Mining                                                Chapter 1
    #
    # Returns:
    # transaction object
    transactions.obj <- read.transactions(file = data.path, format = "single",
    sep = ",",
    cols = columns,
    rm.duplicates = FALSE,
    quote = "", skip = 0,
    encoding = "unknown")
    return(transactions.obj)
    }
                                    [ 60 ]
Association Rule Mining                                                   Chapter 1
    return(best.rules.df)
    }
})
})
                                    [ 61 ]
Association Rule Mining                                                       Chapter 1
   ui <- fluidPage(
    headerPanel(title = "X-Sell Recommendations"),
    sidebarLayout(
    sidebarPanel(
    sliderInput("Support", "Support threshold:", min = 0.01, max = 1.0, value
   = 0.01),
    sliderInput("Confidence", "Support threshold:", min = 0.05, max = 1.0,
   value = 0.05)
    ),
    mainPanel(
    tabsetPanel(
    id = 'xsell',
    tabPanel('Rules', DT::dataTableOutput('rulesTable')),
    tabPanel('Explore', plotOutput('explorePlot')),
    tabPanel('Item Groups', plotOutput('graphPlot'))
    )
    )
    )
    )
We have described the get.txn, get.rules, and find.rules functions in the previous
section. We will not go through them again here. The preceding code is a single page
RShiny app code; both the server and the UI component reside in the same file.
                                        [ 62 ]
Association Rule Mining                                                                Chapter 1
     ),
     mainPanel(
     tabsetPanel(
     id = 'xsell',
     tabPanel('Rules', DT::dataTableOutput('rulesTable')),
     tabPanel('Explore', plotOutput('explorePlot')),
     tabPanel('Item Groups', plotOutput('graphPlot'))
     )
     )
     )
     )
We define the screen layout in this section. This section can also be kept in a separate file
called UI.R. The page is defined by two sections, a panel in the left, defined by
sidebarPanel, and a main section defined under mainPanel. Inside the side bar, we have
defined two slider controls for the support and confidence thresholds respectively. The
main panel contains a tab-separated window, defined by tabPanel.
The main panel has three tabs; each tab has a slot defined for the final set of rules, with their
interest measures, a scatter plot for the rules, and finally the graph plot of the rules.
})
                                             [ 63 ]
Association Rule Mining                                                            Chapter 1
The cross.sell.rules data frame is defined as a reactive component. When the values of
the support and confidence thresholds change in the UI, cross.sell.rules data frame
will be recomputed. This frame will be served to the first page, where we have defined a
slot for this table, called rulesTable:
     gen.rules <- reactive({
     support <- input$Support
     confidence <- input$Confidence
     gen.rules <- get.rules( support, confidence ,transactions.obj)
     return(gen.rules)
     })
This reactive component retrieves the calculations and returns the rules object every time
the support or/and confidence threshold is changed by the user in the UI:
     output$rulesTable <- DT::renderDataTable({
     cross.sell.rules()
     })
The preceding code renders the data frame back to the UI:
     output$graphPlot <- renderPlot({
     g <-plot.graph(cross.sell.rules())
     plot(g)
     })
The preceding two pieces of code render the plot back to the UI.
                                           [ 64 ]
Association Rule Mining                                                           Chapter 1
Summary
We started the chapter with an overview of recommender systems. We introduced our
retail case and the association rule mining algorithm. Then we applied association rule
mining to design a cross-selling campaign. We went on to understand weighted association
rule mining and its applications. Following that, we introduced the HITS algorithm and its
use in transaction data. Next, we studied the negative association rules discovery process
and its use. We showed you different ways to visualize association rules. Finally, we
created a small web application using R Shiny to demonstrate some of the concepts we
learned.
In the next chapter, we will look at another recommendation system algorithm called
content based filtering. We will see how this method can help address the famous cold start
problem in recommendation systems. Furthermore, we will introduce the concept of fuzzy
ranking to order the final recommendations.
                                           [ 65 ]
      Fuzzy Logic Induced Content-
                                                                                2
          Based Recommendation
When a friend comes to you for a movie recommendation, you don't arbitrarily start
shooting movie names. You try to suggest movies while keeping in mind your friend's
tastes. Content-based recommendation systems try to mimic the exact same process.
Consider a scenario in which a user is browsing through a list of products. Given a set of
products and the associated product properties, when a person views a particular product,
content-based recommendation systems can generate a subset of products with similar
properties to the one currently being viewed by the user. Typical content-based
recommendation systems tend to also include the user profile. In this chapter, however, we
will not be including the user profiles. We will be working solely with the item/product
profiles. Content-based recommendation systems are also called content-based filtering
methods.
During winter, cars tend to have problems starting. But once the engine reaches the optimal
temperature it starts to run smoothly. Recommender systems tend to have the same cold
start problem. When deployed for the first time, the recommender system is unaware of the
user preferences. It's in the dark when it comes to recommendations. Other recommender
systems that use user preferences, such as collaborative filtering, which we covered in the
previous chapter, need user preferences to make recommendations. In the absence of user
preferences, these methods become ineffective.
Fuzzy Logic Induced Content-Based Recommendation                                    Chapter 2
Content-based recommendation is the best method to address the cold start problem. Since
content-based methods rely on the product properties to create recommendations, they can
ignore the user preferences, to begin with. While the content-based method dishes out the
needed recommendation, the user profile can be built in the background. With a sufficient
user profile, content-based methods can be further improved or can move on to using
collaborative filtering methods.
A news aggregator collects syndicated web content such as new articles, blogs, video, and
similar items at a centralized location for easy viewing. We will use an imaginary news
aggregator website as an example to explain and build a content-based recommendation
system. Our use case is as follows. When a person is reading a particular news article, we
want to recommend to him other news article which might interest him. This will help us
understand how the cold-start problem in recommender systems is addressed using
content-based methods. Furthermore, our fuzzy logic rule system will shed light on how a
fuzzy system can be leveraged for ranking.
The code for this chapter was written in RStudio version 0.99.491. It uses R version 3.3.1. As
we work through our examples, we will introduce the R packages that we will be using.
During our code description, we will be using some of the output printed in the console. We
have included what will be printed in the console immediately following the statement
which prints the information to the console, so as to not disturb the flow of the code.
                                               [ 67 ]
Fuzzy Logic Induced Content-Based Recommendation                                               Chapter 2
This dataset is the result of the chemical analysis of wine grown in the same region in Italy.
We have data from three different cultivars (From an assemblage of plants selected for desirable
characters, Wikipedia: https://en.wikipedia.org/wiki/Cultivar).
    > head(wine.data)
       V1    V2   V3  V4         V5    V6     V7     V8     V9    V10    V11    V12    V13    V14
    1: 1 14.23 1.71 2.43       15.6   127   2.80   3.06   0.28   2.29   5.64   1.04   3.92   1065
    2: 1 13.20 1.78 2.14       11.2   100   2.65   2.76   0.26   1.28   4.38   1.05   3.40   1050
    3: 1 13.16 2.36 2.67       18.6   101   2.80   3.24   0.30   2.81   5.68   1.03   3.17   1185
    4: 1 14.37 1.95 2.50       16.8   113   3.85   3.49   0.24   2.18   7.80   0.86   3.45   1480
    5: 1 13.24 2.59 2.87       21.0   118   2.80   2.69   0.39   1.82   4.32   1.04   2.93    735
    6: 1 14.20 1.76 2.45       15.2   112   3.27   3.39   0.34   1.97   6.75   1.05   2.85   1450
      1 2 3
    59 71 48
    >
                                              [ 68 ]
Fuzzy Logic Induced Content-Based Recommendation                                   Chapter 2
Let's remove the cultivar and only retain the chemical properties of the wine:
    > wine.type <- wine.data[,1]
    > wine.features <- wine.data[,-1]
wine.features has all the properties and without the cultivar column.
Let's add the row names and give an integer number for each wine:
    > rownames(wine.mat) <- seq(1:dim(wine.features.scaled)[1])
    > wine.mat[1:2,]
             V2          V3        V4        V5         V6        V7        V8
    V9        V10         V11
    1 1.5143408 -0.5606682 0.2313998 -1.166303 1.90852151 0.8067217 1.0319081
    -0.6577078 1.2214385 0.2510088
    2 0.2455968 -0.4980086 -0.8256672 -2.483841 0.01809398 0.5670481 0.7315653
    -0.8184106 -0.5431887 -0.2924962
            V12      V13       V14
    1 0.3611585 1.842721 1.0101594
    2 0.4049085 1.110317 0.9625263
    >
With our matrix ready, let's find the similarity between the wines.
We have numbered our rows representing the wine. The columns represent the properties
of the wine.
cov() is the covariance, and it's divided by the standard deviation of x and standard
deviation of y.
                                             [ 69 ]
Fuzzy Logic Induced Content-Based Recommendation                                    Chapter 2
In our case, we want to find the pearson coefficient between the rows. We want the
similarity between two wines. Hence we will transpose our matrix before invoking cor
function.
    > dim(cor.matrix)
    [1] 178 178
    > cor.matrix[1:5,1:5]
              1           2             3             4            5
    1 1.0000000 0.7494842       0.5066551 0.7244043066 0.1850897291
    2 0.7494842 1.0000000       0.4041662 0.6896539740 -0.1066822182
    3 0.5066551 0.4041662       1.0000000 0.5985843958 0.1520360593
    4 0.7244043 0.6896540       0.5985844 1.0000000000 -0.0003942683
    5 0.1850897 -0.1066822      0.1520361 -0.0003942683 1.0000000000
We transpose the wine.mat matrix and pass it to the cor function. In the transposed
matrix, our output will be the similarity between the different wines.
The cor.matrix matrix is the similarity matrix, which shows how closely related items are.
The values range from -1 for perfect negative correlation, when two items have attributes
that move in opposite directions, and +1 for perfect positive correlation, when attributes for
the two items move in the same direction. For example, in row 1, wine 1 is more similar to
wine 2 than wine 3. The diagonal values will be +1, as we are comparing a wine to itself.
    > user.view
             V2         V3       V4         V5         V6        V7       V8
    V9      V10       V11
    3 0.1963252 0.02117152 1.106214 -0.2679823 0.08810981 0.8067217 1.212114
    -0.497005 2.129959 0.2682629
            V12       V13      V14
    3 0.3174085 0.7863692 1.391224
                                            [ 70 ]
Fuzzy Logic Induced Content-Based Recommendation                                      Chapter 2
Let's a say a particular user is either tasting or looking at the properties of wine 3. We want
to recommend him wines similar to wine 3.
    > sim.items
               1              2              3             4              5              6
    7              8              9
      0.50665507    0.40416617     1.00000000 0.59858440       0.15203606     0.54025182
    0.57579895     0.18210803     0.42398729
              10            11             12         13                 14            15
    16             17             18
      0.55472235    0.66895949     0.40555308 0.61365843       0.57899194     0.73254986
    0.36166695     0.44423273     0.28583467
              19            20             21         22                 23            24
    25             26             27
      0.49034236    0.44071794     0.37793495 0.45685238       0.48065399     0.52503055
    0.41103595     0.04497370     0.56095748
              28            29             30         31                 32            33
    34             35             36
      0.38265553    0.36399501     0.53896624 0.70081585       0.61082768     0.37118102
    -0.08388356     0.41537403     0.57819928
              37            38             39         40                 41            42
    43             44             45
      0.33457904    0.50516170     0.34839907 0.34398394       0.52878458     0.17497055
    0.63598084     0.10647749     0.54740222
              46            47             48         49                 50            51
    52             53             54
    -0.02744663     0.48876356     0.59627672 0.68698418       0.48261764     0.76062564
    0.77192733     0.50767052     0.41555689.....
We look at the third row in our similarity matrix. We know that the similarity matrix has
stored all the item similarities. So the third row gives us the similarity score between wine 3
and all the other wines. The preceding results are truncated.
    > sim.items.sorted[1:5]
            3        52        51        85        15
    1.0000000 0.7719273 0.7606256 0.7475886 0.7325499
    >
                                             [ 71 ]
Fuzzy Logic Induced Content-Based Recommendation                                        Chapter 2
First, we sort row 3 in decreasing order, so we have all the items close to wine 3 popping to
the front. Then we pull out the top five matches. Great--we want to recommend wines 52,
51, 85, and 15 to this user. We ignore the first recommendation as it will be the same item
we are searching for. In this case, the first element will be wine 3 with a similarity score of
1.0.
Let's look at the properties of wine 3 and the top five matches to confirm our
recommendation:
    > rbind(wine.data[3,]
    + ,wine.data[52,]
    + ,wine.data[51,]
    + ,wine.data[85,]
    + ,wine.data[15,]
    + )
        V1   V2   V3   V4       V5 V6    V7   V8   V9 V10         V11    V12    V13    V14
    1: 1 13.16 2.36 2.67      18.6 101 2.80 3.24 0.30 2.81       5.68   1.03   3.17   1185
    2: 1 13.83 1.65 2.60      17.2 94 2.45 2.99 0.22 2.29        5.60   1.24   3.37   1265
    3: 1 13.05 1.73 2.04      12.4 92 2.72 3.27 0.17 2.91        7.20   1.12   2.91   1150
    4: 2 11.84 0.89 2.58      18.0 94 2.20 2.21 0.22 2.35        3.05   0.79   3.08    520
    5: 1 14.38 1.87 2.38      12.0 102 3.30 3.64 0.29 2.96       7.50   1.20   3.00   1547
    >
Great—you can see that the wine properties in our recommendation are close to the
properties of wine 3.
                                             [ 72 ]
Fuzzy Logic Induced Content-Based Recommendation                        Chapter 2
                                                [ 73 ]
Fuzzy Logic Induced Content-Based Recommendation                                Chapter 2
We use the cnames vector to define these headings as we read the file using the read_tsv
function. Further inside read_tsv, while defining the column types, we also specify the
variable type for each of these columns.
   # A tibble: 2,991 x 1
                    PUBLISHER
                         <chr>
    1                 1011now
    2                  10News
    3                     10TV
    4             123Jump.com
    5           12NewsNow.Com
    6               13WHAM-TV
    7       13abc Action News
    8 14 News WFIE Evansville
    9       "24\\/7 Wall St."
   10     "2DayFM \\(blog\\)"
   # ... with 2,981 more rows
                                              [ 74 ]
Fuzzy Logic Induced Content-Based Recommendation                                      Chapter 2
    > head(publisher.count)
          PUBLISHER ct
    1       1011now 1
    2        10News 4
    3          10TV 2
    4   123Jump.com 1
    5 12NewsNow.Com 3
    6     13WHAM-TV 3
    > dim(publisher.count)
    [1] 2991    2
We first find the number of articles under each publisher. Looks like a lot of publishers have
very few articles. Let's validate it to see the number of publishers with less than 10 articles.
We can see 2,820 publishers, out of 2,991, have less than ten articles.
Let's get the top 100 publishers by looking at the number of articles they have published:
    > publisher.top <- head(publisher.count[order(-publisher.count$ct),],100)
    > head(publisher.top)
                PUBLISHER       ct
    1937          Reuters       90
    309      Businessweek       58
    1548           NASDAQ       49
    495 Contactmusic.com        48
    540        Daily Mail       47
    882        GlobalPost       47
    >
We can see that Reuters tops the list. We have retained only the articles from the top 100
publishers list for our exercise. Data frame publisher.top has the top 100 publishers.
                                                [ 75 ]
Fuzzy Logic Induced Content-Based Recommendation                                  Chapter 2
For our top 100 publishers, let's now get their articles and other information:
    > data.subset <- inner_join(publisher.top, data)
    Joining, by = "PUBLISHER"
    > head(data.subset)
      PUBLISHER ct      ID
    TITLE
    1    Reuters 90 38081         PRECIOUS-Gold ticks lower, US dollar holds
    near peak
    2    Reuters 90 306465                UKs FTSE rallies as Rolls-Royce races
    higher
    3    Reuters 90 371436 US economic growth to continue at modest pace - Feds
    Lacker
    4    Reuters 90 410152             Traders pare bets on earlier 2015 Fed
    rate hike
    5    Reuters 90 180407        FOREX-Dollar slides broadly, bullish data
    helps euro
    6    Reuters 90 311113 Fitch Publishes Sector Credit Factors for Japanese
    Insurers
    URL
    1
    http://in.reuters.com/article/2014/03/24/markets-precious-idINL4N0ML03U2014
    0324
    2
    http://www.reuters.com/article/2014/06/19/markets-britain-stocks-idUSL6N0P0
    1DM20140619
    3
    http://in.reuters.com/article/2014/07/08/usa-fed-idINW1N0OF00M20140708
    4
    http://www.reuters.com/article/2014/08/01/us-usa-fed-futures-idUSKBN0G144U2
    0140801
    5
    http://in.reuters.com/article/2014/05/06/markets-forex-idINL6N0NS25P2014050
    6
    6
    http://in.reuters.com/article/2014/06/24/fitch-publishes-sector-credit-fact
    ors-fo-idINFit69752320140624
      CATEGORY                          STORY        HOSTNAME    TIMESTAMP
    1         b df099bV_5_nKjKMqxhiVh1yCmHe3M in.reuters.com 1.395753e+12
    2         b dShvKWlyRq_Z3pM1C1lhuwYEY5MvM www.reuters.com 1.403197e+12
    3         b dNJB5f4GzH0jTlMeEyWcKVpMod5UM in.reuters.com 1.404897e+12
    4         b dunL-T5pNDVbTpMZnZ-3oAUKlKybM www.reuters.com 1.406926e+12
    5         b d8DabtTlhPalvyMKxQ7tSGkTnN_9M in.reuters.com 1.399369e+12
    6         b d3tIMfB2mg-9MZM4G_jGTEiRVl3jM in.reuters.com 1.403634e+12
    > dim(data.subset)
    [1] 2638     9
    >
                                            [ 76 ]
Fuzzy Logic Induced Content-Based Recommendation                                      Chapter 2
We join our top 100 publishers data frame publisher.top with data, get all the details for
our top 100 publishers. Our data.subset has a total of 2,638 articles.
Having looked at our data, let's now move on to design our recommendation engine.
Let's quickly recap how a content-based recommendation engine works. When a user is
browsing a product or item, we need to provide recommendations to the user in the form of
other products or items from our catalog. We can use the properties of the items to come up
with the recommendations. Let's translate this to our use case.
So when a user is browsing a particular news article, we need to give him other news
articles as recommendations, based on:
We are going to introduce another feature. It is a calculated feature from the text field:
          Polarity of the document. Subjective text tends to have an opinion about a topic.
          The opinion can be positive, negative, or neutral. A polarity score is a real
          number which captures these opinions. Polarity identification algorithms use text
          mining to get the document opinion. We are going to use one such algorithm to
          get the polarity of our texts.
                                             [ 77 ]
Fuzzy Logic Induced Content-Based Recommendation                                 Chapter 2
For the wine example in the last section, we used the pearson coefficient as a similarity
measure. Unlike the wine example, we need multiple similarity measures for this use case:
We will explain these distance measures as we program them in R. Hopefully this gave you
a good overview of our problem statement. Let's move on to the design of our content-
based recommendation engine.
We are going to design our content-based recommendation engine in three steps, as shown
in the following diagram:
                                           [ 78 ]
Fuzzy Logic Induced Content-Based Recommendation                                        Chapter 2
In step 1, we will create a similarity index. Think of this index as a matrix, with rows and
columns as the articles and the cell value storing the similarity between the articles. The
diagonal values of this matrix will be one. The similarity score is a value between zero and
one. A cell value of 1.0 indicates that the two articles are an exact replica of each other.
Look at the first row; article 1 is closer to article N when compared to article 2. In our use
case, the cell value will be the cosine similarity between two documents.
In step 2, we have a simple search engine. Given an article, this engine will first retrieve the
top N articles, in our case the top 30, which are close to the given article, based on the
similarity matrix developed in the previous step. Say we are searching for article number 2,
then row 2 of this matrix will be accessed. After sorting the content of the row, the top 30
will be given as the match. For those 30 articles, we further calculate more features the
polarity of the articles. After that, we calculate Manhattan distance between the polarity
value of the given article and all the other articles in our search results. We find the
Jaccard's distance between the article we are searching for and all the other articles in the
search list based on the publisher and category.
In step 3, we implement a fuzzy ranking engine. Using the similarity score from step 1,
Jaccard's score, and the polarity scores, we use a fuzzy engine to rank the top 30 matches.
The results are presented to the user in this ranked order.
                                                   [ 79 ]
Fuzzy Logic Induced Content-Based Recommendation                                     Chapter 2
Bag-of-words
In bag-of-words representation, every document is represented as a collection of words
present in the document, hence the name bag-of-words. The order in which the words occur
are not considered in bag of words approach. A good way to organize these bag-of-words is
using a matrix representation. Let's say we have 100 documents (also called a corpus). In
order to build such a matrix, we first make a list of all the unique words present in those 100
documents. This is called the vocabulary of our text corpus. Say we have 5,000 unique
words. Our matrix is now a 100 x 5,000 dimension, where the rows are the document and in
the column, we have the words from our vocabulary. Such a matrix is called a document
term matrix.
                                             [ 80 ]
Fuzzy Logic Induced Content-Based Recommendation                                    Chapter 2
We have represented each sentence as a vector here. We can now leverage the vector space
model to find the similarity between these two sentences. Now let's look at the cell values of
this matrix.
The example matrix is a binary matrix. A cell value of 1 indicates the presence of a word in
a document and 0 indicates absence. We can have other weighting schemes. Let's look at
some weighting schemes.
Term frequency
Term frequency, or tf(w, d), is the number of times a word w appears in the document d.
If we look at our example:
   tf(dog, sentence 1) = 1
Document frequency
Document frequency or df(w) is the number of documents in the corpus in which this
word occurs. In our example, df(dog) = 2, dog appears in both the sentences.
Rare words tend to have high IDFs. If we look at the equation above and ignore the log
function, IDF is the total number of documents divided by the number of documents that
the word is in. In effect, it is measuring how good the word is at identifying this document.
If the word is only used in a small number of documents, then the denominator will be low,
therefore the overall value will be high. The log is just used to normalize the result.
                                            [ 81 ]
Fuzzy Logic Induced Content-Based Recommendation                                    Chapter 2
TFIDF
This is the product of a term frequency and inverse document frequency:
title.df stores the title and the article ID. others.df stores the article ID, publisher, and
category.
We create a data frame reader using readTabular. Next, we use the Corpus function to
create our text corpus. To Corpus, we pass our title.df data frame, and
the title.reader data frame reader.
Calling getTransformation shows us the list of available functions that can be used to
transform the text:
   corpus <- tm_map(corpus, removePunctuation)
   corpus <- tm_map(corpus, removeNumbers)
   corpus <- tm_map(corpus, stripWhitespace)
                                            [ 82 ]
Fuzzy Logic Induced Content-Based Recommendation                                     Chapter 2
We remove punctuation, numbers, unnecessary white spaces, and stop words from our
articles. Finally, we convert our text to lowercase.
Punctuation, numbers, and whitespace may not be a good feature to distinguish one article
from another. Hence, we remove them.
These words will be present again in most of the English text, no matter what the content is.
They cannot act as a good feature to distinguish our articles. Hence, we remove such words.
We want our algorithms to treat the words "dog" and "Dog" the same way. Hence, we
bring them to lowercase.
Furthermore, we could apply stemming to our words. Stemming reduces the word to its
root form.
                                           [ 83 ]
Fuzzy Logic Induced Content-Based Recommendation                                     Chapter 2
    > inspect(dtm[1:5,10:15])
    <<DocumentTermMatrix (documents: 5, terms: 6)>>
    Non-/sparse entries: 0/30
    Sparsity            : 100%
    Maximal term length: 9
    Sample              :
             Terms
    Docs      abbey abbvie abc abcs abdul abenomics
      180407       0      0  0    0     0         0
      306465       0      0  0    0     0         0
      371436       0      0  0    0     0         0
      38081        0      0  0    0     0         0
      410152       0      0  0    0     0         0
    >
We use the DocumentTermMatrix function to create our matrix. We pass our text corpus
and also pass a list to the parameter control. Inside the list, we say that we are interested
only in words with length, so the number of characters between 3 and 10. For our cell
values in our matrix, we want them to be TFIDF.
We can inspect the created document term matrix by calling the inspect function.
Having created a document term matrix, let's create the cosine distance between the articles:
    sim.score <- tcrossprod_simple_triplet_matrix(dtm)/(sqrt( row_sums(dtm^2)
    %*% t(row_sums(dtm^2)) ))
In the preceding code, we take a document term matrix and return a document matrix. We
will be calling this a similarity matrix going forward. The cell values of this matrix
correspond to the cosine similarity between two documents. The previous line of code
implements this cosine similarity.
                                            [ 84 ]
Fuzzy Logic Induced Content-Based Recommendation                                      Chapter 2
The cosine similarity between two vectors A and B of length n is given as:
Look at the numerator of this equation. It's the inner product of both the vectors. In a vector
space model, the inner product of two vectors gives the similarity between them. We divide
this inner product by the l2 norm of the those individual vector. This makes the score
bounded between -1 and 1. The similarity is hence a number between -1 and 1, where -1
indicates that two documents are completely different from each other. If both the vectors
have non-negative numbers, then the similarity score is bounded between 0 and 1.
Now, in our document term matrix, we have a row for each article. The columns are again
the article IDs.
                                            [ 85 ]
Fuzzy Logic Induced Content-Based Recommendation                                       Chapter 2
   0.0000000 0.0000000
     180407 0.0000000           0       0          0     1         0     0        0
   0.0000000 0.0000000
     311113 0.0000000           0       0          0     0         1     0        0
   0.0000000 0.0000000
     263442 0.0000000           0       0          0     0         0     1        0
   0.0000000 0.0000000
     171310 0.0000000           0       0          0     0         0     0        1
   0.0000000 0.0000000
     116144 0.1428571           0       0          0     0         0     0        0
   1.0000000 0.0000000
     70584 0.1543033            0       0          0     0         0     0        0
   0.0000000 1.0000000
Let's look at row 1, document number 38081. Looks like it's not close, with a lot of
documents displayed there, except for 116144.
Searching
Having created our similarity matrix, we can leverage that matrix to find a match for any
given document. Let's see how to leverage this matrix to perform the search function in this
section.
                                            [ 86 ]
Fuzzy Logic Induced Content-Based Recommendation                                  Chapter 2
We will be using the sim.score created in the previous step to perform the search.
We go to our match.doc similarity matrix and pick up row 38081. Now, this row has all
the other articles and their similarity scores.
Our match.df data frame now contains all the matching documents for 38081 and their
cosine scores. No wonder the first row is 38081; it has to match itself perfectly.
                                             [ 87 ]
Fuzzy Logic Induced Content-Based Recommendation                                  Chapter 2
But as we said before, we are going to recommend only the top 30 matches:
   > match.refined<-head(match.df[order(-match.df$cosine),],30)
   > head(match.refined)
              ID    cosine
   38081   38081 1.0000000
   38069   38069 0.3779645
   231136 231136 0.2672612
   334088 334088 0.2672612
   276011 276011 0.2519763
   394401 394401 0.2390457
   >
So let's order our match.df data frame in descending order of cosine similarity and extract
the top 30 matches using the head function.
Now that we have the matching documents, we need to present them in a ranked order. In
order to rank the results, we are going to calculate some additional measures and use fuzzy
logic to get the final ranking score.
Before we go ahead and calculate the additional measures, let's merge title.df and
other.df with match.refined:
> head(match.refined)
         ID    cosine
   TITLE
   1 38081 1.0000000                              PRECIOUS-Gold ticks lower,
   US dollar holds near peak
   2 38069 0.3779645 PRECIOUS-Bullion drops nearly 1 pct on dollar, palladium
   holds near 2-1/2-yr high
   3 231136 0.2672612                       Dollar steady near 3-1/2 month
   lows vs. yen, Aussie weaker
   4 334088 0.2672612                          Canadian dollar falls amid
   lower than expected GDP data
   5 276011 0.2519763                      Gold holds near four-month low as
   ECB move on rates awaited
   6 394401 0.2390457                   Dollar Tree Will Buy Competitor
   Family Dollar For $8.5 Billion
             PUBLISHER CATEGORY
   1           Reuters         b
   2           Reuters         b
                                           [ 88 ]
Fuzzy Logic Induced Content-Based Recommendation                                         Chapter 2
    3            NASDAQ              b
    4          CTV News              b
    5 Business Standard              b
    6     The Inquisitr              b
We have all the information and the cosine similarity in one data frame now.
Polarity scores
We are going to leverage the sentimentr R package to learn the sentiments of the articles
we have collected.
Let's look at how to score using the sentiment function :
    > sentiment.score <- sentiment(match.refined$TITLE)
    > head(sentiment.score)
       element_id sentence_id word_count   sentiment
    1:          1           1          8 0.00000000
    2:          2           1         11 0.00000000
    3:          3           1          9 -0.13333333
    4:          4           1          9 -0.08333333
    5:          5           1         11 0.07537784
    6:          6           1          9 0.00000000
    >
The sentiment function in sentimentr calculates a score between -1 and 1 for each of the
articles. In fact, if a text has multiple sentences, it will calculate the score for each sentence.
A score of -1 indicates that the sentence has a very negative polarity. A score of 1 means that
the sentence is very positive. A score of 0 refers to the neutral nature of the sentence.
However, we need the score at an article level and not at a sentence level, so we can take an
average value of the score across all the sentences in a text.
Calculate the average value of the sentiment scores for each article:
    > sentiment.score <- sentiment.score %>% group_by(element_id) %>%
    +   summarise(sentiment = mean(sentiment))
    > head(sentiment.score)
    # A tibble: 6 x 2
      element_id   sentiment
           <int>       <dbl>
    1          1 0.00000000
    2          2 0.00000000
    3          3 -0.13333333
    4          4 -0.08333333
    5          5 0.07537784
    6          6 0.00000000
                                              [ 89 ]
Fuzzy Logic Induced Content-Based Recommendation                                    Chapter 2
Here, the element_id refers to the individual article. By grouping element_id and
calculating the average, we can get the sentiment score at an article level. We now have the
scores for each article.
Let's update the match.refined data frame with the polarity scores:
   > match.refined$polarity <- sentiment.score$sentiment
   > head(match.refined)
         ID    cosine
   TITLE
   1 38081 1.0000000                                PRECIOUS-Gold ticks lower,
   US dollar holds near peak
   2 38069 0.3779645 PRECIOUS-Bullion drops nearly 1 pct on dollar, palladium
   holds near 2-1/2-yr high
   3 231136 0.2672612                         Dollar steady near 3-1/2 month
   lows vs. yen, Aussie weaker
   4 334088 0.2672612                            Canadian dollar falls amid
   lower than expected GDP data
   5 276011 0.2519763                        Gold holds near four-month low as
   ECB move on rates awaited
   6 394401 0.2390457                     Dollar Tree Will Buy Competitor
   Family Dollar For $8.5 Billion
             PUBLISHER CATEGORY     polarity
   1           Reuters         b 0.00000000
   2           Reuters         b 0.00000000
   3            NASDAQ         b -0.13333333
   4          CTV News         b -0.08333333
   5 Business Standard         b 0.07537784
   6     The Inquisitr         b 0.00000000
Before we move on, let's spend some time understanding the inner workings of our
dictionary-based sentiment method. The sentiment function utilizes a sentiment lexicon
(Jockers, 2017) from the lexicon package. It preprocesses the given text as follows:
                                           [ 90 ]
Fuzzy Logic Induced Content-Based Recommendation                                    Chapter 2
Each word is looked up in the lexicon; positive and negative words are tagged with +1 and
-1 respectively. Let's call the words which have received a score the polarized words. Not
all words receive a score. Only those found in the lexicons receive a score. We can pass a
customer lexicon through the polarity_dt parameter to the sentiment function. For each
of the polarized words, n words before them and n words after them are considered, and
together they are called polarized context clusters. The parameter n can be set by the user.
The words in the polarized context cluster can be tagged as either of the following:
          neutral
          negator
          amplifier
          de-amplifier
          adversative conjunctions
Jaccard's distance
While ranking the matched articles, we want to also include the category and publisher
columns.
We need the publisher, category, and the sentiment details of the document we are
searching for. Fortunately, the first row of our match.refined data frame stores all the
details related to 38081. We retrieve those values from there.
                                           [ 91 ]
Fuzzy Logic Induced Content-Based Recommendation                                      Chapter 2
For the rest of the articles, we need to find out if they match the publisher and category of
document 38081:
    match.refined$is.publisher <- match.refined$PUBLISHER == target.publisher
    match.refined$is.publisher <- as.numeric(match.refined$is.publisher)
Now we can go into match.refined and create a new column called is.publisher, a
Boolean column to say if the article's publisher is same as the publisher for the one we are
searching for.
Repeat the same for the category. We have created a new column called is.category to
store the category match.
With the two new columns, we can calculate the Jaccard's distance between document
38081 and all the other documents in the match.refined data frame, as shown in the
following code block:
    match.refined$jaccard <- (match.refined$is.publisher +
    match.refined$is.category)/2
Jaccards distance/index
The Jaccard index measures the similarity between two sets, and is a ratio of the size of the
intersection and the size of the union of the participating sets. Here we have only have two
elements, one for publisher and one for category, so our union is 2. The numerator, by
adding the two Boolean variable, we get the intersection.
Finally, we also calculate the absolute difference (Manhattan distance) in the polarity values
between the articles in the search results and our search article. We do a min/max
normalization of the difference score as follows:
    match.refined$polaritydiff <- abs(target.polarity -
    match.refined$polarity$sentiment)
                                            [ 92 ]
Fuzzy Logic Induced Content-Based Recommendation                       Chapter 2
                                           [ 93 ]
Fuzzy Logic Induced Content-Based Recommendation                                      Chapter 2
We remove some of the unwanted fields from the match.refined data frame. Finally, we
have the ID, cosine distance, title, publisher, category, Jaccard score, and the polarity
difference.
                                             [ 94 ]
Fuzzy Logic Induced Content-Based Recommendation                                      Chapter 2
We need to perform our ranking based on the following metrics we have calculated:
          Cosine similarity
          Jaccard index
          Polarity difference
A standard way of doing this would be to come up with another score, which is a linear
combination of the preceding three scores. We use that final score to sort the results.
However, we will leverage fuzzy logic programming to do the search result ranking.
Fuzzy logic
A detailed discussion on fuzzy logic is beyond the scope of this book. A short tutorial on
fuzzy logic, http://cs.bilkent.edu.tr/~zeynep/files/short_fuzzy_logic_tutorial.
pdf
We will give a short introduction here. Let us use an example from Artificial Intelligence: A
Guide to Intelligent Systems book by Michael Negnevistky. Thanks to the article from
https://vpetro.io/
 Fuzzification
Fuzzification is the process where we convert our input and output to linguistic variables
using ranges and membership functions.In this example, we want to convert our input,
project funding and staffing, and output risk into linguistic variables:
          Project funding: Inadequate, marginal and adequate. These are the three
          linguistic variables we have for project funding. We will use a fuzzy cone
          membership function.
          Project staffing: Small and large, a fuzzy cone membership function
          Risk: low, normal and high.
                                            [ 95 ]
Fuzzy Logic Induced Content-Based Recommendation                                      Chapter 2
          Linguistic Variable: Variables whose values are words in a natural language. Let
          us take the example of Project Funding, the actual values of project funding in
          this example are percentages. So if we say the project funding is 50% then, 50% of
          the funding is still available. This 50% is now transalated to the linguistic
          variable, adequate, marginal or inadequate. The fuzzy system will use only these
          linguistc variables and not the original value.
          Crisp Values: The real values an input or output can take is a crisp value. For
          example, 50% is the crisp value Project funding can take.
          Membership Function: Its a function which defines how the crisp values are
          mapped to a membership value (degree of membership). These functions can be
          either linear or curved. For example,let us again take project funding,
Membership function will translate this value into degree of membership for each of the
linguistic variables. We will not show here the detail of how the membership function
converts our crisp input values to linguistic variables, but will give a simple example to
follow.
Let us say we are trying to evaluate the risk for the following crisp values,
When we pass these values to the respective membership function we get the following
results, Inadequate = 0.5, adequate = 0.0 and marginal = 0.2; similarly for project staffing we
get small = 0.1 and large =0.7
The crisp input value for project funding 35%, cannot be represented by the linguistic
variable adequate, hence membership value for adequate is 0.0. It does not belong
completely to marginal linguistic variable, and hence we have 0.2 as the value.
Similarly 70% of project staffing has a high degree of membership with linguistic variable
large, hence the value 0.7.
As you saw, fuzzification converts the crisp input values to fuzzy membership values to
linguistic variables.
                                             [ 96 ]
Fuzzy Logic Induced Content-Based Recommendation                                       Chapter 2
Rules define the interaction between our input variables to produce the output. Here we
have three simple rules for our example:
This rule has an OR clause, its evaluated using UNION operator. Max operator in set is used
to evaluate this rule. There are several other set operators such as ASUM or BSUM which
can be used to evaluate the UNION (OR) operator.
Let us borrow the fuzzification results from the previous section. The fuzzification of project
funding = adequate for crisp value 35% is 0.0. The fuzzification of project staffing = small for
crisp value 60% is 0.1
Let us use the Max operator Maximum among them is taken as the results, hence the this
rule evaluates to, Project Risk = low = 0.1
The next rule, If project funding is marginal and project staffing is large then risk is normal,
has an AND and hence we use an intersection operator. Intersection operator can be
handled using min, prod or bdif set operator.
The fuzzification of project funding = marginal for crisp value 35% is 0.2. The fuzzification
of project staffing =large for crisp value 35% is 0.7. Let us use the min operator here. The
minimum is 0.2, the rule evaluates to Project risk = normal = 0.2
Finally, the last rule, If project funding is inadequate then risk his high. We can directly say
Project Risk = high = 0.5, as this is the fuzzification value for project funding = inadequate
for crisp value of 35%.
                                              [ 97 ]
Fuzzy Logic Induced Content-Based Recommendation                                      Chapter 2
Now we need to combine these results. This process of evaluating the rules and combining
them is called as inference.
We have, for project risk, low = 0.1, normal = 0.2, and high = 0.5. We will use these values to
scale the membership function. The final decision plot is a summation of all these scaled
values.
Defuzzification
In fuzzification, we converted the crisp values of our input to fuzzy degree of membership
values for our linguistic variable. In defuzzification, we need to convert the degree of
membership values for our output variable to its crisp value.
 Let us superimpose the membership function on top of the above decision plot. We have
assumed a triangular membership function here. You can read more about triangular
membership function and their importance in the paper why triangular memebership
functions? http://www.sciencedirect.com/science/article/pii/0165011494900035
                                             [ 98 ]
Fuzzy Logic Induced Content-Based Recommendation                                                Chapter 2
The most popular defuzzification method is the centroid method. It calculates the center of
gravity for the area under the curve. The area under the green line represented in the graph.
It is calculated as follows:
( (0 + 10 + 20) * 0.1 + (30 + 40 + 50 + 60) * 0.2 + (70 + 80 + 90 + 100) * 0.5) / (0.1* 3+ 0.2 * 4 + 0.5
*4).
                                                  [ 99 ]
Fuzzy Logic Induced Content-Based Recommendation                                      Chapter 2
Hence, we can conclude that the risk associated with the project is 67.4%.
In our example, we want to convert our inputs are cosine, Jaccard and polarity score and
our output is the ranking. We first convert them to linguistic variables. For example, let us
take our cosine score:
Then we define a membership function. The membership function is responsible for the
fuzzy membership. Let us say we define a cone as our membership function with a radius
of 0.2. There are no real mandates to choose the membership function. Refer to the
membership plot appearing a little later in this chapter for fuzzy memberships. This is
responsible for the fuzzification process. Let us say we pass our cosine score into this
membership function it changes it to one of the linguistic values.
We follow the same process for polarity score and Jaccard's distance. Finally, we also define
a final linguistic variable called ranking. This is our final ranking score or our output. Its
defined in the same way as cosine, polarity, and Jaccards. It has a range and a membership
function.
Having done the fuzzification process for our similarity scores, let us proceed to define the
rules.
We can define more complex rules involving multiple linguistic variables. Refer to the code
for different rules we have defined for this scenario. After all the rules are evaluated we can
then proceed to inference process.
The last step is the defuzzification of the above result to a crisp value. We use the centroid
method for defuzzification.
                                              [ 100 ]
Fuzzy Logic Induced Content-Based Recommendation                                     Chapter 2
We will be using the sets R package for our fuzzy logic programming:
    > library(sets, quietly = TRUE)
%>%
The first step is to set up our universe. We define the range of values and the granularity of
the values we will be dealing with in our universe. Our cosine, Jaccard, and polarity are all
normalized to have a value between zero and one. Hence, the range of our universe is set
between zero and one.
The first step in fuzzy logic programming is to define the linguistic variables we will be
dealing with:
    variables <-
      set(cosine =
            fuzzy_partition(varnames =
                              c(vlow = 0.2, low = 0.4,
                                 medium = 0.6, high = 0.8),
                            FUN = fuzzy_cone , radius = 0.2),
            jaccard =
            fuzzy_partition(varnames =
                              c(close = 1.0, halfway = 0.5,
                                far = 0.0),
                            FUN = fuzzy_cone , radius = 0.4),
          polarity =
            fuzzy_partition(varnames =
                              c(same = 0.0, similar = 0.3,close = 0.5,
                                away = 0.7),
                            FUN = fuzzy_cone , radius = 0.2),
          ranking =
            fuzzy_partition(varnames =
                              c(H = 1.0, MED = 0.7 , M = 0.5, L = 0.3),
                            FUN = fuzzy_cone , radius = 0.2
                            )
      )
                                            [ 101 ]
Fuzzy Logic Induced Content-Based Recommendation                                     Chapter 2
For each variable, we define the various linguistic values and the fuzzy membership
function. For example, for our linguistic variable cosine, the linguistic values include vlow,
low, medium, and high. We define the fuzzy_cone membership function.
For this application, we have used a cone membership function throughout. There are other
membership functions available.
Calling the R help for fuzzy_cone, the documentation will detail other available
membership functions:
> help("fuzzy_cone")
We define the jaccard and polarity linguistic variables in the same way as cosine.
The ranking lat linguistic variable defines the final score we need, which we can use to
sort our results.
Based on the interaction between the linguistic variables cosine, jaccard, and polarity,
the ranking linguistic variables are assigned different linguistic values. These interactions
are defined as rules. Having defined the linguistic variables, the linguistic values, and the
membership function, we proceed to write down our fuzzy rules:
    rules <-
      set(
        ######### Low Ranking Rules ###################
        fuzzy_rule(cosine %is% vlow,
                   ranking %is% L),
        fuzzy_rule(cosine %is% low || jaccard %is% far
                   || polarity %is% away,
                   ranking %is% L),
        fuzzy_rule(cosine %is% low || jaccard %is% halfway
                   || polarity %is% away,
                   ranking %is% L),
                                            [ 102 ]
Fuzzy Logic Induced Content-Based Recommendation                                       Chapter 2
The rules define how the linguistic variables should be combined to make a decision. We
have three groups of rules. Let's look at the first rule:
    fuzzy_rule(cosine %is% vlow, ranking %is% L),
If the cosine linguistic value is vlow, the ranking is given as L, the lowest ranking. Similarly,
each rule captures the interactions between the linguistic variables.
With the rules and linguistic variables defined, we can now put our complete fuzzy
system together:
    > ranking.system <- fuzzy_system(variables, rules)
    > print(ranking.system)
    A fuzzy system consisting of 4 variables and 14 rules.
variables:
rules:
                                            [ 103 ]
Fuzzy Logic Induced Content-Based Recommendation                       Chapter 2
   cosine %is% low && jaccard %is% close && polarity %is% similar => ranking
   %is% M
   cosine %is% medium && jaccard %is% close && polarity %is% same => ranking
   %is% MED
   cosine %is% medium && jaccard %is% close && polarity %is% similar =>
   ranking %is% MED
   cosine %is% medium && jaccard %is% halfway && polarity %is% same => ranking
   %is% MED
   cosine %is% medium && jaccard %is% halfway && polarity %is% similar =>
   ranking %is% MED
   cosine %is% low || jaccard %is% far || polarity %is% away => ranking %is% L
   cosine %is% low || jaccard %is% close || polarity %is% same => ranking %is%
   M
   cosine %is% low || jaccard %is% halfway || polarity %is% away => ranking
   %is% L
   cosine %is% low || jaccard %is% halfway || polarity %is% same => ranking
   %is% L
   cosine %is% low || jaccard %is% halfway || polarity %is% close => ranking
   %is% L
   cosine %is% low || jaccard %is% halfway || polarity %is% similar => ranking
   %is% L
   cosine %is% medium || jaccard %is% far || polarity %is% away => ranking
   %is% L
   cosine %is% high => ranking %is% H
   cosine %is% vlow => ranking %is% L
   > plot(ranking.system)
                                          [ 104 ]
Fuzzy Logic Induced Content-Based Recommendation                                      Chapter 2
The final plot reveals the fuzziness in the boundary for different linguistic variables.
Compare this with a hard-coded if-else logic system.
We can now proceed to use this system to do the ranking. Let's do the ranking on a single
example:
    fi <- fuzzy_inference(ranking.system, list(cosine = 0.5000000,                jaccard =
    0, polarity=0.00000000))
                                            [ 105 ]
Fuzzy Logic Induced Content-Based Recommendation                                   Chapter 2
For given values of cosine, polarity, and Jaccard, we get a ranking score of 0.4. Now we can
use this score to rank the results.
                                           [ 106 ]
Fuzzy Logic Induced Content-Based Recommendation                                Chapter 2
The get.ranks function is applied in each row of match.refined to get the fuzzy
ranking. Finally, we sort the results using this ranking.
This brings us to the end of our design and implementation of a simple fuzzy-induced
content-based recommendation system.
Complete R Code
The complete R code is shown as follows:
   library(data.table)
   wine.data <-
   fread('https://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.
   data')
   head(wine.data)
table(wine.data$V1)
                                           [ 107 ]
Fuzzy Logic Induced Content-Based Recommendation            Chapter 2
cor.matrix[1:5,1:5]
   rbind(wine.data[3,]
   ,wine.data[52,]
   ,wine.data[51,]
   ,wine.data[85,]
   ,wine.data[15,]
   )
   library(tidyverse)
   library(tidytext)
   library(tm)
   library(slam)
                                          [ 108 ]
Fuzzy Logic Induced Content-Based Recommendation                         Chapter 2
   head(data)
   data %>% group_by(PUBLISHER) %>% summarise()
   data %>% group_by(CATEGORY) %>% summarise()
getTransformations()
stopwords("english")
                                          [ 109 ]
Fuzzy Logic Induced Content-Based Recommendation                         Chapter 2
sim.score[1:10,1:10]
head(match.refined)
                                          [ 110 ]
Fuzzy Logic Induced Content-Based Recommendation                         Chapter 2
   # Calcuate Jaccards
   match.refined$jaccard <- (match.refined$is.publisher +
   match.refined$is.category)/2
   match.refined$polaritydiff <- abs(target.polarity - match.refined$polarity)
   head(match.refined)
   ## clean up
   match.refined$is.publisher = NULL
   match.refined$is.category = NULL
   match.refined$polarity = NULL
   match.refined$sentiment = NULL
head(match.refined)
   variables <-
     set(cosine =
           fuzzy_partition(varnames =
                             c(vlow = 0.2, low = 0.4,
                                medium = 0.6, high = 0.8),
                           FUN = fuzzy_cone , radius = 0.2),
           jaccard =
           fuzzy_partition(varnames =
                             c(close = 1.0, halfway = 0.5,
                               far = 0.0),
                           FUN = fuzzy_cone , radius = 0.4),
         polarity =
           fuzzy_partition(varnames =
                             c(same = 0.0, similar = 0.3,close = 0.5,
                               away = 0.7),
                           FUN = fuzzy_cone , radius = 0.2),
         ranking =
           fuzzy_partition(varnames =
                             c(H = 1.0, MED = 0.7 , M = 0.5, L = 0.3),
                           FUN = fuzzy_cone , radius = 0.2
                           )
                                          [ 111 ]
Fuzzy Logic Induced Content-Based Recommendation                   Chapter 2
   rules <-
     set(
       ######### Low Ranking Rules ###################
       fuzzy_rule(cosine %is% vlow,
                  ranking %is% L),
       fuzzy_rule(cosine %is% low || jaccard %is% far
                  || polarity %is% away,
                  ranking %is% L),
       fuzzy_rule(cosine %is% low || jaccard %is% halfway
                  || polarity %is% away,
                  ranking %is% L),
                                          [ 112 ]
Fuzzy Logic Induced Content-Based Recommendation                                   Chapter 2
plot(ranking.system)
Summary
We started the chapter by introducing content-based filtering. We discussed how content
based filtering methods can help with cold-start problems in recommendation systems. We
then explained the new aggregator use case. We explored the data provided by the
customer--various news articles from different publishers belonging to different categories.
Based on the data, we came up with a design for our content-based recommendation
system.
We implemented a similarity dictionary; given a news article, this dictionary would be able
to provide the top N matching articles. The similarity was calculated based on the words
present in the article. We leveraged the vector space model for text and ultimately used the
cosine distance to find the similarities between articles.
                                           [ 113 ]
Fuzzy Logic Induced Content-Based Recommendation                                  Chapter 2
We implemented a simple search based on the similarity dictionary to get a list of matching
news articles. We introduced additional features for the matching document, sentiment,
and polarity score.
In the next chapter, we will go into the details of various approaches to collaborative
filtering algorithm. Collaborative filtering is the most powerful way to build recommender
systems.
                                          [ 114 ]
                              Collaborative Filtering
                                                                                    3
Given a database of user ratings for products, where a set of users have rated a set of
products, collaborative filtering algorithms can give ratings for products yet to be rated by a
particular user. This leverages the neighborhood information of the user to provide such
recommendations. The input to collaborative filtering is a matrix, where the rows are users
and the columns are items. Cell values are the ratings provided by the user for a product.
Ratings of products are ubiquitous in today's internet world. IMDB, Yelp, Amazon, and
similar systems today have a rating system deployed to capture user preferences.
Preferences are typically captured by a rating system, where the ratings are defined as stars
or a points system.
          An online system, where the whole user product preference matrix is loaded into
          the memory to make recommendations for a user and his yet to be rated product
          combination
          A model-based system, where an approach similar to clustering is used to group
          users with similar preferences and later use those cluster centroids to make
          recommendations
Collaborative Filtering                                                                Chapter 3
The code for this chapter was written in RStudio Version 0.99.491. It uses R version 3.3.1. As
we work through our example, we will introduce the R recommenderlab packages that we
will be using. During our code description, we will be using some of the output printed in
the console. We have included what will be printed in the console immediately after the
statement that prints the information to the console, so as to not disturb the flow of the
code.
Collaborative filtering
Given a user rating matrix, where several users have rated several products, the goal of
collaborative filtering is as follows:
The underlying premise of the collaborative filtering algorithm is that if two users agree on
ratings for a large set of items, they may tend to agree for other items too. Let us use a small
R code snippet to explain this concept. Assume we have seven products (A, B, C, D, E, F, G)
and two users (user.a and user.b). We also have the ratings provided by both of the
users for some of the products. The ratings are range of numbers from 1 to 5, with 1
indicating a poor rating, 5 indicating a good rating, and 0 indicating no rating.
                                            [ 116 ]
Collaborative Filtering                                                             Chapter 3
       user.a user.b
   A        3      3
   B        0      5
   C        2      3
   D        5      5
   E        5      4
   F        0      2
Both of the users agree on ratings for products A, C, D, and E. User user.a, has not seen and
therefore not rated, products B and F. However, we see that user.b has rated those
products. Since both the users agree on most of the products, we can use user.b ratings for
product B, and F, and fill in these missing ratings for user.a
This is the fundamental working premise of collaborative filtering. The example is based on
a user-based collaborative filtering approach. There are other approaches to collaborative
filtering which are shown in the following diagram, but we hope this example serves to
introduce the concept.
Having explained the hypothesis behind the algorithm, let us look at the approaches to
collaborative filtering:
                                           [ 117 ]
Collaborative Filtering                                                                   Chapter 3
Memory-based approach
The memory-based approach to collaborative filtering loads the whole rating matrix into
memory to provide recommendations, hence the name memory-based model. User-based
filtering is the most prominent memory-based collaborative filtering model. The R snippet
explained in the preceding section is the underlying principle by which memory-based
methods work. As the user and product base grows, scalability is a big issue with memory-
based models.
In the R snippet example, we used the ratings of user.b to fill in the missing ratings for
user.a. In the real world, with thousands of users, the user-based filtering algorithm first
proceeds as follows.
Let us say we have a set of users {u1,u2.....un} and a set of products {p1,p2,...pm}, and we are
trying to predict the ratings for a single user, ua.
Using a similarity measure such as Pearson coefficient, or cosine distance, we try to find
K neighbors for ua. For more on similarity measures, refer to http://reference.wolfram.
com/language/guide/DistanceAndSimilarityMeasures.html:
               When the two vectors are centered (zero mean) their cosine similarity is
               same as the Pearson coefficient.
                                              [ 118 ]
Collaborative Filtering                                                                Chapter 3
Let us say ua does not have the ratings for p5, p7, and p9. We take the average of ratings for
these products from the K neighbors of ua and use that as the ratings for ua.
Model-based approach
The model based approach addresses the scalability problem seen in memory-based
approaches. Compared to the user-based approach, where the recommendations came from
leveraging a user's neighbors preference, the model-based approach leverages product
similarity to make recommendations. The premise is that users will prefer those products
similar to the ones they have already rated.
The first step in this approach is to calculate the product similarity matrix. Let us say there
are a set of products: {p1,p2,...pm}. An m x m matrix is constructed to begin with. Once again
Pearson coefficient or cosine similarity is used as a similarity measure. For efficiency
purposes, instead of building a whole m x m matrix, a smaller m x k matrix is built, where k
<< m, k most similar items.
We have three users, user.a, user.b, user.c, and three products: A, B, C. We also have
the ratings the users have provided for these products. The ratings are in a scale from 1 to 5.
You should see the ratings.matrix, the final rating matrix summarizing the user product
ratings.
                                            [ 119 ]
Collaborative Filtering                                                              Chapter 3
We have used the cor function from the base package to calculate the similarity matrix. Let
us say we want to find the ratings for product B for user.a.
The most similar product to product B is A, as it has a higher score compared to C. In a real-
world example, we will have multiple similar products. Now the score for B for user.a is
calculated as follows:
The predicted rating for product B for user.a is calculated using similarity measures
between (B, A) and (B, C) weighted by the respective ratings given by user.a for A and C
and scaling this weighted sum with the sum of similarity measures.
The similarity matrix of products are pre-computed and therefore it does not have the
overhead of loading the whole user ratings matrix in memory. Amazon.com is successfully
using item-based recommendations.
                                             [ 120 ]
Collaborative Filtering                                                               Chapter 3
Let us say our original matrix is m x n, where we have m users and n products. Applying
SVD on this matrix, we get:
Applying SVD matrices with missing values is not possible. Our ratings matrix will have a
lot of missing values. Simon Funk proposed a new method, now called funk SVD, to do
SVD on large matrices using numerical optimization methods and gradient descent. Funk
SVD is implemented in the recommenderlab package. The dimension f in m x f and n x f
represents the latent factors, hence the name of this method is the latent factors
method. These latent factors represents some of the hidden features. For example in a
movie database, where our items corresponds to movie, one or a couple of latent
factors may point to the box office success of the movie.
Each user has a vector now from the m x f matrix. Similarly each product has a vector now
from the n x f matrix. For a given user and item, multiplying their respective vectors gives
the rating for that item specific to that user. Can you guess why? Here is where the
features/factors comes to play. The features/factors are shared by both the matrices. The
ratings in the original ratings matrix is now decomposed into a set of features. Thus when
we multiply the user vector, 1 x f, with the product vector 1 x f, we get a scalar value, which
is the rating for that product by that user.
Using this we can now predict the recommendation for unknown products for users.
                                            [ 121 ]
Collaborative Filtering                                                            Chapter 3
We use the SVD function from the base R package to decompose the ratings.mat.
ratings.mat is the same ratings matrix we used to explain item-based similarity.
user.a.vector is the vector representing user.a, and product.B.vector is the vector
representing the product B. ratings.user.a.prod.B is the product of these two
vectors representing the ratings of product B by user.a.
Now that we have explained all the three different approaches to build a collaborative
filtering method, let us proceed to look at the recommenderlab package.
Recommenderlab package
It will be very useful for the subsequent sections to get an overview of
the recommenderlab package. Let us quickly look at the S4 objects inside the package and
see how we can use them to build collaborative filtering projects.
The ratingMatrix is an abstract object, used to store and manipulate the ratings matrix.
The term abstract is from object-oriented programming. ratingMatrix defines the
interfaces to develop a user ratingsMatrix, but does not implement them. There are two
concrete implementations of this object one for the real-valued matrix and the other one for
the binary matrix. The Recommender class is used to store the recommendation models. It
takes as an input a ratingsMatrix object and other parameters and builds the required
recommender model.
                                           [ 122 ]
Collaborative Filtering                                                                    Chapter 3
The predict function can produce recommendations for unseen/unknown data (where we
don't know the recommendation) using the Recommender model.
In the preceding code, we create a random binary matrix and coerce it to a binaryRatingMatrix object.
Similarly, we can also create a realRatingMatrix.
Having built our ratingsMatrix, let us go ahead and create a simple Recommender model
based on this matrix.
                                              [ 123 ]
Collaborative Filtering                                                          Chapter 3
     .. .. .. ..@ n         : int 5
     ..@ predict :function (model, newdata, n = 10, data = NULL, type =
   c("topNList"), ...)
   >
We pass our ratings matrix to the Recommender class and select an approach called
popular. This creates our recommendation model.
Popular approach
This is a very simple strategy where the recommendations are based on the item popularity.
Among the products not rated by a user, the most popular ones among all the users are
suggested as recommendations.
There are several other implementations of the recommendation model available in this
package.
   $AR_binaryRatingMatrix
   Recommender method: AR for binaryRatingMatrix
   Description: Recommender based on association rules.
   Reference: NA
   Parameters:
     support confidence maxlen sort_measure sort_decreasing apriori_control
   verbose
   1     0.1        0.8      3 "confidence"             TRUE         list()
   FALSE
   $IBCF_binaryRatingMatrix
   Recommender method: IBCF for binaryRatingMatrix
   Description: Recommender based on item-based collaborative filtering
   (binary rating data).
   Reference: NA
                                          [ 124 ]
Collaborative Filtering                                                         Chapter 3
   Parameters:
      k    method normalize_sim_matrix alpha
   1 30 "Jaccard"                FALSE   0.5
   $POPULAR_binaryRatingMatrix
   Recommender method: POPULAR for binaryRatingMatrix
   Description: Recommender based on item popularity.
   Reference: NA
   Parameters: None
   $RANDOM_binaryRatingMatrix
   Recommender method: RANDOM for binaryRatingMatrix
   Description: Produce random recommendations (binary ratings).
   Reference: NA
   Parameters: None
   $UBCF_binaryRatingMatrix
   Recommender method: UBCF for binaryRatingMatrix
   Description: Recommender based on user-based collaborative filtering.
   Reference: NA
   Parameters:
        method nn weighted sample
   1 "jaccard" 25     TRUE FALSE
Having created the Recommender model, let us see how to leverage it to produce the top N
recommendations.
   $`2`
   [1] 1
   $`3`
   [1] 2 5
   $`4`
   [1] 4 5
   > recomms@ratings
   NULL
                                         [ 125 ]
Collaborative Filtering                                                                 Chapter 3
We pass our Recommender model to the predict function. We are testing it on our input
data, and therefore we have the newdata parameter set to our training data. Finally the
parameter n decides how many recommendations we need. The recommendations are
printed in the next line by calling recomm@items. Since our input is a binary matrix, we
don't have any ratings.
Similarly, we can create a realMatrix either from an R matrix or a data frame and create a
Recommender model and proceed with the recommendations.
This concludes our introduction to the recommenderlab package. Let us use this
knowledge to solve the problem in hand for this chapter. For more information about
recommenderlab visit https://cran.r-project.org/web/packages/recommenderlab/
index.html
    > str(Jester5k)
    Formal class 'realRatingMatrix' [package "recommenderlab"] with 2 slots
      ..@ data      :Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
      .. .. ..@ i        : int [1:362106] 0 1 2 3 4 5 6 7 8 9 ...
      .. .. ..@ p        : int [1:101] 0 3314 6962 10300 13442 18440 22513 27512
    32512 35685 ...
      .. .. ..@ Dim      : int [1:2] 5000 100
      .. .. ..@ Dimnames:List of 2
      .. .. .. ..$ : chr [1:5000] "u2841" "u15547" "u15221" "u15573" ...
      .. .. .. ..$ : chr [1:100] "j1" "j2" "j3" "j4" ...
      .. .. ..@ x        : num [1:362106] 7.91 -3.2 -1.7 -7.38 0.1 0.83 2.91
    -2.77 -3.35 -1.99 ...
      .. .. ..@ factors : list()
      ..@ normalize: NULL
                                             [ 126 ]
Collaborative Filtering                                                                  Chapter 3
   > head(Jester5k@data[1:5,1:5])
   5 x 5 sparse Matrix of class "dgCMatrix"
             j1    j2    j3    j4    j5
   u2841   7.91 9.17 5.34 8.16 -8.74
   u15547 -3.20 -3.50 -9.56 -8.74 -6.36
   u15221 -1.70 1.21 1.55 2.77 5.58
   u15573 -7.38 -8.93 -3.88 -7.23 -4.90
   u21505 0.10 4.17 4.90 1.55 5.53
The data is stored as realRatingMatrix. We can examine the slots of this matrix by
calling the str() function. Looking at the data slot of realRatingMatrix, we see that it's
a matrix with 5,000 rows and 100 columns. The rows are the number of users and the
columns represent the number of jokes. If all users have rated all jokes, we should have a
total of 500,000 ratings. However, we see that we have 362,106 ratings.
As the name of the dataset suggests, Jester5k has 5 k users and their ratings for 100 jokes.
Further, let's look at the first five rows and columns of realRatingMatrix to have a quick
look at the ratings.
Let us look at the ratings provided by two random users, 1 and 100:
   > Jester5k@data[1,]
       j1     j2     j3     j4     j5     j6        j7      j8      j9    j10    j11     j12
   j13    j14    j15    j16    j17    j18
    7.91 9.17 5.34 8.16 -8.74 7.14                8.88 -8.25     5.87    6.21   7.72    6.12
   -0.73 7.77 -5.83 -8.88 8.98 -9.32
     j19    j20    j21    j22    j23    j24        j25     j26    j27     j28    j29     j30
   j31    j32    j33    j34    j35    j36
   -9.08 -9.13 7.77 8.59 5.29 8.25                6.02    5.24   7.82    7.96 -8.88     8.25
   3.64 -0.73 8.25 5.34 -7.77 -9.76
     j37    j38    j39    j40    j41    j42        j43     j44    j45     j46    j47     j48
   j49    j50    j51    j52    j53    j54
    7.04 5.78 8.06 7.23 8.45 9.08                 6.75    5.87   8.45 -9.42     5.15    8.74
   6.41 8.64 8.45 9.13 -8.79 6.17
     j55    j56    j57    j58    j59    j60        j61     j62    j63     j64    j65     j66
   j67    j68    j69    j70    j71    j72
    8.25 6.89 5.73 5.73 8.20 6.46                 8.64    3.59   7.28    8.25   4.81 -8.20
   5.73 7.04 4.56 8.79 0.00 0.00
     j73    j74    j75    j76    j77    j78        j79     j80    j81     j82    j83     j84
   j85    j86    j87    j88    j89    j90
    0.00 0.00 0.00 0.00 -9.71 0.00                0.00    0.00   0.00    0.00   0.00    0.00
                                             [ 127 ]
Collaborative Filtering                                                                 Chapter 3
    > Jester5k@data[100,]
        j1     j2     j3     j4     j5     j6  j7  j8  j9  j10 j11 j12
    j13    j14    j15    j16    j17    j18
    -2.48 3.93 2.72 -2.67 1.75 3.35 0.73 -0.53 -0.58 3.88 3.16 1.17
    0.53 1.65 1.26 -4.08 -0.49 -3.79
       j19   j20    j21    j22    j23    j24  j25 j26 j27  j28 j29 j30
    j31    j32    j33    j34    j35    j36
    -3.06 -2.33 3.59 0.58 0.39 0.53 2.38 -0.05 2.43 -0.34 3.35 2.04
    2.33 3.54 -0.19 -0.24 2.62 3.83
       j37   j38    j39    j40    j41    j42  j43 j44 j45  j46 j47 j48
    j49    j50    j51    j52    j53    j54
    -2.52 5.19 1.75 0.00 0.39 1.75 -3.64 -2.28 2.33 3.16 -2.48 0.19
    2.82 4.22 -0.19 3.30 -0.53 3.45
       j55   j56    j57    j58    j59    j60  j61 j62 j63  j64 j65 j66
    j67    j68    j69    j70    j71    j72
    -0.53 0.97 -2.91 -8.25 -0.29 2.52 4.66 3.50 -0.24 3.64 -0.05 1.21
    -3.25 1.17 -2.57 -2.18 -5.44 2.67
       j73   j74    j75    j76    j77    j78  j79 j80 j81  j82 j83 j84
    j85    j86    j87    j88    j89    j90
      2.57 -4.03 2.96 3.40 1.12 1.36 -3.01 2.96 2.04 -3.25 1.94 -3.40
    -3.50 -3.45 -3.06 2.04 3.20 3.06
       j91   j92    j93    j94    j95    j96  j97 j98 j99 j100
      2.86 -5.15 3.01 0.83 -6.21 -6.60 -6.31 3.69 -4.22 0.97
    >
As we can see in the data, users don't have ratings for all the jokes. The two random users
we have just shown have zero as the rating for some of the jokes.
A user will not have rated all the jokes. If they have not rated it, there will be a zero value.
To get the number of jokes a user rated, we can run:
    length(Jester5k@data[100,][Jester5k@data[100,]>0]) # answer = 58
                                             [ 128 ]
Collaborative Filtering                                                            Chapter 3
We are looking to see per user the count of jokes he has not rated. We can achieve this by
doing the sum of rows of our ratings matrix. After summing it up, we create a dataframe,
zero.ratings.df, with two columns; the first column is the user and the second column
is the number of zero-entries they have in the ratings matrix, that is, the number of jokes
where their ratings were zero. Further, we can order our dataframe zero ratings in
descending order by the count. We can see that user u3228 has not rated 66 jokes.
Let us use this data to make a histogram to see the underlying distribution:
   > hist(zero.ratings.df$count, main ="Distribution of zero rated jokes")
                                           [ 129 ]
Collaborative Filtering                                                             Chapter 3
The histogram is showing the count of the zeros (that is, unrated jokes), so a low value on
the x-axis means that users have rated more jokes. The bin 0-5 is a testimony to it. Out of
100 jokes, anything between 0 to 5 is left unrated. However, the distribution looks to have
three modes.
                                           [ 130 ]
Collaborative Filtering                                                              Chapter 3
The three modes are evident from the density plot. We have three groups of users in our
database. Those who have a low number for unrated jokes, another group which has
around 25 jokes unrated and the final group which has around 65 jokes unrated.
We can fact check our user clusters by running a k-means algorithm on our data. We set the
parameter k to the number of clusters and finally collect our results in a dataframe named
model.df. The cluster centers reflect the number of jokes not rated. More information
about R k-means can be found at https://stat.ethz.ch/R-manual/R-devel/library/
stats/html/kmeans.html
                                            [ 131 ]
Collaborative Filtering                                                               Chapter 3
Now that we've looked at the user distribution, let us proceed to look at the joke's ratings:
    > Jester5k@data[,1]
      u2841 u15547 u15221 u15573 u21505 u15994   u238 u5809 u16636 u12843
    u17322 u13610 u7061 u23059 u7299
       7.91 -3.20 -1.70 -7.38      0.10   0.83   2.91 -2.77 -3.35 -1.99
    -0.68    9.17 -9.71 -3.16     5.58
    u20906 u7147 u6662 u4662 u5798 u7904 u7556 u3970           u999 u5462
    u20231 u13120 u22827 u20747 u1143
       9.08   0.00 -6.70    0.00   1.02   0.00 -3.01    5.87 -7.33   0.00
    7.48    0.00 -9.71    0.00   0.00
    u11381 u6617 u7602 u12658 u4519 u18953 u5021 u6457 u24750 u20139
    u13802 u16123 u7778 u15509 u8225
       0.00   6.55   0.00   0.00   0.78 -0.10    0.00   0.00   0.00 -6.65
    2.28    1.02   0.00 -8.35    5.53
    u12519 u16885 u12094 u6083 u19086 u1840 u7722 u17883 u12579 u3815
    u12563 u12313 u18725 u4354 u21146
       ....................
      [ reached getOption("max.print") -- omitted 4000 entries ]
    >
    > par(mfrow=c(2,2))
    > joke.density <- density(Jester5k@data[,1][Jester5k@data[,1]!=0])
    > plot(joke.density)
We look at the first joke, Jester5k@data[,1], for its scoring values; the output shown in
the preceding code is truncated.
Further, we plot the density plot for four randomly selected jokes (1, 25, 75, and 100) and
look at the distribution of scores.
                                            [ 132 ]
Collaborative Filtering                                                           Chapter 3
For all the four jokes, we see the rating is more than zero. The recommenderlab package
provides a function, getRatings, which can work on the s3 object to retrieve the ratings.
                                            [ 133 ]
Collaborative Filtering                                                Chapter 3
Let us now move on to see if we can find the most popular joke.
                                           [ 134 ]
Collaborative Filtering                                                              Chapter 3
         joke pratings
   j80    j80     1072
   j90    j90     1057
   j73    j73     1041
   j77    j77     1012
   j86    j86      994
   j79    j79      934
   j75    j75      895
   j71    j71      796
   j58    j58      695
   j74    j74      689
   >
We begin with binarizing our ratings matrix. In the new matrix, ratings.binary , the
ratings with 0 or more will be considered as positive ratings and all the others will be
considered as negative ratings. We create a dataframe, ratings.sum.df, with two
columns: the joke name and the sum of the ratings received by the joke. Since we have
binarized the matrix, this sum should be equal to the popularity of a joke. Displaying the
matrix in descending order of the sum, we see the most popular jokes and the least popular
jokes.
Finally, we are going to sample the datasets for 1,500 users and use that as our dataset for
the rest of the chapter. Sampling is often used during exploration, by taking a smaller
subset of the data, you can explore the data and produce the initial models quicker. Then
when the time comes, you can apply the best approach to the entire dataset
                                           [ 135 ]
Collaborative Filtering                                                            Chapter 3
The following is the image for the distribution of ratings for 1500 users:
Hopefully, the data exploration we performed in this section has given you a good
overview of the underlying dataset. Let us proceed now to build our joke recommendation
system.
                                            [ 136 ]
Collaborative Filtering                                                            Chapter 3
The steps in designing our recommendation project are shown in the following figure:
Ratings matrix
The first step is to obtain the ratings matrix. The recommenderlab expects the user rating
matrix to be stored as either binaryRatingsMatrix or realRatingsMatrix.
The realRatingsmatrix s3 class has a slot called data where the actual ratings matrix is
stored in a compressed format. The getRatingMatrix function is an easy way to extract
this matrix from the s3 class.
                                           [ 137 ]
Collaborative Filtering                                                            Chapter 3
      ..@ Dimnames:List of 2
      .. ..$ : chr [1:1500] "u24377" "u11509" "u6569" "u19311" ...
      .. ..$ : chr [1:100] "j1" "j2" "j3" "j4" ...
      ..@ x        : num [1:108531] 7.72 0.68 8.35 -1.94 -4.61 2.18 7.96 2.23
    8.06 -0.53 ...
      ..@ factors : list()
    >
First we take a sample of 1,500 users from our dataset. Using getRatingMatrix, we extract
the realRatingsMatrix, our ratings matrix from this dataset. We see that it's stored as
dgCMatrix, a compressed matrix form.
As you can see, using the slot data from our dataset, we can extract the ratings matrix.
Now that we have obtained the ratings matrix, we proceed to the next step of
normalization.
               When you have your own dataset, you can either create a matrix or a
               dataframe out of your dataset and coerce them to realRatingsMatrix or
               realBinaryMatrix, as shown in the previous section where we
               introduced the recommenderlab infrastructure.
                                             [ 138 ]
Collaborative Filtering                                                             Chapter 3
Normalization
Our next step is to normalize the ratings. Currently, our ratings range from -10 to 10. For
some of the underlying algorithms to work efficiently, it is generally a good practice to
normalize the data. One reason for normalizing the data is to account for the different
biases people use when rating items. Users do not rate items consistently, one user may
mainly only recommend items they such as, (that is, mostly 4/5, 5/5) whereas another user
may use rate across the range. Normalizing the data accounts for these biases.
       1. The first one is centering; it removes the row bias by subtracting the row mean
          value from all the row values.
       2. The second one is the z-score normalization; a z-score is obtained by subtracting
          the mean from individual scores and dividing it by the standard deviation.
Centering as the name indicates makes the mean zero. It does not change the scale of the
variable. If all the variables in the dataset were measured in the same scale, we can adopt
centering as our normalization technique.
On the contrary, z-score normalization changes the scale in addition to centering the data.
Now a unit change in the data is a one standard deviation change in the data. When the
variables are measured in different scale, it makes sense to use z-score normalization.
The normalize function is used to normalize the data. The method parameter dictates if
we need to use the centering or z-score normalizing approach. We further proceed to plot
the ratings distribution.
                                           [ 139 ]
Collaborative Filtering                                                             Chapter 3
The plots of the raw, normalized, and z-score normalized ratings are shown in the
following figure:
From the plots, we can see that we have brought the data close to a normal distribution
from its original distribution.
Let's proceed to the next step, where we want to arrange our data in such a way that we can
evaluate the performance of our Recommender model.
                                          [ 140 ]
Collaborative Filtering                                                                 Chapter 3
The recommenderlab package provides the necessary infrastructure to achieve this. The
evaluationScheme class is provided to help us create a train/test strategy. It takes a
ratingsMatrix as an input and provides several schemes including simple split, bootstrap
sampling, and k-fold cross-validation to create an evaluation scheme.
Using the evaluationScheme method, we allocate 90% of our data to training. The
given parameter achieves the following. From the test data, it takes 10 jokes for each user
and keeps them aside for evaluation.The model will predict the rating for these ten jokes of
the test records and these can be compared to the actual values. That way we can evaluate
the performance of our recommender systems. Finally, the parameter goodRating, defines
the threshold for good rating, here we say any rating greater than or equal to 1 is a positive
rating. This is critical for evaluating our classifier. Let us say if we want to find the accuracy
of our recommender system, then we need to find out the number of ratings where our
predictions matched with the actual. So if the ratings of both actual and predicted is greater
than or equal to 1, then they are considered as a match. We choose 2 here looking at the
density plot of z-score normalized data.
The method parameter is very important as this decides the evaluation scheme. In our
example, we have used split. In split it randomly assigns objects to train or test based on
the proportion given.
                                             [ 141 ]
Collaborative Filtering                                                               Chapter 3
Another important parameter we did not include is k. By default, the value of k is set to 1
for the split method. It decides the number of times to run the evaluation scheme. If we
had selected k-fold cross-validation as our method, k defaults to 10. A total of 10 different
models will be built for 10 splits of the data and the performance will be an average of all
the 10 models.
As you can see in a three-fold validation, we have three splits of our train and test data. The
model is evaluated against all the three splits.
                                            [ 142 ]
Collaborative Filtering                                                             Chapter 3
Now that we have built our scheme, we can use the scheme to extract our test and train
dataset.
The getData function is used to extract the train, and test dataset. The type parameter
returns different datasets. The train type returns the training dataset. The known type
returns the known ratings from the test dataset. The unknown type returns the ratings used
for evaluation of the dataset.
This is a standard technique followed in any machine learning algorithm development. The
train dataset is used to train the model and the test dataset is used to test the performance
of the model. If the model performs poorly, we go back to using the train dataset and tune
the model. Tuning may involve either changing the approach (that is, instead of using user-
based models, let us say we move to latent-based models and evaluate the performance) or
we start changing some of the parameters of our existing approach. Say in the case of user-
based filtering, we change the number of neighbors.
Our test dataset is further divided into test.known and test. The dataset test.known,
the ratings for 10 products. Using those ratings, our recommender system will predict the
recommendations. These recommendations can be now compared with the actual value
from test data set.
                                              [ 143 ]
Collaborative Filtering                                                                    Chapter 3
Out of a total of 1,500 users, 90% of the data (that is 1,350), is allocated to train. A set of 150
users are reserved for the purpose of testing.
Train model
We have our data split into the following:
           train: The user matrix we will use to build our recommendation model.
           test.known: We will feed this for our predict method along with our model. The
           output can now be compared to our test dataset.
           test: The test dataset is used for evaluating our model
Using these, let us go ahead and build our recommender system. For the first model, we are
going to build a random model.
    $labels
      [1] "j1"       "j2"     "j3"     "j4"    "j5"      "j6"    "j7"     "j8"     "j9"     "j10"
    "j11" "j12"       "j13"    "j14"
     [15] "j15"      "j16"    "j17"    "j18"   "j19"     "j20"   "j21"    "j22"    "j23"    "j24"
    "j25" "j26"       "j27"    "j28"
     [29] "j29"      "j30"    "j31"    "j32"   "j33"     "j34"   "j35"    "j36"    "j37"    "j38"
    "j39" "j40"       "j41"    "j42"
     [43] "j43"      "j44"    "j45"    "j46"   "j47"     "j48"   "j49"    "j50"    "j51"    "j52"
    "j53" "j54"       "j55"    "j56"
     [57] "j57"      "j58"    "j59"    "j60"   "j61"     "j62"   "j63"    "j64"    "j65"    "j66"
    "j67" "j68"       "j69"    "j70"
     [71] "j71"      "j72"    "j73"    "j74"   "j75"     "j76"   "j77"    "j78"    "j79"    "j80"
    "j81" "j82"       "j83"    "j84"
     [85] "j85"      "j86"    "j87"    "j88"   "j89"     "j90"   "j91"    "j92"    "j93"    "j94"
    "j95" "j96"       "j97"    "j98"
     [99] "j99"      "j100"
>
                                               [ 144 ]
Collaborative Filtering                                                              Chapter 3
   > random.predict@ratings[1]
   [[1]]
   [1] 2.516658 2.139329 1.921027 1.900945 1.845772
Using the Recommender function, we have built a random model. Inspecting the model, we
see that this model will produce a random number between -5.9 and 5.2 as defined by the
range. The model does not learn anything, except the range of the ratings in the training set
and produces the requested rating.
The predict function is used to predict the ratings. We need the top N predictions, that is
the top five jokes the user would have rated. Is it the top 5 jokes of the 10 known (i.e.
holdout) jokes for a specific user? This is specified by the n parameter and the
type parameter. Finally, we can look into our predictions, random.predict, to find our
recommendations and ratings for an individual user. In the preceding code snippet, we look
for user 1. We see that jokes 81, 96, 47, 28, and 65 are recommended for this user.
               Its a best practice to build a reference model to begin with before we start
               building our actual models. When we have our first model, we can
               compare it with our reference model to see some performance gain. Any
               good model should be better than a random model. Once we have a good
               model we can replace the random model with the good model and
               proceed to find better models. Random models, since they are very easy to
               create are typically used as the first reference model.
Let us build a more sensible model. The one we have seen in the previous sections, called
the popular model.
                                            [ 145 ]
Collaborative Filtering                                                              Chapter 3
We build a model using the popular type. We have trained it with 1,350 user data.
   > popular.predict@ratings[1]
   $u4027
   [1] 3.472511 3.218649 3.157963 3.011908 2.976325
>
Using the predict function and using the test.known dataset, we try to predict the top
five recommendations for users in the test dataset. Remember that in the known
parameter, we had reserved 10 recommendations for each user, stored in test.known.
Using that we are now predicting other jokes for the user.
Before we proceed to build a better model, let us introduce the evaluate function.
Till now we had explicitly invoked the Recommender and create a recommender object.
Further passed that model to predict function to get our predictions. This is the correct
way of implementing a model. However, when are evaluating different models, we need a
convenient way to run and test multiple models.
                                           [ 146 ]
Collaborative Filtering                                                           Chapter 3
The evaluate function can take an evaluation scheme and apply it to a given model and
produce the performance of the recommendation model. The performance is given using
multiple metrics.
>
As you can see, we have passed the plan, which is our evaluation scheme and the method.
We request five recommendation for each user using parameters k and type.
While comparing the ratings, the parameter goodRating is used. Let us say we have set
our goodRating parameter to 5. If the predicted rating is 5.1 and the actual rating is 7.1,
they are considered as a match. Our recommendation is considered as a match with the test
set. This way we can calculate the metrics we have detailed.
TP stands for True Positive, the number of instances where our recommendation matched
with the actual recommendation.
                                          [ 147 ]
Collaborative Filtering                                                                Chapter 3
FP stands for False Positive, the number of instances where we made a recommendation,
but the actual data shows otherwise.
FN stands for False Negative, where we did not recommend, but the actual data shows
otherwise.
TN stands for True Negative, where we did not recommend and the actual data was in
agreement with us.
Precision is the ratio of the number of correct recommendations out of the total
recommendations.
Recall is the ratio of the number of correct recommendations out of the total correct
recommendations.
TPR stands for True Positive Rate, the number of true positives to the total positives ( true
positive + false negative).
FPR stands for False Positive Rate, the ratio of false positives to the sum of false positives
and true negatives.
Until now we have used the split scheme; now let us use the n fold cross-validation
scheme. Further, we will be only using the evaluate function to find the performance of
our recommendation model.
                                             [ 148 ]
Collaborative Filtering                                                                  Chapter 3
Instead of the train test split scheme, here we are doing a cross-validation. By default, it
does a 10-fold cross-validation. Finally, our results is an average of 10 cross-validations for
three different top Ns. As you can see, we have passed n a vector with three values, 5, 10,
and 15, so we can generate 5, 10, or 15 recommendations.
User-based models
We discussed the the user-based model in the previous sections. Let us do a quick recap
here. The user-based model for collaborative filtering tries to mimic the word-of-mouth
approach in marketing. It is a memory-based model. The premise of this algorithm is that
similar users will have a similar taste in jokes. Therefore, they will rate jokes in a more or
less similar manner. It's a two-step process, in the first step for a given user, the algorithm
finds his neighbors. A similarity distance measure, such as Pearson coefficient, or cosine
distance, is used to find the neighbors for a given user. For an item not rated by the user,
we look at whether the user's neighbors have rated that item. If they have rated an average
of his the neighbors' ratings are considered as the ratings for this user.
With our input data prepared, let us proceed to build the model.
                                            [ 149 ]
Collaborative Filtering                                                           Chapter 3
         1 [0.017sec/0.268sec]
         2 [0.016sec/0.267sec]
         3 [0.01sec/0.284sec]
         4 [0.01sec/0.273sec]
         5 [0.009sec/0.273sec]
         6 [0.009sec/0.272sec]
         7 [0.009sec/0.508sec]
         8 [0.009sec/0.236sec]
         9 [0.009sec/0.268sec]
         10 [0.01sec/0.262sec]
    > avg(results)
             TP       FP       FN       TN precision    recall       TPR
    FPR
    5 2.024000 2.976000 14.40600 70.59400 0.4048000 0.1586955 0.1586955
    0.03877853
    10 3.838667 6.161333 12.59133 67.40867 0.3838667 0.2888018 0.2888018
    0.08048999
    15 5.448000 9.552000 10.98200 64.01800 0.3632000 0.3987303 0.3987303
    0.12502479
    >
Using our framework defined in the previous model, we can create a user-based
recommendation system, as shown in the previous code, and evaluate its performance. We
have used the cross-validation scheme to evaluate our model's performance.
We call the evaluate method with our cross-validation scheme and specify the user-based
model by the parameter method. UBCF stands for user-based recommendation. Once
again we are interested only in the top N recommendation and our N is now an array of
three values: 5, 10, and 15. We want to evaluate our model for all the three Ns. Therefore,
we have passed an array of values. Finally, when we see the model performance using the
results object, the performance is averaged across 10 models, for all the three Ns we have
supplied.
An alternate way to evaluate the model is to make the model do the recommendation for all
the unknown items in the test data. Now compare the difference between the predicted and
actual ratings and show a metric, such as root mean square error, or mean absolute error or
squared error.
                                          [ 150 ]
Collaborative Filtering                                                             Chapter 3
        5 [0.01sec/0.221sec]
        6 [0.009sec/0.213sec]
        7 [0.013sec/0.247sec]
        8 [0.009sec/0.401sec]
        9 [0.011sec/0.242sec]
        10 [0.009sec/0.243sec]
   > avg(results.1)
           RMSE      MSE      MAE
   res 4.559954 20.80655 3.573544
   >
Here we are saying for our user-based model, we need to normalize the data. Sine the scale
of the rating is same across the users ( from -10 to +10) we want to only bring the data to a
zero mean. Further, we want to use the Pearson coefficient as our distance measure. Finally,
we want to use 10 neighbors to get our recommendation.
Also since we have 100 jokes, we can have around 30 jokes in our test.known.
We can see that our RMSE has gone down. Similarly, we can make some more changes and
tune our model.
                                           [ 151 ]
Collaborative Filtering                                                             Chapter 3
Item-based models
This is a model-based approach. From a given ratings matrix, this method explores the
relationship between the items. Based on the ratings, different users provide different items,
and an item-to-item similarity matrix is derived. Once again, as in the user-based model,
Pearson coefficient or cosine distance is used as a similarity metric. For each item, we store
the top K similar items, rather than storing all the items for efficiency purposes. A weighted
sum idea is used to finally make a recommendation for a user. Refer to the paper from
Amazon for more about item-based filtering: https://dl.acm.org/citation.cfm?id=
642471
As you can see, the class structure of recommenderlab is very elegant; the only change we
made was to use IBCF as the value to the method parameter.
                                           [ 152 ]
Collaborative Filtering                                                           Chapter 3
The RMSE is higher than the user-based model. Which is not surprising. Item-based model
typically performs slightly poorer than the user-based model. The trade off is memory
usage. User-based model consumes more memory but gives better results.
Factor-based models
Factor-based models leverage matrix decomposition techniques to predict
recommendations for unrated items.
Once again the steps are the same, except the value for the method parameter.
                                          [ 153 ]
Collaborative Filtering                                                          Chapter 3
        3 [33.355sec/1.456sec]
        4 [34.085sec/1.505sec]
        5 [33.442sec/1.356sec]
        6 [34.764sec/1.328sec]
        7 [33.307sec/1.232sec]
        8 [34.139sec/1.484sec]
        9 [33.132sec/1.359sec]
        10 [35.274sec/1.342sec]
   > avg(results.1)
           RMSE      MSE      MAE
   res 5.086455 25.88602 3.853208
   >
Our RMSE error is much higher than both the item-based and the user-based models.
Let us now build a complete model, using our knowledge from the above three
experiments.
To being with let us normalize the data and then split it into test and train:
   data <- Jester5k
   data <- normalize(data, method = "center")
   plan <- evaluationScheme(data, method="split", train=0.9, given = 10,
   goodRating=5)
   train <- getData(plan, "train")
   test <- getData(plan, "unknown")
   test.known <- getData(plan, "known")
We saw that, Pearson with 10 nearest neighbors worked well, we will use those to build our
final model.
                                           [ 154 ]
Collaborative Filtering                                                          Chapter 3
[1] 66 36 35 54 42
   > final.predict@ratings[1]
   $u7147
   [1] 2.746771 2.742437 2.722453 2.297228 1.974920
We have shown in this section, how to build all three types of recommender systems. We
finally chose the user-based system as it out-performed all the other systems in RMSE.
The next section gives the complete list of code we have used in this chapter.
Complete R Code
The complete project code is as follows:
   set.seed(100)
   products <- c('A','B','C','D','E','F','G')
   user.a <-   c( 3, 0, 2, 5, 5, 0,1)
   user.b <-   c( 3, 5, 3, 5, 4, 2, 1)
                                           [ 155 ]
Collaborative Filtering                                                Chapter 3
install.packages("recommenderlab")
   # A quick recommender
   model <- Recommender(data = rating.mat, method = "POPULAR")
   model
   str(model)
recommenderRegistry$get_entries(dataType = "binaryRatingMatrix")
   # Quick Prediction
   recomms <- predict(model, newdata = rating.mat, n =2)
   recomms@items
   recomms@ratings
   data("Jester5k")
   str(Jester5k)
   head(Jester5k@data[1:5,1:5])
   # Users
   Jester5k@data[1,]
   Jester5k@data[100,]
                                       [ 156 ]
Collaborative Filtering                                                 Chapter 3
   # Ratings
   Jester5k@data[,1]
   par(mfrow=c(2,2))
   joke.density <- density(Jester5k@data[,1][Jester5k@data[,1]!=0])
   plot(joke.density)
   par(mfrow=c(1,1))
   nratings(Jester5k)
   hist(getRatings(Jester5k), main="Distribution of ratings")
   # Popular joke
   # Binarize the ratings
   ratings.binary <- binarize(Jester5k, minRating =0)
   ratings.binary
   ratings.sum <- colSums(ratings.binary)
   ratings.sum.df <- data.frame(joke = names(ratings.sum), pratings =
   ratings.sum)
   head( ratings.sum.df[order(-ratings.sum.df$pratings), ],10)
   tail( ratings.sum.df[order(-ratings.sum.df$pratings), ],10)
   # Sample
   data <- sample(Jester5k, 1500)
   hist(getRatings(data), main="Distribution of ratings for 1500 users")
   ##
   data <- sample(Jester5k, 1500)
   ratings.mat <- getRatingMatrix(data)
   str(ratings.mat)
                                    [ 157 ]
Collaborative Filtering                                                  Chapter 3
str(data@data)
   par(mfrow=c(3,1))
   plot(density(getRatings(data)), main = "Raw")
   plot(density(getRatings(data.norm)), main = "Normalized")
   plot(density(getRatings(data.norm.z)), main = "Z-score normalized")
   par(mfrow=c(1,1))
   # Popular Model
   popular.model <- Recommender(train, "POPULAR")
   popular.model
                                    [ 158 ]
Collaborative Filtering                                                  Chapter 3
test.known@data[1,c(50,36,27,32,35)]
   # Cross validation
   plan <- evaluationScheme(data, method="cross", train=0.9, given = 10,
   goodRating=5)
   results <- evaluate(plan, method = "POPULAR", type = "topNList", n =
   c(5,10,15) )
   avg(results)
   param=list(normalize="center",method="Pearson",nn=10)
   plan <- evaluationScheme(data, method="cross", train=0.9, given = 30,
   goodRating=5)
   results.1 <- evaluate(plan, method ="UBCF", param=param, type ="ratings")
   avg(results.1)
                                    [ 159 ]
Collaborative Filtering                                                Chapter 3
   goodRating=5)
   results <- evaluate(plan, method = "IBCF", type = "topNList", n =
   c(5,10,15) )
   avg(results)
   # final model
   data <- Jester5k
   param=list(method="Pearson",nn=10)
   final.model <- Recommender(train, method = "UBCF", param = param)
   final.model
test@data[1,]
test@data[1,c(45,80,10,25,77)]
                                    [ 160 ]
Collaborative Filtering                                                            Chapter 3
Summary
We looked at what is a collaborative filtering method and went into different collaborative
filtering strategies. We introduced the R package recommenderlab to perform
collaborative filtering. We leveraged the Jester5k dataset to demonstrate a collaborative
filtering algorithm. We looked at the random model, popular model, item-based similarity,
user-based similarity models, and factor models. We introduced the concept of evaluating
the performance of a recommender system before deploying it. We demonstrated the steps
to split the datasets in order to evaluate our model performance.
In the next chapter, we will be looking to build Deep Neural networks using the MXNet
package for time series data. We will introduce the package MXNet R and proceed to build a
deep neural network.
                                          [ 161 ]
Taming Time Series Data Using
                                                                                            4
        Deep Neural Networks
With the advent of the Internet of Things, more and more devices are being added to the
World Wide Web every day. Today's world is rapidly becoming inundated with
intelligence devices. These devices are producing data at a rapid pace which we have never
seen before. The storage cost has significantly come down, allowing companies to store this
massive data in a cheap manner. Smart City projects have been kick-started by various
governments. These Smart City projects rely heavily on this instrumentation. Technology
companies and governments now want to leverage this massive data, generated from these
smart devices, to perform analytics and produce data-enabled applications. The end use
cases of these analytics are limitless. The data is characterized by three factors:
Among all the different varieties of data, data from sensors is the most widespread and is
referred to as time series data.
              A time series is a series of data points indexed (or listed or graphed) in time order.
              Most commonly, a time series is a sequence taken at successive equally spaced
              points in time. Thus it is a sequence of discrete-time data.
Taming Time Series Data Using Deep Neural Networks                                   Chapter 4
Time series data is not a new type of data. For decades, financial industries have been using
this data for various market-related purposes. Algorithmic trading is the use of computer
programs to generate and follow a defined set of instructions to trade in the market. This
allows profit to be generated at a greater speed and frequency compared to a human trader.
Algorithmic trading leverages time series data. The traditional approach to time series data
involves analyzing it in frequency domains or temporal domains. Today, deep learning has
opened up a whole new door to analyzing time series data at a never before seen speed and
level of precision.
Time series data is not just limited to sensor data and financial data. In many other real-
world applications, such as speech recognition, machine translation, and sequence
generation, data is captured in a temporal fashion. Two points in a time series may look
identical, but they may have different temporal dependencies, thus making them belong to
two different categories.
Deep learning allows us to build sophisticated models, which can capture non-linear
relationships in the data in a much more efficient and faster manner. Deep learning can be
defined in two ways.
Any neural network with more than three hidden layers can be defined as a deep learning
network. Different layers in the network capture different properties of the data. The nature
of deep learning is very valuable to time series analysis.
The other definition is the series of hacks performed on these networks to incorporate
bagging and boosting functionalities, thus allowing us to build an ensemble of models
through a single network. Once again, each model in the ensemble is now capable of
understanding different regions of the data. This is very useful for analyzing time series
data.
The aim of this chapter is to introduce MXNet R to our readers and show them how fully
connected deep networks can be effective in predicting time series data. Our treatment of
traditional time series data analysis will be very limited. We will provide sufficient
references for traditional time series analysis.
                                             [ 163 ]
Taming Time Series Data Using Deep Neural Networks                                   Chapter 4
The code for this chapter was written in RStudio version 0.99.491. It uses R version 3.3.1. As
we work through our example, we will introduce the MXNet R package that we will be
using. During our code descriptions, we will be using some of the output printed in the
console. We have included what was printed in the console immediately following the
statement that prints the information to the console, to not disturb the flow of the code.
Using the scan function, we get the data from the URL. Following that, we create a time
series object using ts.
                                             [ 164 ]
Taming Time Series Data Using Deep Neural Networks                                     Chapter 4
The standard R plot function knows how to plot time series data. The time series plot is as
shown in the following diagram:
An additive model is used to estimate the trend of a non-seasonal time series. The time
series can be smoothed to remove the trend using methods such as moving averages. The
trend can be either increasing or decreasing.
    par(mfrow=c(2 ,2))
    plot(SMA(king.ts, n=2), main = "n=2")
    plot(SMA(king.ts, n=5), main = "n=5")
    plot(SMA(king.ts, n=10), main = "n=10")
    plot(SMA(king.ts, n=15), main = "n=15")
    par(mfrow=c(1,1)
                                            [ 165 ]
Taming Time Series Data Using Deep Neural Networks                                  Chapter 4
The TTR library provides the SMA function to perform the smoothing of the data. The n
parameter defines the order of the moving average, which means how many points we
want to consider to do our moving average.
We will use another small dataset to show seasonality—the number of births per month in
New York City from January 1946 to December 1959:
    > births <- scan("http://robjhyndman.com/tsdldata/data/nybirths.dat")
    Read 168 items
    > births.ts <- ts(births, frequency = 12)
    > births.comps <- decompose(births.ts)
                                            [ 166 ]
Taming Time Series Data Using Deep Neural Networks                                  Chapter 4
Using the decompose function, we split the trend, seasonality, and error from the time
series. Let us now plot these components using plot(births.comps):
We see the individual components, trends, seasonality, and errors in the graph.
In the previous section, we looked at non-seasonal time series with trend and error
components, and seasonal time series with an additional seasonal component. These are
non-stationary aspects of the time series. The time series has to be made stationary before
we use any regression methods.
              Statistical stationarity
              A stationary time series is one whose statistical properties, such as mean,
              variance, autocorrelation, and so on, are all constant over time.
                                           [ 167 ]
Taming Time Series Data Using Deep Neural Networks                                  Chapter 4
Let us do a simple regression exercise with our time series data. We are going to predict the
age of death of a king based on the age of death of the previous king. It's a simple example
and hence may sound very strange. The idea is to show how regression is used to do time
series forecasting.
We are using the quantmod package to get the lag of the time series. Our predictor is the
age of death of the previous king. Hence, we take a lag of 1. Finally, our data
frame, new.data, holds our x and y.
                                           [ 168 ]
Taming Time Series Data Using Deep Neural Networks                     Chapter 4
    Call:
    lm(formula = y ~ Lag.1, data = new.data)
    Coefficients:
    (Intercept)           Lag.1
         2.8651          0.9548
>
                                            [ 169 ]
Taming Time Series Data Using Deep Neural Networks                                      Chapter 4
This plot shows whether or not the residuals are normally distributed. The residuals follow
a straight line.
This plot shows whether or not the residuals have non-linear patterns. We can see equally
spread residuals around the red horizontal line, without distinct patterns, which is a good
indication that we don't have non-linear relationships.
We will stop this section here. Hopefully, it gave you an idea about how to rearrange time
series data to suit a regression problem. Some of the techniques shown here, such as lag,
will be used later when we build our deep learning model.
For more about time series regression, refer to the book Forecasting: Principles and
Practices at https://www.otexts.org/fpp/4/8. It introduces some basic time series
definitions. More curious readers can refer to A Little Book of R for Time Series at https://a-
little-book-of-r-for-time-series.readthedocs.io/en/latest/src/timeseries.html.
                                             [ 170 ]
Taming Time Series Data Using Deep Neural Networks                                  Chapter 4
This network architecture has a single hidden layer with two nodes. The output layer is
activated by a softmax function. This network is built for a classification task. The hidden
layer can be activated by tanh, sigmoid, relu, or soft relu activation functions. The
activation function performs the key role of introducing non-linearity in the network.
The product of weight and the input from the previous layer summed up with the bias is
passed to these activation functions.
                                           [ 171 ]
Taming Time Series Data Using Deep Neural Networks                                 Chapter 4
Compared to the multi-layer perceptron, we can see that there are several hidden layers.
For the sake of clarity in the diagram, we have only two nodes in each hidden layer, but in
practice there may be several hundred nodes.
As you can see, the deep learning networks are different from their sibling single-hidden-
layer neural networks in terms of their depth, that is, the number of hidden layers. The
input data is now passed through multiple hidden layers, on the assumption that each layer
will learn some aspects of the input data.
So when do we call a network deep? Networks with more than three layers (including input
and output) can be called deep neural network.
                                           [ 172 ]
Taming Time Series Data Using Deep Neural Networks                                    Chapter 4
As previously mentioned, in deep learning networks the nodes of each hidden layer learn to
identify a distinct set of features based on the previous layer’s output. Subsequent hidden
layers can understand more complex features as we aggregate and recombine features from
the previous layer:
Let us see how intuitively a feedforward neural network learns from the data.
Say we are solving a classification task. Our input X is a 2 x 2 matrix and our output Y is a
vector of the binary response, either one or zero.
The two nodes in our input layer are fully connected to the two nodes in the hidden layer.
We have four weights going from our input node to our hidden node. We can represent our
weights using a 2 x 2 matrix. We have bias across the two nodes in our hidden layer. We
can represent our bias using a 1 x 2 matrix.
Again, we connect our hidden nodes fully to the two nodes of our output layer. The weight
matrix is again 2 x 2 and the bias matrix is again 1 x 2.
                                            [ 173 ]
Taming Time Series Data Using Deep Neural Networks                                   Chapter 4
{w1,w2,w3,w4} and {b1,b2} are the weights and bias for the input to hidden nodes.
{v1,v2,v3,v4} and {bb1,bb2} are the weights and bias for the hidden to output nodes.
It is very important to initialize the weights and bias randomly. Xavier initialization is a
popular initialization used widely today. See this blog for a very intuitive explanation on
Xavier initialization, at http://andyljones.tumblr.com/post/110998971763/an-
explanation-of-xavier-initialization.
Forward cycle
The input x1, x2 is multiplied by the weights w1..w4 in each node and the respective bias is
added. This is the linear transformation:
                                            [ 174 ]
Taming Time Series Data Using Deep Neural Networks                                  Chapter 4
Now these activations are multiplied by v1..v4 and the respective bias is added:
Backward cycle
The training of the network happens in the backward cycle, hence the name
backpropagation algorithm. The difference between the actual and predicted value error is
calculated. This error is walked back over the model and the weights in the model are
adjusted. The derivative of the errors with respect to these weights gives us an idea as to
how much these weights contributed to the overall error. That percentage of contribution is
adjusted in the weight. Refer to https://en.wikipedia.org/wiki/Backpropagation to
learn more.
Hopefully, this gave you an intuitive explanation on how neural networks learn from data.
Let us go to the next section to learn more about the MXNet R library we will be using to
build our deep learning network.
                                            [ 175 ]
Taming Time Series Data Using Deep Neural Networks                                    Chapter 4
In MXNet, NDArray is the basic operation unit. It's a vectorized operation unit for matrix
and tensor computations. All operations on this operation unit can be run on either the CPU
or GPUs. The most important point is that all these operations are parallel. It's the basic data
structure for manipulating and playing around with data in MXNet. It supports a wide
range of mathematical operations, allowing for the writing of machine learning programs in
an imperative fashion. Let us look at some basic NDArray operations.
We have created two matrices of size 3 x 3, one filled with zeros and the other filled with
ones.
The context defines where the code is executed, either in the CPU or GPU. Let us look at the
context:
    > mx.ctx.default()
    $device
    [1] "cpu"
    $device_id
    [1] 0
    $device_typeid
    [1] 1
    attr(,"class")
    [1] "MXContext"
As my machine does not include a GPU, the only context available is CPU.
                                            [ 176 ]
Taming Time Series Data Using Deep Neural Networks                                 Chapter 4
We pass the context while creating the arrays. Here, we have passed the GPU context,
specifying that we need this 3 x 3 zero filled array to be created in the GPU.
The code demonstrates matrix multiplication and element-wise addition in the matrix.
The code generates data from a normal distribution for a given mean and standard
deviation.
                                           [ 177 ]
Taming Time Series Data Using Deep Neural Networks                             Chapter 4
   $b
          [,1] [,2] [,3]
   [1,]      1    1    1
   [2,]      1    1    1
   [3,]      1    1    1
   > pexec$outputs
   $`_plus4_output`
        [,1] [,2] [,3]
                                           [ 178 ]
Taming Time Series Data Using Deep Neural Networks                                Chapter 4
   [1,]      2    2     2
   [2,]      2    2     2
   [3,]      2    2     2
We create an executor by calling mx.simple.bind for our symbol c. In order to create it,
we need to tell the executor the shape of a and b, which are arguments to c. After that,
using mx.exec.update.arg.arrays, we push the real data into the executor for it to
execute the symbol c. The output slot of the executor has the results stored.
Let us look at another example, where we create a symbol d, which is a dot product of two
matrices:
   > d <- mx.symbol.dot(a, b)
   > arg_lst <- list(symbol = d, ctx = mx.ctx.default(), a = dim(ones.matrix),
   +                  b= dim(ones.matrix), grad.req="null")
   > pexec <- do.call(mx.simple.bind, arg_lst)
   > pexec
   C++ object <0x1170550a0> of class 'MXExecutor' <0x101be9c30>
   > input_list <- list(a = ones.matrix,b = ones.matrix)
   > mx.exec.update.arg.arrays(pexec, input_list)
   > mx.exec.forward(pexec)
   > pexec$arg.arrays
   $a
        [,1] [,2] [,3]
   [1,]    1    1     1
   [2,]    1    1     1
   [3,]    1    1     1
   $b
          [,1] [,2] [,3]
   [1,]      1    1    1
   [2,]      1    1    1
   [3,]      1    1    1
   > pexec$outputs
   $dot3_output
        [,1] [,2] [,3]
   [1,]    3    3    3
   [2,]    3    3    3
   [3,]    3    3    3
                                           [ 179 ]
Taming Time Series Data Using Deep Neural Networks                                    Chapter 4
Refer to https://github.com/apache/incubator-mxnet/tree/master/R-package/
vignettes for R vignettes and other operations using MXNet.
With this background in imperative and symbolic use of MXNet R, let us go ahead and build
a simple multi-layer perceptron to solve the famous XOR gate problem. We want the
network to learn the XOR truth table:
X Y Z
0 0 0
0 1 1
1 0 1
1 1 0
Our network architecture is as follows:
                                           [ 180 ]
Taming Time Series Data Using Deep Neural Networks                                    Chapter 4
    X <- data.matrix(mx.nd.concat(list(x.1,x.2,x.3,x.4)) )
    Y <- c(y.1,y.2,y.3,y.4)
We have used four normal distributions to generate our data points. Finally, we combine
them into one array, X. The labels are stored in the Y variable.
Now let us go ahead and define our network architecture to solve our XOR problem:
    ############## Define the Network #########
    # Input layer
    data <- mx.symbol.Variable("data")
    # Hidden Layer
    hidden.layer <- mx.symbol.FullyConnected(data = data
                                             , num_hidden = 2)
    # Output Layer
    out.layer <- mx.symbol.FullyConnected(data = act
                                          , num.hidden = 2)
    # Softmax of output
    out <- mx.symbol.SoftmaxOutput(out.layer)
We use relu activation for this layer. mx.symbol.Activation is used to create this
activation. We pass the previous hidden layer to this to say this activation function is tied to
the previous hidden layer.
                                            [ 181 ]
Taming Time Series Data Using Deep Neural Networks                                    Chapter 4
Softmax activation
Softmax activation is most frequently used for classification tasks. Softmax rescales the
output from the previous layer. First, it calculates the exponential of the input to its neuron
and then it divides the total sum of input with all of the neurons in the layer, so that the
activation sums up to one and lies between zero and one.
Softmax activation is used as the last layer in a deep learning neural network for multi-class
classification. The layer has the same number of nodes as the number of classes and it
rescales the output so that it adds up to 1.0, therefore calculating the probability for each
class, that is P(y|x).
                                            [ 182 ]
Taming Time Series Data Using Deep Neural Networks                                 Chapter 4
   [14] Train-accuracy=0.912
   [15] Train-accuracy=1
   [16] Train-accuracy=1
   [17] Train-accuracy=1
   [18] Train-accuracy=1
   [19] Train-accuracy=1
   [20] Train-accuracy=1
   > X_test = data.matrix(rbind(c(0,0),c(1,1),c(1,0),c(0,1) ) )
   > preds = predict(model, X_test, array.layout = "rowmajor")
   > pred.label <- max.col(t(preds)) -1
   > pred.label
   [1] 0 0 1 1
Finally, we pass the model to the predict function to see the predictions.
Wow, we have finished building a neural network with 100% accuracy for our training
data. You can look at the prediction results from the model for a simple XOR truth table.
We can further investigate our model by looking at the graph generated by our symbolic
programming:
   > graph.viz(model$symbol)
   > model$arg.params
   $fullyconnected4_weight
             [,1]     [,2]
   [1,] -1.773083 1.751193
   [2,] -1.774175 1.751324
   $fullyconnected4_bias
   [1] 1.769027 -1.754247
   $fullyconnected5_weight
            [,1]      [,2]
   [1,] 2.171195 -2.166867
   [2,] 2.145441 -2.132973
   $fullyconnected5_bias
   [1] -1.662504 1.662504
>
                                           [ 183 ]
Taming Time Series Data Using Deep Neural Networks                                Chapter 4
Furthermore, we can look into the weights our model has generated for this network.
Hopefully, this section gave you a good overview of the Mxnet R package. Using this
knowledge, let us go ahead and solve our time series problem.
                                           [ 184 ]
Taming Time Series Data Using Deep Neural Networks                                   Chapter 4
We leverage the package quantmod. The getSymbols function in quantmod can fetch stock
information from sources such as Yahoo and Google. As you can see, we are fetching
Apple's stock price data using their AAPL ticker.
We are interested only in the close price. We take the closing price out and peek at the top
rows.
We pull the data into a data frame called plot.data, so as to pass this conveniently to
ggplot.
                                            [ 185 ]
Taming Time Series Data Using Deep Neural Networks                                     Chapter 4
The graph shows the time series of the stock prices. It's a non-stationery time series with a
definite trend and error component.
          The closing price from day 1 in our dataset, up to day 31, will be our X. The day
          32 closing price will be our Y.
                                            [ 186 ]
Taming Time Series Data Using Deep Neural Networks                                  Chapter 4
          The closing price from day 2 in our dataset, up to day 32, will be our X. The
          closing price on day 33 will be our Y.
We will follow the same pattern. Once again, we will leverage quantmod package's Lag
functionality to prepare our data:
   > data.ts <- as.ts(data)
   > data.zoo <- as.zoo(data.ts)
   > x.data <- list()
   > for (j in 1:31){
   +      var.name <- paste("x.lag.",j)
   +      x.data[[var.name]] <- Lag(data.zoo,j)
   +   }
   > final.data <- na.omit(data.frame(x.data, Y = data.zoo))
   > head(final.data)
          Lag.1      Lag.2     Lag.3  Lag.4   Lag.5   Lag.6    Lag.7                 Lag.8
   Lag.9    Lag.10     Lag.11
   32 4.250000 4.136161 3.883929 4.053571 4.022321 4.102679 4.073661             3.857143
   3.689732 3.529018 3.580357
   33 4.075893 4.250000 4.136161 3.883929 4.053571 4.022321 4.102679             4.073661
   3.857143 3.689732 3.529018
   34 4.102679 4.075893 4.250000 4.136161 3.883929 4.053571 4.022321             4.102679
   4.073661 3.857143 3.689732
   35 3.973214 4.102679 4.075893 4.250000 4.136161 3.883929 4.053571             4.022321
   4.102679 4.073661 3.857143
   36 4.064732 3.973214 4.102679 4.075893 4.250000 4.136161 3.883929             4.053571
   4.022321 4.102679 4.073661
   37 4.151786 4.064732 3.973214 4.102679 4.075893 4.250000 4.136161             3.883929
   4.053571 4.022321 4.102679
         Lag.12    Lag.13     Lag.14 Lag.15  Lag.16  Lag.17   Lag.18                Lag.19
   Lag.20    Lag.21     Lag.22
   32 3.705357 3.629464 3.928571 3.935268 4.008929 3.794643 3.975446             4.053571
   3.805804 3.712054 3.587054
   33 3.580357 3.705357 3.629464 3.928571 3.935268 4.008929 3.794643             3.975446
   4.053571 3.805804 3.712054
   34 3.529018 3.580357 3.705357 3.629464 3.928571 3.935268 4.008929             3.794643
   3.975446 4.053571 3.805804
   35 3.689732 3.529018 3.580357 3.705357 3.629464 3.928571 3.935268             4.008929
   3.794643 3.975446 4.053571
   36 3.857143 3.689732 3.529018 3.580357 3.705357 3.629464 3.928571             3.935268
   4.008929 3.794643 3.975446
   37 4.073661 3.857143 3.689732 3.529018 3.580357 3.705357 3.629464             3.928571
   3.935268 4.008929 3.794643
         Lag.23    Lag.24     Lag.25 Lag.26  Lag.27  Lag.28   Lag.29                Lag.30
   Lag.31          Y
                                           [ 187 ]
Taming Time Series Data Using Deep Neural Networks                                    Chapter 4
We start by creating an empty x.data list. Inside the for loop we invoke the Lag function
to get the last 31 closing prices.
We combine this list with our original data and omit all the rows with NA values. Now we
are ready with our dataset.
train.data contains 80% of our data and test.data now contains the final 20%.
Let us now split the data into the dependent variable, y, and independent variable, x:
    train.x.data <- data.matrix(train.data[,-1])
    train.y.data <- train.data[,1]
                                            [ 188 ]
Taming Time Series Data Using Deep Neural Networks                     Chapter 4
                                           [ 189 ]
Taming Time Series Data Using Deep Neural Networks   Chapter 4
   [25]   Train-mae=3.91980588843425
   [26]   Train-mae=3.86194357938237
   [27]   Train-mae=4.2168276017242
   [28]   Train-mae=3.97845429400603
   [29]   Train-mae=3.76002277152406
   [30]   Train-mae=3.83593577626679
   [31]   Train-mae=3.71273083243105
   [32]   Train-mae=3.6195217209061
   [33]   Train-mae=3.87637294858694
   [34]   Train-mae=3.79551469782988
   [35]   Train-mae=3.66272431297435
   [36]   Train-mae=3.81557290663322
   [37]   Train-mae=3.73366007523404
   [38]   Train-mae=3.57986994018157
   [39]   Train-mae=3.48014950540331
   [40]   Train-mae=3.70377947661612
   [41]   Train-mae=3.57618864927027
   [42]   Train-mae=3.52039705213573
   [43]   Train-mae=3.55409083919393
   [44]   Train-mae=3.64251783867677
   [45]   Train-mae=3.45834989514616
   [46]   Train-mae=4.0749755111999
   [47]   Train-mae=3.71578220281336
   [48]   Train-mae=3.50679727282789
   [49]   Train-mae=4.14914410233498
   [50]   Train-mae=3.7334080057674
   [51]   Train-mae=3.70052516341209
   [52]   Train-mae=3.91092829631435
   [53]   Train-mae=3.67670645170742
   [54]   Train-mae=3.63808277865251
   [55]   Train-mae=3.6711657425099
   [56]   Train-mae=3.45309603734149
   [57]   Train-mae=3.6100285096301
   [58]   Train-mae=3.47404639359977
   [59]   Train-mae=3.48650643053982
   [60]   Train-mae=3.44442329986228
   [61]   Train-mae=3.76036083085669
   [62]   Train-mae=3.61709513925844
   [63]   Train-mae=3.50891292694542
   [64]   Train-mae=3.74455126540528
   [65]   Train-mae=3.58543863587909
   [66]   Train-mae=3.46644685930676
   [67]   Train-mae=3.54323410050737
   [68]   Train-mae=3.64015973226892
   [69]   Train-mae=3.75287163681454
   [70]   Train-mae=3.48837021980021
   [71]   Train-mae=3.57296027570963
   [72]   Train-mae=3.77685429851214
                                           [ 190 ]
Taming Time Series Data Using Deep Neural Networks                                Chapter 4
   [73] Train-mae=3.63230897545815
   [74] Train-mae=3.46640953759352
   [75] Train-mae=3.42612222330438
   [76] Train-mae=3.59378631307019
   [77] Train-mae=3.71207889818483
   [78] Train-mae=3.67349873264631
   [79] Train-mae=3.91540269825194
   [80] Train-mae=3.75410354889101
   [81] Train-mae=3.74548216611147
   [82] Train-mae=3.48044820434517
   [83] Train-mae=3.45057798130645
   [84] Train-mae=3.56516027308173
   [85] Train-mae=3.82901276187764
   [86] Train-mae=3.57472546464867
   [87] Train-mae=3.5186486049162
   [88] Train-mae=3.48254795942042
   [89] Train-mae=3.5777530745003
   [90] Train-mae=3.46502910438511
   [91] Train-mae=3.50481863598029
   [92] Train-mae=3.94887298007806
   [93] Train-mae=3.85228754586644
   [94] Train-mae=3.39591218464904
   [95] Train-mae=3.32976617760129
   [96] Train-mae=3.46257649577326
   [97] Train-mae=3.74387675043609
   [98] Train-mae=3.49124042812321
   [99] Train-mae=3.49357031944725
   [100] Train-mae=3.66018298261695
For more details about the parameters, please refer to the MXNet R's documentation.
                                           [ 191 ]
Taming Time Series Data Using Deep Neural Networks                                   Chapter 4
Let us now write a small function to evaluate the performance of our model:
    > model.evaluate <- function(deep.model, new.data, actual){
    +   preds = predict(deep.model, new.data, array.layout = "rowmajor")
    +   error <- actual - preds
    +     return(mean(abs(error)))
    +
    + }
    >
The model.evaluate function takes the model, predicts the output, and compares the
output prediction with the actual values. It calculates and returns the mean absolute error.
We have a mean absolute error of 0.66 in our training data and 1.478 in the test data. This is
the average of the absolute difference between the actual stock close price and our
prediction. It's good to get such a low MAE. The plot of our predictions should show us the
best of this model.
Using the plot.data frame, let us plot the actual and predicted values:
    ggplot(plot.data,aes(x = seq_along(actual))) + geom_line(aes(y = actual,
    color = "Actual")) + geom_line(aes(y = predicted, color = "Predicted"))
                                             [ 192 ]
Taming Time Series Data Using Deep Neural Networks                      Chapter 4
Not bad, our model is trained well with the training data.
                                            [ 193 ]
Taming Time Series Data Using Deep Neural Networks                              Chapter 4
Once again, we pack the actual and predicted values in a plot.data frame for ease of
plotting. Let us look at the prediction plot:
Great, we can see great performance with our test data too.
                                           [ 194 ]
Taming Time Series Data Using Deep Neural Networks                          Chapter 4
                                           [ 195 ]
Taming Time Series Data Using Deep Neural Networks                                   Chapter 4
We can look into the weights learned in each of the layers. Let us look at the weights of the
first hidden layer:
   > deep.model$arg.params$fullyconnected18_weight
                  [,1]           [,2]           [,3]          [,4]           [,5]
   [,6]          [,7]
    [1,] -0.0083999438 8.485802e-04 0.0036450755 -0.05046703 8.365825e-03
   8.337282e-03 -0.0087173656
    [2,] 0.0013683429 -6.198279e-03 -0.0083191348 -0.05300714 -3.504494e-03
   -5.474154e-03 0.0020545013
                  [,8]         [,9]       [,10]            [,11]       [,12]
   [,13]       [,14]       [,15]
    [1,] -6.238837e-04 -0.02354868 -0.02122656 0.0090157799 -0.02873407
   -0.05015010 -0.06675677 -0.02245500
    [2,] -2.801131e-03 -0.01479008 -0.01589586 -0.0063846195 -0.02265893
   -0.05025507 -0.06229784 -0.02008835
               [,16]          [,17]       [,18]            [,19]       [,20]
   [,21]       [,22]          [,23]
    [1,] -0.09068959 0.0078839101 -0.05393362 -0.0087832464 -0.02230835
   -0.02040787 -0.08998983 -6.150414e-03
    [2,] -0.09574232 0.0016337502 -0.06588018 -0.0046719122 -0.02494662
   -0.03023234 -0.07455231 -1.322337e-03
              [,24]       [,25]        [,26]         [,27]          [,28]
   [,29]       [,30]       [,31]
    [1,] -0.1390225 -0.01965304 -0.02212325 -0.01884965 -0.0020784298
   -0.02215839 -0.06166138 -0.06153671
    [2,] -0.1502734 -0.01914297 -0.02417008 -0.02372178 -0.0086792232
   -0.02916201 -0.06242901 -0.06657734
                 [,32]        [,33]       [,34]         [,35]          [,36]
   [,37]         [,38]        [,39]
    [1,] -9.309854e-03 -0.06228027 -0.08344175 -0.05109306 -0.0040911064
   -0.09716180 -5.205772e-03 -0.04241674
    [2,] -8.990907e-03 -0.05343428 -0.08138016 -0.05199971 -0.0090623405
   -0.08794980 -1.023613e-03 -0.05040253
               [,40]       [,41]        [,42]         [,43]        [,44]
   [,45]       [,46]       [,47]
    [1,] -0.05714688 -0.02148283 -0.06252886 -0.06428687 -0.06470018
   -8.545434e-04 -0.05015615 -0.06178543
    [2,] -0.06344762 -0.01426465 -0.06747299 -0.06294472 -0.07089290
   -2.695541e-03 -0.05909692 -0.06171135
               [,48]          [,49]          [,50]           [,51]        [,52]
   [,53]      [,54]
    [1,] -0.02646469 -6.758149e-03 -0.0091832494 -0.0094310865 -0.03371038
   -0.05042901 -0.1533335
    [2,] -0.03239496 4.416389e-03 0.0053570746 -0.0086173108 -0.01760352
   -0.05314169 -0.1448236
                                           [ 196 ]
Taming Time Series Data Using Deep Neural Networks                                   Chapter 4
We have truncated the output. Similarly, other weights can be viewed. Note that your
weights may not be the same as the ones printed here.
Sometimes, looking at the learned weights of a neural network can provide insight into the
learning behavior. For example, if weights look unstructured, maybe some were not used at
all, or if very large coefficients exist, maybe regularization was too low or the learning rate
too high.
                                            [ 197 ]
Taming Time Series Data Using Deep Neural Networks                                Chapter 4
According to the diagram, the x axis corresponds to the row number and the y axis to the
column number, with column 1 at the bottom.
                                           [ 198 ]
Taming Time Series Data Using Deep Neural Networks     Chapter 4
                                             [ 199 ]
Taming Time Series Data Using Deep Neural Networks                                   Chapter 4
This one has some small sparks spread about. It's better than the previous ones.
Finally, let us talk about tuning the model. There are several knobs to be tuned in this
model, and some of them are:
It's typically suggested to use random search in parameter space to search for the
parameters. Grid search is not recommended for deep learning networks.
We provide a function here which can be used to perform a random search across different
parameters:
   random.search <- function(){
      # Sample layers
      count.layers <- sample(2:5, 1)
      no.layers <- c()
      activations <- c()
      for (i in 1:count.layers-1){
        # Sample node per layers
        no.nodes <- sample(10:50,1)
        no.layers[i] <- no.nodes
        activations[i] <- "relu"
      }
      no.layers <- append(no.layers, 1)
      activations <- append(activations, "relu")
      deep.model <- mx.mlp(data = train.x.data, label = train.y.data,
                           hidden_node = no.layers
                           , out_node = 1
                           , dropout = 0.50
                           , activation = activations
                           ,out_activation = "rmse"
                           , array.layout = "rowmajor"
                           , learning.rate = 0.01
                           , array.batch.size = 100
                           , num.round = 10
                           , verbose = TRUE
                           , optimizer = "adam"
                                            [ 200 ]
Taming Time Series Data Using Deep Neural Networks                                  Chapter 4
                            , eval.metric = mx.metric.mae
       )
       train.error <- model.evaluate(deep.model,train.x.data, train.y.data)
       test.error <- model.evaluate(deep.model,test.x.data, test.y.data)
       output <- list(layers = no.layers, activations <- activations
                      , train.error = train.error, test.error = test.error)
       return(output)
   }
   final.output = list()
   for (i in 1:2){
     out <- random.search()
     final.output[[i]] <- out
   }
You can go head and modify the code for other parameters too, such as dropout, learning
rate, and other parameters, to see how the models evolve.
Before we conclude, let us have some discussion around the suitability of deep learning
networks for time series analysis. Earlier, when we discussed time series, we referred to the
trend, seasonality, and error components in time series data. These lead to the non-
stationary aspect of time series data. We suggested that we should remove the seasonality
and the trend component from the time series data to use them for prediction in regression
models. When it comes to using a neural network for time series predictions, there are
several options available:
          Data Preprocessing:
                 1. Remove the trend and seasonality and use the non-stationary data for
                    prediction.
                 2. Arrange the data in a way that includes the seasonality correlation in
                    the input fed to the network.
                 3. Use a scheme of rolling windows as shown in this chapter.
          Network:
                1. Deep neural networks formed my multiple hidden layers.
                2. Recurrent neural networks (RNN).
                3. LSTM networks--a variation of the RNN network.
                                           [ 201 ]
Taming Time Series Data Using Deep Neural Networks                                  Chapter 4
In our example, we didn't do any data preprocessing to remove trend and seasonality.
There are several research papers available on this topic. Most of them are empirical studies
using well known time series data to analyze the effect of data preprocessing in prediction
output. One paper is an empirical study conducted by Sharda and Patil. They compared
prediction using ARIMA and Neural Networks on 75 time series data and found Neural
Network predictions to be more accurate. You can access the paper at https://link.
springer.com/article/10.1007/BF01577272.
We have successfully created a fully connected regression deep learning network to predict
the stock price. RNN and LTSM network APIs are currently not available with MXNet.
Hopefully, they should be available soon.
Complete R code
The following is the complete R code:
   # Time series data
   kings <- scan("http://robjhyndman.com/tsdldata/misc/kings.dat",skip=3)
   king.ts <- ts(kings)
   king.ts
   install.packages("TTR")
   library(TTR)
   par(mfrow=c(2 ,2))
   plot(SMA(king.ts, n=2), main = "n=2")
   plot(SMA(king.ts, n=5), main = "n=5")
   plot(SMA(king.ts, n=10), main = "n=10")
   plot(SMA(king.ts, n=15), main = "n=15")
   par(mfrow=c(1,1))
   library(zoo)
   library(quantmod)
                                           [ 202 ]
Taming Time Series Data Using Deep Neural Networks                     Chapter 4
   x1 <- Lag(data,1)
   new.data <- na.omit(data.frame(Lag.1 = x1, y = data))
   head(new.data)
   plot(king.ts)
   # Introducing MXNet library
   library(mxnet)
   # Context
   mx.ctx.default()
   # Operations
   mx.nd.dot(ones.matrix, ones.matrix)
   mx.nd.elemwise.add(ones.matrix,ones.matrix)
   # Sampling
   mu <- mx.nd.array(c(0.0,2.5))
   sigma <- mx.nd.array(c(1,3))
   mx.nd.sample.normal(mu = mu, sigma = sigma)
   #   Symbolic Programming
   a   <- mx.symbol.Variable("a")
   a
   b   <- mx.symbol.Variable("b")
   b
   c   <- a + b
   c
                                           [ 203 ]
Taming Time Series Data Using Deep Neural Networks                     Chapter 4
mx.exec.update.arg.arrays(pexec, input_list)
mx.exec.forward(pexec)
pexec$arg.arrays
pexec$outputs
d <- mx.symbol.dot(a, b)
   mx.exec.update.arg.arrays(pexec, input_list)
   mx.exec.forward(pexec)
   pexec$arg.arrays
   pexec$outputs
                                           [ 204 ]
Taming Time Series Data Using Deep Neural Networks                       Chapter 4
   X <- data.matrix(mx.nd.concat(list(x.1,x.2,x.3,x.4)) )
   Y <- c(y.1,y.2,y.3,y.4)
   # Input layer
   data <- mx.symbol.Variable("data")
   # Hidden Layer
   hidden.layer <- mx.symbol.FullyConnected(data = data
                                            , num_hidden = 2)
   # Output Layer
   out.layer <- mx.symbol.FullyConnected(data = act
                                         , num.hidden = 2)
   # Softmax of output
   out <- mx.symbol.SoftmaxOutput(out.layer)
   # Perform Predictions
   X_test = data.matrix(rbind(c(0,0),c(1,1),c(1,0),c(0,1) ) )
   preds = predict(model, X_test, array.layout = "rowmajor")
   pred.label <- max.col(t(preds)) -1
                                           [ 205 ]
Taming Time Series Data Using Deep Neural Networks                     Chapter 4
pred.label
########################################
   # Feature generation
   data.ts <- as.ts(data)
   data.zoo <- as.zoo(data.ts)
set.seed(100)
                                           [ 206 ]
Taming Time Series Data Using Deep Neural Networks                          Chapter 4
   # Train/test split
   train.perc = 0.8
   train.indx = 1:as.integer(dim(final.data)[1] * train.perc)
   mx.set.seed(100)
   deep.model <- mx.mlp(data = train.x.data, label = train.y.data,
                        hidden_node = c(1000,500,250)
                       ,out_node = 1
                       ,dropout = 0.50
                       ,activation = c("relu", "relu","relu")
                       ,out_activation = "rmse"
                       , array.layout = "rowmajor"
                       , learning.rate = 0.01
                       , array.batch.size = 100
                       , num.round = 100
                       , verbose = TRUE
                       , optimizer = "adam"
                       , eval.metric = mx.metric.mae
   print("Train Error")
   model.evaluate(deep.model,train.x.data, train.y.data)
   print("Test Error")
   model.evaluate(deep.model,test.x.data, test.y.data)
                                           [ 207 ]
Taming Time Series Data Using Deep Neural Networks                     Chapter 4
graph.viz(deep.model$symbol)
   library(Matrix)
   # Look at the graph plot for the name of each layer
   # alternatively call deep.model$arg.params$ to see the name
   weights <- deep.model$arg.params$fullyconnected85_weight
   dim(weights)
   image(as.matrix(weights))
     # Sample layers
     count.layers <- sample(2:5, 1)
     no.layers <- c()
     activations <- c()
     for (i in 1:count.layers-1){
       # Sample node per layers
       no.nodes <- sample(10:50,1)
       no.layers[i] <- no.nodes
       activations[i] <- "relu"
     }
     no.layers <- append(no.layers, 1)
     activations <- append(activations, "relu")
     deep.model <- mx.mlp(data = train.x.data, label = train.y.data,
                          hidden_node = no.layers
                          , out_node = 1
                                           [ 208 ]
Taming Time Series Data Using Deep Neural Networks                                Chapter 4
                              , dropout = 0.50
                              , activation = activations
                              ,out_activation = "rmse"
                              , array.layout = "rowmajor"
                              , learning.rate = 0.01
                              , array.batch.size = 100
                              , num.round = 10
                              , verbose = TRUE
                              , optimizer = "adam"
                              , eval.metric = mx.metric.mae
       )
       train.error <- model.evaluate(deep.model,train.x.data, train.y.data)
       test.error <- model.evaluate(deep.model,test.x.data, test.y.data)
       output <- list(layers = no.layers, activations <- activations
                      , train.error = train.error, test.error = test.error)
       return(output)
   }
   final.output = list()
   for (i in 1:2){
     out <- random.search()
     final.output[[i]] <- out
   }
Summary
We started the chapter by introducing time series data and the traditional approaches to
solving them. We gave you an overview of deep learning networks and information on how
they learn. Furthermore, we introduced the MXNet R package. Then we prepared our stock
market data so that our deep learning network could consume it. Finally, we built two deep
learning networks, one for regression, where we predicted the actual closing price of the
stock, and one for classification, where we predicted whether the stock price would move
up or down.
In the next chapter, we will deal with sentiment mining. We will show how to extract
tweets in R, process them and use a dictionary based method to find the sentiments of the
tweets. Finally using those scored tweets as datasets we will build a Naive Bayes model
based on Kernel density estimate.
                                           [ 209 ]
                   Twitter Text Sentiment
                                                                                     5
               Classification Using Kernel
                        Density Estimates
Twitter and other social media applications are generating a lot of unstructured text today
at a great velocity. Today's product companies look at these social media applications as a
great source to give them an understanding of how consumers feel about their products or
services. Hence, the analysis of these unstructured texts has become very important for
these companies.
Text mining as field is gaining importance day by day. The invention of techniques such as
bag-of-words, word2vec embedding has made text data suitable for processing by most
machine learning algorithms. Analyzing the sentiment in a text is an active field of research.
              Sentiment analysis, also called opinion mining, is the field of study that
              analyzes people's opinions, sentiments, evaluations, appraisals, attitudes,
              and emotions towards entities such as products, services, organizations,
              individuals, issues, events, topics, and their attributes. Liu, Bing. Sentiment
              Analysis and Opinion Mining (Kindle Locations 199-201). Morgan and
              Claypool Publishers.
We will look at how to pre-process text so it can be efficiently used by machine learning
algorithms. We will introduce a recently published state of the art word weighting scheme
called Delta TFIDF. Delta TFIDF is a variation of the traditional tfidf weighting scheme,
specially designed for sentiment classification tasks. Sentiment classification can be done
using either a dictionary method or a machine learning approach. We will show you
examples of both.
Twitter Text Sentiment Classification Using Kernel Density Estimates                 Chapter 5
Given a text, the model should predict the sentiment of the text as either positive or
negative.
Any data we analyze is generated from a process. The underlying process emits the data
following a probability distribution. Kernel density estimate techniques help find the
underlying probability distribution. It helps find the probability density function for the
given sample of data. Once we have an idea about the distribution of the data, we can
leverage this understanding of the data to make decisions further down the line based on
the application.
In this chapter, our application is to classify a given text as either positive sentiment
oriented or negative sentiment oriented. In our training set, we will have both positively
oriented and negatively oriented sentences. We will be using kernel density
estimation (KDE) for this classification task. Using KDE, we can find the distribution for
positively oriented text and negatively oriented text. The assumption is that these two
distributions should be different.
The code for this chapter was written in RStudio Version 0.99.491. It uses R version 3.3.1. As
we work through our example we will introduce the R packages twitteR, sentimentr,
tidyr, tm, and naivebayes, which we will be using. During our code description, we will
be using some of the output printed in the console. We have included what will be printed
in the console immediately following the statement which prints the information to the
console, so as to not disturb the flow of the code.
                                              [ 211 ]
Twitter Text Sentiment Classification Using Kernel Density Estimates               Chapter 5
We can generate histograms using the following code block for reference:
   > data <- rnorm(1000, mean=25, sd=5)
   > data.1 <- rnorm(1000, mean=10, sd=2)
   > data <- c(data, data.1)
   > hist(data)
   > hist(data, plot = FALSE)
   $breaks
    [1] 0 5 10 15 20 25 30 35 40 45
   $counts
   [1]   8 489 531 130 361 324 134           22     1
   $density
   [1] 0.0008 0.0489 0.0531 0.0130 0.0361 0.0324 0.0134 0.0022 0.0001
   $mids
   [1] 2.5      7.5 12.5 17.5 22.5 27.5 32.5 37.5 42.5
   $xname
   [1] "data"
   $equidist
   [1] TRUE
   attr(,"class")
   [1] "histogram"
This code creates two artificial data-sets and combines them. Both datasets are based on the
normal distribution; the first has a mean of 25 and standard deviation of 5, the second has a
mean of 10 and standard deviation of 2. If we recall from basic statistics, a normal
distribution is in a bell shape centered at the mean value, and 99% of the values are between
the mean-(3*sd) and mean+(3*sd). By combining these data from these two distributions,
the new dataset will have two peaks and a small area of overlap.
                                              [ 212 ]
Twitter Text Sentiment Classification Using Kernel Density Estimates               Chapter 5
The output in R shows the bin size and the locations. Look at $breaks and $mids in the R
output.
We provide the hist function with the breaks parameter, which is a sequence from 0 to
50, with a step size of two. We want our bin size to be two rather than 10—the default value
R is selected.
                                              [ 213 ]
Twitter Text Sentiment Classification Using Kernel Density Estimates                 Chapter 5
As you can see, by varying the bin size, the representation of the data changes. You can test
the same effect by varying the bin locations or $mids. The histogram depends on having the
right bin size and bin location.
The alignment of the bins with the data points decide the height of a bar in the histogram.
Ideally, this height should be based on actual density of the nearby points. Can we stack the
bar in such way that the height is proportional to the points they represent. This is where
KDE one-ups the histogram.
A discrete random variable takes on a finite number of possible values. We can determine
the probability for all the possible value of that random variable using a probability mass
function. A continuous random variable takes on infinite number of possible values. We
cannot find the probability of a particular value, instead we find the probability that a value
falls in some interval. A probability density function (PDF) is used to find the probability
of a value falling in an interval for a continuous random variable.
[1]0.9544997
                                              [ 214 ]
Twitter Text Sentiment Classification Using Kernel Density Estimates                  Chapter 5
pnorm is the function to get the cumulative probability distribution function for normal
distributions, i.e. the probability that a given value is within a range of a normal
distribution. Here we calculate the probability that a value falls between 2 standard
deviations of the mean value, which we know from basic statistics is 95%. We can calculate
this probability by looking at the probability that the given value is greater than the
mean+(2*sd); we then multiply that by 2 (to account for the values less than mean-(2*sd)).
This gives us the probability the values are in the "tails" of the distribution, so we subtract
this value from 1 to get the probability that the values lie between mean +/- (2*sd) to give
the answer of 0.9544997. Note that even if you change the values of mean and sd, the result
is always 0.9544997.
Let us see a univariate example using the density function to approximate the PDF of the
given data:
    # Kernel Density estimation
    kde = density(data)
    plot(kde)
                                              [ 215 ]
Twitter Text Sentiment Classification Using Kernel Density Estimates                   Chapter 5
This smoothed out plot from KDE gives us a better idea about the distribution of the
underlying data. The density function in R uses a Gaussian kernel by default for each
point in the data. The parameters of KDE are the kernel, which is used at every point to
decide the shape of the distribution at each point, and the bandwidth, which defines the
smoothness of the output distribution.
Why use a kernel instead of calculating the density at each point and them summing it up?
Where h is the distance threshold. This may give a bumpy plot as each point xi in the
neighborhood defined by h receives the same weight.
Where K is the kernel and h is the smoothing parameter. The kernel gives weight to each
point xi based on its proximity to x0.
Look at the R help function to learn more about the density function:
     help(density)
Hopefully, this section gives an understanding of how to use kernel density estimate to
approximate the probability density function of the underlying dataset. We will be using
this PDF to build our sentiment classifier.
Refer to Chapter 6's, Kernel Smoothing methods section in The Elements of Statistical Learning,
by Trevor Hastie, Robert Tibshirani and Jerome Friedman for a rigorous discussion on
kernel density estimates.
                                              [ 216 ]
Twitter Text Sentiment Classification Using Kernel Density Estimates                 Chapter 5
Twitter text
We will leverage the twitteR package to extract tweets. Refer to https://cran.r-project.
org/web/packages/twitteR/index.html to get more information about this package.
In order to use this package, you need a Twitter account. With the account, sign in to
https://app.twitter.com and create an application. Use the consumer key, consumer
secret key, access token, and access token security keys from that page to authenticate into
Twitter.
We will retrieve only the English tweets using the searchTwitter function provided by
the twitteR package:
   tweet.results <- searchTwitter("@apple", n=1000,lang = "en")
   tweet.df <- twListToDF(tweet.results)
Using the function twListToDF, we convert our extracted tweets from Twitter to a
dataframe, tweet.df. Out of all the fields extracted, we are most interested in the name
text column.
We have removed all the retweets from our results and retain just the text column.
Our tweet.df data frame now only has a single text column.
                                              [ 217 ]
Twitter Text Sentiment Classification Using Kernel Density Estimates                    Chapter 5
We have removed the URLs, spaces, Unicode characters, and so on from our tweets.
Having extracted and cleaned up this Twitter text, let us proceed to build our text
classification.
Sentiment classification
Sentiment in text can be either positive or negative. We will stick to this definition for this
chapter. Sentiment mining is an active field of research. A starting point for someone to
learn the various aspects of sentiment mining is through the book Sentiment Analysis and
Opinion Mining, by Bing Liu.
Broadly speaking, sentiment mining problems are solved using the following techniques:
                                              [ 218 ]
Twitter Text Sentiment Classification Using Kernel Density Estimates                  Chapter 5
Dictionary methods
Sentiment lexicons, in which words are categorized as positive and negative, are used in
this technique. There are several lexicons available today:
We match these lexicons to the tokens (words) in the input data and we can then calculate
an overall sentiment, for example, based on the proportion of positive words to negative
words.
The tidyr package has three sentiment lexicons. In the book Text Mining with R: The Tidy
Approach, there is a full chapter on sentiment mining using lexicons: http://
tidytextmining.com/sentiment.html.
Our approach
In this chapter, we are going to leverage dictionary and machine learning methods.
                                              [ 219 ]
Twitter Text Sentiment Classification Using Kernel Density Estimates                   Chapter 5
After extracting our text, we are going to use the dictionary-based approach to first give
sentiment scores to the extracted tweets. Later, we will manually verify if the scores are
good. This is our gold set of data for building a classifier. We divide this data into two sets,
one we will use for training our classifier, and the other we will use to test our classifier.
Since we don't have any pre-labeled tweet data, we are following this approach.
Let us now proceed to build our gold dataset using a sentiment dictionary.
Let us see how to score using the sentiment function from the sentimentr package:
    > library(sentimentr, quietly = TRUE)
    > sentiment.score <- sentiment(tweet.df$text)
    > head(sentiment.score)
       element_id sentence_id word_count sentiment
                                              [ 220 ]
Twitter Text Sentiment Classification Using Kernel Density Estimates                    Chapter 5
    1:             1              1             8 0.0000000
    2:             2              1             8 0.3535534
    3:             3              1             3 0.0000000
    4:             3              2             4 0.0000000
    5:             3              3             7 0.0000000
    6:             4              1            14 -0.8418729
The sentiment function in sentimentr calculates a score between -1 and 1 for each of the
tweets. In fact, if a tweet has multiple sentences, it will calculate the score for each sentence.
A score of -1 indicates that the sentence has very negative polarity. A score of 1 means that
the sentence is very positive. A score of 0 refers to the neutral nature of the sentence.
               Refer to https://cran.r-project.org/web/packages/sentimentr/index.
               html for more details about the sentimentr package.
However, we need the score to beat the tweet level and not at the sentence level. We can
take the average value of the score across all the sentences in a tweet.
Here, the element_id refers to the individual tweet. By grouping by element_id and
calculating the average, we can get the sentiment score at the tweet level.
Let us now add the sentiment to our original tweet.df. data frame. Going forward, we
only need the text and its sentiment; let us subset those columns from tweet.df:
    > tweet.df$polarity <- sentiment.score$sentiment
    > tweet.final <- tweet.df[,c('text','polarity')]
                                              [ 221 ]
Twitter Text Sentiment Classification Using Kernel Density Estimates                  Chapter 5
We have our dataset prepared now, with the tweet text and the sentiment score.
We remove all records with a polarity value of 0. These are records with neutral sentiments.
If our polarity is less than zero, we mark the tweet as negative, otherwise, it is positive:
    > tweet.final <- tweet.final[tweet.final$polarity != 0, ]
    > tweet.final$sentiment <- ifelse(tweet.final$polarity < 0,
    "Negative","Positive")
    > tweet.final$sentiment <- as.factor(tweet.final$sentiment)
    > table(tweet.final$sentiment)
    Negative Positive
         200      168
Using ifelse, we have discretized the real-valued sentiment score into a binary variable.
We added an identifier column. With that, we have the training data ready. Finally, the
table command shows us the class distribution. Our class distribution is imbalanced.
Let us say we are building a discriminative model such as logistic regression or SVM, where
the model is trying to learn a boundary between the classes. It expects an equal number of
positive classes and negative classes. If the number of positive classes is greater than or less
than the number of negative classes in the dataset, we have a class imbalance problem.
There are several techniques to balance the dataset. Some of them are downsampling,
upsampling, SMOTE, and so on. We will leverage the function upSample in the caret
package to create more records in the minority class. This should produce better results in
the classification models.
    Negative Positive
         200      200
The upSample function looks at the class distribution and repeatedly samples from the
Positive class to balance. The final table command shows the new class distribution.
                                              [ 222 ]
Twitter Text Sentiment Classification Using Kernel Density Estimates                 Chapter 5
In this chapter, we are going to learn the distribution of positive and negative classes
independently and use them for our predictions. Hence, we don't need to do balance the
classes.
Before we move on to the next section, let us spend some time understanding the inner
workings of our dictionary-based sentiment function. The sentiment function utilizes a
sentiment lexicon (Jockers, 2017) from the lexicon package. It preprocesses the given text
as follows:
Each word is looked up in the lexicon, and positive and negative words are tagged with +1
and -1 respectively. Let us call the words which have received a score as the polarized
words. Not all words will receive a score. Only those found in the lexicons receive a score.
We can pass a customer lexicon through the polarity_dt parameter to
the sentiment function.
For each of the polarized words, n words before them and n words after them are
considered, and together they are called polarized context clusters. The parameter n can be
set by the user. The words in the polarized context cluster can be tagged as either of the
following:
          Neutral
          Negator
          Amplifier
          De-amplifier
          Adversative conjunctions
                                              [ 223 ]
Twitter Text Sentiment Classification Using Kernel Density Estimates                Chapter 5
Each polarized word is weighted now based on polarity_dt, and also weighted based on
the number of valence shifters/words surrounding it, which are tagged either as amplifiers
or adversative conjunctions. Neutrally tagged weights have no weights.
For more details about the weight and scoring, refer to R help for the sentiment function.
Text pre-processing
Before we build our model, we need to prepare our data so it can be provided to our model.
We want a feature vector and a class label. In our case, the class label can take two values,
positive or negative depending on if the sentence has a positive or a negative sentiment.
Words are our features. We will use the bag-of-words model to represent our text as
features. In a bag-words-model, the following steps are performed to transform a text into a
feature vector:
       1. Extract all unique individual words from the text dataset. We call a text
          dataset a corpus.
       2. Process the words. Processing typically involves removing numbers and other
          characters, placing the words in lowercase, stemming the words, and removing
          unnecessary white spaces.
       3. Each word is assigned a unique number and together they form the vocabulary.
          A word uknown is added to the vocabulary. This is for the unknown words we
          will be seeing in future datasets.
       4. Finally, a document term matrix is created. The rows of this matrix are the
          document IDs, and the columns are formed by the words from the vocabulary.
                                              [ 224 ]
Twitter Text Sentiment Classification Using Kernel Density Estimates                   Chapter 5
We need to pre-process our tweets in a similar manner. We will use the tm R package to
pre-process our twitter text.
The get.dtm function creates a document term matrix from the given data frame. It does
text pre-processing/normalization before creating the actual document term matrix.
Preprocessing involves removing any punctuation, removing numbers, stripping unwanted
white space, removing English stop words, and finally converts the text into lowercase.
                                              [ 225 ]
Twitter Text Sentiment Classification Using Kernel Density Estimates             Chapter 5
The output is truncated, as we have not shown all the stop words. In text mining, stop
words are considered those which do not contribute much to the context of the text. Let's
say we have two documents--if we want to differentiate them, we cannot use stop words to
find what those two documents are uniquely representing. Hence, we typically remove
these words from our text as a pre-processing exercise.
We have around 62 documents and 246 words. Document term matrices are typically
sparse, as all words don't appear in all the documents. This brings us to the weighting
scheme. When we explained the document term matrix, we used a binary weightage
scheme. A one was added the cell if a word was present in a document; if the word did not
appear in the document, a zero was added. We can use a different weighting scheme called
term-frequency inverse document frequency.
The document frequency or df for a given word w is the number of documents in which the
word has occurred. Document frequency is considered the global weight.
                                              [ 226 ]
Twitter Text Sentiment Classification Using Kernel Density Estimates                     Chapter 5
idf( w, d) = log (number of documents in the corpus / 1 + document frequency of the word)
Term-frequency inverse document frequency or TFIDF is the product of tf and idf. They
define the importance of a word in our corpus. According to this scheme, the most
frequently occurring words are not given high importance since they don't carry enough
information to differentiate one tweet from another.
You can see the TFIDF value for the word fire in document 2 is 0.27.
TFIDF is a great weighting scheme. It has been successfully used in many text mining
projects and information retrieval projects. We are going to use a modified version of
TFIDF called Delta TFIDF for our feature generation.
Delta TFIDF
The problem with TFIDF is it fails to differentiate between words from the perspective of
implicit sentiments. During the calculation of TFIDF, no knowledge of the document
sentiment is added. Hence, it may not serve as a good differentiating feature for sentiment
classification.
Delta TFIDF was proposed by Justin Martineau and Tim Finin in their paper Delta TFIDF:
An Improved Feature Space for Sentiment
Analysis: http://ebiquity.umbc.edu/_file_directory_/papers/446.pdf.
                                               [ 227 ]
Twitter Text Sentiment Classification Using Kernel Density Estimates                Chapter 5
Delta TFIDF is calculated for each word and document combination as follows:
          Finds Ctd, the number of times the word occurs in the document, term frequency
          of the word.
          Finds nt, the number of negative documents in which the word has occurred.
          Finds pt, the number of positive documents in which the word has occurred.
We get the document term matrix for our whole corpus. The rows are our document and
the columns are our vocabulary. We use term frequency for our weighting scheme. Our dtm
is very sparse. There are a lot of cells with zero values. Let us throw away some terms and
try to reduce the sparsity of our document term matrix.
   > dtm <- removeSparseTerms(dtm, 0.98)
   > dtm
   <<DocumentTermMatrix (documents: 368, terms: 58)>>
   Non-/sparse entries: 934/20410
   Sparsity           : 96%
   Maximal term length: 11
After reducing the sparseness, we convert our document term matrix to a matrix object.
Now let us split our data into a positive tweets dataset and a negative tweets dataset and
get their respective document term matrices:
   dtm.pos <- get.dtm('text','id', tweet.final[tweet.final$sentiment ==
   'Positive',],"weightBin")
   dtm.neg <- get.dtm('text','id', tweet.final[tweet.final$sentiment ==
   'Negative',],"weightBin")
                                              [ 228 ]
Twitter Text Sentiment Classification Using Kernel Density Estimates           Chapter 5
dtm.post.mat and dtm.neg.mat are the matrix representations of the positive and the
negative tweet corpuses.
Similarly, neg.words.df contains the words and their document frequencies in the
negative corpus.
Let us get all the unique words and the document IDs:
   > tot.features <- colnames(dtm.mat)
   > doc.ids <- rownames(dtm.mat)
   for( i in 1:length(tot.features)){
     for ( j in 1:length(doc.ids)){
       # Number of times the term has occured in the document
       ctd <- dtm.mat[doc.ids[j], tot.features[i]]
       # Number for documents in pos data with the term
       pt <- pos.words.df[tot.features[i]]
       # Number for documents in pos data with the term
       nt <- neg.words.df[tot.features[i]]
       score <- ctd * log( nt / pt)
       if(is.na(score)){
         score <- 0
       }
       c.dtm.mat[doc.ids[j], tot.features[i]] <- score
     }
   }
Our dtm.mat has the term frequency for each word and document. We use pos.words.df
and neg.words.df to find the document frequency of a word in both positive and negative
corpuses. Finally, we update c.dtm.mat with the new score.
                                              [ 229 ]
Twitter Text Sentiment Classification Using Kernel Density Estimates                   Chapter 5
We have calculated the Delta TFIDF. This brings us to the end of this section. We have
prepared our data to be consumed by the model. Let us proceed to build our sentiment
classification model.
We have a set of tweets positively labeled. Another set of tweets negatively labeled. The
idea is to learn the PDF of these two data sets independently using kernel density
estimation.
Here, P(x | label) is the likelihood, P(label) is prior, and P(x) is the evidence. Here the label
can be positive sentiment or negative sentiment.
Using the PDF learned from kernel density estimation, we can easily calculate the
likelihood, P(x | label)
For any new tweet, we can now calculate using the Bayes Rule,
    P(Label = Positive | words and their delta tfidf weights)
Viola, we have our sentiment classifier assembled based on kernel density estimation.
Before we jump into our use case, let us see quickly how KDE can be leveraged to classify
the Iris data set. More information about this data set is available in https://archive.ics.
uci.edu/ml/datasets/iris.
                                               [ 230 ]
Twitter Text Sentiment Classification Using Kernel Density Estimates                 Chapter 5
We pass the first 4 columns as our features, x and the last column species as our y variable.
There are three classes in this dataset. There is a total of 150 records, 50 records per each
class.
We have a plot for every column in iris dataset. Let us look at the petal.width column.
There are three density plots each representing, P(Petal.width, setosa),
P(Petal.width, versicolor), and finally P(Petal.width, virginica) one for each
class variable. Each of them represents the underlying distribution of Petal.width for
each one of the classes.
                                              [ 231 ]
Twitter Text Sentiment Classification Using Kernel Density Estimates                Chapter 5
Other plots can be interpreted similarly. Using this PDF we can now classify the data into
one of the three classes. Hopefully, this gives an idea of how the underlying distribution
discovered by KDE can be used to separate the records into their respective class instance.
We use the naivebayes R package. The function naive_bayes is used to build the model.
You can see that we have set the useKernel parameter to TRUE. This informs the function
to use KDE for calculating the likelihood.
This will help you view the various properties of this model.
Having built the model, we can use the standard predict function to predict the label for
unknown tweets.
    library(caret)
    confusionMatrix(preds, tweet.final$sentiment)
               Reference
    Prediction Negative Positive
      Negative       128       58
      Positive        72      110
                    Accuracy : 0.6467
                      95% CI : (0.5955, 0.6956)
        No Information Rate : 0.5435
        P-Value [Acc > NIR] : 3.744e-05
                       Kappa : 0.2928
     Mcnemar's Test P-Value : 0.2542
                 Sensitivity : 0.6400
                 Specificity : 0.6548
                                              [ 232 ]
Twitter Text Sentiment Classification Using Kernel Density Estimates                   Chapter 5
We have an accuracy of 64%. Can we understand why we reached this number? A good
way to investigate this is to look at the features fed to this model. In this case, the words we
have fed as features. Looking at the PDF of words for both positive and negative sentiment,
we should be able to throw some light on our model performance.
Let us look at some of the variables and their PDF for positive and negative sentiments:
The graph represents a subset of words used for classification. It is evident that underlying
PDF estimated by kernel density estimate is not much different for the positive and the
negative classes. PDF for positive and negative are represented by the red and green line. In
most of the cases both the PDF are overlapping each other. Compare these plots to the plot
we generated for the Iris dataset, where the PDFs were distinctly separate. Hence the
model does not have enough classification power.
                                              [ 233 ]
Twitter Text Sentiment Classification Using Kernel Density Estimates                  Chapter 5
That brings us to the end of the section. We have shown you a simple Naive Bayes classifier
using KDE estimates for calculating the likelihood. You can go ahead and use other
classification methods and try to compare the results. Another good exercise would be to
compare the normal TFIDF features to the Delta TFIDF features.
Let us first load all the necessary libraries and authenticate into our Twitter account:
    library(shiny)
    library(twitteR, quietly = TRUE)
    library(stringr)
    library(sentimentr, quietly = TRUE)
    library(dplyr, quietly = TRUE)
                                              [ 234 ]
Twitter Text Sentiment Classification Using Kernel Density Estimates              Chapter 5
We have a navigation panel at the top. The first navigation panel has a textbox in which we
can enter our search string. A default value is provided, so when you open the application,
you will have the search results available for that string. We have a data table output
following the textbox. This is where we display the tweet results.
In the second navigation panel, we have the data table output, where we want to show the
tweets and their sentiments.
                                              [ 235 ]
Twitter Text Sentiment Classification Using Kernel Density Estimates                 Chapter 5
tweet.df is declared as a reactive expression. It will change if the text in our search box
changes. Here, we search Twitter for the search string provided. Capture the returned
tweets, clean them up, calculate their polarity scores, and keep them ready.
It's simple from now on. output$results return a data frame. It takes the data frame from
our tweet.df reactive expression. This one only returns the text column.
output$sentiment behaves the same way as output$results, except that it returns the
whole data frame.
              Reactive expressions are expressions that can read reactive values and call
              other reactive expressions. Whenever reactive value changes, any reactive
              expressions that depended on it are marked as invalidated and will
              automatically re-execute if necessary. - RShiny web page.
                                              [ 236 ]
Twitter Text Sentiment Classification Using Kernel Density Estimates    Chapter 5
We can change the search string and have the new results shown:
You can see that we have changed the query to R data science.
In the preceding screenshot, we can see the sentiment for the tweets.
                                              [ 237 ]
Twitter Text Sentiment Classification Using Kernel Density Estimates   Chapter 5
Complete R code
The following is the complete R code used for this project:
   # Generate data
   data <- rnorm(1000, mean=25, sd=5)
   data.1 <- rnorm(1000, mean=10, sd=2)
   data <- c(data, data.1)
   # Plot histogram
   hist(data)
   # View the bins
   hist(data, plot = FALSE)
   library(pdfCluster)
   kde.1 <- kepdf(data)
   plot(kde.1)
   kde.1@kernel
   kde.1@par
   # Twitter Text
   library(twitteR, quietly = TRUE)
                                              [ 238 ]
Twitter Text Sentiment Classification Using Kernel Density Estimates   Chapter 5
   # KDE Classifier
   library(tm)
   get.dtm <- function(text.col, id.col, input.df, weighting){
                                              [ 239 ]
Twitter Text Sentiment Classification Using Kernel Density Estimates      Chapter 5
stopwords("english")
   # delta tf-idf
   dtm.pos <- get.dtm('text','id', tweet.final[tweet.final$sentiment ==
   'Positive',],"weightBin")
   dtm.neg <- get.dtm('text','id', tweet.final[tweet.final$sentiment ==
   'Negative',],"weightBin")
   for( i in 1:length(tot.features)){
     for ( j in 1:length(doc.ids)){
       # Number of times the term has occured in the document
       ctd <- dtm.mat[doc.ids[j], tot.features[i]]
       # Number for documents in pos data with the term
       pt <- pos.words.df[tot.features[i]]
       # Number for documents in pos data with the term
       nt <- neg.words.df[tot.features[i]]
       score <- ctd * log( nt / pt)
       if(is.na(score)){
         score <- 0
       }
       dtm.mat[doc.ids[j], tot.features[i]] <- score
     }
   }
library(naivebayes)
                                              [ 240 ]
Twitter Text Sentiment Classification Using Kernel Density Estimates   Chapter 5
   consumer.key <-
   consumer.secret <-
   access.token <-
   token.secret <-
                                              [ 241 ]
Twitter Text Sentiment Classification Using Kernel Density Estimates               Chapter 5
   ui <- fluidPage(
     navbarPage("TweetSenti",
                tabPanel("Search Tweets"
                         , textInput("search","search",value="#bladerunner")
                         , dataTableOutput("results")
                         ),
                tabPanel("Tag Sentiment"
                         ,dataTableOutput("sentiment"))
     )
   )
Summary
We started this chapter with a discussion about the KDE and its usefulness in
understanding the underlying distribution of data. We proceeded by explaining how to
extract tweets from Twitter for a given search string in R. Then, we proceeded to explain the
sentiment ming, dictionary, and machine learning approaches. Using a dictionary
approach, we calculated the sentiment scores for the tweets. We further explained text pre-
processing routines required to prepare the text data. We covered weighting schemes for
creating document term matrixes. We discussed the classic tfidf and the new Delta
TFIDF schemes. We created our training set using the Delta TFIDF scheme. Using this
training set, we finally built a Naive Bayes KDE classifier to classify tweets based on the
sentiment the text carried.
In the next chapter, we will be working on Record Linakage. A master data management
technique to do data dedpulication.
                                              [ 242 ]
             Record Linkage - Stochastic
                                                                                    6
                  and Machine Learning
                            Approaches
In a large database of records, synonymous records pose a great problem. Two records
referring to the same entity are considered to be synonymous. In the absence of a common
identifier, such as a primary key or foreign key, joining such records based on the entities is
a tough task. Let's illustrate this with a quick example. Consider the following two records:
Sno First name Middle name Last name Address                        City        State Zip
1     John        NULL           NULL        312 Delray Ave         Deer Field FL     33433
2     John        NULL           Sanders     312 Delray Beach Ave Deer Field FL       33433
Both the records refer to the same entity, one Mr. John. Record linkage refers to an umbrella
of algorithms that are designed to solve the exact same problem. Record linkage plays a key
role today in various applications such as CRM, Loyalty to name a few. They are an integral
part of today's sophisticated business intelligence systems and master data management
systems.
Record Linkage - Stochastic and Machine Learning Approaches                               Chapter 6
The code for this chapter was written in RStudio Version 0.99.491. It uses R version 3.3.1. As
we work through our example we will introduce the R packages RecordLinkage we will
be using. During our code description, we will be using some of the output printed in the
console. We have included what will be printed in the console immediately following the
statement which prints the information to the console, so as to not disturb the flow of the
code.
                                              [ 244 ]
Record Linkage - Stochastic and Machine Learning Approaches                        Chapter 6
The preceding image is a snapshot of the data from our customer and Solit. Assume that we
have merged the database records from both the parties. The record number indicates the
source of the record, that is whether they came from our customer or Solit. If we can find
duplicates in this dataset, that should serve as the first step in identifying the customers
channeled by Solit.
                                             [ 245 ]
Record Linkage - Stochastic and Machine Learning Approaches                         Chapter 6
Our data, RLdata500, has 500 records and 7 variables, which includes first name, last
name, and date of birth details. The first and last names are separated into two components
denoted by the suffixes, _c1 and _c2. The date of birth is split into year, day, and month.
Let's look at the steps that we are going to follow to implement record linkage:
Feature generation is the first step in record linkage. Once we have the desired features, we
can solve the record linkage problem either using a stochastic approach or by fitting a
machine learning model to the generated features.
                                             [ 246 ]
Record Linkage - Stochastic and Machine Learning Approaches                          Chapter 6
Feature generation
We will use compare.dedup to generate the features:
   > rec.pairs <- compare.dedup(RLdata500
   +                        ,blockfld = list(1, 5:7)
   +                        ,strcmp =   c(2,3,4)
   +                        ,strcmpfun = levenshteinSim)
   > summary(rec.pairs)
   500 records
   1221 record pairs
   0 matches
   0 non-matches
   1221 pairs with unknown status
Our constraints are represented in a list. We say we need either to match the first column or
columns 5 up to 7 for two records to qualify to become a pair. You can see in the results that
we are finally left with only 1,221 pairs:
   Deduplication Data Set
   500 records
   1221 record pairs
   0 matches
   0 non-matches
   1221 pairs with unknown status
                                             [ 247 ]
Record Linkage - Stochastic and Machine Learning Approaches                           Chapter 6
The entry pairs in the list form a data frame that has all the generated features. We capture
this data frame under the name matches:
   matches <- rec.pairs$pairs
Let's look at the first two rows of this data frame. Each row compares two records:
         id1 id2 fname_c1 fname_c2 lname_c1 lname_c2 by bm bd is_match
   1       1 174        1       NA 0.1428571      NA 0 0 0          NA
   2       1 204        1       NA 0.0000000      NA 0 0 0          NA
                                             [ 248 ]
Record Linkage - Stochastic and Machine Learning Approaches                           Chapter 6
The first instance compares records 1 and 174. There is a perfect match in the first
component of the first name. Both the entities do not have a second component for the first
name. We see a float number in the first component of the last name. This number is the
output of a string comparison. There is no match in the date of birth fields. The final column
is a is_match indicating if we have a match. We will get to the last column later. Let's start
with the string comparison.
String features
Let's look at the dedup function invocation once again:
    > rec.pairs <- compare.dedup(RLdata500
    +                        ,blockfld = list(1, 5:7)
    +                        ,strcmp =   c(2,3,4)
    +                        ,strcmpfun = levenshteinSim).
The strcmp and strcmpfun parameters dictate on which fields we need to do string
comparison and what kind of string comparison we need to apply. We pass a vector
indicating the column IDs to strcmp. We need to do string comparisons in columns 2, 3,
and 4. We want to use the Levenshtein distance to find the similarity between two strings.
Let's look at the records 1 and 174; we see the first name matching, but no match with the
rest of the fields. The Levenshtein distance of 0.142857 also states how far the last names are
from each other:
    > RLdata500[1,]
      fname_c1 fname_c2 lname_c1 lname_c2   by bm bd
    1 CARSTEN       <NA>   MEIER     <NA> 1949 7 22
    > RLdata500[174,]
        fname_c1 fname_c2 lname_c1 lname_c2   by bm bd
    174 CARSTEN       <NA> SCHMITT     <NA> 2001 6 27
                                             [ 249 ]
Record Linkage - Stochastic and Machine Learning Approaches                          Chapter 6
We need to state the exact columns where string comparisons should be applied. Not
specifying this may lead to unexpected results:
   > rec.pairs.matches <- compare.dedup(RLdata500
   +                                     ,blockfld = list(1, 5:7)
   +                                     ,strcmp =   TRUE
   +                                     ,strcmpfun = levenshteinSim)
   > head(rec.pairs.matches$pairs)
     id1 id2 fname_c1 fname_c2 lname_c1 lname_c2     by bm bd is_match
   1   1 174        1       NA 0.1428571        NA 0.00 0.5 0.5       NA
   2   1 204        1       NA 0.0000000        NA 0.50 0.5 0.0       NA
   3   2   7        1       NA 0.3750000        NA 0.75 0.5 0.0       NA
   4   2 43         1       NA 0.8333333        NA 1.00 1.0 1.0       NA
   5   2 169        1       NA 0.0000000        NA 0.50 0.0 0.5       NA
   6   4 19         1       NA 0.1428571        NA 0.50 0.5 0.0       NA
You can see that the function has also calculated the string comparisons for the date of birth
fields!
              We had excluded the first column from string comparison. It was used as
              a blocking field. However, we can use string comparison for the first
              column. In that case, the string comparison is performed before the
              blocking.
Phonetic features
The RecordLinkage package includes Soundex and Pho_h algorithms to compare string
columns. In our example, we want to use columns 2, 3, and 4 for string comparison,
specified by the list we pass to the phonetic parameter, and use the Pho_h function by
passing it to the phonfun parameter:
                                             [ 250 ]
Record Linkage - Stochastic and Machine Learning Approaches                       Chapter 6
If we compare the results with the string matching output, we see that we find no match
between record IDs 1 and 174. Let's look at instance 4, where records id1, 2 and id2, 43
are compared:
   > RLdata500[2,]
      fname_c1 fname_c2 lname_c1 lname_c2   by bm bd
   2      GERD     <NA>    BAUER     <NA> 1968 7 27
   > RLdata500[43,]
       fname_c1 fname_c2 lname_c1 lname_c2   by bm bd
   43      GERD     <NA>   BAUERH     <NA> 1968 7 27
The last name in those cases sound similar, hence the algorithm has captured them as
similar records.
                                             [ 251 ]
Record Linkage - Stochastic and Machine Learning Approaches                             Chapter 6
We will show how to leverage two methods, emWeights and epiWeights, implemented in
the RecordLinkage package.
P (features | match = 0) and P (features | match = 1) are estimated using the expectation
maximization algorithm. The weights are calculated as the ratio of these two probabilities.
This approach is called the Fellegi-Sunter model.
    >   library(RecordLinkage)
    >   data("RLdata500")
    >   rec.pairs <- compare.dedup(RLdata500
    +                              ,blockfld = list(1, 5:7)
    +                              ,strcmp =   c(2,3,4)
    +                              ,strcmpfun = levenshteinSim)
    >   pairs.weights <- emWeights(rec.pairs)
    > hist(pairs.weights$Wdata)
    >
                                             [ 252 ]
Record Linkage - Stochastic and Machine Learning Approaches                        Chapter 6
As seen in the feature generation section, we use the dedup function to generate string
comparison features. With the features, we invoke the emWeights function to get the
Fellegi-Sunter weights. The output of emWeights is a list:
   > str(pairs.weights)
   List of 8
     $ data       :'data.frame':    500 obs. of 7 variables:
      ..$ fname_c1: Factor w/ 146 levels "ALEXANDER","ANDRE",..: 19 42 114 128
   112 77 42 139 26 99 ...
      ..$ fname_c2: Factor w/ 23 levels "ALEXANDER","ANDREAS",..: NA NA NA NA
   NA NA NA NA NA NA ...
      ..$ lname_c1: Factor w/ 108 levels "ALBRECHT","BAUER",..: 61 2 31 106 50
   23 76 61 77 30 ...
      ..$ lname_c2: Factor w/ 8 levels "ENGEL","FISCHER",..: NA NA NA NA NA NA
   NA NA NA NA ...
      ..$ by      : int [1:500] 1949 1968 1930 1957 1966 1929 1967 1942 1978
   1971 ...
      ..$ bm      : int [1:500] 7 7 4 9 1 7 8 9 3 2 ...
      ..$ bd      : int [1:500] 22 27 30 2 13 4 1 20 4 27 ...
     $ pairs      :'data.frame':    1221 obs. of 10 variables:
      ..$ id1     : num [1:1221] 1 1 2 2 2 4 4 4 4 4 ...
      ..$ id2     : num [1:1221] 174 204 7 43 169 19 50 78 83 133 ...
      ..$ fname_c1: num [1:1221] 1 1 1 1 1 1 1 1 1 1 ...
      ..$ fname_c2: num [1:1221] NA NA NA NA NA NA NA NA NA NA ...
      ..$ lname_c1: num [1:1221] 0.143 0 0.375 0.833 0 ...
      ..$ lname_c2: num [1:1221] NA NA NA NA NA NA NA NA NA NA ...
      ..$ by      : num [1:1221] 0 0 0 1 0 0 1 0 0 0 ...
      ..$ bm      : num [1:1221] 0 0 0 1 0 0 0 0 1 0 ...
      ..$ bd      : num [1:1221] 0 0 0 1 0 0 0 0 0 0 ...
      ..$ is_match: num [1:1221] NA NA NA NA NA NA NA NA NA NA ...
     $ frequencies: Named num [1:7] 0.00685 0.04167 0.00926 0.11111 0.01163 ...
      ..- attr(*, "names")= chr [1:7] "fname_c1" "fname_c2" "lname_c1"
   "lname_c2" ...
     $ type       : chr "deduplication"
     $ M          : num [1:128] 0.000355 0.001427 0.004512 0.01815 0.001504 ...
     $ U          : num [1:128] 2.84e-04 8.01e-06 2.52e-05 7.10e-07 2.83e-06
   ...
     $ W          : num [1:128] 0.322 7.477 7.486 14.641 9.053 ...
     $ Wdata      : num [1:1221] -10.3 -10.3 -10.3 12.8 -10.3 ...
     - attr(*, "class")= chr "RecLinkData"
   >
                                             [ 253 ]
Record Linkage - Stochastic and Machine Learning Approaches                        Chapter 6
The Wdata vector stores the weights for Record Linkage based on an EM algorithm, higher
values indicate better matches. Let's plot this data as a histogram to look at the weights
distribution:
The histogram is skewed with a lot of negative weights and very few positive weights. This
gives us a hint that we have very few matches in our dataset. Alternatively, we can view the
weight distribution as follows:
   > summary(pairs.weights)
   500 records
   1221 record pairs
   0 matches
   0 non-matches
   1221 pairs with unknown status
Weight distribution:
                                             [ 254 ]
Record Linkage - Stochastic and Machine Learning Approaches                        Chapter 6
The getPairs function conveniently gives the weights for the pair:
   > weights.df<-getPairs(pairs.weights)
   > head(weights.df)
      id fname_c1 fname_c2 lname_c1 lname_c2             by bm bd       Weight
   1 48    WERNER     <NA> KOERTIG       <NA>          1965 11 28
   2 238 WERNIER      <NA> KOERTIG       <NA>          1965 11 28    29.628078
   3
   4 68    PETEVR     <NA>    FUCHS      <NA>          1972   9 12
   5 190    PETER     <NA>    FUCHS      <NA>          1972   9 12   29.628078
   6
For record IDs 48 and 238, the weight is 29.62. The higher the weight is, the more
probability there is of a match. With the weights, now we can use a threshold-based
classification model. We can derive the thresholds from either the histogram or the weight
distribution. We are going to choose the upper threshold, that is, for a match, we need a
weight of 10 or more. For a no match, we set the lower threshold as 5, and any entity pairs
with less than 5 will be tagged as a no match. The emClassify function is used to classify
the entities as match and no match:
   > pairs.classify <- emClassify(pairs.weights, threshold.upper = 10,
   threshold.lower = 5)
   > summary(pairs.classify)
   500 records
   1221 record pairs
   0 matches
   0 non-matches
   1221 pairs with unknown status
Weight distribution:
   51 links detected
   2 possible links detected
   1168 non-links detected
Classification table:
                                             [ 255 ]
Record Linkage - Stochastic and Machine Learning Approaches                      Chapter 6
               classification
   true status     N    P     L
          <NA> 1168     2   51
The label N stands for no match or no links found. Label P stands for possible matches and
label L for matches aka links founds. We see that with our given threshold, 51 matches were
found. Let's make a single data frame to collate all our results:
   > final.results <- pairs.classify$pairs
   > final.results$weight <- pairs.classify$Wdata
   > final.results$links <- pairs.classify$prediction
   > head(final.results)
     id1 id2 fname_c1 fname_c2 lname_c1 lname_c2 by bm bd is_match    weight
   links
   1   1 174        1       NA 0.1428571       NA 0 0 0         NA -10.28161
   N
   2   1 204        1       NA 0.0000000       NA 0 0 0         NA -10.28161
   N
   3   2   7        1       NA 0.3750000       NA 0 0 0         NA -10.28161
   N
   4   2 43         1       NA 0.8333333       NA 1 1 1         NA 12.76895
   L
   5   2 169        1       NA 0.0000000       NA 0 0 0         NA -10.28161
   N
   6   4 19         1       NA 0.1428571       NA 0 0 0         NA -10.28161
   N
   >
                                             [ 256 ]
Record Linkage - Stochastic and Machine Learning Approaches                          Chapter 6
                                             [ 257 ]
Record Linkage - Stochastic and Machine Learning Approaches                        Chapter 6
Weights-based method
The epiWeights function implements the weights-based method. R documentation has a
nice introduction to the weights-based method:
   > help("epiWeights")
              For more details about the weights method, refer to P. Contiero et al. The
              Epilink record linkage software.
              Methods Inf Med., 44(1):66–71, 2005.
The mechanism of invoking and finally generating the results is very similar to how we did
it using emWeights:
   library(RecordLinkage)
   data("RLdata500")
   # weight calculation
   rec.pairs <- compare.dedup(RLdata500
                              ,blockfld = list(1, 5:7)
                              ,strcmp =   c(2,3,4)
                              ,strcmpfun = levenshteinSim)
                                             [ 258 ]
Record Linkage - Stochastic and Machine Learning Approaches                      Chapter 6
One again, the distribution is similar to our histogram from the emWeights method. Let's
see an alternate view of the weights distribution:
   > summary(pairs.weights)
   500 records
   1221 record pairs
   0 matches
   0 non-matches
   1221 pairs with unknown status
Weight distribution:
                                             [ 259 ]
Record Linkage - Stochastic and Machine Learning Approaches              Chapter 6
    # Classification
    pairs.classify <- emClassify(pairs.weights, threshold.upper = 0.5,
    threshold.lower = 0.3)
                                             [ 260 ]
Record Linkage - Stochastic and Machine Learning Approaches                          Chapter 6
   > head(final)
     id1 id2 fname_c1 fname_c2 lname_c1 lname_c2 by bm bd is_match links
   fname_c1.1 fname_c2.1 lname_c1.1
   1 106 175         1        NA 1.0000000    NA 1 0 1          NA     L
   ANDRE       <NA>      MUELLER
   2 108 203         1        NA 1.0000000    NA 0 1 1          NA     L
   GERHARD       <NA> FRIEDRICH
   3 112 116         1        NA 0.8000000    NA 1 1 1          NA     L
   GERHARD       <NA>        ERNSR
   4 120 165         1        NA 0.8750000    NA 1 1 1          NA     L
   FRANK       <NA>    BERGMANN
   5 125 193         1        NA 0.8750000    NA 1 1 1          NA     L
   CHRISTIAN        <NA>    MUELLEPR
   6 127 142         1        NA 0.8333333    NA 1 1 0          NA     L
   KARL       <NA>        KLEIN
     lname_c2.1 by.1 bm.1 bd.1       Weight
   1       <NA> 1976      2   25 0.6910486
   2       <NA> 1987      2   10 0.6133400
   3       <NA> 1980     12   16 0.7518301
   4       <NA> 1998     11     8 0.7656562
   5       <NA> 1974      8     9 0.7656562
   6       <NA> 2002      6   20 0.6228760
   >
Unsupervised learning
Let's start with an unsupervised machine learning technique, K-means clustering. K-means
is a well-known and popular clustering algorithm and works based on the principles of
expectation maximization. It belongs to the class of iterative descent clustering methods.
Internally, it assumes the variables are of quantitative type and uses Euclidean distance as a
similarity measure to arrive at the clusters.
                                             [ 261 ]
Record Linkage - Stochastic and Machine Learning Approaches                           Chapter 6
The K is a parameter to the algorithm. K stands for the number of clusters we need. Users
need to provide this parameter.
Unsupervised techniques are appealing as they don't require us to build a training set. For
us to use a supervised technique, we need to create a training set, a tuple consisting of our
features, and a label to say if that record pair is a match or no-match. Till now in our case,
we haven't had a training dataset. Hence, let's start with an unsupervised technique:
    >   library(RecordLinkage, quietly = TRUE)
    >   data("RLdata500")
    >   rec.pairs <- compare.dedup(RLdata500
    +                              ,blockfld = list(1, 5:7)
    +                              ,strcmp =   c(2,3,4)
    +                              ,strcmpfun = levenshteinSim)
    >   kmeans.model <- classifyUnsup(rec.pairs, method = "kmeans")
    >   summary(kmeans.model)
    500 records
    1221 record pairs
    0 matches
    0 non-matches
    1221 pairs with unknown status
Classification table:
                classification
    true status     N    P     L
           <NA> 1071     0 150
                                             [ 262 ]
Record Linkage - Stochastic and Machine Learning Approaches                        Chapter 6
We generate our features using the compare.dedup method. This time we are sticking to
string-based features. The classifyUnSup method invokes K-means from the package,
e1071. A K value of 2 is supplied to the algorithm. As we see the output of the algorithm,
we see that K-means has created two groups, one for no link represented by N and has
around 1,071 records assigned to that cluster. Around 150 are assigned to the L cluster:
   > final.results <- kmeans.model$pairs
   > final.results$prediction <- kmeans.model$prediction
   > head(final.results)
     id1 id2 fname_c1 fname_c2 lname_c1 lname_c2 by bm bd is_match prediction
   1   1 174        1       NA 0.1428571       NA 0 0 0         NA          N
   2   1 204        1       NA 0.0000000       NA 0 0 0         NA          N
   3   2   7        1       NA 0.3750000       NA 0 0 0         NA          N
   4   2 43         1       NA 0.8333333       NA 1 1 1         NA          L
   5   2 169        1       NA 0.0000000       NA 0 0 0         NA          N
   6   4 19         1       NA 0.1428571       NA 0 0 0         NA          N
   >
We extract the input pairs and the prediction, where L stands for link available aka match
and N stands for no link available aka no match.
In cases where we have the ground truth, that is features for the tuples and also the ground
truth saying if the tuples are representing the same entity, then we can leverage
classification modes in supervised settings. Let us look at the supervised learning
Supervised learning
In a supervised learning scenario, we need to provide the algorithm with a set of training
tuples. Each tuple has our features from record pairs and a label classifying the tuple as
either a match or no match. In our case, we don't have any labeled data.
                                             [ 263 ]
Record Linkage - Stochastic and Machine Learning Approaches                           Chapter 6
If you see the output of rec.pairs$pairs, you will notice that now the is_match column
says if the record pair is a match or no match. Previously, when we did not provide the
identity parameter, it was initialized to NA. We are going to leverage this output to train our
classification model:
    >   train <- getMinimalTrain(rec.pairs)
    >   model <- trainSupv(train, method ="bagging")
    >   train.pred <- classifySupv(model, newdata = train)
    >   test.pred <- classifySupv(model, newdata = rec.pairs)
    >
    >   summary(train.pred)
    500 records
    17 record pairs
    9 matches
    8 non-matches
    0 pairs with unknown status
    9 links detected
    0 possible links detected
    8 non-links detected
Classification table:
               classification
    true status N P L
          FALSE 8 0 0
          TRUE 0 0 9
    > summary(test.pred)
                                             [ 264 ]
Record Linkage - Stochastic and Machine Learning Approaches                        Chapter 6
   500 records
   1221 record pairs
   49 matches
   1172 non-matches
   0 pairs with unknown status
   52 links detected
   0 possible links detected
   1169 non-links detected
Classification table:
               classification
   true status     N    P     L
         FALSE 1168     0     4
         TRUE      1    0   48
   >
Using the getMinimalTrain function, we get a small set of records from our rec.pairs
as our training data. We initialize and train a bagging model using this data with the
trainSupv function. Finally, using the classifySupv function, we run our model on our
training data and test data, which is the whole rec.pairs in this case to get the
predictions. Finally, using the summary function, we can look at the accuracy of our model.
We have a 100% accurate model in our training set. RecordLinkage supports a lot of
classification models, including neural networks, svm, bagging, and trees. As we have
a very small training set, it is advisable to use bagging or svm. Finally, we have around
99% accuracy on our whole dataset.
Alternatively, we can leverage our unsupervised clustering output. Use that as an initial
training set to build our first supervised learning model:
   > rec.pairs <- compare.dedup(RLdata500
   +                            ,blockfld = list(1, 5:7)
   +                            ,strcmp =   c(2,3,4)
   +                            ,strcmpfun = levenshteinSim)
   >
   > # Run K-Means Model
   > kmeans.model <- classifyUnsup(rec.pairs, method = "kmeans")
   >
                                             [ 265 ]
Record Linkage - Stochastic and Machine Learning Approaches                    Chapter 6
We pass our rec.pairs to the clustering method and extract the pairs with their
predictions. We want to replace the is_match column with our predictions. However, the
values should be 0 or 1 in the is_match column instead of N or L:
   >   pairs$is_match <- NULL
   >   pairs$is_match <- ifelse(pairs$prediction == 'N', 0,1)
   >   pairs$prediction <- NULL
   >   pairs[is.na(pairs)] <- 0
   >   head(pairs)
       id1 id2 fname_c1 fname_c2 lname_c1 lname_c2 by bm bd is_match
   1     1 174        1         0 0.1428571       0 0 0 0          0
   2     1 204        1         0 0.0000000       0 0 0 0          0
   3     2   7        1         0 0.3750000       0 0 0 0          0
   4     2 43         1         0 0.8333333       0 1 1 1          1
   5     2 169        1         0 0.0000000       0 0 0 0          0
   6     4 19         1         0 0.1428571       0 0 0 0          0
   >
   > rec.pairs$pairs <- pairs
   > head(rec.pairs$pairs)
     id1 id2 fname_c1 fname_c2         lname_c1 lname_c2 by bm bd is_match
   1   1 174        1         0       0.1428571        0 0 0 0           0
   2   1 204        1         0       0.0000000        0 0 0 0           0
   3   2   7        1         0       0.3750000        0 0 0 0           0
   4   2 43         1         0       0.8333333        0 1 1 1           1
   5   2 169        1         0       0.0000000        0 0 0 0           0
   6   4 19         1         0       0.1428571        0 0 0 0           0
   >
   >
                                             [ 266 ]
Record Linkage - Stochastic and Machine Learning Approaches                         Chapter 6
Now, having changed our is_match column to binary, we replace the original pairs data
frame in rec.pairs with our modified data frame pairs. With this achieved, we can follow
the process of generating a small train test, building a model, and verifying the accuracy of
the model as we did in the previous section:
   train <- getMinimalTrain(rec.pairs)
   Warning message:
   In getMinimalTrain(rec.pairs) :
     Comparison patterns in rpairs contain string comparison values!
   > model <- trainSupv(train, method ="bagging")
   > train.pred <- classifySupv(model, newdata = train)
   > test.pred <- classifySupv(model, newdata = rec.pairs)
   >
   > summary(train.pred)
   500 records
   82 record pairs
   38 matches
   44 non-matches
   0 pairs with unknown status
   38 links detected
   0 possible links detected
   44 non-links detected </strong>
Classification table:
              classification
   true status N P L
         FALSE 44 0 0
         TRUE   0 0 38
   > summary(test.pred)
   500 records
   1221 record pairs
                                             [ 267 ]
Record Linkage - Stochastic and Machine Learning Approaches                   Chapter 6
    150 matches
    1071 non-matches
    0 pairs with unknown status
Classification table:
                classification
    true status     N    P     L
          FALSE 1071     0     0
          TRUE      0    0 150
We have achieved a 100% accuracy in both our training and whole datasets.
          Load the RLdata500 from the RecordLinkage package and display to the user
          Implement the weights algorithm and display the weight range as a histogram
          Allow the user to select the lower and upper thresholds of weights for
          classification
          Based on the user-selected threshold, do a record matching and display the
          duplicate entities discovered.
                                              [ 268 ]
Record Linkage - Stochastic and Machine Learning Approaches                     Chapter 6
There are two panels, one to show the RLdata500 and an other one to show the results of
the weights algorithm.
                                             [ 269 ]
Record Linkage - Stochastic and Machine Learning Approaches                        Chapter 6
outptu$weights runs the emweights algorithm and displays the histogram. The user
interface provides a slider for threshold selection. The user can choose the necessary
threshold based on the histogram displayed. Using these thresholds, a classification is made
and the results are shown to the user.
                                             [ 270 ]
Record Linkage - Stochastic and Machine Learning Approaches                         Chapter 6
The results of the emWeight algorithm is shown are a histogram. The histogram shows the
range of weights thrown by the emWeight algorithm. In order to find duplicates, users can
select a lower and upper threshold for weights and pass them to emClassify method.
Complete R code
The complete R code for this chapter is given as follows. We have split the different
algorithms used for record linkage into different sections, for ease of reading.
                                             [ 271 ]
Record Linkage - Stochastic and Machine Learning Approaches          Chapter 6
Feature generation
Let us begin with the feature generation R code:
   library(RecordLinkage, quietly = TRUE)
summary(rec.pairs)
   RLdata500[1,]
   RLdata500[174,]
   # String features
   rec.pairs.matches <- compare.dedup(RLdata500
                                      ,blockfld = list(1, 5:7)
                                      ,strcmp =   c(2,3,4)
                                     ,strcmpfun = levenshteinSim)
head(rec.pairs.matches$pairs)
                                             [ 272 ]
Record Linkage - Stochastic and Machine Learning Approaches                        Chapter 6
   # Phoenotic features
   rec.pairs.matches <- compare.dedup(RLdata500
                                      ,blockfld = list(1, 5:7)
                                      ,phonetic =   c(2,3,4)
                                      ,phonfun = pho_h)
head(rec.pairs.matches$pairs)
   RLdata500[2,]
   RLdata500[43,]
After feature generation, let us look at the code for the expectation maximization method.
   # Em weight calculation
   rec.pairs <- compare.dedup(RLdata500
                              ,blockfld = list(1, 5:7)
                              ,strcmp =   c(2,3,4)
                              ,strcmpfun = levenshteinSim)
   pairs.weights <- emWeights(rec.pairs)
   hist(pairs.weights$Wdata)
summary(pairs.weights)
   weights.df<-getPairs(pairs.weights)
   head(weights.df)
   # Classification
   pairs.classify <- emClassify(pairs.weights, threshold.upper = 10,
   threshold.lower = 5)
                                             [ 273 ]
Record Linkage - Stochastic and Machine Learning Approaches                     Chapter 6
xlab="Link Types")
This is the complete source code for the expectation maximization method. Let us now look
at the weights method.
Weights-based method
The weights-based algorithm for record linkage:
   library(RecordLinkage)
    data("RLdata500")
    # weight calculation
    rec.pairs <- compare.dedup(RLdata500
    ,blockfld = list(1, 5:7)
    ,strcmp = c(2,3,4)
    ,strcmpfun = levenshteinSim)
summary(pairs.weights)
    weights.df<-getPairs(pairs.weights)
    head(weights.df)
    # Classification
    pairs.classify <- emClassify(pairs.weights, threshold.upper = 0.5,
   threshold.lower = 0.3)
                                             [ 274 ]
Record Linkage - Stochastic and Machine Learning Approaches            Chapter 6
   # weight calculation
   rec.pairs <- compare.dedup(RLdata500
                              ,blockfld = list(1, 5:7)
                              ,strcmp =   c(2,3,4)
                              ,strcmpfun = levenshteinSim)
   # Unsupervised classification
   kmeans.model <- classifyUnsup(rec.pairs, method = "kmeans")
   summary(kmeans.model)
   # Supervised Learning 1
   str(identity.RLdata500)
   rec.pairs <- compare.dedup(RLdata500
                              ,identity = identity.RLdata500
                              ,blockfld = list(1, 5:7)
   )
   head(rec.pairs$pairs)
                                             [ 275 ]
Record Linkage - Stochastic and Machine Learning Approaches                        Chapter 6
   summary(train.pred)
   summary(test.pred)
   # Supervised learning 2
   rec.pairs <- compare.dedup(RLdata500
                              ,blockfld = list(1, 5:7)
                              ,strcmp =   c(2,3,4)
                              ,strcmpfun = levenshteinSim)
   summary(train.pred)
   summary(test.pred)
Having seen the source code for all the different methods, let us now proceed to look at the
complete source code for our RShiny application.
                                             [ 276 ]
Record Linkage - Stochastic and Machine Learning Approaches             Chapter 6
RShiny application
The RShiny application source code:
   library(shiny)
   library(RecordLinkage)
   data("RLdata500")
   ui <- fluidPage(
     navbarPage("Record Linkage",
                tabPanel("Load"
                         , dataTableOutput("records")
                ),
                tabPanel("Weights Method"
                         ,plotOutput("weightplot")
                         ,sliderInput("lowerthreshold", "Weight Lower
   threshold:",
                                            min = 0.0, max = 1.0,
                                             [ 277 ]
Record Linkage - Stochastic and Machine Learning Approaches                       Chapter 6
                                                 value =0.2)
                              ,sliderInput("upperthreshold", "Weight Upper
   threshold:",
                                           min = 0.0, max = 1.0,
                                           value =0.5)
                              ,dataTableOutput("weights")
                              )
       )
   )
Summary
We introduced the problem of record linkage and emphasized its importance. We
introduced the package, RecordLinkage, in R to solve record linkage problems. We started
with generating features, string- and phonetic-based, for record pairs so that they can be
processed further down the pipeline to dedup records. We covered expectation
maximization and weights-based methods to perform a dedup task on our record pairs.
Finally, we wrapped up the chapter by introducing machine learning methods for dedup
tasks. Under unsupervised methods, K-means clustering was discussed. We further
leveraged the output of the K-means algorithm to train a supervised model.
In the next chapter we go through streaming data and its challenges. We will build a stream
clustering algorithm for a given streaming data.
                                             [ 278 ]
                 Streaming Data Clustering
                                                                                    7
                             Analysis in R
In all those instances where data is collected from various sources, brought to a centralized
location, and stored for analysis, that data is called as data at rest. There is a huge time
delay between the time the data was recorded and when the analysis was performed.
Analyzing the last 6 months' inventory data is an example of data at rest. Today, majority of
data analysis is performed using data at rest.
With the number of Internet of Things projects on the rise, there is a great demand today to
perform analysis on data in motion, also called streaming data. Streaming data is becoming
ubiquitous with the number of addressable sensors and devices being added to the internet.
As an example from computer network monitoring: an intrusion detection system analysis,
the network packets received in real time to quickly determine if the system is
compromised and takes an appropriate action. Latency is key when analyzing data in
motion.
The code for this chapter was written in RStudio Version 0.99.491. It uses R version 3.3.1. As
we work through our example, we will introduce the R stream packages we will be using.
During our code description, we will be using some of the output printed in the console. We
have included what will be printed in the console, immediately following the statement
which prints the information to the console, so as to not disturb the flow of the code.
The processing challenges in stream data are shown in the following figure:
                                            [ 280 ]
Streaming Data Clustering Analysis in R                                                Chapter 7
Bounded problems
The first challenge is deciding on a window, and what size window we need to
accommodate to make sense of the incoming data. By window, we mean storing the last n
data points. For streaming data, it is rare to actually process records one at a time. One of
the most common ways to process streaming data is to process them in a window. This can
either bundle data points into a group and process them as a unit or it can be a changing
group of n data points (or x time) where older records are discarded and newer records
added.
Drift
The second challenge is the non-stationary aspect of the streaming data. Data is considered
stationary if its statistical attributes such as mean, standard deviation, and others do not
vary over time. However, we cannot make this assumption for streaming data. This non-
stationary behavior is also called drift. Our algorithms should be able to spot and handle
drifts efficiently.
The paper, Open Challenges for Data Stream Mining Research, by Georg Kremp et. al (http://
www.kdd.org/exploration_files/16-1-2014.pdf), clearly classifies the various issues with
processing the stream data.
Single pass
Further, the incoming stream has to be processed in a single pass. Any delay may lead to
the loss of data or loss of a window of opportunity.
                                             [ 281 ]
Streaming Data Clustering Analysis in R                                               Chapter 7
Real time
Finally, the algorithm has to be fast enough; it has to process the data at the same speed at
which the data arrives.
Now that we have elicited the processing issues associated with stream processing, let us
move on to the next section.
The online/offline two-stage processing is the most common framework adopted by many
of the stream clustering algorithms.
Micro-clusters are created by a single pass to the data. As each data point arrives, it is
assigned to the closest micro-cluster. You can think of a micro-cluster as a summary
of similar data points. The summary is typically stored in the form of a cluster center, the
local density of the point, and may include more statistics, such as variance. In the stream, if
we are not able to allocate a new incoming data point to any of the existing micro-
clusters, a new micro-cluster is formed with that data point.
The online step deals with micro-cluster formation. As the data arrives, either new
micro-clusters are created or points are assigned to existing micro-clusters.
                                            [ 282 ]
Streaming Data Clustering Analysis in R                                                   Chapter 7
Macro-cluster
When a more robust clustering is either demanded by the user or by the application, the
offline stage kicks off. Here the micro-clusters generated in the online steps are used as
initial centers. A more rigorous clustering is performed in the background to generate the
final list of clusters, called a macro-cluster. Since this is not time-critical, it can be executed
the same way we work on data at rest.
For a more rigorous treatment of the various stream data clustering algorithm, please refer
to State-of-the-art on clustering data streams Mohammed Ghesmoune, by Mustapha Lebbah and
Hanene Azzag: https://bdataanalytics.biomedcentral.com/track/pdf/10.1186/
s41044-016-0011-3?site=bdataanalytics.biomedcentral.com
                                              [ 283 ]
Streaming Data Clustering Analysis in R                                          Chapter 7
                                           [ 284 ]
Streaming Data Clustering Analysis in R                                            Chapter 7
                                          [ 285 ]
Streaming Data Clustering Analysis in R                                             Chapter 7
Calling the get_points function with this stream and specifying the number of tuples to
be generated, using parameter n, we get our dataframe of 100 points in a three-dimensional
space. With the class parameter, we also request the cluster group the point it belongs to.
                                           [ 286 ]
Streaming Data Clustering Analysis in R                                                Chapter 7
If we want to have a simulation function mimicking a live IoT system, where data is
generated every n seconds/minutes, we can place the DSD function inside a while loop with
a timer. Calling this while loop would continuously generate data every n seconds/minutes
as per the timer configuration.
There are a lot of data generators available within the stream package. Refer to the stream
documentation to learn more about other data generators: https://cran.r-project.org/
web/packages/stream/vignettes/stream.pdf
>
DSD_MG stands for moving generator. It's a drift generator consisting of several MGCs. In
this example, we have used a linear MGC. The add_key function frame allows us to specify
the drift happening over time. Inside the function, we define the centers of the cluster for a
particular time period denoted by the parameter time. We have defined three frames where
the center is changed.
                                            [ 287 ]
Streaming Data Clustering Analysis in R                                            Chapter 7
Look at the time, it says 1. Let us get the data at this time and plot the same.
                                             [ 288 ]
Streaming Data Clustering Analysis in R                                            Chapter 7
Now we see that time is around 20. Let u get another 1,000 points and plot them.
The plot of points gathered around time 20 is shown in the following figure:
Clearly, the center has changed compared to the last time frame.
We see that time is now around 40. Let us take another 1000 data points.
                                          [ 289 ]
Streaming Data Clustering Analysis in R                                              Chapter 7
The three graphs convey the drift clearly. We can see the centers of the clusters shifting over
time.
There are several other useful functions, such as writing and reading streams to disk,
playing back the stream, and many others. We strongly urge you to go through the package
documentation.
A curious reader may be wondering, in both examples we showed, we sampled the data at
a particular time stamp. We never used the stream of data in real time. Please bear with us
until we discuss our use case. We will show how the stream can be fed into clustering
algorithm, so the algorithms can now work on the incoming stream of data and not on
sampled data at different time stamps. Though stream package does not provide a direct
mechanism to do this, we achieve this with an external time and matching the number of
records sampled to the window size we need to process.
                                            [ 290 ]
Streaming Data Clustering Analysis in R                                        Chapter 7
We used DSD_Gaussian to create some random four-dimensional records, and have stored
them in a dataframe called data.un.
We have created a memory stream interface to our dataframe: data.un. Now, we can play
this dataframe as we like:
   > get_points(replayer, n=5)
           X1       X2       X3       X4
   1 199.9456 77.09095 19.98670 750.1129
   2 195.9696 80.10380 16.01115 789.9727
   3 196.0109 79.95394 16.08678 790.1042
   4 199.9882 77.03385 20.00825 750.0069
   5 200.0485 76.84687 19.94311 750.0130
   > replayer
   Memory Stream Interface
   Class: DSD_Memory, DSD_R, DSD_data.frame, DSD
   With NA clusters in 4 dimensions
                                          [ 291 ]
Streaming Data Clustering Analysis in R                                              Chapter 7
Using get_points, we have extracted five points. We can now see that our interface is in
position 6.
DSD_ReadCSV can help to read a very large CSV file line by line, if it cannot be loaded into
memory completely.
Inflight operation
DSD_ScaleStream can be used to standardize (centering and scaling) data in a data stream
in-flight.
We saw various types of DSD in this section. DSD acting as a simulator is primarily used for
experimentation. It comes in handy when we need to try new data analysis algorithms. The
DSD to read a CSV file or read a database can come in very handy in real-world
applications.
        1. We can write our own DSD derived from the abstract DSD class.
        2. An actual stream can be made to be written to a CSV file or a database. In this
           case, we can leverage DSD_Memory to read the stream.
                                           [ 292 ]
Streaming Data Clustering Analysis in R                                              Chapter 7
Data stream clustering is implemented as a part of the data stream task (DST) in the stream
package. It is called a DSC class. DSC_R is a native R implementation of the clustering
algorithms. DSC_MOA provides an interface to algorithms implemented in the MOA Java
framework. In this chapter, we will be only looking at DSC_R, the native R algorithms.
The DSC classes provide two functions to perform the online and the offline steps.
The update() function accepts a DSD and a DSC object, and creates the micro-clusters. At
any point, we can query this class for the number of micro-clusters formed.
The recluster function performs the offline step. The macro-clusters are formed from the
micro-clusters inside this function.
Let us look at how to code up a data stream clustering method using the stream package.
Let us do the online part to begin with, where we want to form micro-clusters:
   > stream <- DSD_Gaussians(k = 3, d = 2, noise = .05)
   > stream
   Mixture of Gaussians
   Class: DSD_Gaussians, DSD_R, DSD_data.frame, DSD
   With 3 clusters in 2 dimensions
We have used a Gaussian data generator, generating a two-dimensional data formed into
three clusters.
Now that we have created our data generator, let us move on to our clustering algorithm:
   > clus.alg <- DSC_DBSTREAM(r = 0.03)
   > clus.alg
   DBSTREAM
   Class: DSC_DBSTREAM, DSC_Micro, DSC_R, DSC
   Number of micro-clusters: 0
   Number of macro-clusters: 0
   >
                                           [ 293 ]
Streaming Data Clustering Analysis in R                                                   Chapter 7
Internally, the DBStream algorithm stores, for each micro-cluster, a data point that defines
its center. This data point is called as the leader. It also stores the density of the cluster in an
area defined by the radius threshold. This threshold is provided by the user. While
initializing DBStream, we provided as the radius an r parameter.
If the data point falls into the assignment area of several micro-clusters, then all of them are
updated.
Refer to the paper, Clustering Data Streams Based on Shared Density Between Micro-Clusters, by
Michael Hahsler and Matthew Bolan ̃os, for more information about DBStream.
With this understanding, let's call the update function to create micro-clusters:
    > update(clus.alg, stream, n =100)
    > clus.alg
    DBSTREAM
    Class: DSC_DBSTREAM, DSC_Micro, DSC_R, DSC
    Number of micro-clusters: 5
    Number of macro-clusters: 2
As you can see, we call the update function every time with both our DSD object stream
and DSC object clus.alg. During each call, we pull around 100 records, n =100
                                              [ 294 ]
Streaming Data Clustering Analysis in R                                            Chapter 7
   Number of micro-clusters: 5
   Number of macro-clusters: 2
   > update(clus.alg, stream, n =100)
   > clus.alg
   DBSTREAM
   Class: DSC_DBSTREAM, DSC_Micro, DSC_R, DSC
   Number of micro-clusters: 26
   Number of macro-clusters: 9
   >
The update function creates the micro-clusters. We can see the number of micro-
clusters from the shell output.
Finally, we plot our data points and the micro-clusters using plot(clus.alg,
stream)
We can see the data points and the micro-clusters in the following graph:
              The DSC_DBSTREAM has inbuilt reclustering, hence you can see the macro-
              clusters in the output. There is no need to run a separate offline process.
                                          [ 295 ]
Streaming Data Clustering Analysis in R                                           Chapter 7
Now that we have our clusters, both micro and macro, we need a good way to evaluate the
quality of these clusters. A good clustering algorithm should produce the output in such a
way that the inter-cluster distance is large and the intra-cluster distance is small. The
stream package provides a function called evaluate to find the quality of the clustering
algorithm's output:
   > evaluate(clus.alg, stream)
   Evaluation results for micro-clusters.
   Points were assigned to micro-clusters.
                                              [ 296 ]
Streaming Data Clustering Analysis in R                                           Chapter 7
It gives a lot of different evaluation metrics. We can choose one among them based on our
criteria.
                                           [ 297 ]
Streaming Data Clustering Analysis in R                                                Chapter 7
There are many more clustering algorithms implemented in the stream. Refer to the
documentation for other algorithms.
Hopefully, by now we are equipped with enough information to solve our use case using
stream package.
This example is inspired by the sensor network example given in pubnub at: https://www.
pubnub.com/developers/realtime-data-streams/sensor-network/.
                                            [ 298 ]
Streaming Data Clustering Analysis in R                                               Chapter 7
Our digital control system is generating data at a particular rate. It has the following sensors
installed:
These data from the sensors are pumped to a message queue. We have two systems
consuming the data from our message queue.
Speed layer
This is where the real-time analysis of the incoming data is performed. Our R code will be
running here. If this layer was realized by say apache storm and bolt (http://storm.
apache.org/), the bolts would be running our cluster creation code.
The speed layer talks to the real-time dashboard and the mobility layer. The micro-
clusters discovered as pushed to the dashboard and to the mobility layer.
                                            [ 299 ]
Streaming Data Clustering Analysis in R                                             Chapter 7
Batch layer
We are not concerned about the batch layer for this exercise. The batch layer is responsible
for keeping the master data intact. Since it has a very low latency, we have the speed layer
to process the continuous stream of data.
Now that we have a good idea of the overall architecture of our system, let us proceed to
develop the stream clustering algorithm. For the cluster algorithm development purpose,
we are going to leverage the DSD simulators in the stream package to act as our sensors in
the digital control system.
As you can see, we want DSD_Gaussian to generate a four-dimensional dataset, one for
each sensor. We expect two clusters to be formed out of this dataset. The centers for the
clusters are provided as a matrix to the mu parameter.
                                            [ 300 ]
Streaming Data Clustering Analysis in R                                             Chapter 7
It's a good practice to scale the incoming data before we proceed to do any analysis. Luckily
we have DSD_ScaleStream to apply scaling on the incoming stream data.
                                            [ 301 ]
Streaming Data Clustering Analysis in R                                             Chapter 7
The DSD_ScaleStream class is used to scale the incoming data from our Gaussian stream.
Let us plot this data:
We have normalized our data and are ready to apply the clustering.
Inside the stream package, we have a class, DSC_TwoStage, which will help us pipeline
our online and offline algorithms. Let us see how we can use this class:
   > micro.alg <- DSC_Sample(k = 10)
   > macro.alg <- DSC_Kmeans(k = 2)
We begin with defining the online clustering algorithm for creating micro-clusters. We
use the DSC_Sample class. It's one of the simplest online clustering algorithms. It extracts
samples from a data stream using reservoir sampling. The extracted sample is stored as a
set of micro-clusters. Before we proceed further, let us quickly look at the reservoir
sampling.
                                           [ 302 ]
Streaming Data Clustering Analysis in R                                                       Chapter 7
Reservoir sampling
Given n items, where n is very large or not known in advance, the reservoir sampling
algorithm helps us select k samples from n items. The algorithm to begin with initializes an
array of size k. It then copies the first k items from the stream into the array. Now, it
proceeds to evaluate each item in the array. A random number, j, between 0 and i, is
generated, where i is the index of the item we are currently evaluating. If, j is in range of 0
to k-1, we replace the j element in the array with i th element in the stream.
The parameter k, to DSC_Sample, is used to set the k value for the reservoir sampling, and
eventually sets the number of micro-clusters we expect from the algorithm. In our case, we
have set k to 10. Further down the code, we see that 10 micro-clusters are generated.
Next, we define the clustering algorithm for the offline step. In this case, we use a simple k-
means algorithm:
We will not indulge in the details of the k-means algorithm here. More information about R
k means can be found at https://stat.ethz.ch/R-manual/R-devel/library/stats/html/
kmeans.html
                                              [ 303 ]
Streaming Data Clustering Analysis in R                                                     Chapter 7
    numMicroClusters purity
     10 1
Using DSC_TwoStage, we pipe our online and offline algorithms together. By calling the
update function, we are invoking our offline and online algorithm. We run our algorithm
on our normalized data stream, stream.norm.
Purity is one of the simplest measures to assess the quality of clusters. Purity is derived by
summing up the number of majority elements in each cluster and dividing it by the total
number of clusters. Look at the following example:
In cluster 1, the triangle is the majority class and the count is 4. In cluster 2, the circle is the
majority class and the count is again 4. The purity is then (4 + 4)/10, so 0.8 is the purity of
this cluster.
In our data case, we have achieved a purity of 1. Our clusters are therefore a very good fit to
the incoming streaming data.
                                               [ 304 ]
Streaming Data Clustering Analysis in R                                            Chapter 7
Now, we need to move this set up to our streaming infrastructure. It is beyond the scope of
this book to show how to setup the streaming infrastructure shown above. However, we
will give some pointers here to deploy our stream clustering algorithm:
       1. We need to run R and install the necessary packages including stream package
          into our speed layer.
       2. Assuming that the speed layer is some kind of set up similar to Apache storm
          and bolts, we can have the following setup:
                 1. A storm bolt combination, reads the data from the message queue and
                     populates a in-memory database. This can be a small sqldb in memory
                     or other key value databases.
                 2. Another storm bolt combination acts as follows:
                            1. Use DSD_ReadDB to connect to the database using R DBI and
                                extract the data.
                            2. Micro/Macro Cluster twin, set up using DSC_TwoStage can
                                now query the DSD_ReadDB at fixed intervals as defined by
                                the overall application. When using function update, the
                                window size can be passed as parameter n. The results of the
                                clustering can now be pushed to the dashboard (or pull
                                mechanism, from the dashboard can be setup). Function
                                get_points and get_centers can provide the cluster
                                details.
Hopefully, this gave an overview of how stream package can be used in a real-time data
processing framework.
Complete R code
   install.packages("stream")
   library(plotly)
   p <- plot_ly(data, x = ~X1, y = ~X2, z = ~X3, color = ~class) %>%
     add_markers() %>%
                                          [ 305 ]
Streaming Data Clustering Analysis in R                                Chapter 7
   # Drift generators
   drift.stream <- DSD_MG(dim = 2)
   cluster.1 <- MGC_Linear(dim = 2)
   drift.stream
   data <- get_points(drift.stream, n = 1000)
   plot_ly(data, x = ~X1, y= ~X2) %>% add_markers() %>% layout(title = 'time
   around 1')
   drift.stream
   data <- get_points(drift.stream, n = 1000)
   plot_ly(data, x = ~X1, y= ~X2) %>% add_markers() %>% layout(title = 'time
   around 20')
   drift.stream
   data <- get_points(drift.stream, n = 1000)
   plot_ly(data, x = ~X1, y= ~X2) %>% add_markers() %>% layout(title = 'time
   around 40')
evaluate(clus.alg, stream)
                                          [ 306 ]
Streaming Data Clustering Analysis in R                                Chapter 7
   par(mfrow = c(2,2))
   data.un <- get_points(sensor.stream, n =100)
   head(data.un)
par(mfrow = c(1,1))
   # Perform Clustering
   micro.alg <- DSC_Sample(k = 10)
   macro.alg <- DSC_Kmeans(k = 2)
   pipe <- DSC_TwoStage(micro = micro.alg, macro = macro.alg)
                                              [ 307 ]
Streaming Data Clustering Analysis in R                                           Chapter 7
Summary
The chapter started with an overview of data at motion and data at rest, also called as the
streaming data. We further dwelled into the properties of streaming data and the challenges
it poses while processing it. We introduced the stream clustering algorithm. The famous
offline/online approach to stream clustering was discussed. Later on, we introduced various
classes in stream package and how to use them. During that process, we discussed ideas
about several data generators, DBSTREAM algorithms to find micro and macro clusters and
several metrics to assess the quality of clusters. We then introduced our use case. We went
ahead to design a clustering algorithm, with the online part based on reservoir sampling
and the offline part was handled by k-means algorithm. Finally, we described the steps
needed to take this whole setup in a real streaming environment.
In the next chapter, we will explore graph mining algorithms. We will show you how to use
the package igraph to create and manipulate graphs. We will discuss Product Network
Analysis and show how graph algorithms can assist in generating micro categories.
                                          [ 308 ]
                     Analyze and Understand
                                                                                 8
                           Networks Using R
Network analysis is the study of graphs. Graphs are defined by a set of nodes or vertices
connected by edges. Both the nodes and vertices can have attributes describing them. Most
importantly, the edges can carry weight, indicating the importance of the connection. When
the directions of the edges are preserved, the graph is called a directed graph; when not
preserved, it's called an undirected graph. Network analysis, or network theory, or graph
theory provides a rich set of algorithms to analyze and understand graphs. The famous
Koenigsberg problem (http://mathworld.wolfram.com/KoenigsbergBridgeProblem.html)
introduced by Euler is one of the first graph theory problems to be studied. Koenigsberg is
an old city in Prussia (modern day Russia). The river Pregal separates the city. There are
two other islands. There are seven bridges connecting the islands and the cities. The
Koenigsberg problem was to devise a walk through the city that would cross each of those
bridges once and only once.
One graph structure that we all know today and is easy to relate to is the social network
structure formed by various social media applications such as Facebook and LinkedIn. In
these networks, people form the vertices and, when two of them are connected to each
other, an edge is drawn between those vertices. The whole internet is a graph of connected
machines. Other examples from biology include protein-protein interaction networks,
genetic interaction networks, and so on.
Analyze and Understand Networks Using R                                             Chapter 8
When we represent a problem as a graph, it may give us a different point of view to solve
that problem. Sometimes it can make the problem simpler to solve. One such problem that
we are going to see in this chapter is assigning categories to items. More importantly, we
will understand the micro-categorization of items in a retail setting. Though our example is
from a retail setting, this technique is not limited to the retail domain. We will show how
we can leverage graphs to assign categories to items. This technique is called the Product
Network Analysis.
              This chapter is loosely based on the following two papers: Product Network
              Analysis – The Next Big Thing in Retail Data Mining--a white paper by Forte
              Consultancy and Extending Market Basket Analysis with Graph Mining
              Techniques: A Real Case, by Ivan F. Videla-Cavieres , Sebastián A. Ríos,
              University of Chile, Department of Industrial Engineering, Business
              Intelligence Research Center (CEINE), Santiago, Chile.
Category management is very important for retailers. Having products grouped into the
right category is the first step for retailers to manage their products. Downstream
applications such as up-selling, cross-selling, and loyalty systems can benefit tremendously
with the right category assignment. In the absence of a sophisticated product network
analysis system, product categorization is done manually and is heavily dependent on the
product features entered either by the suppliers or the procurement teams. This
categorization may not be accurate and will be heavily biased by human judgment. It's
impossible to expect concordance between two people in a task such as this one.
The code for this chapter was written in RStudio Version 0.99.491. It uses R version 3.3.1.
As we work through our example, we will introduce the R packages, igraph, and arules
that we will be using. During our code description, we will be using some of the output
printed in the console. We have included what will be printed in the console immediately
following the statement that prints the information to the console, so as not to disturb the
flow of the code.
                                            [ 310 ]
Analyze and Understand Networks Using R                                           Chapter 8
Graphs in R
We will use the R package, igraph, for our graph analysis needs. We will leverage the
arules package to manipulate our data. If you don't have them installed, proceed to install
them as follows:
    >   install.packages("arules")
    >   install.packages("igraph")
You can use the sessionInfo function from the utils package to look at the packages
available for you in the current session.
                                             [ 311 ]
Analyze and Understand Networks Using R                                              Chapter 8
After including the igraph library, we used the graph_from_literal function to create a
simple undirected graph with six nodes. The igraph package provides the plot.igraph
function to visualize the graphs. There are several ways in which we can create a graph. For
a complete list of the different methods available to create graphs, refer to http://igraph.
org/r/#docs.
By adding the + sign while defining the edges, we define the direction of the graph.
                                             [ 312 ]
Analyze and Understand Networks Using R                                             Chapter 8
You can see the arrows now showing the direction of the edges.
Let's look at the following code snippet to show the vertices and edges of the graph:
   > E(simple.graph) # Edges
The E() and V() functions give the edges and vertices of our graph.
By calling the name property on all the vertices, we change the names of our vertices.
Alternatively, we can also use the set_vertex_attr function to add/modify an attribute;
in this example, we added a new attribute called age.
                                             [ 313 ]
Analyze and Understand Networks Using R                                            Chapter 8
Further, based on the age property, we change the color of the vertex:
   >V(simple.graph)$color <- ifelse(V(simple.graph)$age == '11',
   "blue","green")
   >plot.igraph(simple.graph)
Using the color slot, we have changed the color of the vertex based on the age property.
                                          [ 314 ]
Analyze and Understand Networks Using R                                               Chapter 8
The preceding plot vertices are green where the age attribute is 11, otherwise they are blue.
Structural properties such as degree and strength help us understand the underlying
structure of the graph. Many graph-based algorithms use these properties.
Degree of a vertex
The number of edges adjacent, that is, either entering or exiting a vertex, is the degree of a
vertex. The degree function gives the degree of all the vertices in our graph.
We further use the E() function to refer to our graph's edges and add weights to the edges.
                                            [ 315 ]
Analyze and Understand Networks Using R                                             Chapter 8
Strength of a vertex
The strength of a vertex is the sum of the weights of the edges adjacent on that node. The
strength function gives us the strength of vertices in our graph. Let's look at a small code
snippet to find the degree and strength of a graph:
   > degree(simple.graph)
     alice     bob charlie   david     eli
         1       1        3      1       2
   > E(simple.graph)$weight <- c(10,20,35,15,25,35)
   > strength(simple.graph)
     alice     bob charlie   david     eli
        10      20      70      35      25
The functions degree and strength in igraph package can be invoked to get degree and
strength.
Adjacency Matrix
Any graph can be represented as a matrix, called the adjacency matrix, where the rows and
columns are the vertices of the graph. The presence of a value in a particular cell indicates
an edge between the vertices. The cells can be populated with the edge weights too.
The get.adjacency function returns the incident or the adjacency matrix of our graph.
Let's look at a small code snippet to find the adjacency matrix for a graph:
   > get.adjacency(simple.graph)
   5 x 5 sparse Matrix of class "dgCMatrix"
           alice bob charlie david eli
   alice       .   .       .     .   1
   bob         .   .       1     .   .
   charlie     .   1       .     1   1
   david       .   .       1     .   .
   eli         1   .       1     .   .
   >
                                           [ 316 ]
Analyze and Understand Networks Using R                                                Chapter 8
More networks in R
In this section, let us see some more structural properties of a graph. Till now we created
some simple graphs. It will be better to have some larger graphs to study the structural
properties.
The R code for a lot of different networks built from a diverse source of input is available
at https://github.com/igraph/igraph/tree/master/nexus/download.
Let's use the US airport network from the repository to show some structural properties of a
graph:
    url <-
    "http://sites.google.com/site/cxnets/US_largest500_airportnetwork.txt"
    tmp <- tempdir()
    dest <- paste(sep="", tmp, "/", "usairport.txt")
    download.file(url, dest)
We first download the file from an internet URL and use read_graph function to
successfully create a graph. Let us now start looking at some structural properties.
Centrality of a vertex
The centrality of a vertex in a graph defines the importance of that vertex in the graph.
Centrality can be calculated in a multitude of ways. Degree centrality is calculated based on
the in-degree--the number of incoming edges--and the out-degree--the number of outgoing
edges of the graph.
In the preceding case, we use the degree function to find the degree centrality score. We
have sorted it in descending order. Airport 0 has the highest centrality score. In this case,
we have used both the in-degree and the out-degree. According to this analysis, airport 0
must be some important node in this network.
                                             [ 317 ]
Analyze and Understand Networks Using R                                               Chapter 8
Closeness is the inverse of farness. Another way of looking at it is that closeness is how long
will it take to pass a message from a node to all other nodes.
The following snippet shows the closeness calculation using the igraph package:
    > head(sort(closeness(usairport), decreasing = TRUE),10)
We have nested multiple functions, let us untangle them. First, we use the function
closeness to get the closeness property for each vertex. We pass that output to sort
function, and sort the results by decreasing order of closeness. Finally, we use the head
function to pick the top 10 results from the sort output.
Having explored some of the properties of the graph, let us see an overview of some of the
graph algorithms available in the igraph package.
The shortest.paths method gives the shortest paths between our node alice and all the
other nodes. The node david is far way from alice.
                                            [ 318 ]
Analyze and Understand Networks Using R                                                Chapter 8
We see that the algorithm moved to eli and came back to alice.
We hope this gives you an overview of how to use the igraph package. Let's now move to
our use case and see how we can leverage the igraph package to solve our problem.
The Nielsen definition of a category is based on product features. Products that exhibit the
following features are put under the same category:
When analyzing purchasing behavior, several patterns emerge; some products are sold
together at the same time, some products are sold in a time sequential pattern, the sale of
one product affects the sale of another and several others. These types of product
interactions and patterns can either occur in the same category or across different
categories. This leads to formation of micro-categories. Unfortunately, there are no simple
ways to identify these micro-categories. Based on the products, market conditions, price
points, consumer preference, and many other factors, several such micro-categories may
emerge as more retail transactions aggregate.
                                            [ 319 ]
Analyze and Understand Networks Using R                                               Chapter 8
This data and source can be downloaded from the Packt website.
    > data <- read.csv('data.csv')
    > head(data)
      order_id                        product_id
    1   837080           Unsweetened Almondmilk
    2   837080                     Fat Free Milk
    3   837080                            Turkey
    4   837080          Caramel Corn Rice Cakes
    5   837080                Guacamole Singles
    6   837080 HUMMUS 10OZ WHITE BEAN EAT WELL
The given data is in a tabular format. Every row is a tuple of order_id, representing the
transaction and product_id, the item included in that transaction. We need to transform
this data so that we have all the pairs of products and the number of transactions in which
they have occurred together. We will leverage the arules package to achieve this. This
package provides the necessary infrastructure to store, manipulate, and analyze the retail
transaction data. We have already covered the arules package in Chapter 1, Association
Rule Mining. For more information about arules package, refer to https://cran.r-
project.org/web/packages/arules/index.html.
                                            [ 320 ]
Analyze and Understand Networks Using R                                              Chapter 8
Data preparation
Our data preparation task involves taking the transactions data and converting it to a form
where we have product pairs and their transaction frequency. Transaction frequency is the
number of transactions in which both the products have appeared. We will use these
product pairs to build our graph. The vertices of our graph are the products. For every
product pair, an edge is drawn in the graph between the corresponding product vertices.
The weight of the edge is the transaction frequency.
We will use the arules package version 1.5-0 to help us perform this data preparation task:
   > library(arules)
   > transactions.obj <- read.transactions(file = 'data.csv', format =
   "single",
   +                                       sep = ",",
   +                                       cols = c("order_id", "product_id"),
   +                                       rm.duplicates = FALSE,
   +                                       quote = "", skip = 0,
   +                                       encoding = "unknown")
   Warning message:
   In asMethod(object) : removing duplicated items in transactions
   > transactions.obj
   transactions in sparse format with
     6988 transactions (rows) and
     16793 items (columns)
We begin with reading our transactions stored in the text file and create an arules data
structure called transactions. Let's look at the parameters of read.transactions, the
function used to create the transactions object. The first parameter, file, we pass to our text
file where we have the transactions from the retailer. The second parameter, format, can
take any of two values, single or basket, depending on how the input data is organized. In
our case, we have a tabular format with two columns--one column representing the unique
identifier for our transaction and the other column for a unique identifier representing the
products present in our transaction. This format is named as single by arules. Refer to the
arules documentation for a detailed description of all the parameters. On inspecting the
newly created transactions object transaction.obj, we see that there are 6,988
transactions and 16,793 products.
                                           [ 321 ]
Analyze and Understand Networks Using R                                                     Chapter 8
Now that we have loaded the transaction data, let's proceed to find the product pairs:
   > support    <- 0.015
   >
   > # Frequent item sets
   > parameters = list(
   +   support = support,
   +   minlen = 2, # Minimal number of items per item set
   +   maxlen = 2, # Maximal number of items per item set
   +   target = "frequent itemsets"
   + )
   >
   > freq.items <- apriori(transactions.obj, parameter = parameters)
   Apriori
   Parameter specification:
    confidence minval smax arem aval originalSupport maxtime support minlen
   maxlen             target   ext
            NA    0.1     1 none FALSE          TRUE       5   0.015      2
   2 frequent itemsets FALSE
   Algorithmic control:
    filter tree heap memopt load sort verbose
       0.1 TRUE TRUE FALSE TRUE     2    TRUE
    "Apriori is an algorithm for frequent itemset mining and association over transaction
    databases. It proceeds by identifying the frequent individual items in the database and
    extending them to larger and larger item sets as long as those itemsets appear sufficiently
    often in the database." -- Wikipedia
                                              [ 322 ]
Analyze and Understand Networks Using R                                             Chapter 8
Generating frequent itemsets is the first phase of the apriori algorithm. We conveniently
leverage this phase of the algorithm to generate our product pairs and the number of
transactions in which they are present.
The apriori algorithm works in two phases. Finding the frequent item sets is the first
phase of the association rule mining algorithm. A group of product IDs is called an item set.
The algorithm makes multiple passes to the database; in the first pass, it finds out the
transaction frequency of all the individual items. These are item sets of order 1. Let's
introduce the first interest measure Support here:
Now in the first pass, the algorithm calculates the transaction frequency for each product.
At this stage, we have order 1 item sets. We will discard all those item sets that fall below
our support threshold. The assumption here is that items with high transaction frequency
are more interesting than the ones with very low frequency. Items with very low support
are not going to make interesting rules further down in the pipeline. Using the most
frequent items, we construct item sets with two products and find their transaction
frequency, that is, the number of transactions in which both the items are present. Once
again, we discard all the two-product item sets, also known also item sets of order 2 that
are below the given support threshold. We continue this way, until we finish. Let's look at a
quick illustration:
Pass 1:
    Support = 0.1
    Product, transaction frequency
    {item5}, 0.4
    {item6}, 0.3
    {item9}, 0.2
    {item11}, 0.1
                                           [ 323 ]
Analyze and Understand Networks Using R                                            Chapter 8
item11 will be discarded in this phase as its transaction frequency is below the support
threshold.
Pass 2:
    {item5, item6}
    {item5, item9}
    {item6, item9}
As you can see, we have constructed item sets of order 2 using the filtered items from pass
1. We proceed to find their transaction frequency, discard item sets falling below our
minimum support threshold, and step in to pass 3, where once again, we create item sets of
order 3, calculate the transaction frequency, and perform filtering and move to pass 4. In
one of the subsequent passes, we will reach a stage where we cannot create higher order
item sets. That is when we stop.
The apriori method is used in arules packages to get the frequent items. This method
takes two parameters, one transaction object, and the second parameter is a named list. We
create a named list called parameters. Inside the named list, we have an entry for our
support threshold. We have set our support threshold to 0.015. The minlen and maxlen
parameters set a lower and upper cutoff on how many items we expect in our item sets. By
setting our minlen and maxlen to 2, we say that we need only product pairs.
The apriori method returns an itemset object. Let's now extract the product pairs and
their transaction frequency from the itemset object, freq.items:
    > freq.items.df <- data.frame(item_set = labels(freq.items)
    +                             , support = freq.items@quality)
    > freq.items.df$item_set <- as.character(freq.items.df$item_set)
    > head(freq.items.df)
                                       item_set support.support support.count
    1                 {Banana,Honeycrisp Apple}      0.01617058           113
    2               {Banana,Organic Fuji Apple}      0.01817401           127
    3                   {Banana,Cucumber Kirby}      0.01788781           125
    4                     {Banana,Strawberries}      0.01931883           135
    5 {Bag of Organic Bananas,Organic Zucchini}      0.01659989           116
    6 {Organic Strawberries,Organic Whole Milk}      0.01617058           113
From the itemset object, freq.items, returned by apriori, we extract our product pairs
and their transaction frequency count. The item_set column in our
dataframe, freq.items.df refers to the product pair, the support.count column is the
actual number of transactions in which both the products were present, and
the support.support column gives the support.count value as a percentage. Notice
that we have our product pairs enclosed weirdly in a curly brace. We need them in two
different columns.
                                          [ 324 ]
Analyze and Understand Networks Using R                                           Chapter 8
Let's write some cleaning code to remove those braces and also separate our product pairs
into two different columns:
   > library(tidyr)
   > freq.items.df <- separate(data = freq.items.df, col = item_set, into =
   c("item.1", "item.2"), sep = ",")
   > freq.items.df[] <- lapply(freq.items.df, gsub, pattern='\\{',
   replacement='')
   > freq.items.df[] <- lapply(freq.items.df, gsub, pattern='\\}',
   replacement='')
   > head(freq.items.df)
                     item.1             item.2    support.support
   support.count
   1                 Banana   Honeycrisp Apple 0.0161705781339439
   113
   2                 Banana Organic Fuji Apple 0.0181740125930166
   127
   3                 Banana     Cucumber Kirby 0.0178878076702919
   125
   4                 Banana       Strawberries 0.0193188322839153
   135
   5 Bag of Organic Bananas   Organic Zucchini 0.0165998855180309
   116
   6   Organic Strawberries Organic Whole Milk 0.0161705781339439
   113
We leverage the separate function from the tidyr package to split the item_set column
into two columns. As the products are separated by a comma, we specify the comma as the
separator to a separate function. Once separated, we run a regular expression on those
columns to remove the curly braces.
Let us now to create a new data frame with product pairs and weights:
   > network.data <- freq.items.df[,c('item.1','item.2','support.count')]
   > names(network.data) <- c("from","to","weight")
   > head(network.data)
                        from                to weight
   1                 Banana   Honeycrisp Apple    113
   2                 Banana Organic Fuji Apple    127
   3                 Banana     Cucumber Kirby    125
   4                 Banana       Strawberries    135
   5 Bag of Organic Bananas   Organic Zucchini    116
   6   Organic Strawberries Organic Whole Milk    113
                                          [ 325 ]
Analyze and Understand Networks Using R                                              Chapter 8
We retain only the item.1, item.2, and support.count columns. Next, we rename these
columns to from, to, and weight. The igraph package expects this naming convention to
create a graph object seamlessly. Finally, you can see that we have modified the data suit
igraph package's graph manipulation functions.
We leveraged the apriori function in arules to prepare our dataset. Equipped with our
dataset, let's proceed to perform product network analysis to discover micro-categories.
          The core product: In a subgraph or a cluster group, the product that is most
          commonly purchased in the group is termed as the core product of that group.
          The connectors: These are products that connect two subgraphs or clusters
          together. They are the ones that are typically bought first, if a customer starts
          shopping for products in that subgraph. Promoting this product as a part of
          cross-selling can influence customers who have never bought any products from
          this subgraph of products to start purchasing products present in this subgraph.
With our network.data data frame prepared in the exact manner we need, let's proceed to
build a graph of products:
   > set.seed(29)
   > my.graph <- graph_from_data_frame(network.data)
   > plot.igraph(my.graph,
   +              layout=layout.fruchterman.reingold,
   +              vertex.label.cex=.5,
   +              edge.arrow.size=.1)
   >
                                           [ 326 ]
Analyze and Understand Networks Using R              Chapter 8
                                           [ 327 ]
Analyze and Understand Networks Using R                                               Chapter 8
Having built the graph, let's proceed to perform the clustering exercise, the second step in
product network analysis. The clustering of graphs is performed by community detection
algorithms. Communities discovered by the graphs are considered as clusters. Though there
are some small nuances in equating communities to clusters, in this chapter, we will stick to
the simple principle that the communities we discover in our graphs are going to be the
clusters we want to consider for our product network analysis.
Community detection algorithms: These are a group of algorithms that, when applied on a
graph, produce the most likely communities present in the graph. Traditional algorithms,
such as hierarchical clustering, use similarity-based approaches to arrive at the
communities. The more recent algorithms leverage several structural properties of the
graph.
Walktrap algorithm: This algorithm is based on the random walk principle. The
fundamental assumption with the random walk algorithm is as follows. A community
typically has very few edges leading outside of the community. So if we perform a random
walk from a vertex, we are more likely to stay within the community. We will use this
algorithm to detect communities for our use case.
    > random.cluster
    IGRAPH clustering walktrap, groups: 2, mod: 0.26
    + groups:
      $`1`
      [1] "Banana"             "Large Lemon"         "Organic Avocado"
    "Honeycrisp Apple"
      [5] "Organic Fuji Apple" "Cucumber Kirby"      "Strawberries"
    "Organic Whole Milk"
      [9] "Limes"
      $`2`
      [1] "Bag of Organic Bananas" "Organic Strawberries"    "Organic Hass
    Avocado"   "Organic Raspberries"
                                            [ 328 ]
Analyze and Understand Networks Using R                                    Chapter 8
We pass our graph to the walktrap.community function. We then move the cluster
membership and cluster members to a data frame, groupings.df. Our random walk
algorithm produced two subgraphs or communities.
                                          [ 329 ]
Analyze and Understand Networks Using R                 Chapter 8
                                          [ 330 ]
Analyze and Understand Networks Using R                                                   Chapter 8
Let's start with the large cluster in the left side of the picture. It resembles a star schema,
with Banana occupying the center. Let's confirm this by looking at some properties of the
Banana vertex.
If we quickly look at the adjacency matrix, it appears that Banana is the Core Product for this
cluster. Banana is connected to most of the other products.
Let's look at the strength and degree for the Banana vertex:
    > strength(my.graph)
                     Banana Bag of Organic Bananas             Organic Strawberries
    Organic Hass Avocado
                       1697                    941                                 1191
    789
        Organic Raspberries            Large Lemon                      Organic Avocado
    Organic Baby Spinach
                        414                    288                                  422
    845
           Honeycrisp Apple     Organic Fuji Apple                       Cucumber Kirby
    Strawberries
                        113                    127                                  125
    135
           Organic Zucchini     Organic Whole Milk                                Limes
                        116                    233                                  270
    > degree(my.graph)
                                             [ 331 ]
Analyze and Understand Networks Using R                                                Chapter 8
Banana has very high degree and strength, and thus there is more evidence to assign
Banana as the Core Product for this cluster. Though it was evident that Banana is the most
important product in the cluster from the graph plot, it gives us more confidence to look at
some of the properties of the Banana vertex.
Having decided upon the Core Product as banana for the left cluster, let's do some random
walking from the Banana vertex. We select an edge uniformly and randomly from all the
out-degrees of Banana and do a specified number of steps. To find out more about random
walk click http://igraph.org/r/doc/random_walk.html.
   > for(var in seq(1,5)){
   +   print (random_walk(my.graph, start = "Banana", steps = 2))
   + }
   + 2/15 vertices, named, from 16e82b7:
   [1] Banana               Organic Strawberries
   + 2/15 vertices, named, from 16e82b7:
   [1] Banana          Organic Avocado
   + 2/15 vertices, named, from 16e82b7:
   [1] Banana         Cucumber Kirby
   + 2/15 vertices, named, from 16e82b7:
   [1] Banana Limes
   + 2/15 vertices, named, from 16e82b7:
   [1] Banana       Strawberries
   > random_walk(my.graph, start = "Banana", steps = 3)
   + 3/15 vertices, named, from 16e82b7:
   [1] Banana               Organic Baby Spinach Organic Hass Avocado
   > random_walk(my.graph, start = "Banana", steps = 4)
   + 4/15 vertices, named, from 16e82b7:
   [1] Banana               Organic Avocado      Organic Baby Spinach Organic
   Strawberries
   > random_walk(my.graph, start = "Banana", steps = 5)
                                           [ 332 ]
Analyze and Understand Networks Using R                                               Chapter 8
This random walk can expose some of the product affinities quickly without a lot of
computation. Imagine working on large graphs with thousands of vertices; these random
walks performed multiple times can quickly unearth product affinities.
              Another data source that can be used to experiment with the techniques
              described in this chapter is Amazon's online retail page which is available
              for download."
              #https://snap.stanford.edu/data/amazon0302.html
The Organic Avacado and Banana products in the left cluster can be termed as The
Connector. They have a path into the right-side cluster. Promoting these products can
induce the customers of the left cluster to start exploring or buying the products in the right
cluster.
This brings us to the end of this section. Here, we were able to use the random walk
principle-based community detection algorithm to find communities in our graph. These
communities can be the new micro-categories. Further, we also identified our core and
connector vertices.
                                            [ 333 ]
Analyze and Understand Networks Using R                                               Chapter 8
                                               ".csv"
                                               )
                              )
                              , dataTableOutput("transactions")
                   ),
                   tabPanel("Product Pairs"
                            ,dataTableOutput("ppairs")),
                   tabPanel("Community"
                            ,plotOutput("community"))
        )
    )
We have three panels. In the first panel, we select a product transaction file and display it.
In our second panel, we show the product pairs and their transaction counts. In the final
panel, we display the communities we have discovered.
                                            [ 334 ]
Analyze and Understand Networks Using R                                         Chapter 8
As soon as a file is uploaded, this file is read into a data frame, trans.df.
                                             [ 335 ]
Analyze and Understand Networks Using R                                Chapter 8
                                          [ 336 ]
Analyze and Understand Networks Using R                                                Chapter 8
Using the transaction object, the apriori function is invoked to get the product pairs. The
output of the apriori is carefully formatted and a final data frame, network.data, is
created.
The rest of the functions in the server renders these outputs to the respective slots in the UI.
Using the file selector, we can select the transaction file. The transaction file should be a
.csv file with two columns. One for the order_id and the other one for the product_id.
In this version, we need to maintain the column names as order_id and product_id.
                                            [ 337 ]
Analyze and Understand Networks Using R             Chapter 8
                                          [ 338 ]
Analyze and Understand Networks Using R                      Chapter 8
                                             [ 339 ]
Analyze and Understand Networks Using R                                          Chapter 8
   ## Graph properties
   E(simple.graph) # Edges
   V(simple.graph) # Vertices
   ## Graph attributes
   V(simple.graph)$name <- c('alice', 'bob','charlie','david',
   'eli','francis')
   simple.graph <- set_vertex_attr(simple.graph ,"age", value = c('11',
   '11','15','9', '8','11'))
   plot.igraph(simple.graph)
   V(simple.graph)$color <- ifelse(V(simple.graph)$age == '11',
   "blue","green")
   plot.igraph(simple.graph)
   # Structural properties
   degree(simple.graph) # degree of nodes
   E(simple.graph)$weight <- c(10,20,35,15,25,35)
   strength(simple.graph) # strength of nodes
   get.adjacency(simple.graph) # adjacency matrix
                                          [ 340 ]
Analyze and Understand Networks Using R                                   Chapter 8
                                               encoding = "unknown")
   transactions.obj
   # Interest Measures
   support    <- 0.015
## Clustering
                                          [ 341 ]
Analyze and Understand Networks Using R                                  Chapter 8
   plot(random.cluster,my.graph,
               layout=layout.fruchterman.reingold,
               vertex.label.cex=.5,
               edge.arrow.size=.1)
   get.adjacency(my.graph)
   strength(my.graph)
   degree(my.graph)
   ## Random walks
   for(var in seq(1,5)){
     print (random_walk(my.graph, start = "Banana", steps = 2))
   }
   library(shiny)
   library(arules)
   library(igraph, quietly = TRUE)
                                          [ 342 ]
Analyze and Understand Networks Using R                                Chapter 8
                                          [ 343 ]
Analyze and Understand Networks Using R                                             Chapter 8
   ui <- fluidPage(
     navbarPage("Product Pairs",
                tabPanel("Transactions"
                         , fileInput("datafile", "select transactions csv
   file",
                                     accept = c(
                                        "text/csv",
                                        "text/comma-separated-
   values,text/plain",
                                        ".csv"
                                        )
                         )
                         , dataTableOutput("transactions")
                ),
                tabPanel("Product Pairs"
                         ,dataTableOutput("ppairs")),
                tabPanel("Community"
                         ,plotOutput("community"))
     )
   )
Summary
We started this chapter by explaining our problem and dataset and conducted product
categorization using network/graph theory. We introduced the R package, igraph, to
create and manipulate graphs. We set out discovering some of the properties of the graph
and also a little bit about the graphs and random walks. We used the arules package to
find frequent pairs of items and finally applied the graph clustering algorithm to our data to
discover product categories. We were also able to identify important vertices performing
the role of the core and connectors.
                                           [ 344 ]
                                         Index
                                           community detection algorithms 328
A                                          community, in graph 328
adjacency matrix 316                       confusion matrix
apriori algorithm 16                         reference 148
association rule 11                        content-based recommendation 68, 69, 71
association rule mining                    content-based recommendation engine
  about 13, 14, 15, 17, 18, 19, 21, 23       designing 77, 79
  support/confidence threshold 26            searching 86
                                             similarity index, building 80
B                                          cosine distance 78
bag-of-words 80                            cosine similarity
Banana vertex 331, 332                       about 118, 119
Brain Harris model 319                       need for 85
                                             reference 86
C                                          cross-selling campaign
case                                         about 30, 32
  introducing 244, 245                       conviction 34
category management                          leverage 33
  about 310                                cross-validation
  reference 319                              reference 142
challenges, stream data                    cultivars
  about 280                                  reference 68
  bounded problems 281
  drift 281                                D
  real time 282                            data generators
  single pass 281                            reference 287
collaborative filtering                    data preparation 321, 323
  about 116                                data stream data (DSD)
  designing 136, 137                         about 284
  implementing 136, 137                      as simulator, with drift 287
  latent factor approach 120, 121            as static simulator 285
  memory-based approach 118, 119             connecting, to actual data stream 292
  model-based approach 119, 120              connecting, to database 292
  normalization techniques 139, 140          connecting, to file 292
  ratings matrix 137                         connecting, to memory 291
  train model 144, 146, 147                  inflight operation 292
  train test split 141, 142, 143           data stream task (DST) 293, 295
data stream                                          centrality of vertex 317
  about 279                                          degree of vertex 315
  challenges 281, 282                                node, closeness 318
deep networks                                        node, farness 318
  for time series prediction 186                     random walk 319
deep neural networks                                 shortest path, finding between nodes 318
  about 171, 172                                     strength of vertex 316
  backward cycle 175
  feature hierarchy 173                          H
  forward cycle 174                              Hyperlink-induced topic search (HITS) 42, 44, 46
  latent features 173
defuzzification 98                               I
Delta TFIDF 227
                                                 inverse document frequency (IDF) 81
dictionary based scoring 220, 221
                                                 item-based models 152
digital control system
                                                 itemset 16
  batch layer 300
  reservoir sampling 303
  speed layer 299
                                                 J
directed graph 309                               Jaccard index 92
document frequency 81                            Jaccard's distance 78, 91
documentation 57, 62                             Jester5k dataset
dplyr                                              reference 126
  reference 12
drift                                            K
  about 287                                      k-means
  reference 287                                    reference 131
                                                 kernel density 211
F                                                kernel density estimation (KDE) 211, 212, 213,
factor-based models 153                              214, 215, 216
feature generation, RecordLinkage package        Koenigsberg problem
  about 247                                        reference 309
  phonetic features 250, 251
  string features 249, 250                       L
feature hierarchy 173                            lambda architecture
Fellegi-Sunter model 252                           reference 299
fuzzification 95, 96                             Levenshtein distance (LD)
fuzzy logic                                        reference 249
  about 95                                       lexicons
  reference 95                                     about 219
                                                   SocialSent 219
G                                                  Wordnet 219
graphs 309
graphs, in R                                     M
  about 311, 312, 313                            machine learning-based record linkage
  adjacency matrix 316                            about 261
                                            [ 346 ]
 supervised learning 263                      machine learning method 275
 unsupervised learning 261                    RShiny application source code 277
macro-cluster 283                             weights-based method 274
Manhattan distance 78                       R project code 155
market basket analysis 11                   R vignettes
matrix decomposition                          reference 180
 reference 120                              ratings matrix 137
matrix factorization 120                    recommender systems
multi-layer perceptron 171                    about 7
MXNet R package                               transactions 8
 about 175                                    web application 10
 reference 175                              recommenderlab package
MXNet symbolic API                            about 122, 123
 reference 177                                popular approach 124, 126
MXNet                                       RecordLinkage package
 symbolic programming 178                     feature generation 247
                                              usage, demonstrating 245
N                                           reservoir sampling
n fold cross-validation scheme 148            about 303
negative association rules 50, 51             reference 303
network analysis 309                        retailer use case 10
networks, in R 317                          RShiny application
news aggregator dataset                       assembling 234, 235, 236
  reference 72                                building 268, 270, 333, 336, 338
news aggregator                             rules visualization 53, 56
  use case 72
non-seasonal time series 165                S
normalization 139                           search results
                                              ranking 94
P                                           searching
Pearson coefficient 118, 119                  about 86
polarity 77                                   defuzzification 98
polarity scores 89                            fuzzification 95, 96
polarized context clusters 91, 223            fuzzy logic 95
probability density function                  Jaccard index 92
  reference 215                               Jaccard's distance 91
Product Network Analysis 310                  polarity scores 89
product network analysis                      rules, defining 97
  about 326, 328                              rules, evaluating 97
  demonstrating 340                         seasonal time series 166
                                            sensor network example, in pubnub
R                                             reference 298
                                            sentiment annotated datasets
R code
                                              reference 219
 expectation maximization method 273
                                            sentiment classification
 feature generation R code 272
                                       [ 347 ]
  about 218
  approach 219                                    T
  dictionary methods 219                          term frequency (tf) 81, 226
  machine learning methods 219                    term-frequency inverse document frequency (tf-idf)
sentiment classifier                                  226
  building 230, 231, 232, 233                     test split
sentiment mining                                    training 188, 191, 193, 196, 197, 200
  reference 219                                   text pre-processing
similarity index                                    about 224
  bag-of-words 80                                   Delta TFIDF 227
  cosine similarity 85                              term-frequency inverse document frequency (tf-
  document frequency 81                               idf) 226
  inverse document frequency (IDF) 81             TFIDF 82, 84
  term frequency 81                               time series data 164
  TFIDF 82, 84                                    time series
similarity measure 118                              about 163
Singular Value Decomposition (SVD) 121              as regression problem 167, 169, 170
SocialSent                                          non-seasonal time series 165
  reference 219                                     seasonal time series 166
softmax activation                                train model
  about 182                                         about 145, 146, 147
  use case 184                                      factor-based models 153
Soundex                                             item-based models 152
  reference 250                                     user-based models 149
stationary time series 167                        transactions 8, 321
stochastic record linkage                         twitteR package
  about 252                                         reference 217
  expectation maximization method 252, 254, 255   Twitter text 217, 218
  weights-based method 258, 259
stream clustering 282                             U
stream data                                       undirected graph 309
  about 280                                       unsupervised learning 261
  challenges 280                                  use case 126, 129, 130, 131, 298
stream MOA algorithms, in Java                    user rating matrix 116
  reference 293                                   user-based filtering 118
stream package                                    user-based models 149
  about 283                                       user-based recommendation (UBCF) 150
  components 284
  data stream data (DSD) 284                      W
  data stream task (DST) 293, 295                 walktrap algorithm 328, 329
supervised learning 263                           web application 10
symbolic programming, in MXNet                    weighted association rule mining 35, 36, 38, 40
  about 178                                       weighted transactions 9
  softmax activation 182                          Wordnet
                                                   reference 219
                                             [ 348 ]
X                       Z
Xavier initialization   z-score normalization
 reference 174            reference 140