Exploring Yelp Dataset in Map Reduce,
HIVE and HPCC
Preface
The era we belong to is ruled by data, where advertently or
inadvertently our contribution also counts in. Parallel processing
framework concepts were introduced in 90s, but the real test for these
frameworks happened when the data production rates grew
exponentially. Big Data tools and software helps in processing these
data. Data analysis decisions helps an organization to determine the
data relevancy, underlying patterns and other useful information in the
data which can be used to drive better business decisions in the future.
This white paper discuss about Map Reduce, Hive, Pig which belongs to
Hadoop Ecosystem and High Performance Computing Cluster (HPCC)
on the context of Yelp Dataset from Yelp Dataset Challenge using a
subset of data.
The Data
Understand the data
A good practice is to understand the data before you gets your hands
dirty by start processing on the data. What the data represents, data
sources, data formats and so on.
Yelp
Yelp is a multi-platform local business review and social networking
application that publishes crowd sourced reviews about local business
owned by Yelp, California. A five star rating system is used to rate
business by reviewers.
Yelp dataset
Every year Yelp publishes its dataset about local businesses, users and
reviews across the country. The Yelp Dataset challenge for $35K price
is to mine the Academic Data and analyze the trends like what cuisines
are Yelpers raving about in these different countries, how much of a
business' success is really due to just location .
The dataset we are going to explore is the Yelp business and review
dataset which is provided in JSON format.
Yelp business dataset represents every business with a unique
business id. It also provides the city, location, state and similar
attributes.
The review dataset has the individual review done by each user
identified user id against business.
Business
Review
Hadoop Ecosystem and HPCC
HPCC
High-Performance Computing Cluster (HPCC) systems has a history of
14 years where it was developed for data and information services
business, data analytics and scoring, full text storage and retrieval,
entity extraction and resolution and data linking developed by Lexis
Nexis Risk Solutions.
Enterprise Control Language (ECL) is the primary programming
language for the HPCC environment. ECL is compiled into optimized C+
+ which is then compiled into DLLs for execution on the Thor and Roxie
platforms
Hadoop
Hadoop was created by Doug Cutting and Mike Cafarella in 2005 at
Yahoo! MapReduce and HDFS components were inspired by Google
papers on MapReduce and Google File System (GFS).Hadoop
ecosystem is have a collection of additional software packages that
can run alongside Hadoop - Apache Pig, Apache Hive, Apache HBase,
Apache Spark, and others.
Hadoop Map Reduce jobs are usually written in Java. Other languages
are supported through a streaming or pipe interface
Hive
Apache Hive is a data warehousing infrastructure based on Hadoop. It
provides a simple query language called Hive QL, which is similar to
SQL. .A Hive query gets converted into a sequence of stages. These
stages may be map/reduce stages or they may even be stages that do
Metastore or file system operations like move and rename.
Data Processing
Environments
The operation performed on the Yelp dataset is on the following
environment settings.
Framework
Host OS
HPCC
Systems
HIVE
Windows 8.1
Map Reduce
Windows 8.1
Windows 8.1
Virtualization
Software
VMware Player
5.0.3
Oracle VM Virtual
Box
VM Image
HPCC Systems VM-amd64-4.2.43
Cloudera Quick Start VM 4.7.0
Virtual Box (Red Hat 64 bit)
Oracle VM Virtual
Box
Ubuntu-12.04.5-desktop-i386
ISO 64 bit
Data Loading
HPCC
Data loading is controlled through the Distributed File Utility (DFU)
server. Data typically arrives on the landing zone (for example, by FTP).
File movement (across components) is initiated by DFU. Data is copied
from the landing zone and is distributed (sprayed) to the Data Refinery
(Thor) by the ECL code.
A single physical file is distributed into multiple physical files across
the nodes of a cluster. The aggregate of the physical files creates one
logical file that is addressed by the ECL code.
Here the JSON data is converted into csv and uploaded to the cluster.
HADOOP
We can use command line to manage files on HDFS
hadoop fs -put:
Copy single src file, or multiple src files from local file system to the Hadoop data file
system
hadoop fs -put /home/user/review_cleaned.csv /user/hadoop/dir3/
HUE
Using Hue Metastore manager in Cloudera we can create a table
manually from a JSON file and by specifying the input column names.
A python script was used to clean the Json and the review json cleaned
has the following headers.
["funny", "useful", "cool", "user_id", "review_id", "text", "business_id", "stars", "date", "type"]
A screen cast of MetaStore manager in Hue after Yelp_Biz table has
been created from Yelp business dataset json.
Queries
We mainly focus on three operations COUNT, GROUP and JOIN.
A COUNT query of the number of records based on a filter (city)
A GROUP query on one the columns on the dataset (city, review
count)
A JOIN of two datasets (business )
Query Execution in HPCC
In HPCC we read logical files by specifying the layout of the file. Once
we read the file we can specify the filter by mentioning the column
name and filter value.
Q1: How many businesses in Scottsdale city got reviewed?
The code snippet follows
EXPORT YBiz := DATASET(~query_result::biz,YBiz_Lay,CSV(HEADING(1)));
OUTPUT(COUNT(YBIZ(city='"Scottsdale"')),NAMED('Scott_Review));
Result
Q2: Find the total number of review counts for businesses in each
city.
This is going to be a group by based on city.
Code Snippet
CityCountRec:=RECORD
Biz_Clean.city;
statecount:=SUM(GROUP,Biz_Clean.review_c);
END;
personSummary:= TABLE(Biz_Clean,CityCountRec,city);
OUTPUT(personSummary,ALL);
In ECL we can ask the code to group the records by creating a layout
that refers a data set.
Query Result
Execution Graph
Q3: List 10 cool business
We have the cool attribute in review and business name in Yelp
business dataset. Obliviously we need a join of the two datasets.
Code snippet
Join_Lay := RECORD
STRING27 biz_id ;
STRING20 name;
STRING7 cool;
END;
Join_Lay JoinTran(YBiz B, YReview R) := TRANSFORM
SELF.biz_id := B.biz_id;
SELF.name := REGEXREPLACE('"',B.name,'');
SELF.cool := REGEXREPLACE('"',R.cool,'');
END;
YJoin := JOIN( YBiz, YReview,LEFT.biz_id = RIGHT.business_id ,JoinTran(LEFT, RIGHT));
YLay:=RECORD
STRING20 name := YJoin.name;
INTEGER4 cool:=SUM(GROUP,(INTEGER4)YJoin.cool);
END;
z := TABLE(YJoin,YLay,biz_id);
y := SORT(z,-cool);
y[1..10];
Query Execution in Hive
Q1: How many businesses in Scottsdale city got reviewed?
We can implement a simple COUNT query in HIVE.
SELECT COUNT(*) FROM YELP_BIZ WHERE city = Scottsdale
After executing the query in HIVE query editor we will get the result.
There was one map task, one reduce task.
Total Map Reduce CPU Time Spent: 2 seconds 480 milliseconds
Q2: Find the total number of review counts for businesses in each
city?
A Hive query with SUM will fetch as the result.
SELECT b.city, SUM(review_count) AS review FROM yelp_biz b GROUP BY b.city
The result
There were one map task and one reduce task.
Total Map Reduce CPU Time Spent: 2 seconds 810 milliseconds
Q2: Top 10 coolest restaurants.
SELECT r.business_id, name, SUM(cool) AS coolness FROM review r JOIN business b
ON (r.business_id = b.business_id) WHERE categories LIKE '%Restaurants%' GROUP
BY r.business_id, name ORDER BY coolness DESC LIMIT 10
There were 3 map reduce jobs.
Total Map Reduce CPU Time Spent: 17 seconds 610 milliseconds
Result
Query Execution using Map Reduce
JSON is harder to partition into distinct segments than a format such as
XML because JSON doesnt have a token (like an end tag in XML) to
denote the start or end of a record.
The problem in using JSON in Map Reduce is if you are working with large
JSON files, you need to be able to split them.
Elephant Bird, an open source project that contains some useful
utilities for working with LZOP compression, has an LzoJsonInputFormat
that can read JSON, though it requires that the input file be LZOP
compressed.
So we have uploaded the csv file for the dataset into HDFS and will do
the processing.
Mapper and Reducer code snippet
Mapper
Reducer
Map Reduce Logs
We wrote the data to yelpdb/01 folder
There is a slight mismatch if one record compared to Hive and HPCC
executions.
Q2: Find the total number of review counts for businesses in each
city?
Mapper code snippet
Reducer code snippet
Map Reduce Logs
Output file written to yelpdb/o2 folder
Summary
The easiest way of processing the dataset was using Hive as loading
the dataset and defining its layout via HUE. The HiveQL increases the
usability with its SQL like syntax. Writing map reduce programs for big
data problems becomes easy if you are familiar with Java and it gives
you more control over implementation and execution.
HPCC is a lesser known platform but powerful enough to compete with
Hadoop. The language ECL used is powerful but with an unfamiliar set
of syntax and semantics takes more development time for novices. If
you consider a data movement of files from HPCC to Hadoop Comma
Separated Files (CSV) is the best option even though HPCC can
produce XML files.
As most of the queries were executed on single node local or virtual
machine environments a performance benchmarking comparing the
Big data platforms or technologies will be effete. At this point the
comparable material will be the line of codes and coding time.
References
http://hortonworks.com/hadoop-tutorial/how-to-process-datawith-apache-hive/
https://cwiki.apache.org/confluence/display/Hive/LanguageManua
l
http://hadoop.apache.org/docs/r1.2.1/mapred_tutorial.html
http://hpccsystems.com/
https://gigaom.com/2008/11/09/mapreduce-leads-the-way-forparallel-programming/
http://blog.cloudera.com/
https://amplab.cs.berkeley.edu/benchmark/