KEMBAR78
R23 IDS Unit 3 Lecture Notes | PDF | Apache Spark | No Sql
0% found this document useful (0 votes)
262 views57 pages

R23 IDS Unit 3 Lecture Notes

The document provides an introduction to data science, focusing on the NoSQL movement and the use of Hadoop and Spark frameworks for handling big data. It discusses the principles of ACID and BASE in relation to relational and NoSQL databases, as well as the CAP theorem's implications for distributed databases. Additionally, it covers various types of NoSQL databases, including key-value stores, document stores, column-oriented databases, and graph databases, along with a case study on risk assessment for loan sanctioning.

Uploaded by

23a31a4411
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
262 views57 pages

R23 IDS Unit 3 Lecture Notes

The document provides an introduction to data science, focusing on the NoSQL movement and the use of Hadoop and Spark frameworks for handling big data. It discusses the principles of ACID and BASE in relation to relational and NoSQL databases, as well as the CAP theorem's implications for distributed databases. Additionally, it covers various types of NoSQL databases, including key-value stores, document stores, column-oriented databases, and graph databases, along with a case study on risk assessment for loan sanctioning.

Uploaded by

23a31a4411
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 57

INTRODUCTION TO DATA SCIENCE

(R23 – II Year I Sem)


LECTURE NOTES

NoSQL movement for handling Bigdata: Distributing data storage and


processing with Hadoop framework, Case study on Risk Assessment for Loan
Sanctioning, ACID principle of Relational Databases, CAP Theorem, base
UNIT III
principle of NoSQL databases, Types of NoSQL databases, Case study on
disease diagnosis and profiling

DISTRIBUTING DATA STORAGE AND PROCESSING WITH HADOOP FRAMEWORK:

New big data technologies such as Hadoop and Spark make it much easier to work with and control a
cluster of computers.
Hadoop can scale up to thousands of computers, creating a cluster with petabytes of storage.
This enables businesses to grasp the value of the massive amount of data available.

Hadoop: a framework for storing and processing large data sets

Apache Hadoop is a framework that simplifies working with a cluster of computers. It aims to be all
of the following things and more:
1. Reliable—By automatically creating multiple copies of the data and redeploying
processinglogic in case of failure.
2. Fault tolerant—It detects faults and applies automatic recovery.
3. Scalable—Data and its processing are distributed over clusters of computers
(horizontalscaling).
4. Portable—Installable on all kinds of hardware and operating systems.

The core framework is composed of a distributed file system, a resource manager, and a system to
run distributed programs. In practice it allows you to work with the distributed file system almost as
easily as with the local file system of your home computer. But in the background, the data can be
scattered among thousands of servers.

The Different Components Of Hadoop

At the heart of Hadoop we find


5. A distributed file system (HDFS)
6. A method to execute programs on a massive scale (MapReduce)
7. A system to manage the cluster resources (YARN)

On top of that, an ecosystem of applications arose, such as the databases Hive and HBase and
frameworks for machine learning such as Mahout.

Introduction to Data Science – UNIT 3 – Lecture Notes – by M V Rajesh, DS Dept., PEC


Mapreduce: How Hadoop Achieves Parallelism

8. Hadoop uses a programming method called MapReduce to achieve parallelism. A


MapReduce algorithm splits up the data, processes it in parallel, and then sorts,
combines, andaggregates the results back together.
9. However, the MapReduce algorithm isn’t well suited for interactive analysis or
iterative programs because it writes the data to a disk in between each computational
step. This is expensive when working with large data sets.
Let’s see how MapReduce would work on a small fictitious example. You’re the director of a toy
company. Every toy has two colors, and when a client orders a toy from the web page, the web page
puts an order file on Hadoop with the colors of the toy. Your task is to find out how many color units
you need to prepare. You’ll use a MapReduce-style algorithm to count the colors.
10. First let’s look at a simplified version in figure 5.3.

Introduction to Data Science – UNIT 3 – Lecture Notes – by M V Rajesh, DS Dept., PEC


Introduction to Data Science – UNIT 3 – Lecture Notes – by M V Rajesh, DS Dept., PEC
NOTE: While Hadoop makes working with big data easy, setting up a good working cluster still
isn’t trivial, but cluster managers such as Apache Mesosdo ease the burden. In reality, many (mid-
sized) companies lack the competence to maintain a healthy Hadoop installation.
• This is why we’ll work with the Hortonworks Sandbox, a pre-installed and configured Hadoop
ecosystem.

Spark: replacing MapReduce for better performance


• Data scientists often do interactive analysis and rely on algorithms that are inherently
iterative; it can take awhile until an algorithm converges to a solution.
• As this is a weak point of the MapReduce framework, we’ll introduce the Spark Framework
to overcome it.
Spark improves the performance on such tasks by an order of magnitude.

What Is Spark?
Spark is a cluster computing framework similar to MapReduce. Spark, however, doesn’t handle the
storage of files on the (distributed) file system itself, nor does it handle the resource management. For
this it relies on systems such as the Hadoop File System, YARN, or Apache Mesos.

How Does Spark Solve The Problems Of Mapreduce?


• Spark creates a kind of shared RAM memory between the computers of your cluster. This
allows the different workers to share variables (and their state) and thus eliminates the need to
write the intermediate results to disk.
• Spark uses Resilient Distributed Datasets (RDD), which are a distributed memory abstraction
that lets programmers perform in-memory computations on large clusters in a fault-tolerant
way. Because it’s an in-memory system, it avoids costly disk operations.

Introduction to Data Science – UNIT 3 – Lecture Notes – by M V Rajesh, DS Dept., PEC


The Different Components Of The Spark Ecosystem
• Spark core provides a NoSQL environment well suited for interactive, exploratory analysis.
Spark can be run in batch and interactive mode and supports Python.
• Spark has four other large components, as listed below
1. 1. Spark streaming is a tool for real-time analysis.
2. 2. Spark SQL provides a SQL interface to work with Spark.
3. 3. MLLib is a tool for machine learning inside the Spark framework.
4. 4. GraphX is a graph database for Spark.

Join the NoSQL Movement:


• what are NoSQL databases and what makes them so different from the relational or SQL
databases you’re used to?
• NoSQL is short for Not Only Structured Query Language, but although it’s true that NoSQL
databases can allow you to query them with SQL
• Traditional relational database management systems (RDBMS) reside on a single computer or
server.
• With the growth of the internet, companies such as Google and Amazon felt they were held
back by these single-node databases and looked for alternatives.

Introduction to Data Science – UNIT 3 – Lecture Notes – by M V Rajesh, DS Dept., PEC


Introduction to NoSQL:
The goal of NoSQL databases
Isn’t only to offer a way to partition databases successfully over multiple nodes,
but also to present fundamentally different ways to model the data at hand to fit itsstructure
to its use case and not to how a relational database requires it to be modeled.

ACID: THE CORE PRINCIPLE OF RELATIONAL DATABASES:


The main aspects of a traditional relational database can be summarized by the concept ACID:
1. Atomicity—The “all or nothing” principle. If a record is put into a database, it’s put in
completely or not at all.
2. Consistency—This important principle maintains the integrity of the data. No entry
that makes it into the database will ever be in conflict with predefined rules, such as
lacking a required field or a field being numeric instead of text.
3. Isolation—When something is changed in the database, nothing can happen on this
exact same data at exactly the same moment. Instead, the actions happen in serial with
other changes. Isolation is a scale going from low isolation to high isolation. On this
scale, traditional databases are on the “high isolation” end.
4. Durability—If data has entered the database, it should survive permanently. Physical
damage to the hard discs will destroy records, but power outages and software crashes
should not.
ACID applies to all relational databases and certain NoSQL databases, such as the graph database
Neo4j.
For most other NoSQL databases another principle applies: BASE.
To understand BASE and why it applies to most NoSQL databases, we need to look at the CAP
Theorem.

CAP THEOREM: THE PROBLEM WITH DBS ON MANY NODES


• Once a database gets spread out over different servers, it’s difficult to follow the ACID
principle because of the consistency ACID promises;
• the CAP Theorem points out why this becomes problematic.
• The CAP Theorem states that a database can be any two of the following things but never all
three:
• Partition tolerant—The database can handle a network partition or network failure.
• Available—As long as the node you’re connecting to is up and running and you can connect
to it,the node will respond, even if the connection between the different database nodes is lost.
• Consistent—No matter which node you connect to, you’ll always see the exact same data.

For a single-node database it’s easy to see how it’s always available and consistent:
• Available—As long as the node is up, it’s available. That’s all the CAP availability promises.
• Consistent—There’s no second node, so nothing can be inconsistent.
Things get interesting once the database gets partioned. Then you need to make a choice between
availability and consistency, as shown in figure 6.2.

Introduction to Data Science – UNIT 3 – Lecture Notes – by M V Rajesh, DS Dept., PEC


Let’s take the example of an online shop with a server in Europe and a server in the United States,
with a single distribution center.
A German named Fritz and an American named Freddy are shopping at the same time on that same
online shop. They see an item and only one is still in stock: a bronze, octopus-shaped coffee table.
Disaster strikes, and communication between the two local servers is temporarily down.
If you were the owner of the shop, you’d have two options:
• Availability—You allow the servers to keep on serving customers, and you sort out
everythingafterward.
• Consistency—You put all sales on hold until communication is reestablished.

Introduction to Data Science – UNIT 3 – Lecture Notes – by M V Rajesh, DS Dept., PEC


THE BASE PRINCIPLES OF NOSQL DATABASES:
• RDBMS follows the ACID principles; NoSQL databases that don’t follow ACID, such as the
document stores and key-value stores, follow BASE.
• BASE is a set of much softer database promises:
1. Basically Available—Availability is guaranteed in the CAP sense. Taking the
web shop example, if a node is up and running, you can keep on shopping. Depending
on how thingsare set up, nodes can take over from other nodes. Elastic search, for
example, is a NoSQLdocument–type search engine that divides and replicates its data
in such a way that node failure doesn’t necessarily mean service failure, via the
process of sharding. Each shard can be seen as an individual database server instance,
but is also capable of with the other communicating shards to divide the workload as
efficiently as possible.

1.Soft state —The state of a system might change over time. This corresponds to
the eventual consistency principle: the system might have to change to make
the data consistent again. In one node the data might say “A” and in the other it
Introduction to Data Science – UNIT 3 – Lecture Notes – by M V Rajesh, DS Dept., PEC
possible the “A” in the first node is replaced by “B.” Even though no one did anythingto explicitly
change “A” into “B,” it will take on this value as it becomes consistent with the other node.

2.Eventual consistency —The database will become consistent over time. In the
web shopexample, the table is sold twice, which results in data inconsistency.
Once the connection between the individual nodes is reestablished, they’ll
communicate and decide how to resolve it. This conflict can be resolved, for
example, on a first-come,
first-served basis or by preferring the customer who would incur the lowest transport
cost.

NOSQL DATABASE TYPES:

• There are four big NoSQL types: key-value store, document store, column-oriented database,
and graph database.
• Each type solves a problem that can’t be solved with relational databases. Actual
implementations are often combinations of these.
• OrientDB, for example, is a multi-model database, combining NoSQL types.
• OrientDB is a graph database where each node is a document.
• Relational databases generally strive toward normalization: making sure every piece of data is
stored only once.
• Normalization marks their structural setup. If, for instance, you want to store data about a
person and their hobbies, you can do so with two tables: one about the person and one about
their hobbies.
• An additional table is necessary to link hobbies to persons because of their many-to-many
relationship: a person can have multiple hobbies and a hobby can have many persons
practicing it.

Introduction to Data Science – UNIT 3 – Lecture Notes – by M V Rajesh, DS Dept., PEC


• Now that you have something to compare NoSQL to, let’s look at the different types of
NoSQL data types:

Column-Oriented Database

• Traditional relational databases are row-oriented, with each row having a row id and each
field within the row stored together in a table.
• Let’s say, for example’s sake, that no extra data about hobbies is stored and you have only a
single table to describe people, as shown in figure 6.8. Notice how in this scenario you have
slight denormalization because hobbies could be repeated

Introduction to Data Science – UNIT 3 – Lecture Notes – by M V Rajesh, DS Dept., PEC


row-oriented database column-oriented database

adding an entire record is easy easy to add another column


because none of the existing
columns are affected by it
Preferable for online transaction processing (OLTP) Not suitable for OLTP

Not much suitable for analytics and reporting shines when performing analytics
and reporting: summing values
and counting
entries

Introduction to Data Science – UNIT 3 – Lecture Notes – by M V Rajesh, DS Dept., PEC


Not suitable for Overnight batch jobs More suitable for Overnight batch
jobs, supporting lightning-speed
lookups and aggregations using
MapReduce algorithms for reports
Examples: Examples of column-family stores
MySQL are Apache HBase,
PostgreSQL Facebook’s Cassandra,
Oracle Hypertable, and the grandfather of
wide-column stores,
Google BigTable.

Key-Value Stores

• Key-value stores are the least complex of the NoSQL databases.


• Simplicity makes them the most scalable of the NoSQL database types
• capable of storing huge amounts of data
• The value in a key-value store can be anything: a string, a number, but also an entire new set
of key-value pairs encapsulated in an object.
• Figure 6.12 shows a slightly more complex key-value structure.
• Examples of key-value stores are Redis, Voldemort, Riak, and Amazon’s Dynamo.

Introduction to Data Science – UNIT 3 – Lecture Notes – by M V Rajesh, DS Dept., PEC


Document Stores

• Document stores are one step up in complexity from key-value stores: a document store does
assume a certain document structure that can be specified with a schema.
• Document stores appear the most natural among the NoSQL database types because they’re
designed to store everyday documents as is, and they allow for complex querying and
calculations on this often already aggregated form of data.
• Document stores care little about normalization as long as the data is in a structure that makes
sense.
• Examples of document stores are MongoDB and CouchDB.

Introduction to Data Science – UNIT 3 – Lecture Notes – by M V Rajesh, DS Dept., PEC


Graph Databases:

• The last big NoSQL database type is the most complex one, geared toward storing relations
between entities in an efficient manner. When the data is highly interconnected, such as for
social networks, scientific paper citations, or capital asset clusters, graph databases are the
answer.
• Graph or network data has two main components:
1. Node—The entities themselves. In a social network this could be people.
2. Edge—The relationship between two entities. This relationship is represented by a line
and has its own properties.
An edge can have a direction, for example, if the arrow indicates who is whose boss. Graphs
can become incredibly complex given enough relation and entity types. Graph databases like
Neo4j also claim to uphold ACID, whereas document stores and key-value stores adhere to
BASE.

A ranking of the most popular databases and how they’re progressing can be found at http:/ /db-
engines.com/en/ranking.

Introduction to Data Science – UNIT 3 – Lecture Notes – by M V Rajesh, DS Dept., PEC


Case study: Assessing risk when loaning money
The goal of this case study is to have a first experience with the technologies and see that for a large part you can (but
don’t have to) work similarly as with other technologies. Note: The portion of the data used here isn’t that big because
that would require serious bandwidth to collect it and multiple nodes to follow along with the example.
What we’ll use:
• Horton Sandbox on a virtual machine. If you haven’t downloaded and imported this to VM software such as
VirtualBox, please go back to section 1.5 where this is explained Version 2.3.2 of the Horton Sandbox was used.
• Python libraries: Pandas and pywebhdsf. They don’t need to be installed on your local virtual
environment this time around; we need them directly on the Horton Sandbox. Therefore we need to fire
up the Horton Sandbox (on VirtualBox, for instance) and make a few preparations.
In the Sandbox command line there are several things you still need to do for this all to work, so connect to
the command line. You can do this using a program like PuTTY. If you’re unfamiliar with PuTTY, it offers a
command line interface to servers and can be downloaded freely at
http://www.chiark.greenend.org.uk/~sgtatham/putty/ download.html.
The PuTTY login configuration is shown in figure 5.6.

Introduction to Data Science – UNIT 3 – Lecture Notes – by M V Rajesh, DS Dept., PEC


The default user and password are (at the time of writing) “root” and “hadoop”, respectively. You’ll need to
change this password at the first login, though. Once connected, issue the following commands:
■ yum -y install python-pip—This installs pip, a Python package manager.
■ pip install git+https: //github.com /DavyCielen /pywebhdfs.git –upgrade— At the time of writing there was
a problem with the pywebhdfs library and we fixed that in this fork. Hopefully you won’t require this anymore
when you read this; the problem has been signaled and should be resolved by the maintainers of this package.
■pip install pandas—To install Pandas. This usually takes awhile because of the dependencies.
we’ll go through several more of the data science process steps:
Step 1: The research goal
➢ The Lending Club is an organization that connects people in need of a loan with people who have
money to invest. Your boss also has money to invest and wants information before throwing a
substantial sum on the table.
➢ To achieve this, you’ll create a report for him that gives him insight into the average rating, risks, and
return for lending money to a certain person. By going through this process, you make the data
accessible in a dashboard tool, thus enabling other people to explore it as well. In a sense this is the
secondary goal of this case: opening up the data for self-service BI. Self-service Business Intelligence is
often applied in data-driven organizations that don’t have analysts to spare.
➢ We can do this case study because the Lending Club makes anonymous data available about the existing
loans.

Introduction to Data Science – UNIT 3 – Lecture Notes – by M V Rajesh, DS Dept., PEC


Step 2: Data retrieval
It’s time to work with the Hadoop File System (or hdfs). First we’ll send commands through the command
line and then through the Python scripting language with the help of the pywebhdfs package.

USING THE COMMAND LINE TO INTERACT WITH THE HADOOP FILE SYSTEM

Let’s first retrieve the currently present list of directories and files in the Hadoop root folder using the
command line.
Type the command hadoop fs –ls / in PuTTY to achieve this.
Make sure you turn on your virtual machine with the Hortonworks Sandbox before attempting a connection.
In PuTTY you should then connect to 127.0.0.1:2222 as shown before in figure 5.6.
The output of the Hadoop command is shown in figure 5.8. You can also add arguments such as hadoop fs
–ls –R / to get a recursive list of all the files and subdirectories.

Introduction to Data Science – UNIT 3 – Lecture Notes – by M V Rajesh, DS Dept., PEC


We’ll now create a new directory “chapter5” on hdfs.
sudo -u hdfs hadoop fs -mkdir /chapter5
sudo -u hdfs hadoop fs –chmod 777 /chapter5
You probably noticed a pattern here. The Hadoop commands are very similar to our local file system
commands (POSIX style) but start with Hadoop fs and have a dash - before each command.
Table 5.1 gives an overview of popular file system commands on Hadoop and their local file system command
counterparts

There are two special commands you’ll use often. These are
■ Upload files from the local file system to the distributed file system (hadoop fs –put LOCALURI
REMOTEURI).
■ Download a file from the distributed file system to the local file system (hadoop –get REMOTEURI).
Suppose you have a .CSV file on the Linux virtual machine from which you connect to the Linux
Hadoop cluster. You want to copy the .CSV file from your Linux virtual machine to the cluster hdfs. Use the
command hadoop –put mycsv.csv /data.

Introduction to Data Science – UNIT 3 – Lecture Notes – by M V Rajesh, DS Dept., PEC


Using PuTTY we can start a Python session on the Horton Sandbox to retrieve our data using a Python
script. Issue the “pyspark” command in the command line to start the session. If all is well you should see the
welcome screen shown in figure 5.9.

We download the file “LoanStats3d.csv.zip” from the Lending Club’s website at https://
resources.lendingclub.com/LoanStats3d.csv.zip and unzip it. We use methods from the requests, zipfile, and
stringio Python packages to respectively download the data, create a virtual file, and unzip it. This is only a
single file; if you want all their data you could create a loop, but for demonstration purposes this will do. Before
we can do so, however, we need to put it on the Hadoop file system. PyWebHdfs is a package that allows you to
interact with the Hadoop file system from Python. This is useful because you can use your favorite scripting
language to automate tasks, as shown in the following listing.

Introduction to Data Science – UNIT 3 – Lecture Notes – by M V Rajesh, DS Dept., PEC


Before moving on, we can check our file using the following line of code:
print hdfs.get_file_dir_status ('chapter5/ Loan Stats 3d. csv’)
The PySpark console should tell us our file is safe and well on the Hadoop system, as shown in figure 5.10

With the file ready and waiting for us on Hadoop, we can move on to data preparation using Spark, because it’s
not clean enough to directly store in Hive.

Introduction to Data Science – UNIT 3 – Lecture Notes – by M V Rajesh, DS Dept., PEC


Step 3: Data preparation
Now that we’ve downloaded the data for analysis, we’ll use Spark to clean the data before we
store it in Hive.
DATA PREPARATION IN SPARK
Spark is well suited for this type of interactive analysis because it doesn’t need to save the data
after each step and has a much better model than Hadoop for sharing data between servers (a
kind of distributed memory). The transformation consists of four parts:
1) Start up PySpark (should still be open from section 5.2.2) and load the Spark and Hive
context.
2) Read and parse the .CSV file.
3) Split the header line from the data.
4) Clean the data
Okay, onto business. The following listing shows the code implementation in the PySpark console.

Introduction to Data Science – UNIT 3 – Lecture Notes – by M V Rajesh, DS Dept., PEC


Step 1: Starting up Spark in interactive mode and loading the context
The Spark context import isn’t required in the PySpark console because a context is readily available as
variable sc.
We then load a Hive context to enable us to work interactively with Hive. If you work interactively with
Spark, the Spark and Hive contexts are loaded automatically, but if you want to use it in batch mode you need to
load it manually. To submit the code in batch you would use the spark-submit filename.py command on the
Horton Sandbox command line.
from pyspark import SparkContext
from pyspark.sql import HiveContext
sc = SparkContext()
sqlContext = HiveContext(sc)

With the environment set up, we’re ready to start parsing the .CSV file.

Step 2: Reading and parsing the .CSV file


Next we read the file from the Hadoop file system and split it at every comma we encounter. In our code the first
line reads the .CSV file from the Hadoop file system. The second line splits every line when it encounters a
comma. Our .CSV parser is naïve by design.
data = sc.textFile("/chapter5/LoanStats3d.csv")
parts = data.map(lambda r:r.split(',’))

Introduction to Data Science – UNIT 3 – Lecture Notes – by M V Rajesh, DS Dept., PEC


Notice how similar this is to a functional programming approach. For those who’ve never encountered it, you
can naïvely read lambda r:r.split(‘,’) as “for every input r (a row in this case), split this input r when it encounters
a comma.” As in this case, “for every input” means “for every row,” but you can also read it as “split every row
by a comma.” This functional-like syntax is one of my favorite characteristics of Spark.

Step 3: Split the header line from the data


To separate the header from the data, we read in the first line and retain every line that’s not similar to the
header line:
firstline = parts.first()
datalines = parts.filter(lambda x:x != firstline)
In reality, .CSV files do often contain a header line and you’ll need to perform a similar operation before you can
start cleaning the data.

Step 4: Clean the data


After the second step, our data consists of arrays. We’ll treat every input for a lambda function as an array now
and return an array. To ease this task, we build a helper function that cleans. Our cleaning consists of
reformatting an input such as “10,4%” to 0.104 and encoding every string as utf-8, as well as replacing
underscores with spaces and lowercasing all the strings. The second line of code calls our helper function for
every line of the array.
def cleans(row):
row [7] = str(float(row [7][:-1])/100)
return [s.encode('utf8').replace(r"_"," ").lower() for s in row]
datalines = datalines.map(lambda x: cleans(x))
Our data is now prepared for the report, so we need to make it available for our reporting tools. Hive is well
suited for this, because many reporting tools can connect to it. Let’s look at how to accomplish this.

SAVE THE DATA IN HIVE


To store data in Hive we need to complete two steps:
• Create and register metadata.
• Execute SQL statements to save data in Hive.
In this section, we’ll once again execute the next piece of code in our beloved PySpark shell, as shown in the
following listing.

Introduction to Data Science – UNIT 3 – Lecture Notes – by M V Rajesh, DS Dept., PEC


Let’s drill deeper into each step for a bit more clarification.
Step 1: Create and register metadata
Many people prefer to use SQL when they work with data. This is also possible with Spark. You can even read
and store data in Hive directly as we’ll do. Before you can do that, however, you’ll need to create metadata that
contains a column name and column type for every column.
The StructType represents rows as an array of structfields. Then you place it in a dataframe that’s registered as a
(temporary) table in Hive

Introduction to Data Science – UNIT 3 – Lecture Notes – by M V Rajesh, DS Dept., PEC


With the metadata ready, we’re now able to insert the data into Hive.

Step 2: Execute queries and store table in Hive


Now we’re ready to use a SQL-dialect on our data. First we’ll make a summary table that counts the number of
loans per purpose. Then we store a subset of the cleaned raw data in Hive for visualization in Qlik. Executing
SQL-like commands is as easy as passing a string that contains the SQLcommand to the sqlContext.sql function.

With the data stored in Hive, we can connect our visualization tools to it.

Step 4: Data exploration & Step 6: Report building


We’ll build an interactive report with Qlik Sense to show to our manager. Qlik Sense can be downloaded
from http://www.qlik.com/try-or-buy/download-qlik-sense after subscribing to their website. When the
download begins you will be redirected to a page containing several informational videos on how to install and
work with Qlik Sense. It’s recommended to watch these first.
We use the Hive ODBC connector to read data from Hive and make it available for Qlik. A tutorial on
installing ODBC connectors in Qlik is available. For major operating systems, this can be found at
http://hortonworks.com/hdp/addons/.

Introduction to Data Science – UNIT 3 – Lecture Notes – by M V Rajesh, DS Dept., PEC


Now open Qlik Sense. If installed in Windows you should have gotten the option to place a shortcut to the .exe
on your desktop. Qlik isn’t freeware; it’s a commercial product with a bait version for single customers, but it
will suffice for now.
Qlik can either take the data directly into memory or make a call every time to Hive. We’ve chosen the first
method because it works faster.
This part has three steps:
1 Load data inside Qlik with an ODBC connection.
2 Create the report.
3 Explore data.
Let start with the first step, loading data into Qlik.

Introduction to Data Science – UNIT 3 – Lecture Notes – by M V Rajesh, DS Dept., PEC


Step 1: Load data in Qlik
When you start Qlik Sense it will show you a welcome screen with the existing reports (called apps), as shown in
figure 5.12.

To start a new app, click on the Create new app button on the right of the screen, as shown in figure 5.13. This
opens up a new dialog box. Enter “chapter 5” as the new name of our app.

A confirmation box appears (figure 5.14) if the app is created successfully.

Introduction to Data Science – UNIT 3 – Lecture Notes – by M V Rajesh, DS Dept., PEC


Click on the Open app button and a new screen will prompt you to add data to the application (figure 5.15)

Introduction to Data Science – UNIT 3 – Lecture Notes – by M V Rajesh, DS Dept., PEC


Introduction to Data Science – UNIT 3 – Lecture Notes – by M V Rajesh, DS Dept., PEC
After this step, it will take a few seconds to load the data in Qlik (figure 5.19).

Step 2: Create the report


Choose Edit the sheet to start building the report. This will add the report editor (figure 5.20)

Introduction to Data Science – UNIT 3 – Lecture Notes – by M V Rajesh, DS Dept., PEC


Substep 1: Adding a selection filter to the report
The first thing we’ll add to the report is a selection box that shows us why each person wants a loan. To achieve
this, drop the title measure from the left asset panel on the report pane and give it a comfortable size and position
(figure 5.21). Click on the Fields table so you can drag and drop fields

Substep 2: Adding a KPI to the report


A KPI chart shows an aggregated number for the total population that’s selected. Numbers such as the average
interest rate and the total number of customers are shown in this chart (figure 5.22). Adding a KPI to a report
takes four steps, as listed below and shown in figure 5.23.
• Choose a chart—Choose KPI as the chart and place it on the report screen; resize and position to your
liking.
• Add a measure—Click the add measure button inside the chart and select int_rate.
• Choose an aggregation method—Avg(int_rate).
• Format the chart—On the right pane, fill in average interest rate as Label.

Introduction to Data Science – UNIT 3 – Lecture Notes – by M V Rajesh, DS Dept., PEC


In total we’ll add four KPI charts to our report, so you’ll need to repeat these steps for the following KPI’s:
■ Average interest rate
■ Total loan amount
■ Average loan amount
■ Total recoveries

Substep 3: Adding bar charts to our report


Next we’ll add four bar charts to the report. These will show the different numbers for each risk grade. One bar
chart will explain the average interest rate per risk group, and another will show us the total loan amount per risk
group (figure 5.24).

Introduction to Data Science – UNIT 3 – Lecture Notes – by M V Rajesh, DS Dept., PEC


Adding a bar chart to a report takes five steps, as listed below and shown in figure 5.25.
1 Choose a chart—Choose bar chart as the chart and place it on the report screen; resize and position to your
liking.
2 Add a measure—Click the Add measure button inside the chart and select int_rate.
average interest rate as Label
3 Choose an aggregation method—Avg(int_rate).
4 Add a dimension—Click Add dimension, and choose grade as the dimension.
5 Format the chart—On the right pane, fill in average interest rate as Label.

Introduction to Data Science – UNIT 3 – Lecture Notes – by M V Rajesh, DS Dept., PEC


Repeat this procedure for the following dimension and measure combinations:
■ Average interest rate per grade
■ Average loan amount per grade
■ Total loan amount per grade
■ Total recoveries per grade

Substep 4: Adding a cross table to the report


Suppose you want to know the average interest rate paid by directors of risk group C. In this case you want to get
a measure (interest rate) for a combination of two dimensions (job title and risk grade). This can be achieved
with a pivot table such as in figure 5.26

Introduction to Data Science – UNIT 3 – Lecture Notes – by M V Rajesh, DS Dept., PEC


Adding a pivot table to a report takes six steps, as listed below and shown in figure 5.27.
1 Choose a chart—Choose pivot table as the chart and place it on the report screen; resize and position to your
liking.
2 Add a measure—Click the Add measure button inside the chart and select int_rate.
3 Choose an aggregation method—Avg(int_rate).
4 Add a row dimension—Click Add dimension, and choose emp_title as the dimension.
5 Add a column dimension—Click Add data, choose column, and select grade.
6 Format the chart—On the right pane, fill in average interest rate as Label.

Introduction to Data Science – UNIT 3 – Lecture Notes – by M V Rajesh, DS Dept., PEC


Introduction to Data Science – UNIT 3 – Lecture Notes – by M V Rajesh, DS Dept., PEC
After resizing and repositioning, you should achieve a result similar to figure 5.28. Click the Done button on the
left and you’re ready to explore the data.

Step 3: Explore the data


The result is an interactive graph that updates itself based on the selections you make.
We finally did it: We created the report our manager craves, and in the process we opened the door for other
people to create their own reports using this data. An interesting next step for you to ponder on would be to use
this setup to find those people likely to default on their debt.

Introduction to Data Science – UNIT 3 – Lecture Notes – by M V Rajesh, DS Dept., PEC


Case study on disease diagnosis and profiling
In this case study, you’ll learn how to build such a search engine here, albeit using only a fraction of the medical
data that is freely accessible. To tackle the problem, you’ll use a modern NoSQL database called Elasticsearch to
store the data, and the data science process to work with the data and turn it into a resource that’s fast and easy to
search.

Introduction to Data Science – UNIT 3 – Lecture Notes – by M V Rajesh, DS Dept., PEC


To follow along with the code, you’ll need these items:
■ A Python session with the elasticsearch-py and Wikipedia libraries installed (pip install elasticsearch and pip
install wikipedia)
■ A locally set up Elasticsearch instance; see appendix A for installation instructions
■ The IPython library

Step 1: Setting the research goal


■ Your primary goal is to set up a disease search engine that would help general practitioners in diagnosing
diseases.
■ Your secondary goal is to profile a disease: What keywords distinguish it from other diseases?

This secondary goal is useful for educational purposes or as input to more advanced uses such as detecting
spreading epidemics by tapping into social media. With your research goal and a plan of action defined, let’s
move on to the data retrieval step.

Steps 2 and 3: Data retrieval and preparation


Data retrieval and data preparation are two distinct steps in the data science process, and even though this
remains true for the case study, we’ll explore both in the same section. This way you can avoid setting up local
intermedia storage and immediately do data preparation while the data is being retrieved. Let’s look at where we
are in the data science process (see figure 6.18).

Introduction to Data Science – UNIT 3 – Lecture Notes – by M V Rajesh, DS Dept., PEC


■ Internal data—You have no disease information lying around. If you currently work for a pharmaceutical
company or a hospital, you might be luckier.
■ External data—All you can use for this case is external data. You have several possibilities, but you’ll go with
Wikipedia.
As shown in figure 6.19 there are three distinct categories of data preparation to consider:
■ Data cleansing—The data you’ll pull from Wikipedia can be incomplete or erroneous. Data entry errors and
spelling mistakes are possible—even false information isn’t excluded.
Luckily, you don’t need the list of diseases to be exhaustive, and you can handle spelling mistakes at
search time; more on that later. Thanks to the Wikipedia Python library, the textual data you’ll receive is fairly
clean already. If you were to scrape it manually, you’d need to add HTML cleaning, removing all HTML tags.
The truth of the matter is full-text search tends to be fairly robust toward common errors such as incorrect values.
Even if you dumped in HTML tags on purpose, they’d be unlikely to influence the results; the HTML tags are
too different from normal language to interfere.

Introduction to Data Science – UNIT 3 – Lecture Notes – by M V Rajesh, DS Dept., PEC


■ Data transformation—You don’t need to transform the data much at this point; you want to search it as is. But
you’ll make the distinction between page title, disease name, and page body. This distinction is almost
mandatory for search result interpretation.
■ Combining data—All the data is drawn from a single source in this case, so you have no real need to combine
data. A possible extension to this exercise would be to get disease data from another source and match the
diseases. This is no trivial task because no unique identifier is present and the names are often slightly different.
You can do data cleansing at only two stages: when using the Python program that connects Wikipedia to
Elasticsearch and when running the Elasticsearch internal indexing system:
■ Python—Here you define what data you’ll allow to be stored by your document store, but you won’t clean the
data or transform the data at this stage, because Elasticsearch is better at it for less effort.
■ Elasticsearch—Elasticsearch will handle the data manipulation (creating the index) under the hood.
First comes data retrieval: you need information on the different diseases. You have several ways to get that
kind of data. This is a bit ironic because searches on the Wikipedia website itself are handled by Elasticsearch.
Wikipedia used to have its own system build on top of Apache Lucene, but it became unmaintainable, and as of
January 2014 Wikipedia began using Elasticsearch instead. Wikipedia has a Lists of diseases page, as shown in
figure 6.20. From here you can borrow the data from the alphabetical lists.

You know what data you want; now go grab it. You could download the entire Wikipedia data dump. If you
want to, you can download it to http://meta.wikimedia.org/wiki/Data_dump_torrents#enwiki.
Another option is scraping the pages you require. Like Google, you can make a program crawl through the pages
and retrieve the entire rendered HTML. This would do the trick, but you’d end up with the actual HTML, so
you’d need to clean that up before indexing it. This creates an unnecessarily high amount of traffic, and if
enough people send crawlers, it can bring the HTTP server to its knees, spoiling the fun for everyone.
If you do need to scrape a website, script in a time gap between each page request. This way, your scraper more
closely mimics the behavior of a regular website visitor and you won’t blow up their servers. They’ve put an API
in place from which you can safely draw your information. You can read more about it at
http://www.mediawiki.org/wiki/API:Main_page.
Activate your Python virtual environment and install all the libraries you’ll need for the rest of the book:
pip install Wikipedia
pip install Elasticsearch

Introduction to Data Science – UNIT 3 – Lecture Notes – by M V Rajesh, DS Dept., PEC


Open your favorite Python interpreter and import the necessary libraries:
from elasticsearch import Elasticsearch
import Wikipedia
You’re going to draw data from the Wikipedia API and at the same time index on your local Elasticsearch
instance, so first you need to prepare it for data acceptance.

The first thing you need is a client. Elasticsearch() can be initialized with an address but the default is
localhost:9200. Elasticsearch() and Elasticsearch ('localhost:9200') are thus the same thing: your client is
connected to your local Elasticsearch node. Then you create an index named "medical". If all goes well, you
should see an "acknowledged:true" reply, as shown in figure 6.21.

Introduction to Data Science – UNIT 3 – Lecture Notes – by M V Rajesh, DS Dept., PEC


This way you tell Elasticsearch that your index will have a document type called "disease", and
you supply it with the field type for each of the fields. You have three fields in a disease
document: name, title, and fulltext, all of them of type string. If you hadn’t supplied the
mapping, Elasticsearch would have guessed their types by looking at the first entry it received.
Now let’s move on to Wikipedia. The first thing you want to do is fetch the List of diseases page, because this is
your entry point for further exploration:
dl = wikipedia.page("Lists_of_diseases")
You now have your first page, but you’re more interested in the listing pages because they
contain links to the diseases. Check out the links:
dl.Links
The List of diseases page comes with more links than you’ll use. Figure 6.22 shows the
alphabetical lists starting at the sixteenth link.
dl = wikipedia.page("Lists_of_diseases")
dl.links

Introduction to Data Science – UNIT 3 – Lecture Notes – by M V Rajesh, DS Dept., PEC


This page has a considerable array of links, but only the alphabetic lists interest you, so keep only those:
diseaseListArray = []
for link in dl.links[15:42]:
try:
diseaseListArray.append(wikipedia.page(link))
except Exception,e:
print str(e)
You can find more information on them at https://docs.python.org/2/howto/regex.html.
One possibility for a regex version would be the following code snippet.

diseaseListArray = []
check = re.compile("List of diseases*")
for link in dl.links:
if check.match(link):
try:
diseaseListArray.append(wikipedia.page(link))
except Exception,e:
print str(e)

diseaseListArray[0].links
It’s time to index the diseases. Once they’re indexed, both data entry and data preparation are effectively over, as
shown in the following listing.

Introduction to Data Science – UNIT 3 – Lecture Notes – by M V Rajesh, DS Dept., PEC


For example, try this URL in your browser:
http://localhost:9200/medical/diseases/11%20beta%20hydroxylase%20deficiency

Specifying http://localhost:9200/ medical/diseases/_search will return the first five results.


For a more structured view on the data you can ask for the mapping of this document type at
http://localhost:9200/medical/ diseases/_mapping?pretty. The pretty get argument shows the returned JSON in a
more readable format, as can be seen in figure 6.25. The mapping does appear to be the way you specified it: all
fields are type string.

Introduction to Data Science – UNIT 3 – Lecture Notes – by M V Rajesh, DS Dept., PEC


Step 4: Data exploration
Data exploration is what marks this case study, because the primary goal of the project (disease diagnostics) is a
specific way of exploring the data by querying for disease symptoms.

Introduction to Data Science – UNIT 3 – Lecture Notes – by M V Rajesh, DS Dept., PEC


Import the Elasticsearch library and define global search settings:
from elasticsearch import Elasticsearch
client = Elasticsearch()
indexName = "medical"
docType="diseases"
searchFrom = 0
searchSize= 3
You’ll return only the first three results; the default is five. Elasticsearch has an elaborate JSON query language;
every search is a POST request to the server and will be answered with a JSON answer. Roughly, the language
consists of three big parts: queries, filters, and aggregations.
• A query takes in search keywords and puts them through one or more analyzers before the words are
looked up in the index. We’ll get deeper into analyzers a bit later in this chapter.
• A filter takes keywords like a query does but doesn’t try to analyze what you give it; it filters on the
conditions we provide.
• Aggregations can be compared to the SQL group; buckets of words will be created, and for each bucket
relevant statistics can be calculated.
We’ll use the “Query string query language,” a way to query the data that closely resembles the Google search
query language If, for instance, you want a search term to be mandatory, you add a plus (+) sign; if you want to
exclude the search term, you use a minus (-) sign.

PROJECT PRIMARY OBJECTIVE: DIAGNOSING A DISEASE BY ITS SYMPTOMS


Start off with three symptoms: fatigue, fever, and joint pain. Your imaginary patient has all three of them (and
more), so make them all mandatory by adding a plus sign before each one. You use the query string syntax to
search in all the indexed fields: fulltext, title, and name.
.

Introduction to Data Science – UNIT 3 – Lecture Notes – by M V Rajesh, DS Dept., PEC


Figure 6.27 shows the top three results returned out of 34 matching diseases.
The results are sorted by their matching score, the variable _score. The matching score is no simple thing to
explain; it takes into consideration how well the disease matches your query and how many times a keyword was
found, the weights you gave, and so on. Currently, lupus doesn’t even show up in the top three results.
Luckily for you, lupus has another distinct symptom: a rash. The rash doesn’t always show up on the person’s
face, but it does happen and this is where lupus got its name: the face rash makes people vaguely resemble a
wolf.
Your patient has a rash but not the signature rash on the face, so add “rash” to the symptoms without
mentioning the face.
"query": '+fatigue+fever+"joint pain"+rash’

Introduction to Data Science – UNIT 3 – Lecture Notes – by M V Rajesh, DS Dept., PEC


At this point, the search engine says Human Granulocytic Ehrlichiosis (HGE) is more likely. HGE is a disease
spread by ticks, like the infamous Lyme disease. By now a capable doctor would have already figured out which
disease plagues your patient, because in determining diseases many factors are at play, more than you can feed
into your humble search engine. For instance, the rash occurs only in 10% of HGE and in 50% of lupus patients.
Lupus emerges slowly, whereas HGE is set off by a tick bite. Advanced machine-learning databases fed with all
this information in a more structured way could make a diagnosis with far greater certainty. Given that you need
to make do with the Wikipedia pages, you need another symptom to confirm that it’s lupus. The patient experiences chest
pain, so add this to the list.
"query": '+fatigue+fever+"joint pain"+rash+"chest pain"',

Introduction to Data Science – UNIT 3 – Lecture Notes – by M V Rajesh, DS Dept., PEC


HANDLING SPELLING MISTAKES: DAMERAU-LEVENSHTEIN
Say someone typed “lupsu” instead of “lupus.” Spelling mistakes happen all the time and in all types of
human-crafted documents. To deal with this data scientists often use Damerau-Levenshtein.
The Damerau-Levenshtein distance between two strings is the number of operations required to turn one
string into the other. Four operations are allowed to calculate the distance:
■ Deletion—Delete a character from the string.
■ Insertion—Add a character to the string.
■ Substitution—Substitute one character for another. Without the substitution counted as one operation, changing one
character into another would take two operations: one deletion and one insertion.
■ Transposition of two adjacent characters—Swap two adjacent characters.

PROJECT SECONDARY OBJECTIVE: DISEASE PROFILING


For this you’ll use the significant terms aggregation. Let’s do that on diabetes, a common disease that can take
many forms:

Introduction to Data Science – UNIT 3 – Lecture Notes – by M V Rajesh, DS Dept., PEC


A filter is thus far less complex than an actual search: it’s either “yes” or “no” and this is evident in the output.
The score is 1 for everything; no distinction is made within the result set. The output consists of two parts now
because of the significant terms aggregation.

Introduction to Data Science – UNIT 3 – Lecture Notes – by M V Rajesh, DS Dept., PEC


When should you use filters and when search queries? The rule is simple: use filters whenever possible and use
search queries for full-text search when a ranking between the results is required to get the most interesting
results at the top.

If you look at the first five keywords in figure 6.32 you’ll see that the top four are related to the origin of
diabetes. The following Wikipedia paragraph offers help:

The word diabetes (/ˌdaɪ.əˈbiːtiːz/ or /ˌdaɪ.əˈbiːtɨs/) comes from Latin diabe– te– s, which in
turn comes from Ancient Greek diabh¢thV (diabe– te– s) which literally means “a passer through; a siphon”
[69]. Ancient Greek physician Aretaeus of Cappadocia (fl. 1st century CE) used that word, with the intended
meaning “excessive discharge of urine,” as the name for the disease [70, 71, 72]. Ultimately, the word comes
from Greek diabai¢ nein (diabainein), meaning “to pass through,” [69] which is composed of dia- (dia-),
meaning “through” and bai¢ nein (bainein), meaning “to go” [70]. The word “diabetes” is first recorded in
English, in the form diabete, in a medical text written around 1425.
—Wikipedia page Diabetes_mellitus

This tells you where the word diabetes comes from: “a passer through; a siphon” in Greek. It also mentions
diabainein and bainein.

Introduction to Data Science – UNIT 3 – Lecture Notes – by M V Rajesh, DS Dept., PEC


Step 3 revisited: Data preparation for disease profiling

Figure 6.33 Data science process step 3: data preparation. Data cleansing for text can be stop word
filtering; data transformation can be lowercasing of characters.

To index bigrams you need to create your own token filter and text analyzer. A token filter is capable of putting
transformations on tokens. Your specific token filter needs to combine tokens to create n-grams, also called
shingles.

Introduction to Data Science – UNIT 3 – Lecture Notes – by M V Rajesh, DS Dept., PEC


You create two new elements: the token filter called “my shingle filter” and a new analyzer called
“my_shingle_analyzer.”
Because n-grams are so common, Elasticsearch comes with a built-in shingle token filter type. All you need
to tell it is that you want the bigrams “min_shingle_size” : 2, “max_shingle_size” : 2, as shown in figure 6.34.
You could go for trigrams and higher, but for demonstration purposes this will suffice.

You can find an overview of what settings need the index to be closed at
http://www.elastic.co/guide/en/elasticsearch/reference/current/indices-update-settings.html.
The index is now ready to use your new analyzer. For this you’ll create a new document type, diseases2, with a
new mapping, as shown in the following listing.

Introduction to Data Science – UNIT 3 – Lecture Notes – by M V Rajesh, DS Dept., PEC


Within fulltext you now have an extra parameter, fields. Here you can specify all the different isotopes of
fulltext. You have only one; it goes by the name shingles and will analyze the fulltext with your new
my_shingle_analyzer. You still have access to your original fulltext, and you didn’t specify an analyzer for this,
so the standard one will be used as before.You can access the new one by giving the property name followed by
its field name: fulltext.shingles.

Introduction to Data Science – UNIT 3 – Lecture Notes – by M V Rajesh, DS Dept., PEC


Step 4 revisited: Data exploration for disease profiling
You’ve once again arrived at data exploration. You can adapt the aggregations query and use your new field to
give you bigram key concepts related to diabetes:

Your new aggregate, called DiseaseBigrams, uses the fulltext.shingles field to provide a few new insights into
diabetes. These new key terms show up:
■ Excessive discharge—A diabetes patient needs to urinate frequently.
■ Causes polyuria—This indicates the same thing: diabetes causes the patient to urinate frequently.
■ Deprivation test—This is actually a trigram, “water deprivation test”, but it recognized deprivation test
because you have only bigrams. It’s a test to determine whether a patient has diabetes.
■ Excessive thirst—You already found “thirst” with your unigram keyword search, but technically at that point
it could have meant “no thirst.”
Step 6: Presentation and automation
• Your primary objective, disease diagnostics, turned into a self-service diagnostics tool by allowing a
physician to query it via, for instance, a web application. You won’t build a website in this case, but if
you plan on doing so, please read the sidebar “Elasticsearch for web applications.”
• The secondary objective, disease profiling, can also be taken to the level of a user interface; it’s possible
to let the search results produce a word cloud that visually summarizes the search results.

Introduction to Data Science – UNIT 3 – Lecture Notes – by M V Rajesh, DS Dept., PEC


You can find an example implementation at http://www.jasondavies.com/wordcloud/
#%2F%2Fwww.jasondavies.com%2Fwordcloud%2Fabout%2F. Adding your keywords on this D3.js-driven
website will produce a unigram word cloud like the one shown in figure 6.36 that can be incorporated into the
presentation of your project results.

Introduction to Data Science – UNIT 3 – Lecture Notes – by M V Rajesh, DS Dept., PEC

You might also like