Big Data Hadoop and Spark Developer
NoSQL Databases: HBase
Learning Objectives
By the end of this lesson, you will be able to:
Understand the need for NoSQL databases
Analyze the HBase architecture and components
Distinguish HBase from RDBMS
NoSQL Introduction
NoSQL Database
NoSQL is a form of unstructured storage.
DB NoSQL
Structured Unstructured
Why NoSQL?
With the explosion of social media sites, such as Facebook and Twitter, the demand to manage
large data has grown tremendously.
Key-Value Pair Document Column-Based
Databases Databases Data Stores
Types of NoSQL
Key-Value Document-Based Column-Based Graph-Based
Graph
Example:
Record Record
s s
Nodes Organiz Relationships
e
Hav Hav
e e
Properties
Example: Example: Example: Example:
Oracle NoSQL, Redis MongoDB, CouchDB, BigTable, Cassandra, Neo4J, InfoGrid, Infinite
Server, Scalaris OrientDB, RavenDB HBase, Hypertable Graph, FlockDB
RDBMS vs. NoSQL
The differences between RDBMS and NoSQL databases are as follows:
Feature RDBMS NoSQL Databases
Data Storage Tabular Variable
Schema Fixed Dynamic
Performance Low High
Scalability Vertical Horizontal
Reliability Good Poor
Assisted Practice
YARN Tuning Duration: 15 mins
Problem Statement: In this demonstration, you will learn, how to tune YARN and allow HBase to run
smoothly without being resource starved.
Access: Click on the Practice Labs tab on the left side panel of the LMS. Copy or note the
username and password that is generated. Click on the Launch Lab button. On the page that
appears, enter the username and password in the respective fields, and click Login.
HBase Overview
What Is HBase?
HBase is a database management system designed in 2007 by Powerset, a Microsoft company.
HBase rests on top of HDFS and enables real-time analysis of data.
What Is HBase?
It can store huge amount of data in tabular format for extremely fast reads and writes.
HBase is mostly used in a scenario that requires regular and consistent inserting and overwriting of data.
Why HBase?
HDFS stores, processes, and manages large amounts of data efficiently.
However, it performs only batch processing and the data will be accessed in a sequential manner.
Therefore, a solution is required to access, read, or write data anytime regardless of its sequence in the
clusters of data.
Characteristics of HBase
HBase is a type of NoSQL database and is classified as a key-value store. In HBase:
Value is Values are
Key and value are Quickly accessed
identified with a stored in
a ByteArray by value keys
key key-orders
HBase is a database in which tables have no schema. At the time of table creation, column families are
defined, not columns.
HBase: Real-Life Connect
Facebook’s messenger platform needs to store over 135 trillion messages every month.
Rarely Accessed Highly Volatile
Dataset Dataset
Where do they store such data?
HBase Architecture
HBase Architecture
HBase has two types of nodes: Master and RegionServer. Their characteristics are as follows:
Master RegionServer
• Single Master node running at a • One or more RegionServers
time running at a time
• Manages cluster operations • Hosts tables and performs reads
HBase and buffer writes
• Not a part of the read or write Nodes
path • RegionServer is communicated in
order to read and write
A region in HBase is the subset of a table’s rows. The Master node detects the status of RegionServers and
assigns regions to it.
HBase Components
The HBase components include HBase Master and multiple RegionServers.
ZooKeeper is used for
ZooKeeper Quorum coordination or monitoring
ZooKeeper Peer
HBase Cluster Architecture HMaster
ZooKeeper Peer
... ...
HBase Master
assigns regions
RegionServer RegionServer and load-
Region balancing
Region Region
Store Store MemStore
MemStore
Store MemStore Store MemStore
...
StoreFile StoreFile StoreFile
StoreFile StoreFile
HLog HFile HFile HFile HFile HFile
HLog HLog
HDFS
Storage Model of HBase
The two major components of the storage model are as follows:
Partitioning:
• A table is horizontally partitioned into regions.
• Each region is managed by a RegionServer.
• A RegionServer may hold multiple regions.
Persistence and data availability:
• HBase stores its data in HDFS, does not replicate RegionServers,
and relies on HDFS replication for data availability.
• Updates and reads are served from the in-memory cache called
MemStore.
Row Distribution of Data between RegionServers
The distribution of rows of structured data using HBase is illustrated here:
A1
A2 Region
Null🡪A3
A22
Logical View-All rows in a table
A3 Region
A3🡪F34
…
…
Region
K4 F34🡪K80
…
… Region
O90 k80🡪095
Region
… 095🡪null
… RegionServer RegionServer RegionServer
…
Z30
Z55
Data Storage in HBase
Data is stored Data is stored in files called HFiles or StoreFiles that are usually saved in HDFS.
in files called HFiles or StoreFiles that are usually saved in HDFS.
HFile is a key-value map.
When data is added, it is written to a log called the Write Ahead
Log, and it is stored in memory, MemStore.
HFiles are immutable, since HDFS does not support updates to an
existing file.
HBase periodically performs data compactions to control the
number of HFiles and to keep the cluster well-balanced.
Data Model
Data Model
Following are the features of the data model in HBase:
One column family can have
Multi-versioned
any number of columns.
rowkey CF1:C1 CF1:C2 CF1:C3
.. CF2:C1 CF1:C8
. .. CF2:C1 CF1:C8
rowkey CF1:C1 CF1:C2 CF1:C3
. ..
rowkey CF1:C1 CF1:C2 CF1:C3 CF2:C1 CF1:C8
.
Cells within a column family are sorted physically. Very sparse as most cells have NULL values.
Everything except table names are stored as ByteArrays.
Data Mode: Features
Row Key
Column family 1 Column family 2
qualifier1 qualifier2 qualifier1 qualifier2 qualifier3
Timestamp 1 Timestamp 1 Timestamp 1 Timestamp 1
value1 value1 value2 value3
When to Use HBase?
Utilize HBase Enough data in millions
invariable or billions of rows
schema
For random selects and Sufficient commodity hardware
range scans by key with at least five nodes
HBase vs. RDBMS
The table shows a comparison between HBase and a Relational Database Management System (RDBMS):
HBase RDBMS
Automatic partitioning Usually manual and admin-driven partitioning
Scales linearly and automatically with new Usually scales vertically by adding more hardware
nodes resources
Uses commodity hardware Relies on expensive servers
Has fault tolerance Fault tolerance may or may not be present
Leverages batch processing with MapReduce Relies on multiple threads or processes rather
distributed processing than MapReduce distributed processing
Connecting to HBase
Connecting to HBase
HBase can be connected through the following media:
MapReduce
Rest/Thrift
Hive/Pig/HCatalog Java Application
Gateway
/Hue
Java API
ZooKeeper
HBase
HDFS
HBase Shell Commands
Common commands include, but are not limited to, the following:
Create table. Pass table name from a dictionary of specifications per
HBase> create ‘t1′, {NAME => ‘f1′}, {NAME
column family, and a dictionary of table configuration which is => ‘f2′}, {NAME => ‘f3′}
optional HBase> #
The above in shorthand would be the
following:
HBase> create ‘t1′, ‘f1′, ‘f2′, ‘f3′
Describe the table named HBase> describe ‘t1′
Start the disabling of the table named HBase> disable ‘t1′
Drop the table named. Table must first be disabled HBase> drop ‘t1′
List all tables in HBase. Optional regular expression parameter can be
HBase> list
used to filter the output.
HBase Shell Commands
Delete Put
Deleting a cell value Putting a cell value
Count Get Scan
Counting the number Getting the contents Scanning a table’s
of rows in a table of a row or a cell value
Unssisted Practice
HBase Shell Duration: 15 mins
Problem Statement: Create a sample HBase table on the cluster, enter some data, query the table, then
clean up the data and exit.
Access: Click on the Practice Labs tab on the left side panel of the LMS. Copy or note the
username and password that is generated. Click on the Launch Lab button. On the page that
appears, enter the username and password in the respective fields, and click Login.
Unassisted Practice
Steps to Perform
• HBase Shell
// Start the HBase shell
hbase shell
// Create a table called simplilearn with one column family named stats:
create 'simplilearn', 'stats’
// Verify the table creation by listing everything
list
// Add a test value to the daily column in the stats column family for row 1:
put 'simplilearn', 'row1', 'stats:daily', 'test-daily-value’
Unassisted Practice
Steps to Perform
• HBase Shell
// Add a test value to the weekly column in the stats column family for row 1:
put 'simplilearn', 'row1', 'stats:weekly', 'test-weekly-value’
// Add a test value to the weekly column in the stats column family for row 2:
put 'simplilearn', 'row2', 'stats:weekly', 'test-weekly-value’
// Type scan 'simplilearn' to display the contents of the table.
// Type get 'simplilearn', 'row1' to display the contents of row 1.
Type disable 'simplilearn' to disable the table.
Type drop 'simplilearn' to drop the table and delete all data.
Type exit to exit the HBase shell.
NoSQL Graph Database
NoSQL Graph Database
A database designed to treat the relationships between data as equally important
to the data itself.
It is intended to hold data without constricting it to a predefined model.
It focuses on the relationships between entities and is able to infer new knowledge
out of existing information.
Why Graph Databases?
Accessing nodes and relationships in a native graph database is an efficient,
constant-time operation and allows you to quickly traverse millions of connections
per second per core.
Independent of the total size of your dataset, graph databases excel at managing
highly connected data and complex queries.
Property Graph Model
Nodes Relationships
Relationships provide
directed, named,
Nodes are the entities in
semantically relevant
the graph. Nodes can
connections between
be tagged with labels,
two node entities.
representing their
It always has a
different roles in your
direction, a type, a start
domain.
node, and an end node.
Assisted Practice
NoSQL Graph Database Duration: 15 mins
Problem Statement: In this demonstration, you will learn, how to create a NoSQL graph database.
Access: Click on the Practice Labs tab on the left side panel of the LMS. Copy or note the
username and password that is generated. Click on the Launch Lab button. On the page that
appears, enter the username and password in the respective fields, and click Login.
Key Takeaways
You are now able to:
Understand the need for NoSQL databases
Analyze the HBase architecture and components
Differentiate HBase from RDBMS
Knowledge Check
Knowledge
Check Which of the following are the nodes of HBase?
1
a. Spooldir and Master
b. Syslog and RegionalServer
c. Master and Regional Server
d. None of the above
Knowledge
Check Which of the following are the nodes of HBase?
1
a. Spooldir and Master
b. Syslog and RegionalServer
c. Master and Regional Server
d. None of the above
The correct answer is c.
Master and RegionalServer are the nodes of HBase, whereas the other options are parts of Flume.
Knowledge
Check In which of the following scenarios can we use HBase?
2
a. For random selects and range scans by key
b. For sufficient commodity hardware with at least five nodes
c. In variable schema
d. All of the above
Knowledge
Check In which of the following scenarios can we use HBase?
2
a. For random selects and range scans by key
b. For sufficient commodity hardware with at least five nodes
c. In variable schema
d. All of the above
The correct answer is d.
HBase can be used for random selects and range scans by key, for sufficient commodity hardware with at least five
nodes, and in variable schema.
Lesson-End-Project
Problem Statement:
Global transport private limited is in transport analytics and they are keen to ensure the
safety of people. Nowadays, as the population is increasing accidents are also becoming
more and more frequent. Accidents occur mostly when the route is long, the driver is drunk,
or the roads are damaged. The company collects data of all the accidents and provides
important insights that can reduce the number of accidents. The company wants to create a
public portal where anyone can see the accident’s aggregated data.
Your task is to suggest a suitable database and design a schema which can cover most of the
use cases.
You are given a file that contains details about the various parameter of accidents.
The column details are as follows:
1. Year
2. TYPE
3. 0-3 hrs. (Night)
4. 3-6 hrs. (Night)
5. 6-9 hrs (Day)
6. 9-12 hrs (Day)
7. 12-15 hrs (Day)
8. 15-18 hrs (Day)
9. 18-21 hrs (Night)
10. 21-24 hrs (Night)
11. Total
Lesson-End-Project
Problem Statement:
You have to save the given data in HBase in such a way that you can solve the below queries.
Please mention what you are selecting as a row key and why.
1. Get the total number of accidents when you are given
a. Year
b. Type of Accident
c. Time Duration
2. Get the total number of accidents when you are given
a. Year
b. Type of Accident
3. Get the total number of accidents in a given year
Thank You