100% found this document useful (1 vote)

172 views47 pages

Lesson 6 NoSQL Databases HBase

The document discusses HBase, an open-source NoSQL database that provides big data storage and access across clusters of servers. It explains the architecture and components of HBase, including how it uses a master node and region servers to partition and store data across nodes in a Hadoop cluster. Key differences between HBase and relational databases are also outlined, such as HBase's use of dynamic schemas and horizontal scaling for high performance and scalability.

Uploaded by

Keerthi Uma Mahesh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

100% found this document useful (1 vote)

172 views47 pages

Lesson 6 NoSQL Databases HBase

Uploaded by

Keerthi Uma Mahesh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 47

Big Data Hadoop and Spark Developer

NoSQL Databases: HBase

Learning Objectives

By the end of this lesson, you will be able to:

Understand the need for NoSQL databases

Analyze the HBase architecture and components

Distinguish HBase from RDBMS

NoSQL Introduction
NoSQL Database

NoSQL is a form of unstructured storage.

DB NoSQL

Structured Unstructured
Why NoSQL?

With the explosion of social media sites, such as Facebook and Twitter, the demand to manage
large data has grown tremendously.

Key-Value Pair Document Column-Based

Databases Databases Data Stores
Types of NoSQL

Key-Value Document-Based Column-Based Graph-Based

Graph
Example:
Record Record
s s

Nodes Organiz Relationships

Hav Hav
e e
Properties

Example: Example: Example: Example:

Oracle NoSQL, Redis MongoDB, CouchDB, BigTable, Cassandra, Neo4J, InfoGrid, Inﬁnite
Server, Scalaris OrientDB, RavenDB HBase, Hypertable Graph, FlockDB
RDBMS vs. NoSQL

The diﬀerences between RDBMS and NoSQL databases are as follows:

Feature RDBMS NoSQL Databases

Data Storage Tabular Variable

Schema Fixed Dynamic

Performance Low High

Scalability Vertical Horizontal

Reliability Good Poor

Assisted Practice

YARN Tuning Duration: 15 mins

Problem Statement: In this demonstration, you will learn, how to tune YARN and allow HBase to run
smoothly without being resource starved.

Access: Click on the Practice Labs tab on the left side panel of the LMS. Copy or note the
username and password that is generated. Click on the Launch Lab button. On the page that
appears, enter the username and password in the respective ﬁelds, and click Login.
HBase Overview
What Is HBase?

HBase is a database management system designed in 2007 by Powerset, a Microsoft company.

HBase rests on top of HDFS and enables real-time analysis of data.

What Is HBase?

It can store huge amount of data in tabular format for extremely fast reads and writes.

HBase is mostly used in a scenario that requires regular and consistent inserting and overwriting of data.
Why HBase?

HDFS stores, processes, and manages large amounts of data eﬃciently.

However, it performs only batch processing and the data will be accessed in a sequential manner.

Therefore, a solution is required to access, read, or write data anytime regardless of its sequence in the
clusters of data.
Characteristics of HBase

HBase is a type of NoSQL database and is classiﬁed as a key-value store. In HBase:

Value is Values are

Key and value are Quickly accessed
identiﬁed with a stored in
a ByteArray by value keys
key key-orders

HBase is a database in which tables have no schema. At the time of table creation, column families are
deﬁned, not columns.
HBase: Real-Life Connect

Facebook’s messenger platform needs to store over 135 trillion messages every month.

Rarely Accessed Highly Volatile

Dataset Dataset

Where do they store such data?

HBase Architecture
HBase Architecture

HBase has two types of nodes: Master and RegionServer. Their characteristics are as follows:

Master RegionServer
• Single Master node running at a • One or more RegionServers
time running at a time
• Manages cluster operations • Hosts tables and performs reads
HBase and buﬀer writes
• Not a part of the read or write Nodes
path • RegionServer is communicated in
order to read and write

A region in HBase is the subset of a table’s rows. The Master node detects the status of RegionServers and
assigns regions to it.
HBase Components

The HBase components include HBase Master and multiple RegionServers.

ZooKeeper is used for

ZooKeeper Quorum coordination or monitoring
ZooKeeper Peer
HBase Cluster Architecture HMaster
ZooKeeper Peer
... ...
HBase Master
assigns regions
RegionServer RegionServer and load-
Region balancing
Region Region
Store Store MemStore
MemStore
Store MemStore Store MemStore
...
StoreFile StoreFile StoreFile
StoreFile StoreFile
HLog HFile HFile HFile HFile HFile
HLog HLog

HDFS
Storage Model of HBase

The two major components of the storage model are as follows:

Partitioning:
• A table is horizontally partitioned into regions.
• Each region is managed by a RegionServer.
• A RegionServer may hold multiple regions.

Persistence and data availability:

• HBase stores its data in HDFS, does not replicate RegionServers,
and relies on HDFS replication for data availability.
• Updates and reads are served from the in-memory cache called
MemStore.
Row Distribution of Data between RegionServers

The distribution of rows of structured data using HBase is illustrated here:

A1
A2 Region
Null🡪A3
A22
Logical View-All rows in a table

A3 Region
A3🡪F34
…
…
Region
K4 F34🡪K80
…
… Region
O90 k80🡪095
Region
… 095🡪null
… RegionServer RegionServer RegionServer
…
Z30
Z55
Data Storage in HBase

Data is stored Data is stored in ﬁles called HFiles or StoreFiles that are usually saved in HDFS.
in ﬁles called HFiles or StoreFiles that are usually saved in HDFS.

HFile is a key-value map.

When data is added, it is written to a log called the Write Ahead

Log, and it is stored in memory, MemStore.

HFiles are immutable, since HDFS does not support updates to an

existing ﬁle.

HBase periodically performs data compactions to control the

number of HFiles and to keep the cluster well-balanced.
Data Model
Data Model

Following are the features of the data model in HBase:

One column family can have

Multi-versioned
any number of columns.

rowkey CF1:C1 CF1:C2 CF1:C3

.. CF2:C1 CF1:C8

. .. CF2:C1 CF1:C8
rowkey CF1:C1 CF1:C2 CF1:C3
. ..
rowkey CF1:C1 CF1:C2 CF1:C3 CF2:C1 CF1:C8
.

Cells within a column family are sorted physically. Very sparse as most cells have NULL values.

Everything except table names are stored as ByteArrays.

Data Mode: Features

Row Key

Column family 1 Column family 2

qualifier1 qualifier2 qualifier1 qualifier2 qualifier3

Timestamp 1 Timestamp 1 Timestamp 1 Timestamp 1

value1 value1 value2 value3

When to Use HBase?

Utilize HBase Enough data in millions

invariable or billions of rows
schema

For random selects and Suﬃcient commodity hardware

range scans by key with at least ﬁve nodes
HBase vs. RDBMS

The table shows a comparison between HBase and a Relational Database Management System (RDBMS):

HBase RDBMS
Automatic partitioning Usually manual and admin-driven partitioning

Scales linearly and automatically with new Usually scales vertically by adding more hardware
nodes resources

Uses commodity hardware Relies on expensive servers

Has fault tolerance Fault tolerance may or may not be present

Leverages batch processing with MapReduce Relies on multiple threads or processes rather
distributed processing than MapReduce distributed processing
Connecting to HBase
Connecting to HBase

HBase can be connected through the following media:

MapReduce
Rest/Thrift
Hive/Pig/HCatalog Java Application
Gateway
/Hue

Java API

ZooKeeper

HBase

HDFS
HBase Shell Commands

Common commands include, but are not limited to, the following:

Create table. Pass table name from a dictionary of speciﬁcations per

HBase> create ‘t1′, {NAME => ‘f1′}, {NAME
column family, and a dictionary of table conﬁguration which is => ‘f2′}, {NAME => ‘f3′}
optional HBase> #
The above in shorthand would be the
following:
HBase> create ‘t1′, ‘f1′, ‘f2′, ‘f3′

Describe the table named HBase> describe ‘t1′

Start the disabling of the table named HBase> disable ‘t1′

Drop the table named. Table must ﬁrst be disabled HBase> drop ‘t1′

List all tables in HBase. Optional regular expression parameter can be

HBase> list
used to ﬁlter the output.
HBase Shell Commands

Delete Put
Deleting a cell value Putting a cell value

Count Get Scan

Counting the number Getting the contents Scanning a table’s
of rows in a table of a row or a cell value
Unssisted Practice

HBase Shell Duration: 15 mins

Problem Statement: Create a sample HBase table on the cluster, enter some data, query the table, then
clean up the data and exit.
Access: Click on the Practice Labs tab on the left side panel of the LMS. Copy or note the
username and password that is generated. Click on the Launch Lab button. On the page that
appears, enter the username and password in the respective ﬁelds, and click Login.
Unassisted Practice

Steps to Perform
• HBase Shell

// Start the HBase shell

hbase shell

// Create a table called simplilearn with one column family named stats:
create 'simplilearn', 'stats’

// Verify the table creation by listing everything

list

// Add a test value to the daily column in the stats column family for row 1:
put 'simplilearn', 'row1', 'stats:daily', 'test-daily-value’
Unassisted Practice

Steps to Perform
• HBase Shell

// Add a test value to the weekly column in the stats column family for row 1:
put 'simplilearn', 'row1', 'stats:weekly', 'test-weekly-value’

// Add a test value to the weekly column in the stats column family for row 2:
put 'simplilearn', 'row2', 'stats:weekly', 'test-weekly-value’

// Type scan 'simplilearn' to display the contents of the table.

// Type get 'simplilearn', 'row1' to display the contents of row 1.

Type disable 'simplilearn' to disable the table.

Type drop 'simplilearn' to drop the table and delete all data.
Type exit to exit the HBase shell.
NoSQL Graph Database
NoSQL Graph Database

A database designed to treat the relationships between data as equally important

to the data itself.

It is intended to hold data without constricting it to a predeﬁned model.

It focuses on the relationships between entities and is able to infer new knowledge
out of existing information.
Why Graph Databases?

Accessing nodes and relationships in a native graph database is an eﬃcient,

constant-time operation and allows you to quickly traverse millions of connections
per second per core.

Independent of the total size of your dataset, graph databases excel at managing
highly connected data and complex queries.
Property Graph Model

Nodes Relationships

Relationships provide
directed, named,
Nodes are the entities in
semantically relevant
the graph. Nodes can
connections between
be tagged with labels,
two node entities.
representing their
It always has a
diﬀerent roles in your
direction, a type, a start
domain.
node, and an end node.
Assisted Practice

NoSQL Graph Database Duration: 15 mins

Problem Statement: In this demonstration, you will learn, how to create a NoSQL graph database.

You are now able to:

Understand the need for NoSQL databases

Analyze the HBase architecture and components

Diﬀerentiate HBase from RDBMS

Knowledge Check
Knowledge
Check Which of the following are the nodes of HBase?
1

a. Spooldir and Master

b. Syslog and RegionalServer

c. Master and Regional Server

d. None of the above

Knowledge
Check Which of the following are the nodes of HBase?
1

a. Spooldir and Master

b. Syslog and RegionalServer

c. Master and Regional Server

d. None of the above

The correct answer is c.

Master and RegionalServer are the nodes of HBase, whereas the other options are parts of Flume.
Knowledge
Check In which of the following scenarios can we use HBase?
2

a. For random selects and range scans by key

b. For suﬃcient commodity hardware with at least ﬁve nodes

c. In variable schema

d. All of the above

Knowledge
Check In which of the following scenarios can we use HBase?
2

a. For random selects and range scans by key

b. For suﬃcient commodity hardware with at least ﬁve nodes

c. In variable schema

d. All of the above

The correct answer is d.

HBase can be used for random selects and range scans by key, for suﬃcient commodity hardware with at least ﬁve
nodes, and in variable schema.
Lesson-End-Project
Problem Statement:

Global transport private limited is in transport analytics and they are keen to ensure the
safety of people. Nowadays, as the population is increasing accidents are also becoming
more and more frequent. Accidents occur mostly when the route is long, the driver is drunk,
or the roads are damaged. The company collects data of all the accidents and provides
important insights that can reduce the number of accidents. The company wants to create a
public portal where anyone can see the accident’s aggregated data.

Your task is to suggest a suitable database and design a schema which can cover most of the
use cases.

You are given a ﬁle that contains details about the various parameter of accidents.
The column details are as follows:
1. Year
2. TYPE
3. 0-3 hrs. (Night)
4. 3-6 hrs. (Night)
5. 6-9 hrs (Day)
6. 9-12 hrs (Day)
7. 12-15 hrs (Day)
8. 15-18 hrs (Day)
9. 18-21 hrs (Night)
10. 21-24 hrs (Night)
11. Total
Lesson-End-Project

Problem Statement:

You have to save the given data in HBase in such a way that you can solve the below queries.
Please mention what you are selecting as a row key and why.

1. Get the total number of accidents when you are given

a. Year
b. Type of Accident
c. Time Duration

2. Get the total number of accidents when you are given

a. Year
b. Type of Accident

3. Get the total number of accidents in a given year

Thank You

Hbase: Q) What Is Hbase ?
No ratings yet
Hbase: Q) What Is Hbase ?
15 pages
Cloud Bigtable for Developers
100% (1)
Cloud Bigtable for Developers
18 pages
Bigtable: A Distributed Storage System For Structured Data: Presentation On Paper by
No ratings yet
Bigtable: A Distributed Storage System For Structured Data: Presentation On Paper by
12 pages
Google Bigtable for Developers
No ratings yet
Google Bigtable for Developers
3 pages
S Pig Hive HBase Zookeeper 07
No ratings yet
S Pig Hive HBase Zookeeper 07
21 pages
Lesson 3 - Data - Ingestion - Into - Big - Data - Systems - and - ETL
No ratings yet
Lesson 3 - Data - Ingestion - Into - Big - Data - Systems - and - ETL
104 pages
Nosql Is Dead: Eric Redmond @coderoshi
No ratings yet
Nosql Is Dead: Eric Redmond @coderoshi
55 pages
HBase
No ratings yet
HBase
38 pages
noSQL V newSQL
No ratings yet
noSQL V newSQL
33 pages
Using Volt DB
No ratings yet
Using Volt DB
228 pages
Implement - Column-Family Stores
No ratings yet
Implement - Column-Family Stores
37 pages
BDA Presentations
No ratings yet
BDA Presentations
26 pages
Big Data
No ratings yet
Big Data
28 pages
IM Ch14 Big Data Analytics NoSQL Ed12
No ratings yet
IM Ch14 Big Data Analytics NoSQL Ed12
8 pages
Mastering Google Bigtable Database
No ratings yet
Mastering Google Bigtable Database
248 pages
File Systems for CS Students
No ratings yet
File Systems for CS Students
28 pages
1.+basics of DBMS
0% (1)
1.+basics of DBMS
45 pages
Database Architecture
No ratings yet
Database Architecture
6 pages
Assignment Set 1 Oper
100% (1)
Assignment Set 1 Oper
10 pages
ClickHouse Guide for Engineers
No ratings yet
ClickHouse Guide for Engineers
14 pages
ADBMS Lab Manual Aug-Dec 2017 - ByMe
No ratings yet
ADBMS Lab Manual Aug-Dec 2017 - ByMe
9 pages
Mario's Jumping Challenge
No ratings yet
Mario's Jumping Challenge
1 page
14 Types of Databases and Data Stores You Should Know
No ratings yet
14 Types of Databases and Data Stores You Should Know
16 pages
Unix Commands: Whoami Who Mkdir
No ratings yet
Unix Commands: Whoami Who Mkdir
14 pages
PL/SQL Quick Reference Guide
No ratings yet
PL/SQL Quick Reference Guide
3 pages
2.3 Informix Security
100% (1)
2.3 Informix Security
99 pages
Note On Operating System and Kernel
No ratings yet
Note On Operating System and Kernel
3 pages
MongoDB Developer Training Guide
No ratings yet
MongoDB Developer Training Guide
13 pages
Startup and Shutdown Container Databases (CDB) and Pluggable Databases (PDB)
No ratings yet
Startup and Shutdown Container Databases (CDB) and Pluggable Databases (PDB)
3 pages
Object Relational DBMSs
No ratings yet
Object Relational DBMSs
34 pages
UNIX File System Basics
No ratings yet
UNIX File System Basics
62 pages
Virtualization and Five Step Process
No ratings yet
Virtualization and Five Step Process
19 pages
Unix File System Case Study
No ratings yet
Unix File System Case Study
23 pages
Introduction To UNIX and Linux - Tutorial Lectures and Exercise Sheets
No ratings yet
Introduction To UNIX and Linux - Tutorial Lectures and Exercise Sheets
80 pages
Oracle 11g Consolidated Database Replay Guide
No ratings yet
Oracle 11g Consolidated Database Replay Guide
12 pages
DataStage Universe Basic SQL Client Interface Guide
No ratings yet
DataStage Universe Basic SQL Client Interface Guide
285 pages
Nosql Database Systems: M.Tech. (Iind, Sem Ce/Cn)
100% (1)
Nosql Database Systems: M.Tech. (Iind, Sem Ce/Cn)
135 pages
Big Data Analytics - Sgtrategy and Roadmap
No ratings yet
Big Data Analytics - Sgtrategy and Roadmap
31 pages
HBase Interview Questions
No ratings yet
HBase Interview Questions
12 pages
Performance Programming Guide
No ratings yet
Performance Programming Guide
22 pages
Bce Unit 5
No ratings yet
Bce Unit 5
36 pages
CockroachDB for Cloud Developers
No ratings yet
CockroachDB for Cloud Developers
37 pages
Machine Learning Regression Guide
No ratings yet
Machine Learning Regression Guide
6 pages
Unit-5 JAVA FX Programming
No ratings yet
Unit-5 JAVA FX Programming
104 pages
MySQL Press Mysql Database Design and Tuning Jun 2005 Ebook
No ratings yet
MySQL Press Mysql Database Design and Tuning Jun 2005 Ebook
342 pages
Design Document Database
No ratings yet
Design Document Database
62 pages
Ubuntu Notes
No ratings yet
Ubuntu Notes
11 pages
Redis Cheat Sheet
No ratings yet
Redis Cheat Sheet
4 pages
Welcome To VoltDB Training
100% (1)
Welcome To VoltDB Training
102 pages
KVM Virtualization Guide for Students
100% (1)
KVM Virtualization Guide for Students
7 pages
Coursera Enterprise Catalog - Master
No ratings yet
Coursera Enterprise Catalog - Master
1,702 pages
How To Configure Samba As A Primary Domain Controller
No ratings yet
How To Configure Samba As A Primary Domain Controller
150 pages
T Are A 1 Jesus Becerril
No ratings yet
T Are A 1 Jesus Becerril
27 pages
3D Graphics With OpenGL
No ratings yet
3D Graphics With OpenGL
31 pages
Unit of Analysis
No ratings yet
Unit of Analysis
56 pages
Introduction To Information and Big Data Security
No ratings yet
Introduction To Information and Big Data Security
39 pages
Unit 1 P2 HBase
No ratings yet
Unit 1 P2 HBase
22 pages
Apache HBase
No ratings yet
Apache HBase
12 pages
Hadoop HBASE
No ratings yet
Hadoop HBASE
71 pages
Hadoop Week 6
No ratings yet
Hadoop Week 6
38 pages
Marketing Plan
No ratings yet
Marketing Plan
7 pages
Netwrix Auditor Data Discovery and Classification Quick Start Guide
No ratings yet
Netwrix Auditor Data Discovery and Classification Quick Start Guide
39 pages
Ado. Net Part 2 Data Binding of Form Controls, Connected Vs Disconnected Architecture Presented by
No ratings yet
Ado. Net Part 2 Data Binding of Form Controls, Connected Vs Disconnected Architecture Presented by
11 pages
Tutorial - Step by Step Database Design in SQL
No ratings yet
Tutorial - Step by Step Database Design in SQL
47 pages
Database Systems Lab Manual
No ratings yet
Database Systems Lab Manual
137 pages
Online Analytical: Processing
No ratings yet
Online Analytical: Processing
18 pages
Chapter 6 Foundations of Business Intelligence: Databases and Information Management
No ratings yet
Chapter 6 Foundations of Business Intelligence: Databases and Information Management
22 pages
School Staff Management Project
No ratings yet
School Staff Management Project
15 pages
JSP Prac Set01
No ratings yet
JSP Prac Set01
11 pages
Developing and Deploying A Machine Learning Scenario For SAP HANA
No ratings yet
Developing and Deploying A Machine Learning Scenario For SAP HANA
29 pages
Online Course Reservation System
50% (2)
Online Course Reservation System
16 pages
DT-EDU-DEN80EDU01ABDS001 Origins of Data Virtualization
No ratings yet
DT-EDU-DEN80EDU01ABDS001 Origins of Data Virtualization
20 pages
Neo4j CCPA and Privacy Compliance EN US
No ratings yet
Neo4j CCPA and Privacy Compliance EN US
9 pages
Apu Sysdesign Bigg
No ratings yet
Apu Sysdesign Bigg
18 pages
GoldenGate Tutorial 1
100% (1)
GoldenGate Tutorial 1
4 pages
CANdb++ Database Guide
No ratings yet
CANdb++ Database Guide
25 pages
Snowflake Training
No ratings yet
Snowflake Training
685 pages
Gridview
No ratings yet
Gridview
60 pages
Mulla Saahir - Chronological Resume-2
No ratings yet
Mulla Saahir - Chronological Resume-2
2 pages
Prashant-Sr. ETL Consultant
No ratings yet
Prashant-Sr. ETL Consultant
6 pages
Data Model Patterns for Quality
No ratings yet
Data Model Patterns for Quality
12 pages
Air Transport Association: E-Business Specification For Materiels Management
No ratings yet
Air Transport Association: E-Business Specification For Materiels Management
132 pages
Microsoft Publisher Lesson Plan
67% (6)
Microsoft Publisher Lesson Plan
2 pages
PostgreSQL Database Administration Vol 1
100% (3)
PostgreSQL Database Administration Vol 1
124 pages
Library Management System PDF
0% (1)
Library Management System PDF
12 pages
St. Joseph''''S School: Project
No ratings yet
St. Joseph''''S School: Project
23 pages
Nachiketa (211260116505) Project Report (1) (1) Updated
No ratings yet
Nachiketa (211260116505) Project Report (1) (1) Updated
50 pages
Unit 5 Parallel and Distributed Databases
No ratings yet
Unit 5 Parallel and Distributed Databases
22 pages
Document Management PDF
No ratings yet
Document Management PDF
82 pages
00 HSC 6.0 Manual PDF
No ratings yet
00 HSC 6.0 Manual PDF
1,008 pages