BIG DATA
CHAPTER-5
Introduction to HBase
Introduction
In the early days ,data used to be less and
structured.
Data could be easily stored in a relational
database.
After the evaluation of internet
huge amount of structured and
semi structured data got
generated.
Storing and processing the data
using RDBMS becomes a
problem
H Base
• Open source software.
• Non relational.
• Distributed column oriented
database.
• It run on top of HDFS.
Different from RDBMS
• Not a SQL database
• Not relational
• No joins
• No query language
• Not a drop in
replacement of
RDBMS.
Features
• Linear Scalability
• Automatic and configurable
shading of table
• Automatic failure support
• Strictly consistent read and
writes.
• Provide real time random read
write access to data stored in
hdfs.
• Attach nicely with Hadoop map
reduce.
• Easy java API for client access.
• Import of large amount of data.
• Backup option
• Bloom filter for real time
quarries.
Limitation of HBase
• It takes a very long time to recover if the H Master goes down. It takes
a long time to activate another node if the first nodes go down.
• In HBase, cross data operations and join operations are very difficult
to perform, even if we join operations by using Map Reduce, it
requires a lot of time to design and develop.
• HBase needs a new format when we want to migrate from RDBMS
external sources to HBase servers.
• It is very challenging in HBase to support querying process. We need
integration support from SQL layers like Apache Phoenix to write
queries to trigger the information from the database.
• It takes enormous time to develop security factor to grant access to
the users.
Pig
• Pig Latin: A not-so-foreign
language for data processing.
Introduction
• Apache pig work over map reduce.
• It is a tool/platform which is used to
analyze large data.
• We can perform all data manipulation
operation in Hadoop using Apache pig.
• Pig provide high level language known
as Pig Latin.
• Pig Latin help the user to develop their
own functions for reading, writing and
processing data.
Pig Architecture
PIG
Map Reduce
Parse statements
Pig Latin Compile
script
Map Reduce
Optimize
Plan
HDFS
Features of PIG
• Rich in operators: It provides a large number of operator operation.
• Ease of programming: As the pig Latin is similar to SQL.
• UDFs : Pig provide the facility to create user defined function.
• Handles all kinds of data : Apache pig can analyzes all kind of data
structured and unstructured.
Work flow of PIG.
• Programmers need to write script using Pig Latin Language.
• All scripts are internally converted to map reduce task.
• Apache pig contains a component as PIG ENGINE that accept Latin
script as input and convert those script to map reduce jobs.
Install PIG.
Install PIG.
Install PIG.
Install PIG.
Install PIG.
Install PIG.
Can Elephant fly?
Can Hadoop be
used more
efficiently?
Let See…
Ideas
Let have a red bull for wings…
Not a great
idea.
Shrink
After
Before
Genetic change
Before After
Behind the scenes…?
Facebook
initially
developed hive
What is hive?
• Hive is a data ware house infrastructure built on top of Hadoop.
• Support analysis of large dataset stored in Hadoop.
• Provide sql query language called HIVEQL.
• To provide quick query response ,it provide indexing.
Architecture of hive
Working of hive
Install HIVE
Java SE - Downloads | Oracle Technology Network | Oracle
Install HIVE
Install HIVE
Install HIVE
Install HIVE
Hadoop 2.4.1 Subversion https://svn.apache.org/repos/asf/hadoop/common -r
1529768
Compiled by hortonmu on 2013-10-07T06:28Z
Compiled with protoc 2.5.0
From source with checksum 79e53ce7994d1628b240f09af91e1af4
Install Hadoop
Install Hadoop
Install Hadoop
Install Hadoop
Install Hadoop
Install Hadoop
Install Hadoop
Install Hadoop
Install Hadoop
Install Hadoop
Install Hadoop
Install Hadoop
Install Hive
Install Hive
Install Hive
Install Hive
Install Hive
Install Hive
Install Hive
Install Hive
Install Hive
javax.jdo.PersistenceManagerFactoryClass =
org.jpox.PersistenceManagerFactoryImpl
org.jpox.autoCreateSchema = false
org.jpox.validateTables = false
org.jpox.validateColumns = false
org.jpox.validateConstraints = false
org.jpox.storeManagerType = rdbms
org.jpox.autoCreateSchema = true
org.jpox.autoStartMechanismMode = checked
org.jpox.transactionIsolation = read_committed
javax.jdo.option.DetachAllOnCommit = true
javax.jdo.option.NontransactionalRead = true
javax.jdo.option.ConnectionDriverName = org.apache.derby.jdbc.ClientDriver
javax.jdo.option.ConnectionURL = jdbc:derby://hadoop1:1527/metastore_db;create = true
javax.jdo.option.ConnectionUserName = APP
javax.jdo.option.ConnectionPassword = mine
Install Hive
Install Hive
Hive - Data Types
1. Column Types
2. Literals
3. Null Values
4. Complex Types
Column Types
Column type are used as column data types of Hive. They are as follows:
1. Integral Types
2. String Types
3. Timestamp
4. Dates
5. Decimals
6. Union Types
Literals
1.Floating Point Types
Floating point types are nothing but numbers with decimal points. Generally, this
type of data is composed of DOUBLE data type.
2.Decimal Type
Decimal type data is nothing but floating point value with higher range than
DOUBLE data type. The range of decimal type is approximately -10-308 to 10308.
Complex Types
Create Database
Drop Database
Create Table
Example
Syntax of example
Load Data Statement
Example
Alter Table Statement
Change Statement
Change Statement
Drop Table Statement
Example
Partition
Renaming Partition
Dropping a Partition
Operator
There are four types of operators in Hive:
1. Relational Operators
2. Arithmetic Operators
3. Logical Operators
4. Complex Operators
Built in Function
the built-in functions available in Hive. The functions look quite
similar to SQL functions, except for their usage.
Creating a View
Dropping a View
Creating a Index
Dropping a Index
Select Query
Order By
Group By
Join Table
Join
Left Outer Join
Right Outer Join
Full Outer Join
Physical Layout of hive
• Warehouse directory in hdfs.
• User/hive/warehouse
• Tables : subdirectories of warehouse.
• Partitions : subdirectories of
corresponding Table Directory.
Encapsulation
• HIVEQL queries is converted to map reduce code using hive engine.
• Hive engine translate all queries into a directed acyclic graph of map-
reduce jobs.
• These map reduce jobs are sent to Hadoop for execution.
Dependencies
• /user/hive directory is created as soon as the hive session is started first
time.
• /user/hive/warehouse directory shall be accessible by everyone.
• Hadoop dfs –chmod –R 1777/user/hive/warehouse.
• Recommended to activate sticky note if supported.
Hive Command line Interface
• HIVE CLI can be invoked by hike command.
• %hive
Hive SQL script
Hive QL
• HIVE QL is similar to SQL query language.
• DML(Data manipulation language)
• Select
• DDL(Data Definition Language)
• SHOW TABLE
• CREATE TABLE
• ALTER TABLE
• DROP TABLE
Play with hive
Loading delimited data
Normal Table VS External Table
Normal Table External Table
Normal table are created under External table read directly from hdfs file.
warehouse directory.
Normal table are directly visible through External table are not visible in
hdfs directory browsing. warehouse directory.
On dropping a normal table, the source Only dropping the external table only the
data and table metadata both are metadata is deleted.
deleted.
Joins
• HIVE QL supports join on only equality expressions. Complex Boolean
Expression inequality conditions are not supported.
• More than 2 table can be joined.
• Number of map-reduce jobs are generated for a join depend on the
columns being used.
• If same col is used for all the tables, then n=1
• Otherwise n>1.
Data type and format
Primitive data types
Numeric Date/Time String Miscellaneous
• TIME STAMP • STRING • BOOLEAN
• DATE • VARCHAR • BINARY
• INTERVAL • CHAR
Integral Floating
• TINYINT
• SMALL INT • FLOAT
• INTEGER • DOUBLE
• BIGINT
Data format in hive
TEXTFILE ORC
PARQUET
Apache hive data file format.
SEQUENCE
FILE
AVRO
RCFILE
PIG VS HIVE
PIG HIVE
Script language Sql like language
Comparatively less no of line than map Comparatively less no of lines than
reduce. map reduce and pig.
No partition Yes partition
Pig is mainly used for programming Hive mainly used for data analysts.
Pig support Avro Hive does not support Avro
www.paruluniversity.ac.in