0% found this document useful (0 votes)

30 views120 pages

Big Data

Hive is a data warehouse infrastructure built on top of Hadoop. It provides SQL-like queries (HiveQL) to analyze large datasets stored in Hadoop. Hive converts the HiveQL queries into MapReduce jobs which are executed on Hadoop. Some key features of Hive include its ability to handle structured, semi-structured and unstructured data, rich set of operators, user defined functions, and support for analyzing large datasets.

Uploaded by

prithvikotian2002

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

30 views120 pages

Big Data

Uploaded by

prithvikotian2002

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 120

BIG DATA

CHAPTER-5
Introduction to HBase
Introduction

In the early days ,data used to be less and

structured.

Data could be easily stored in a relational

database.
After the evaluation of internet
huge amount of structured and
semi structured data got
generated.

Storing and processing the data

using RDBMS becomes a
problem
H Base
• Open source software.

• Non relational.

• Distributed column oriented

database.

• It run on top of HDFS.

Different from RDBMS
• Not a SQL database

• Not relational

• No joins

• No query language

• Not a drop in
replacement of
RDBMS.
Features
• Linear Scalability

• Automatic and configurable

shading of table

• Automatic failure support

• Strictly consistent read and

writes.

• Provide real time random read

write access to data stored in
hdfs.
• Attach nicely with Hadoop map
reduce.

• Easy java API for client access.

• Import of large amount of data.

• Backup option

• Bloom filter for real time

quarries.
Limitation of HBase
• It takes a very long time to recover if the H Master goes down. It takes
a long time to activate another node if the first nodes go down.
• In HBase, cross data operations and join operations are very difficult
to perform, even if we join operations by using Map Reduce, it
requires a lot of time to design and develop.
• HBase needs a new format when we want to migrate from RDBMS
external sources to HBase servers.
• It is very challenging in HBase to support querying process. We need
integration support from SQL layers like Apache Phoenix to write
queries to trigger the information from the database.
• It takes enormous time to develop security factor to grant access to
the users.
Pig
• Pig Latin: A not-so-foreign
language for data processing.
Introduction

• Apache pig work over map reduce.

• It is a tool/platform which is used to

analyze large data.

• We can perform all data manipulation

operation in Hadoop using Apache pig.

• Pig provide high level language known

as Pig Latin.

• Pig Latin help the user to develop their

own functions for reading, writing and
processing data.
Pig Architecture

PIG
Map Reduce
Parse statements

Pig Latin Compile

script
Map Reduce
Optimize

Plan

HDFS
Features of PIG

• Rich in operators: It provides a large number of operator operation.

• Ease of programming: As the pig Latin is similar to SQL.

• UDFs : Pig provide the facility to create user defined function.

• Handles all kinds of data : Apache pig can analyzes all kind of data
structured and unstructured.
Work flow of PIG.

• Programmers need to write script using Pig Latin Language.

• All scripts are internally converted to map reduce task.

• Apache pig contains a component as PIG ENGINE that accept Latin

script as input and convert those script to map reduce jobs.
Install PIG.
Install PIG.
Install PIG.
Install PIG.
Install PIG.
Install PIG.
Can Elephant fly?

Can Hadoop be
used more
efficiently?

Let See…
Ideas

Let have a red bull for wings…

Not a great
idea.
Shrink

After
Before
Genetic change

Before After
Behind the scenes…?

Facebook
initially
developed hive
What is hive?

• Hive is a data ware house infrastructure built on top of Hadoop.

• Support analysis of large dataset stored in Hadoop.

• Provide sql query language called HIVEQL.

• To provide quick query response ,it provide indexing.

Architecture of hive
Working of hive
Install HIVE

Java SE - Downloads | Oracle Technology Network | Oracle

Install HIVE
Install HIVE
Install HIVE
Install HIVE

Hadoop 2.4.1 Subversion https://svn.apache.org/repos/asf/hadoop/common -r

1529768
Compiled by hortonmu on 2013-10-07T06:28Z
Compiled with protoc 2.5.0
From source with checksum 79e53ce7994d1628b240f09af91e1af4
Install Hadoop
Install Hadoop
Install Hadoop
Install Hadoop
Install Hadoop
Install Hadoop
Install Hadoop
Install Hadoop
Install Hadoop
Install Hadoop
Install Hadoop
Install Hadoop
Install Hive
Install Hive
Install Hive
Install Hive
Install Hive
Install Hive
Install Hive
Install Hive
Install Hive
javax.jdo.PersistenceManagerFactoryClass =
org.jpox.PersistenceManagerFactoryImpl
org.jpox.autoCreateSchema = false
org.jpox.validateTables = false
org.jpox.validateColumns = false
org.jpox.validateConstraints = false
org.jpox.storeManagerType = rdbms
org.jpox.autoCreateSchema = true
org.jpox.autoStartMechanismMode = checked
org.jpox.transactionIsolation = read_committed
javax.jdo.option.DetachAllOnCommit = true
javax.jdo.option.NontransactionalRead = true
javax.jdo.option.ConnectionDriverName = org.apache.derby.jdbc.ClientDriver
javax.jdo.option.ConnectionURL = jdbc:derby://hadoop1:1527/metastore_db;create = true
javax.jdo.option.ConnectionUserName = APP
javax.jdo.option.ConnectionPassword = mine
Install Hive
Install Hive
Hive - Data Types

1. Column Types

2. Literals

3. Null Values

4. Complex Types
Column Types

Column type are used as column data types of Hive. They are as follows:

1. Integral Types

2. String Types

3. Timestamp

4. Dates

5. Decimals

6. Union Types
Literals

1.Floating Point Types

Floating point types are nothing but numbers with decimal points. Generally, this
type of data is composed of DOUBLE data type.

2.Decimal Type

Decimal type data is nothing but floating point value with higher range than
DOUBLE data type. The range of decimal type is approximately -10-308 to 10308.
Complex Types
Create Database
Drop Database
Create Table
Example
Syntax of example
Load Data Statement
Example
Alter Table Statement
Change Statement
Change Statement
Drop Table Statement
Example
Partition
Renaming Partition
Dropping a Partition
Operator
There are four types of operators in Hive:

1. Relational Operators

2. Arithmetic Operators

3. Logical Operators

4. Complex Operators
Built in Function
the built-in functions available in Hive. The functions look quite
similar to SQL functions, except for their usage.
Creating a View
Dropping a View
Creating a Index
Dropping a Index
Select Query
Order By
Group By
Join Table
Join
Left Outer Join
Right Outer Join
Full Outer Join
Physical Layout of hive

• Warehouse directory in hdfs.

• User/hive/warehouse

• Tables : subdirectories of warehouse.

• Partitions : subdirectories of
corresponding Table Directory.
Encapsulation

• HIVEQL queries is converted to map reduce code using hive engine.

• Hive engine translate all queries into a directed acyclic graph of map-
reduce jobs.

• These map reduce jobs are sent to Hadoop for execution.

Dependencies

• /user/hive directory is created as soon as the hive session is started first

time.

• /user/hive/warehouse directory shall be accessible by everyone.

• Hadoop dfs –chmod –R 1777/user/hive/warehouse.

• Recommended to activate sticky note if supported.

Hive Command line Interface

• HIVE CLI can be invoked by hike command.

• %hive
Hive SQL script
Hive QL

• HIVE QL is similar to SQL query language.

• DML(Data manipulation language)
• Select

• DDL(Data Definition Language)

• SHOW TABLE
• CREATE TABLE
• ALTER TABLE
• DROP TABLE
Play with hive
Loading delimited data
Normal Table VS External Table

Normal Table External Table

Normal table are created under External table read directly from hdfs file.
warehouse directory.

Normal table are directly visible through External table are not visible in
hdfs directory browsing. warehouse directory.

On dropping a normal table, the source Only dropping the external table only the
data and table metadata both are metadata is deleted.
deleted.
Joins

• HIVE QL supports join on only equality expressions. Complex Boolean

Expression inequality conditions are not supported.

• More than 2 table can be joined.

• Number of map-reduce jobs are generated for a join depend on the

columns being used.

• If same col is used for all the tables, then n=1

• Otherwise n>1.
Data type and format
Primitive data types

Numeric Date/Time String Miscellaneous

• TIME STAMP • STRING • BOOLEAN

• DATE • VARCHAR • BINARY
• INTERVAL • CHAR
Integral Floating
• TINYINT
• SMALL INT • FLOAT
• INTEGER • DOUBLE
• BIGINT
Data format in hive

TEXTFILE ORC

PARQUET
Apache hive data file format.
SEQUENCE
FILE

AVRO
RCFILE
PIG VS HIVE
PIG HIVE

Script language Sql like language

Comparatively less no of line than map Comparatively less no of lines than

reduce. map reduce and pig.
No partition Yes partition

Pig is mainly used for programming Hive mainly used for data analysts.

Pig support Avro Hive does not support Avro

www.paruluniversity.ac.in

Unit 5-Hive
No ratings yet
Unit 5-Hive
18 pages
Unit5 Notes
No ratings yet
Unit5 Notes
29 pages
Hive Data Types and Data Models
No ratings yet
Hive Data Types and Data Models
24 pages
HIVE
No ratings yet
HIVE
80 pages
(R17a0528) Big Data Analytics-57-100
No ratings yet
(R17a0528) Big Data Analytics-57-100
44 pages
Hive
100% (1)
Hive
47 pages
Unit IV
No ratings yet
Unit IV
64 pages
Unit 5 (BDC)
No ratings yet
Unit 5 (BDC)
59 pages
Unit-Vi Hive Hadoop & Big Data
100% (1)
Unit-Vi Hive Hadoop & Big Data
24 pages
Unit 4 HIVE - PIG
No ratings yet
Unit 4 HIVE - PIG
71 pages
Hive for Data Analysts
No ratings yet
Hive for Data Analysts
16 pages
Unit IV
No ratings yet
Unit IV
22 pages
Chapter+9+ HIVE
No ratings yet
Chapter+9+ HIVE
50 pages
Hive Final
No ratings yet
Hive Final
75 pages
Module-IV Hive
No ratings yet
Module-IV Hive
17 pages
Introduction To Hive
No ratings yet
Introduction To Hive
14 pages
Hive Tutorial
No ratings yet
Hive Tutorial
25 pages
Unit 3 BDA
No ratings yet
Unit 3 BDA
44 pages
Course On: Big Data Analytics
No ratings yet
Course On: Big Data Analytics
59 pages
Big Data Analytics: Welcome
No ratings yet
Big Data Analytics: Welcome
69 pages
Apache HIVE
100% (1)
Apache HIVE
105 pages
Introduction to Hive Architecture
No ratings yet
Introduction to Hive Architecture
23 pages
Bda Unit 5 Hive Notes
No ratings yet
Bda Unit 5 Hive Notes
23 pages
Unit V
No ratings yet
Unit V
23 pages
Hive
No ratings yet
Hive
49 pages
Apache Hive for Data Analysts
No ratings yet
Apache Hive for Data Analysts
8 pages
Unit-3 FBDA
No ratings yet
Unit-3 FBDA
34 pages
Unit 5 Lecture No-1 (Hive)
No ratings yet
Unit 5 Lecture No-1 (Hive)
30 pages
Hiveppt
No ratings yet
Hiveppt
29 pages
Module 4
No ratings yet
Module 4
34 pages
BDA Hive
No ratings yet
BDA Hive
22 pages
HIVE
No ratings yet
HIVE
28 pages
Cse3002 Big Data m2
No ratings yet
Cse3002 Big Data m2
76 pages
BDA Unit 4 Notes
No ratings yet
BDA Unit 4 Notes
33 pages
Hive PPT
No ratings yet
Hive PPT
61 pages
Apache Hive: Prashant Gupta
100% (1)
Apache Hive: Prashant Gupta
61 pages
Session 3.2
No ratings yet
Session 3.2
27 pages
Big Data Analytics Module-4
No ratings yet
Big Data Analytics Module-4
39 pages
Chapter 5 Hive
No ratings yet
Chapter 5 Hive
69 pages
Hadoop HIVE
No ratings yet
Hadoop HIVE
41 pages
Bda Ia-3 QB-1
No ratings yet
Bda Ia-3 QB-1
17 pages
Bda M4
No ratings yet
Bda M4
52 pages
Hive Data Warehousing Overview
No ratings yet
Hive Data Warehousing Overview
9 pages
Hive
No ratings yet
Hive
42 pages
IET Udaipur BDA Unit-5
No ratings yet
IET Udaipur BDA Unit-5
9 pages
Hive
No ratings yet
Hive
30 pages
Hive Main
No ratings yet
Hive Main
33 pages
Bigdata Analytics
No ratings yet
Bigdata Analytics
13 pages
Hive
No ratings yet
Hive
26 pages
Module-IV HIVE
No ratings yet
Module-IV HIVE
69 pages
Unit Iv Part - 1
No ratings yet
Unit Iv Part - 1
60 pages
Unit 2.2 Hive
No ratings yet
Unit 2.2 Hive
80 pages
Big Data Analytics and Developers Training Session 10
No ratings yet
Big Data Analytics and Developers Training Session 10
27 pages
Bda 4 Og
No ratings yet
Bda 4 Og
18 pages
Unit 5 Lecture No-1 (Hive)
No ratings yet
Unit 5 Lecture No-1 (Hive)
30 pages
FAQ RXi Reports
No ratings yet
FAQ RXi Reports
9 pages
Software Engineering Diagrams Guide
No ratings yet
Software Engineering Diagrams Guide
5 pages
Lecture 1
No ratings yet
Lecture 1
28 pages
Customer Satisfaction Analysis
No ratings yet
Customer Satisfaction Analysis
6 pages
Objects, Fields and Methods - OpenERP Server Developers Documentation 7.0b Documentation
No ratings yet
Objects, Fields and Methods - OpenERP Server Developers Documentation 7.0b Documentation
9 pages
EViews 9 Users Guide I (001-607) PDF
No ratings yet
EViews 9 Users Guide I (001-607) PDF
607 pages
AGILENT-SAP-RAR-Project-Configuration-Protocol - 1
No ratings yet
AGILENT-SAP-RAR-Project-Configuration-Protocol - 1
36 pages
Search Engines List
No ratings yet
Search Engines List
23 pages
Ex-5 241022 140400
No ratings yet
Ex-5 241022 140400
7 pages
Nasrullah Laravel Developer
No ratings yet
Nasrullah Laravel Developer
2 pages
SAP Commerce Cloud Data Modeling
No ratings yet
SAP Commerce Cloud Data Modeling
40 pages
Database Development and Implementation Lesson
100% (1)
Database Development and Implementation Lesson
21 pages
Advanced PHP
No ratings yet
Advanced PHP
159 pages
SQL Nosql Databases Architectures 2nd
No ratings yet
SQL Nosql Databases Architectures 2nd
263 pages
Data Analyst Resume: Shreya Arun
No ratings yet
Data Analyst Resume: Shreya Arun
2 pages
Pandas Tutorial
No ratings yet
Pandas Tutorial
9 pages
Ali MBIS403 Data Modelling and Database Development Week 9
No ratings yet
Ali MBIS403 Data Modelling and Database Development Week 9
3 pages
Trifacta Connection Guide
No ratings yet
Trifacta Connection Guide
83 pages
DBI202
No ratings yet
DBI202
51 pages
Fourth Edition: Descriptive Analytics II: Business Intelligence and Data Warehousing
No ratings yet
Fourth Edition: Descriptive Analytics II: Business Intelligence and Data Warehousing
64 pages
Map Info Pro Release Notes
No ratings yet
Map Info Pro Release Notes
29 pages
Ready or Not Applying Secure Configuration To Oracle E Business Suite
No ratings yet
Ready or Not Applying Secure Configuration To Oracle E Business Suite
41 pages
21bcs2008 Prashant Kumar Singh Assignment 1
No ratings yet
21bcs2008 Prashant Kumar Singh Assignment 1
3 pages
Use Case Requirements Validation Checklist
No ratings yet
Use Case Requirements Validation Checklist
2 pages
Comp PP2 S1 Kef 10-QP
No ratings yet
Comp PP2 S1 Kef 10-QP
37 pages
DBMS Bal Krishna Nyaupane PDF
No ratings yet
DBMS Bal Krishna Nyaupane PDF
166 pages
Project 1 - Presentation
No ratings yet
Project 1 - Presentation
9 pages
Answer Any 4 Out of The Given 6 Questions On Employability Skills (1 X 4 4 Marks) I. 1
No ratings yet
Answer Any 4 Out of The Given 6 Questions On Employability Skills (1 X 4 4 Marks) I. 1
4 pages
MySQL Database & Table Basics
No ratings yet
MySQL Database & Table Basics
5 pages
Myanmar Health Assistant Association Vacancy Announcement (VA-051/2023 MHAA-HR)
No ratings yet
Myanmar Health Assistant Association Vacancy Announcement (VA-051/2023 MHAA-HR)
4 pages

Big Data

Uploaded by

Big Data

Uploaded by

BIG DATA

In the early days ,data used to be less and

Data could be easily stored in a relational

Storing and processing the data

• Distributed column oriented

• It run on top of HDFS.

• Automatic and configurable

• Automatic failure support

• Strictly consistent read and

• Provide real time random read

• Easy java API for client access.

• Import of large amount of data.

• Bloom filter for real time

• Apache pig work over map reduce.

• It is a tool/platform which is used to

• We can perform all data manipulation

• Pig provide high level language known

• Pig Latin help the user to develop their

Pig Latin Compile

• Rich in operators: It provides a large number of operator operation.

• Ease of programming: As the pig Latin is similar to SQL.

• UDFs : Pig provide the facility to create user defined function.

• Programmers need to write script using Pig Latin Language.

• All scripts are internally converted to map reduce task.

• Apache pig contains a component as PIG ENGINE that accept Latin

Let have a red bull for wings…

• Hive is a data ware house infrastructure built on top of Hadoop.

• Support analysis of large dataset stored in Hadoop.

• Provide sql query language called HIVEQL.

• To provide quick query response ,it provide indexing.

Java SE - Downloads | Oracle Technology Network | Oracle

Hadoop 2.4.1 Subversion https://svn.apache.org/repos/asf/hadoop/common -r

1.Floating Point Types

• Warehouse directory in hdfs.

• Tables : subdirectories of warehouse.

• HIVEQL queries is converted to map reduce code using hive engine.

• These map reduce jobs are sent to Hadoop for execution.

• /user/hive directory is created as soon as the hive session is started first

• /user/hive/warehouse directory shall be accessible by everyone.

• Recommended to activate sticky note if supported.

• HIVE CLI can be invoked by hike command.

• HIVE QL is similar to SQL query language.

• DDL(Data Definition Language)

Normal Table External Table

• HIVE QL supports join on only equality expressions. Complex Boolean

• More than 2 table can be joined.

• Number of map-reduce jobs are generated for a join depend on the

• If same col is used for all the tables, then n=1

Numeric Date/Time String Miscellaneous

• TIME STAMP • STRING • BOOLEAN

Script language Sql like language

Comparatively less no of line than map Comparatively less no of lines than

Pig support Avro Hive does not support Avro

You might also like