3161607 – Big Data Analytics
WEB BASED DATA
MANAGEMENT OF APACHE HIVE
Submitted To : Prof. Pooja Bhatt
Present By :
Krupa Patel(190633116001)
Maitri Patel(190633116003)
Riya Soni(190633116004)
1
Outlines :
Origin
What is hive?
How Hive works?
Hive Architecture
Working of hive
Execution of Hive
limitations
Apache Hive Vs Pig
Hive Table
Summary
2
Origin:
Hive was Initially developed by Facebook.
Data was stored in Oracle database every night
ETL(Extract,Transform,Load) was performed on Data
The Data growth was exponential
– By 2006 1TB /Day
– By 2010 10 TB /Day
– By 2013 about 5000,000,000 per day..etc and
there was a need to find some way to manage the data
“effectively”.
3
What is Hive?
Hive is a Data warehouse infrastructure built on top of
Hadoop that can compile SQL Quires as Map Reduce
jobs and run the jobs in the cluster.
Suitable for semi and structured databases.
Capable to deal with different storage and file formats.
Provides HQL(SQL like Query Language) What Hive
is not.
Does not use complex indexes so do not response in
seconds.
But it scales very well , it works with data of peta byte
order.
4 It is not independent and its performance is tied
How Hive Works?
Hive Built on top of Hadoop – Think HDFS and Map
Reduce
Hive stored data in the HDFS
Hive compile SQL Quires into Map Reduce jobs and
run the jobs in the Hadoop cluster.
It stores schema in a database and processed data into
HDFS.
It is designed for OLAP.
We need reports to make operations better not to
conduct and operations.
5
We use ETL to populate data in DW
Hive Architecture
6
Hive Architecture
User Interface – Hive is a data warehouse infrastructure
software that can create interaction between user and HDFS.
Meta Store – Hive chooses respective database servers to
store the schema or Metadata of tables, databases, columns in
a table, their data types, and HDFS mapping.
HiveQL Process Engine – HiveQL is similar to SQL for
querying on schema info on the Metastore. It is one of the
replacements of traditional approach for MapReduce
program.
7
Hive Architecture
Execution Engine : The conjunction part of HiveQL
process Engine and MapReduce is Hive Execution
Engine. It uses the flavor of MapReduce.
HDFS or HBASE – Hadoop distributed file system or
HBASE are the data storage techniques to store data
into file system. Extreme scalability (up to 100 PB) –
Self-healing storage .
8
Working of Hive :
9
Execution of Hive :
Execute Query : The Hive interface such as
Command Line or Web UI sends query to Driver (any
database driver such as JDBC, ODBC, etc.) to execute.
Get Plan : The driver takes the help of query
compiler that parses the query to check the syntax and
query plan or the requirement of query.
Get Metadata : The compiler sends metadata request
to Meta store (any database).
Send Metadata: Meta store sends metadata as a
response to the compiler.
10
Execution of Hive :
Send Plan : The compiler checks the requirement and
resends the plan to the driver. Up to here, the parsing
and compiling of a query is complete.
Execute Plan: The driver sends the execute plan to
the execution engine.
Execute Job: The execution engine sends the job to
JobTracker, which is in Name node and it assigns this
job to TaskTracker, which is in Data node. Here, the
query executes MapReduce job.
11
Execution of Hive :
Metadata Ops : Meanwhile in execution, the
execution engine can execute metadata operations with
Meta store.
Fetch Result : The execution engine receives the
results from Data nodes.
Send Results : The execution engine sends those
resultant values to the driver. The driver sends the
results to Hive Interfaces.
12
Limitations:
The biggest limitation of Hadoop is that one have to
use M/R model (Map-Reduce Model). Other
limitations are as stated below:
* Not Reusable
* Error prone
* Multiple stage of Map/Reduce functions for complex
jobs.
*It’s just like asking a developer to write physical
execution plan in the DB.
13
Apache Hive vs. Apache Pig
14
Hive Table:
A Hive Table: -
Data: file or group of files in HDFS .
Schema: in the form of metadata stored in a relational
database
You have to define a schema if you have existing data in
HDFS that you want to use in Hive.
Schema and Data are separate.
15
Defining a Table
16
Managing Table
17
Loading Data
Use LOAD DATA to import data into Hive Table.
Use the word OVERWRITE to write over a file of the same
name
18
Insert Data
Use INSERT statement to populate data into a table from
another Hive table.
Overwrite is used to replace the data in the table, Otherwise
the data is appended to the table
19
Performing Queries (HiveQL):
SELECT
20
Summary
21
Thank You…!
22