Introduction to Hive
Outline
Motivation
Overview
Data Model / Metadata
Architecture
Performance
Cons and Pros
Application
Related Work
12/12/2017 Introduction to Hive 2
Outline
Started at Facbook
Data was collected by nightly
by corn job into Oracle db
Etl via hadcode python
Grew form 10s of Gb to 1 Tb/day new data(2007)
now 100 x
12/12/2017 Introduction to Hive 3
Motivation
Realtime
Hadoop
Cluster
Web Servers Scribe MidTier
Scribe Writers
Oracle RAC Hadoop Hive Warehouse MySQL
http://hadoopblog.blogspot.com/2009/06/hdfs-scribe-integration.html
12/12/2017 Introduction to Hive 4
Limitation of MR
Have to use M/R model
Not Reusable
Error prone
For complex jobs:
Multiple stage of Map/Reduce functions
Just like ask dev to write specify physical
execution plan in the database
12/12/2017 Introduction to Hive 5
Limitation of MR
Have to use M/R model
Not Reusable
Error prone
For complex jobs:
Multiple stage of Map/Reduce functions
Just like ask dev to write specify physical
execution plan in the database
12/12/2017 Introduction to Hive 6
Scrib-haddop cluster @Facebook
Use to log data from web server
Clusters Collected with the web serer
Network is the biggest bottleneck
Typically clustor has about 50 node
Status:
25 tb/ day of raw data logged
99% of the time data is available with in 20 seconds
12/12/2017 Introduction to Hive 7
Motivation
Limitation of MR
Have to use M/R model
Not Reusable
Error prone
For complex jobs:
Multiple stage of Map/Reduce functions
Just like ask dev to write specify physical
execution plan in the database
12/12/2017 Introduction to Hive 8
Motivation
Limitation of MR
Have to use M/R model
Not Reusable
Error prone
For complex jobs:
Multiple stage of Map/Reduce functions
Just like ask dev to write specify physical
execution plan in the database
12/12/2017 Introduction to Hive 9
Overview
Intuitive
Make the unstructured data looks like tables
regardless how it really lay out
SQL based query can be directly against these tables
Generate specify execution plan for this query
Whats Hive
A data warehousing system to store structured data on
Hadoop file system
Provide an easy query these data by execution Hadoop
MapReduce plans
12/12/2017 Introduction to Hive 10
Hive applications
Log Processing
Text Mining
Document indexing
Customer facing business intelligence
Predictive modeling
Hypothesis testing
12/12/2017 Introduction to Hive 11
Hive architecture
12/12/2017 Introduction to Hive 12
Working of Hive
12/12/2017 Introduction to Hive 13
Working of Hive
12/12/2017 Introduction to Hive 14
Working of Hive
12/12/2017 Introduction to Hive 15
Data Model
Tables
Basic type columns (int, float, boolean)
Complex type: List / Map ( associate array)
Partitions
Buckets
CREATE TABLE sales( id INT, items
ARRAY<STRUCT<id:INT,name:STRING>
) PARITIONED BY (ds STRING)
CLUSTERED BY (id) INTO 32 BUCKETS;
SELECT id FROM sales TABLESAMPLE (BUCKET 1 OUT OF 32)
12/12/2017 Introduction to Hive 16
Metadata
Database namespace
Table definitions
schema info, physical location In HDFS
Partition data
ORM Framework
All the metadata can be stored in Derby by default
Any database with JDBC can be configed
12/12/2017 Introduction to Hive 17
Performance
GROUP BY operation
Efficient execution plans based on:
Data skew:
how evenly distributed data across a number of
physical nodes
bottleneck VS load balance
Partial aggregation:
Group the data with the same group by value as
soon as possible
In memory hash-table for mapper
Earlier than combiner
12/12/2017 Introduction to Hive 18
Performance
JOIN operation
Traditional Map-Reduce Join
Early Map-side Join
very efficient for joining a small table with a large
table
Keep smaller table data in memory first
Join with a chunk of larger table data each time
Space complexity for time complexity
7/20/2010 Introduction to Hive 19
Performance
Ser/De
Describe how to load the data from the file into a
representation that make it looks like a table;
Lazy load
Create the field object when necessary
Reduce the overhead to create unnecessary objects in
Hive
Java is expensive to create objects
Increase performance
7/20/2010 Introduction to Hive 20
Pros
Pros
A easy way to process large scale data
Support SQL-based queries
Provide more user defined interfaces to
extend
Programmability
Efficient execution plans for performance
Interoperability with other database tools
12/12/2017 Introduction to Hive 21
Cons
Cons
No easy way to append data
Files in HDFS are immutable
Future work
Views / Variables
More operator
In/Exists semantic
More future work in the mail list
12/12/2017 Introduction to Hive 22
Primitive Data Types
TINYINT 1 byte signed integer. 20
SMALLINT 2 byte signed integer. 20
INT 4 byte signed integer. 20
BIGINT 8 byte signed integer. 20
BOOLEAN Boolean true or false. TRUE
FLOAT Single precision floating point. 3.14159
DOUBLE Double precision floating point. 3.14159
String
Time Stamp
Binary
12/12/2017 Introduction to Hive 23
Collection Data Types
STRUCT struct('John', 'Doe')
MAP map('first', 'John,'last', 'Doe')
ARRAY array('John', 'Doe')
CREATE TABLE employees (
name STRING,
salary FLOAT,
subordinates ARRAY<STRING>,
deductions MAP<STRING, FLOAT>,
address STRUCT<street:STRING, city:STRING,
state:STRING, zip:INT>);
12/12/2017 Introduction to Hive 24
Text File Encoding of Data Values
12/12/2017 Introduction to Hive 25
Json schema
{
"name": "John Doe",
"salary": 100000.0,
"subordinates": ["Mary Smith", "Todd Jones"],
"deductions": {
"Federal Taxes": .2,
"State Taxes": .05,
"Insurance": .1
},
"address": {
"street": "1 Michigan Ave.",
"city": "Chicago",
"state": "IL",
"zip": 60600
}
12/12/2017 Introduction to Hive 26
Databases in Hive
The Hive concept of a database is essentially just a
catalog or namespace of tables.
However, they are very useful for larger clusters with
multiple teams and users, as away of avoiding table
name collisions.
12/12/2017 Introduction to Hive 27
syntax for creating and listing database
hive> CREATE DATABASE financials;
hive> CREATE DATABASE IF NOT EXISTS financials;
hive> CREATE DATABASE financials
> LOCATION '/my/preferred/directory';
hive> SHOW DATABASES;
default
Financials
12/12/2017 Introduction to Hive 28
Create database
hive> CREATE DATABASE financials
> COMMENT 'Holds all financial tables';
DESCRIBE DATABASE financials;
hive> CREATE DATABASE financials
> WITH DBPROPERTIES ('creator' = 'Mark Moneybags',
'date' = '2012-01-02');
hive> DESCRIBE DATABASE EXTENDED financials;
financials hdfs://master-
server/user/hive/warehouse/financials.db
{date=2012-01-02, creator=Mark Moneybags);
12/12/2017 Introduction to Hive 29
Cons
hive> USE financials;
SHOW TABLES; will list the tables in this database.
hive> set hive.cli.print.current.db=true;
hive (financials)> USE default;
hive (default)> set hive.cli.print.current.db=false;
hive> ...
12/12/2017 Introduction to Hive 30
Drop database & Alter database
hive> DROP DATABASE IF EXISTS financials;
hive> DROP DATABASE IF EXISTS financials CASCADE;
hive> ALTER DATABASE financials SET DBPROPERTIES
('edited-by' = Babjee');
12/12/2017 Introduction to Hive 31
Creating Tables
CREATE TABLE IF NOT EXISTS sales_db (
name STRING COMMENT 'Employee name',
salary FLOAT COMMENT 'Employee salary',
subordinates ARRAY<STRING> COMMENT 'Names of subordinates',
deductions MAP<STRING, FLOAT>
COMMENT 'Keys are deductions names, values are percentages',
address STRUCT<street:STRING, city:STRING, state:STRING, zip:INT>
COMMENT 'Home address')
COMMENT 'Description of the table'
TBLPROPERTIES ('creator'='me', 'created_at'='2012-01-02 10:00:00)
\LOCATION '/user/hive/warehouse/sales_db/employees';
12/12/2017 Introduction to Hive 32
Cons
CREATE TABLE IF NOT EXISTS mydb.employees2
LIKE mydb.employees;
hive> SHOW TABLES;
employees
table1
Table2
show tables in sales_db;
SHOW TABLES 'empl.*';
employees
12/12/2017 Introduction to Hive 33
Describle tables
hive> DESCRIBE EXTENDED employees;
name string Employee name
salary float Employee salary
subordinates array<string> Names of subordinates
deductions map<string,float> Keys are deductions names, values are
percentages
address struct<street:string,city:string,state:string,zip:int> Home address
Detailed Table Information Table(tableName:employees, dbName:mydb,
owner:me,
...
location:hdfs://master-server/user/hive/warehouse/mydb.db/employees,
parameters:{creator=me, created_at='2012-01-02 10:00:00',
last_modified_user=me, last_modified_time=1337544510,
comment:Description of the table, ...}, ...)
12/12/2017 Introduction to Hive 34
Managed tables/internal tables
Managed called internal tables, because Hive controls
the lifecycle of their data (more or less)
External Tables
create external table if not exists employees(employees
string,
ename string,
city string)
row format delimited fields terminated by ',' location
'/home/cloudera/employee.txt'
12/12/2017 Introduction to Hive 35
Partitioned managed table
CREATE TABLE employees (
name STRING,
salary FLOAT,
subordinates ARRAY<STRING>,
deductions MAP<STRING, FLOAT>,
address STRUCT<street:STRING, city:STRING,
state:STRING, zip:INT>
)
PARTITIONED BY (country STRING, state STRING);
12/12/2017 Introduction to Hive 36
Partitioned table
hdfs://master_server/user/hive/warehouse/mydb.db/emp
loyees
.../employees/country=CA/state=AB
.../employees/country=CA/state=BC
...
.../employees/country=US/state=AL
.../employees/country=US/state=AK
SELECT * FROM employees
WHERE country = 'US' AND state = 'IL';
12/12/2017 Introduction to Hive 37
Partitions
hive> SHOW PARTITIONS employees;
...
Country=CA/state=AB
country=CA/state=BC
...
country=US/state=AL
country=US/state=AK
hive> SHOW PARTITIONS employees PARTITION(country='US');
country=US/state=AL
country=US/state=AK
12/12/2017 Introduction to Hive 38
External Partitioned Tables
CREATE EXTERNAL TABLE IF NOT EXISTS log_messages (
hms INT,
severity STRING,
server STRING,
process_id INT,
message STRING)
PARTITIONED BY (year INT, month INT, day INT)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t';
12/12/2017 Introduction to Hive 39
Describe extended
hive> DESCRIBE EXTENDED log_messages;
...
message string,
year int,
month int,
day int
Detailed Table Information...
partitionKeys:[FieldSchema(name:year, type:int, comment:null),
FieldSchema(name:month, type:int, comment:null),
FieldSchema(name:day, type:int, comment:null)],
12/12/2017 Introduction to Hive 40
CREATE EXTERNAL TABLE IF NOT EXISTS stocks (
exchange STRING,
symbol STRING,
ymd STRING,
price_open FLOAT,
price_high FLOAT,
price_low FLOAT,
price_close FLOAT,
volume INT,
price_adj_close FLOAT)
CLUSTERED BY (exchange, symbol)
SORTED BY (ymd ASC)
INTO 96 BUCKETS
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LOCATION '/data/stocks';
12/12/2017 Introduction to Hive 41
Drop /alter table
DROP TABLE IF EXISTS employees;
ALTER TABLE log_messages RENAME TO logmsgs;
ALTER TABLE log_messages ADD IF NOT EXISTS
PARTITION (year = 2011, month = 1, day = 1) LOCATION '/logs/2011/01/01'
PARTITION (year = 2011, month = 1, day = 2) LOCATION '/logs/2011/01/02'
PARTITION (year = 2011, month = 1, day = 3) LOCATION '/logs/2011/01/03
ALTER TABLE log_messages DROP IF EXISTS PARTITION(year = 2011, month = 12,
day = 2);
12/12/2017 Introduction to Hive 42
ALTER TABLE log_messages
CHANGE COLUMN hms hours_minutes_seconds INT
COMMENT 'The hours, minutes, and seconds part of the
timestamp'
AFTER severity;
ALTER TABLE log_messages SET TBLPROPERTIES (
'notes' = 'The process id is no longer captured; this column
is always NULL');
12/12/2017 Introduction to Hive 43
HiveQL: Data Manipulation
LOAD DATA LOCAL INPATH '${env:HOME}/california-
employees OVERWRITE INTO TABLE employees
LOAD DATA LOCAL INPATH '${env:HOME}/california-
employees OVERWRITE INTO TABLE employees
PARTITION (country = 'US', state = 'CA');
INSERT OVERWRITE TABLE employees
PARTITION (country = 'US', state = 'OR')
SELECT * FROM staged_employees se
WHERE se.cnty = 'US' AND se.st = 'OR';
12/12/2017 Introduction to Hive 44
Create table
CREATE TABLE ca_employees
AS SELECT name, salary, address
FROM employees
WHERE se.state = 'CA';
INSERT OVERWRITE LOCAL DIRECTORY
'/tmp/ca_employees'
SELECT name, salary, address
FROM employees
WHERE se.state = 'CA';
12/12/2017 Introduction to Hive 45
Inserting Data into Tables from Queries
INSERT OVERWRITE TABLE employees
PARTITION (country = 'US', state = 'OR')
SELECT * FROM staged_employees se
WHERE se.cnty = 'US' AND se.st = 'OR';
12/12/2017 Introduction to Hive 46
Inserting Data into Tables from Queries
FROM staged_employees se
INSERT OVERWRITE TABLE employees
PARTITION (country = 'US', state = 'OR')
SELECT * WHERE se.cnty = 'US' AND se.st = 'OR'
INSERT OVERWRITE TABLE employees
PARTITION (country = 'US', state = 'CA')
SELECT * WHERE se.cnty = 'US' AND se.st = 'CA'
INSERT OVERWRITE TABLE employees
PARTITION (country = 'US', state = 'IL')
SELECT * WHERE se.cnty = 'US' AND se.st = 'IL';
12/12/2017 Introduction to Hive 47
Dynamic Partition Inserts
INSERT OVERWRITE TABLE employees
PARTITION (country, state)
SELECT ..., se.cnty, se.st
FROM staged_employees se;
12/12/2017 Introduction to Hive 48
Exporting Data
INSERT OVERWRITE LOCAL DIRECTORY
'/tmp/ca_employees'
SELECT name, salary, address
FROM employees
WHERE se.state = 'CA';
FROM staged_employees se
INSERT OVERWRITE DIRECTORY '/tmp/or_employees'
SELECT * WHERE se.cty = 'US' and se.st = 'OR'
INSERT OVERWRITE DIRECTORY '/tmp/ca_employees'
SELECT * WHERE se.cty = 'US' and se.st = 'CA'
INSERT OVERWRITE DIRECTORY '/tmp/il_employees'
12/12/2017 Introduction 49
SELECT * WHERE se.cty = 'US' andtose.st
Hive = 'IL';
Cons
12/12/2017 Introduction to Hive 50