Technical Principles of
Hive
www.huawei.com
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved.
Foreword
Based on Hive provided by the Hive Open Source community,
Hive in FusionInsight HD has various enterprise-level
customization features, such as Colocation table creation,
column encryption, and syntax enhancement. With these
features, FusionInsight HD outperforms the community version
in terms of reliability, tolerance, scalability, and performance.
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 2
Objectives
Upon completion of this course, you will be able to know:
Hive application scenarios and basic principles
Enhanced features of FusionInsight Hive
Common Hive SQL statements
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 3
Contents
1. Introduction to Hive
2. Hive Functions and Architecture
3. Basic Hive Operations
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 4
Hive Overview
Hive is a data warehouse tool running on Hadoop and supports
PB-level distributed data query and management.
Hive provides the following functions:
Flexible ETL (extract/transform/load)
Supporting computing engines, such as MapReduce, Tez, and
Spark
Direct access to HDFS files and HBase
Easy to use and program
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 5
Application Scenarios of Hive
User behavior analysis
Data mining Interest analysis
Partition demonstration
Non-real- Log analysis
time analysis Text analysis
Data Daily/Weekly click count
aggregation Traffic statistics
Data extraction
Data Data loading
warehouse Data transformation
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 6
Position of Hive in FusionInsight
Application service layer
OpenAPI/SDK REST/SNMP/Syslog
Data Information Knowledge Wisdom
DataFarm Porter Miner Farmer Manager
System
management
Hadoop API Plugin API
Service
governance
HIVE M/R Spark Storm Flink
Hadoop LibrA
YARN/ Zookeeper Security
management
HDFS/HBase
Hive is a data warehouse tool, which employs HiveQL (SQL-like) to query data.
All Hive data is stored in HDFS.
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 7
Comparison Between Hive and
Traditional Data Warehouses (1)
Hive Traditional Warehouse
Cluster, which is of limited storage capacity. The
cluster calculation speed decreases dramatically
HDFS. Theoretically, it is infinitely when the storage capacity increases. It is
Storage
scalable. applicable only to commercial applications that
involve a small amount of data, and cannot
handle an extra-large amount of data.
An algorithm with higher efficiency can be used
Execution
MapReduce/Tez/Spark to query data. More optimization measures can
engine
be taken to improve the efficiency.
Usage HQL (similar to SQL) SQL
Metadata and data are stored
Flexibility Low. Data is used for limited purposes.
separately for decoupling.
The calculation speed depends
on cluster size. Hive is scalable. It It is fast when there is a small amount of data.
Analysis
is more efficient than traditional Nevertheless, the speed decreases dramatically
speed
data warehouses when there is a when the amount of data becomes large.
large amount of data.
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 8
Comparison Between Hive and
Traditional Data Warehouses (2)
Hive Traditional Data Warehouses
Low efficiency. It has not met
Index expectations currently. Efficient.
An application model must be
It provides a set of well-developed report
Usability developed. This results in high
solutions to facilitate data analysis.
flexibility but low usability.
Data is stored in HDFS, which It has relatively low reliability. When a query
Reliability features high reliability and attempt fails, the query must be restarted.
high fault tolerance. Data fault tolerance relies on hardware RAID.
Environment It can be deployed using It requires high-performance commercial
dependence common computers. servers.
The data warehouses used for commercial
Price Open-source product.
purposes are expensive.
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 9
Advantages of Hive
Advantages of Hive
High reliability Multiple
SQL-like query Scalability
and tolerance interfaces
HiveServer in SQL-like syntax User-defined Beeline
cluster mode Built-in functions storage format JDBC
Dual-MetaStore in large quantity User defined Thrift
Query retry functions Python
after timeout (UDF/UDAF/UDTF) ODBC
1 2 3 4
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 10
Disadvantages of Hive
Disadvantages of Hive
High latency Not support Inapplicable to Not support
materialized OLTP storage process
views
Does not support Does not Does not
MapReduce
materialized views. support support storage
execution Data updating,
column-level process, but
engine by insertion, and
deletion cannot be data adding, supports logic
default
performed on views. updating, and processing
High latency of
deletion. using UDF.
MapReduce
1 2 3 4
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 11
Contents
1. Introduction to Hive
2. Hive Functions and Architecture
3. Basic Hive Operations
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 12
Hive Architecture
Hive
JDBC ODBC
Web
Command Line Interface Thrift Server
Interface
Driver
Metastore
(Compiler,Optimizer,Executor)
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 13
Hive Architecture in FusionInsight HD
Hive contains HiveServer, MetaStore,
and WebHcat.
HiveServer: receives requests from Hiveserver (s) WebHcat (s)
clients, parses and executes HQL
commands, and returns query results. Metastore (s)
MetaStore: provides metadata
services.
DBService/HDFS/YARN
WebHcat: provides HTTP services,
such as metadata, Data Defined
Language (DDL) for external users.
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 14
Architecture of WebHCat
WebHCat provides Rest interface for users to make the following operations
through safe HTTPS protocol:
Hive DDL operations
Running Hive HQL task
Runing MapReduce task
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 15
Data Storage Model of Hive
Database
Table Table
Partition
Bucket
Bucket
Partition Skewed data Normal data
Bucket
Bucket
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 16
Data Storage Model of Hive - Partition
and Bucket
Partition: A data table can be divided into partitions by using a field
value.
Each partition is a directory.
The number of partitions is configurable.
A partition can be partitioned or bucketed.
Bucket: Data can be stored in different buckets.
Each bucket is a file.
The bucket quantity is set when a table is created and data can be sorted
in the bucket.
Data is stored in a bucket by the hash value of a field.
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 17
Data Storage Model of Hive - Managed
Table and External Table
Hive can create managed table and external table:
Managed tables are created by default and managed by Hive. In this case,
Hive migrates data to data warehouse directories.
When external tables are created, Hive access data from locations outside data
warehouse directories.
Use managed tables when Hive performs all operations.
Use external tables when Hive and other tools share the same data set for different
processing.
Managed Table External Table
Data is migrated to data warehouse The location of external data is
CREATE/LOAD directories. specified when a table is created.
DROP Metadata and data are deleted. Only metadata is deleted.
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 18
Functions of Hive
Built-in functions in Hive:
Mathematical Function, such as round(), floor(), abs(), rand(), etc.
Date Function, such as to_date(), month(), day(), etc.
String Function, such as trim(), length(), substr(), etc.
UDF (User- Defined Funcation)
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 19
Enhanced Features of Hive - Colocation
Overview
Colocation: storing associated data or data on which associated operations
are performed on the same storage node.
File-level Colocation allows quick file access. This avoids network
consumption caused by data migration.
NN #1
A C D A B D B C B C A D
DN #1 DN #2 DN #3 DN #4 DN #5 DN #6
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 20
Enhanced Features of Hive - Using
Colocation
Step 1: Use an HDFS interface to create groupid and locatorid.
hdfs colocationadmin -createGroup -groupId groupid
-locatorIds locatorid1,locatorid2,locatorid3
Step 2: Use the Hive Colocation function.
CREATE TABLE tbl_1 (id INT, name STRING) stored as
RCFILE
TBLPROPERTIES("groupId"="group1","locatorId"="loca
tor1");
CREATE TABLE tbl_2 (id INT, name STRING) row
format delimited fields terminated by '\t' stored
as TEXTFILE
TBLPROPERTIES("groupId"="group1","locatorId"="loca
tor1");
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 21
Enhanced Features of Hive - Encrypting
Columns
Step 1: When creating a table, specify the columns to be encrypted
and the encryption algorithm.
create table encode_test (id INT, name STRING, phone
STRING, address STRING) row format serde
"org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe"
WITH SERDEPROPERTIES(
"column.encode.columns"="phone,address","column.encode.
classname"="org.apache.hadoop.hive.serde2.AESRewriter"
);
Step 2: Use an insert syntax to import data to tables whose columns
are encrypted.
insert into table encode_test select id, name,
phone, address from test;
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 22
Enhanced Features of Hive - Deleting
HBase Records in Batches
Overview:
In FusionInsight HD, Hive allows deletion of a single record from an HBase
table. Hive can use specific syntax to delete one or more data records that
meet criteria from its HBase tables.
Usage:
To delete some data from an HBase table, run the following HQL
statement:
remove table HBase_table where expression;
here, expression indicates the criteria for selecting the
data to be deleted.
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 23
Enhanced Features of Hive - Controlling
Traffic
By using the traffic control feature, you can control:
Total number of established connections
Number of established connections of each use
Number of connections established within a unit period
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 24
Enhanced Features of Hive - Specifying
Row Delimiters
Step 1: Set inputFormat and outputFormat when creating a table.
CREATE [TEMPORARY] [EXTERNAL] TABLE [IF NOT EXISTS]
[db_name.]table_name
[(col_name data_type [COMMENT col_comment], ...)]
[ROW FORMAT row_format]
STORED AS
inputformat
"org.apache.hadoop.hive.contrib.fileformat.SpecifiedD
elimiterInputFormat"
outputformat
"org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutput
Format";
Step 2: Specify the delimiter before a query.
set hive.textinput.record.delimiter="!@!";
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 25
Contents
1. Introduction to Hive
2. Hive Functions and Architecture
3. Basic Hive Operations
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 26
Hive SQL Overview
DDL-Data definition language
Table creation, table modification and deletion, partitions, and
data types
DML-Data manipulation language
Data import, export
DQL-Data query language
General query
Complicated query, like Group by,Order by,Join, etc.
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 27
Hive Basic Operations (1)
Data format example:1,huawei,1000.0
--Create managed table
CREATE TABLE IF NOT EXISTS example.employee(
Id INT COMMENT 'employeeid',
Company STRING COMMENT 'your company',
Money FLOAT COMMENT 'work money',)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
STORED AS TEXTFILE;
--Create external table
CREATE EXTERNAL TABLE IF NOT EXISTS
example.employee(
Id INT COMMENT 'employeeid',
Company STRING COMMENT 'your company',
Money FLOAT COMMENT 'work money',) ROW FORMAT
DELIMITED FIELDS TERMINATED BY ','
STORED AS TEXTFILE LOCATION '/localtest';
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 28
Hive Basic Operations (2)
--Modify a column
ALTER TABLE employee1 CHANGE money string COMMENT
'changed by alter' AFTER dateincompany;
--Add a column
ALTER TABLE employee1 ADD columns(column1 string);
--Modify the file format
ALTER TABLE employee3 SET fileformat TEXTFILE;
--Delete table data
DELETE column_1 from table_1 WHERE column_1=??;
DROP table_a;
--Describe table
DESC table_a;
--Show the statements for creating a table
SHOW CREATE table_a;
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 29
Hive Basic Operations (3)
--Load data from the local
LOAD DATA LOCAL INPATH 'employee.txt' OVERWRITE INTO TABLE
example.employee;
--Load data from another table
INSERT INTO TABLE company.person PARTITION(century= '21',year='2010')
SELECT id, name, age, birthday FROM company.person_tmp WHERE
century= '23' AND year='2010';
--Export data from a Hive table to HDFS
EXPORT TABLE company.person TO '/department';
--Import data from HDFS to a Hive table
IMPROT TABLE company.person FROM '/department';
--Insert data
INSERT INTO TABLE company.person
SELECT id, name, age, birthday FROM company.person_tmp
WHERE century= '23' AND year='2010';
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 30
Hive Basic Operations (4)
--WHERE
SELECT id, name FROM employee WHERE salary >= 10000;
--GROUP BY
SELECT department, avg(salary) FROM employee GROUP BY department;
--UNION ALL
SELECT id, salary, date FROM employee_a UNION ALL
SELECT id, salary, date FROM employee_b;
--JOIN
SELECT a.salary, b.address FROM employee a JOIN employee_info
b ON a.name=b.name;
--Subquery
SELECT a.salary, b.address FROM employee a JOIN (SELECT
address FROM employee_info where province='zhejiang') b ON
a.name=b.name;
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 31
Summary
This module describes the following information about Hive:
basic principles, application scenarios, enhanced features in
FusionInsight and common Hive SQL statements.
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 32
Quiz
1. Which of the following scenarios does Hive apply to?
A. Real-time online data analysis
B. Data mining (user behavior analysis, interest analysis, and partition
demonstration)
C. Data aggregation (daily/weekly click count and click count rankings)
D. Non-real-time data analysis (log analysis and statistics analysis)
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 33
Quiz
1. Which of the following statements about Hive SQL operations are correct?
A. The keyword external is used to create an external table and the key
word internal is used to create a common table.
B. Specify the location information when creating an external table.
C. When data is uploaded to Hive, the data source must be one HDFS path.
D. When creating a table, column delimiters can be specified.
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 34
More Information
Training materials:
http://support.huawei.com/learning/Certificate!showCertificate?lang=en&pbiPath=term100002
5450&id=Node1000011796
Exam outline:
http://support.huawei.com/learning/Certificate!toExamOutlineDetail?lang=en&nodeId=Node10
00011797
Mock exam:
http://support.huawei.com/learning/Certificate!toSimExamDetail?lang=en&nodeId=Node10000
11798
Authentication process:
http://support.huawei.com/learning/NavigationAction!createNavi#navi[id]=_40
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 35
Thank You
www.huawei.com
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 36