0% found this document useful (0 votes)

60 views36 pages

Module 06 Hive - Distributed Data Warehouse

The document outlines the technical principles of Hive, a data warehouse tool that operates on Hadoop and supports large-scale data management and querying. It highlights the enhanced features of FusionInsight HD's Hive, including customization options, improved reliability, and performance compared to traditional data warehouses. Additionally, it provides an overview of Hive's architecture, SQL operations, and various functionalities such as data partitioning, encryption, and traffic control.

Uploaded by

Lucas Oliveira

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

60 views36 pages

Module 06 Hive - Distributed Data Warehouse

Uploaded by

Lucas Oliveira

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 36

Technical Principles of

Hive

www.huawei.com

Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved.

Foreword
 Based on Hive provided by the Hive Open Source community,
Hive in FusionInsight HD has various enterprise-level
customization features, such as Colocation table creation,
column encryption, and syntax enhancement. With these
features, FusionInsight HD outperforms the community version
in terms of reliability, tolerance, scalability, and performance.

Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 2
Objectives
 Upon completion of this course, you will be able to know:
 Hive application scenarios and basic principles
 Enhanced features of FusionInsight Hive
 Common Hive SQL statements

Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 3
Contents
1. Introduction to Hive
2. Hive Functions and Architecture

3. Basic Hive Operations

Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 4
Hive Overview
 Hive is a data warehouse tool running on Hadoop and supports
PB-level distributed data query and management.
 Hive provides the following functions:
 Flexible ETL (extract/transform/load)

 Supporting computing engines, such as MapReduce, Tez, and

Spark

 Direct access to HDFS files and HBase

 Easy to use and program

Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 5
Application Scenarios of Hive
 User behavior analysis
Data mining  Interest analysis
 Partition demonstration

Non-real-  Log analysis

time analysis  Text analysis

Data  Daily/Weekly click count

aggregation  Traffic statistics

 Data extraction
Data  Data loading
warehouse  Data transformation

Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 6
Position of Hive in FusionInsight
Application service layer
OpenAPI/SDK REST/SNMP/Syslog

Data Information Knowledge Wisdom

DataFarm Porter Miner Farmer Manager
System
management
Hadoop API Plugin API
Service
governance
HIVE M/R Spark Storm Flink
Hadoop LibrA
YARN/ Zookeeper Security
management
HDFS/HBase

Hive is a data warehouse tool, which employs HiveQL (SQL-like) to query data.
All Hive data is stored in HDFS.

Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 7
Comparison Between Hive and
Traditional Data Warehouses (1)
Hive Traditional Warehouse
Cluster, which is of limited storage capacity. The
cluster calculation speed decreases dramatically
HDFS. Theoretically, it is infinitely when the storage capacity increases. It is
Storage
scalable. applicable only to commercial applications that
involve a small amount of data, and cannot
handle an extra-large amount of data.

An algorithm with higher efficiency can be used

Execution
MapReduce/Tez/Spark to query data. More optimization measures can
engine
be taken to improve the efficiency.

Usage HQL (similar to SQL) SQL

Metadata and data are stored
Flexibility Low. Data is used for limited purposes.
separately for decoupling.

The calculation speed depends

on cluster size. Hive is scalable. It It is fast when there is a small amount of data.
Analysis
is more efficient than traditional Nevertheless, the speed decreases dramatically
speed
data warehouses when there is a when the amount of data becomes large.
large amount of data.

Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 8
Comparison Between Hive and
Traditional Data Warehouses (2)
Hive Traditional Data Warehouses
Low efficiency. It has not met
Index expectations currently. Efficient.

An application model must be

It provides a set of well-developed report
Usability developed. This results in high
solutions to facilitate data analysis.
flexibility but low usability.

Data is stored in HDFS, which It has relatively low reliability. When a query
Reliability features high reliability and attempt fails, the query must be restarted.
high fault tolerance. Data fault tolerance relies on hardware RAID.

Environment It can be deployed using It requires high-performance commercial

dependence common computers. servers.

The data warehouses used for commercial

Price Open-source product.
purposes are expensive.

High reliability Multiple

SQL-like query Scalability
and tolerance interfaces
 HiveServer in  SQL-like syntax  User-defined  Beeline
cluster mode  Built-in functions storage format  JDBC
 Dual-MetaStore in large quantity  User defined  Thrift
 Query retry functions  Python
after timeout (UDF/UDAF/UDTF)  ODBC

1 2 3 4

Disadvantages of Hive

High latency Not support Inapplicable to Not support

materialized OLTP storage process
views
Does not support  Does not  Does not
 MapReduce 

materialized views. support support storage

execution Data updating,

column-level process, but
engine by insertion, and
deletion cannot be data adding, supports logic
default
performed on views. updating, and processing
 High latency of
deletion. using UDF.
MapReduce

1 2 3 4

2. Hive Functions and Architecture

3. Basic Hive Operations

Hive

JDBC ODBC

Web
Command Line Interface Thrift Server
Interface

Driver
Metastore
(Compiler,Optimizer,Executor)

Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 13
Hive Architecture in FusionInsight HD
 Hive contains HiveServer, MetaStore,
and WebHcat.
 HiveServer: receives requests from Hiveserver (s) WebHcat (s)

clients, parses and executes HQL

commands, and returns query results. Metastore (s)
 MetaStore: provides metadata
services.
DBService/HDFS/YARN
 WebHcat: provides HTTP services,
such as metadata, Data Defined
Language (DDL) for external users.

Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 14
Architecture of WebHCat
 WebHCat provides Rest interface for users to make the following operations
through safe HTTPS protocol：
 Hive DDL operations

 Running Hive HQL task

 Runing MapReduce task

Database

Table Table

Partition
Bucket

Bucket

Partition Skewed data Normal data

Bucket
Bucket

Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 16
Data Storage Model of Hive - Partition
and Bucket
 Partition: A data table can be divided into partitions by using a field
value.
 Each partition is a directory.

 The number of partitions is configurable.

 A partition can be partitioned or bucketed.

 Bucket: Data can be stored in different buckets.

 Each bucket is a file.

 The bucket quantity is set when a table is created and data can be sorted
in the bucket.

 Data is stored in a bucket by the hash value of a field.

Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 17
Data Storage Model of Hive - Managed
Table and External Table
 Hive can create managed table and external table:
 Managed tables are created by default and managed by Hive. In this case,
Hive migrates data to data warehouse directories.
 When external tables are created, Hive access data from locations outside data
warehouse directories.

 Use managed tables when Hive performs all operations.

 Use external tables when Hive and other tools share the same data set for different
processing.
Managed Table External Table
Data is migrated to data warehouse The location of external data is
CREATE/LOAD directories. specified when a table is created.

DROP Metadata and data are deleted. Only metadata is deleted.

Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 18
Functions of Hive
 Built-in functions in Hive:
 Mathematical Function, such as round(), floor(), abs(), rand(), etc.

 Date Function, such as to_date(), month(), day(), etc.

 String Function, such as trim(), length(), substr(), etc.

 UDF (User- Defined Funcation)

Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 19
Enhanced Features of Hive - Colocation
Overview
 Colocation: storing associated data or data on which associated operations
are performed on the same storage node.

 File-level Colocation allows quick file access. This avoids network

consumption caused by data migration.

NN #1

A C D A B D B C B C A D
DN #1 DN #2 DN #3 DN #4 DN #5 DN #6

Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 20
Enhanced Features of Hive - Using
Colocation
 Step 1: Use an HDFS interface to create groupid and locatorid.

hdfs colocationadmin -createGroup -groupId groupid

-locatorIds locatorid1,locatorid2,locatorid3

 Step 2: Use the Hive Colocation function.

CREATE TABLE tbl_1 (id INT, name STRING) stored as
RCFILE
TBLPROPERTIES("groupId"="group1","locatorId"="loca
tor1");

CREATE TABLE tbl_2 (id INT, name STRING) row

format delimited fields terminated by '\t' stored
as TEXTFILE
TBLPROPERTIES("groupId"="group1","locatorId"="loca
tor1");

Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 21
Enhanced Features of Hive - Encrypting
Columns
 Step 1: When creating a table, specify the columns to be encrypted
and the encryption algorithm.
create table encode_test (id INT, name STRING, phone
STRING, address STRING) row format serde
"org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe"
WITH SERDEPROPERTIES(
"column.encode.columns"="phone,address","column.encode.
classname"="org.apache.hadoop.hive.serde2.AESRewriter"
);
 Step 2: Use an insert syntax to import data to tables whose columns
are encrypted.

insert into table encode_test select id, name,

phone, address from test;

Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 22
Enhanced Features of Hive - Deleting
HBase Records in Batches
 Overview：
 In FusionInsight HD, Hive allows deletion of a single record from an HBase
table. Hive can use specific syntax to delete one or more data records that
meet criteria from its HBase tables.

 Usage：
 To delete some data from an HBase table, run the following HQL
statement:

remove table HBase_table where expression;

here, expression indicates the criteria for selecting the

data to be deleted.

Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 23
Enhanced Features of Hive - Controlling
Traffic
By using the traffic control feature, you can control:
 Total number of established connections
 Number of established connections of each use
 Number of connections established within a unit period

Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 24
Enhanced Features of Hive - Specifying
Row Delimiters
 Step 1: Set inputFormat and outputFormat when creating a table.
CREATE [TEMPORARY] [EXTERNAL] TABLE [IF NOT EXISTS]
[db_name.]table_name
[(col_name data_type [COMMENT col_comment], ...)]
[ROW FORMAT row_format]
STORED AS
inputformat
"org.apache.hadoop.hive.contrib.fileformat.SpecifiedD
elimiterInputFormat"
outputformat
"org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutput
Format";

 Step 2: Specify the delimiter before a query.

set hive.textinput.record.delimiter="!@!";

2. Hive Functions and Architecture

3. Basic Hive Operations

Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 26
Hive SQL Overview
 DDL-Data definition language
 Table creation, table modification and deletion, partitions, and
data types

 DML-Data manipulation language

 Data import, export

 DQL-Data query language

 General query

 Complicated query, like Group by，Order by，Join, etc.

--Create managed table

CREATE TABLE IF NOT EXISTS example.employee(
Id INT COMMENT 'employeeid',
Company STRING COMMENT 'your company',
Money FLOAT COMMENT 'work money',)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
STORED AS TEXTFILE;
--Create external table
CREATE EXTERNAL TABLE IF NOT EXISTS
example.employee(
Id INT COMMENT 'employeeid',
Company STRING COMMENT 'your company',
Money FLOAT COMMENT 'work money',) ROW FORMAT
DELIMITED FIELDS TERMINATED BY ','
STORED AS TEXTFILE LOCATION '/localtest';

Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 28
Hive Basic Operations (2)
--Modify a column
ALTER TABLE employee1 CHANGE money string COMMENT
'changed by alter' AFTER dateincompany;

--Add a column
ALTER TABLE employee1 ADD columns(column1 string);

--Modify the file format

ALTER TABLE employee3 SET fileformat TEXTFILE;

--Delete table data

DELETE column_1 from table_1 WHERE column_1=??;
DROP table_a;

--Describe table
DESC table_a;

--Show the statements for creating a table

SHOW CREATE table_a;

Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 29
Hive Basic Operations (3)
--Load data from the local
LOAD DATA LOCAL INPATH 'employee.txt' OVERWRITE INTO TABLE
example.employee;

--Load data from another table

INSERT INTO TABLE company.person PARTITION(century= '21',year='2010')
SELECT id, name, age, birthday FROM company.person_tmp WHERE
century= '23' AND year='2010';

--Export data from a Hive table to HDFS

EXPORT TABLE company.person TO '/department';

--Import data from HDFS to a Hive table

IMPROT TABLE company.person FROM '/department';

--Insert data
INSERT INTO TABLE company.person
SELECT id, name, age, birthday FROM company.person_tmp
WHERE century= '23' AND year='2010';

--GROUP BY
SELECT department, avg(salary) FROM employee GROUP BY department;

--UNION ALL
SELECT id, salary, date FROM employee_a UNION ALL
SELECT id, salary, date FROM employee_b;

--JOIN
SELECT a.salary, b.address FROM employee a JOIN employee_info
b ON a.name=b.name;

--Subquery
SELECT a.salary, b.address FROM employee a JOIN (SELECT
address FROM employee_info where province='zhejiang') b ON
a.name=b.name;

Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 31
Summary
 This module describes the following information about Hive:
basic principles, application scenarios, enhanced features in
FusionInsight and common Hive SQL statements.

A. Real-time online data analysis

B. Data mining (user behavior analysis, interest analysis, and partition

demonstration)

C. Data aggregation (daily/weekly click count and click count rankings)

D. Non-real-time data analysis (log analysis and statistics analysis)

A. The keyword external is used to create an external table and the key
word internal is used to create a common table.

B. Specify the location information when creating an external table.

C. When data is uploaded to Hive, the data source must be one HDFS path.

D. When creating a table, column delimiters can be specified.

Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 34
More Information
 Training materials:
 http://support.huawei.com/learning/Certificate!showCertificate?lang=en&pbiPath=term100002
5450&id=Node1000011796
 Exam outline:
 http://support.huawei.com/learning/Certificate!toExamOutlineDetail?lang=en&nodeId=Node10
00011797
 Mock exam:
 http://support.huawei.com/learning/Certificate!toSimExamDetail?lang=en&nodeId=Node10000
11798
 Authentication process:
 http://support.huawei.com/learning/NavigationAction!createNavi#navi[id]=_40

Big Data. Lecture. Chapter 3 Hive - Distributed Data Warehouse
No ratings yet
Big Data. Lecture. Chapter 3 Hive - Distributed Data Warehouse
25 pages
Hive Data Warehouse Overview
No ratings yet
Hive Data Warehouse Overview
26 pages
Big Data & Analytics (CSE6005) L6
No ratings yet
Big Data & Analytics (CSE6005) L6
56 pages
IET Udaipur BDA Unit-5
No ratings yet
IET Udaipur BDA Unit-5
9 pages
Big Data Huawei Course
No ratings yet
Big Data Huawei Course
23 pages
Hive Basics
No ratings yet
Hive Basics
35 pages
Hive Main
No ratings yet
Hive Main
33 pages
Hive Architecture
No ratings yet
Hive Architecture
7 pages
Using Hive For Data Warehousing: Introduction To Hive
No ratings yet
Using Hive For Data Warehousing: Introduction To Hive
4 pages
BDA Hive
No ratings yet
BDA Hive
22 pages
Hive
No ratings yet
Hive
5 pages
Apache Hive: Structure & Data Analysis
No ratings yet
Apache Hive: Structure & Data Analysis
25 pages
5 - Hive
No ratings yet
5 - Hive
51 pages
Introduction To Hive
No ratings yet
Introduction To Hive
14 pages
7 Hive
No ratings yet
7 Hive
30 pages
04 Bigdata Hive
No ratings yet
04 Bigdata Hive
22 pages
6.1NoSQL ApacheHIVE Witha3
No ratings yet
6.1NoSQL ApacheHIVE Witha3
45 pages
Session 3.1
No ratings yet
Session 3.1
29 pages
BDA Unit 4 Notes
No ratings yet
BDA Unit 4 Notes
33 pages
02 B Monu Agrawal BDAV03
No ratings yet
02 B Monu Agrawal BDAV03
21 pages
Apache Hive: Data Warehousing on Hadoop
No ratings yet
Apache Hive: Data Warehousing on Hadoop
28 pages
Hive
No ratings yet
Hive
29 pages
Unit-5 - Hive
No ratings yet
Unit-5 - Hive
31 pages
Hive Lecture Notes
100% (1)
Hive Lecture Notes
17 pages
Bda Report
No ratings yet
Bda Report
16 pages
Hive
No ratings yet
Hive
49 pages
Chapter+9+ HIVE
No ratings yet
Chapter+9+ HIVE
50 pages
Unit 3
No ratings yet
Unit 3
23 pages
Hive Introduction
No ratings yet
Hive Introduction
47 pages
Apache Hive Overview & Architecture
No ratings yet
Apache Hive Overview & Architecture
27 pages
Hive Unit VI
No ratings yet
Hive Unit VI
39 pages
Hive Database & Analytics Guide
No ratings yet
Hive Database & Analytics Guide
10 pages
Architecture and Working of Hive
No ratings yet
Architecture and Working of Hive
7 pages
Hive Notes
No ratings yet
Hive Notes
15 pages
Chapter 5 Hive
No ratings yet
Chapter 5 Hive
69 pages
Hive
No ratings yet
Hive
50 pages
Big-Data-Unit 5
No ratings yet
Big-Data-Unit 5
54 pages
Bda Unit 4 - Mam
No ratings yet
Bda Unit 4 - Mam
57 pages
Apache Hive for Data Analysts
No ratings yet
Apache Hive for Data Analysts
8 pages
Big-Data-Unit 5
No ratings yet
Big-Data-Unit 5
54 pages
58B Swaraj Shid BDEV Prac3
No ratings yet
58B Swaraj Shid BDEV Prac3
21 pages
Unit V-Hive
No ratings yet
Unit V-Hive
10 pages
Unit-4 Hive
No ratings yet
Unit-4 Hive
10 pages
Hive Final
No ratings yet
Hive Final
75 pages
Hive - A Warehousing Solution Over A Map-Reduce Framework
No ratings yet
Hive - A Warehousing Solution Over A Map-Reduce Framework
4 pages
BDA Unit-5
No ratings yet
BDA Unit-5
26 pages
Hive Data Warehousing Overview
No ratings yet
Hive Data Warehousing Overview
9 pages
Hive
No ratings yet
Hive
4 pages
HIVE Lect
No ratings yet
HIVE Lect
91 pages
Session 3.2
No ratings yet
Session 3.2
27 pages
Unit IV
No ratings yet
Unit IV
22 pages
Big Data 4
No ratings yet
Big Data 4
14 pages
Hive - Self Learning Notes
No ratings yet
Hive - Self Learning Notes
69 pages
Hive Slides-2
No ratings yet
Hive Slides-2
25 pages
HIVE
No ratings yet
HIVE
16 pages
Huawei Academy Instructor Assessment Guide (Only For HCNA-cloud) - 20180416
No ratings yet
Huawei Academy Instructor Assessment Guide (Only For HCNA-cloud) - 20180416
8 pages
Module 13 FusionInsight HD Solution Overview
No ratings yet
Module 13 FusionInsight HD Solution Overview
57 pages
Appendix 4 Apply For HCAI Certificate Guide
No ratings yet
Appendix 4 Apply For HCAI Certificate Guide
8 pages
Module 11 Kafka - Distributed Message Subscription System
No ratings yet
Module 11 Kafka - Distributed Message Subscription System
34 pages
Module 10 Flume - Massive Logs Aggregation
No ratings yet
Module 10 Flume - Massive Logs Aggregation
42 pages
Module 12 Zookeeper - Cluster Distributed Coordination Service
No ratings yet
Module 12 Zookeeper - Cluster Distributed Coordination Service
26 pages
Module 08 Flink - Stream Processing and Batch Processing Platform
No ratings yet
Module 08 Flink - Stream Processing and Batch Processing Platform
40 pages
Module 01 Big Data Industry and Technological Trends
No ratings yet
Module 01 Big Data Industry and Technological Trends
50 pages
Module 07 Streaming - Distributed Stream Computing Engine
No ratings yet
Module 07 Streaming - Distributed Stream Computing Engine
33 pages
Relational Database Management System Lab - Assignment2
No ratings yet
Relational Database Management System Lab - Assignment2
3 pages
Post Gree
No ratings yet
Post Gree
180 pages
Data Warehouse Building Guide
No ratings yet
Data Warehouse Building Guide
10 pages
Oracle ERP: SLA to GL Data Flow
100% (1)
Oracle ERP: SLA to GL Data Flow
3 pages
QAWhat Is Clinical Data Management
No ratings yet
QAWhat Is Clinical Data Management
38 pages
Differential Billing
0% (1)
Differential Billing
3 pages
FSLogix UPD and FSLogix Containers
No ratings yet
FSLogix UPD and FSLogix Containers
29 pages
Big Data and Data Analytics Cloudera.
No ratings yet
Big Data and Data Analytics Cloudera.
3 pages
Understanding 0day Exploits
No ratings yet
Understanding 0day Exploits
43 pages
Microsoft PL-200 Exam Skills Guide
No ratings yet
Microsoft PL-200 Exam Skills Guide
10 pages
Google Drive Features & Benefits
No ratings yet
Google Drive Features & Benefits
12 pages
ASUG84529 - The Transformation From SAP Customer Relationship Management To SAP S4HANA For Customer Management
No ratings yet
ASUG84529 - The Transformation From SAP Customer Relationship Management To SAP S4HANA For Customer Management
14 pages
Red Hat Enterprise Linux Overview
No ratings yet
Red Hat Enterprise Linux Overview
67 pages
Azure Security Infographic
No ratings yet
Azure Security Infographic
1 page
From A Monolithic PLM Landscape To A Federated Domain and Data Mesh
No ratings yet
From A Monolithic PLM Landscape To A Federated Domain and Data Mesh
10 pages
AWS Amazon EMR
100% (1)
AWS Amazon EMR
38 pages
Contact Work Experience: Mindbridge Nov 2016 - Feb 2019
No ratings yet
Contact Work Experience: Mindbridge Nov 2016 - Feb 2019
2 pages
SIMATIC S7 Connector Configurator enUS en-US
No ratings yet
SIMATIC S7 Connector Configurator enUS en-US
104 pages
Long Quiz 3
100% (1)
Long Quiz 3
18 pages
100 of 500
No ratings yet
100 of 500
3 pages
Stack by SGL
No ratings yet
Stack by SGL
19 pages
Using Data Flow Diagrams: Kendall & Kendall Systems Analysis and Design, Global Edition, 9e
No ratings yet
Using Data Flow Diagrams: Kendall & Kendall Systems Analysis and Design, Global Edition, 9e
40 pages
Unit 3 DBMS
No ratings yet
Unit 3 DBMS
58 pages
C Circular Linked List Guide
100% (1)
C Circular Linked List Guide
3 pages
Phani AWS
No ratings yet
Phani AWS
4 pages
ECCouncil.312-96.vAug-2023.by .Nuly .20q
No ratings yet
ECCouncil.312-96.vAug-2023.by .Nuly .20q
8 pages
2 RIMA PreferenceMgmt Overview
No ratings yet
2 RIMA PreferenceMgmt Overview
24 pages
Software Testing & Inspection Guide
No ratings yet
Software Testing & Inspection Guide
14 pages
File Unicode Block 0370 Greek and Coptic - SVG
No ratings yet
File Unicode Block 0370 Greek and Coptic - SVG
4 pages
Data Mapper Cheatsheet
No ratings yet
Data Mapper Cheatsheet
3 pages

Module 06 Hive - Distributed Data Warehouse

Uploaded by

Module 06 Hive - Distributed Data Warehouse

Uploaded by

Technical Principles of

Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved.

3. Basic Hive Operations

 Supporting computing engines, such as MapReduce, Tez, and

 Direct access to HDFS files and HBase

 Easy to use and program

Non-real-  Log analysis

Data  Daily/Weekly click count

Data Information Knowledge Wisdom

An algorithm with higher efficiency can be used

Usage HQL (similar to SQL) SQL

The calculation speed depends

An application model must be

Environment It can be deployed using It requires high-performance commercial

The data warehouses used for commercial

High reliability Multiple

High latency Not support Inapplicable to Not support

materialized views. support support storage

2. Hive Functions and Architecture

3. Basic Hive Operations

clients, parses and executes HQL

 Running Hive HQL task

 Runing MapReduce task

Partition Skewed data Normal data

 The number of partitions is configurable.

 A partition can be partitioned or bucketed.

 Bucket: Data can be stored in different buckets.

 Data is stored in a bucket by the hash value of a field.

 Use managed tables when Hive performs all operations.

DROP Metadata and data are deleted. Only metadata is deleted.

 Date Function, such as to_date(), month(), day(), etc.

 String Function, such as trim(), length(), substr(), etc.

 UDF (User- Defined Funcation)

 File-level Colocation allows quick file access. This avoids network

hdfs colocationadmin -createGroup -groupId groupid

 Step 2: Use the Hive Colocation function.

CREATE TABLE tbl_2 (id INT, name STRING) row

insert into table encode_test select id, name,

remove table HBase_table where expression;

here, expression indicates the criteria for selecting the

 Step 2: Specify the delimiter before a query.

2. Hive Functions and Architecture

3. Basic Hive Operations

 DML-Data manipulation language

 DQL-Data query language

 Complicated query, like Group by，Order by，Join, etc.

--Create managed table

--Modify the file format

--Delete table data

--Show the statements for creating a table

--Load data from another table

--Export data from a Hive table to HDFS

--Import data from HDFS to a Hive table

A. Real-time online data analysis

B. Data mining (user behavior analysis, interest analysis, and partition

C. Data aggregation (daily/weekly click count and click count rankings)

D. Non-real-time data analysis (log analysis and statistics analysis)

B. Specify the location information when creating an external table.

D. When creating a table, column delimiters can be specified.

You might also like