0% found this document useful (0 votes)

17 views15 pages

Hive Notes

Uploaded by

karangole7074

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

17 views15 pages

Hive Notes

Uploaded by

karangole7074

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 15

INTRODUCTORY NOTES ON HIVE

What is Hive?
--It is an open source data warehouse tool to process and query data on hdfs
-->Hive always reaches out to hdfs if it requires any data
-->hdfs is the storage unit and map-reduce is processing unit for hive

Why Hive?
--We know that we need map reduce to process data on hdfs and give us result
data
--But map reduce is quite difficult to write and very monotonous
--Due to this reason we came up with Hive
--Hive queries are written using HQL(hive query language) which convert into map
reduce code and get the result data from hdfs in a structured format (tables).

What data is stored in hive?

--Structured data (from hdfs)
--Metadata of tables (schema) from rdbms

Why metadata is not stored in hdfs?

--Data in hdfs can't be edited/changed so we can't metadata here for any
updations
--Data in hdfs is difficult to retrieve quickly (low latency)

Why hive when we have RDBMS?

--Hive runs on distributed systems and queries are converted into map reduce
--RDBMS runs on single system and parallelism is not present

HIVE ARCHITECTURE:
--Hive will get meta data from rdbms and raw data from hdfs which combines to
give a tabular format of data
--Hive stores all its data in hdfs in raw format
Transactional Systems VS Analytics Systems:

Transactional systems Analytical systems

Analyzes individual entires Analyzes batches of data
1. We deal with We deal with historical data
present/current/latest
data
2. We can perform 2. We mostly read the
manipulations on data data and doesn't
like insert, update etc perform any
3. It is used for real-time manipulations on it.
production 3. It is used for analysis
4. It incorporates purpose
monolithic architecture
as single system can 4. Distributed system are
hold entire data good for storage of
5. Ex mysql, such data
postgre,oracle 5. Ex terradata

Datawarehouse VS Datalake

Data warehouse Data lake

1. Meant to store 1. Stores all kinds of data
structured data
2.Stores data which is structured, 2. Stores raw form of data
filtered and meant to be used for a who purpose is not yet
specific purpose defined

Do we need to write map-reduce code to retrieve data from hdfs?

--NO, we write queries in HQL which are converted internally into map reduce
tasks which actually process the data and give result sets to hive

How Hive has tabular data when data in hdfs is in file form?
--Hive stores mainly two types of data
1. Data from hdfs in format of text files
2. Metadata or schema present in rdbms (mysql) in form of Hive Meta store
Ie in mysql we have a separate database called Hive Metastore which has
schema of all tables in hive
Hdfs file + metadata =Hive tables

Why metadata is stored in rdbms?

-->As we can perform manipulations(update/insert) when stored in rdbms where
as on hive which is part of hadoop ecosystem is difficult.
-->Data latency is high in rdbms rather than hive

HIVE= DATA(hdfs) + metadata (rdbms)

HIVE VS RDBMS:

HIVE RDBMS
1. Deals with huge 1.Deals with small amount of data
amount of data 2.serial computation of data
2. Parallel computation 3.Low latency
of data
3. High latency (time
taken to fetch a record
from database)
4. Perform mostly read 4.perform both read/write
operation
5.Not ACID compliant 5.ACID compliant
6.Read on schema 6.write on schema

Read on Schema:
-->The data is already present in hdfs in form of files.
-->while presenting it in hive in tabular form schema of data is checked
-->due to this no constraints are placed on data

Write on schema:
-->at first table is created in rdbms
-->data is inserted now into the table
-->while inserting it will check for schema, if it doesn't match it throws an error
-->due to this constraints are placed on data

DATA STORAGE IN HIVE:

Where the data is stored in hive?

-->Data is stored in hive in the ware house directory.
Default storage location: /user/hive/warehouse
DB storage: /user/hive/warehouse/test.db
Table location: /user/hive/warehouse/test.db/userDetails
Table location for default db: /user/hive/warehouse/userDetails (default db is
stored directly in warehouse directory)
COMPLEX DATA TYPES IN HIVE:

What complex data types are supported in hive?

1. Array: It is a collections of elements of same data type.

2. Map: Unordered collection of (key, value) pairs where we need to have
unique keys. Based on key we get corresponding value
3. Struct: Collection of elements having different data types grouped together
4. Union: To join tables with same schema (not completely supported in hive)

ARRAYS IN HIVE:

-->create table mobiles(

mobile_name string,
price bigint,
colors array<string>,
models array<string>
row format delimiter
fields terminated by ','
collection items delimited by '#' //how elements in array are seperated
)
-->select mobile_name,color[0] from mobiles //to get a particular element from
array

-->if our file has 10 columns where as our table has 6 column (schema of 6
columns is matching)
then the remaining 4 columns will be ignored by the table.

MAP IN HIVE:

-->create table mobiles(

mobile_name string,
price bigint,
colors array<string>,
models array<string>
features map<string, boolean>
row format delimiter
fields delimited by ','
collection items delimited by '#'
map keys terminated by ':'
)

select * from mobiles limit 1;

vivo 10000 ["black","white","silver"] [10.3,10.3.2] {"slow motion":true, "voice
assistance":true}

select mobile_name, features['slow motion'] from mobiles; //to check whether

this key has value or not

STRUCT IN HIVE:

-->create table mobiles(

mobile_name string,
price bigint,
colors array<string>,
models array<string>
features map<string, boolean>
information struct<battery:string, country_code:bigint>
row format delimiter
fields delimited by ','
collection items delimited by '#'
map keys terminated by ':'
)

select * from mobiles limit 1;

vivo 10000 ["black","white","silver"] [10.3,10.3.2] {"slow motion":true, "voice
assistance":true} {"battery":"25kmph","country_code":9403}
BUILT-IN FUNCTIONS IN HIVE:

What are the built-in functions in hive?

1.UDF (user defined function) : These are system defined functions which operate
on single row and give result in single row.
Ex. Trim() , length()
-->when system defined functions don't fit our need we have to write our own
user defined functions (UDF)
-->we have to write our own java code as we can't take help of map-reduce as it is
not defined by system

2.UDAF (user defined aggregate functions) :They are used to operate on multiple
rows and give result in single row.
Ex. Sum(), count(), avg()

3.UDTF (user defined table-generating functions) : They are used to operate on

single row and give result in multiple rows.
They flatten the complex data types into singular data types.
Ex. [100,200,300,400] ==> [100], [200], [300], [400]
Virtual tables are created in backend which are joined with main table to provide
the result set of multiple rows.

Lateral view: It expands the array into rows

-->select explode(order_items) from orders;

-->shampoo,
book,
purse,
dress,
whey protein
-->it breaks the array order_items into indiviual items

-->to have cust_name along with orders placed

-->select cust_name, cust_items from orders lateral view explode(order_items)
cust_orders as cust_items
table1 : orders
join : lateral_view
table2 : explode(order_items)
alias name of table2 is cust_orders
cust_items is column name in virtual table (cust_orders)

Output:
yashe: shampoo
vaishu: book
vaishu: purse
vaishu: whey protein
amma: dress

VIEWS IN HIVE:
-->View is a virtual table which holds the meta data of query and in run time loads
the data of the query

Pros:
-->We can create views and restrict the permissions to users on main DB. Data
Security is preserved.
-->Complex queries can be simplified into views
-->Frequently accessed data can be written into views

Normalization and De-Normalization:

-->Normalization: It is a process of dividing large tables into smaller tables to
remove data redundancy and inconsistency.
Pros:
1. Data is consistent
2. Free from redundancies
Cons:
1. We need to perform many joins to get the data of even a single
entity(user/employee)
2. 'join' queries are complex and time taking to execute.

De-Normalization:
-->Data stored in hive is in de-normalized form.
-->De-Normalization is the process in which huge tables are not divided into
smaller tables and are fine to have data redundancy, inconsistency.

Why De-normalization?
-->Normalization is costly process as huge amount of data has to be processed to
give final results which is not recommended.
-->Data processing time increases due to the complex queries.

What will you do when updates have to be made?

-->Data in hive is mainly meant for READ only operations so no need to worry
about data manipulation
-->Storage is very cheap so even if we have duplicates its manageable

Hands-On with Hive:

cloudera: hive ==> enter into hive interface

ctlr+l : to clear the screen in hive and rdbms

";" ==>all queries end with semi colon
describe formatted table_name; ==> to get detailed description of table

Beeline Editor:
-->beeline -u jdbc:hive2:// ; ==>connect to beeline from cloud era
-->beeline -u jdbc:hive2:// -f /user/cloudera/myqueries.sql ==>to run script from
beeline
--> !q ==>to exit from beeline
-->Beeline editor is used to view the hive data in a clean and clear format

Hive Table:
-->Hive table consists of two things namely
1. data (present in hdfs in the form of files)
2. schema (metadata present in rdbms)

--> If we are using raw versions of hadoop then we have derby database by
default
-->as we are using cloudera version of hadoop we have mysql database in place
-->inside mysql we have a db named 'metastore' which is used to handle the
metadata of hive
-->we have created a table named 'users' inside the default db

-->To view contents of the file. Data delimited by special characters

Types of tables in Hive:

-->Managed Table: Hive manages/owns the data and metadata present in the
table. By default managed table is created when we create a table in hive
if we drop the table then we lose both the data and metadata.
drop table table_name;
-->External Table: Table which is owned/managed by external source like
hdfs/local system

Ways to insert data into hive:

1. Insert data directly into the table by manual process (not recommended as
map-reduce jobs take lot of time to insert)
2. Insert data from files (industry way of doing, load from hdfs/local system)
 Load data from hdfs
 Load data from local system
3. Copy data from one table to another

when to use managed table?

-->when Hive is the only user of the data
-->no other tools use the tables we have created in hive so that they don't get
impacted when table is dropped
Load Data from local:

1. create a table with schema similar to data being loaded

create table if not exists products_managed(

id string,
title string,
cost float
)
row format delimited
fields terminated by ','
stored as text file

2. Load data from local to table directly

load data local inpath '/home/cloudera/Desktop/product.csv' into table

test.products_managed
Load Data from HDFS:

1. Create a directory in hadoop and copy the file into it

-->hadoop fs -mkdir /user/cloudera/new_folder
-->hadoop fs -copyFromLocal /home/cloudera/Desktop/product.csv
/user/cloudera/new_folder

2. Load data from hadoop to table

load data local inpath '/user/cloudera/new_folder/product.csv' into table

test.products_managed

3. check the contents of the table

Load data from one table to another

1. create a new table to match with schema being inserted

2. Perform insert
insert into table table_managed
select * from products_managed ;

EXTERNAL TABLE:

-->The table in which the data is not managed/owned by hive. It is external

to hive
->We get this data from hdfs, rdbms, hbase
etc
-->when this table is dropped only metadata gets deleted but not actual data
-->we need to create table again once the schema is dropped

how to create an external table

1.Createa table definition

create external table orders(

order_id bigint,
order_name string,
order_status string)
row format delimited
fields terminated by ',' (each field is seperated by comma)
stored as textfile
location '/user/cloudera/new_folder' (location where data to be loaded into
table is kept)

when to use External Table:

-->when we have other tools/systems using the data like hdfs,pig,hbase
along with hive.

Hive Subqueries:
-->subqueries are queries written within queries
-->sql returns result if they are in tables or tabular format
-->subqueries are used in two situation
1. with FROM clause
select * from
(select prod_it from products where city='banglore)
2. with WHERE clause
 IN / NOT IN
 EXISTS / NOT EXISTS

select * from customers where city in

(select order_city where order_status='completed')

select * from customers where exists

(select order_status, order_city from orders)

HIVE VIEWS:

-->create view customer_detail

as
(select * from customers c
join orders o on c.cust_id=o.order_id)
-->show tables;
-->describe formatted customer_detail;

Hive Indexes:

-->create index cust_index

on table customers (customer_id)
as 'COMPACT' DEFERRED REBUILD

where 'COMPACT' is the compact index being created

DEFERRED REBUILD is to alter the index in later stages of execution

Chapter+9+ HIVE
No ratings yet
Chapter+9+ HIVE
50 pages
HIVE
No ratings yet
HIVE
80 pages
Hive Final
No ratings yet
Hive Final
75 pages
Hive Main
No ratings yet
Hive Main
33 pages
Big Data Analytics: Welcome
No ratings yet
Big Data Analytics: Welcome
69 pages
Hive Basics
No ratings yet
Hive Basics
35 pages
HIVE
No ratings yet
HIVE
28 pages
BDA Unit-5
No ratings yet
BDA Unit-5
39 pages
Hive
No ratings yet
Hive
45 pages
Unit IV
No ratings yet
Unit IV
22 pages
Hive
No ratings yet
Hive
29 pages
Hive Overview
No ratings yet
Hive Overview
28 pages
6.1NoSQL ApacheHIVE Witha3
No ratings yet
6.1NoSQL ApacheHIVE Witha3
45 pages
Unit-5 - Hive
No ratings yet
Unit-5 - Hive
31 pages
Course On: Big Data Analytics
No ratings yet
Course On: Big Data Analytics
59 pages
Apache Hive: Data Warehousing on Hadoop
No ratings yet
Apache Hive: Data Warehousing on Hadoop
28 pages
BDA Hive
No ratings yet
BDA Hive
22 pages
Hive Data Warehousing Overview
No ratings yet
Hive Data Warehousing Overview
61 pages
Hive Tutorial
No ratings yet
Hive Tutorial
25 pages
Bda-Unit-Iv - 2020-21
100% (1)
Bda-Unit-Iv - 2020-21
30 pages
Wa0006.
No ratings yet
Wa0006.
53 pages
Introduction To Hive
No ratings yet
Introduction To Hive
14 pages
Unit-4 Pig Hive
No ratings yet
Unit-4 Pig Hive
40 pages
Apache HIVE
No ratings yet
Apache HIVE
44 pages
Session 3.2
No ratings yet
Session 3.2
27 pages
HIVE Lect
No ratings yet
HIVE Lect
91 pages
Cheat Sheet: Hive Basics
No ratings yet
Cheat Sheet: Hive Basics
1 page
Big Data & Analytics (CSE6005) L6
No ratings yet
Big Data & Analytics (CSE6005) L6
56 pages
Hiveppt
No ratings yet
Hiveppt
29 pages
HIVE
No ratings yet
HIVE
16 pages
BDA Unit-5
No ratings yet
BDA Unit-5
26 pages
Unit IV
No ratings yet
Unit IV
64 pages
Hive Data Types and Data Models
No ratings yet
Hive Data Types and Data Models
24 pages
HiveQL Overview
No ratings yet
HiveQL Overview
71 pages
Hive
No ratings yet
Hive
63 pages
Module 4
No ratings yet
Module 4
34 pages
BDA Unit 4 Notes
No ratings yet
BDA Unit 4 Notes
33 pages
Hive Intoduction and Tables
No ratings yet
Hive Intoduction and Tables
31 pages
Bigdata Analytics
No ratings yet
Bigdata Analytics
13 pages
Hive and Pig
No ratings yet
Hive and Pig
57 pages
Hive
No ratings yet
Hive
9 pages
Bda Unit 5 Hive Notes
No ratings yet
Bda Unit 5 Hive Notes
23 pages
Module 3-1
No ratings yet
Module 3-1
32 pages
Unit 2.2 Hive
No ratings yet
Unit 2.2 Hive
80 pages
Apache Hive 34 35
No ratings yet
Apache Hive 34 35
65 pages
Hive for Data Analysts
No ratings yet
Hive for Data Analysts
16 pages
Hive Part 2
No ratings yet
Hive Part 2
53 pages
Unit V-Hive
No ratings yet
Unit V-Hive
10 pages
Hive Data Modeling & Commands Guide
No ratings yet
Hive Data Modeling & Commands Guide
6 pages
Unit5 Notes
No ratings yet
Unit5 Notes
29 pages
Apache Hive for Data Analysts
No ratings yet
Apache Hive for Data Analysts
8 pages
Hive Interview
75% (4)
Hive Interview
17 pages
Hive - A Warehousing Solution Over A Map-Reduce Framework
No ratings yet
Hive - A Warehousing Solution Over A Map-Reduce Framework
4 pages
Hive
No ratings yet
Hive
4 pages
Apache Hive for Data Analysts
No ratings yet
Apache Hive for Data Analysts
51 pages
Cambridge IGCSE: Information and Communication Technology 0417/21
No ratings yet
Cambridge IGCSE: Information and Communication Technology 0417/21
12 pages
Veusz Manual
No ratings yet
Veusz Manual
38 pages
CanSat Mission Guide 2021
No ratings yet
CanSat Mission Guide 2021
41 pages
VIPER Tools User Manual v1.5
No ratings yet
VIPER Tools User Manual v1.5
91 pages
Training On E-Filing: Mesfin Nigusse
100% (1)
Training On E-Filing: Mesfin Nigusse
67 pages
Mail Merge Presentation
No ratings yet
Mail Merge Presentation
8 pages
Sample Xii Comp SC Hy-2024
No ratings yet
Sample Xii Comp SC Hy-2024
6 pages
CS Project Final - Merged
No ratings yet
CS Project Final - Merged
17 pages
Datascience Worksheet
No ratings yet
Datascience Worksheet
3 pages
2088658picklists Working With Legacy Picklists
No ratings yet
2088658picklists Working With Legacy Picklists
5 pages
Data Analysis with Python Libraries
No ratings yet
Data Analysis with Python Libraries
29 pages
IP CSV Project For Class 12
No ratings yet
IP CSV Project For Class 12
22 pages
Insights For Facebook Pages: Product Guide
100% (6)
Insights For Facebook Pages: Product Guide
11 pages
DME & PMW for Financial Transactions
No ratings yet
DME & PMW for Financial Transactions
9 pages
Protastructure 2018 Whats New
No ratings yet
Protastructure 2018 Whats New
57 pages
DataDictionary DS
No ratings yet
DataDictionary DS
1,197 pages
Monitoring of Suspicious Discussions On Online Forums Using Data Mining
No ratings yet
Monitoring of Suspicious Discussions On Online Forums Using Data Mining
7 pages
Oracle APEX Learner Guide
No ratings yet
Oracle APEX Learner Guide
32 pages
Prodigi Etsy CSV Guide
No ratings yet
Prodigi Etsy CSV Guide
9 pages
III Unit Fds
No ratings yet
III Unit Fds
24 pages
Math 1280 Notes
No ratings yet
Math 1280 Notes
91 pages
DM Lab 1
No ratings yet
DM Lab 1
6 pages
Transaction Template
No ratings yet
Transaction Template
345 pages
Home Range Creation and Analysis Using Geospatial Modeling Environment and Arcgis Software
No ratings yet
Home Range Creation and Analysis Using Geospatial Modeling Environment and Arcgis Software
15 pages
Rs Dam
No ratings yet
Rs Dam
65 pages
CP600 Webinar Trends, Alarms and Recipes Rev0
No ratings yet
CP600 Webinar Trends, Alarms and Recipes Rev0
28 pages
Administrators Guide
No ratings yet
Administrators Guide
41 pages
BusinessCaseStudyTargetMySQL v1
No ratings yet
BusinessCaseStudyTargetMySQL v1
31 pages
9AKK108466A6227 - Release Note CoreTec 4 Software V2.3
No ratings yet
9AKK108466A6227 - Release Note CoreTec 4 Software V2.3
6 pages
Beginning at Cell A16 On
No ratings yet
Beginning at Cell A16 On
5 pages

Hive Notes

Uploaded by

Hive Notes

Uploaded by

INTRODUCTORY NOTES ON HIVE

What data is stored in hive?

Why metadata is not stored in hdfs?

Why hive when we have RDBMS?

Transactional systems Analytical systems

Data warehouse Data lake

Do we need to write map-reduce code to retrieve data from hdfs?

Why metadata is stored in rdbms?

HIVE= DATA(hdfs) + metadata (rdbms)

DATA STORAGE IN HIVE:

Where the data is stored in hive?

What complex data types are supported in hive?

1. Array: It is a collections of elements of same data type.

-->create table mobiles(

-->create table mobiles(

select * from mobiles limit 1;

select mobile_name, features['slow motion'] from mobiles; //to check whether

-->create table mobiles(

select * from mobiles limit 1;

What are the built-in functions in hive?

3.UDTF (user defined table-generating functions) : They are used to operate on

Lateral view: It expands the array into rows

-->select explode(order_items) from orders;

-->to have cust_name along with orders placed

Normalization and De-Normalization:

What will you do when updates have to be made?

Hands-On with Hive:

ctlr+l : to clear the screen in hive and rdbms

-->To view contents of the file. Data delimited by special characters

Types of tables in Hive:

Ways to insert data into hive:

when to use managed table?

1. create a table with schema similar to data being loaded

create table if not exists products_managed(

2. Load data from local to table directly

load data local inpath '/home/cloudera/Desktop/product.csv' into table

1. Create a directory in hadoop and copy the file into it

2. Load data from hadoop to table

load data local inpath '/user/cloudera/new_folder/product.csv' into table

3. check the contents of the table

Load data from one table to another

1. create a new table to match with schema being inserted

-->The table in which the data is not managed/owned by hive. It is external

how to create an external table

1.Createa table definition

create external table orders(

when to use External Table:

select * from customers where city in

select * from customers where exists

-->create view customer_detail

-->create index cust_index

where 'COMPACT' is the compact index being created

You might also like