BDA Unit 4 Notes
BDA Unit 4 Notes
Introduction to Hive
What is Hive:Hive is a data warehouse infrastructure tool to process structured data in Hadoop.
It resides on top of Hadoop to summarize Big Data, and makes querying and analyzing
easy.Initially Hive was developed by Facebook, later the Apache Software Foundation took it up
and developed it further as an open source under the name Apache Hive. It is used by different
companies. For example, Amazon uses it in Amazon Elastic MapReduce.
Hive is Not
A relational database
A design for OnLine Transaction Processing (OLTP)
A language for real-time queries and row-level updates
Features of Hive
It stores schema in a database and processed data into HDFS.
It is designed for OLAP.
It provides SQL type language for querying called HiveQL or HQL.
It is familiar, fast, scalable, and extensible.
Advantages of Hive Architecture:
Scalability: Hive is a distributed system that can easily scale to handle large volumes of data
by adding more nodes to the cluster.
Data Accessibility: Hive allows users to access data stored in Hadoop without the need for
complex programming skills. SQL-like language is used for queries and HiveQL is based on
SQL syntax.
Data Integration: Hive integrates easily with other tools and systems in the Hadoop
ecosystem such as Pig, HBase, and MapReduce.
Flexibility: Hive can handle both structured and unstructured data, and supports various data
formats including CSV, JSON, and Parquet.
Security: Hive provides security features such as authentication, authorization, and encryption
to ensure data privacy.
Hive Server – It is referred to as Apache Thrift Server. It accepts the request from different
clients and provides it to Hive Driver.
Driver –
Queries of the user after the interface are received by the driver within the Hive. Concept of
session handles is implemented by driver. Execution and Fetching of APIs modelled on
JDBC/ODBC interfaces is provided by the user.
Compiler –
Queries are parses, semantic analysis on the different query blocks and query expression is
done by the compiler. Execution plan with the help of the table in the database and partition
metadata observed from the metastore are generated by the compiler eventually.
Metastore –
All the structured data or information of the different tables and partition in the warehouse
containing attributes and attributes level information are stored in the metastore. Sequences or
de-sequences necessary to read and write data and the corresponding HDFS files where the
data is stored. Hive selects corresponding database servers to stock the schema or Metadata of
databases, tables, attributes in a table, data types of databases, and HDFS mapping.
Execution Engine –
Execution of the execution plan made by the compiler is performed in the execution engine.
The plan is a DAG of stages. The dependencies within the various stages of the plan is
managed by execution engine as well as it executes these stages on the suitable system
components.
Column Types
Literals
Null Values
Complex Types
Column Types
Column type are used as column data types of Hive. They are as follows:
Integral Types
Integer type data can be specified using integral data types, INT. When the data range exceeds
the range of INT, you need to use BIGINT and if the data range is smaller than the INT, you use
SMALLINT. TINYINT is smaller than SMALLINT.
The following table depicts various INT data types:
Type Postfix Example
TINYINT Y 10Y
SMALLINT S 10S
INT - 10
BIGINT L 10L
String Types
String type data types can be specified using single quotes (' ') or double quotes (" "). It contains
two data types: VARCHAR and CHAR. Hive follows C-types escape characters.
The following table depicts various CHAR data types:
Data Type Length
VARCHAR 1 to 65355
CHAR 255
Timestamp
It supports traditional UNIX timestamp with optional nanosecond precision. It supports
java.sql.Timestamp format YYYY-MM-DD HH:MM:SS. and format yyyy-mm-dd hh:mm:ss.
Dates
DATE values are described in year/month/day format in the form {{YYYY-MM-DD}}.
Decimals
The DECIMAL type in Hive is as same as Big Decimal format of Java. It is used for
representing immutable arbitrary precision. The syntax and example is as follows:
DECIMAL(precision, scale)
decimal(10,0)
Union Types
Union is a collection of heterogeneous data types. You can create an instance using create
union. The syntax and example is as follows:
UNIONTYPE<int, double, array<string>, struct<a:int,b:string>>
{0:1}
{1:2.0}
{2:["three","four"]}
{3:{"a":5,"b":"five"}}
{2:["six","seven"]}
{3:{"a":8,"b":"eight"}}
{0:9}
{1:10.0}
Literals
The following literals are used in Hive:
Floating Point Types
Floating point types are nothing but numbers with decimal points. Generally, this type of data is
composed of DOUBLE data type.
Decimal Type
Decimal type data is nothing but floating point value with higher range than DOUBLE data type.
The range of decimal type is approximately -10-308 to 10308.
Null Value
Missing values are represented by the special value NULL.
Complex Types
The Hive complex data types are as follows:
Arrays
Arrays in Hive are used the same way they are used in Java.
Syntax: ARRAY<data_type>
Maps
Maps in Hive are similar to Java Maps.
Syntax: MAP<primitive_type, data_type>
Structs
Structs in Hive is similar to using complex data with comment.
Syntax: STRUCT<col_name : data_type [COMMENT col_comment], ...>
It is a SQL-like query language used to interact with Apache Hive, a data warehouse
infrastructure built on top of Hadoop. Hive allows for querying and managing large datasets
stored in Hadoop's HDFS (Hadoop Distributed File System) using a language that's similar to
SQL. However, it is specifically designed to work efficiently with massive amounts of
unstructured or semi-structured data, typically in the form of logs, JSON, or Parquet files.
Here’s an overview of HQL:
Basic Features of Hive Query Language:
Data Definition Language (DDL):
CREATE DATABASE: To create a new database in Hive.
sql
CopyEdit
CREATE DATABASE my_database;
USE DATABASE: Switch between databases.
sql
CopyEdit
USE my_database;
CREATE TABLE: Create a new table with a specified schema.
sql
CopyEdit
CREATE TABLE my_table (
id INT,
name STRING,
age INT
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ',';
DROP TABLE: Remove a table from the database.
sql
CopyEdit
DROP TABLE my_table;
Data Manipulation Language (DML):
INSERT INTO: Add data into a table.
sql
CopyEdit
INSERT INTO my_table VALUES (1, 'Alice', 25);
SELECT: Query data from a table.
sql
CopyEdit
SELECT * FROM my_table WHERE age > 20;
LOAD DATA: Load data from a file into a Hive table.
sql
CopyEdit
LOAD DATA INPATH '/path/to/data.csv' INTO TABLE my_table;
Data Types:
Hive supports common data types such as:
STRING, INT, BOOLEAN, DOUBLE, FLOAT, DATE, TIMESTAMP, ARRAY, MAP,
STRUCT, UNIONTYPE, etc.
Partitioning and Bucketing:
Partitioning: Splits large tables into smaller, manageable parts. For example:
sql
CopyEdit
CREATE TABLE my_table (
id INT,
name STRING
)
PARTITIONED BY (year INT, month INT);
Bucketing: Divides data into fixed-size buckets to distribute the load evenly.
sql
CopyEdit
CREATE TABLE my_table (
id INT,
name STRING
)
CLUSTERED BY (id) INTO 10 BUCKETS;
Joins:
Hive supports various types of joins like inner join, left join, etc.
sql
CopyEdit
SELECT a.id, a.name, b.salary
FROM employee a
INNER JOIN salary b
ON a.id = b.id;
Aggregation:
Hive supports common aggregation functions such as COUNT(), SUM(), AVG(), MAX(),
MIN(), etc.
sql
CopyEdit
SELECT department, AVG(salary)
FROM employee
GROUP BY department;
Group By and Having:
Hive supports both GROUP BY and HAVING clauses to aggregate and filter the results.
sql
CopyEdit
SELECT department, COUNT(*)
FROM employee
GROUP BY department
HAVING COUNT(*) > 5;
Views:
Hive allows you to create views, which are virtual tables based on the result of a query.
sql
CopyEdit
CREATE VIEW employee_view AS
SELECT name, salary
FROM employee
WHERE salary > 50000;
File Formats:
Hive supports multiple file formats like:
Text: Plain text files, like CSV or TSV.
ORC (Optimized Row Columnar): A highly efficient columnar storage format.
Parquet: A columnar storage format often used with complex nested data structures.
Functions:
Built-in Functions: These include mathematical functions, string functions, date functions, etc.
Example of a string function:
sql
CopyEdit
SELECT CONCAT(first_name, ' ', last_name) AS full_name
FROM employee;
Example of a Simple Query:
Creating a Table:
sql
CopyEdit
CREATE TABLE employees (
id INT,
name STRING,
salary DOUBLE
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE;
Loading Data into the Table:
sql
CopyEdit
LOAD DATA INPATH '/user/hive/warehouse/employees.csv' INTO TABLE employees;
Querying the Data:
sql
CopyEdit
SELECT name, salary
FROM employees
WHERE salary > 50000;
Key Differences Between HQL and SQL:
Execution: SQL queries run on traditional relational databases, while HQL queries are executed
on top of Hadoop clusters (e.g., Hive runs on HDFS).
Performance: Hive queries are slower than traditional databases due to the distributed nature of
Hadoop and the lack of optimizations that relational databases have.
Data Storage: Hive works with data stored in the Hadoop Distributed File System (HDFS) or
other similar distributed file systems, whereas traditional SQL databases store data in a
centralized system.
sql
CopyEdit
CREATE TABLE employee_rcfile (
id INT,
name STRING,
salary DOUBLE
)
STORED AS RCFILE;
Sql
CopyEdit
INSERT INTO TABLE employee_rcfile VALUES (1, 'Alice', 50000.0);
INSERT INTO TABLE employee_rcfile VALUES (2, 'Bob', 60000.0);
sql
CopyEdit
LOAD DATA INPATH '/user/hive/warehouse/employee.csv' INTO TABLE
employee_rcfile;
sql
CopyEdit
SELECT * FROM employee_rcfile WHERE salary > 55000;
RCFile: Older columnar format; good for some types of queries but lacks some
optimizations available in more recent formats like ORC and Parquet.
ORC (Optimized Row Columnar): ORC files are an evolution of the RCFile format
and provide much better performance, compression, and advanced features such as
lightweight predicate pushdown and vectorization. ORC is the default format for Hive
since version 0.14.
Parquet: Another columnar format that is often used in the Hadoop ecosystem. It is
language-agnostic and widely used for big data systems.
In this example, the RCFile is stored with Snappy compression, which is a popular choice for
data compression due to its balance of speed and compression ratio.
1. Column Pruning: Since RCFile stores data in columns, it is more efficient to access a
few columns from a large dataset than a row-based format, which has to read the entire
row.
2. Querying Performance: RCFile provides faster read performance for certain kinds of
queries (especially when you only need a few columns) compared to row-based formats
like TextFile.
A User Defined Function (UDF) in Hive is a way to extend the functionality of Hive by
allowing users to write custom functions to handle specific logic or operations that aren't
provided by the built-in functions. These functions can be written in Java and are particularly
useful when the built-in functions (such as COUNT(), AVG(), SUM(), etc.) do not meet the
needs of the user for complex computations or specialized processing.
Key Points About UDF in Hive:
Custom Logic: UDFs enable you to implement custom logic in your queries.
Extending Hive: Hive’s built-in functions are limited to standard operations. A UDF allows for
more flexible and specialized operations.
Written in Java: Hive UDFs are typically written in Java, but they can be written in any
language that compiles into a JVM-compatible bytecode.
Integration with Hive Queries: Once a UDF is written, it can be used directly in Hive queries
like any other built-in function.
Types of Hive Functions:
UDF (User Defined Function): Extends Hive’s functionality with custom operations.
UDAF (User Defined Aggregation Function): Extends the built-in aggregation functions (e.g.,
SUM(), AVG(), COUNT()) for more complex group-wise operations.
UDTF (User Defined Table-Generating Function): Returns a set of rows rather than a single
value. Useful for operations that need to split a single row into multiple rows (e.g., for parsing
data).
Creating and Using a UDF in Hive
To use a UDF in Hive, you must first create the UDF in Java, compile it, and then register it with
Hive. Let’s walk through the process step by step:
1. Writing a UDF in Java
First, you need to write your custom UDF. Here is an example of a simple UDF that reverses a
string.
java
CopyEdit
import org.apache.hadoop.hive.ql.exec.UDF;
import org.apache.hadoop.io.Text;
What is Pig?
A Platform for Big Data Processing:
Pig is a platform designed to handle large amounts of data, particularly within the Hadoop
environment.
Pig Latin:
Pig's core is its high-level scripting language, Pig Latin, which is used to define data
processing workflows.
Dataflow Language:
Pig Latin is a dataflow language, meaning it allows users to describe how data should be read,
processed, and written to output in a parallel manner.
Simplifies MapReduce:
Pig abstracts away the complexities of MapReduce, allowing programmers to focus on the data
processing logic rather than low-level implementation details.
Automatic Job Generation:
Pig Latin scripts are automatically translated into MapReduce jobs by the Pig Engine, a
component of the Pig platform.
Key Concepts:
Dataflow:
Pig Latin scripts define dataflows, which are essentially directed acyclic graphs (DAGs) where
data is transformed through a series of operations.
Operators:
Pig Latin provides a rich set of operators (e.g., FILTER, JOIN, SORT) to manipulate data
within these dataflows.
User-Defined Functions (UDFs):
Users can create their own functions using Pig Latin, extending its functionality.
HDFS:
Pig is designed to work with the Hadoop Distributed File System (HDFS), storing and
retrieving data efficiently.
Pig Engine:
The Pig Engine takes Pig Latin scripts and translates them into MapReduce jobs, which are
then executed by Hadoop.
Anatomy of Pig:
In the context of big data, "Pig" refers to Apache Pig, a high-level data flow language and
platform for analyzing large datasets. It uses Pig Latin as its scripting language, which simplifies
writing MapReduce jobs. Pig works with data stored in Hadoop File System (HDFS).
Here's a breakdown of Pig's key aspects:
Data Flow Language:
Pig Latin allows users to define data transformations and operations in a declarative way,
which is then compiled into MapReduce jobs.
Abstraction:
Pig provides a high-level abstraction for processing data, hiding the complexities of
MapReduce programming.
Data Types:
Pig supports various data types, including atoms (like integers and strings), tuples (ordered
collections of fields), and bags (unordered collections of tuples).
Components:
Pig's architecture includes a Parser, Optimizer, and Execution Engine.
ETL:
Pig is often used in Extract, Transform, and Load (ETL) processes to prepare and analyze
data.
Hadoop Integration:
Pig is designed to run on Hadoop clusters, leveraging its infrastructure for parallel data
processing.
Use Cases:
Pig is suitable for various applications, including clickstream analysis, search log analysis, and
web crawl analysis.
Pig on Hadoop
Use case for Pig
Apache Pig is a data flow language and processing framework used for handling large datasets in
a distributed computing environment like Hadoop. It's often used for tasks like data cleaning,
transformation, and analysis, particularly in situations where ad-hoc queries and fast prototyping
are needed. Pig is also beneficial for ETL (Extract, Transform, Load) processes and handling
unstructured data like web logs and social media feeds.
Here's a more detailed look at some specific use cases:
1. Data Cleaning and Transformation:
ETL (Extract, Transform, Load):
Pig is well-suited for performing ETL operations, where data is extracted from various sources,
transformed into a desired format, and then loaded into a target system.
Data Cleaning:
Pig can handle inconsistent data schemas and clean data by removing errors, duplicates, and
other inconsistencies.
Data Transformation:
It allows for various transformations, including filtering, grouping, joining, and sorting
datasets, simplifying data preparation for analysis.
2. Web Log Processing:
Web Log Analysis:
Pig can process large web logs to gain insights into server usage, user behavior, and identify
potential security threats or performance bottlenecks.
Performance Monitoring:
By analyzing web logs, Pig can help monitor server performance, identify frequent errors, and
improve the overall user experience.
3. Data Analysis and Exploration:
Ad-hoc Queries:
Pig supports ad-hoc queries across large datasets, allowing analysts to quickly explore data and
find patterns.
Prototyping:
It can be used to quickly prototype algorithms for processing large datasets, making it suitable
for experimentation and development.
Sampling:
Pig can be used for sampling datasets to obtain insights and insights from large datasets.
4. Unstructured Data Handling:
Log Files:
Pig can effectively process unstructured data, including log files, text files, and social media
data.
Social Media Data:
It can be used to extract and analyze data from social media feeds, providing insights into
customer sentiment, trends, and engagement.
5. Other Applications:
Matched Data:
Pig can be used to match places and find jobs.
Search Engines:
Search engines and question-answer engines use Pig to match data and find relevant
information.
Social Media:
Social media platforms like Twitter and Yahoo use Pig for various tasks, including data mining
and matching.
De-identification:
Pig can be used to de-identify personal health information, ensuring patient privacy while still
allowing for data analysis.
Running Pig
Running Pig can be done in different modes depending on your setup, whether you're running it
locally on a single machine or in a distributed Hadoop environment (using MapReduce mode).
Here's a step-by-step guide on how to run Pig in both Local Mode and MapReduce Mode.
1. Running Pig in Local Mode
In Local Mode, Pig processes data on a single machine without using Hadoop’s distributed file
system (HDFS). This mode is useful for small datasets or testing.
Steps:
Install Pig:
Download and install Apache Pig from the official Pig website.
Set up your PIG_HOME environment variable to point to the Pig installation directory.
Set Up Hadoop:
Local mode does not require a running Hadoop cluster, but you’ll need to have Hadoop
installed for the Pig installation to work.
Make sure the HADOOP_HOME environment variable is set correctly.
Write Pig Script:
Create a Pig script (example.pig) with a set of Pig Latin statements.
For example:
pig
-- Load the data from a text file
data = LOAD 'data.txt' USING PigStorage(',') AS (name:chararray, age:int, city:chararray);
-- Filter out rows where age is greater than 30
filtered_data = FILTER data BY age > 30;
-- Store the result into an output directory
STORE filtered_data INTO 'output' USING PigStorage(',');
Run Pig Script in Local Mode:
To run the script in local mode, use the following command:
bash
CopyEdit
pig -x local example.pig
The -x local option tells Pig to run in local mode. The script will process the data and output the
results to the output directory.
2. Running Pig in MapReduce Mode (Cluster Mode)
In MapReduce Mode, Pig runs on a Hadoop cluster and processes large datasets in parallel.
This mode is suitable for processing large-scale data stored in HDFS.
Steps:
Install Hadoop:
First, set up and configure a Hadoop cluster. Make sure the HDFS is running, as Pig will use it to
store and process data.
Install Pig:
Download and install Apache Pig as you would for local mode.
Ensure the PIG_HOME and HADOOP_HOME environment variables are properly configured.
Write Pig Script:
Create a Pig Latin script for the data transformation and analysis.
Upload Data to HDFS:
Before running the Pig script, upload your data to HDFS using the HDFS put command:
bash
hadoop fs -put local_data.txt /user/hadoop/data
Run Pig Script in MapReduce Mode:
To run the script in MapReduce mode, use the following command:
bash
pig -x mapreduce example.pig
The -x mapreduce option tells Pig to run in MapReduce mode, meaning it will run on the
Hadoop cluster and use MapReduce jobs to process the data.
Access Results:
After the script completes, check the results stored in HDFS.
bash
hadoop fs -ls /user/hadoop/output
hadoop fs -cat /user/hadoop/output/part-00000
3. Key Pig Commands for Running Pig:
pig -x local <script>: Runs the Pig script in local mode.
pig -x mapreduce <script>: Runs the Pig script in MapReduce mode.
pig -e "<command>": Runs a single Pig Latin command instead of a script.
pig: Starts the interactive Pig Grunt shell, where you can execute Pig Latin commands one by
one interactively.
4. Debugging and Logs
Pig generates log files during execution. If there are issues with running your Pig script, these
logs will contain details about the problem.
In local mode, logs can be found in the pig_*.log file in the current directory.
In MapReduce mode, logs will be available in the Hadoop job tracker’s web UI, and also in the
logs directory in HDFS.
Example of a Simple Pig Script:
pig
-- Load the input data into a relation
data = LOAD 'hdfs:/user/hadoop/input_data.csv' USING PigStorage(',') AS (name:chararray,
age:int, city:chararray);
-- Filter the data to only include people older than 30
filtered_data = FILTER data BY age > 30;
-- Group the data by city
grouped_data = GROUP filtered_data BY city;
Command − Command −
$ ./pig x local $ ./pig -x mapreduce
Output −
Output −
Either of these commands gives you the Grunt shell prompt as shown below. grunt>
You can exit the Grunt shell using ctrl + d.
After invoking the Grunt shell, you can execute a Pig script by directly entering the Pig Latin
statements in it.
grunt> customers = LOAD 'customers.txt' USING PigStorage(',');
Executing Apache Pig in Batch Mode
You can write an entire Pig Latin script in a file and execute it using the x command. Let us
suppose we have a Pig script in a file named sample_script.pig as shown below.
Sample_script.pig
student = LOAD 'hdfs://localhost:9000/pig_data/student.txt' USING
PigStorage(',') as (id:int,name:chararray,city:chararray);
Dump student;
Now, you can execute the script in the above file as shown below.
Local mode MapReduce mode
HDFS Commands
HDFS is the primary or major component of the Hadoop ecosystem which is responsible for
storing large data sets of structured or unstructured data across various nodes and thereby
maintaining the metadata in the form of log files. To use the HDFS commands, first you need to
start the Hadoop services using the following command:
sbin/start-all.sh
To check the Hadoop services are up and running use the following command:
jps
Commands:
ls: This command is used to list all the files. Use lsr for recursive approach. It is useful when we
want a hierarchy of a folder.
Syntax:
bin/hdfs dfs -ls <path>
Example:
bin/hdfs dfs -ls /
It will print all the directories present in HDFS. bin directory contains executables
so, bin/hdfs means we want the executables of hdfs particularly dfs(Distributed File System)
commands.
mkdir: To create a directory. In Hadoop dfs there is no home directory by default. So let’s first
create it.
Syntax:
bin/hdfs dfs -mkdir <folder name>
copyFromLocal (or) put: To copy files/folders from local file system to hdfs store. This is the
most important command. Local filesystem means the files present on the OS.
Syntax:
bin/hdfs dfs -copyFromLocal <local file path> <dest(present on hdfs)>
Example: Let’s suppose we have a file AI.txt on Desktop which we want to copy to
folder Sample present on hdfs.
bin/hdfs dfs -copyFromLocal ../Desktop/AI.txt /Sample
(OR)
copyToLocal (or) get: To copy files/folders from hdfs store to local file system.
Syntax:
bin/hdfs dfs -copyToLocal <<srcfile(on hdfs)> <local file dest>
Example:
bin/hdfs dfs -copyToLocal /Sample ../Desktop/hero
(OR)
Note: Observe that we don’t write bin/hdfs while checking the things present on local filesystem.
moveFromLocal: This command will move file from local to hdfs.
Syntax:
bin/hdfs dfs -moveFromLocal <local src> <dest(on hdfs)>
Example:
bin/hdfs dfs -moveFromLocal ../Desktop/cutAndPaste.txt /Sample
cp: This command is used to copy files within hdfs. Lets copy folder Sample to Sample_copied.
Syntax:
bin/hdfs dfs -cp <src(on hdfs)> <dest(on hdfs)>
Example:
bin/hdfs dfs -cp /Sample /Sample_copied
mv: This command is used to move files within hdfs. Lets cut-paste a
file myfile.txt from Sample folder to Sample_copied.
Syntax:
bin/hdfs dfs -mv <src(on hdfs)> <src(on hdfs)>
Example:
bin/hdfs dfs -mv /Sample/myfile.txt /Sample_copied
rmr: This command deletes a file from HDFS recursively. It is very useful command when you
want to delete a non-empty directory.
Syntax:
bin/hdfs dfs -rmr <filename/directoryName>
Example:
bin/hdfs dfs -rmr /Sample_copied -> It will delete all the content inside the directory then the
directory itself.
stat: It will give the last modified time of directory or path. In short it will give stats of the
directory or file.
Syntax:
bin/hdfs dfs -stat <hdfs file>
Example:
bin/hdfs dfs -stat /Sample
setrep: This command is used to change the replication factor of a file/directory in HDFS. By
default it is 3 for anything which is stored in HDFS (as set in hdfs core-site.xml).
Example 1: To change the replication factor to 6 for Sample.txt stored in HDFS.
bin/hdfs dfs -setrep -R -w 6 Sample.txt
Example 2: To change the replication factor to 4 for a directory SampleInput stored in HDFS.
bin/hdfs dfs -setrep -R 4 /Sample
Relational Operators
1. LOAD
Description: The LOAD operator is used to load data from the file system (such as HDFS or the
local file system) into a Pig relation.
Syntax:
relation_name = LOAD 'path' USING PigStorage(',') AS (field1:type, field2:type, ...);
Example:
data = LOAD 'data.txt' USING PigStorage(',') AS (name:chararray, age:int, city:chararray);
Explanation: This loads data from the file data.txt where fields are separated by commas and
defines the schema with name, age, and city.
2. STORE
Description: The STORE operator is used to store the result of a relation into a specified location
(typically in HDFS).
Syntax:
STORE relation_name INTO 'path' USING PigStorage(',');
Example:
STORE data INTO 'output' USING PigStorage(',');
Explanation: This stores the data in data into the HDFS directory output, with fields separated by
commas.
3. FILTER
Description: The FILTER operator is used to filter out records from a relation based on a
condition.
Syntax:
filtered_data = FILTER relation_name BY condition;
Example:
adults = FILTER data BY age > 18;
Explanation: This filters the data relation to include only records where the age is greater than
18.
4. FOREACH
Description: The FOREACH operator is used to apply transformations or expressions on each
element of a relation, similar to a map operation.
Syntax:
result = FOREACH relation_name GENERATE expression1, expression2, ...;
Example:
CopyEdit
names = FOREACH data GENERATE name;
Explanation: This extracts just the name field from the data relation.
5. GROUP
Description: The GROUP operator is used to group data by one or more fields. It is similar to
SQL's GROUP BY clause.
Syntax:
grouped_data = GROUP relation_name BY field_name;
Example:
grouped_by_city = GROUP data BY city;
Explanation: This groups the data relation by the city field.
6. JOIN
Description: The JOIN operator is used to combine two or more relations based on a common
field (key), similar to SQL's JOIN.
Syntax:
joined_data = JOIN relation1 BY key, relation2 BY key;
Example:
joined_data = JOIN data1 BY name, data2 BY name;
Explanation: This joins the data1 and data2 relations on the name field.
7. ORDER BY
Description: The ORDER BY operator is used to sort data by one or more fields in ascending or
descending order.
Syntax:
ordered_data = ORDER relation_name BY field_name [ASC|DESC];
Example:
sorted_data = ORDER data BY age DESC;
Explanation: This orders the data relation by the age field in descending order.
8. LIMIT
Description: The LIMIT operator restricts the number of rows in the result to a specified count.
Syntax:
limited_data = LIMIT relation_name n;
Example:
top5 = LIMIT data 5;
Explanation: This returns the first 5 records from the data relation.
9. DISTINCT
Description: The DISTINCT operator is used to eliminate duplicate records from a relation.
Syntax:
distinct_data = DISTINCT relation_name;
Example:
unique_data = DISTINCT data;
Explanation: This returns a relation where duplicate records from data are removed.
10. CROSS
Description: The CROSS operator is used to compute the cross-product of two or more relations
(similar to SQL’s Cartesian Join).
Syntax:
crossed_data = CROSS relation1, relation2;
Example:
cross_data = CROSS data1, data2;
Explanation: This produces all combinations of records from data1 and data2.
11. CONCAT
Description: The CONCAT operator is used to concatenate two or more strings or collections
(like tuples or bags).
Syntax:
concatenated_data = CONCAT string1, string2;
Example:
full_name = CONCAT(first_name, last_name);
Explanation: This combines the first_name and last_name fields into a single field full_name.
12. UNION
Description: The UNION operator is used to combine two or more relations into a single
relation, including duplicates (unlike SQL's UNION ALL).
Syntax:
united_data = UNION relation1, relation2;
Example:
all_data = UNION data1, data2;
Eval Function
In Apache Pig, Eval Functions (short for Evaluation Functions) are used to perform
computations or transformations on data within a Pig Latin script. They allow you to manipulate
the data in various ways, such as applying mathematical operations, string manipulation, type
conversions, and other data processing tasks.
Types of Eval Functions
Built-in Eval Functions
User-Defined Eval Functions (UDFs)
def string_length(input):
return len(input)
Step 2: Register the Python UDF in Pig
REGISTER 'string_length.py' USING jython as string_length;
Step 3: Use the UDF in a Pig Script
result = FOREACH data GENERATE string_length(name);
Piggy Bank
In big data, "piggybank" refers to a collection of user-defined functions (UDFs) for the Apache
Pig data processing framework. These UDFs extend Pig's capabilities, allowing users to perform
custom data manipulation and analysis. Piggybank is not a built-in part of Pig but is a repository
of user-contributed functions, distributed as part of the Pig distribution.
What it is:
UDF Repository:
Piggybank is essentially a library of user-defined functions that extend the functionality of the
Apache Pig data processing framework.
Extending Pig:
These UDFs allow users to write custom code that can be executed within Pig scripts, enabling
them to perform specific data transformations and analysis that might not be possible with
Pig's built-in functions.
User-Contributed:
The functions within Piggybank are contributed by the Pig user community, meaning they can
be shared and reused by others.
Not Built-in:
Piggybank functions are not part of the core Pig distribution but are separate libraries that need
to be registered and used within Pig scripts.
How it's used:
1. Registering UDFs:
Before using a piggybank function, it needs to be registered within the Pig environment.
2. Invoking UDFs:
Once registered, UDFs can be called within Pig scripts like any other function.
3. Benefits:
Custom Functionality: Allows users to perform specific data manipulation and analysis not
covered by Pig's built-in functions.
Code Reuse: Users can contribute their own UDFs and also access UDFs written by others,
reducing development time and effort.
Performance Enhancement: Properly written UDFs can potentially improve the performance
of Pig scripts by optimizing specific operations.
Examples:
Mathematical functions like sum, average, count, ascending, and descending order can be created
as UDFs and stored in piggybank libraries.
UDFs can be used for custom data cleaning, transformation, and analysis tasks.
In essence, piggybank acts as a collaborative repository of user-defined functions that extends
the capabilities of Pig, enabling users to perform a wider range of data processing tasks within
the Pig framework.
Pig Vs Hive.