KEMBAR78
BDA Unit 4 Notes | PDF | Apache Hadoop | Map Reduce
0% found this document useful (0 votes)
21 views33 pages

BDA Unit 4 Notes

The document provides a comprehensive overview of Hive and Pig, detailing their architectures, data types, file formats, and query languages. Hive is a data warehouse tool for processing structured data in Hadoop, while Pig is a high-level platform for creating programs that run on Hadoop. The document also compares Hive and Pig, highlighting their advantages, disadvantages, and use cases.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views33 pages

BDA Unit 4 Notes

The document provides a comprehensive overview of Hive and Pig, detailing their architectures, data types, file formats, and query languages. Hive is a data warehouse tool for processing structured data in Hadoop, while Pig is a high-level platform for creating programs that run on Hadoop. The document also compares Hive and Pig, highlighting their advantages, disadvantages, and use cases.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 33

Introduction to Hive: What is Hive, Hive Architecture, Hive data types, Hive file formats, Hive

Query Language (HQL), RC File implementation, User Defined Function (UDF).


Introduction to Pig: What is Pig, Anatomy of Pig, Pig on Hadoop, Pig Philosophy, Use case for
Pig, Pig Latin Overview, Data types in Pig, Running Pig, Execution Modes of Pig, HDFS
Commands, Relational Operators, Eval Function, Complex Data Types, Piggy Bank, User
Defined Function, Pig Vs Hive.

Introduction to Hive
What is Hive:Hive is a data warehouse infrastructure tool to process structured data in Hadoop.
It resides on top of Hadoop to summarize Big Data, and makes querying and analyzing
easy.Initially Hive was developed by Facebook, later the Apache Software Foundation took it up
and developed it further as an open source under the name Apache Hive. It is used by different
companies. For example, Amazon uses it in Amazon Elastic MapReduce.
Hive is Not
A relational database
A design for OnLine Transaction Processing (OLTP)
A language for real-time queries and row-level updates
Features of Hive
It stores schema in a database and processed data into HDFS.
It is designed for OLAP.
It provides SQL type language for querying called HiveQL or HQL.
It is familiar, fast, scalable, and extensible.
Advantages of Hive Architecture:
Scalability: Hive is a distributed system that can easily scale to handle large volumes of data
by adding more nodes to the cluster.
Data Accessibility: Hive allows users to access data stored in Hadoop without the need for
complex programming skills. SQL-like language is used for queries and HiveQL is based on
SQL syntax.
Data Integration: Hive integrates easily with other tools and systems in the Hadoop
ecosystem such as Pig, HBase, and MapReduce.
Flexibility: Hive can handle both structured and unstructured data, and supports various data
formats including CSV, JSON, and Parquet.
Security: Hive provides security features such as authentication, authorization, and encryption
to ensure data privacy.

Disadvantages of Hive Architecture:


High Latency: Hive’s performance is slower compared to traditional databases because of the
overhead of running queries in a distributed system.
Limited Real-time Processing: Hive is not ideal for real-time data processing as it is designed
for batch processing.
Complexity: Hive is complex to set up and requires a high level of expertise in Hadoop, SQL,
and data warehousing concepts.
Lack of Full SQL Support: HiveQL does not support all SQL operations, such as transactions
and indexes, which may limit the usefulness of the tool for certain applications.
Debugging Difficulties: Debugging Hive queries can be difficult as the queries are executed
across a distributed system, and errors may occur in different nodes.
Hive Architecture
Hive's architecture follows a multi-layer structure consisting of several components:
The major components of Hive and its interaction with the Hadoop is demonstrated in the
figure below and all the components are described further:
User Interface (UI) –
As the name describes User interface provide an interface between user and hive. It enables
user to submit queries and other operations to the system. Hive web UI, Hive command line,
and Hive HD Insight (In windows server) are supported by the user interface.

Hive Server – It is referred to as Apache Thrift Server. It accepts the request from different
clients and provides it to Hive Driver.
Driver –
Queries of the user after the interface are received by the driver within the Hive. Concept of
session handles is implemented by driver. Execution and Fetching of APIs modelled on
JDBC/ODBC interfaces is provided by the user.

Compiler –
Queries are parses, semantic analysis on the different query blocks and query expression is
done by the compiler. Execution plan with the help of the table in the database and partition
metadata observed from the metastore are generated by the compiler eventually.

Metastore –
All the structured data or information of the different tables and partition in the warehouse
containing attributes and attributes level information are stored in the metastore. Sequences or
de-sequences necessary to read and write data and the corresponding HDFS files where the
data is stored. Hive selects corresponding database servers to stock the schema or Metadata of
databases, tables, attributes in a table, data types of databases, and HDFS mapping.

Execution Engine –
Execution of the execution plan made by the compiler is performed in the execution engine.
The plan is a DAG of stages. The dependencies within the various stages of the plan is
managed by execution engine as well as it executes these stages on the suitable system
components.

Diagram – Architecture of Hive that is built on the top of Hadoop


In the above diagram along with architecture, job execution flow in Hive with Hadoop is
demonstrated step by step.
Step-1: Execute Query –
Interface of the Hive such as Command Line or Web user interface delivers query to the driver
to execute. In this, UI calls the execute interface to the driver such as ODBC or JDBC.

Step-2: Get Plan –


Driver designs a session handle for the query and transfer the query to the compiler to make
execution plan. In other words, driver interacts with the compiler.

Step-3: Get Metadata –


In this, the compiler transfers the metadata request to any database and the compiler gets the
necessary metadata from the metastore.
Step-4: Send Metadata –
Metastore transfers metadata as an acknowledgment to the compiler.
Step-5: Send Plan –
Compiler communicating with driver with the execution plan made by the compiler to execute
the query.
Step-6: Execute Plan –
Execute plan is sent to the execution engine by the driver.
Execute Job
Job Done
Dfs operation (Metadata Operation)
Step-7: Fetch Results –
Fetching results from the driver to the user interface (UI).
Step-8: Send Results –
Result is transferred to the execution engine from the driver. Sending results to Execution
engine. When the result is retrieved from data nodes to the execution engine, it returns the
result to the driver and to user interface (UI).

Hive Data Types:

Column Types
Literals
Null Values
Complex Types
Column Types
Column type are used as column data types of Hive. They are as follows:
Integral Types
Integer type data can be specified using integral data types, INT. When the data range exceeds
the range of INT, you need to use BIGINT and if the data range is smaller than the INT, you use
SMALLINT. TINYINT is smaller than SMALLINT.
The following table depicts various INT data types:
Type Postfix Example

TINYINT Y 10Y
SMALLINT S 10S

INT - 10

BIGINT L 10L
String Types
String type data types can be specified using single quotes (' ') or double quotes (" "). It contains
two data types: VARCHAR and CHAR. Hive follows C-types escape characters.
The following table depicts various CHAR data types:
Data Type Length

VARCHAR 1 to 65355

CHAR 255
Timestamp
It supports traditional UNIX timestamp with optional nanosecond precision. It supports
java.sql.Timestamp format YYYY-MM-DD HH:MM:SS. and format yyyy-mm-dd hh:mm:ss.
Dates
DATE values are described in year/month/day format in the form {{YYYY-MM-DD}}.
Decimals
The DECIMAL type in Hive is as same as Big Decimal format of Java. It is used for
representing immutable arbitrary precision. The syntax and example is as follows:
DECIMAL(precision, scale)
decimal(10,0)
Union Types
Union is a collection of heterogeneous data types. You can create an instance using create
union. The syntax and example is as follows:
UNIONTYPE<int, double, array<string>, struct<a:int,b:string>>

{0:1}
{1:2.0}
{2:["three","four"]}
{3:{"a":5,"b":"five"}}
{2:["six","seven"]}
{3:{"a":8,"b":"eight"}}
{0:9}
{1:10.0}
Literals
The following literals are used in Hive:
Floating Point Types
Floating point types are nothing but numbers with decimal points. Generally, this type of data is
composed of DOUBLE data type.
Decimal Type
Decimal type data is nothing but floating point value with higher range than DOUBLE data type.
The range of decimal type is approximately -10-308 to 10308.
Null Value
Missing values are represented by the special value NULL.
Complex Types
The Hive complex data types are as follows:
Arrays
Arrays in Hive are used the same way they are used in Java.
Syntax: ARRAY<data_type>
Maps
Maps in Hive are similar to Java Maps.
Syntax: MAP<primitive_type, data_type>
Structs
Structs in Hive is similar to using complex data with comment.
Syntax: STRUCT<col_name : data_type [COMMENT col_comment], ...>

Hive File Formats:


In Hive, file formats define how data is stored in HDFS and how Hive interacts with that data.
The file format you choose impacts the efficiency of storage, querying, and performance. Hive
supports several file formats, each suitable for different use cases. Here's a list of the most
common Hive file formats:
1. Text File (Default Format)
File Extension: .txt
Description: Plain text format where each line represents a record. Data within a line is
separated by delimiters (such as commas, tabs, or spaces).
Usage: Simple use cases, small datasets, or when you need to import or export data in plain text.
Example:
sql
CopyEdit
CREATE TABLE example (name STRING, age INT)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
STORED AS TEXTFILE;
2. SequenceFile
File Extension: .seq
Description: A binary format that is optimized for storing serialized data. SequenceFile is
suitable for Hadoop MapReduce jobs as it supports splitting large files into blocks for parallel
processing.
Usage: When working with large datasets that require efficient serialization, such as MapReduce
jobs.
Example:
sql
CopyEdit
CREATE TABLE example (name STRING, age INT)
STORED AS SEQUENCEFILE;
3. ORC (Optimized Row Columnar)
File Extension: .orc
Description: A columnar storage format that is optimized for both storage and query
performance. ORC provides lightweight compression, fast reads, and efficient disk space usage.
Usage: Ideal for large-scale data analytics, especially when doing complex queries that scan only
a subset of columns.
Example:
sql
CopyEdit
CREATE TABLE example (name STRING, age INT)
STORED AS ORC;
4. Parquet
File Extension: .parquet
Description: A columnar storage file format designed for high performance and efficient
storage. Parquet is widely used with big data processing tools like Apache Spark and Apache
Drill.
Usage: Ideal for complex data types (such as nested data) and large analytical queries in a
columnar format.
Example:
sql
CopyEdit
CREATE TABLE example (name STRING, age INT)
STORED AS PARQUET;
5. Avro
File Extension: .avro
Description: A row-based format that is used for serialization of data. It supports schema
evolution, meaning it can handle changes to data structure over time (e.g., adding/removing
fields).
Usage: Best suited for data streaming, and real-time ingestion, or for use cases where schema
evolution is needed.
Example:
sql
CopyEdit
CREATE TABLE example (name STRING, age INT)
STORED AS AVRO;
6. RCFile (Record Columnar File)
File Extension: .rc
Description: A hybrid storage format that stores data in records (rows) but stores columns
together for efficient compression and query processing.
Usage: Suitable for large datasets and when columnar storage provides better performance than
row-based formats.
Example:
sql
CopyEdit
CREATE TABLE example (name STRING, age INT)
STORED AS RCFILE;
7. Delta Lake (with Apache Spark)
File Extension: .delta
Description: Delta Lake is an open-source storage layer that provides ACID transactions on top
of data lakes. It allows for versioning and ensures data reliability.
Usage: Used in Spark-based workflows and provides features such as schema enforcement and
time travel (versioning).
Example:
sql
CopyEdit
CREATE TABLE example (name STRING, age INT)
STORED AS DELTA;
8. Custom File Formats
Hive allows you to create custom file formats using the STORED AS clause and by defining
custom serialization/deserialization (SerDe) logic. You can use this for special use cases where
existing file formats don't fit your needs.

Comparison of Hive File Formats


Query Storage
Format Type Compression Use Case
Efficiency Efficiency
Simple, small
TextFile Row-based None Low Low
datasets, debugging
Large datasets,
SequenceFile Binary Yes Medium Medium
MapReduce jobs
Data analytics, read-
ORC Columnar Yes High High
heavy queries
Complex data, large-
Parquet Columnar Yes High High
scale analytics
Data streaming,
Avro Row-based Yes Medium Medium
schema evolution
Large datasets,
Hybrid
RCFile Yes High High hybrid column-row
(Row/Column)
processing
ACID transactions,
Delta Lake Transactional Yes Very High Very High Spark-based
workloads

Hive Query Language (HQL)

It is a SQL-like query language used to interact with Apache Hive, a data warehouse
infrastructure built on top of Hadoop. Hive allows for querying and managing large datasets
stored in Hadoop's HDFS (Hadoop Distributed File System) using a language that's similar to
SQL. However, it is specifically designed to work efficiently with massive amounts of
unstructured or semi-structured data, typically in the form of logs, JSON, or Parquet files.
Here’s an overview of HQL:
Basic Features of Hive Query Language:
Data Definition Language (DDL):
CREATE DATABASE: To create a new database in Hive.
sql
CopyEdit
CREATE DATABASE my_database;
USE DATABASE: Switch between databases.
sql
CopyEdit
USE my_database;
CREATE TABLE: Create a new table with a specified schema.
sql
CopyEdit
CREATE TABLE my_table (
id INT,
name STRING,
age INT
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ',';
DROP TABLE: Remove a table from the database.
sql
CopyEdit
DROP TABLE my_table;
Data Manipulation Language (DML):
INSERT INTO: Add data into a table.
sql
CopyEdit
INSERT INTO my_table VALUES (1, 'Alice', 25);
SELECT: Query data from a table.
sql
CopyEdit
SELECT * FROM my_table WHERE age > 20;
LOAD DATA: Load data from a file into a Hive table.
sql
CopyEdit
LOAD DATA INPATH '/path/to/data.csv' INTO TABLE my_table;
Data Types:
Hive supports common data types such as:
STRING, INT, BOOLEAN, DOUBLE, FLOAT, DATE, TIMESTAMP, ARRAY, MAP,
STRUCT, UNIONTYPE, etc.
Partitioning and Bucketing:
Partitioning: Splits large tables into smaller, manageable parts. For example:
sql
CopyEdit
CREATE TABLE my_table (
id INT,
name STRING
)
PARTITIONED BY (year INT, month INT);
Bucketing: Divides data into fixed-size buckets to distribute the load evenly.
sql
CopyEdit
CREATE TABLE my_table (
id INT,
name STRING
)
CLUSTERED BY (id) INTO 10 BUCKETS;
Joins:
Hive supports various types of joins like inner join, left join, etc.
sql
CopyEdit
SELECT a.id, a.name, b.salary
FROM employee a
INNER JOIN salary b
ON a.id = b.id;
Aggregation:
Hive supports common aggregation functions such as COUNT(), SUM(), AVG(), MAX(),
MIN(), etc.
sql
CopyEdit
SELECT department, AVG(salary)
FROM employee
GROUP BY department;
Group By and Having:
Hive supports both GROUP BY and HAVING clauses to aggregate and filter the results.
sql
CopyEdit
SELECT department, COUNT(*)
FROM employee
GROUP BY department
HAVING COUNT(*) > 5;
Views:
Hive allows you to create views, which are virtual tables based on the result of a query.
sql
CopyEdit
CREATE VIEW employee_view AS
SELECT name, salary
FROM employee
WHERE salary > 50000;
File Formats:
Hive supports multiple file formats like:
Text: Plain text files, like CSV or TSV.
ORC (Optimized Row Columnar): A highly efficient columnar storage format.
Parquet: A columnar storage format often used with complex nested data structures.
Functions:
Built-in Functions: These include mathematical functions, string functions, date functions, etc.
Example of a string function:
sql
CopyEdit
SELECT CONCAT(first_name, ' ', last_name) AS full_name
FROM employee;
Example of a Simple Query:
Creating a Table:
sql
CopyEdit
CREATE TABLE employees (
id INT,
name STRING,
salary DOUBLE
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE;
Loading Data into the Table:
sql
CopyEdit
LOAD DATA INPATH '/user/hive/warehouse/employees.csv' INTO TABLE employees;
Querying the Data:
sql
CopyEdit
SELECT name, salary
FROM employees
WHERE salary > 50000;
Key Differences Between HQL and SQL:
Execution: SQL queries run on traditional relational databases, while HQL queries are executed
on top of Hadoop clusters (e.g., Hive runs on HDFS).
Performance: Hive queries are slower than traditional databases due to the distributed nature of
Hadoop and the lack of optimizations that relational databases have.
Data Storage: Hive works with data stored in the Hadoop Distributed File System (HDFS) or
other similar distributed file systems, whereas traditional SQL databases store data in a
centralized system.

RCFile Implementation in Hive

1. Creating a Table with RCFile Format:


To create a table using the RCFile format in Hive, you can use the following CREATE
TABLE statement. It specifies RCFILE as the storage format.

sql
CopyEdit
CREATE TABLE employee_rcfile (
id INT,
name STRING,
salary DOUBLE
)
STORED AS RCFILE;

2. Inserting Data into RCFile Table:


Once the table is created, you can insert data into it either from another table or from an
external file. For example:

Sql
CopyEdit
INSERT INTO TABLE employee_rcfile VALUES (1, 'Alice', 50000.0);
INSERT INTO TABLE employee_rcfile VALUES (2, 'Bob', 60000.0);

3. Loading Data into RCFile Table:


You can load data from a file into the RCFile format table. Here’s an example of how to
do this from an HDFS file:

sql
CopyEdit
LOAD DATA INPATH '/user/hive/warehouse/employee.csv' INTO TABLE
employee_rcfile;

4. Querying Data from an RCFile Table:


Once the table is populated, you can query the data just like any other Hive table.

sql
CopyEdit
SELECT * FROM employee_rcfile WHERE salary > 55000;

RCFile vs Other Formats (ORC, Parquet)

 RCFile: Older columnar format; good for some types of queries but lacks some
optimizations available in more recent formats like ORC and Parquet.
 ORC (Optimized Row Columnar): ORC files are an evolution of the RCFile format
and provide much better performance, compression, and advanced features such as
lightweight predicate pushdown and vectorization. ORC is the default format for Hive
since version 0.14.
 Parquet: Another columnar format that is often used in the Hadoop ecosystem. It is
language-agnostic and widely used for big data systems.

Example of RCFile Table Creation with Compression:


sql
CopyEdit
CREATE TABLE employee_rcfile_compressed (
id INT,
name STRING,
salary DOUBLE
)
STORED AS RCFILE
TBLPROPERTIES ('orc.compress'='SNAPPY');

In this example, the RCFile is stored with Snappy compression, which is a popular choice for
data compression due to its balance of speed and compression ratio.

Working with RCFile in Hive:

1. Column Pruning: Since RCFile stores data in columns, it is more efficient to access a
few columns from a large dataset than a row-based format, which has to read the entire
row.
2. Querying Performance: RCFile provides faster read performance for certain kinds of
queries (especially when you only need a few columns) compared to row-based formats
like TextFile.

User Defined Function (UDF)

A User Defined Function (UDF) in Hive is a way to extend the functionality of Hive by
allowing users to write custom functions to handle specific logic or operations that aren't
provided by the built-in functions. These functions can be written in Java and are particularly
useful when the built-in functions (such as COUNT(), AVG(), SUM(), etc.) do not meet the
needs of the user for complex computations or specialized processing.
Key Points About UDF in Hive:
Custom Logic: UDFs enable you to implement custom logic in your queries.
Extending Hive: Hive’s built-in functions are limited to standard operations. A UDF allows for
more flexible and specialized operations.
Written in Java: Hive UDFs are typically written in Java, but they can be written in any
language that compiles into a JVM-compatible bytecode.
Integration with Hive Queries: Once a UDF is written, it can be used directly in Hive queries
like any other built-in function.
Types of Hive Functions:
UDF (User Defined Function): Extends Hive’s functionality with custom operations.
UDAF (User Defined Aggregation Function): Extends the built-in aggregation functions (e.g.,
SUM(), AVG(), COUNT()) for more complex group-wise operations.
UDTF (User Defined Table-Generating Function): Returns a set of rows rather than a single
value. Useful for operations that need to split a single row into multiple rows (e.g., for parsing
data).
Creating and Using a UDF in Hive
To use a UDF in Hive, you must first create the UDF in Java, compile it, and then register it with
Hive. Let’s walk through the process step by step:
1. Writing a UDF in Java
First, you need to write your custom UDF. Here is an example of a simple UDF that reverses a
string.
java
CopyEdit
import org.apache.hadoop.hive.ql.exec.UDF;
import org.apache.hadoop.io.Text;

public class ReverseStringUDF extends UDF {


public Text evaluate(Text input) {
if (input == null) {
return null;
}
String reversed = new StringBuilder(input.toString()).reverse().toString();
return new Text(reversed);
}
}
In this example:
ReverseStringUDF extends UDF, which is a Hive class used to create user-defined functions.
The method evaluate() is overridden to define the functionality of the UDF.
The function takes a Text object as input and returns a Text object.
2. Compiling the Java UDF
To compile this Java class, you need to package it into a JAR file. You can use javac or a build
tool like Maven or Gradle to compile it into a JAR file.
If you're using the command line:
bash
CopyEdit
javac -cp $(hadoop classpath):$(hive --auxpath) ReverseStringUDF.java
jar -cf reverse-string-udf.jar ReverseStringUDF.class
This creates a reverse-string-udf.jar file that can be used in Hive.
3. Registering the UDF with Hive
Once your UDF is compiled into a JAR file, you need to register it with Hive to make it available
for use in queries.
sql
CopyEdit
ADD JAR /path/to/reverse-string-udf.jar;
This command adds the JAR file containing the UDF to the Hive session.
4. Using the UDF in a Hive Query
After registering the JAR, you can use your UDF in a Hive query just like any built-in function.
sql
CopyEdit
SELECT reverse_string(name) FROM employees;
In this example, reverse_string is the name of the function that you defined in your UDF, and it
will reverse the name field from the employees table.
5. Dropping the UDF
If you want to remove the UDF from the current Hive session, you can use the following
command:
sql
CopyEdit
DROP FUNCTION reverse_string;
Example of a UDF in Action
Let’s walk through an example to see the UDF in action.
Step 1: Create the Table
sql
CopyEdit
CREATE TABLE employees (
id INT,
name STRING
);
Step 2: Insert Data
sql
CopyEdit
INSERT INTO employees VALUES (1, 'Alice');
INSERT INTO employees VALUES (2, 'Bob');
Step 3: Register the UDF
sql
CopyEdit
ADD JAR /path/to/reverse-string-udf.jar;
Step 4: Use the UDF in a Query
sql
CopyEdit
SELECT id, reverse_string(name) AS reversed_name FROM employees;
Expected Output:
CopyEdit
1 ecilA
2 boB

Introduction to Pig: What is Pig


Pig is a high-level platform and language (Pig Latin) used for processing large datasets within
the Hadoop ecosystem. It provides a way to express data transformations and analyses using a
scripting language that resembles SQL, simplifying the creation of MapReduce jobs. Pig Latin
scripts are automatically converted into MapReduce jobs by the Pig Engine, allowing users to
focus on data processing logic rather than low-level implementation details.

What is Pig?
A Platform for Big Data Processing:
Pig is a platform designed to handle large amounts of data, particularly within the Hadoop
environment.
Pig Latin:
Pig's core is its high-level scripting language, Pig Latin, which is used to define data
processing workflows.
Dataflow Language:
Pig Latin is a dataflow language, meaning it allows users to describe how data should be read,
processed, and written to output in a parallel manner.
Simplifies MapReduce:
Pig abstracts away the complexities of MapReduce, allowing programmers to focus on the data
processing logic rather than low-level implementation details.
Automatic Job Generation:
Pig Latin scripts are automatically translated into MapReduce jobs by the Pig Engine, a
component of the Pig platform.
Key Concepts:
Dataflow:
Pig Latin scripts define dataflows, which are essentially directed acyclic graphs (DAGs) where
data is transformed through a series of operations.
Operators:
Pig Latin provides a rich set of operators (e.g., FILTER, JOIN, SORT) to manipulate data
within these dataflows.
User-Defined Functions (UDFs):
Users can create their own functions using Pig Latin, extending its functionality.
HDFS:
Pig is designed to work with the Hadoop Distributed File System (HDFS), storing and
retrieving data efficiently.
Pig Engine:
The Pig Engine takes Pig Latin scripts and translates them into MapReduce jobs, which are
then executed by Hadoop.

Anatomy of Pig:

In the context of big data, "Pig" refers to Apache Pig, a high-level data flow language and
platform for analyzing large datasets. It uses Pig Latin as its scripting language, which simplifies
writing MapReduce jobs. Pig works with data stored in Hadoop File System (HDFS).
Here's a breakdown of Pig's key aspects:
Data Flow Language:
Pig Latin allows users to define data transformations and operations in a declarative way,
which is then compiled into MapReduce jobs.
Abstraction:
Pig provides a high-level abstraction for processing data, hiding the complexities of
MapReduce programming.
Data Types:
Pig supports various data types, including atoms (like integers and strings), tuples (ordered
collections of fields), and bags (unordered collections of tuples).
Components:
Pig's architecture includes a Parser, Optimizer, and Execution Engine.
ETL:
Pig is often used in Extract, Transform, and Load (ETL) processes to prepare and analyze
data.
Hadoop Integration:
Pig is designed to run on Hadoop clusters, leveraging its infrastructure for parallel data
processing.
Use Cases:
Pig is suitable for various applications, including clickstream analysis, search log analysis, and
web crawl analysis.

Pig on Hadoop
Use case for Pig
Apache Pig is a data flow language and processing framework used for handling large datasets in
a distributed computing environment like Hadoop. It's often used for tasks like data cleaning,
transformation, and analysis, particularly in situations where ad-hoc queries and fast prototyping
are needed. Pig is also beneficial for ETL (Extract, Transform, Load) processes and handling
unstructured data like web logs and social media feeds.
Here's a more detailed look at some specific use cases:
1. Data Cleaning and Transformation:
ETL (Extract, Transform, Load):
Pig is well-suited for performing ETL operations, where data is extracted from various sources,
transformed into a desired format, and then loaded into a target system.
Data Cleaning:
Pig can handle inconsistent data schemas and clean data by removing errors, duplicates, and
other inconsistencies.
Data Transformation:
It allows for various transformations, including filtering, grouping, joining, and sorting
datasets, simplifying data preparation for analysis.
2. Web Log Processing:
Web Log Analysis:
Pig can process large web logs to gain insights into server usage, user behavior, and identify
potential security threats or performance bottlenecks.
Performance Monitoring:
By analyzing web logs, Pig can help monitor server performance, identify frequent errors, and
improve the overall user experience.
3. Data Analysis and Exploration:
Ad-hoc Queries:
Pig supports ad-hoc queries across large datasets, allowing analysts to quickly explore data and
find patterns.
Prototyping:
It can be used to quickly prototype algorithms for processing large datasets, making it suitable
for experimentation and development.
Sampling:
Pig can be used for sampling datasets to obtain insights and insights from large datasets.
4. Unstructured Data Handling:
Log Files:
Pig can effectively process unstructured data, including log files, text files, and social media
data.
Social Media Data:
It can be used to extract and analyze data from social media feeds, providing insights into
customer sentiment, trends, and engagement.
5. Other Applications:
Matched Data:
Pig can be used to match places and find jobs.
Search Engines:
Search engines and question-answer engines use Pig to match data and find relevant
information.
Social Media:
Social media platforms like Twitter and Yahoo use Pig for various tasks, including data mining
and matching.
De-identification:
Pig can be used to de-identify personal health information, ensuring patient privacy while still
allowing for data analysis.

Pig Latin Overview,


Pig Latin is a simple language game, a form of argot or jargon, where English words are altered
to sound foreign and fun. It's not related to the Latin language but uses its name for its English
connotations of something foreign. Pig Latin typically involves moving the initial consonant or
consonant cluster of a word to the end and adding "-ay," as in "he does not know" becomes
"ehay oesday otnay ownay". It's often used by children as a way to create a secret or confusing
code.
Here's a more detailed look:
Basic Principle:
Pig Latin works by taking the first consonant (or consonant cluster) of a word and moving it to
the end, followed by "-ay".
Examples:
"Hello" becomes "Ellohay".
"Apple" becomes "Appleay" (as it starts with a vowel).
"Pig Latin" becomes "Igpay Atinlay".
Purpose:
It's primarily used as a game or a form of secret language, often enjoyed by children for its
playful, confusing effect.
Not Related to Latin:
The name "Pig Latin" is a deliberate misnomer; it has no connection to the actual Latin
language.

Data types in Pig,


Apache Hadoop is a data file system, but to perform data processing, we need an SQL, such as a
language that can change data or make complex data conversions according to our requirements.
Apache PIG can achieve this data manipulation. An advanced writing language like SQL is used
with Hadoop to create the Pig. Pig Data types work with formal and informal data and are
translated into a Map Reduce number processed in the Hadoop collection.
We must know about Pig Data Types before understanding operators in Pig. Any data uploaded
to a pig has a specific structure and schema that uses a data structure processed by pig data types
to form a data model.
For understanding the structure, data must go through the map to define the data model. The pig
can handle any data because of the SQL-like structure that works well with the Single value
structure and the hierarchical data structure in the nest. It comes with a finite set of data types.
Pig data type can be classified into two categories, and they are ?
Primitive
Complex
Primitive Data type
It is also named as Simple Data type. The primitive data types are as follows ?
int ? Signed 32-bit integer and similar to Integer in Java.
Long ? It is a fully signed 64-bit number similar to Long in Java.
Float ? It is a signed 32-bit floating surface that appears to be similar to Java's float.
Double ? A floating-point 63-bit and similar to Double in Java.
Char array ? A list of characters in the Unicode format, UTF-8. This is compatible with the
Java character unit item.
byte array ? The byte data type represents bytes by default. When the data file type is not
specified, the default value is byte array.
Boolean ? A value that is either true or false.
Complex Data type
Complex data types consist of a bit of logical and complicated data type. The following are the
complex data type ?
Data
Definition Code Example
Types

A set of ordered fields. The tuple is written


Tuple (field[,fields....]) (1,2)
with braces.

A group of tuples is called a bag. Represented


Bag {tuple,[,tuple...]} {(1,2), (3,4)}
by folded weights or curly braces.

A set of key-value pairs. The map is


Map [Key # Value] ['keyname'#'valuename']
represented by square brackets.
Key ? An element of finding an element, the key must be unique and must be charrarray.
Value ? Any data can be stored in a value, and each key has particular data related to it. The map
is built using a bracket and hash between key and values. Cas to separate pairs of over one key
value. Here # is used to distinguish key and value.
Null Values ? Valuable value is missing or unknown, and any data may apply. The pig handles
an empty value similar to SQL. Pig detects blank values when data is missing, or an error occurs
during data processing. Also, null can be used as a value proposition of your choice.

Running Pig

Running Pig can be done in different modes depending on your setup, whether you're running it
locally on a single machine or in a distributed Hadoop environment (using MapReduce mode).
Here's a step-by-step guide on how to run Pig in both Local Mode and MapReduce Mode.
1. Running Pig in Local Mode
In Local Mode, Pig processes data on a single machine without using Hadoop’s distributed file
system (HDFS). This mode is useful for small datasets or testing.
Steps:
Install Pig:
Download and install Apache Pig from the official Pig website.
Set up your PIG_HOME environment variable to point to the Pig installation directory.
Set Up Hadoop:
Local mode does not require a running Hadoop cluster, but you’ll need to have Hadoop
installed for the Pig installation to work.
Make sure the HADOOP_HOME environment variable is set correctly.
Write Pig Script:
Create a Pig script (example.pig) with a set of Pig Latin statements.
For example:
pig
-- Load the data from a text file
data = LOAD 'data.txt' USING PigStorage(',') AS (name:chararray, age:int, city:chararray);
-- Filter out rows where age is greater than 30
filtered_data = FILTER data BY age > 30;
-- Store the result into an output directory
STORE filtered_data INTO 'output' USING PigStorage(',');
Run Pig Script in Local Mode:
To run the script in local mode, use the following command:
bash
CopyEdit
pig -x local example.pig
The -x local option tells Pig to run in local mode. The script will process the data and output the
results to the output directory.
2. Running Pig in MapReduce Mode (Cluster Mode)
In MapReduce Mode, Pig runs on a Hadoop cluster and processes large datasets in parallel.
This mode is suitable for processing large-scale data stored in HDFS.
Steps:
Install Hadoop:
First, set up and configure a Hadoop cluster. Make sure the HDFS is running, as Pig will use it to
store and process data.
Install Pig:
Download and install Apache Pig as you would for local mode.
Ensure the PIG_HOME and HADOOP_HOME environment variables are properly configured.
Write Pig Script:
Create a Pig Latin script for the data transformation and analysis.
Upload Data to HDFS:
Before running the Pig script, upload your data to HDFS using the HDFS put command:
bash
hadoop fs -put local_data.txt /user/hadoop/data
Run Pig Script in MapReduce Mode:
To run the script in MapReduce mode, use the following command:
bash
pig -x mapreduce example.pig
The -x mapreduce option tells Pig to run in MapReduce mode, meaning it will run on the
Hadoop cluster and use MapReduce jobs to process the data.
Access Results:
After the script completes, check the results stored in HDFS.
bash
hadoop fs -ls /user/hadoop/output
hadoop fs -cat /user/hadoop/output/part-00000
3. Key Pig Commands for Running Pig:
pig -x local <script>: Runs the Pig script in local mode.
pig -x mapreduce <script>: Runs the Pig script in MapReduce mode.
pig -e "<command>": Runs a single Pig Latin command instead of a script.
pig: Starts the interactive Pig Grunt shell, where you can execute Pig Latin commands one by
one interactively.
4. Debugging and Logs
Pig generates log files during execution. If there are issues with running your Pig script, these
logs will contain details about the problem.
In local mode, logs can be found in the pig_*.log file in the current directory.
In MapReduce mode, logs will be available in the Hadoop job tracker’s web UI, and also in the
logs directory in HDFS.
Example of a Simple Pig Script:
pig
-- Load the input data into a relation
data = LOAD 'hdfs:/user/hadoop/input_data.csv' USING PigStorage(',') AS (name:chararray,
age:int, city:chararray);
-- Filter the data to only include people older than 30
filtered_data = FILTER data BY age > 30;
-- Group the data by city
grouped_data = GROUP filtered_data BY city;

-- Count the number of people in each city


count_data = FOREACH grouped_data GENERATE group AS city, COUNT(filtered_data) AS
num_people;
-- Store the results in HDFS
STORE count_data INTO 'hdfs:/user/hadoop/output_data' USING PigStorage(',');

Execution Modes of Pig


You can run Apache Pig in two modes, namely, Local Mode and HDFS mode.
Local Mode
In this mode, all the files are installed and run from your local host and local file system. There is
no need of Hadoop or HDFS. This mode is generally used for testing purpose.
MapReduce Mode
MapReduce mode is where we load or process the data that exists in the Hadoop File System
(HDFS) using Apache Pig. In this mode, whenever we execute the Pig Latin statements to
process the data, a MapReduce job is invoked in the back-end to perform a particular operation
on the data that exists in the HDFS.
Apache Pig Execution Mechanisms
Apache Pig scripts can be executed in three ways, namely, interactive mode, batch mode, and
embedded mode.
Interactive Mode (Grunt shell) − You can run Apache Pig in interactive mode using the Grunt
shell. In this shell, you can enter the Pig Latin statements and get the output (using Dump
operator).
Batch Mode (Script) − You can run Apache Pig in Batch mode by writing the Pig Latin script in
a single file with .pig extension.
Embedded Mode (UDF) − Apache Pig provides the provision of defining our own functions
(User Defined Functions) in programming languages such as Java, and using them in our script.
Invoking the Grunt Shell
You can invoke the Grunt shell in a desired mode (local/MapReduce) using the −x option as
shown below.
Local mode MapReduce mode

Command − Command −
$ ./pig x local $ ./pig -x mapreduce

Output −
Output −

Either of these commands gives you the Grunt shell prompt as shown below. grunt>
You can exit the Grunt shell using ctrl &plus; d.
After invoking the Grunt shell, you can execute a Pig script by directly entering the Pig Latin
statements in it.
grunt> customers = LOAD 'customers.txt' USING PigStorage(',');
Executing Apache Pig in Batch Mode
You can write an entire Pig Latin script in a file and execute it using the x command. Let us
suppose we have a Pig script in a file named sample_script.pig as shown below.
Sample_script.pig
student = LOAD 'hdfs://localhost:9000/pig_data/student.txt' USING
PigStorage(',') as (id:int,name:chararray,city:chararray);

Dump student;
Now, you can execute the script in the above file as shown below.
Local mode MapReduce mode

$ pig -x local Sample_script.pig $ pig -x mapreduce Sample_script.pig

HDFS Commands
HDFS is the primary or major component of the Hadoop ecosystem which is responsible for
storing large data sets of structured or unstructured data across various nodes and thereby
maintaining the metadata in the form of log files. To use the HDFS commands, first you need to
start the Hadoop services using the following command:
sbin/start-all.sh
To check the Hadoop services are up and running use the following command:
jps

Commands:
ls: This command is used to list all the files. Use lsr for recursive approach. It is useful when we
want a hierarchy of a folder.
Syntax:
bin/hdfs dfs -ls <path>
Example:
bin/hdfs dfs -ls /
It will print all the directories present in HDFS. bin directory contains executables
so, bin/hdfs means we want the executables of hdfs particularly dfs(Distributed File System)
commands.
mkdir: To create a directory. In Hadoop dfs there is no home directory by default. So let’s first
create it.
Syntax:
bin/hdfs dfs -mkdir <folder name>

creating home directory:

hdfs/bin -mkdir /user


hdfs/bin -mkdir /user/username -> write the username of your computer
Example:
bin/hdfs dfs -mkdir /Sample => '/' means absolute path
bin/hdfs dfs -mkdir Sample2 => Relative path -> the folder will be
created relative to the home directory.

touchz: It creates an empty file.


Syntax:
bin/hdfs dfs -touchz <file_path>
Example:
bin/hdfs dfs -touchz /Sample/myfile.txt

copyFromLocal (or) put: To copy files/folders from local file system to hdfs store. This is the
most important command. Local filesystem means the files present on the OS.
Syntax:
bin/hdfs dfs -copyFromLocal <local file path> <dest(present on hdfs)>
Example: Let’s suppose we have a file AI.txt on Desktop which we want to copy to
folder Sample present on hdfs.
bin/hdfs dfs -copyFromLocal ../Desktop/AI.txt /Sample

(OR)

bin/hdfs dfs -put ../Desktop/AI.txt /Sample


cat: To print file contents.
Syntax:
bin/hdfs dfs -cat <path>
Example:
// print the content of AI.txt present
// inside Sample folder.
bin/hdfs dfs -cat /Sample/AI.txt ->

copyToLocal (or) get: To copy files/folders from hdfs store to local file system.
Syntax:
bin/hdfs dfs -copyToLocal <<srcfile(on hdfs)> <local file dest>
Example:
bin/hdfs dfs -copyToLocal /Sample ../Desktop/hero

(OR)

bin/hdfs dfs -get /Sample/myfile.txt ../Desktop/hero


myfile.txt from Sample folder will be copied to folder hero present on Desktop.

Note: Observe that we don’t write bin/hdfs while checking the things present on local filesystem.
moveFromLocal: This command will move file from local to hdfs.
Syntax:
bin/hdfs dfs -moveFromLocal <local src> <dest(on hdfs)>
Example:
bin/hdfs dfs -moveFromLocal ../Desktop/cutAndPaste.txt /Sample

cp: This command is used to copy files within hdfs. Lets copy folder Sample to Sample_copied.
Syntax:
bin/hdfs dfs -cp <src(on hdfs)> <dest(on hdfs)>
Example:
bin/hdfs dfs -cp /Sample /Sample_copied

mv: This command is used to move files within hdfs. Lets cut-paste a
file myfile.txt from Sample folder to Sample_copied.
Syntax:
bin/hdfs dfs -mv <src(on hdfs)> <src(on hdfs)>
Example:
bin/hdfs dfs -mv /Sample/myfile.txt /Sample_copied

rmr: This command deletes a file from HDFS recursively. It is very useful command when you
want to delete a non-empty directory.
Syntax:
bin/hdfs dfs -rmr <filename/directoryName>
Example:
bin/hdfs dfs -rmr /Sample_copied -> It will delete all the content inside the directory then the
directory itself.

du: It will give the size of each file in directory.


Syntax:
bin/hdfs dfs -du <dirName>
Example:
bin/hdfs dfs -du /Sample
dus:: This command will give the total size of directory/file.
Syntax:
bin/hdfs dfs -dus <dirName>
Example:
bin/hdfs dfs -dus /Sample

stat: It will give the last modified time of directory or path. In short it will give stats of the
directory or file.
Syntax:
bin/hdfs dfs -stat <hdfs file>
Example:
bin/hdfs dfs -stat /Sample

setrep: This command is used to change the replication factor of a file/directory in HDFS. By
default it is 3 for anything which is stored in HDFS (as set in hdfs core-site.xml).
Example 1: To change the replication factor to 6 for Sample.txt stored in HDFS.
bin/hdfs dfs -setrep -R -w 6 Sample.txt
Example 2: To change the replication factor to 4 for a directory SampleInput stored in HDFS.
bin/hdfs dfs -setrep -R 4 /Sample

Relational Operators
1. LOAD
Description: The LOAD operator is used to load data from the file system (such as HDFS or the
local file system) into a Pig relation.
Syntax:
relation_name = LOAD 'path' USING PigStorage(',') AS (field1:type, field2:type, ...);
Example:
data = LOAD 'data.txt' USING PigStorage(',') AS (name:chararray, age:int, city:chararray);
Explanation: This loads data from the file data.txt where fields are separated by commas and
defines the schema with name, age, and city.
2. STORE
Description: The STORE operator is used to store the result of a relation into a specified location
(typically in HDFS).
Syntax:
STORE relation_name INTO 'path' USING PigStorage(',');
Example:
STORE data INTO 'output' USING PigStorage(',');
Explanation: This stores the data in data into the HDFS directory output, with fields separated by
commas.
3. FILTER
Description: The FILTER operator is used to filter out records from a relation based on a
condition.
Syntax:
filtered_data = FILTER relation_name BY condition;
Example:
adults = FILTER data BY age > 18;
Explanation: This filters the data relation to include only records where the age is greater than
18.
4. FOREACH
Description: The FOREACH operator is used to apply transformations or expressions on each
element of a relation, similar to a map operation.
Syntax:
result = FOREACH relation_name GENERATE expression1, expression2, ...;
Example:
CopyEdit
names = FOREACH data GENERATE name;
Explanation: This extracts just the name field from the data relation.
5. GROUP
Description: The GROUP operator is used to group data by one or more fields. It is similar to
SQL's GROUP BY clause.
Syntax:
grouped_data = GROUP relation_name BY field_name;
Example:
grouped_by_city = GROUP data BY city;
Explanation: This groups the data relation by the city field.
6. JOIN
Description: The JOIN operator is used to combine two or more relations based on a common
field (key), similar to SQL's JOIN.
Syntax:
joined_data = JOIN relation1 BY key, relation2 BY key;
Example:
joined_data = JOIN data1 BY name, data2 BY name;
Explanation: This joins the data1 and data2 relations on the name field.
7. ORDER BY
Description: The ORDER BY operator is used to sort data by one or more fields in ascending or
descending order.
Syntax:
ordered_data = ORDER relation_name BY field_name [ASC|DESC];
Example:
sorted_data = ORDER data BY age DESC;
Explanation: This orders the data relation by the age field in descending order.
8. LIMIT
Description: The LIMIT operator restricts the number of rows in the result to a specified count.
Syntax:
limited_data = LIMIT relation_name n;
Example:
top5 = LIMIT data 5;
Explanation: This returns the first 5 records from the data relation.
9. DISTINCT
Description: The DISTINCT operator is used to eliminate duplicate records from a relation.
Syntax:
distinct_data = DISTINCT relation_name;
Example:
unique_data = DISTINCT data;
Explanation: This returns a relation where duplicate records from data are removed.
10. CROSS
Description: The CROSS operator is used to compute the cross-product of two or more relations
(similar to SQL’s Cartesian Join).
Syntax:
crossed_data = CROSS relation1, relation2;
Example:
cross_data = CROSS data1, data2;
Explanation: This produces all combinations of records from data1 and data2.
11. CONCAT
Description: The CONCAT operator is used to concatenate two or more strings or collections
(like tuples or bags).
Syntax:
concatenated_data = CONCAT string1, string2;
Example:
full_name = CONCAT(first_name, last_name);
Explanation: This combines the first_name and last_name fields into a single field full_name.
12. UNION
Description: The UNION operator is used to combine two or more relations into a single
relation, including duplicates (unlike SQL's UNION ALL).
Syntax:
united_data = UNION relation1, relation2;
Example:
all_data = UNION data1, data2;
Eval Function

In Apache Pig, Eval Functions (short for Evaluation Functions) are used to perform
computations or transformations on data within a Pig Latin script. They allow you to manipulate
the data in various ways, such as applying mathematical operations, string manipulation, type
conversions, and other data processing tasks.
Types of Eval Functions
Built-in Eval Functions
User-Defined Eval Functions (UDFs)

1. Built-in Eval Functions


These are the functions that come pre-packaged with Pig. They cover common operations like
arithmetic, string manipulation, type conversion, and more.
Common Built-in Eval Functions
Arithmetic Functions:
+, -, *, /, %: Basic arithmetic operators.
ABS(x): Returns the absolute value of x.
ROUND(x, n): Rounds the number x to n decimal places.
Example:
result = FOREACH data GENERATE age + 5 AS age_in_5_years;
This adds 5 years to the age field.
String Functions:
CONCAT(str1, str2): Concatenates two strings.
SUBSTRING(str, start, length): Extracts a substring from the string str, starting at position start
and with a length of length.
LOWER(str): Converts a string to lowercase.
UPPER(str): Converts a string to uppercase.
TRIM(str): Removes leading and trailing spaces.
Example:
full_name = FOREACH data GENERATE CONCAT(first_name, ' ', last_name) AS full_name;
Date Functions:
CURRENT_TIMESTAMP: Returns the current date and time.
TO_DATE(date_string): Converts a string into a date format.
DAYOFWEEK(date): Returns the day of the week for a given date.
Example:
today = CURRENT_TIMESTAMP;
Type Conversion Functions:
INT(x): Converts x to an integer.
LONG(x): Converts x to a long integer.
CHARARRAY(x): Converts x to a string (chararray).
FLOAT(x): Converts x to a float.
DOUBLE(x): Converts x to a double.
Example:
age_in_int = FOREACH data GENERATE INT(age) AS age_in_int;
Collection Functions:
SIZE(bag_or_tuple): Returns the number of elements in a bag or tuple.
FLATTEN(bag): Expands a bag (set of values) into individual rows.
Example:
size_of_bag = FOREACH data GENERATE SIZE(bag_field) AS bag_size;
Null Handling Functions:
COALESCE(val1, val2, ...): Returns the first non-null value in the list of values.
ISNULL(val): Returns TRUE if the value is null, otherwise FALSE.
Example:
non_null_value = FOREACH data GENERATE COALESCE(field1, field2, 'default') AS value;
2. User-Defined Eval Functions (UDFs)
While Pig comes with a set of built-in functions, sometimes you need to create custom functions
to meet your specific requirements. This is where User-Defined Functions (UDFs) come in.
You can write UDFs in Java, Python, or other supported languages and use them in your Pig
scripts.
Creating and Using UDFs
Steps to Create a UDF in Java:
Write a Java class that extends one of the built-in Pig UDF classes (e.g., EvalFunc).
Implement the exec() method to define the transformation or computation.
Package the Java class into a JAR file.
Register the JAR in the Pig script using REGISTER.
Use the UDF in a Pig Latin script.
Example:
Let's say you want to create a UDF that converts a string to "Title Case" (e.g., "hello world" →
"Hello World").
Step 1: Write the Java UDF
import org.apache.pig.EvalFunc;
import org.apache.pig.data.Tuple;
import org.apache.pig.data.TupleFactory;

public class TitleCase extends EvalFunc<String> {


public String exec(Tuple input) {
try {
String str = (String) input.get(0);
if (str == null) return null;
String[] words = str.split(" ");
StringBuilder result = new StringBuilder();
for (String word : words) {
if (word.length() > 0) {
result.append(word.substring(0,
1).toUpperCase()).append(word.substring(1).toLowerCase()).append(" ");
}
}
return result.toString().trim();
} catch (Exception e) {
return null;
}
}
}
Step 2: Compile and Package into a JAR File
Compile the Java class and package it into a JAR file (e.g., my_udfs.jar).
Step 3: Register the JAR in Pig
REGISTER 'my_udfs.jar';
Step 4: Use the UDF in a Pig Script
TitleCase UDF = new TitleCase();
result = FOREACH data GENERATE UDF(name);
This will apply the TitleCase UDF to each record in the data relation.
Using UDFs in Python (Python UDFs)
Pig also supports Python UDFs, which can be written using the Pig Python API. Here’s an
example of creating a simple UDF in Python that computes the length of a string:
Step 1: Write the Python UDF
from pig import Pig

def string_length(input):
return len(input)
Step 2: Register the Python UDF in Pig
REGISTER 'string_length.py' USING jython as string_length;
Step 3: Use the UDF in a Pig Script
result = FOREACH data GENERATE string_length(name);

Complex Data Types


Complex data types consist of a bit of logical and complicated data type. The following are the
complex data type ?
Data
Definition Code Example
Types

A set of ordered fields. The tuple is written


Tuple (field[,fields....]) (1,2)
with braces.

A group of tuples is called a bag. Represented


Bag {tuple,[,tuple...]} {(1,2), (3,4)}
by folded weights or curly braces.

A set of key-value pairs. The map is


Map [Key # Value] ['keyname'#'valuename']
represented by square brackets.
Key ? An element of finding an element, the key must be unique and must be charrarray.
Value ? Any data can be stored in a value, and each key has particular data related to it. The map
is built using a bracket and hash between key and values. Cas to separate pairs of over one key
value. Here # is used to distinguish key and value.
Null Values ? Valuable value is missing or unknown, and any data may apply. The pig handles
an empty value similar to SQL. Pig detects blank values when data is missing, or an error occurs
during data processing. Also, null can be used as a value proposition of your choice.

Piggy Bank
In big data, "piggybank" refers to a collection of user-defined functions (UDFs) for the Apache
Pig data processing framework. These UDFs extend Pig's capabilities, allowing users to perform
custom data manipulation and analysis. Piggybank is not a built-in part of Pig but is a repository
of user-contributed functions, distributed as part of the Pig distribution.
What it is:
UDF Repository:
Piggybank is essentially a library of user-defined functions that extend the functionality of the
Apache Pig data processing framework.
Extending Pig:
These UDFs allow users to write custom code that can be executed within Pig scripts, enabling
them to perform specific data transformations and analysis that might not be possible with
Pig's built-in functions.
User-Contributed:
The functions within Piggybank are contributed by the Pig user community, meaning they can
be shared and reused by others.
Not Built-in:
Piggybank functions are not part of the core Pig distribution but are separate libraries that need
to be registered and used within Pig scripts.
How it's used:
1. Registering UDFs:
Before using a piggybank function, it needs to be registered within the Pig environment.
2. Invoking UDFs:
Once registered, UDFs can be called within Pig scripts like any other function.
3. Benefits:
Custom Functionality: Allows users to perform specific data manipulation and analysis not
covered by Pig's built-in functions.
Code Reuse: Users can contribute their own UDFs and also access UDFs written by others,
reducing development time and effort.
Performance Enhancement: Properly written UDFs can potentially improve the performance
of Pig scripts by optimizing specific operations.
Examples:
Mathematical functions like sum, average, count, ascending, and descending order can be created
as UDFs and stored in piggybank libraries.
UDFs can be used for custom data cleaning, transformation, and analysis tasks.
In essence, piggybank acts as a collaborative repository of user-defined functions that extends
the capabilities of Pig, enabling users to perform a wider range of data processing tasks within
the Pig framework.

Pig Vs Hive.

You might also like