Sri Raghavendra Educational Institutions Society (R)
Sri Krishna Institute of Technology
(Approved by AICTE, Accredited by NAAC, Affiliated to VTU, Karnataka)
             Title: Big Data And Analytics
             Sub Code: 18CS72
             Presented by: KAVYA M
             Department: Computer Science & Engineering
                                                                        /skit.org.in
       Sri Krishna Institute of Technology
       (Approved by AICTE, Accredited by NAAC, Affiliated to VTU, Karnataka)
                    Module-4
        MapReduce, Hive and Pig
Dept. of ISE
                                                                               Big Data & Analytics/18CS72 /skit.org.in
       Sri Krishna Institute of Technology
       (Approved by AICTE, Accredited by NAAC, Affiliated to VTU, Karnataka)
                                             4.1.1 INTRODUCTION
Dept. of ISE
                                                                               Big Data & Analytics/18CS72 /skit.org.in
      Sri Krishna Institute of Technology
      (Approved by AICTE, Accredited by NAAC, Affiliated to VTU, Karnataka)
                                               key terms
      MapReduce programming model refers to a programming paradigm for processing Big Data sets with a
      parallel and distributed environment using map and reduce tasks.
      YARN refers to provisioning of running and scheduling parallel programs for map and reduce tasks and allocating
      parallel processing resources for computing sub-tasks running in parallel at the Hadoop for a user application.
      Script refers to a small program (codes up to few thousand lines of code) used for purposes such as query
      processing, text processing, or refers to a small code written in a dynamic high-level general-purpose language,
        such as Python or PERL.
Dept. of ISE
                                                                                     Big Data & Analytics/18CS72 /skit.org.in
       Sri Krishna Institute of Technology
       (Approved by AICTE, Accredited by NAAC, Affiliated to VTU, Karnataka)
        SQL-like scripting language:
        means a language for writing script that processes queries similar to SQL. SQL lets us:
        (i) write structured queries for processing in DBMS,
        (ii) create and modify schema, and control the data access,
        (iii) create client for sending query scripts, and create and manage server databases, and
        (iv)view, query and change (update, insert or append or delete) databases.
Dept. of ISE
                                                                                       Big Data & Analytics/18CS72 /skit.org.in
       Sri Krishna Institute of Technology
       (Approved by AICTE, Accredited by NAAC, Affiliated to VTU, Karnataka)
          4.2.1 MAPREDUCE MAP TASKS, REDUCE TASKS AND
                      MAPREDUCE EXECUTION
        • Big data processing employs the MapReduce programming model.
        • A Job means a MapReduce program.
        • Each job consists of several smaller units, called MapReduce tasks.
        • A software execution framework in MapReduce programming defines the parallel tasks.
Dept. of ISE
                                                                               Big Data & Analytics/18CS72 /skit.org.in
       Sri Krishna Institute of Technology
       (Approved by AICTE, Accredited by NAAC, Affiliated to VTU, Karnataka)
                MapReduce process on client submitting a job
Dept. of ISE
                                                                               Big Data & Analytics/18CS72 /skit.org.in
       Sri Krishna Institute of Technology
       (Approved by AICTE, Accredited by NAAC, Affiliated to VTU, Karnataka)
       • A user application specifies locations of the input/output data and translates into map and reduce
          functions.
       • The Hadoop job client then submits the job (jar/executable etc.) and configuration to the
          JobTracker, which then takes the responsibility of distributing the software/configuration to the slaves by
          scheduling tasks, monitoring them, and provides status and diagnostic information to the job-client.
       • The master is responsible (JobTracker) for scheduling the component tasks in a job onto the
          slaves, monitoring them and re-executing the failed tasks. The slaves execute the tasks as directed by
          the master.
Dept. of ISE
                                                                                        Big Data & Analytics/18CS72 /skit.org.in
      Sri Krishna Institute of Technology
      (Approved by AICTE, Accredited by NAAC, Affiliated to VTU, Karnataka)
                                                4.2.1 Map-Tasks
        • Map task means a task that implements a map(), which runs user application
        • codes for each key-value pair (ki, vi).
        • The output of map() would be zero (when no values are found) or intermediate key-value
            pairs (k2, v2). The value v2 is the information that is later used at reducer for the transformation
            operation using aggregation or other reducing functions.
        • Reduce task refers to a task which takes the output v2 from the map as an input and combines
Dept. of ISE those data pieces into a smaller set of data using a combiner.
                                                                                      Big Data & Analytics/18CS72 /skit.org.in
       Sri Krishna Institute of Technology
       (Approved by AICTE, Accredited by NAAC, Affiliated to VTU, Karnataka)
               Logical View of map() Functioning
Dept. of ISE
                                                                               Big Data & Analytics/18CS72 /skit.org.in
       Sri Krishna Institute of Technology
       (Approved by AICTE, Accredited by NAAC, Affiliated to VTU, Karnataka)
                                         Hadoop Mapper Class
       Hadoop Java API includes Mapper class.
       An abstract function map() is present in the Mapper class.
       Any specific Mapper implementation should be a subclass of this class and overrides the abstract function,
       map().
Dept. of ISE
                                                                                   Big Data & Analytics/18CS72 /skit.org.in
      Sri Krishna Institute of Technology
      (Approved by AICTE, Accredited by NAAC, Affiliated to VTU, Karnataka)
                                4.2.2 Key-Value Pair
      • MapReduce uses only key-value pairs as input and output.
      • Hence available Data should be first converted into key-value pairs before it is passed to the Mapper,
         as the Mapper only understands key-value pairs of data.
      • Key-value pairs in Hadoop MapReduce are generated as follows:
      • InputSplit - Defines a logical representation of data and presents a Split data for processing at individual
         map().
       • As user we don’t deal with InputSplit in Hadoop directly, as InputFormat (InputFormat is responsible for
Dept. of ISE
           creating the Inputsplit and dividing into the records) creates it. FileInputFormat breaks a file into 128MB
         chunks.
                                                                                      Big Data & Analytics/18CS72 /skit.org.in
       Sri Krishna Institute of Technology
       (Approved by AICTE, Accredited by NAAC, Affiliated to VTU, Karnataka)
               Steps Involved in MapReduce key-value pairing
Dept. of ISE
                                                                               Big Data & Analytics/18CS72 /skit.org.in
       Sri Krishna Institute of Technology
       (Approved by AICTE, Accredited by NAAC, Affiliated to VTU, Karnataka)
 Map and Reduce functions uses key value pair at 4 places:
 1. map() input,
 2. map() output,
 3. reduce() input and
 4. reduce() output.
Dept. of ISE
                                                                               Big Data & Analytics/18CS72 /skit.org.in
      Sri Krishna Institute of Technology
      (Approved by AICTE, Accredited by NAAC, Affiliated to VTU, Karnataka)
                               4.2.3 Grouping by Key
      • Mapper outputs by grouping the key-values, and the value v2 append in a list of values.
      • A "Group By" operation on intermediate keys creates v2.
      • Shuffle and Sorting Phase
      • Shuffling in MapReduce
      • The process of transferring data from the mappers to reducers is known as shuffling i.e. the
          process by which the system performs the sort and transfers the map output to the reducer as input. So,
           MapReduce shuffle phase is necessary for the reducers, otherwise, they would not have any input (or input
Dept. of ISE
           from every mapper).
                                                                                    Big Data & Analytics/18CS72 /skit.org.in
       Sri Krishna Institute of Technology
       (Approved by AICTE, Accredited by NAAC, Affiliated to VTU, Karnataka)
                                     4.2.4 Partitioning
     A partitioner partitions the key-value pairs of intermediate Map-outputs.
     It partitions the data using a user-defined condition, which works like a hash function.
     • The total number of partitions is same as the number of Reducer tasks for the job. Let
        us take an example to understand how the partitioner works.
Dept. of ISE
                                                                               Big Data & Analytics/18CS72 /skit.org.in
     Sri Krishna Institute of Technology
     (Approved by AICTE, Accredited by NAAC, Affiliated to VTU, Karnataka)
                                             4.2.5 Combiners
     • A Combiner, also known as a semi-reducer, is an optional class that operates by accepting
       the inputs from the Map class and thereafter passing the output key-value pairs to the
       Reducer class.
     • The main function of a Combiner is to summarize the map output records with the same key.
       The output (key-value collection) of the combiner will be sent over the network to the actual
       Reducer task as input.
       • The Combiner class is used in between the Map class and the Reduce class to reduce the
           volume of data transfer between Map and Reduce. Usually, the output of the map task is
Dept. of ISE
           large and the data transferred to the reduce task is high.
                                                                             Big Data & Analytics/18CS72 /skit.org.in
       Sri Krishna Institute of Technology
       (Approved by AICTE, Accredited by NAAC, Affiliated to VTU, Karnataka)
Dept. of ISE
                                                                               Big Data & Analytics/18CS72 /skit.org.in
     Sri Krishna Institute of Technology
     (Approved by AICTE, Accredited by NAAC, Affiliated to VTU, Karnataka)
                                                4.2.6 Reduce Tasks
        • Java API at Hadoop includes Reducer class. An abstract function, reduce() is in the
            Reducer. Any specific Reducer implementation should be subclass of this class and override
            the abstract reduce().
        • Reduce task implements reduce() that takes the Mapper output (which shuffles and sorts),
            which is grouped by key-values (k2, v2) and applies it in parallel to each group.
        • Intermediate pairs are at input of each Reducer in order after sorting using the key.
        • Reduce function iterates over the list of values associated with a key and produces outputs
            such as aggregations and statistics.
        • The reduce function sends output zero or another set of key-value pairs (k3, v3) to the final
Dept. of ISEthe output file. Reduce: {(k2, list (v2) -> list (k3, v3)}
                                                                             Big Data & Analytics/18CS72 /skit.org.in
       Sri Krishna Institute of Technology
       (Approved by AICTE, Accredited by NAAC, Affiliated to VTU, Karnataka)
            4.2.7 Details of MapReduce Processing Steps
Dept. of ISE
                                                                               Big Data & Analytics/18CS72 /skit.org.in
       Sri Krishna Institute of Technology
       (Approved by AICTE, Accredited by NAAC, Affiliated to VTU, Karnataka)
                        4.2.8 Coping with Node Failures
    Hadoop achieves fault tolerance is through restarting the tasks.
    Each task nodes (TaskTracker) regularly communicates with the master node, JobTracker.
    If a TaskTracker fails to communicate with the JobTracker for a pre-defined period (by default, it is set to 10 minutes),
    JobTracker assumes node failure.
     The JobTracker knows which map and reduce tasks were assigned to each TaskTracker.
     • If a TaskTracker has already completed nine out of ten reduce tasks assigned to it, only the tenth task must execute at a
         different node.
Dept.• ofMap
          ISE tasks are slightly more complicated. A node may have completed ten map tasks but the Reducers may not have
         copied all their inputs from the output of those map tasks. Now if a node fails, then its Mapper outputs are inaccessible.
         Thus, any complete map tasks must also be re-executed to make their results available to the remaining reducing nodes.
         Hadoop handles all of this automatically.
                                                                                                 Big Data & Analytics/18CS72 /skit.org.in
       Sri Krishna Institute of Technology
       (Approved by AICTE, Accredited by NAAC, Affiliated to VTU, Karnataka)
        Following points summarize the coping mechanism with distinct Node Failures:
        • (i) Map TaskTracker failure:
           Map tasks completed or in-progress at TaskTracker, are reset to idle on failure
           Reduce TaskTracker gets a notice when a task is rescheduled on another TaskTracker
        • (ii) Reduce TaskTracker failure:
           Only in-progress tasks are reset to idle
        • (iii) Master JobTracker failure:
           Map-Reduce task aborts and notifies the client (in case of one master node).
Dept. of ISE
                                                                                                Big Data & Analytics/18CS72 /skit.org.in
       Sri Krishna Institute of Technology
       (Approved by AICTE, Accredited by NAAC, Affiliated to VTU, Karnataka)
                  4.3.2 Matrix-Vector Multiplication by MapReduce
Dept. of ISE
                                                                               Big Data & Analytics/18CS72 /skit.org.in
       Sri Krishna Institute of Technology
       (Approved by AICTE, Accredited by NAAC, Affiliated to VTU, Karnataka)
                     4.3.3 Relational—Algebra Operations
        Relational algebraic operations on large datasets using MapRed:
        1 Selection
        2 Projection
        3 Union
        4 Intersection and Difference
Dept. of ISE
                                                                               Big Data & Analytics/18CS72 /skit.org.in
       Sri Krishna Institute of Technology
       (Approved by AICTE, Accredited by NAAC, Affiliated to VTU, Karnataka)
                                                               4.4.1 HIVE
     • Hive was created by Facebook.
     • Hive is a data warehousing tool and is also a data store on the top of Hadoop.
     • Enterprises uses a data warehouse as large data repositories that are designed to enable the Searching,
        managing, and analyzing the data.
     • Hive processes structured data, integrates well heterogeneous sources.
     • Additionally, also manages the volumes of data.
Dept. of ISE
                                                                                  Big Data & Analytics/18CS72 /skit.org.in
       Sri Krishna Institute of Technology
       (Approved by AICTE, Accredited by NAAC, Affiliated to VTU, Karnataka)
                                        HIVE Features
Dept. of ISE
                                                                               Big Data & Analytics/18CS72 /skit.org.in
       Sri Krishna Institute of Technology
       (Approved by AICTE, Accredited by NAAC, Affiliated to VTU, Karnataka)
                               Hive Characteristics
        1. Has the capability to translate queries into MapReduce jobs. This makes Hive scalable, able
               to handle data warehouse applications, and therefore, suitable for the analysis of static data of an
               extremely large size data and application.
        2. Supports web interfaces as well. Application APIs as well as web-browser clients, can access
               the Hive DB server.
        3. Provides an SQL dialect (Hive Query Language, abbreviated HiveQL or HQL).
Dept. of ISE
                                                                                           Big Data & Analytics/18CS72 /skit.org.in
       Sri Krishna Institute of Technology
       (Approved by AICTE, Accredited by NAAC, Affiliated to VTU, Karnataka)
                                    HIVE Limitations
       1.       Not a full database. Main disadvantage is that Hive does not provide update, alter and deletion of
       records in the database.
       2.       Not developed for unstructured data.
       3.       Not designed for real-time queries.
       4.       Performs the partition always from the last column.
Dept. of ISE
                                                                                        Big Data & Analytics/18CS72 /skit.org.in
       Sri Krishna Institute of Technology
       (Approved by AICTE, Accredited by NAAC, Affiliated to VTU, Karnataka)
                             4.4.1 Hive Architecture
Dept. of ISE
                                                                               Big Data & Analytics/18CS72 /skit.org.in
     Sri Krishna Institute of Technology
     (Approved by AICTE, Accredited by NAAC, Affiliated to VTU, Karnataka)
   Hive architecture components are:
   • Hive Server - An optional service that allows a remote client to submit requests to Hive and retrieve
   results. Requests can use a variety of programming languages. Hive Server exposes a very simple client API
   to execute HiveQL statements.
   • Hive CLI (Command Line Interface) - Popular interface to interact with Hive. Hive runs in local mode
   that uses local storage when running the CLI on a Hadoop cluster instead of HDFS.
   • Web Interface - Hive can be accessed using a web browser as well. This requires a HWI Server running
    on some designated code. The URL http:// hadoop:<port no.> / hwi command can be used to access
Dept. of ISE
    Hive through the web.
                                                                                  Big Data & Analytics/18CS72 /skit.org.in
       Sri Krishna Institute of Technology
       (Approved by AICTE, Accredited by NAAC, Affiliated to VTU, Karnataka)
     • Metastore - It is the system catalog. All other components of Hive interact with the Metastore. It stores the
        schema or metadata of tables, databases, columns in a table, their data types and HDFS mapping.
     • Hive Driver - It manages the life cycle of a HiveQL statement during compilation, optimization and
     execution.
Dept. of ISE
                                                                                     Big Data & Analytics/18CS72 /skit.org.in
       Sri Krishna Institute of Technology
       (Approved by AICTE, Accredited by NAAC, Affiliated to VTU, Karnataka)
                         4.4.2 Hive Installation
        Hive can be installed on Windows 10, Ubuntu 16.04 and MySQL. It requires three software
        packages:
        •        Java Development kit for Java compiler (Javac) and interpreter
        •        Hadoop
        •        Compatible version of Hive with Java- Hive 1.2 onward supports Java     1.7 or newer.
Dept. of ISE
                                                                                   Big Data & Analytics/18CS72 /skit.org.in
     Sri Krishna Institute of Technology
     (Approved by AICTE, Accredited by NAAC, Affiliated to VTU, Karnataka)
  Steps for installation of Hive in a Linux based OS are as follows:
   1. Install Javac and Java from Oracle Java download site. Download jdk 7 or a later version from
   http://www.oracle.com/technetwork/java/javase/downloads/jdk7-downloads-1880260.html,                          and
   extract the compressed file.
   All users can access Java by Making Java available to all users.
   The user has to move it to the location "/usr/local/" using the required commands.
   2. Set the path by the commands for jdk1.7.0_71,
    export JAVA_HOME=usr/local/jdk1.7.0_71,
Dept. of ISE
    exportPATH=$PATH: $JAVA_HOME/bin
   (Can use alternative install /usr/bin/java usr/local/java/bin/java 2)
                                                                                   Big Data & Analytics/18CS72 /skit.org.in
        Sri Krishna Institute of Technology
        (Approved by AICTE, Accredited by NAAC, Affiliated to VTU, Karnataka)
   3.        Install Hadoop http://apache.claz.org/hadoop/common/hadoop-              2.4.1/
   4.        Make shared HADOOP, MAPRED, COMMON, HDFS and all related files,       configure HADOOP and set property
   such as replication parameter.
   5.        Name the yarn.nodemanager.aux-services. Assign value to
             mapreduce_shuffle. Set namenode and datanode paths.
   6.        Download http://apache.petsads.us/hive/hive-0.14.0/. Use is command to        verify the files $ tar zxvf
   apache-hive-0.14.0-bin.tar.gz, $ ls
   7.       Use an external database server. Configure metastore for the server.
Dept. of ISE
                                                                                    Big Data & Analytics/18CS72 /skit.org.in
       Sri Krishna Institute of Technology
       (Approved by AICTE, Accredited by NAAC, Affiliated to VTU, Karnataka)
            4.4.3 Comparison with RDBMS (Traditional Database)
Dept. of ISE
                                                                               Big Data & Analytics/18CS72 /skit.org.in
       Sri Krishna Institute of Technology
       (Approved by AICTE, Accredited by NAAC, Affiliated to VTU, Karnataka)
         4.4.4 Hive Data Types and File Formats
Dept. of ISE
                                                                               Big Data & Analytics/18CS72 /skit.org.in
       Sri Krishna Institute of Technology
       (Approved by AICTE, Accredited by NAAC, Affiliated to VTU, Karnataka)
         Hive has three Collection data types
Dept. of ISE
                                                                               Big Data & Analytics/18CS72 /skit.org.in
       Sri Krishna Institute of Technology
       (Approved by AICTE, Accredited by NAAC, Affiliated to VTU, Karnataka)
          HIVE file formats and their descriptions
Dept. of ISE
                                                                               Big Data & Analytics/18CS72 /skit.org.in
       Sri Krishna Institute of Technology
       (Approved by AICTE, Accredited by NAAC, Affiliated to VTU, Karnataka)
                             4.4.5 Hive Data Model
Dept. of ISE
                                                                               Big Data & Analytics/18CS72 /skit.org.in
       Sri Krishna Institute of Technology
       (Approved by AICTE, Accredited by NAAC, Affiliated to VTU, Karnataka)
               4.4.6 Hive Integration and Workflow Steps
     Hive integrates with the MapReduce and HDFS. Figure below shows the dataflow sequences and workflow steps
     between Hive and Hadoop.
Dept. of ISE
                                                                                 Big Data & Analytics/18CS72 /skit.org.in
       Sri Krishna Institute of Technology
       (Approved by AICTE, Accredited by NAAC, Affiliated to VTU, Karnataka)
Dept. of ISE
                                                                               Big Data & Analytics/18CS72 /skit.org.in
       Sri Krishna Institute of Technology
       (Approved by AICTE, Accredited by NAAC, Affiliated to VTU, Karnataka)
                     4.4.7 Hive Built-in Functions
Dept. of ISE
                                                                               Big Data & Analytics/18CS72 /skit.org.in
       Sri Krishna Institute of Technology
       (Approved by AICTE, Accredited by NAAC, Affiliated to VTU, Karnataka)
Dept. of ISE
                                                                               Big Data & Analytics/18CS72 /skit.org.in
       Sri Krishna Institute of Technology
       (Approved by AICTE, Accredited by NAAC, Affiliated to VTU, Karnataka)
                                                            4.51 HIVEQL
     Hive Query Language (abbreviated HiveQL) is for querying the large datasets which reside in the HDFS
     environment.
     HiveQL script commands enable data definition, data manipulation and query processing.
     HiveQL supports a large base of SQL users who are using SQL to extract information from data warehouse.
Dept. of ISE
                                                                                  Big Data & Analytics/18CS72 /skit.org.in
       Sri Krishna Institute of Technology
       (Approved by AICTE, Accredited by NAAC, Affiliated to VTU, Karnataka)
               4.5.1 HiveQL Data Definition Language (DDL)
          • HiveQL database commands for data definition for DBs and Tables are
          • CREATE DATABASE,
          • SHOW DATABASE (list of all DBs),
          • CREATE SCHEMA,
          • CREATE TABLE.
Dept. of ISE
                                                                                  Big Data & Analytics/18CS72 /skit.org.in
       Sri Krishna Institute of Technology
       (Approved by AICTE, Accredited by NAAC, Affiliated to VTU, Karnataka)
                                 Following are HiveQL commands which create a table:
Dept. of ISE
                                                                                Big Data & Analytics/18CS72 /skit.org.in
       Sri Krishna Institute of Technology
       (Approved by AICTE, Accredited by NAAC, Affiliated to VTU, Karnataka)
                                                  Creating a Database
                                               Showing Database
Dept. of ISE
                                                                               Big Data & Analytics/18CS72 /skit.org.in
       Sri Krishna Institute of Technology
       (Approved by AICTE, Accredited by NAAC, Affiliated to VTU, Karnataka)
                                 Dropping a Database
Dept. of ISE
                                                                               Big Data & Analytics/18CS72 /skit.org.in
      Sri Krishna Institute of Technology
      (Approved by AICTE, Accredited by NAAC, Affiliated to VTU, Karnataka)
     4.5.2 HiveQL Data Manipulation Language (DML)
      HiveQL commands for data manipulation are
      USE <database name>,
      DROP DATABASE,
      DROP SCHEMA,
      ALTER TABLE,
      DROP TABLE, and
Dept. ofLOAD
         ISE DATA.
                                                                              Big Data & Analytics/18CS72 /skit.org.in
       Sri Krishna Institute of Technology
       (Approved by AICTE, Accredited by NAAC, Affiliated to VTU, Karnataka)
                                      Loading Data into HIVE DB
Dept. of ISE
                                                                               Big Data & Analytics/18CS72 /skit.org.in
       Sri Krishna Institute of Technology
       (Approved by AICTE, Accredited by NAAC, Affiliated to VTU, Karnataka)
          4.5.3 HiveQL For Querying the Data
     • For any data analysis application there is a need for partitioning and storing the data.
     • A data warehouse should have a large number of partitions where the tables, files and databases store.
     • Querying then requires sorting, aggregating and joining functions.
Dept. of ISE
                                                                                    Big Data & Analytics/18CS72 /skit.org.in
       Sri Krishna Institute of Technology
       (Approved by AICTE, Accredited by NAAC, Affiliated to VTU, Karnataka)
                                                4.5.3.1 Partitioning
     • Table partitioning refers to dividing the table data into some parts based on the values of particular
        set of columns.
     • Hive organizes tables into partitions.
     • Partition makes querying easy and fast.
     • This is because SELECT is then from the smaller number of column fields.
     • The following example explains the concept of partitioning, columnar and file records formats.
Dept. of ISE
                                                                                     Big Data & Analytics/18CS72 /skit.org.in
       Sri Krishna Institute of Technology
       (Approved by AICTE, Accredited by NAAC, Affiliated to VTU, Karnataka)
                                   Table Partitioning
Dept. of ISE
                                                                               Big Data & Analytics/18CS72 /skit.org.in
       Sri Krishna Institute of Technology
       (Approved by AICTE, Accredited by NAAC, Affiliated to VTU, Karnataka)
                             Renaming the Partition
Dept. of ISE
                                                                               Big Data & Analytics/18CS72 /skit.org.in
       Sri Krishna Institute of Technology
       (Approved by AICTE, Accredited by NAAC, Affiliated to VTU, Karnataka)
                  Add a partition to the Existing Table
Dept. of ISE
                                                                               Big Data & Analytics/18CS72 /skit.org.in
       Sri Krishna Institute of Technology
       (Approved by AICTE, Accredited by NAAC, Affiliated to VTU, Karnataka)
                                      Drop a partition
Dept. of ISE
                                                                               Big Data & Analytics/18CS72 /skit.org.in
       Sri Krishna Institute of Technology
       (Approved by AICTE, Accredited by NAAC, Affiliated to VTU, Karnataka)
          Partitioning: Query Fast Processing
   The following example shows how querying is processed fast by using partitioning of a table.
   A query processes faster when using partition.
   Selection of a product of a specific category from a table during query processing takes lesser time when
   the table has a partition based on a category.
Dept. of ISE
                                                                                     Big Data & Analytics/18CS72 /skit.org.in
       Sri Krishna Institute of Technology
       (Approved by AICTE, Accredited by NAAC, Affiliated to VTU, Karnataka)
Dept. of ISE
                                                                               Big Data & Analytics/18CS72 /skit.org.in
       Sri Krishna Institute of Technology
       (Approved by AICTE, Accredited by NAAC, Affiliated to VTU, Karnataka)
Dept. of ISE
                                                                               Big Data & Analytics/18CS72 /skit.org.in
       Sri Krishna Institute of Technology
       (Approved by AICTE, Accredited by NAAC, Affiliated to VTU, Karnataka)
  Advantages of Partition
  1.      Distributes execution load horizontally.
  2.      Query response time becomes faster when processing a small part of the                 data instead of
  searching the entire dataset.
  Limitations of Partition
  1.      Creating a large number of partitions in a table leads to a large number of      files and directories in
  HDFS, which is an overhead to NameNode, since it must keep all metadata for the file system in memory
  only.
  2. of ISE
Dept.     Partitions may optimize some queries based on Where clauses, but they           may be less responsive
  for other important queries on grouping clauses.
                                                                                        Big Data & Analytics/18CS72 /skit.org.in
       Sri Krishna Institute of Technology
       (Approved by AICTE, Accredited by NAAC, Affiliated to VTU, Karnataka)
                                   4.5.3.2 Bucketing
     • A partition itself may have a large number of columns when tables are very large.
     • Tables or partitions can be sub-divided into buckets.
     • Division is based on the hash of a column in the table.
     • CLUSTERED BY clause divides a table into buckets. A coding example on Buckets is given below:
Dept. of ISE
                                                                                    Big Data & Analytics/18CS72 /skit.org.in
     Sri Krishna Institute of Technology
     (Approved by AICTE, Accredited by NAAC, Affiliated to VTU, Karnataka)
                                             4.5.3.3 Views
 Views provide ease of programming.
 Complex queries simplify using reusable Views.
 A View provisions the following:
 •     Saves the query and reduces the query complexity
 •     Use a View like a table but a View does not store data like a table
 •      Hides the complexity by dividing the query into smaller, more manageable   pieces
Dept.
 • of ISE
       The Hive executes the View and then the planner combines the information     in View definition
 with the remaining actions on the query
                                                                                   Big Data & Analytics/18CS72 /skit.org.in
       Sri Krishna Institute of Technology
       (Approved by AICTE, Accredited by NAAC, Affiliated to VTU, Karnataka)
     Hive supports the following built-in aggregation functions. The usage of these functions is same as the SQL
     aggregate functions.
Dept. of ISE
                                                                                 Big Data & Analytics/18CS72 /skit.org.in
       Sri Krishna Institute of Technology
       (Approved by AICTE, Accredited by NAAC, Affiliated to VTU, Karnataka)
                                                                4.5.5 Join
• A JOIN clause combines columns of two or more tables, based on a relation between them.
• HiveQL Join is more or less similar to SQL JOINS.
• Example:
Dept. of ISE
                                                                                 Big Data & Analytics/18CS72 /skit.org.in
       Sri Krishna Institute of Technology
       (Approved by AICTE, Accredited by NAAC, Affiliated to VTU, Karnataka)
                              4.5.6 Group By Clause
Dept. of ISE
                                                                               Big Data & Analytics/18CS72 /skit.org.in
        Sri Krishna Institute of Technology
        (Approved by AICTE, Accredited by NAAC, Affiliated to VTU, Karnataka)
                                              4.6.1 PIG
    Apache developed Pig, which:
    •         Is an abstraction over MapReduce
    •         Is an execution framework for parallel processing
    •         Reduces the complexities of writing a MapReduce program
    •         Is a high-level dataflow language. Dataflow language means that a Pig operation node takes the
    inputs and generates the output for the next node
Dept.•of ISE Is mostly used in HDFS environment
    •         Performs data manipulation operations at files at data nodes in Hadoop.
                                                                                        Big Data & Analytics/18CS72 /skit.org.in
      Sri Krishna Institute of Technology
      (Approved by AICTE, Accredited by NAAC, Affiliated to VTU, Karnataka)
                              Applications of Apache Pig
 Applications of Pig are:
 •      Analyzing large datasets
 •      Executing tasks involving adhoc processing
 •      Processing large data sources such as web logs and streaming online data
 •      Data processing for search platforms. Pig processes different types of data
 •       Processing time sensitive data loads; data extracts and analyzes quickly. For               example, analysis of
 dataoffrom
Dept.   ISE twitter to find patterns for user behavior and                    recommendations.
                                                                                                 Big Data & Analytics/18CS72 /skit.org.in
       Sri Krishna Institute of Technology
       (Approved by AICTE, Accredited by NAAC, Affiliated to VTU, Karnataka)
                      Differences between Pig and MapReduce
Dept. of ISE
                                                                               Big Data & Analytics/18CS72 /skit.org.in
       Sri Krishna Institute of Technology
       (Approved by AICTE, Accredited by NAAC, Affiliated to VTU, Karnataka)
               Differences between Pig and SQL
Dept. of ISE
                                                                               Big Data & Analytics/18CS72 /skit.org.in
       Sri Krishna Institute of Technology
       (Approved by AICTE, Accredited by NAAC, Affiliated to VTU, Karnataka)
               Differences between Pig and Hive
Dept. of ISE
                                                                               Big Data & Analytics/18CS72 /skit.org.in
       Sri Krishna Institute of Technology
       (Approved by AICTE, Accredited by NAAC, Affiliated to VTU, Karnataka)
                                      Pig Architecture
Dept. of ISE
                                                                               Big Data & Analytics/18CS72 /skit.org.in
       Sri Krishna Institute of Technology
       (Approved by AICTE, Accredited by NAAC, Affiliated to VTU, Karnataka)
  The three ways to execute scripts are:
  1.      Grunt Shell: An interactive shell of Pig that executes the scripts.
  2.      Script File: Pig commands written in a script file that execute at Pig Server.
  3.      Embedded Script: Create UDFs for the functions unavailable as Pig built-in                                 operators. UDF can be in
  other programming languages. The UDFs can                                    embed in Pig Latin Script file.
Dept. of ISE
                                                                                                                 Big Data & Analytics/18CS72 /skit.org.in
       Sri Krishna Institute of Technology
       (Approved by AICTE, Accredited by NAAC, Affiliated to VTU, Karnataka)
                                           Installing Pig
Dept. of ISE
                                                                               Big Data & Analytics/18CS72 /skit.org.in
       Sri Krishna Institute of Technology
       (Approved by AICTE, Accredited by NAAC, Affiliated to VTU, Karnataka)
Dept. of ISE
                                                                               Big Data & Analytics/18CS72 /skit.org.in
Sri Krishna Institute of Technology
(Approved by AICTE, Accredited by NAAC, Affiliated to VTU, Karnataka)
      Discussion
                                                                        75   Web Technology And its Applications/18CS63   /skit.org.in
Sri Krishna Institute of Technology
(Approved by AICTE, Accredited by NAAC, Affiliated to VTU, Karnataka)
                                                                        76   Web Technology And its Applications/18CS63   /skit.org.in