0% found this document useful (0 votes)

459 views60 pages

TalendOpenStudio BigData GettingStarted 5.4.1 en

This document provides an introduction to Talend Open Studio for Big Data and its capabilities for working with big data solutions. It discusses how Talend Studio leverages the Apache Hadoop platform to enable users to access, transform, move, and synchronize big data through a graphical interface without needing to write code. The document also provides an overview of the functional architecture of Talend Big Data solutions, identifying the key functions, interactions, and IT needs for working with big data in Talend Studio.

Uploaded by

Roshin P.S

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

459 views60 pages

TalendOpenStudio BigData GettingStarted 5.4.1 en

Uploaded by

Roshin P.S

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 60

Talend Open Studio for Big Data

Getting Started Guide

5.4.1

Talend Open Studio for Big Data

Adapted for v5.4.1. Supersedes previous Getting Started Guide releases. Publication date: December 12, 2013

Copyleft
This documentation is provided under the terms of the Creative Commons Public License (CCPL). For more information about what you can and cannot do with this documentation in accordance with the CCPL, please read: http://creativecommons.org/licenses/by-nc-sa/2.0/

Notices
All brands, product names, company names, trademarks and service marks are the properties of their respective owners.

Table of Contents
Preface ................................................. v
1. General information . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1. Purpose . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2. Audience . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3. Typographical conventions . . . . . . . . . . . 2. Feedback and Support . . . . . . . . . . . . . . . . . . . . . . . . v v v v v

Chapter 1. Introduction to Talend Big Data solutions ....................................... 1

1.1. Hadoop and Talend studio . . . . . . . . . . . . . . . . . 2 1.2. Functional architecture of Talend Big Data solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

Chapter 2. Getting started with Talend Big Data using the demo project .............. 5
2.1. Introduction to the Big Data demo project . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1.1. Hortonworks_Sandbox_Samples . . . . . . . . . . 2.1.2. NoSQL_Examples . . . . . . . . . . . . . . . . . . 2.2. Setting up the environment for the demo Jobs to work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.1. Installing Hortonworks Sandbox . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.2. Understanding context variables used in the demo project . . . . . . . . . 6 6 8 8 8 9

Chapter 3. Handling Jobs in Talend Studio ................................................. 13

3.1. How to run a Job via Oozie . . . . . . . . . . . . . . . 3.1.1. How to set HDFS connection details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1.2. How to run a Job on the HDFS server . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1.3. How to schedule the executions of a Job . . . . . . . . . . . . . . . . . . . . . . 3.1.4. How to monitor Job execution status . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1. tPigMap interface . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2. tPigMap operation . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.1. Configuring join operations . . . . . . . 4.2.2. Catching rejected records . . . . . . . . . . 4.2.3. Editing expressions . . . . . . . . . . . . . . . . 4.2.4. Setting up a Pig User-Defined Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.1. Gathering Web traffic information using Hadoop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.1.1. Prerequisites . . . . . . . . . . . . . . . . . . . . . . A.1.2. Discovering the scenario . . . . . . . . . . A.1.3. Translating the scenario into Jobs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 14 19 20 21 24 25 25 26 27 30

Chapter 4. Mapping Big Data flows ........ 23

Appendix A. Big Data Job examples....... 33

34 34 34 34

Talend Open Studio for Big Data Getting Started Guide

Preface
1. General information
1.1. Purpose
Unless otherwise stated, "Talend Studio" or "Studio" throughout this guide refers to any Talend Studio product with Big Data capabilities.

This Getting Started Guide explains how to manage Big Data specific functions of Talend Studio in a normal operational context. Information presented in this document applies to Talend Studio releases beginning with 5.4.1.

1.2. Audience
This guide is for users and administrators of Talend Studio.
The layout of GUI screens provided in this document may vary slightly from your actual GUI.

1.3. Typographical conventions

This guide uses the following typographical conventions: text in bold: window and dialog box buttons and fields, keyboard keys, menus, and menu and options, text in [bold]: window, wizard, and dialog box titles, text in courier: system parameters typed in by the user, text in italics: file, schema, column, row, and variable names, The icon indicates an item that provides additional information about an important point. It is also used to add comments related to a table or a figure, The icon indicates a message that gives information about the execution requirements or recommendation type. It is also used to refer to situations or information the end-user needs to be aware of or pay special attention to.

2. Feedback and Support

Your feedback is valuable. Do not hesitate to give your input, make suggestions or requests regarding this documentation or product and find support from the Talend team, on Talends Forum website at:

Talend Open Studio for Big Data Getting Started Guide

Feedback and Support

http://talendforge.org/forum

Talend Open Studio for Big Data Getting Started Guide

Chapter 1. Introduction to Talend Big Data solutions

It is nothing new that organizations data collections tend to grow increasingly large and complex, especially in the Internet era, and it has become more and more difficult to process such large and complex data sets using conventional, on-hand information management tools. To face up this difficulty, a new platform of big data tools has emerged specifically designed to handle large quantities of data - the Apache Hadoop Big Data Platform. Built on top of Talend's data integration solutions, Talend's Big Data solutions provide a powerful tool set that enables users to access, transform, move and synchronize big data by leveraging the Apache Hadoop Big Data Platform and makes the Hadoop platform ever so easy to use. This guide addresses only the big data related features and capabilities of your Talend studio. Therefore, before working with big data Jobs in the studio, users need to read the user guide to get acquainted with your studio.

Talend Open Studio for Big Data Getting Started Guide

Hadoop and Talend studio

1.1. Hadoop and Talend studio

When IT specialists talk about big data, they are usually referring to data sets that are so large and complex that they can no longer be processed with conventional data management tools. These huge volumes of data are produced for a variety of reasons. Streams of data can be generated automatically (reports, logs, camera footage, and so on). Or they could be the result of detailed analyses of customer behavior (consumption data), scientific investigations (the Large Hadron Collider is an apt example) or the consolidation of various data sources. These data repositories, which typically run into petabytes and exabytes, are hard to analyze because conventional database systems simply lack the muscle. Big data has to be analyzed in massively parallel environments where computing power is distributed over thousands of computers and the results are transferred to a central location. The Hadoop open source platform has emerged as the preferred framework for analyzing big data. This distributed file system splits the information into several data blocks and distributes these blocks across multiple systems in the network (the Hadoop cluster). By distributing computing power, Hadoop also ensures a high degree of availability and redundancy. A master nodehandles file storage as well as requests. Hadoop is a very powerful computing platform for working with big data. It can accept external requests, distribute them to individual computers in the cluster and execute them in parallel on the individual nodes. The results are fed back to a central location where they can then be analyzed. However, to reap the benefits of Hadoop, data analysts need a way to load data into Hadoop and subsequently extract it from this open source system. This is where Talend studio comes in. Built on top of Talend's data integration solutions, Talend studio enable users to handle big data easily by leveraging Hadoop and its databases or technologies such as HBase, HCatalog, HDFS, Hive, Oozie and Pig. Talend studio is an easy-to-use graphical development environment that allows for interaction with big data sources and targets without the need to learn and write complicated code. Once a big data connection is configured, the underlying code is automatically generated and can be deployed as a service, executable or stand-alone Job that runs natively on your big data cluster - HDFS, Pig, HCatalog, HBase, Sqoop or Hive. Talends big data solutions provide comprehensive support for all the major big data platforms. Talends big data components work with leading big data Hadoop distributions, including Cloudera, Greenplum, Hortonworks and MapR. Talend provides out-of-the-box support for a range of big data platforms from the leading appliance vendors including Greenplum, Netezza, Teradata, and Vertica.

1.2. Functional architecture of Talend Big Data solutions

The functional architecture of Talend Big Data solutions is an architectural model that identifies the functions, interactions and corresponding IT needs of Talend Big Data solutions. The overall architecture has been described by isolating specific functionalities in functional blocks. The following chart illustrates the main architectural functional blocks relevant to big data handling in the studio.

Talend Open Studio for Big Data Getting Started Guide

Functional architecture of Talend Big Data solutions

Three different types of functional blocks are defined: at least one Studio where you can design big data Jobs that leverage the Apache Hadoop platform to handle large data sets. These Jobs can be either executed locally or deployed, scheduled and executed on a Hadoop grid via the Oozie workflow scheduler system integrated within the studio. a workflow scheduler system integrated within the studio through which you can deploy, schedule, and execute big data Jobs on a Hadoop grid and monitor the execution status and results of the Jobs. a Hadoop grid independent of the Talend system to handle large data sets.

Talend Open Studio for Big Data Getting Started Guide

Chapter 2. Getting started with Talend Big Data using the demo project
This chapter provides short descriptions about the sample Jobs included in the demo project and introduces the necessary preparations required to run the sample Jobs on a Hadoop platform. For how to import a demo project, see the section on importing a demo project of Talend Studio User Guide. Before you start working in the studio, you need to be familiar with its Graphical User Interface (GUI). For more information, see the appendix describing GUI elements of Talend Studio User Guide.

Talend Open Studio for Big Data Getting Started Guide

Introduction to the Big Data demo project

2.1. Introduction to the Big Data demo project

Talend provides a Big Data demo project that includes a number of easy-to-use sample Jobs. You can import this demo project into your Talend studio to help familiarize yourself with Talend studio and understand the various features and functions of Talend components.
Due to license incompatibility, some third-party Java libraries and database drivers (.jar files) required by Talend components used in some Jobs of the demo project could not be shipped with your Talend Studio. You need to download and install those .jar files, known as external modules, before you can run Jobs that involve such components. Fortunately, your Talend Studio provides a wizard that let you install external modules quickly and easily. This wizard automatically appears when your run a Job and the studio detects that one or more required external modules are missing. It also appears when you click Install on the top of the Basic settings or Advanced settings view of a component for which one or more required external modules are missing. For more information on installing third-party modules, see the sections on identifying and installing external modules of the Installation and Upgrade Guide.

With the Big Data demo project imported and opened in your Talend studio, all the sample Jobs included in it are available in different folders under the Job Designs node of the Repository tree view.

The following sections briefly describe the Jobs contained in each sub-folder of the main folders.

2.1.1. Hortonworks_Sandbox_Samples
The main folder Hortonworks_Sandbox_Samples gathers standard Talend Jobs that are intended to demonstrate how handle data on a Hadoop platform.

Talend Open Studio for Big Data Getting Started Guide

Hortonworks_Sandbox_Samples

Folder Advanced_Examples

Sub-folder

Description The Advanced_Examples folder has some use cases, including an example of processing Apache Weblogs using the Talend's Apache Weblog, HCatalog and Pig components, an example of computing US Government Spending data using a Hive query, and an example of extracting data from any MySQL database and loading all the data from the tables dynamically. If there are multiple steps to achieve a use case they are named Step_1, Step_2 and so on.

ApacheWebLog

This folder has a classic Weblog file process that shows loading an Apache web log into HCatalog and HDFS and filtering out specific codes. There are two examples computing counts of unique IP addresses or web codes. These examples use Pig Scripts and HCatalog load. There are 6 steps in this example, run each step in the order listed in the Job names. For more details of this example, see section Gathering Web traffic information using Hadoop of appendix Big Data Job examples, which guides you step by step through the creation and configuration of the example Jobs.

Gov_Spending_Analysis

This is an example of a two-step process that loads some sample US Government spending data into HCatalog and then in step two it uses a Hive query to get the total spending amount per Government agency. There is an extra Data Integration Job that takes a file from the http://usaspending.gov/data web site and prepares it for the input to the Job that loads the data to HCatalog. You will need to replace the tFixedFlowInput component with the input file. There are 2 steps in this example, run each step in the order listed in the Job names.

RDBMS_Migration_SQOOP

This is a two step process that will read data from any MySQL schema and load it to HDFS. The database can be any MySQL5.5 or newer. The schema needs to have as many tables as you desire. Set the database and the schema in the context variables labeled SQOOP_SCENARIO_CONTEXT and the first Job will dynamically read the schema and create two files with list of tables. One file consists of tables with Primary Keys to be partitioned in HCatalog or Hive if used and the second one consists of the same tables without Primary Keys. The second step uses the two files to then load all the data from MySQL tables in the schema to HDFS. There will be a file per table. Keep in mind when running this process not to select a schema with a large amount of volume if you are using the Sandbox single node VM, as it has not a lot of power. For more information on using the proposed Sandbox single node VM, see section Installing Hortonworks Sandbox.

E2E_hCat_2_Hive

This folder contains a very simple process that loads some sample data to HCatalog in the first step and then in the second step just shows how you can use the Hive components to access and process the data. This folder contains simple examples of how to load data to HBase and read data from it. There are two examples for HCatalog: the first one puts a file directly on the HDFS and then loads the meta store with the information into HCatalog. The second example loads data streaming directly into HCatalog in the defined partitions. The examples in this folder show the basic HDFS operations like Get, Put, and Streaming loads. This folder contains three examples: the first Job shows how to use the Hive components to complete basic operations on Hive like creating a database, creating a table and loading data to the table. The next two

HBASE HCATALOG

HDFS HIVE

Talend Open Studio for Big Data Getting Started Guide

NoSQL_Examples

Folder

Sub-folder

Description Jobs shows first how to load two tables to Hive, which are then used in the second step, an example of how you can do ELT with Hive. This folder contains many different examples of how Pig components can be used to perform many different key functions such as aggregations and filtering and an example of how the Pig Code works.

PIG

2.1.2. NoSQL_Examples
The main folder NoSQL_Examples gathers Jobs that are intended to demonstrate how to handle data with NoSQL databases.
Folder Cassandra MongoDB Description This is another example of how to do the basic write and read to the Cassandra Database to then start using the Cassandra NoSQL database right away. This folder has an example of how to use MongoDB to easily and quickly search open text unstructured data for blog entries with key words.

2.2. Setting up the environment for the demo Jobs to work

The Big Data demo project is intended to give you some easy and practical examples of many of the basic functionality of Talend's Big Data solution. To run the demo Jobs included in the demo project, you need to have your Hadoop platform up and running, and configure the context variables defined in the demo project or configure the relevant components directly if you do not want to use the proposed Hortonworks Sandbox virtual appliance.

2.2.1. Installing Hortonworks Sandbox

For ease of use, one of the methods to get a Hadoop platform up quickly is to use a Virtual Appliance from one of the top Hadoop Distribution Vendors. Hortonworks provides a Virtual Appliance/Machine or a VM called the Sandbox that is fast and easy to set up. Using context variables, the samples Jobs within the Hortonworks_Sandbox_Samples folder of the demo project have been configured to work on Hortonworks Sandbox VM. Below is a brief procedure of setting up the single-node VM with Hortonworks Sandbox on Oracle VirtualBox, which is recommended by Hortonworks. For details, see the documentation of the relevant vendors. 1. Download the recommended version of Oracle VirtualBox from https://www.virtualbox.org/ and the Sandbox image for VirtualBox from http://hortonworks.com/products/hortonworks-sandbox/. Install and set up Oracle VirtualBox by following Oracle VirtualBox documentation. Install the Hortonworks Sandbox virtual appliance on Oracle VirtualBox by following Hortonworks Sandbox instructions. In the Oracle VM VirtualBox Manager window, click Network, select the Adapter 1 tab, select Bridged Adapter from the Attached to list box, and then select your working physical network adapter from the Name list box.

2. 3.

Talend Open Studio for Big Data Getting Started Guide

Understanding context variables used in the demo project

Start the Hortonworks Sandbox virtual appliance to get the Hadoop platform up and running, and check that the IP address assigned to the Sandbox virtual machine is pingable.

Then, before launching the demo Jobs, add an IP-domain mapping entry in your hosts file to resolve the host name sandbox, which is defined as the value of a couple of context variables in this demo project, rather than using an IP address of the Sandbox virtual machine, as this will minimize the changes you will need to make to the configured context variables. For more information about context variables used in the demo project, see section Understanding context variables used in the demo project.

2.2.2. Understanding context variables used in the demo project

In Talend Studio, you can define project-level context variables once in a project and use them in many Jobs, typically to help define connections and other things that are common across many different Jobs and processes. The advantage for this is obvious. For example, if you define the namenode IP address in a context variable and you have 50 Jobs that use that variable, to change the IP address of the namenode you simply need to update the context variable. Then, the studio will inform you of all the Jobs impacted by this update and change them for you. Project-level context variables are grouped under the Contexts node in the Repository tree view. In the Big Data demo project, two groups of project-level context variables have been defined in the Repository: HDP and SQOOP_SCENARIO_CONTEXT.

Talend Open Studio for Big Data Getting Started Guide

Understanding context variables used in the demo project

To view or edit the settings of the context variables of a group, double-click the group name in the Repository tree view to open the [Create / Edit a context group] wizard, and then select the Values as tree or Values as table tab.

The context variables in the HDP group are used in all the demo examples in the Hortonworks_Sandbox_Samples folder. If you want, you can change the values of these variables. For example, if you want to use the IP address for the Sandbox Platform VM rather than the host name sandbox, you can change the value of the host name variables to the IP address. If you change any of the default configurations on the Sandbox VM, you need to change the context settings accordingly; otherwise the demo examples may not work as intended.
Variable name namenode_host namenode_port user templeton_host templeton_port hive_host hive_port jobtracker_host jobtracker_port mysql_host mysql_port mysql_user mysql_passed mysql_testes Description Namenode host name Namenode port User name to connect to the Hadoop system HCatalog server host name HCatalog server port Hive metastore host name Hive metastore port Jobtracker host name Jobtracker port Host of the Sandbox for the Hive metastore Port of the Hive metastore User name to connect to the Hive metastore Password to connect to the Hive metastore Name of the test database for the Hive metastore Default value sandbox 8020 sandbox sandbox 50111 sandbox 9083 sandbox 50300 sandbox 3306 hep hep testes

Talend Open Studio for Big Data Getting Started Guide

Understanding context variables used in the demo project

Variable name hbase_host hbase_port

Description HBase host name HBase port

Default value sandbox 2181

The context variables in the SQOOP_SCENARIO_CONTEXT group are used for the RDBMS_Migration_SQOOP demo examples only. You will need to go through the following context variables and update your information for the Sandbox VM on your local MySQL connections if you want to use the RDBMS_Migration_SQOOP demo.

Variable name KEY_LOGS_DIRECTORY MYSQL_DBNAME_TO_MIGRATE MYSQL_HOST_or_IP MYSQL_PORT MYSQL_USERNAME MYSQL_PWD HDFS_LOCATION_TARGET

Description

Default value

A directory holding table files on your local machine that C:/Talend/BigData/ the studio has full access to Name of your own MySQL database to migrate to HDFS dstar_crm Host name or IP address of the MySQL database Port of the MySQL database User name to connect to the MySQL database Password to connect to the MySQL database Target location on the Sandbox HDFS where you want /user/hdp/sqoop/ to load the data 192.168.56.1 3306 tisadmin

To use project-level context variables in a Job, you need to import them into the Job first by clicking the button in the Context view. You can also define context variables in the Contexts view of a Job. These variables are built-in variables that work only for that Job. The Contexts view shows the built-in context variables defined in the Job and the project-level context variables imported into the Job.

Once defined, variables are referenced in the configurations of components. The following example shows how context variables are used in the configurations of the tHDFSConnection component in a Pig Job of the demo project.

Talend Open Studio for Big Data Getting Started Guide

Understanding context variables used in the demo project

Once these are setup to reflect how you have configured the HortonWorks Sandbox, the examples will run with little intervention. You can see how many of the core functions work allowing you to have good samples to implement your Big Data projects. For more information on defining and using context variables, see the section on how to centralize contexts and variables of the Talend Studio User Guide. For how to run a Job from the Run console, see the section on how to run a Job of the Talend Studio User Guide. For how to run a Job from the Oozie scheduler view, see section How to run a Job via Oozie.

Talend Open Studio for Big Data Getting Started Guide

Chapter 3. Handling Jobs in Talend Studio

This chapter introduces procedures of handling Jobs in your Talend studio that take the advantage of the Hadoop big data platform to work with large data sets. For general procedures of designing, executing, and managing Talend data integration Jobs, see the User Guide that comes with your Talend studio. Before starting working on a Job in the studio, you need to be familiar with its Graphical User Interface (GUI). For more information, see the appendix describing GUI elements of the User Guide.

Talend Open Studio for Big Data Getting Started Guide

How to run a Job via Oozie

3.1. How to run a Job via Oozie

Your Talend studio provides an Oozie scheduler, a feature that enables you to schedule executions of a Job you have created or run it immediately on a remote Hadoop Distributed File System (HDFS) server, and to monitor the execution status of your Job. For more information on Apache Oozie and Hadoop, check http://oozie.apache.org/ and http://hadoop.apache.org/.
If the Oozie scheduler view is not shown, click Window > Show view and select Talend Oozie from the [Show View] dialog box to show it in the configuration tab area.

3.1.1. How to set HDFS connection details

Before you can run or schedule executions of a Job on an HDFS server, you need first to define the HDFS connection details either in the Oozie scheduler view or in the studio preference settings, and specify the path where your Job will be deployed.

3.1.1.1. Defining HDFS connection details in Oozie scheduler view

To define HDFS connection details in the Oozie scheduler view, do the following: 1. Click the Oozie schedule view beneath the design workspace.

Click Setting to open the connection setup dialog box.

Talend Open Studio for Big Data Getting Started Guide

How to set HDFS connection details

The connection settings shown above are for an example only.

Fill in the required information in the corresponding fields, and click OK close the dialog box.

Field/Option Hadoop distribution

Description Hadoop distribution to be connected to. This distribution hosts the HDFS file system to be used. If you select Custom to connect to a custom Hadoop distribution, then you need to click the [...] button to open the [Import custom definition] dialog box and from this dialog box, to import the jar files required by that custom distribution. For further information, see section Connecting to a custom Hadoop distribution.

Hadoop version Enable kerberos security

Version of the Hadoop distribution to be connected to. This list disappears if you select Custom from the Hadoop distribution list. If you are accessing the Hadoop cluster running with Kerberos security, select this check box, then, enter the Kerberos principal name for the NameNode in the field displayed. This enables you to use your user name to authenticate against the credentials stored in Kerberos. This check box is available depending on the Hadoop distribution you are connecting to.

User Name Name node end point Job tracker end point Oozie end point

Login user name. URI of the name node, the centerpiece of the HDFS file system. URI of the Job Tracker node, which farms out MapReduce tasks to specific nodes in the cluster. URI of the Oozie web console, for Job execution monitoring.

Talend Open Studio for Big Data Getting Started Guide

How to set HDFS connection details

Field/Option Hadoop Properties

Description If you need to use custom configuration for the Hadoop of interest, complete this table with the property or properties to be customized. Then at runtime, these changes will override the corresponding default properties used by the Studio for its Hadoop engine. For further information about the properties required by Hadoop, see Apache's Hadoop documentation on http://hadoop.apache.org, or the documentation of the Hadoop distribution you need to use. Settings defined in this table are effective on a per-Job basis.

Once the Hadoop distribution/version and connection details are defined in the Oozie scheduler view, the settings in the [Preferences] window are automatically updated, and vice versa. For information on Oozie preference settings, see section Defining HDFS connection details in preference settings. Upon defining the deployment path in the Oozie scheduler view, you are ready to schedule executions of your Job, or run it immediately, on the HDFS server.

3.1.1.2. Defining HDFS connection details in preference settings

To define HDFS connection details in the studio preference settings, do the following: 1. 2. From the menu bar, click Window > Preferences to open the [Preferences] window. Expand the Talend node and click Oozie to display the Oozie preference view.

The Oozie settings shown above are for an example only.

Fill in the required information in the corresponding fields:

Talend Open Studio for Big Data Getting Started Guide

How to set HDFS connection details

Field/Option Hadoop distribution

Description Hadoop distribution to be connected to. This distribution hosts the HDFS file system to be used. If you select Custom to connect to a custom Hadoop distribution, then you need to click the button to open the [Import custom definition] dialog box and from this dialog box, to import the jar files required by that custom distribution. For further information, see section Connecting to a custom Hadoop distribution.

Hadoop version Enable kerberos security

User Name Name node end point Job tracker end point Oozie end point

Once the settings are defined in the[Preferences] window, the Hadoop distribution/version and connection details in the Oozie scheduler view are automatically updated, and vice versa. For information on the Oozie scheduler view, see section How to run a Job via Oozie.

Connecting to a custom Hadoop distribution

From the [Import custom definition] dialog box, proceed as follows to import the required jar files: 1. Select Import from existing version or Import from zip to import the required jar files from the appropriate source.

Talend Open Studio for Big Data Getting Started Guide

How to set HDFS connection details

Verify that the Oozie check box is selected. This allows you to import the jar files pertinent to the Oozie and HDFS to be used. Click OK and then in the pop-up warning, click Yes to accept overwriting any custom setup of jar files previously implemented for this connection. Once done, the [Custom Hadoop version definition] dialog box becomes active.

4. If you still need to add more jar files, click the button to open the [Select libraries] dialog box.

If required, in the filter field above the Internal libraries list, enter the name of the jar file you need to use, in order to verify whether that file is provided with the Studio. Select the External libraries option to open its view.

Talend Open Studio for Big Data Getting Started Guide

How to run a Job on the HDFS server

7. 8.

Browse to and select any jar file you need to import. Click OK to validate the changes and to close the [Select libraries] dialog box. Once done, the selected jar file appears in the list in the Oozie tab view.

Then, you can repeat this procedure to import more jar files. If you need to share the jar files with another Studio, you can export this custom connection from the [Custom Hadoop version definition] dialog box using the button.

3.1.2. How to run a Job on the HDFS server

To run a Job on the HDFS server, 1. 2. In the Path field on the Oozie scheduler tab, enter the path where your Job will be deployed on the HDFS server. Click the Run button to start Job deployment and execution on the HDFS server.

Your Job data is zipped, sent to, and deployed on the HDFS server based on the server connection settings and automatically executed. Depending on your connectivity condition, this may take some time. The console displays the Job deployment and execution status. To stop the Job execution before it is completed, click the Kill button.

Talend Open Studio for Big Data Getting Started Guide

How to schedule the executions of a Job

3.1.3. How to schedule the executions of a Job

The Oozie scheduler feature integrated in your Talend studio enables you to schedule executions of your Job on the HDFS server. Thus, your Job will be executed based on the defined frequency within the set time duration. To configure Job scheduling, do the following: 1. In the Path field on the Oozie scheduler tab, enter the path where your Job will be deployed on the HDFS server if the deployment path is not yet defined. Click the Schedule button on the Oozie scheduler tab to open the scheduling setup dialog box.

Fill in the Frequency field with an integer and select a time unit from the Time Unit list to define the Job execution frequency. Click the [...] button next to the Start Time field to open the [Select Date & Time] dialog box, select the date, hour, minute, and second values, and click OK to set the Job execution start time. Then, set the Job execution end time in the same way.

Click OK to close the dialog box and start scheduled executions of your Job. The Job automatically runs based on the defined scheduling parameters. To stop the Job, click Kill.

Talend Open Studio for Big Data Getting Started Guide

How to monitor Job execution status

3.1.4. How to monitor Job execution status

To monitor Job execution status and results, click the Monitor button on the Oozie scheduler tab. The Oozie end point URI opens in your Web browser, displaying the execution information of the Jobs on the HDFS server.

To display the detailed information of a particular Job, click any field of that Job to open a separate page showing the details of the Job.

Talend Open Studio for Big Data Getting Started Guide

How to monitor Job execution status

Talend Open Studio for Big Data Getting Started Guide

Chapter 4. Mapping Big Data flows

When developing an ETL process for Big Data, it is common to map data from either single or multiple sources to data stored in the target system. Although Hadoop provides a scripting language, Pig Latin, and a programming model, Map/Reduce, to ease the development of a transformation and routing process for Big Data, learning and understanding them still requires a huge coding effort. Talend provides mapping components that are optimized for the Hadoop environment in order to visually map the input and the output data flows. Taking tPigMap as example, this chapter gives information about the theory behind how these mapping components can be used. For more practical examples of how to use the components, see Talend Open Studio Components Reference Guide. Before starting any data integration processes, you need to be familiar with the Graphical User Interface (GUI) of the Studio. For more information, see the appendix describing GUI elements of Talend Studio User Guide.

Talend Open Studio for Big Data Getting Started Guide

tPigMap interface

4.1. tPigMap interface

Pig is a platform using a scripting language to express data flows. It programs step-by-step operations to transform data using Pig Latin, name of the language used by Pig. tPigMap is an advanced component that maps input flows and output flows being handled in a Pig process (an array of Pig components). Therefore, it requires tPigLoad to read data from the source system and tPigStoreResult to write data in a given target. Starting from this basic design composed of tPigLoad, tPigMap and tPigStoreResult, you can visually develop a Pig process with a wide range of complexity by using the other Pig components around tPigMap. As these components generate Pig code, the Job developed is thus optimized for the Hadoop environment. You need to use a map editor to configure tPigMap. This Map Editor is an all-in-one tool allowing you to define all parameters needed to map, transform and route your data flows via a convenient graphical interface. You can minimize and restore the Map Editor and all tables in the Map Editor using the window icons.

The Map Editor is made of several panels: The Input panel is the top left panel on the editor. It offers a graphical representation of all (main and lookup) incoming data flows. The data are gathered in various columns of input tables. Note that the table name reflects the main or lookup row from the Job design on the design workspace. The Output panel is the top right panel on the editor. It allows mapping data and fields from input tables to the appropriate output rows. The Search panel is the central panel. It allow you to search in the editor for columns or expressions that contain the text you enter in the Find field. Both bottom panels are the input and output schemas description. The Schema editor tab offers a schema view of all columns of input and output tables in selection in their respective panel.

Talend Open Studio for Big Data Getting Started Guide

tPigMap operation

Expression editor is the editing tool for all expression keys of input/output data or filtering conditions. The name of input/output tables in the Map Editor reflects the name of the incoming and outgoing flows (row connections). This Map Editor stays the way a typical Talend mapping component's map editor, such as tMap's, is designed and used. Therefore, in order for you to understand fully how a classic mapping component works, we recommend reading as reference the chapter describing how Talend Studio maps data flows of Talend Studio User Guide.

4.2. tPigMap operation

You can map data flows by simply dragging and dropping columns from the Input panel to the Output panel of tPigMap, while more than often, you may need to perform operations of higher complexities, such as editing a filter, setting a join or using a user-defined function for Pig. In that situation, tPigMap provides a rich set of options you can set and generates the corresponding Pig code to meet your needs. The following sections present those options.

4.2.1. Configuring join operations

On the input side, you can display the panel used for settings the join options by clicking the appropriate table. button on the

Lookup properties Join Model

Value Inner Join; Left Outer Join; Right Outer Join; Full Outer Join. The default join option is Left Outer Join when you do not activate this option settings panel by displaying it. These options perform the join of two or more flows based on common field values. When more than one lookup tables need join, the main input flow starts the join from the first lookup flow, then uses the result to join the second and so on in the same manner until the last lookup flow is joined.

Talend Open Studio for Big Data Getting Started Guide

Catching rejected records

Lookup properties Join Optimization

Value None; Replicated; Skewed; Merge. The default join option is None when you do not activate this option settings panel by displaying it. These options are used to perform more efficient join operations. For example, if you are using the parallelism of multiple reduce tasks, the Skewed join can be used to counteract the load imbalance problem if the data to be processed is sufficiently skewed. Each of these options is subject to the constraints explained in Apache's documentation about Pig Latin.

Custom Partitioner

Enter the Hadoop partitioner you need to use to control the partitioning of the keys of the intermediate map-outputs. For example, enter, in double quotation marks,
org.apache.pig.test.utils.SimpleCustomPartitioner

to use the partitioner SimpleCustomPartitioner. The jar file of this partitioner must have been registered in the Register jar table in the Advanced settings view of the tPigLoad component linked with the tPigMap component to be used. For further information about the code of this SimpleCustomPartitioner, see Apache's documentation about Pig Latin. Increase Parallelism Enter the number of reduce tasks for the Hadoop Map/Reduce tasks generated by Pig. For further information about the parallelism of reduce tasks, see Apache's documentation about Pig Latin.

4.2.2. Catching rejected records

On the output side, the following options become available once you display the panel used for setting output options by clicking the button on the appropriate table.

Output properties Catch Output Reject

Value True;

Talend Open Studio for Big Data Getting Started Guide

Editing expressions

Output properties

Value False. This option, once activated, allows you to catch the records rejected by a filter you have defined in the appropriate area.

Catch Lookup Inner Join Reject

True; False. This option, once activated, allows you to catch the records rejected by the inner join operation performed on the input flows.

4.2.3. Editing expressions

On both sides, you can edit all expression keys of input/output data or filtering conditions by using Pig Latin. For details about Pig Latin, see Apache's documentation about Pig such as Pig Latin Basics and Pig Latin Reference Manual. You can write the expressions necessary for the data transformation directly in the Expression editor view located in the lower half of the expression editor, or you can open the [Expression Builder] dialog box where you can write the data transformation expressions. To open the [Expression Builder] dialog box, click the button next to the expression you want to open in the tabular panels representing the lookup flow(s) or the output flow(s) of the Map Editor.

The [Expression Builder] dialog box opens on the selected expression.

Talend Open Studio for Big Data Getting Started Guide

Editing expressions

If you have created any Pig user-defined function (Pig UDF) in the Studio, a Pig UDF Functions option appears automatically in the Categories list and you can select it to edit the mapping expression you need to use.

You need to use the Pig UDF item under the Code node of the Repository tree to create a Pig UDF function. Although you need to know how to write a Pig function using Pig Latin, a Pig UDF function is created the same way as a Talend routine.

Talend Open Studio for Big Data Getting Started Guide

Editing expressions

Note that your Repository may look different from this image depending on the license you are using. For further information about a routine, see the chapter describing how to manage a routine of the User Guide. To open the Expression editor view, 1. 2. In the lower half of the editor, click the Expression editor tab to open the corresponding view. Click the column you need to set expressions for and edit the expressions in the Expression editor view.

If you need to set filtering conditions for an input or output flow, you have to click the button and then edit the expressions in the displayed area or by using the Expression editor view or the [Expression Builder] dialog box.

Talend Open Studio for Big Data Getting Started Guide

Setting up a Pig User-Defined Function

4.2.4. Setting up a Pig User-Defined Function

As explained in the section earlier, you can create a Pig User-Defined Function (Pig UDF) and it is automatically added to the Category list in the Expression Builder view. 1. Right-click the Pig UDF sub-node under the Code node of the Repository tree and from the contextual menu, select Create Pig UDF. The [Create New Pig UDF] wizard is displayed.

Talend Open Studio for Big Data Getting Started Guide

Setting up a Pig User-Defined Function

From the Template list, select the type of the Pig UDF function to be created. Based on your choice, the Studio will provide the corresponding template to help the development of the Pig UDF you need. Complete the other fields in the wizard. Click Finish to validate the changes and the Pig UDF template is opened in the workspace. Write your code in the template.

3. 4. 5.

Once done, this Pig UDF will automatically appear in the Categories list in the Expression Builder view of tPigMap and is ready for use.

Talend Open Studio for Big Data Getting Started Guide

Appendix A. Big Data Job examples

This chapter aims at users of Talend Big Data solutions who seek real-life use cases to help them take full control over the products. This chapter comes as a complement of Talend Open Studio Components Reference Guide.

Talend Open Studio for Big Data Getting Started Guide

Gathering Web traffic information using Hadoop

A.1. Gathering Web traffic information using Hadoop

To drive a focused marketing campaign based on habits or profiles of your customers or users, you need to be able to fetch data based on their habits or behavior on your website to be able to create user profiles and send them the right advertisements, for example. The ApacheWebLog folder of the Big Data demo project that comes with your Talend Studio provides an example of finding out users having visited a website most often by sorting out their IP addresses from a huge number of records in an access log file for an Apache HTTP server to enable further analysis on user behavior on the website. This section describes the procedures for creating and configuring Jobs that will implement this example. For more information about the Big Data demo project, see chapter Getting started with Talend Big Data using the demo project.

A.1.1. Prerequisites
Before discovering this example and creating the Jobs, you should have: Imported the demo project, and obtained the input access log file used in this example by executing the Job GenerateWebLogFile provided with the demo project. Installed and started Hortonworks Sandbox virtual appliance that the demo project is designed to work on, as described in section Installing Hortonworks Sandbox. An IP to host name mapping entry has been added in the hosts file to resolve the host name sandbox.

A.1.2. Discovering the scenario

In this example, certain Talend Big Data components are used to leverage the advantage of the Hadoop open source platform for handling big data. In this scenario we use six Jobs: The first Job sets up an HCatalog database, table and partition in HDFS The second Job uploads the access log file to be analyzed to the HDFS file system. The third Job connects to the HCatalog database and displays the content of the uploaded file on the console. The fourth Job parses the uploaded access log file, including removing any records with a "404" error, counting the code occurrences in successful service calls to the website, sorting the result data and saving it in the HDFS file system. The fifth Jobs parse the uploaded access log file, including removing any records with a "404" error, counting the IP address occurrences in successful service calls to the website, sorting the result data and saving it in the HDFS file system. The last Job reads the result data from HDFS and displays the IP addresses with successful service calls and their number of visits to the website on the standard system console.

A.1.3. Translating the scenario into Jobs

This section describes how to create, configure, and execute the Jobs to get the expected result of this scenario.

Talend Open Studio for Big Data Getting Started Guide

Translating the scenario into Jobs

A.1.3.1. Creating the example Jobs

In this section, we will create six Jobs that will implement the ApacheWebLog example of the demo Job.

Create the first Job

Follow these steps to create the first Job, which will set up an HCatalog database to manage the access log file to be analyzed: 1. In the Repository tree view, right-click Job Designs, and select Create folder to create a new folder to group the Jobs that you will create. Right-click the folder you just created, and select Create job to create your first Job. Name it A_HCatalog_Create to identify its role and execution order among the example Jobs. You can also provide a short description for your Job, which will appear as a tooltip when you move your mouse over the Job. 2. Drop a tHDFSDelete and two tHCatalogOperation components from the Palette onto the design workspace. Connect the three components using Trigger > On Subjob Ok connections. The HDFS subjob will be used to remove any previous results of this demo example, if any, to prevent possible errors in Job execution, and the two HCatalog subjobs will be used to create an HCatalog database and set up an HCatalog table and partition in the created HCatalog table, respectively. Label these components to better identify their functionality.

Create the second Job

Follow these steps to create the second Job, which will upload the access log file to the HCatalog: 1. Create a new Job and name it B_HCatalog_Load to identify its role and execution order among the example Jobs. From the Palette, drop a tApacheLogInput, a tFilterRow, a tHCatalogOutput, and a tLogRow component onto the design workspace. Connect the tApacheLogInput component to the tFilterRow component using a Row > Main connection, and then connect the tFilterRow component to the tHCatalogOutput component using a Row > Filter connection. This data flow will load the log file to be analyzed to the HCatalog database, with any records having the error code of "301" removed. Connect the tFilterRow component to the tLogRow component using a Row > Reject connection. This flow will print the records with the error code of "301" on the console. Label these components to better identify their functionality.

Talend Open Studio for Big Data Getting Started Guide

Translating the scenario into Jobs

Create the third Job

Follow these steps to create the third Job, which will display the content of the uploaded file: 1. Create a new Job and name it C_HCatalog_Read to identify its role and execution order among the example Jobs. Drop a tHCatalogInput component and a tLogRow component from the Palette onto the design workspace, and link them using a Row > Main connection. Label the components to better identify their functionality.

Create the fourth Job

Follow these steps to create the fourth Job, which will analyze the uploaded log file to get the code occurrences in successful calls to the website: 1. Create a new Job and name it D_Pig_Count_Codes to identify its role and execution order among the example Jobs. Drop the following components from the Palette to the design workspace: a tPigLoad, to load the data to be analyzed, a tPigFilterRow, to remove records with the '404' error from the input flow, a tPigFilterColumns, to select the columns you want to include in the result data, a tPigAggregate, to count the number of visits to the website, a tPigSort, to sort the result data, and a tPigStoreResult, to save the result to HDFS. 3. Connect these components using Row > Pig Combine connections to form a Pig chain, and label them to better identify their functionality.
Talend Open Studio for Big Data Getting Started Guide

Translating the scenario into Jobs

Create the fifth Job

Follow these steps to create the fift Job, which will analyze the uploaded log file to get the IP occurrences of successful service calls to the website: 1. 2. Right-click the previous Job in the Repository tree view and select Duplicate. In the dialog box that appears, name the Job E_Pig_Count_IPs to identify its role and execution order among the example Jobs. Change the label of the tPigFilterColumns component to identify its role in the Job.

Create the sixth Job

Follow these steps to create the last Job, which will display the results of access log analysis; 1. 2. Create a new Job and name it F_Read_Results to identify its role and execution order among the example Jobs. From the Palette, drop two tHDFSInput components and two tLogRow components onto the design workspace. Link the first tHDFSInput to the first tLogRow, and the second tHDFSInput to the second tLogRow using Row > Main connections. Link the first tHDFSInput to the second tHDFSInput using a Trigger > OnSubjobOk connection. Label the components to better identify their functionality.

Talend Open Studio for Big Data Getting Started Guide

Translating the scenario into Jobs

A.1.3.2. Configuring the Jobs

In this section we will configure each the example Jobs we have created.
The Repository Metadata feature is available only with licensed-based Big Data solutions of Talend. If you are using Talend Open Studio for Big Data, you have to configure each component manually as the Property type and Schema type are always Built-in.

Configuring the first Job

In this step, we will configure the first Job, A_HCatalog_Create, to set up the HCatalog system for processing the access log file.

Set up an HCatalog database

1. Double-click the tHDFSDelete component, which is labelled HDFS_ClearResults in this example, to open its Basic settings view on the Component tab.

To use a centralized HDFS connection, click the Property Type list box and select Repository, and then click the [...] button to open the [Repository Content] dialog box. Select the HDFS connection defined for connecting to the HDFS system and click OK. All the connection details are automatically filled in the respective fields.

Talend Open Studio for Big Data Getting Started Guide

Translating the scenario into Jobs

If you are using Talend Open Studio for Big Data, you have to define the following connection details manually: Hadoop distribution: HortonWorks Hadoop version: Hortonworks Data Platform V1 NameNode URI: hdfs://sandbox:8020 User name: sandbox 3. In the File or Directory Path field, specify the directory where the access log file will be stored on the HDFS, /user/hdp/weblog in this example. Double-click the first tHCatalogOperation component, which is labelled HCatalog_Create_DB in this example, to open its Basic settings view on the Component tab.

Talend Open Studio for Big Data Getting Started Guide

Translating the scenario into Jobs

To use a centralized HCatalog connection, click the Property Type list box and select Repository, and then click the [...] button to open the [Repository Content] dialog box. Select the HCatalog connection defined for connecting to the HCatalog database and click OK. All the connection details are automatically filled in the respective fields.

Talend Open Studio for Big Data Getting Started Guide

Translating the scenario into Jobs

Templeton port: 50111 Database: talend User name: sandbox 6. 7. 8. From the Operation on list, select Database; from the Operation list, select Drop if exist and create. In the Option list of the Drop configuration area, select Cascade. In the Database location field, enter the location for the database file is to be created in HDFS, /user/hdp/ weblog/weblogdb in this example.

Set up an HCatalog table and partition

1. Double-click the second tHCatalogOperation component, labelled HCatalog_CreateTable in this example, to open its Basic settings view on the Component tab.

2. 3.

Define the same HCatalog connection details using the same procedure as for the first tHCatalogOperation component. Click the Schema list box and select Repository, then click the [...] button next to the field that appears to open the [Repository Content] dialog box, expand Metadata > Generic schemas > access_log and select schema. Click OK to confirm your choice and close the dialog box. The generic schema of access_log is automatically applied to the component. Alternatively, you can directly select the generic schema of access_log from the Repository tree view and then drag and drop it onto this component to apply the schema. If you are using Talend Open Studio for Big Data, you need to define the schema manually by clicking the [...] button next to Edit schema and pasting the read-only schema of the tApacheLogInput component into the [Schema] dialog box.

From the Operation on list, select Table; from the Operation list, select Drop if exist and create.

Talend Open Studio for Big Data Getting Started Guide

Translating the scenario into Jobs

5. 6.

In the Table field, enter a name for the table to be created, weblog in this example. Select the Set partitions check box and click the [...] button next to Edit schema to set a partition and partition schema. Note that the partition schema must not contain any column name defined in the table schema. In this example, the partition schema column is named ipaddresses.

Upon completion of the component settings, press Ctrl+S to save your Job configurations.

Configuring the second Job

In this step, we will configure the second Job, B_HCatalog_Load, to upload the access log file to the Hadoop system.

Upload the access log file to HCatalog

1. Double-click the tApacheLogInput component to open its Basic settings view, and specify the path to the access log file to be uploaded in the File Name field. In this example, we store the log file access_log in the directory C:/Talend/BigData.

Double-click the tFilterRow component to open its Basic settings view.

3. 4.

From the Logical operator used to combine conditions list box, select AND. Click the [+] button to add a line in the Filter configuration table, and set filter parameters to send records that contain the code of "301" to the Reject flow and pass the rest records on to the Filter flow: In the InputColumn field, select the code column of the schema. In the Operator field, select Not equal to. In the Value field, enter 301.

Talend Open Studio for Big Data Getting Started Guide

Translating the scenario into Jobs

Double-click the tHCatalogOutput component to open its Basic settings view.

If you are using Talend Open Studio for Big Data, you have to define the following connection details manually: Hadoop distribution: HortonWorks Hadoop version: Hortonworks Data Platform V1

Talend Open Studio for Big Data Getting Started Guide

Translating the scenario into Jobs

NameNode URI: hdfs://sandbox:8020 Templeton host name: sandbox Templeton port: 50111 Database: talend User name: sandbox 7. Click the [...] button to verify that the schema has been properly propagated from the preceding component. If needed, click Sync columns to retrieve the schema. From the Action list, select Create to create the file or Overwrite if the file already exists. In the Partition field, enter the partition name-value pair between double quotation marks, ipaddresses='192.168.1.15' in this example.

8. 9.

10. In the File location field, enter the path where the data will be save, /user/hdp/weblog/access_log in this example. 11. Double-click the tLogRow component to open its Basic settings view, and select the Vertical option to display each row of the output content in a list for better readability. Upon completion of the component settings, press Ctrl+S to save your Job configurations.

Configuring the third Job

In this step, we will configure the third Job, C_HCatalog_Read, to check the content of the log uploaded to the HCatalog. 1. Double-click the tHCatalogInput component to open its Basic settings view in the Component tab.

Talend Open Studio for Big Data Getting Started Guide

Translating the scenario into Jobs

If you are using Talend Open Studio for Big Data, you have to define manually the following connection details, which are the same as in the tHCatalogOutput component in the previous Job: Hadoop distribution: HortonWorks Hadoop version: Hortonworks Data Platform V1 NameNode URI: hdfs://sandbox:8020 Templeton host name: sandbox Templeton port: 50111 Database: talend User name: sandbox 3. Click the Schema list box and select Repository, then click the [...] button next to the field that appears to open the [Repository Content] dialog box, expand Metadata > Generic schemas > access_log and select schema. Click OK to confirm your select and close the dialog box. The generic schema of access_log is automatically applied to the component. Alternatively, you can directly select the generic schema of access_log from the Repository tree view and then drag and drop it onto this component to apply the schema. If you are using Talend Open Studio for Big Data, you need to define the schema manually by clicking the [...] button next to Edit schema and pasting the read-only schema of the tApacheLogInput component used in the previous Job into the [Schema] dialog box. 4. In the Basic settings view of the tLogRow component, select the Vertical mode to display the each row in a key-value manner when the Job is executed.

Upon completion of the component settings, press Ctrl+S to save your Job configurations.

Configuring the fourth Job

In this step, we will configure the fourth Job, D_Pig_Count_Codes, to analyze the uploaded access log file using a Pig chain to get the codes of successful service calls and their number of visits to the website.
Talend Open Studio for Big Data Getting Started Guide 45

Translating the scenario into Jobs

Read the log file to be analyzed through the Pig chain

1. Double-click the tPigLoad component to open its Basic settings view.

If you are using Talend Open Studio for Big Data, you have to select the Map/Reduce mode and define the following connection details manually: Hadoop distribution: HortonWorks Hadoop version: Hortonworks Data Platform V1 NameNode URI: hdfs://sandbox:8020 JobTracker host: sandbox:50300 in this example

Talend Open Studio for Big Data Getting Started Guide

Translating the scenario into Jobs

Select the generic schema of access_log from the Repository tree view and then drag and drop it onto this component to apply the schema. If you are using Talend Open Studio for Big Data, you need to define the schema manually by clicking the [...] button next to Edit schema and pasting the read-only schema of the tApacheLogInput component used in the second Job into the [Schema] dialog box.

From the Load function list, select PigStorage, and fill the Input file URI field with the file path defined in the previous Job, /user/hdp/weblog/access_log/out.log in this example.

Analyze the log file and save the result

1. In the Basic settings view of the tPigFilterRow component, click the [+] button to add a line in the Filter configuration table, and set filter parameters to remove records that contain the code of 404 and pass the rest records on to the output flow: In the Logical field, select AND. In the Column field, select the code column of the schema. Select the NOT check box. In the Operator field, select equal. In the Value field, enter 404.

In the Basic settings view of the tPigFilterColumns component, click the [...] button to open the [Schema] dialog box. Select the column code in the Input panel and click the single-arrow button to copy the column to the Output panel to pass the information of the code column to the output flow. Click OK to confirm the output schema settings and close the dialog box.

Talend Open Studio for Big Data Getting Started Guide

Translating the scenario into Jobs

In the Basic settings view of the tPigAggregate component, click Sync columns to retrieve the schema from the preceding component, and permit the schema to be propagated to the next component. Click the [...] button next to Edit schema to open the [Schema] dialog box, and add a new column: count. This column will store the number of occurrences of each code of successful service calls. Configure the following parameters to count the number of occurrences of each code: In the Group by area, click the [+] button to add a line in the table, and select the column count in the Column field. In the Operations area, click the [+] button to add a line in the table, and select the column count in the Additional Output Column field, select count in the Function field, and select the column code in the Input Column field.

Talend Open Studio for Big Data Getting Started Guide

Translating the scenario into Jobs

In the Basic settings view of the tPigSort component, configure the sorting parameters to sort the data to be passed on: Click the [+] button to add a line in the Sort key table. In the Column field, select count to set the column count as the key. In the Order field, select DESC to sort data in the descendent order.

In the Basic settings view of the tPigStoreResult component, configure the component properties to upload the result data to the specified location on the Hadoop system: Click Sync columns to retrieve the schema from the preceding component. In the Result file URI field, enter the path to the result file, /user/hdp/weblog/apache_code_cnt in this example. From the Store function list, select PigStorage. If needed, select the Remove result directory if exists check box.

Upon completion of the component settings, press Ctrl+S to save your Job configurations.

Talend Open Studio for Big Data Getting Started Guide

Translating the scenario into Jobs

Configuring the fifth Job

In this step, we will configure the fifth Job, E_Pig_Count_IPs, to analyze the uploaded access log file using a similar Pig chain as in the previous Job to get the IP addresses of successful service calls and their number of visits to the website. We can use the component settings in the previous Job, with the following differences: In the [Schema] dialog box of the tPigFilterColumns component, copy the column host, instead of code, from the Input panel to the Output panel.

In the tPigAggregate component, select the column host in the Column field of the Group by table and in the Input Column field of the Operations table.

In the tPigStoreResult component, fill the Result file URI field with /user/hdp/weblog/apache_ip_cnt. Upon completion of the component settings, press Ctrl+S to save your Job configurations.

Talend Open Studio for Big Data Getting Started Guide

Translating the scenario into Jobs

Configuring the last Job

In this step, we will configure the last Job, F_Read_Results, to read the results data from Hadoop and display them on the standard system console. 1. Double-click the first tHDFSInput component to open its Basic settings view.

If you are using Talend Open Studio for Big Data, you have to select the Map/Reduce mode and define manually the following connection details manually: Hadoop distribution: HortonWorks Hadoop version: Hortonworks Data Platform V1
Talend Open Studio for Big Data Getting Started Guide 51

Translating the scenario into Jobs

NameNode URI: hdfs://sandbox:8020 User name: sandbox in this example 3. Apply the generic schema of ip_count to this component. The schema should contain two columns, host (string, 50 characters) and count (integer, 5 characters), If you are using Talend Open Studio for Big Data, you need to set the schema manually, or copy the schema of the tPigStoreResult component in the previous Job and paste it into the [Schema] dialog box of this component. 4. In the File Name field, enter the path to the result file in HDFS, /user/hdp/weblog/apache_ip_cnt/partr-00000 in this example. From the Type list, select the type of the file to read, Text File in this example. In the Basic settings view of the tLogRow component, select the Table option for better readability. Configure the other subjob in the same way, but in the second tHDFSInput component: Apply the generic schema of code_count, or configure the schema of this component manually so that it contains two columns: code (integer, 5 characters) and count (integer, 5 characters). Fill the File Name field with /user/hdp/weblog/apache_code_cnt/part-r-00000. Upon completion of the component settings, press Ctrl+S to save your Job configurations.

5. 6. 7.

A.1.3.3. Running the Jobs

After the six Jobs are properly set up and configured, click the Run button on the Run tab or press F6 to run them one by one in the alphabetic order of the Job names, and view the execution results on the console of each Job. Upon successful execution of the last Job, the system console displays IP addresses and codes of successful service calls and their number of occurrences.

Talend Open Studio for Big Data Getting Started Guide

Translating the scenario into Jobs

It is possible to run all the Jobs in the required order at one click. To do so: 1. Drop a tRunJob component onto the design workspace of the first Job, A_HCatalog_Create in this example. This component appears as a subjob. Link the preceding subjob to the tRunJob component using a Trigger > On Subjob Ok connection.

3. 4.

Double-click the tRunJob component to open its Basic settings view. Click the [...] button next to the Job field to open the [Repository Content] dialog box. Select the Job that should be triggered after successful execution of the current Job, and click OK to close the dialog box. The next Job to run appears in the Job field.

Talend Open Studio for Big Data Getting Started Guide

Translating the scenario into Jobs

5. 6.

Double-click the tRunJob component again to open the next Job. Repeat the steps above until a tRunJob is configured in the Job E_Pig_Count_IPs to trigger the last Job, F_Read_Results. Run the first Job. The successful execution of each Job triggers the next Job, until all the Jobs are executed, and the execution results are displayed in the console of the first Job.

Talend Open Studio for Big Data Getting Started Guide

Data Quality: A Continuous Business Case
100% (1)
Data Quality: A Continuous Business Case
11 pages
Hadoop Echosystem and Ibm Big Insights: Rafie Tarabay Eng - Rafie@Mans - Edu.Eg
No ratings yet
Hadoop Echosystem and Ibm Big Insights: Rafie Tarabay Eng - Rafie@Mans - Edu.Eg
112 pages
Big Data and Apache Spark Overview
No ratings yet
Big Data and Apache Spark Overview
211 pages
Hive Queries
No ratings yet
Hive Queries
5 pages
Spring JDBC
No ratings yet
Spring JDBC
17 pages
Hive Using Hiveql
No ratings yet
Hive Using Hiveql
38 pages
Spark
No ratings yet
Spark
160 pages
What Is Apache Hadoop?: Ambari™
No ratings yet
What Is Apache Hadoop?: Ambari™
1 page
The Hadoop Distributed File System
No ratings yet
The Hadoop Distributed File System
44 pages
Mapr Snapshots
No ratings yet
Mapr Snapshots
31 pages
HDFS Intro
No ratings yet
HDFS Intro
9 pages
Tutorial-HDP-Administration V III
100% (1)
Tutorial-HDP-Administration V III
274 pages
Hadoop Lab
100% (2)
Hadoop Lab
6 pages
Spring Data Access: By, Srinivas Reddy.S
No ratings yet
Spring Data Access: By, Srinivas Reddy.S
21 pages
NoSQL Intro
No ratings yet
NoSQL Intro
26 pages
Hadoop Overview
100% (1)
Hadoop Overview
16 pages
Learning Datawarehouse, Informatica, Odi, Obiee.....
No ratings yet
Learning Datawarehouse, Informatica, Odi, Obiee.....
4 pages
HBase Guide for Developers
No ratings yet
HBase Guide for Developers
33 pages
Scala PDF
No ratings yet
Scala PDF
29 pages
2016 Big Data Analytics Market Study - Wisdom of Crowdsr Series - Licensed To Pentaho - Copyright 2016 Dresner Advisory Services
No ratings yet
2016 Big Data Analytics Market Study - Wisdom of Crowdsr Series - Licensed To Pentaho - Copyright 2016 Dresner Advisory Services
93 pages
Class: CS 237 Distributed Systems Middleware Instructor: Nalini Venkatasubramanian
No ratings yet
Class: CS 237 Distributed Systems Middleware Instructor: Nalini Venkatasubramanian
55 pages
Hive Join
No ratings yet
Hive Join
6 pages
CCA175 Demo Examenes
No ratings yet
CCA175 Demo Examenes
19 pages
Stream Processing at Lyft
No ratings yet
Stream Processing at Lyft
20 pages
Business Intelligence DW
No ratings yet
Business Intelligence DW
17 pages
High-Performance Web Apps With FastAPI: The Asynchronous Web Framework Based On Modern Python 1st Edition Malhar Lathkar Newest Edition 2025
0% (1)
High-Performance Web Apps With FastAPI: The Asynchronous Web Framework Based On Modern Python 1st Edition Malhar Lathkar Newest Edition 2025
127 pages
Bajaj Allianz Life Insurance Company.
100% (1)
Bajaj Allianz Life Insurance Company.
19 pages
Spark
No ratings yet
Spark
13 pages
HBase for Big Data Professionals
No ratings yet
HBase for Big Data Professionals
100 pages
Hadoop Data Transfer with Sqoop
No ratings yet
Hadoop Data Transfer with Sqoop
21 pages
Material For Student RWVCPC V012021A EN
No ratings yet
Material For Student RWVCPC V012021A EN
70 pages
PracticeExam DCADAS3 Scala 1
No ratings yet
PracticeExam DCADAS3 Scala 1
27 pages
Toad UserGuide
No ratings yet
Toad UserGuide
348 pages
Big Data Hadoop Certification Training Course
No ratings yet
Big Data Hadoop Certification Training Course
6 pages
Big Data Landscape Overview 2017
No ratings yet
Big Data Landscape Overview 2017
1 page
Hadoop & Kognitio Commands Guide
No ratings yet
Hadoop & Kognitio Commands Guide
1 page
Azure Cloud Intro
No ratings yet
Azure Cloud Intro
34 pages
Spark Training in Bangalore
No ratings yet
Spark Training in Bangalore
36 pages
Cognos Dynamic Cubes
No ratings yet
Cognos Dynamic Cubes
51 pages
Mongo DB
No ratings yet
Mongo DB
31 pages
MapR Sandbox For Hadoop DocUpdateFor3.1.1
No ratings yet
MapR Sandbox For Hadoop DocUpdateFor3.1.1
7 pages
Spark Sample Resume 2
100% (1)
Spark Sample Resume 2
7 pages
YARN Essentials - Sample Chapter
No ratings yet
YARN Essentials - Sample Chapter
12 pages
How Sqoop Works?: Sqoop "SQL To Hadoop and Hadoop To SQL"
No ratings yet
How Sqoop Works?: Sqoop "SQL To Hadoop and Hadoop To SQL"
27 pages
Hive Main Installation
No ratings yet
Hive Main Installation
2 pages
Hands-On Hadoop Tutorial
100% (1)
Hands-On Hadoop Tutorial
13 pages
DEV3600 LabGuide
No ratings yet
DEV3600 LabGuide
26 pages
Hortonworks Cluster Config Guide.1.0
No ratings yet
Hortonworks Cluster Config Guide.1.0
15 pages
Informatica Power Center Best Practices
No ratings yet
Informatica Power Center Best Practices
8 pages
Scala Basic Interview Questions
No ratings yet
Scala Basic Interview Questions
16 pages
Simply Rethink DB
No ratings yet
Simply Rethink DB
193 pages
Scaladayslambda Architecture Spark Cassandra Akka Kafka 150609194508 Lva1 App6891 PDF
No ratings yet
Scaladayslambda Architecture Spark Cassandra Akka Kafka 150609194508 Lva1 App6891 PDF
100 pages
Introduction To Information and Big Data Security
No ratings yet
Introduction To Information and Big Data Security
39 pages
Untitled
No ratings yet
Untitled
13 pages
Data Stream Processing Insights
No ratings yet
Data Stream Processing Insights
67 pages
TalendOpenStudio BigData ReleaseNotes 5.4.0 en
No ratings yet
TalendOpenStudio BigData ReleaseNotes 5.4.0 en
9 pages
Talend Etl
No ratings yet
Talend Etl
78 pages
Talend Data Integration
No ratings yet
Talend Data Integration
5 pages
CCS 101 Software and Network Midterm Lecture
No ratings yet
CCS 101 Software and Network Midterm Lecture
8 pages
Iot Based Anti-Piracy System
No ratings yet
Iot Based Anti-Piracy System
5 pages
Half A Million Secrets British English Student
No ratings yet
Half A Million Secrets British English Student
10 pages
FOURTH QUARTER EXAM in CSS-G10-2022-2023
100% (1)
FOURTH QUARTER EXAM in CSS-G10-2022-2023
2 pages
Technical Support Log Summary
No ratings yet
Technical Support Log Summary
20 pages
View Private Instagram Accounts
0% (2)
View Private Instagram Accounts
13 pages
Krishna Prasad SV Updated
No ratings yet
Krishna Prasad SV Updated
5 pages
ASTRO - More Lyrics English Translation Benefit Boys
No ratings yet
ASTRO - More Lyrics English Translation Benefit Boys
1 page
How To Find The Name of A Porn Star by Image - Quora
0% (1)
How To Find The Name of A Porn Star by Image - Quora
1 page
Wired Remote For Pioneer HU
No ratings yet
Wired Remote For Pioneer HU
22 pages
Unit-2 Working With Links, Images, Forms and Multimedia PDF
No ratings yet
Unit-2 Working With Links, Images, Forms and Multimedia PDF
66 pages
Akamai State of The Internet q4 2008
No ratings yet
Akamai State of The Internet q4 2008
19 pages
ZTE UMTS Power Control New
100% (3)
ZTE UMTS Power Control New
70 pages
Port Requirements For Microsoft Windows Server System
100% (2)
Port Requirements For Microsoft Windows Server System
8 pages
MPLS22SG Vol.1-Final PDF
No ratings yet
MPLS22SG Vol.1-Final PDF
310 pages
50 Greatest Short Stories by Terry Ox00027brien B015ri7rn2
33% (3)
50 Greatest Short Stories by Terry Ox00027brien B015ri7rn2
5 pages
Company Profile of Flipkart
No ratings yet
Company Profile of Flipkart
54 pages
Computer First Term Q1 Fill in The Blanks by Choosing The Correct Options (10x1 10)
100% (2)
Computer First Term Q1 Fill in The Blanks by Choosing The Correct Options (10x1 10)
5 pages
Global B2C E-Dispute Solutions
0% (1)
Global B2C E-Dispute Solutions
28 pages
Candidate User Guide
No ratings yet
Candidate User Guide
17 pages
Healthcare SEM PDF
No ratings yet
Healthcare SEM PDF
7 pages
Websense Web Security Gateway DLP ICAP Integration
No ratings yet
Websense Web Security Gateway DLP ICAP Integration
7 pages
Siteminder Concepts Guide
100% (2)
Siteminder Concepts Guide
78 pages
Advanced IP Addressing
No ratings yet
Advanced IP Addressing
14 pages
Trans YouTubers' Videos Misused
No ratings yet
Trans YouTubers' Videos Misused
3 pages
Productos Conectividad
No ratings yet
Productos Conectividad
2 pages
BDC For Vd01 Customer
No ratings yet
BDC For Vd01 Customer
6 pages
Web Site Design & Development Agreement
No ratings yet
Web Site Design & Development Agreement
4 pages
Report On Freelancing
No ratings yet
Report On Freelancing
8 pages
June 2012 Ugc Net Computer Science - Solved
No ratings yet
June 2012 Ugc Net Computer Science - Solved
20 pages