Talend Data Integration
Course Number:
Duration: 4 days
Overview
The program is focused on enhancing data handling and integration capabilities. Create ETL
jobs that connect to almost any data source, Filter, Modify, unite data, Build standalone jobs
that run on a schedule or based on an event and Make jobs more user-friendly for
non-technical users. This course also covers Talend Big Data integration aspects
(Hortonworks Distribution)
Prerequisites
Participants should preferably have basic knowledge of a programming language like Java.
The participants must be familiar with RDBMS and SQL language.
Materials
● Exercise Manual
● Slides
Software Needed on Each Student PC
● Laptop or desktop with i3 quad-core processor or faster
● 8GB RAM or higher
● Internet connection for all attendees and the instructor
● Note: If you do not have classroom computers available with this spec, we can
recommend a rental laptop vendor whose machines meet these specifications.
Objectives
All students will:
● Integrate lot of data sources in Talend
● Learn Basic concepts of Big Data (Hadoop)
● Learn about Talend Studio.
● Read and Write data to/from HDFS (HDFS, HBase)
● Read and Write tables to/from HDFS (Hive, Sqoop)
● Processing Tables stored on HDFS with Hive
● Processing data stored on HDFS with Pig
● Use Talend Open Studio for Big Data for real work as quickly as possible.
● Work on HortonWorks Hadoop Distribution
Outline
● Overview
o Introduction To Talend
o Why Talend?
o Talend Vs Other Tools
o Logical Architecture
o More On Data Integration Aspects
o Talend Big Data Integration
o Talend Open Studio Walkthrough
o Key Components In Palette
o Conclusion
● Introduction And General Principles
o Before You Begin
o Installing The Software
o Enabling Thashinput And Thashoutput
● Metadata And Schemas
o Introduction
o Hand-Cranking A Built-In Schema
o Propagating Schema Changes
o Creating A Generic Schema From The Existing Metadata
o Cutting And Pasting Schema Information
o Dropping Schemas To Empty Components
o Creating Schemas From Lists
● Validating Data
o Introduction
o Enabling And Disabling Reject Flows
o Gathering All Rejects Prior To Killing A Job
o Validating Against The Schema
o Rejecting Rows Using Tmap
o Checking A Column Against A List Of Allowed Values
o Checking A Column Against A Lookup
o Creating Validation Rules For More Complex Requirements
o Creating Binary Error Codes To Store Multiple Test Results
● Mapping Data
o Introduction
o Simple Mapping And Tmap Time Savers
o Creating Tmap Expressions
o Using The Ternary Operator For Conditional Logic
o Using Intermediate Variables In Tmap
o Filtering Input Rows
o Splitting An Input Row Into Multiple Outputs Based On Input Conditions
o Joining Data Using Tmap
o Hierarchical Joins Using Tmap
o Using Reload At Each Row To Process Real-Time / Near Real-Time Data
● Using Java in Talend
o Introduction
o Performing One-Off Pieces Of Logic Using Tjava
o Setting The Context And Globalmap Variables Using Tjava
o Adding Complex Logic Into A Flow Using Tjavarow
o Creating Pseudo Components Using Tjavaflex
o Creating Custom Functions Using Code Routines
o Importing Jar Files To Allow Use Of External Java Classes
● Managing Context Variables
o Introduction
o Creating A Context Group
o Adding A Context Group To Your Job
o Adding Contexts To A Context Group
o Using Tcontextload To Load Contexts
o Using Implicit Context Loading To Load Contexts
o Turning Implicit Context Loading On And Off In A Job
o Setting The Context File Location In The Operating System
● Working With Databases
o Introduction
o Setting Up A Database Connection
o Importing The Table Schemas
o Reading From Database Tables
o Using Context And Globalmap Variables In Sql Queries
o Printing Your Input Query
o Writing To A Database Table
o Printing Your Output Query
o Managing Database Sessions
o Passing A Session To A Child Job
o Selecting Different Fields And Keys For Insert, Update, And Delete
o Capturing Individual Rejects And Errors
o Database And Table Management
o Managing Surrogate Keys For Parent And Child Tables
o Rewritable Lookups Using An In-Process Database
● Managing Files
o Introduction
o Appending Records To A File
o Reading Rows Using A Regular Expression
o Using Temporary Files
o Storing Intermediate Data In The Memory Using Thashmap
o Reading Headers And Trailers Using Tmap
o Reading Headers And Trailers With No Identifiers
o Using The Information In The Header And Trailer
o Adding A Header And Trailer To A File
o Moving, Copying, Renaming, And Deleting Files And Folders
o Capturing File Information
o Processing Multiple Files At Once
o Processing Control/Validation Files
o Creating And Writing Files Depending On The Input Data
● Working With XML, Queues, And Web Services
o Introduction
o Using Txmlmap To Read Xml
o Using Txmlmap To Create An Xml Document
o Reading Complex Hierarchical Xml
o Writing Complex Xml
o Calling A Soap Web Service
o Calling A Restful Web Service
o Reading And Writing To A Queue
o Ensuring Lossless Queues Using Sessions
● Debugging, Logging, And Testing
o Introduction
o Find The Location Of Compilation Errors Using The Problems Tab
o Locating Execution Errors From The Console Output
o Using The Talend Debug Mode – Row-By-Row Execution
o Using The Java Debugger To Debug Talend Jobs
o Using Tlogrow To Show Data In A Row
o Using Tjavarow To Display Row Information
o Using Tjava To Display Status Messages And Variables
o Printing Out The Context
o Dumping The Console Output To A File From Within A Job
o Creating Simple Test Data Using Trowgenerator
o Creating Complex Test Data Using Trowgenerator, Tflowtoiterate, Tmap, And
Sequences
o Creating Random Test Data Using Lookups
o Creating Test Data Using Excel
o Testing Logic – The Most-Used Pattern
o Killing A Job From Within Tjavarow
● Deploying And Scheduling Talend Code
o Introduction
o Creating Compiled Executables
o Using A Different Context
o Adding Command-Line Context Parameters
o Managing Job Dependencies
o Capturing And Acting On Different Return Codes
o Returning Codes From A Child Job Without Tdie
o Passing Parameters To A Child Job
o Executing Non-Talend Objects And Operating System Commands
● Common Mistakes And Other Useful Hints And Tips
o Introduction
o My Tab Is Missing
o Finding The Code Routine
o Finding A New Context Variable
o Reloads Going Missing At Each Row Global Variable
o Dragging Component Globalmap Variables
o Some Complex Date Formats
o Capturing Tmap Rejects
o Adding Job Name, Project Name, And Other Job Specific Information
o Printing Tmap Variables
o Stopping Memory Errors In Talend
● Software Development Lifecycle
o Working With Git And Talend
o How To Perform Ci/Cd With Jenkins And Talend?
o Job Monitoring Using Resource Manager Ui
o Unit Testing
o Best Practices
o Joblets
o Parallelization
o Reusing Jobs (Child Jobs)
o Joblets
o Context Variables
o Repository
● Getting Started With A Basic Big Data Job
o Creating A Job
o Adding Components To The Job
o Connecting The Components Together
o Configuring The Components
o Executing The Job
o Various Types Of Big Data Jobs
o Pig Workflow
o Reading And Writing To Hive On Hadoop
o Working With Hdfs
o Performing Sqoop
o Using Spark In Talend
o Kafka
● Carparts Project
o Creating A Spark Batch Job
o Use Cases
o Scenario: Carparts_Demoprep
o Scenario: Carparts_Etl
o Scenario: Carparts01_Spark
o Scenario: Loadcarpartsinhdfs