DataWarehousing & DataStage - Session 2
Data Stage : Basics
Atul Singh
atulsingh@in.ibm.com
DataWarehousing & DataStage - Session 2
What is Datastage?
Datastage is an ETL tool used to design jobs for Extraction, Transformation and Load Ideal Tool for data integration projects-such as data warehouses and data marts.
9/26/2013 4:25 AM
Data Stage: Basics
Atul Singh
atulsingh@in.ibm.com
DataWarehousing & DataStage - Session 2
What is a DATA WAREHOUSE?
A single, complete and consistent store of data obtained from a variety of different sources made available to end users in a what they can understand and use in a business context.
9/26/2013 4:25 AM 3
Data Stage: Basics
Atul Singh
atulsingh@in.ibm.com
DataWarehousing & DataStage - Session 2
Data Warehouse
A data warehouse is a
subject-oriented integrated time-varying non-volatile
collection of data that is used primarily in organizational decision making.
-- Bill Inmon, Building the Data Warehouse 1996
9/26/2013 4:25 AM
Data Stage: Basics
Atul Singh
atulsingh@in.ibm.com
DataWarehousing & DataStage - Session 2
Data Warehouse Architecture
Relational Databases
Optimized Loader
ERP Systems
Extraction Cleansing Data Warehouse Engine Analyze Query
Purchased Data
Legacy Data Metadata Repository 9/26/2013 4:25 AM 5
Data Stage: Basics
Atul Singh
atulsingh@in.ibm.com
DataWarehousing & DataStage - Session 2
Why Data Warehouse?
Who are our lowest/highest margin customers ?
What is the most effective distribution channel?
Who are my customers and what products are they buying?
What product prom-otions have the biggest impact on revenue?
to the
competition
Which customers are most likely to go
9/26/2013 4:25 AM
What impact will new products/services have on revenue and margins?
Data Stage: Basics
Atul Singh
atulsingh@in.ibm.com
DataWarehousing & DataStage - Session 2
The Motivation
DATA
INFORMATION
UNDERSTANDING
DECISION
9/26/2013 4:25 AM
Data Stage: Basics
Atul Singh
atulsingh@in.ibm.com
DataWarehousing & DataStage - Session 2
INTRODUCTION TO DATASTAGE
9/26/2013 4:25 AM
Data Stage: Basics
Atul Singh
atulsingh@in.ibm.com
DataWarehousing & DataStage - Session 2
What is DataStage ?
DataStage is a client server application. Server can be installed in either Windows or Unix Operating Systems. Client can be installed in Windows Communication between the client tools and DataStage server Design jobs for Extraction, Transformation, and Loading (ETL) Ideal Tool for data integration projects-such as data warehouses, data marts and system migration.
9/26/2013 4:25 AM
Data Stage: Basics
9/26/2013
Atul Singh
atulsingh@in.ibm.com
DataWarehousing & DataStage - Session 2
DATASTAGE -- ETL TOOL
Extract
Load
Transform
The Foundation for Business Intelligence
9/26/2013 4:25 AM
10
Data Stage: Basics
Atul Singh
atulsingh@in.ibm.com
DataWarehousing & DataStage - Session 2
DataStage Architecture
SERVER
Multiple processes dynamically stated to support Parallelization and development
Extraction/ Transformation rules and Metadata
NT/ UNIX
Intel Alpha Unix Solaris ENGINE
WIN 95/NT
MANAGER
CLIENT
DESIGNER
DIRECTOR
ADMIN
Graphical workflow style tools for point-and-click specifications of sources, targets and transformation requirements
9/26/2013 4:25 AM 11
Data Stage: Basics
Atul Singh
atulsingh@in.ibm.com
DataWarehousing & DataStage - Session 2
DataStage Server and Clients
9/26/2013 4:25 AM
12
Data Stage: Basics
Atul Singh
atulsingh@in.ibm.com
DataWarehousing & DataStage - Session 2
DataStage Terminology
Project: A Project is a collection of related Jobs Job : A job is an executable Program which is built using different stages in GUI Stages: They represent the processing steps required. Links: They represent the flow of data between different stages. Shared Containers: Defines reusable logic Sequences: Allows to run a sequence of related Jobs.
9/26/2013 4:25 AM
13
Data Stage: Basics
Atul Singh
atulsingh@in.ibm.com
DataWarehousing & DataStage - Session 2
DATASTAGE SERVER COMPONENTS:
REPOSITORY Central Store for all info required to build a DW
DATASTAGE SERVER Runs jobs
DATASTAGE CLIENT COMPONENTS:
ADMINISTRATOR To create and move projects, set up DS users DESIGNER Design Interface to create DS jobs DIRECTOR GUI to schedule, run and monitor jobs. MANAGER GUI to view and edit Repository contents
9/26/2013 4:25 AM 14
Data Stage: Basics
Atul Singh
atulsingh@in.ibm.com
DataWarehousing & DataStage - Session 2
DataStage Administrator
9/26/2013 4:25 AM
15
Data Stage: Basics
Atul Singh
atulsingh@in.ibm.com
DataWarehousing & DataStage - Session 2
Client Logon
9/26/2013 4:25 AM
16
Data Stage: Basics
Atul Singh
atulsingh@in.ibm.com
DataWarehousing & DataStage - Session 2
DataStage Manager
9/26/2013 4:25 AM
17
Data Stage: Basics
Atul Singh
atulsingh@in.ibm.com
DataWarehousing & DataStage - Session 2
DataStage Designer
9/26/2013 4:25 AM
18
Data Stage: Basics
Atul Singh
atulsingh@in.ibm.com
DataWarehousing & DataStage - Session 2
DataStage Director
9/26/2013 4:25 AM
19
Data Stage: Basics
Atul Singh
atulsingh@in.ibm.com
DataWarehousing & DataStage - Session 2
Developing in DataStage
Define global and project properties in Administrator Import meta data into Manager Build job in Designer Compile Designer Validate, run, and monitor in Director
9/26/2013 4:25 AM
20
Data Stage: Basics
Atul Singh
atulsingh@in.ibm.com
DataWarehousing & DataStage - Session 2
DataStage Projects
9/26/2013 4:25 AM
21
Data Stage: Basics
Atul Singh
atulsingh@in.ibm.com
DataWarehousing & DataStage - Session 2
Project Properties
Projects can be created and deleted in Administrator Project properties and defaults are set in Administrator
9/26/2013 4:25 AM
22
Data Stage: Basics
Atul Singh
atulsingh@in.ibm.com
DataWarehousing & DataStage - Session 2
Setting Project Properties To set project properties, log onto Administrator, select your project, and then click Properties
9/26/2013 4:25 AM
23
Data Stage: Basics
Atul Singh
atulsingh@in.ibm.com
DataWarehousing & DataStage - Session 2
Projects General Tab
9/26/2013 4:25 AM
24
Data Stage: Basics
Atul Singh
atulsingh@in.ibm.com
DataWarehousing & DataStage - Session 2
Environment Variables
9/26/2013 4:25 AM
25
Data Stage: Basics
Atul Singh
atulsingh@in.ibm.com
DataWarehousing & DataStage - Session 2
Permissions Tab
9/26/2013 4:25 AM
26
Data Stage: Basics
Atul Singh
atulsingh@in.ibm.com
DataWarehousing & DataStage - Session 2
Tunables Tab
9/26/2013 4:25 AM
27
Data Stage: Basics
Atul Singh
atulsingh@in.ibm.com
DataWarehousing & DataStage - Session 2
Parallel Tab
9/26/2013 4:25 AM
28
Data Stage: Basics
Atul Singh
atulsingh@in.ibm.com
DataWarehousing & DataStage - Session 2
Managing Meta Data
Data Stage: Basics
Atul Singh
atulsingh@in.ibm.com
DataWarehousing & DataStage - Session 2
What Is Metadata?
Data
Source
Meta Data
Transform
Target
Meta Data
Meta Data Repository
9/26/2013 4:25 AM 30
Data Stage: Basics
Atul Singh
atulsingh@in.ibm.com
DataWarehousing & DataStage - Session 2
DataStage Manager
9/26/2013 4:25 AM
31
Data Stage: Basics
Atul Singh
atulsingh@in.ibm.com
DataWarehousing & DataStage - Session 2
Manager Contents
Metadata describing sources and targets: Table definitions DataStage objects: jobs, routines, table definitions, etc.
9/26/2013 4:25 AM
32
Data Stage: Basics
Atul Singh
atulsingh@in.ibm.com
DataWarehousing & DataStage - Session 2
Import and Export
Any object in Manager can be exported to a file Can export whole projects Use for backup Sometimes used for version control Can be used to move DataStage objects from one project to another Use to share DataStage jobs and projects with other developers
9/26/2013 4:25 AM
33
Data Stage: Basics
Atul Singh
atulsingh@in.ibm.com
DataWarehousing & DataStage - Session 2
Export Procedure
In Manager, click Export>DataStage Components Select DataStage objects for export Specified type of export: DSX, XML Specify file path on client machine
9/26/2013 4:25 AM
34
Data Stage: Basics
Atul Singh
atulsingh@in.ibm.com
DataWarehousing & DataStage - Session 2
Exporting DataStage Objects
9/26/2013 4:25 AM
35
Data Stage: Basics
Atul Singh
atulsingh@in.ibm.com
DataWarehousing & DataStage - Session 2
Exporting DataStage Objects
9/26/2013 4:25 AM
36
Data Stage: Basics
Atul Singh
atulsingh@in.ibm.com
DataWarehousing & DataStage - Session 2
Import Procedure
In Manager, click Import>DataStage Components Select DataStage objects for import
9/26/2013 4:25 AM
37
Data Stage: Basics
Atul Singh
atulsingh@in.ibm.com
DataWarehousing & DataStage - Session 2
Importing DataStage Objects
9/26/2013 4:25 AM
38
Data Stage: Basics
Atul Singh
atulsingh@in.ibm.com
DataWarehousing & DataStage - Session 2
Metadata Import
Import format and column destinations from sequential files Import relational table column destinations Imported as Table Definitions Table definitions can be loaded into job stages
9/26/2013 4:25 AM
39
Data Stage: Basics
Atul Singh
atulsingh@in.ibm.com
DataWarehousing & DataStage - Session 2
Sequential File Import Procedure
In Manager, click Import>Table Definitions>Sequential File Definitions Select directory containing sequential file and then the file Select Manager category Examined format and column definitions and edit is necessary
9/26/2013 4:25 AM
40
Data Stage: Basics
Atul Singh
atulsingh@in.ibm.com
DataWarehousing & DataStage - Session 2
Designing and Documenting Jobs
Data Stage: Basics
Atul Singh
atulsingh@in.ibm.com
DataWarehousing & DataStage - Session 2
What Is a Job?
Executable DataStage program Created in DataStage Designer, but can use components from Manager Built using a graphical user interface Compiles into Orchestrate shell language (OSH)
9/26/2013 4:25 AM
42
Data Stage: Basics
Atul Singh
atulsingh@in.ibm.com
DataWarehousing & DataStage - Session 2
Job Development Overview
In Manager, import metadata defining sources and targets In Designer, add stages defining data extractions and loads And Transformers and other stages to defined data transformations Add links defining the flow of data from sources to targets Compiled the job In Director, validate, run, and monitor your job
9/26/2013 4:25 AM 43
Data Stage: Basics
Atul Singh
atulsingh@in.ibm.com
DataWarehousing & DataStage - Session 2
Designer Work Area
9/26/2013 4:25 AM
44
Data Stage: Basics
Atul Singh
atulsingh@in.ibm.com
DataWarehousing & DataStage - Session 2
Designer Toolbar
Show/hide metadata markers
Job properties
Compile
9/26/2013 4:25 AM
45
Data Stage: Basics
Atul Singh
atulsingh@in.ibm.com
DataWarehousing & DataStage - Session 2
Tools Palette
9/26/2013 4:25 AM
46
Data Stage: Basics
Atul Singh
atulsingh@in.ibm.com
DataWarehousing & DataStage - Session 2
Adding Stages and Links
Stages can be dragged from the tools palette or from the stage type branch of the repository view Links can be drawn from the tools palette or by right clicking and dragging from one stage to another
9/26/2013 4:25 AM
47
Data Stage: Basics
Atul Singh
atulsingh@in.ibm.com
DataWarehousing & DataStage - Session 2
Sequential File Stage
Used to extract data from, or load data to, a sequential file Specify full path to the file Specify a file format: fixed width or delimited Specified column definitions Specify write action
9/26/2013 4:25 AM
48
Data Stage: Basics
Atul Singh
atulsingh@in.ibm.com
DataWarehousing & DataStage - Session 2
Designer - Create New Job
9/26/2013 4:25 AM
49
Data Stage: Basics
Atul Singh
atulsingh@in.ibm.com
DataWarehousing & DataStage - Session 2
Drag Stages and Links Using Palette
9/26/2013 4:25 AM
50
Data Stage: Basics
Atul Singh
atulsingh@in.ibm.com
DataWarehousing & DataStage - Session 2
Assign Meta Data
9/26/2013 4:25 AM
51
Data Stage: Basics
Atul Singh
atulsingh@in.ibm.com
DataWarehousing & DataStage - Session 2
Editing a Sequential Source Stage
9/26/2013 4:25 AM
52
Data Stage: Basics
Atul Singh
atulsingh@in.ibm.com
DataWarehousing & DataStage - Session 2
Transformer Stage
Used to define constraints, derivations, and column mappings A column mapping maps an input column to an output column In this module will just defined column mappings (no derivations)
9/26/2013 4:25 AM
53
Data Stage: Basics
Atul Singh
atulsingh@in.ibm.com
DataWarehousing & DataStage - Session 2
Transformer Stage Elements
9/26/2013 4:25 AM
54
Data Stage: Basics
Atul Singh
atulsingh@in.ibm.com
DataWarehousing & DataStage - Session 2
Create Column Mappings
9/26/2013 4:25 AM
55
Data Stage: Basics
Atul Singh
atulsingh@in.ibm.com
DataWarehousing & DataStage - Session 2
Creating Stage Variables
9/26/2013 4:25 AM
56
Data Stage: Basics
Atul Singh
atulsingh@in.ibm.com
DataWarehousing & DataStage - Session 2
Result
9/26/2013 4:25 AM
57
Data Stage: Basics
Atul Singh
atulsingh@in.ibm.com
DataWarehousing & DataStage - Session 2
Adding Job Parameters
Makes the job more flexible Parameters can be:
Used in constraints and derivations Used in directory and file names
Parameter values are determined at run time
9/26/2013 4:25 AM
58
Data Stage: Basics
Atul Singh
atulsingh@in.ibm.com
DataWarehousing & DataStage - Session 2
Adding Job Documentation
Job Properties
Short and long descriptions
Shows in Manager
Annotation stage
Is a stage on the tool palette Shows on the job GUI (work area)
9/26/2013 4:25 AM
59
Data Stage: Basics
Atul Singh
atulsingh@in.ibm.com
DataWarehousing & DataStage - Session 2
Job Properties Documentation
9/26/2013 4:25 AM
60
Data Stage: Basics
Atul Singh
atulsingh@in.ibm.com
DataWarehousing & DataStage - Session 2
Annotation Stage on the Palette
9/26/2013 4:25 AM
61
Data Stage: Basics
Atul Singh
atulsingh@in.ibm.com
DataWarehousing & DataStage - Session 2
Annotation Stage Properties
9/26/2013 4:25 AM
62
Data Stage: Basics
Atul Singh
atulsingh@in.ibm.com
DataWarehousing & DataStage - Session 2
Final Job Work Area with Documentation
9/26/2013 4:25 AM
63
Data Stage: Basics
Atul Singh
atulsingh@in.ibm.com
DataWarehousing & DataStage - Session 2
Compiling a Job
9/26/2013 4:25 AM
64
Data Stage: Basics
Atul Singh
atulsingh@in.ibm.com
DataWarehousing & DataStage - Session 2
Errors or Successful Message
9/26/2013 4:25 AM
65
Data Stage: Basics
Atul Singh
atulsingh@in.ibm.com
DataWarehousing & DataStage - Session 2
Running Jobs
Data Stage: Basics
Atul Singh
atulsingh@in.ibm.com
DataWarehousing & DataStage - Session 2
Prerequisite to Job Execution
Result from Designer compile
9/26/2013 4:25 AM
67
Data Stage: Basics
Atul Singh
atulsingh@in.ibm.com
DataWarehousing & DataStage - Session 2
DataStage Director
Can schedule, validating, and run jobs Can be invoked from DataStage Manager or Designer
Tools > Run Director
9/26/2013 4:25 AM
68
Data Stage: Basics
Atul Singh
atulsingh@in.ibm.com
DataWarehousing & DataStage - Session 2
Running Your Job
9/26/2013 4:25 AM
69
Data Stage: Basics
Atul Singh
atulsingh@in.ibm.com
DataWarehousing & DataStage - Session 2
Run Options Parameters and Limits
9/26/2013 4:25 AM
70
Data Stage: Basics
Atul Singh
atulsingh@in.ibm.com
DataWarehousing & DataStage - Session 2
Director Log View
9/26/2013 4:25 AM
71
Data Stage: Basics
Atul Singh
atulsingh@in.ibm.com
DataWarehousing & DataStage - Session 2
Message Details are Available
9/26/2013 4:25 AM
72
Data Stage: Basics
Atul Singh
atulsingh@in.ibm.com
DataWarehousing & DataStage - Session 2
Other Director Functions
Schedule job to run on a particular date/time Clear job log Set Director options
Row limits Abort after x warnings
9/26/2013 4:25 AM
73
Data Stage: Basics
Atul Singh
atulsingh@in.ibm.com
DataWarehousing & DataStage - Session 2
Standards and Techniques
Data Stage: Basics
Atul Singh
atulsingh@in.ibm.com
DataWarehousing & DataStage - Session 2
Job Presentation
Document using the annotation stage
9/26/2013 4:25 AM
75
Data Stage: Basics
Atul Singh
atulsingh@in.ibm.com
DataWarehousing & DataStage - Session 2
Naming conventions
Stages named after the
Data they access
Function they perform
DO NOT leave defaulted stage names like Sequential_File_0
Links named for the data they carry
DO NOT leave defaulted link names like DSLink3
9/26/2013 4:25 AM
76
Data Stage: Basics
Atul Singh
atulsingh@in.ibm.com
DataWarehousing & DataStage - Session 2
Stage and Link Names
Stages and links renamed to data they handle
9/26/2013 4:25 AM
77
Data Stage: Basics
Atul Singh
atulsingh@in.ibm.com
DataWarehousing & DataStage - Session 2
Create Reusable Job Components
Container
9/26/2013 4:25 AM
78
Data Stage: Basics
Atul Singh
atulsingh@in.ibm.com
DataWarehousing & DataStage - Session 2
Partitioning and Collecting Icons
Partitioner Collector
9/26/2013 4:25 AM
79
Data Stage: Basics
Atul Singh
atulsingh@in.ibm.com
DataWarehousing & DataStage - Session 2
More Stages
Data Stage: Basics
Atul Singh
atulsingh@in.ibm.com
DataWarehousing & DataStage - Session 2
Transformer Stage Functions
Control data flow Create derivations
9/26/2013 4:25 AM
81
Data Stage: Basics
Atul Singh
atulsingh@in.ibm.com
DataWarehousing & DataStage - Session 2
Rejecting Data Example
Property Reject Mode = Output
If Not Found property
Constraint Other/log option
9/26/2013 4:25 AM
82
Data Stage: Basics
Atul Singh
atulsingh@in.ibm.com
DataWarehousing & DataStage - Session 2
Transformer Stage Properties
9/26/2013 4:25 AM
83
Data Stage: Basics
Atul Singh
atulsingh@in.ibm.com
DataWarehousing & DataStage - Session 2
Stage Variables
Show/Hide button
9/26/2013 4:25 AM
84
Data Stage: Basics
Atul Singh
atulsingh@in.ibm.com
DataWarehousing & DataStage - Session 2
1. The Join Stage
Four types:
2 sorted input links, 1 output link "left outer" on primary input, "right outer" on secondary input Pre-sort make joins "lightweight": few rows need to be in RAM
Inner Left Outer Right Outer Full Outer
9/26/2013 4:25 AM
85
Data Stage: Basics
Atul Singh
atulsingh@in.ibm.com
DataWarehousing & DataStage - Session 2
Job Control Using the Job Sequencer
Data Stage: Basics
Atul Singh
atulsingh@in.ibm.com
DataWarehousing & DataStage - Session 2
Job Control Options
Manually write job control Code generated in Basic Use the job control tab on the job properties page Generates basic code which you can modify Job Sequencer Build a controlling job much the same way you build other jobs Comprised of stages and links
No basic coding
9/26/2013 4:25 AM 87
Data Stage: Basics
Atul Singh
atulsingh@in.ibm.com
DataWarehousing & DataStage - Session 2
Job Sequencer
Build like a regular job Type Job Sequence Has stages and links Job Activity stage represents a DataStage job Links represent passing control
9/26/2013 4:25 AM 88
Stages
Data Stage: Basics
Atul Singh
atulsingh@in.ibm.com
DataWarehousing & DataStage - Session 2
Example
Job Activity stage contains conditional triggers
9/26/2013 4:25 AM
89
Data Stage: Basics
Atul Singh
atulsingh@in.ibm.com
DataWarehousing & DataStage - Session 2
QUESTIONS ??????????
9/26/2013 4:25 AM 90
Data Stage: Basics
Atul Singh
atulsingh@in.ibm.com