InfoSphere DataStage
Workshop
© Copyright IBM Corporation 2009
Course materials may not be reproduced in whole or in part without the prior written permission of IBM. 4.0.3
Course Contents
● Mod 01: Introduction ……………………………………………………………… 05
● Mod 02: Deployment …………………………………………………………………… 27
● Mod 03: Administering DataStage …………………………………… 41
● Mod 04: DataStage Designer ……………………………………………… 67
● Mod 05: Creating Parallel Jobs …………………………………… 97
● Mod 06: Accessing Sequential Data …………………………… 133
● Mod 07: Platform Architecture …………………………………… 163
● Mod 08: Combining Data ………………………………………………………… 201
● Mod 09: Sorting and Aggregating Data ………………… 243
● Mod 10: Transforming Data ……………………………………………… 271
● Mod 11: Working With Relational Data ………………… 327
● Mod 12: Metadata in the Parallel Framework …… 373
● Mod 13: Job Control ……………………………………………………………… 397
© Copyright IBM Corporation 2009 2
IBM InfoSphere DataStage
Mod 01: Introduction
© Copyright IBM Corporation 2009
Course materials may not be reproduced in whole or in part without the prior written permission of IBM. 4.0.3
Unit objectives
After completing this unit, you should be able to:
● List and describe the uses of DataStage
● List and describe the DataStage clients
● Describe the DataStage workflow
● List and compare the different types of DataStage jobs
● Describe the two types of parallelism exhibited by DataStage
parallel jobs
© Copyright IBM Corporation 2009 4
What is IBM InfoSphere DataStage?
● Design jobs for Extraction, Transformation, and Loading
(ETL)
● Ideal tool for data integration projects – such as, data
warehouses, data marts, and system migrations
● Import, export, create, and manage metadata for use
within jobs
● Schedule, run, and monitor jobs, all within DataStage
● Administer your DataStage development and execution
environments
● Create batch (controlling) jobs
© Copyright IBM Corporation 2009 5
IBM Information Server
● Suite of applications, including DataStage, that:
– Share a common repository
• DB2, by default
– Share a common set of application services and functionality
• Provided by Metadata Server components hosted by an application
server
– IBM InfoSphere Application Server
• Provided services include:
– Security
– Repository
– Logging and reporting
– Metadata management
● Managed using web console clients
– Administration console
– Reporting console
© Copyright IBM Corporation 2009 6
Information Server Backbone
Information Business Information DataStage QualityStage MetaBrokers
Federation
Services Glossary Analyzer Server
Director
Metadata Metadata
Access Services Analysis Services
Metadata Server
Information Server console
© Copyright IBM Corporation 2009 7
Information Server Administration Console
© Copyright IBM Corporation 2009 8
DataStage Architecture
Clients
Parallel engine Server engine
Engines Shared Repository
© Copyright IBM Corporation 2009 9
DataStage Administrator
© Copyright IBM Corporation 2009 10
DataStage Designer
DataStage
parallel job
© Copyright IBM Corporation 2009 11
DataStage Director
Log
message
events
© Copyright IBM Corporation 2009 12
Developing in DataStage
● Define global and project properties in Administrator
● Import metadata into the Repository
● Build job in Designer
● Compile job in Designer
● Run and monitor job log messages in Director
– Jobs can also be run in Designer, but job lob messages cannot be
viewed in Designer
© Copyright IBM Corporation 2009 13
DataStage Project Repository
User-added
folder
Standard
jobs folder
Standard
Table
Definitions
folder
© Copyright IBM Corporation 2009 14
Types of DataStage Jobs
● Server jobs
– Executed by the DataStage server engine
– Compiled into Basic
– Runtime monitoring in DataStage Director
● Parallel jobs
– Executed by the DataStage parallel engine
– Built-in functionality for pipeline and partition parallelism
– Compiled into OSH (Orchestrate Scripting Language)
• OSH executes Operators
– Executable C++ class instances
– Runtime monitoring in DataStage Director
● Job sequences (batch jobs, controlling jobs)
– Master Server jobs that kick-off jobs and other activities
– Can kick-off Server or Parallel jobs
– Runtime monitoring in DataStage Director
– Executed by the Server engine
© Copyright IBM Corporation 2009 15
Design Elements of Parallel Jobs
● Stages
– Implemented as OSH operators (pre-built components)
– Passive stages (E and L of ETL)
• Read data
• Write data
• E.g., Sequential File, DB2, Oracle, Peek stages
– Processor (active) stages (T of ETL)
• Transform data
• Filter data
• Aggregate data
• Generate data
• Split / Merge data
• E.g., Transformer, Aggregator, Join, Sort stages
● Links
– “Pipes” through which the data moves from stage to stage
© Copyright IBM Corporation 2009 16
Pipeline Parallelism
● Transform, clean, load processes execute simultaneously
● Like a conveyor belt moving rows from process to process
– Start downstream process while upstream process is running
● Advantages:
– Reduces disk usage for staging areas
– Keeps processors busy
● Still has limits on scalability
© Copyright IBM Corporation 2009 17
Partition Parallelism
● Divide the incoming stream of data into subsets to be
separately processed by an operator
– Subsets are called partitions
● Each partition of data is processed by the same operator
– E.g., if operation is Filter, each partition will be filtered in exactly
the same way
● Facilitates near-linear scalability
– 8 times faster on 8 processors
– 24 times faster on 24 processors
– This assumes the data is evenly distributed
© Copyright IBM Corporation 2009 18
Three-Node Partitioning
Node 1
Operation
subset1
Node 2
subset2
Operation
Data subset3
Node 3
Operation
● Here the data is partitioned into three partitions
● The operation is performed on each partition of data
separately and in parallel
● If the data is evenly distributed, the data will be processed
three times faster
© Copyright IBM Corporation 2009 19
Job Design v. Execution
User designs the flow in DataStage Designer
… at runtime, this job runs in parallel for any configuration
or partitions (called nodes)
© Copyright IBM Corporation 2009 20
Checkpoint
1. True or false: DataStage Director is used to build and
compile your ETL jobs
2. True or false: Use Designer to monitor your job during
execution
3. True or false: Administrator is used to set global and
project properties
© Copyright IBM Corporation 2009 21
Unit summary
Having completed this unit, you should be able to:
● List and describe the uses of DataStage
● List and describe the DataStage clients
● Describe the DataStage workflow
● List and compare the different types of DataStage jobs
● Describe the two types of parallelism exhibited by DataStage
parallel jobs
© Copyright IBM Corporation 2009 22
IBM InfoSphere DataStage
Mod 02: Deployment
© Copyright IBM Corporation 2009
Course materials may not be reproduced in whole or in part without the prior written permission of IBM. 4.0.3
Unit objectives
After completing this unit, you should be able to:
● Identify the components of Information Server that need to
be installed
● Describe what a deployment domain consists of
● Describe different domain deployment options
● Describe the installation process
● Start the Information Server
© Copyright IBM Corporation 2009 24
What Gets Deployed
An Information Server domain, consisting of the following:
● MetaData Server, hosted by an IBM InfoSphere Application Server
instance
● One or more DataStage servers
– DataStage server includes both the parallel and server engines
● One DB2 UDB instance containing the Repository database
● Information Server clients
– Administration console
– Reporting console
– DataStage clients
• Administrator
• Designer
• Director
● Additional Information Server applications
– Information Analyzer
– Business Glossary
– Rational Data Architect
– Information Services Director
– Federation Server
© Copyright IBM Corporation 2009 25
Deployment: Everything on One Machine
● Here we have a single
domain with the hosted
applications all on one MetaData Server backbone
machine
Clients
● Additional Client
workstations can connect to
this machine using TCP/IP
DataStage server DB2 instance with
Repository
Clients
© Copyright IBM Corporation 2009 26
Deployment: DataStage on Separate Machine
● Here the domain is split
between two machines MetaData Server backbone
– DataStage server
– MetaData server and
DB2 Repository
DataStage server DB2 instance with
Repository
Clients
© Copyright IBM Corporation 2009 27
MetaData Server and DB2 On Separate Machines
MetaData Server backbone
● Here the domain is split
between three
machines
– DataStage server
– MetaData server
– DB2 Repository
Clients
DataStage server DB2 instance with
Repository
© Copyright IBM Corporation 2009 28
Information Server Installation
© Copyright IBM Corporation 2009 29
Installation Configuration Layers
● Configuration layers include:
– Client
• DataStage and Information Server clients
– Engine
• DataStage and other application engines
– Domain
• MetaData Server and hosted metadata server components
• Installed products domain components
– Repository
• Repository database server and database
– Documentation
● Selected layers are installed on the machine local to the
installation
● Already existing components can be configured and used
– E.g., DB2, InfoSphere Application Server
© Copyright IBM Corporation 2009 30
Information Server Start-Up
– Start the MetaData Server
• From Windows Start menu, click “Start the Server” after the profile
to be used (e.g., “default”)
• From the command line, open the profile bin directory
– Enter “startup server1”
> server1 is the default name of the application server hosting the MetaData
Server
– Start the ASB agent
• From Windows Start menu, click “Start the agent” after selecting
the Information Server folder
• Only required if DataStage and the MetaData Server are on
different machines
– To begin work in DataStage, double-click on a DataStage client
icon
– To begin work in the Administration and Reporting consoles,
double-click on the Web Console for Information Server icon
© Copyright IBM Corporation 2009 31
Starting the MetaData Server Backbone
Application Server Profiles folder Profile Start the Server
Profile Profile\bin directory
Startup command
© Copyright IBM Corporation 2009 32
Starting the ASB Agent
Start the agent
© Copyright IBM Corporation 2009 33
Checkpoint
1. What application components make up a domain?
2. Can a domain contain multiple DataStage servers?
3. Does the DB2 instance and the Repository database need
to be on the same machine as the Application Server?
4. Suppose DataStage is on a separate machine from the
Application Server. What two components need to be
running before you can log onto DataStage?
© Copyright IBM Corporation 2009 34
Unit summary
Having completed this unit, you should be able to:
● Identify the components of Information Server that need to
be installed
● Describe what a deployment domain consists of
● Describe different domain deployment options
● Describe the installation process
● Start the Information Server
© Copyright IBM Corporation 2009 35
IBM InfoSphere DataStage
Mod 03: Administering
DataStage
© Copyright IBM Corporation 2009
Course materials may not be reproduced in whole or in part without the prior written permission of IBM. 4.0.3
Unit objectives
After completing this unit, you should be able to:
● Open the Administrative console
● Create new users and groups
● Assign Suite roles and Product roles to users and groups
● Give users DataStage credentials
● Log onto DataStage Administrator
● Add a DataStage user on the Permissions tab and specify
the user’s role
● Specify DataStage global and project defaults
● List and describe important environment variables
© Copyright IBM Corporation 2009 37
Information Server Administration Web Console
● Web application for administering Information Server
● Use for:
– Domain management
– Session management
– Management of users and groups
– Logging management
– Scheduling management
© Copyright IBM Corporation 2009 38
Opening the Administration Web Console
Information Server
console address
Information
Server
administrator ID
© Copyright IBM Corporation 2009 39
Users and Group Management
© Copyright IBM Corporation 2009 40
User and Group Management
● Suite authorizations can be provided to users or groups
– Users that are members of a group acquire the authorizations of the group
● Authorizations are provided in the form of roles
– Two types of roles
• Suite roles: Apply to the Suite
• Suite Component roles: Apply to a specific product or component of
Information Server, e.g., DataStage
● Suite roles
– Administrator
• Perform user and group management tasks
• Includes all the privileges of the Suite User role
– User
• Create views of scheduled tasks and logged messages
• Create and run reports
● Suite Component roles
– DataStage
• DataStage user
– Permissions are assigned within DataStage
> Developer, Operator, Super Operator, Production Manager
• DataStage administrator
– Full permissions to work in DataStage Administrator, Designer, and Director
– And so on, for all products in the Suite
© Copyright IBM Corporation 2009 41
Creating a DataStage User ID
Administration
console
Create
new user
Users
© Copyright IBM Corporation 2009 42
Assigning DataStage Roles
Assign Suite
User ID Administrator role
Assign Suite
User role
Users
Assign DataStage
User role Assign DataStage
Administrator role
© Copyright IBM Corporation 2009 43
DataStage Credential Mapping
● DataStage credentials
– Required by DataStage in order to log onto a DataStage client
– Managed by the DataStage Server operating system or LDAP
● Users given DataStage Suite roles in the Suite
Administration console do not automatically receive
DataStage credentials
– Users need to be mapped to a user who has DataStage credentials
on the DataStage Server machine
• This DataStage user must have file access permission to the
DataStage engine/project files or Administrator rights on the
operating system
● Three ways of mapping users (IS ID OS ID)
– Many to 1
– 1 to 1
– Combination of the above
© Copyright IBM Corporation 2009 44
DataStage Credential Mapping
Map DataStage
User credential
© Copyright IBM Corporation 2009 45
DataStage Administrator
© Copyright IBM Corporation 2009 46
Logging onto DataStage Administrator
Host name,
port number of
application
server
DataStage
administrator ID
and password
Name or IP
address of
DataStage server
machine
© Copyright IBM Corporation 2009 47
DataStage Administrator Projects Tab
Server projects
Click to specify
project properties
Link to Information Server
Administration console
© Copyright IBM Corporation 2009 48
DataStage Administrator General Tab
Enable runtime
column propagation
(RCP) by default
Specify auto purge
Environment
variable settings
© Copyright IBM Corporation 2009 49
Environment Variables
Parallel job variables
User-defined variables
© Copyright IBM Corporation 2009 50
Environment Reporting Variables
Display Score
Display record
counts
Display OSH
© Copyright IBM Corporation 2009 51
DataStage Administrator Permissions Tab
Assigned product
role
Add DataStage
users
© Copyright IBM Corporation 2009 52
Adding Users and Groups
Add DataStage
Available users /
users
groups. Must
have a
DataStage
product User role.
© Copyright IBM Corporation 2009 53
Specify DataStage Role
Added DataStage
user
Select DataStage
role
© Copyright IBM Corporation 2009 54
DataStage Administrator Tracing Tab
© Copyright IBM Corporation 2009 55
DataStage Administrator Parallel Tab
© Copyright IBM Corporation 2009 56
DataStage Administrator Sequence Tab
© Copyright IBM Corporation 2009 57
Checkpoint
1. Authorizations can be assigned to what two items?
2. What two types of authorization roles can be assigned to a
user or group?
3. In addition to Suite authorization to log onto DataStage
what else does a DataStage developer require to work in
DataStage?
© Copyright IBM Corporation 2009 58
Unit objectives
Having completing this unit, you should be able to:
● Open the Administrative console
● Create new users and groups
● Assign Suite roles and Product roles to users and groups
● Give users DataStage credentials
● Log onto DataStage Administrator
● Add a DataStage user on the Permissions tab and specify
the user’s role
● Specify DataStage global and project defaults
● List and describe important environment variables
© Copyright IBM Corporation 2009 59
IBM InfoSphere DataStage
Mod 04: DataStage Designer
© Copyright IBM Corporation 2009
Course materials may not be reproduced in whole or in part without the prior written permission of IBM. 4.0.3
Unit objectives
After completing this unit, you should be able to:
● Log onto DataStage
● Navigate around DataStage Designer
● Import and export DataStage objects to a file
● Import a Table Definition for a sequential file
© Copyright IBM Corporation 2009 61
Logging onto DataStage Designer
Host name,
port number of
application
server
DataStage server
machine/project
© Copyright IBM Corporation 2009 62
Designer Work Area
Repository Menus Toolbar
Parallel
canvas
Palette
© Copyright IBM Corporation 2009 63
Importing and Exporting
DataStage Objects
© Copyright IBM Corporation 2009 64
Repository Window
Search for
objects in the
project
Project
Default jobs
folder
Default Table
Definitions
folder
© Copyright IBM Corporation 2009 65
Import and Export
● Any object in the repository window can be exported to a
file
● Can export whole projects
● Use for backup
● Sometimes used for version control
● Can be used to move DataStage objects from one
project to another
● Use to share DataStage jobs and projects with other
developers
© Copyright IBM Corporation 2009 66
Export Procedure
● Click “Export>DataStage Components”
● Select DataStage objects for export
● Specify type of export:
– DSX: Default format
– XML: Enables processing of export file by XML applications,
e.g., for generating reports
● Specify file path on client machine
© Copyright IBM Corporation 2009 67
Export Window
Click to
select
objects
from the
repository Selected
objects
Export
type Export
location on
client
system
Begin export
© Copyright IBM Corporation 2009 68
Import Procedure
● Click “Import>DataStage Components”
– Or “Import>DataStage Components (XML)” if you are importing
an XML-format export file
● Select DataStage objects for import
© Copyright IBM Corporation 2009 69
Import Options
Import all objects in
the file
Display list to
select from
© Copyright IBM Corporation 2009 70
Import/Export Command Line Interface
● Command istool
● Located \IBM\InformationServer\Clients\istools\clients
– Documentation in same location
● Invoke through Programs IBM Information Server
Information Server Command Line Interface
● Examples:
– istool –hostname domain:9080 –username dsuser –password
inf0server –deployment-package importfile.dsx import
– istool –hostname domain:9080 –username dsuser –password
inf0server –job GenDataJob –filename exportfile.dsx export
© Copyright IBM Corporation 2009 71
Information Server Manager
© Copyright IBM Corporation 2009 72
Information Server Manager
●Create and deploy packages of DataStage objects
●Perform DataStage object exports / imports
–Similar to DataStage Designer exports / imports
• Export files are stored as archives (extension .isx)
●Provides a domain-wide view of the DataStage Repository
–Displays all DataStage servers
• E.g., a Development server and a Production server
–Displays all DataStage projects on each server
●Before you can use it you must add a domain
–Right-click over Repository, click Add Domain
• Select domain
• Specify user ID / password
●Invoke through Information Server start menu:
–IBM Information Server Information Server Manager
© Copyright IBM Corporation 2009 73
Adding a Domain
Information Server
domain. May
contain multiple
DataStage systems
© Copyright IBM Corporation 2009 74
Information Server Manager Window
Created
export
archive
DataStage
Repository. Shows all
DataStage servers
and projects
© Copyright IBM Corporation 2009 75
Information Server Manager – Package View
● Multiple packages Domain
● Multiple builds within each package Packages
● Build history
Builds
Objects
© Copyright IBM Corporation 2009 76
Importing Table Definitions
© Copyright IBM Corporation 2009 77
Importing Table Definitions
● Table Definitions describe the format and columns
of files and tables
● You can import Table Definitions for:
– Sequential files
– Relational tables
– COBOL files
– Many other things
● Table Definitions can be loaded into job stages
© Copyright IBM Corporation 2009 78
Sequential File Import Procedure
● Click Import>Table Definitions>Sequential File Definitions
● Select directory containing sequential file and then the file
● Select a repository folder to store the Table Definition in
● Examined format and column definitions and edit as
necessary
© Copyright IBM Corporation 2009 79
Importing Sequential Metadata
Import Table
Definition for
a sequential
file
© Copyright IBM Corporation 2009 80
Sequential Import Window
Select
directory
containing
files
Start import
Select file
Select repository
folder
© Copyright IBM Corporation 2009 81
Specify Format
Edit columns Select if first row
has column names
Delimiter
© Copyright IBM Corporation 2009 82
Edit Column Names and Types
Double-click to define
extended properties
© Copyright IBM Corporation 2009 83
Extended Properties window
Parallel properties
Property
categories
Available
properties
© Copyright IBM Corporation 2009 84
Table Definition General Tab
Type Source
Stored Table
Definition
© Copyright IBM Corporation 2009 85
Checkpoint
● True or False? The directory to which you export is on the
DataStage client machine, not on the DataStage server
machine.
● Can you import Table Definitions for sequential files with
fixed-length record formats?
© Copyright IBM Corporation 2009 86
Unit summary
Having completed this unit, you should be able to:
● Log onto DataStage
● Navigate around DataStage Designer
● Import and export DataStage objects to a file
● Import a Table Definition for a sequential file
© Copyright IBM Corporation 2009 87
IBM InfoSphere DataStage
Mod 05: Creating Parallel Jobs
© Copyright IBM Corporation 2009
Course materials may not be reproduced in whole or in part without the prior written permission of IBM. 4.0.3
Unit objectives
After completing this unit, you should be able to:
● Design a simple Parallel job in Designer
● Define a job parameter
● Use the Row Generator, Peek, and Annotation stages in a
job
● Compile your job
● Run your job in Director
● View the job log
© Copyright IBM Corporation 2009 89
What Is a Parallel Job?
● Executable DataStage program
● Created in DataStage Designer
– Can use components from Repository
● Built using a graphical user interface
● Compiles into Orchestrate script language (OSH) and object
code (from generated C++)
© Copyright IBM Corporation 2009 90
Job Development Overview
● Import metadata defining sources and targets
– Done within Designer using import process
● In Designer, add stages defining data extractions and loads
● Add processing stages to define data transformations
● Add links defining the flow of data from sources to targets
● Compile the job
● In Director, validate, run, and monitor your job
– Can also run the job in Designer
– Can only view the job log in Director
© Copyright IBM Corporation 2009 91
Tools Palette
Stage
categories
Stages
© Copyright IBM Corporation 2009 92
Adding Stages and Links
● Drag stages from the Tools Palette to the diagram
– Can also be dragged from Stage Type branch to the diagram
● Draw links from source to target stage
– Right mouse over source stage
– Release mouse button over target stage
© Copyright IBM Corporation 2009 93
Job Creation Example Sequence
● Brief walkthrough of procedure
● Assumes Table Definition of source already exists in the
repository
© Copyright IBM Corporation 2009 94
Create New Parallel Job
Open
New
window
Parallel job
© Copyright IBM Corporation 2009 95
Drag Stages and Links From Palette
Compile
Row Peek
Generator
Job
properties
© Copyright IBM Corporation 2009 96
Renaming Links and Stages
● Click on a stage or link to
rename it
● Meaningful names have
many benefits
– Documentation
– Clarity
– Fewer development errors
© Copyright IBM Corporation 2009 97
RowGenerator Stage
● Produces mock data for specified columns
● No inputs link; single output link
● On Properties tab, specify number of rows
● On Columns tab, load or specify column definitions
– Open Extended Properties window to specify the values to be
generated for that column
– A number of algorithms for generating values are available
depending on the data type
● Algorithms for Integer type
– Random: seed, limit
– Cycle: Initial value, increment
● Algorithms for string type: Cycle , alphabet
● Algorithms for date type: Random, cycle
© Copyright IBM Corporation 2009 98
Inside the Row Generator Stage
Properties
tab
Set property
value
Property
© Copyright IBM Corporation 2009 99
Columns Tab
Double-click to specify
extended properties
View data
Select Table
Definition Load a
Table
Definition
© Copyright IBM Corporation 2009 100
Extended Properties
Specified
properties and
their values
Additional
properties to add
© Copyright IBM Corporation 2009 101
Peek Stage
● Displays field values
– Displayed in job log or sent to a file
– Skip records option
– Can control number of records to be displayed
– Shows data in each partition, labeled 0, 1, 2, …
● Useful stub stage for iterative job development
– Develop job to a stopping point and check the data
© Copyright IBM Corporation 2009 102
Peek Stage Properties
Output to
job log
© Copyright IBM Corporation 2009 103
Job Parameters
● Defined in Job Properties window
● Makes the job more flexible
● Parameters can be:
– Used in directory and file names
– Used to specify property values
– Used in constraints and derivations
● Parameter values are determined at run time
● When used for directory and files names and property
values, they are surrounded with pound signs (#)
– E.g., #NumRows#
● Job parameters can reference DataStage environment
variables
– Prefaced by $, e.g., $APT_CONFIG_FILE
© Copyright IBM Corporation 2009 104
Defining a Job Parameter
Parameters tab
Parameter
Add environment
variable
© Copyright IBM Corporation 2009 105
Using a Job Parameter in a Stage
Job parameter Click to insert Job
parameter
© Copyright IBM Corporation 2009 106
Adding Job Documentation
● Job Properties
– Short and long descriptions
● Annotation stage
– Added from the Tools Palette
– Displays formatted text descriptions on diagram
© Copyright IBM Corporation 2009 107
Job Properties Documentation
Documentation
© Copyright IBM Corporation 2009 108
Annotation Stage Properties
© Copyright IBM Corporation 2009 109
Compiling a Job
Run
Compile
© Copyright IBM Corporation 2009 110
Errors or Successful Message
Highlight stage with
error Click for more info
© Copyright IBM Corporation 2009 111
Running Jobs and Viewing the Job
Log in Designer
© Copyright IBM Corporation 2009 112
DataStage Director
● Use to run and schedule jobs
● View runtime messages
● Can invoke directly from Designer
– Tools > Run Director
© Copyright IBM Corporation 2009 113
Run Options
Assign values to
parameter
© Copyright IBM Corporation 2009 114
Run Options
Stop after number
of warnings
Stop after number
of rows
© Copyright IBM Corporation 2009 115
Director Status View
Status view Schedule view
Log view
Select job to
view
messages in
the log view
© Copyright IBM Corporation 2009 116
Director Log View
Click the notepad
icon to view log
messages
Peek messages
© Copyright IBM Corporation 2009 117
Message Details
© Copyright IBM Corporation 2009 118
Other Director Functions
● Schedule job to run on a particular date/time
● Clear job log of messages
● Set job log purging conditions
● Set Director options
– Row limits (server job only)
– Abort after x warnings
© Copyright IBM Corporation 2009 119
Running Jobs from Command Line
● dsjob –run –param numrows=10 dx444 GenDataJob
– Runs a job
– Use –run to run the job
– Use –param to specify parameters
– In this example, dx444 is the name of the project
– In this example, GenDataJob is the name of the job
● dsjob –logsum dx444 GenDataJob
– Displays a job’s messages in the log
● Documented in “Parallel Job Advanced Developer’s
Guide”
© Copyright IBM Corporation 2009 120
Checkpoint
1. Which stage can be used to display output data in the job
log?
2. Which stage is used for documenting your job on the job
canvas?
3. What command is used to run jobs from the operating
system command line?
© Copyright IBM Corporation 2009 121
Unit summary
Having completed this unit, you should be able to:
● Design a simple Parallel job in Designer
● Define a job parameter
● Use the Row Generator, Peek, and Annotation stages in a
job
● Compile your job
● Run your job in Director
● View the job log
© Copyright IBM Corporation 2009 122
IBM InfoSphere DataStage
Mod 06: Accessing Sequential
Data
© Copyright IBM Corporation 2009
Course materials may not be reproduced in whole or in part without the prior written permission of IBM. 4.0.3
Unit objectives
After completing this unit, you should be able to:
● Understand the stages for accessing different kinds of file
data
● Sequential File stage
● Data Set stage
● Create jobs that read from and write to sequential files
● Create Reject links
● Work with NULLs in sequential files
● Read from multiple files using file patterns
● Use multiple readers
© Copyright IBM Corporation 2009 124
Types of File Data
● Sequential
– Fixed or variable length
● Data Set
● Complex flat file
© Copyright IBM Corporation 2009 125
How Sequential Data is Handled
● Import and export operators are generated
– Stages get translated into operators during the compile
● Import operators convert data from the external format,
as described by the Table Definition, to the framework
internal format
– Internally, the format of data is described by schemas
● Export operators reverse the process
● Messages in the job log use the “import” / “export”
terminology
– E.g., “100 records imported successfully; 2 rejected”
– E.g., “100 records exported successfully; 0 rejected”
– Records get rejected when they cannot be converted correctly
during the import or export
© Copyright IBM Corporation 2009 126
Using the Sequential File Stage
Both import and export of sequential files (text, binary) can
be performed by the SequentialFile Stage.
Importing/Exporting Data
– Data import: Internal format
– Data export Internal format
© Copyright IBM Corporation 2009 127
Features of Sequential File Stage
● Normally executes in sequential mode
● Executes in parallel when reading multiple files
● Can use multiple readers within a node
– Reads chunks of a single file in parallel
● The stage needs to be told:
– How file is divided into rows (record format)
– How row is divided into columns (column format)
© Copyright IBM Corporation 2009 128
Sequential File Format Example
Record delimiter
Field 1 , Field 1
2 , Field 13 , Last field nl
Final Delimiter = end
Field Delimiter
Field 1 , Field 12 , Field 13 , Last field , nl
Final Delimiter = comma
© Copyright IBM Corporation 2009 129
Sequential File Stage Rules
● One input link
● One stream output link
● Optionally, one reject link
– Will reject any records not matching metadata in the column
definitions
• Example: You specify three columns separated by commas, but
the row that’s read had no commas in it
• Example: The second column is designated a decimal in the
Table Definition, but contains alphabetic characters
© Copyright IBM Corporation 2009 130
Job Design Using Sequential Stages
Stream link
Reject link
(broken line)
© Copyright IBM Corporation 2009 131
Input Sequential Stage Properties
Output tab
File to
access
Column names
in first row
Click to add more
files to read
© Copyright IBM Corporation 2009 132
Format Tab
Record format
Load from Table
Definition
Column format
© Copyright IBM Corporation 2009 133
Sequential Source Columns Tab
View data
Load from Table
Definition
Save as a new
Table Definition
© Copyright IBM Corporation 2009 134
Reading Using a File Pattern
Use wild
cards
Select File
Pattern
© Copyright IBM Corporation 2009 135
Properties - Multiple Readers
Multiple readers option allows
you to set number of readers
per node
© Copyright IBM Corporation 2009 136
Sequential Stage As a Target
Input Tab
Append /
Overwrite
© Copyright IBM Corporation 2009 137
Reject Link
● Reject mode =
– Continue: Continue reading
records
– Fail: Abort job
– Output: Send down output
link
● In a source stage
– All records not matching the
metadata (column definitions)
are rejected
● In a target stage
– All records that fail to be
written for any reason
● Rejected records consist of
one column, datatype = raw
Reject mode property
© Copyright IBM Corporation 2009 138
Inside the Copy Stage
Column
mappings
© Copyright IBM Corporation 2009 139
Reading and Writing NULL Values to a
Sequential File
© Copyright IBM Corporation 2009 140
Working with NULLs
● Internally, NULL is represented by a special value outside
the range of any existing, legitimate values
● If NULL is written to a non-nullable column, the job will abort
● Columns can be specified as nullable
– NULLs can be written to nullable columns
● You must “handle” NULLs written to nullable columns in a
Sequential File stage
– You need to tell DataStage what value to write to the file
– Unhandled rows are rejected
● In a Sequential File source stage, you can specify values you
want DataStage to convert to NULLs
© Copyright IBM Corporation 2009 141
Specifying a Value for NULL
Nullable
column
Added
property
© Copyright IBM Corporation 2009 142
DataSet Stage
© Copyright IBM Corporation 2009 143
Data Set
● Binary data file
● Preserves partitioning
– Component dataset files are written to each partition
● Suffixed by .ds
● Referred to by a header file
● Managed by Data Set Management utility from GUI (Manager,
Designer, Director)
● Represents persistent data
● Key to good performance in set of linked jobs
– No import / export conversions are needed
– No repartitioning needed
● Accessed using DataSet stage
● Implemented with two types of components:
– Descriptor file:
• contains metadata, data location, but NOT the data itself
– Data file(s)
• contain the data
• multiple files, one per partition (node)
© Copyright IBM Corporation 2009 144
Job With DataSet Stage
DataSet stage
DataSet stage
properties
© Copyright IBM Corporation 2009 145
Data Set Management Utility
Display schema
Display data
Display record
counts for each
data file
© Copyright IBM Corporation 2009 146
Data and Schema Displayed
Data viewer
Schema describing the
format of the data
© Copyright IBM Corporation 2009 147
File Set Stage
● Use to read and write to filesets
● Files suffixed by .fs
● Files are similar to a dataset
– Partitioned
– Implemented with header file and data files
● How filesets differ from datasets
– Data files are text files
• Hence readable by external applications
– Datasets have a proprietary data format which may
change in future DataStage versions
© Copyright IBM Corporation 2009 148
Checkpoint
1. List three types of file data.
2. What makes datasets perform better than other types of
files in parallel jobs?
3. What is the difference between a data set and a file set?
© Copyright IBM Corporation 2009 149
Unit summary
Having completed this unit, you should be able to:
● Understand the stages for accessing different kinds of file
data
● Sequential File stage
● Data Set stage
● Create jobs that read from and write to sequential files
● Create Reject links
● Work with NULLs in sequential files
● Read from multiple files using file patterns
● Use multiple readers
© Copyright IBM Corporation 2009 150
IBM InfoSphere DataStage
Mod 07: Platform Architecture
© Copyright IBM Corporation 2009
Course materials may not be reproduced in whole or in part without the prior written permission of IBM. 4.0.3
Unit objectives
After completing this unit, you should be able to:
● Describe parallel processing architecture
● Describe pipeline parallelism
● Describe partition parallelism
● List and describe partitioning and collecting algorithms
● Describe configuration files
● Describe the parallel job compilation process
● Explain OSH
● Explain the Score
© Copyright IBM Corporation 2009 152
Key Parallel Job Concepts
● Parallel processing:
– Executing the job on multiple CPUs
● Scalable processing:
– Add more resources (CPUs and disks) to increase system
performance
1 2 • Example system: 6 CPUs (processing
nodes) and disks
3 4 • Scale up by adding more CPUs
• Add CPUs as individual nodes or to
5 6 an SMP system
© Copyright IBM Corporation 2009 153
Scalable Hardware Environments
● Single CPU ● SMP ● GRID / Clusters
● Dedicated memory ● Multi-CPU (2-64+) – Multiple, multi-CPU
systems
& disk ● Shared memory & – Dedicated memory per
node
disk – Typically SAN-based
shared storage
● MPP
– Multiple nodes with
dedicated memory,
storage
● 2 – 1000’s of CPUs
© Copyright IBM Corporation 2009 154
Pipeline Parallelism
● Transform, clean, load processes execute simultaneously
● Like a conveyor belt moving rows from process to process
– Start downstream process while upstream process is running
● Advantages:
– Reduces disk usage for staging areas
– Keeps processors busy
● Still has limits on scalability
© Copyright IBM Corporation 2009 155
Partition Parallelism
● Divide the incoming stream of data into subsets to be
separately processed by an operation
– Subsets are called partitions (nodes)
● Each partition of data is processed by the same
operation
– E.g., if operation is Filter, each partition will be filtered in exactly
the same way
● Facilitates near-linear scalability
– 8 times faster on 8 processors
– 24 times faster on 24 processors
– This assumes the data is evenly distributed
© Copyright IBM Corporation 2009 156
Three-Node Partitioning
Node 1
Operation
subset1
Node 2
subset2
Operation
Data subset3
Node 3
Operation
● Here the data is partitioned into three partitions
● The operation is performed on each partition of data
separately and in parallel
● If the data is evenly distributed, the data will be processed
three times faster
© Copyright IBM Corporation 2009 157
Parallel Jobs Combine Partitioning and Pipelining
– Within parallel jobs pipelining, partitioning, and repartitioning are automatic
– Job developer only identifies:
• Sequential vs. parallel mode (by stage)
– Default for most stages is parallel mode
• Method of data partitioning
– The method to use to distribute the data into the available nodes
• Method of data collecting
– The method to use to collect the data from the nodes into a single
node
• Configuration file
– Specifies the nodes and resources
© Copyright IBM Corporation 2009 158
Job Design v. Execution
User assembles the flow using DataStage Designer
… at runtime, this job runs in parallel for any configuration
(1 node, 4 nodes, N nodes)
No need to modify or recompile the job design!
© Copyright IBM Corporation 2009 159
Configuration File
● Configuration file separates configuration (hardware /
software) from job design
– Specified per job at runtime by $APT_CONFIG_FILE
– Change hardware and resources without changing job design
● Defines nodes with their resources (need not match
physical CPUs)
– Dataset, Scratch, Buffer disk (file systems)
– Optional resources (Database, SAS, etc.)
– Advanced resource optimizations
• “Pools” (named subsets of nodes)
● Multiple configuration files can be used different occasions
of job execution
– Optimizes overall throughput and matches job characteristics to
overall hardware resources
– Allows runtime constraints on resource usage on a per job basis
© Copyright IBM Corporation 2009 160
Example Configuration File
{
node "n1" {
fastname "s1"
pool "" "n1" "s1" "app2" "sort"
3 4 resource disk "/orch/n1/d1" {}
resource disk "/orch/n1/d2" {"bigdata"}
resource scratchdisk "/temp" {"sort"}
}
1 2 node "n2" {
fastname "s2"
pool "" "n2" "s2" "app1"
resource disk "/orch/n2/d1" {}
resource disk "/orch/n2/d2" {"bigdata"}
Key points: resource scratchdisk "/temp" {}
}
1. Number of nodes defined node "n3" {
fastname "s3"
pool "" "n3" "s3" "app1"
2. Resources assigned to each resource disk "/orch/n3/d1" {}
node. Their order is significant. resource scratchdisk "/temp" {}
}
node "n4" {
3. Advanced resource fastname "s4"
optimizations and configuration pool "" "n4" "s4" "app1"
resource disk "/orch/n4/d1" {}
(named pools, database, SAS) resource scratchdisk "/temp" {}
}
}
© Copyright IBM Corporation 2009 161
Partitioning and Collecting
© Copyright IBM Corporation 2009 162
Partitioning and Collecting
● Partitioning breaks incoming rows into multiple streams of
rows (one for each node)
● Each partition of rows is processed separately by the
stage/operator
● Collecting returns partitioned data back to a single stream
● Partitioning / Collecting is specified on stage input links
© Copyright IBM Corporation 2009 163
Partitioning / Collecting Algorithms
● Partitioning algorithms include:
– Round robin
– Random
– Hash: Determine partition based on key value
• Requires key specification
– Modulus
– Entire: Send all rows down all partitions
– Same: Preserve the same partitioning
– Auto: Let DataStage choose the algorithm
● Collecting algorithms include:
– Round robin
– Auto
• Collect first available record
– Sort Merge
• Read in by key
• Presumes data is sorted by the key in each partition
• Builds a single sorted stream based on the key
– Ordered
• Read all records from first partition, then second, …
© Copyright IBM Corporation 2009 164
Keyless V. Keyed Partitioning Algorithms
● Keyless: Rows are distributed independently of data values
– Round Robin
– Random
– Entire
– Same
● Keyed: Rows are distributed based on values in the
specified key
– Hash: Partition based on key
• Example: Key is State. All “CA” rows go into the same partition; all
“MA” rows go in the same partition. Two rows of the same state
never go into different partitions
– Modulus: Partition based on modulus of key divided by the number
of partitions. Key is a numeric type.
• Example: Key is OrderNumber (numeric type). Rows with the same
order number will all go into the same partition.
– DB2: Matches DB2 EEE partitioning
© Copyright IBM Corporation 2009 165
Round Robin and Random Partitioning
Keyless
● Keyless partitioning methods
…8 7 6 5 4 3 2 1 0
● Rows are evenly distributed
across partitions
– Good for initial import of data if no
other partitioning is needed
– Useful for redistributing data Round Robin
● Fairly low overhead
● Round Robin assigns rows to
partitions like dealing cards
– Row/Partition assignment will be the
same for a given 6 7 8
$APT_CONFIG_FILE 3 4 5
0 1 2
● Random has slightly higher
overhead, but assigns rows in a
non-deterministic fashion
between job runs
© Copyright IBM Corporation 2009 166
ENTIRE Partitioning
Keyless
● Each partition gets a complete copy
of the data
– Useful for distributing lookup and …8 7 6 5 4 3 2 1 0
reference data
• May have performance impact
in MPP / clustered
environments
– On SMP platforms, Lookup stage ENTIRE
(only) uses shared memory instead
of duplicating ENTIRE reference
data
• On MPP platforms, each
server uses shared memory
for a single local copy . . .
● ENTIRE is the default partitioning . . .
for Lookup reference links with 3 3 3
“Auto” partitioning 2 2 2
– On SMP platforms, it is a good 1 1 1
practice to set this explicitly on the 0 0 0
Normal Lookup reference link(s)
© Copyright IBM Corporation 2009 167
HASH Partitioning
Keyed
● Keyed partitioning
Values of key column
method
● Rows are distributed …0 3 2 1 0 2 3 2 1 1
according to the values
in key columns
– Guarantees that rows with same HASH
key values go into the same
partition
– Needed to prevent matching rows
from “hiding” in other partitions
• E.g. Join, Merge,
0 1 2
Remove Duplicates, 3 1 2
… 0 1 2
– Partition distribution is relatively 3
equal if the data across the source
key column(s) is evenly
distributed
© Copyright IBM Corporation 2009 168
Modulus Partitioning
Keyed
● Keyed partitioning method
Values of key column
● Rows are distributed according
…0 3 2 1 0 2 3 2 1 1
to the values in one integer key
column
– Uses modulus
partition = MOD (key_value / #partitions) MODULUS
● Faster than HASH
● Guarantees that rows with
identical key values go in the
same partition 0 1 2
● Partition size is relatively equal 3
0
1
1
2
2
if the data within the key 3
column is evenly distributed
© Copyright IBM Corporation 2009 169
Auto Partitioning
● DataStage inserts partition components as necessary
to ensure correct results
– Before any stage with “Auto” partitioning
– Generally chooses ROUND-ROBIN or SAME
– Inserts HASH on stages that require matched key values
(e.g. Join, Merge, Remove Duplicates)
– Inserts ENTIRE on Normal (not Sparse) Lookup reference
links
• NOT always appropriate for MPP/clusters
● Since DataStage has limited awareness of your data
and business rules, explicitly specify HASH partitioning
when needed
– DataStage has no visibility into Transformer logic
– Hash is required before Sort and Aggregator stages
– DataStage sometimes inserts un-needed partitioning
• Check the log
© Copyright IBM Corporation 2009 170
Partitioning Requirements for Related Records
● Misplaced records
– Using Aggregator stage to sum customer sales by
customer number
– If there are 25 customers, 25 records should be output
– But suppose records with the same customer numbers
are spread across partitions
• This will produce more than 25 groups (records)
– Solution: Use hash partitioning algorithm
● Partition imbalances
– If all the records are going down only one of the nodes,
then the job is in effect running sequentially
© Copyright IBM Corporation 2009 171
Unequal Distribution Example
● Same key values are ● Hash on LName, with 2-node
assigned to the same config file
partition
Part 0
ID LName FName Address
5 Dodge Horace 17840 Jefferson
Source Data
ID LName FName Address
6 Dodge John 75 Boston Boulevard
1 Ford Henry 66 Edison Avenue
2 Ford Clara 66 Edison Avenue
3 Ford Edsel 7900 Jefferson
4 Ford Eleanor 7900 Jefferson
Partition 1
ID LName FName Address
5 Dodge Horace 17840 Jefferson
1 Ford Henry 66 Edison Avenue
6 Dodge John 75 Boston Boulevard
2 Ford Clara 66 Edison Avenue
7 Ford Henry 4901 Evergreen
3 Ford Edsel 7900 Jefferson
8 Ford Clara 4901 Evergreen
4 Ford Eleanor 7900 Jefferson
9 Ford Edsel 1100 Lakeshore
7 Ford Henry 4901 Evergreen
10 Ford Eleanor 1100 Lakeshore
8 Ford Clara 4901 Evergreen
9 Ford Edsel 1100 Lakeshore
10 Ford Eleanor 1100 Lakeshore
© Copyright IBM Corporation 2009 172
Partitioning / Collecting Link Icons
Partitioning icon
Collecting icon
© Copyright IBM Corporation 2009 173
More Partitioning Icons
“fan-out” Sequential
to Parallel
SAME partitioner
Re-partition
watch for this!
AUTO partitioner
© Copyright IBM Corporation 2009 174
Partitioning Tab
Input tab
Key
specification
Algorithms
© Copyright IBM Corporation 2009 175
Collecting Specification
Key
specification
Algorithms
© Copyright IBM Corporation 2009 176
Runtime Architecture
© Copyright IBM Corporation 2009 177
Parallel Job Compilation
Designer
Client
What gets generated:
● OSH: A kind of script
● OSH represents the design data flow and
stages Compile
– Stages become OSH operators
● Transform operator for each Transformer DataStage server
– A custom operator built during the compile
– Compiled into C++ and then to corresponding
native operators
• Thus a C++ compiler is needed to compile
jobs with a Transformer stage
Executable
Job
Transformer
Components
© Copyright IBM Corporation 2009 178
Generated OSH
Enable viewing of generated
OSH in Administrator:
Comments OSH is visible in:
- Job
Operator properties
- Job run log
Schema - View Data
- Table Defs
© Copyright IBM Corporation 2009 179
Stage to Operator Mapping Examples
● Sequential File
– Source: import
– Target: export
● DataSet: copy
● Sort: tsort
● Aggregator: group
● Row Generator, Column Generator, Surrogate Key
Generator: generator
● Oracle
– Source: oraread
– Sparse Lookup: oralookup
– Target Load: orawrite
– Target Upsert: oraupsert
© Copyright IBM Corporation 2009 180
Checkpoint
1. What file defines the degree of parallelism a job runs
under?
2. Name two partitioning algorithms that partition based on
key values?
3. What partitioning algorithm produces the most even
distribution of data in the various partitions?
© Copyright IBM Corporation 2009 181
Unit summary
Having completed this unit, you should be able to:
● Describe parallel processing architecture
● Describe pipeline parallelism
● Describe partition parallelism
● List and describe partitioning and collecting algorithms
● Describe configuration files
● Describe the parallel job compilation process
● Explain OSH
● Explain the Score
© Copyright IBM Corporation 2009 182
IBM InfoSphere DataStage
Mod 08: Combining Data
© Copyright IBM Corporation 2009
Course materials may not be reproduced in whole or in part without the prior written permission of IBM. 4.0.3
Unit objectives
After completing this unit, you should be able to:
● Combine data using the Lookup stage
● Define range lookups
● Combine data using Merge stage
● Combine data using the Join stage
● Combine data using the Funnel stage
© Copyright IBM Corporation 2009 184
Combining Data
Ways to combine data:
● Horizontally:
– Multiple input links
– One output link made of columns from different input links.
– Joins
– Lookup
– Merge
● Vertically:
– One input link, one output link combining groups of related records
into a single record
– Aggregator
– Remove Duplicates
● Funneling: Multiple input streams funneled into a single
output stream
– Funnel stage
© Copyright IBM Corporation 2009 185
Lookup, Merge, Join Stages
● These stages combine two or more input links
– Data is combined by designated "key" column(s)
● These stages differ mainly in:
– Memory usage
– Treatment of rows with unmatched key values
– Input requirements (sorted, de-duplicated)
© Copyright IBM Corporation 2009 186
Lookup Stage
© Copyright IBM Corporation 2009 187
Lookup Features
● One stream input link (source link)
● Multiple reference links
● One output link
● Lookup failure options
– Continue, Drop, Fail, Reject
● Reject link
– Only available when lookup failure option is set to Reject
● Can return multiple matching rows
● Hash file is built in memory from the lookup files
– Indexed by key
– Should be small enough to fit into physical memory
© Copyright IBM Corporation 2009 188
Lookup Types
● Equality match
– Match exactly values in the lookup key column of the reference link to
selected values in the source row
– Return row or rows (if multiple matches are to be returned) that match
● Caseless match
– Like an equality match except that it’s caseless
• E.g., “abc” matches “AbC”
● Range on the reference link
– Two columns on the reference link define the range
– A match occurs when a selected value in the source row is within the
range
● Range on the source link
– Two columns on the source link define the range
– A match occurs when a selected value in the reference link is within
the range
© Copyright IBM Corporation 2009 189
Lookup Example
Reference link
Source (stream)
link
© Copyright IBM Corporation 2009 190
Lookup Stage With an Equality Match
Source link columns
Lookup
constraints
Mappings to
output columns
Reference key column
and source mapping Column
definitions
© Copyright IBM Corporation 2009 191
Lookup Failure Actions
If the lookup fails to find a matching key column, one of
several actions can be taken:
– Fail (Default)
• Stage reports an error and the job fails immediately
– Drop
• Input row is dropped
– Continue
• Input row is transferred to the output. Reference link columns are
filled with null or default values
– Reject
• Input row sent to a reject link
• This requires that a reject link has been created for the stage
© Copyright IBM Corporation 2009 192
Specifying Lookup Failure Actions
Select to return
multiple rows
Select lookup failure
action
© Copyright IBM Corporation 2009 193
Lookup Stage Behavior
Source link Reference link
Revolution Citizen Citizen Exchange
1789 Lefty M_B_Dextrous Nasdaq
1776 M_B_Dextrous Righty NYSE
Lookup key
column
© Copyright IBM Corporation 2009 194
Lookup Stage
Output of Lookup with Continue option
Revolution Citizen Exchange
1789 Lefty
1776 M_B_Dextrous Nasdaq
Empty string
or NULL
Output of Lookup with Drop option
Revolution Citizen Exchange
1776 M_B_Dextrous Nasdaq
© Copyright IBM Corporation 2009 195
Designing a Range Lookup Job
© Copyright IBM Corporation 2009 196
Range Lookup Job
Reference link
Lookup stage
© Copyright IBM Corporation 2009 197
Range on Reference Link
Reference
range values
Retrieve
description
Source values
© Copyright IBM Corporation 2009 198
Selecting the Stream Column
Source link
Double-click
to specify
range
Reference link
© Copyright IBM Corporation 2009 199
Range Expression Editor
Select range
columns
Select
operators
© Copyright IBM Corporation 2009 200
Range on Stream Link
Source range
Retrieve other
column values
Reference link key
© Copyright IBM Corporation 2009 201
Specifying the Range Lookup
Reference link
key column
Range type
© Copyright IBM Corporation 2009 202
Range Expression Editor
Select range
columns
© Copyright IBM Corporation 2009 203
Join Stage
© Copyright IBM Corporation 2009 204
Join Stage
● Four types of joins
• Inner
• Left outer
• Right outer
• Full outer
● Input links must be sorted
– Left link and a right link
– Supports additional “intermediate” links
● Light-weight
– Little memory required, because of the sort requirement
● Join key column or columns
– Column names for each input link must match
© Copyright IBM Corporation 2009 205
Job With Join Stage
Right input link
Left input link
Join stage
© Copyright IBM Corporation 2009 206
Join Stage Editor
Column to
match
Select left input
link
Select if multiple
Select join type columns make up
the join key
© Copyright IBM Corporation 2009 207
Join Stage Behavior
Left link (primary input) Right link (secondary input)
Revolution Citizen Citizen Exchange
1789 Lefty M_B_Dextrous Nasdaq
1776 M_B_Dextrous Righty NYSE
Join key
column
© Copyright IBM Corporation 2009 208
Inner Join
● Transfers rows from both data sets whose key
columns have matching values
Output of inner join on key Citizen
Revolution Citizen Exchange
1776 M_B_Dextrous Nasdaq
© Copyright IBM Corporation 2009 209
Left Outer Join
● Transfers all values from the left link and transfers
values from the right link only where key columns
match.
Revolution Citizen Exchange
1789 Lefty
1776 M_B_Dextrous Nasdaq
Null or default
value
© Copyright IBM Corporation 2009 210
Right Outer Join
● Transfers all values from the right link and transfers values from
the left link only where key columns match.
Revolution Citizen Exchange
1776 M_B_Dextrous Nasdaq
Righty NYSE
Null or default
value
© Copyright IBM Corporation 2009 211
Full Outer Join
● Transfers rows from both data sets, whose key columns
contain equal values, to the output link
● It also transfers rows, whose key columns contain
unequal values, from both input links to the output link
● Treats both input symmetrically
● Creates new columns, with new column names!
Revolution leftRec_Citizen rightRec_Citizen Exchange
1789 Lefty
1776 M_B_Dextrous M_B_Dextrous Nasdaq
0 Righty NYSE
© Copyright IBM Corporation 2009 212
Merge Stage
© Copyright IBM Corporation 2009 213
Merge Stage
• Similar to Join stage
• Input links must be sorted
– Master link and one or more secondary links
– Master must be duplicate-free
● Light-weight
– Little memory required, because of the sort requirement
● Unmatched master rows can be kept or dropped
● Unmatched ("Bad") update rows in input link n can be captured
in a "reject" link in corresponding output link n.
● Matched update rows are consumed
● Follows the Master-Update model
● Master row and one or more updates row are merged iff they
have the same value in user-specified key column(s).
● A non-key column name occurs in several inputs?
– The lowest input port number prevails
– e.g., master over update; update values are ignored
© Copyright IBM Corporation 2009 214
Merge Stage Job
Merge stage
Capture update
link non-matches
© Copyright IBM Corporation 2009 215
Merge Stage Properties
Match key
Keep or drop
unmatched
masters
© Copyright IBM Corporation 2009 216
The Merge Stage
● Allows composite keys
● Multiple update links
Master One or more
updates
● Matched update rows are
consumed
● Unmatched updates in
input port n can be captured
0 1 2 in output port n
● Lightweight
0 1 2
Merge
Output Rejects
© Copyright IBM Corporation 2009 217
Comparison: Joins, Lookup, Merge
Joins Lookup Merge
Model RDBMS-style relational Source - in RAM LU Table Master -Update(s)
Memory usage light heavy light
# and names of Inputs 2 or more: left, right 1 Source, N LU Tables 1 Master, N Update(s)
Mandatory Input Sort all inputs no all inputs
Duplicates in primary input OK OK Warning!
Duplicates in secondary input(s) OK Warning! OK only when N = 1
Options on unmatched primary Keep (left outer), Drop (Inner) [fail] | continue | drop | reject [keep] | drop
Options on unmatched secondary Keep (right outer), Drop (Inner) NONE capture in reject set(s)
On match, secondary entries are captured captured consumed
# Outputs 1 1 out, (1 reject) 1 out, (N rejects)
Captured in reject set(s) Nothing (N/A) unmatched primary entries unmatched secondary entries
© Copyright IBM Corporation 2009 218
Funnel Stage
© Copyright IBM Corporation 2009 219
What is a Funnel Stage?
● Combines data from multiple input links to a single
output link
● All sources must have identical metadata
● Three modes
– Continuous
• Records are combined in no particular order
– First available record
– Sort Funnel
• Combines the input records in the order defined by a
key
• Produces sorted output if the input links are all sorted
by the same key
– Sequence:
• Outputs all records from the first input link, then all
from the second input link, and so on
© Copyright IBM Corporation 2009 220
Funnel Stage Example
© Copyright IBM Corporation 2009 221
Checkpoint
1. Name three stages that horizontally join data?
2. Which stage uses the least amount of memory? Join or
Lookup?
3. Which stage requires that the input data is sorted? Join or
Lookup?
© Copyright IBM Corporation 2009 222
Checkpoint solutions
1. Lookup, Merge, Join
2. Join
3. Join
© Copyright IBM Corporation 2009 223
Unit summary
Having completed this unit, you should be able to:
● Combine data using the Lookup stage
● Define range lookups
● Combine data using Merge stage
● Combine data using the Join stage
● Combine data using the Funnel stage
© Copyright IBM Corporation 2009 224
IBM InfoSphere DataStage
Mod 09: Sorting and
Aggregating Data
© Copyright IBM Corporation 2009
Course materials may not be reproduced in whole or in part without the prior written permission of IBM. 4.0.3
Unit objectives
After completing this unit, you should be able to:
● Sort data using in-stage sorts and Sort stage
● Combine data using Aggregator stage
● Combine data Remove Duplicates stage
© Copyright IBM Corporation 2009 226
Sort Stage
© Copyright IBM Corporation 2009 227
Sorting Data
● Uses
– Some stages require sorted input
• Join, merge stages require sorted input
– Some stages use less memory with sorted input
• e.g., Aggregator
● Sorts can be done:
– Within stages
• On input link Partitioning tab, set partitioning to anything other
than Auto
– In a separate Sort stage
• Makes sort more visible on diagram
• Has more options
© Copyright IBM Corporation 2009 228
Sorting Alternatives
Sort stage Sort within
stage
© Copyright IBM Corporation 2009 229
In-Stage Sorting
Partitioning
tab Do sort
Preserve
non-key
ordering
Remove
dups
Can’t be Auto Sort key
when sorting
© Copyright IBM Corporation 2009 230
Sort Stage
Sort key
Sort options
© Copyright IBM Corporation 2009 231
Sort keys
● Add one or more keys
● Specify sort mode for each key
– Sort: Sort by this key
– Don’t sort (previously sorted):
• Assumes the data has already been sorted on this key
• Continue sorting by any secondary keys
● Specify sort order: ascending / descending
● Specify case sensitive or not
© Copyright IBM Corporation 2009 232
Sort Options
● Sort Utility
– DataStage – the default
– Unix: Don’t use. Slower than DataStage sort utility
● Stable
● Allow duplicates
● Memory usage
– Sorting takes advantage of the available memory for increased
performance
• Uses disk if necessary
– Increasing amount of memory can improve performance
© Copyright IBM Corporation 2009 233
Stable Sort
Key Col Key Col
4 X
1 K
3 Y
1 A
1 K
2 P
3 C 2 L
2 P 3 Y
3 D 3 C
1 A 3 D
2 L 4 X
© Copyright IBM Corporation 2009 234
Partitioning V. Sorting Keys
● Partitioning keys are often different than Sorting keys
– Keyed partitioning (e.g., Hash) is used to group related records into
the same partition
– Sort keys are used to establish order within each partition
● Example:
– Partition on HouseHoldID, sort on HouseHoldID, EntryDate
• Partitioning on HouseHoldID ensures that the same ID will not be
spread across multiple partitions
• Sorting orders the records with the same ID by entry date
– Useful for deciding which of a group of duplicate records with the same
ID should be retained
© Copyright IBM Corporation 2009 235
Aggregator Stage
© Copyright IBM Corporation 2009 236
Aggregator Stage
Purpose: Perform data aggregations
Specify:
● One or more key columns that define the aggregation
units (or groups)
● Columns to be aggregated
● Aggregation functions include, among many others:
– Count (nulls/non-nulls)
– Sum
– Max / Min / Range
● The grouping method (hash table or pre-sort) is a
performance issue
© Copyright IBM Corporation 2009 237
Job with Aggregator Stage
Aggregator
stage
© Copyright IBM Corporation 2009 238
Aggregation Types
● Count rows
– Count rows in each group
– Put result in a specified output column
● Calculation
– Select column
– Put result of calculation in a specified output column
– Calculations include:
• Sum
• Count
• Min, max
• Mean
• Missing value count
• Non-missing value count
• Percent coefficient of variation
© Copyright IBM Corporation 2009 239
Count Rows Aggregator Properties
Grouping key
columns
Output column for
Count Rows the result
aggregation type
© Copyright IBM Corporation 2009 240
Calculation Type Aggregator Properties
Grouping key
columns
Calculation
aggregation type
Calculations with
output columns
Available
calculations
© Copyright IBM Corporation 2009 241
Grouping Methods
● Hash (default)
– Calculations are made for all groups and stored in memory
• Hash table structure (hence the name)
– Results are written out after all input has been processed
– Input does not need to be sorted
– Useful when the number of unique groups is small
• Running tally for each group’s aggregations needs to fit into
memory
● Sort
– Requires the input data to be sorted by grouping keys
• Does not perform the sort! Expects the sort
– Only a single aggregation group is kept in memory
• When a new group is seen, the current group is written out
– Can handle unlimited numbers of groups
© Copyright IBM Corporation 2009 242
Grouping Method - Hash
Key Col 4 4X
4 X
3 Y 3 3Y 3C 3D
1 K
3 C
1 1K 1A
2 P
3 D
2 2P 2L
1 A
2 L
© Copyright IBM Corporation 2009 243
Grouping Method - Sort
Key Col
1 K 1K 1A
1 A
2 P
2P 2L
2 L
3 Y
3Y 3C 3D
3 C
3 D
4X
4 X
© Copyright IBM Corporation 2009 244
Remove Duplicates Stage
© Copyright IBM Corporation 2009 245
Removing Duplicates
● Can be done by Sort stage
– Use unique option
• No choice on which duplicate to keep
• Stable sort always retains the first row in the group
• Non-stable sort is indeterminate
OR
● Remove Duplicates stage
– Has more sophisticated ways to remove duplicates
• Can choose to retain first or last
© Copyright IBM Corporation 2009 246
Remove Duplicates Stage Job
Remove Duplicates
stage
© Copyright IBM Corporation 2009 247
Remove Duplicates Stage Properties
Key that defines
duplicates
Retain first or last
duplicate
© Copyright IBM Corporation 2009 248
Checkpoint
1. What stage is used to perform calculations of column
values grouped in specified ways?
2. In what two ways can sorts be performed?
3. What is a stable sort?
© Copyright IBM Corporation 2009 249
Unit summary
Having completed this unit, you should be able to:
● Sort data using in-stage sorts and Sort stage
● Combine data using Aggregator stage
● Combine data Remove Duplicates stage
© Copyright IBM Corporation 2009 250
IBM InfoSphere DataStage
Mod 10: Transforming Data
© Copyright IBM Corporation 2009
Course materials may not be reproduced in whole or in part without the prior written permission of IBM. 4.0.3
Unit objectives
After completing this unit, you should be able to:
● Use the Transformer stage in parallel jobs
● Define constraints
● Define derivations
● Use stage variables
● Create a parameter set and use its parameters in constraints
and derivations
© Copyright IBM Corporation 2009 252
Transformer Stage
© Copyright IBM Corporation 2009 253
Transformer Stage
● Column mappings
● Derivations
– Written in Basic
– Final compiled code is C++ generated object code
● Constraints
– Filter data
– Direct data down different output links
• For different processing or storage
● Expressions for constraints and derivations can reference
– Input columns
– Job parameters
– Functions
– System variables and constants
– Stage variables
– External routines
© Copyright IBM Corporation 2009 254
Job With a Transformer Stage
Transformer
Single input
Reject link Multiple outputs
© Copyright IBM Corporation 2009 255
Inside the Transformer Stage
Stage variables
Show/Hide
Input columns Output columns
Constraint
Derivations / Mappings
Column defs
© Copyright IBM Corporation 2009 256
Defining a Constraint
Input column
Function
Job parameter
© Copyright IBM Corporation 2009 257
Defining a Derivation
Input column
String in quotes Concatenation
operator (:)
© Copyright IBM Corporation 2009 258
IF THEN ELSE Derivation
● Use IF THEN ELSE to conditionally derive a value
● Format:
– IF <condition> THEN <expression1> ELSE <expression2>
– If the condition evaluates to true then the result of expression1 will
be copied to the target column or stage variable
– If the condition evaluates to false then the result of expression2 will
be copied to the target column or stage variable
● Example:
– Suppose the source column is named In.OrderID and the target
column is named Out.OrderID
– Replace In.OrderID values of 3000 by 4000
– IF In.OrderID = 3000 THEN 4000 ELSE In.OrderID
© Copyright IBM Corporation 2009 259
String Functions and Operators
● Substring operator
– Format: “String” [loc, length]
– Example:
• Suppose In.Description contains the string “Orange Juice”
• InDescription[8,5] “Juice”
● UpCase(<string>) / DownCase(<string>)
– Example: UpCase(In.Description) “ORANGE JUICE”
● Len(<string>)
– Example: Len(In.Description) 12
© Copyright IBM Corporation 2009 260
Checking for NULLs
● Nulls can be introduced into the data
flow from lookups
– Mismatches (lookup failures) can produce
nulls
● Can be handled in constraints,
derivations, stage variables, or a
combination of these
● NULL functions
– Testing for NULL
• IsNull(<column>)
• IsNotNull(<column>)
– Replace NULL with a value
• NullToValue(<column>, <value>)
– Set to NULL: SetNull()
• Example: IF In.Col = 5 THEN SetNull()
ELSE In.Col
© Copyright IBM Corporation 2009 261
Transformer Functions
● Date & Time
● Logical
● Null Handling
● Number
● String
● Type Conversion
© Copyright IBM Corporation 2009 262
Transformer Execution Order
● Derivations in stage variables
● Constraints are executed before derivations
● Column derivations in earlier links are executed before later
links
● Derivations in higher columns are executed before lower
columns
© Copyright IBM Corporation 2009 263
Transformer Stage Variables
● Derivations execute in order from top to bottom
– Later stage variables can reference earlier stage variables
– Earlier stage variables can reference later stage variables
• These variables will contain a value derived from the previous row
that came into the Transformer
● Multi-purpose
– Counters
– Store values from previously read rows to make comparisons with
the currently read row
– Store derived values to be used in multiple target field derivations
– Can be used to control execution of constraints
© Copyright IBM Corporation 2009 264
Transformer Reject Links
Reject link
Convert link to a
Reject link
© Copyright IBM Corporation 2009 265
Transformer and Null Expressions
● Within a Transformer, any expression that includes a NULL
value produces a NULL result
– 1 + NULL = NULL
– “John” : NULL : “Doe” = NULL
● When the result of a link constraint or output derivation is
NULL, the Transformer will output that row to the reject link.
If no reject link, job will abort!
● Standard practices
– Always create a reject link
– Always test for NULL values in expressions with NULLABLE columns
• IF isNull(link.col) THEN … ELSE …
– Transformer issues warnings for rejects
© Copyright IBM Corporation 2009 266
Otherwise Link
Otherwise link
© Copyright IBM Corporation 2009 267
Defining an Otherwise Link
Check to create
otherwise Can specify abort
condition condition
© Copyright IBM Corporation 2009 268
Specifying Link Ordering
Link ordering toolbar icon
Last in
order
© Copyright IBM Corporation 2009 269
Parameter Sets
© Copyright IBM Corporation 2009 270
Parameter Sets
● Store a collection of parameters in a named object
● One or more values files can be named and specified
– A values file stores values for specified parameters
– Values are picked up at runtime
● Parameter Sets can be added to the job parameters
specified on the Parameters tab in the job properties
© Copyright IBM Corporation 2009 271
Creating a New Parameter Set
New Parameter
Set
© Copyright IBM Corporation 2009 272
Parameters Tab
Specify
parameters
Parameter set
name is specified
on General tab
© Copyright IBM Corporation 2009 273
Values Tab
Values for
Values file name parameters
© Copyright IBM Corporation 2009 274
Adding a Parameter Set to Job Properties
View Parameter
Set
Parameter Set Add Parameter
Reference Set
© Copyright IBM Corporation 2009 275
Using Parameter Set Parameters
Select Parameter
Set parameters
© Copyright IBM Corporation 2009 276
Checkpoint
1. What occurs first? Derivations or constraints?
2. Can stage variables be referenced in constraints?
3. Where should you test for NULLS within a Transformer?
Stage Variable derivations or output column derivations?
© Copyright IBM Corporation 2009 277
Unit summary
Having completed this unit, you should be able to:
● Use the Transformer stage in parallel jobs
● Define constraints
● Define derivations
● Use stage variables
● Create a parameter set and use its parameters in constraints
and derivations
© Copyright IBM Corporation 2009 278
IBM InfoSphere DataStage
Mod 11: Working With
Relational Data
© Copyright IBM Corporation 2009
Course materials may not be reproduced in whole or in part without the prior written permission of IBM. 4.0.3
Unit objectives
After completing this unit, you should be able to:
● Import Table Definitions for relational tables
● Create Data Connections
● Use Connector stages in a job
● Use SQL Builder to define SQL Select statements
● Use SQL Builder to define SQL Insert and Update
statements
● DB2 Connector stage
© Copyright IBM Corporation 2009 280
Working with Relational Data
● Importing relational data
– Import using ODBC or orchdbutil
• orchdbutil is preferred, in order to get correct type conversions
● Data Connection objects
– Store database connection information in a named object
● Stages available to access relational data
– Connector stages
• Parallel support
• Most functionality
• Consistent GUI and functionality across all relational types
– Enterprise stages
• Parallel support
– Plug-in stages
• Functionality ported from DataStage Server Jobs
● Selecting data
– Build SELECT statements using SQL Builder
● Writing data
– Build INSERT, UPDATE, DELETE statements using SQL Builder
© Copyright IBM Corporation 2009 281
Importing Table Definitions
© Copyright IBM Corporation 2009 282
Importing Table Definitions
● Can import using ODBC or using Orchestrate schema
definitions
– Orchestrate schema imports are better because the data types
are more accurate
● Import>Table Definitions>Orchestrate Schema
Definitions
● Import>Table Definitions>ODBC Table Definitions
© Copyright IBM Corporation 2009 283
Orchestrate Schema Import
© Copyright IBM Corporation 2009 284
ODBC Import
Select ODBC data
source name
© Copyright IBM Corporation 2009 285
Connectors Stages
© Copyright IBM Corporation 2009 286
Connector Stage Types
● ODBC
– Conforms to ODBC 3.5 standard and is Level 3 compliant
– Certified with Oracle, DB2 UDB, SQL Server, and others
– DataDirect drivers
● DB2 UDB
● Teradata
● InfoSphere MQ
– WSMQ 5.3 and 6.0 for Client / Sever
– WSMB 5.0
© Copyright IBM Corporation 2009 287
Connector Stage Features
● Common stage editor
● Convenient drop-down lists to choose properties
● Job parameters can be inserted into any property
● Required properties are identified with a visual indicator
● Warning indicator for properties requiring attention
● Metadata retrieval
● Integrated SQL Builder
● Parallel support
– Read: parallel connections to the server and modified SQL queries for
each connection
– Write: parallel connections to the server
● Transaction isolation level support
– Read Uncommitted
– Read Committed
– Repeatable Read
– Serializable
● Before / After commands
– Executed once before or after the job runs
© Copyright IBM Corporation 2009 288
Metadata Retrieval
● List data sources
– Can list both user and system DSNs
● List users defined for the given database
● Individual table or view metadata
– Column data type, size, precision, etc.
● Database information
– Name, vendor name, version number, etc.
● List of database schemas
● List tables that match the filter criteria
● List related tables
– Those related by foreign key
– Those whose foreign keys point to the primary key in the given
table
© Copyright IBM Corporation 2009 289
Connector Stages
ODBC connector
ODBC Connector
© Copyright IBM Corporation 2009 290
Stage Editor
Click to display Navigator
stage properties panel
Properties
Click to display link
properties and columns
© Copyright IBM Corporation 2009 291
Property Warning Indicators
Warning indicator
© Copyright IBM Corporation 2009 292
Auto-Generated SQL
Stage generates SQL
SELECT based on
column definitions
© Copyright IBM Corporation 2009 293
Invoking SQL Builder
SQL statement
property
SQL Builder
© Copyright IBM Corporation 2009 294
Connector Stage Properties
Connection
information
SQL
Transaction
and session
management
Before/After
the SQL
© Copyright IBM Corporation 2009 295
DB2 Connector Stage
© Copyright IBM Corporation 2009 296
DB2 Connector Source Stage
© Copyright IBM Corporation 2009 297
DB2 Connector Source Stage Properties
Connection
information
© Copyright IBM Corporation 2009 298
DB2 Connector Target Stage
© Copyright IBM Corporation 2009 299
DB2 Connector Target Stage Properties
Connection
information
Write mode
© Copyright IBM Corporation 2009 300
Building a Query Using SQL
Builder
© Copyright IBM Corporation 2009 301
SQL Builder
● Uses the Table Definition
– Be sure the Locator tab information is specified fully and correctly
• Schema and table names are based on Locator tab information
● Drag Table Definitions to SQL Builder canvas
● Drag columns from Table Definition to Select columns table
© Copyright IBM Corporation 2009 302
Table Definition Locator Tab
Locator tab
Table schema name
Table name
© Copyright IBM Corporation 2009 303
SQL Builder Read Method
SQL Builder
for Select
© Copyright IBM Corporation 2009 304
SQL Builder Window
Drag Table
Definition
Drag
Columns
Build column
expressions
Selection tab
© Copyright IBM Corporation 2009 305
Creating a Calculated Column
Columns
dragged
from table
Select function
Select Expression Editor
Open a second
Expression
Editor
Functions
Select function
Function parameters
© Copyright IBM Corporation 2009 306
Constructing a WHERE Clause
Select predicate Job parameter
Add
conditio
n to
clause
Add second job parameter
© Copyright IBM Corporation 2009 307
Sorting the Data
Second column
to sort by
Sort Ascending
/ Descending
First column to
sort by
© Copyright IBM Corporation 2009 308
Viewing the Generated SQL
Read-only
SQL tab
© Copyright IBM Corporation 2009 309
Building SQL to Write to a
Table
© Copyright IBM Corporation 2009 310
Select Type of Write
Write mode
property
© Copyright IBM Corporation 2009 311
Invoking SQL Builder
Specify
INSERT
SQL
Specify Builder
UPDATE
© Copyright IBM Corporation 2009 312
SQL Builder INSERT
Drag Table
Definition
Drag
columns to
load Specify
values
© Copyright IBM Corporation 2009 313
SQL Builder Update
Drag Table
Definition
Drag
columns to
load Specify
values
Add
WHERE
conditio
clause
n to
clause
© Copyright IBM Corporation 2009 314
Viewing the SQL
© Copyright IBM Corporation 2009 315
Data Connections
© Copyright IBM Corporation 2009 316
Data Connection
● Stores database parameters and values in a named object in
the Repository
– Optionally can create a parameter set with the parameters and
values from with the Data Connection object
● Associated with a stage type
● Property values can be specified in a job stage of the given
type by loading the Data Connection into the stage
© Copyright IBM Corporation 2009 317
Creating a New Data Connection Object
New Data
Connection
© Copyright IBM Corporation 2009 318
Select the Stage Type
Select the stage
type
Specify parameter
Optionally create values
a parameter set
© Copyright IBM Corporation 2009 319
Loading the Data Connection
Load Data
Connection
Save Data
Connection
© Copyright IBM Corporation 2009 320
Checkpoint
1. Which of the following types stages provide the most
functionality in DataStage parallel jobs? ODBC Plug-in,
ODBC Connector, ODBC Enterprise?
2. What are three ways of building SQL statements in
Connector stages?
3. Which of the following statements can be specified in
Connector stages? Select, Insert, Update, Upsert, Create
Table.
4. What are two ways of loading Data Connection metadata
into a database stage?
© Copyright IBM Corporation 2009 321
Unit summary
Having completed this unit, you should be able to:
● Import Table Definitions for relational tables
● Create Data Connections
● Use Connector stages in a job
● Use SQL Builder to define SQL Select statements
● Use SQL Builder to define SQL Insert and Update
statements
● Use the DB2 Enterprise stage
© Copyright IBM Corporation 2009 322
IBM InfoSphere DataStage
Mod 12: Metadata in the
Parallel Framework
© Copyright IBM Corporation 2009
Course materials may not be reproduced in whole or in part without the prior written permission of IBM. 4.0.3
Unit objectives
After completing this unit, you should be able to:
● Explain schemas
● Create schemas
● Explain Runtime Column Propogation (RCP)
● Turn RCP on and off
● Build a job that reads data from a sequential file using a
schema
● Build a shared container
© Copyright IBM Corporation 2009 324
Schema
● Alternative way to specifying column definitions and record
formats
– Similar to a Table Definition
● Written in a plain text file
● Can be imported as a Table Definition
● Can be created from a Table Definition
● Can be used in place of a Table Definition in a Sequential
file stage
– Requires RCP
– Schema file path can be parameterized
• Enables a single job to process files with different column
definitions
© Copyright IBM Corporation 2009 325
Creating a Schema
● Using a text editor
– Follow correct syntax for definitions
– Not recommended
● Import from an existing data set or file set
– On DataStage Manager import > Table Definitions > Orchestrate
Schema Definitions
– Select checkbox for a file with .fs or .ds
● Import from a database table
● Create from a Table Definition
– Click Parallel on Layout tab
© Copyright IBM Corporation 2009 326
Importing a Schema
Schema location can be
server or workstation
Import from Check if schema is
Database table from a data set or
file set
© Copyright IBM Corporation 2009 327
Creating a Schema From a Table Definition
Parallel layout
Layout
Schema corresponding to
the Table Definition Save schema
© Copyright IBM Corporation 2009 328
Reading a Sequential File Using a Schema
No columns
defined here
Path to
schema file
© Copyright IBM Corporation 2009 329
Runtime Column Propagation (RCP)
● When RCP is turned on:
– Columns of data can flow through a stage without being explicitly
defined in the stage
– Target columns in a stage need not have any columns explicitly
mapped to them
• No column mapping enforcement at design time
– Input columns are mapped to unmapped columns by name
● How implicit columns get into a job
– Read a file using a schema in a Sequential File stage
– Read a database table using “Select *”
– Explicitly define as an output column in a stage earlier in the flow
● Benefits of RCP
– Job flexibility
• Job can process input with different layouts
– Ability to create reusable components in shared containers
• Component logic an apply to a single named column
• All other columns flow through untouched
© Copyright IBM Corporation 2009 330
Enabling Runtime Column Propagation (RCP)
● Project level
– DataStage Administrator Parallel tab
● Job level
– Job properties General tab
● Stage level
– Link Output Column tab
● Settings at a lower level override settings at a higher level
– E.g., disable at the project level, but enable for a given job
– E.g., enable at the job level, but disable a given stage
© Copyright IBM Corporation 2009 331
Enabling RCP at Project Level
Check to enable RCP
to be used
Check to make RCP
the default for new
jobs
© Copyright IBM Corporation 2009 332
Enabling RCP at Job Level
Check to make RCP
the job default
© Copyright IBM Corporation 2009 333
Enabling RCP at Stage Level
● Sequential File stage
– Output Columns tab
● Transformer
– Open Stage Properties
– Stage Properties Output tab
Enable RCP in
Sequential File
stage
Check to enable
RCP in
Transformer
© Copyright IBM Corporation 2009 334
When RCP is Disabled
● DataStage Designer enforces Stage Input to Output column mappings.
Colored red; job
won’t compile
© Copyright IBM Corporation 2009 335
When RCP is Enabled
● DataStage does not enforce mapping rules
● Runtime error if no incoming columns match unmapped target
column names
Job will compile
© Copyright IBM Corporation 2009 336
Shared Containers
© Copyright IBM Corporation 2009 337
Shared Containers
● Encapsulate job design components into a stored container
● Provide reusable job design components
● Example
– Apply stored Transformer business logic
© Copyright IBM Corporation 2009 338
Creating a Shared Container
● Select stages from an existing job
● Click Edit>Construct Container>Shared
Selected
components
Click
© Copyright IBM Corporation 2009 339
Using a Shared Container in a Job
Shared
Container
© Copyright IBM Corporation 2009 340
Mapping Input / Output Links to the Container
Select container link
to map input link to
© Copyright IBM Corporation 2009 341
Mapping Input/Output Links
© Copyright IBM Corporation 2009 342
Checkpoint
1. What are two benefits of RCP?
2. What can you use to encapsulate stages and links in a job
to make them reusable?
© Copyright IBM Corporation 2009 343
Unit summary
Having completed this unit, you should be able to:
● Explain schemas
● Create schemas
● Explain Runtime Column Propogation (RCP)
● Turn RCP on and off
● Build a job that reads data from a sequential file using a
schema
● Build a shared container
© Copyright IBM Corporation 2009 344
IBM InfoSphere DataStage
Module 14: Job Control
© Copyright IBM Corporation 2009
Course materials may not be reproduced in whole or in part without the prior written permission of IBM. 4.0.3
Unit objectives
After completing this unit, you should be able to:
● Use the DataStage Job Sequencer to build a job that
controls a sequence of jobs
● Use Sequencer links and stages to control the sequence a
set of jobs run in
● Use Sequencer triggers and stages to control the conditions
under which jobs run
● Pass information in job parameters from the master
controlling job to the controlled jobs
● Define user variables
● Enable restart
● Handle errors and exceptions
© Copyright IBM Corporation 2009 346
What is a Job Sequence?
● A master controlling job that controls the execution of a
set of subordinate jobs
● Passes values to the subordinate job parameters
● Controls the order of execution (links)
● Specifies conditions under which the subordinate jobs get
executed (triggers)
● Specifies complex flow of control
– Loops
– All / Some
– Wait for file
● Perform system activities
– Email
– Execute system commands and executables
● Can include Restart checkpoints
© Copyright IBM Corporation 2009 347
Basics for Creating a New Job Sequence
● Open a new job sequence
– Specify whether its restartable
● Add stages
– Stages to execute jobs
– Stages to execute system commands and executables
– Special purpose stages
● Add links
– Specify the order in which jobs are to be executed
● Specify triggers
– Triggers specify the condition under which control passes across
a link
● Specify error handling
● Enable / Disable restart checkpoints
© Copyright IBM Corporation 2009 348
Job Sequencer Stages
● Run stages
– Job Activity: Run a job
– Execute Command: Run a system command
– Notification Activity: Send an email
● Flow control stages
– Sequencer: Go if All / Some
– Wait for File: Go when file exists / doesn’t
exist
– StartLoop / EndLoop
– Nested Condition: Go if condition satisfied
● Error handling
– Exception Handler
– Terminator
● Variables
– User Variables
© Copyright IBM Corporation 2009 349
Example
Wait for file Execute a
command
Send email
Run job
Handle
exceptions
© Copyright IBM Corporation 2009 350
Sequence Properties
Restart
Exception stage to
handle aborts
© Copyright IBM Corporation 2009 351
Job Activity Stage Properties
Job to be executed
Execution mode
Job parameters
to be passed
© Copyright IBM Corporation 2009 352
Job Activity Trigger
Output link names
List of trigger types
Build custom trigger
expressions
© Copyright IBM Corporation 2009 353
Execute Command Stage
Executable
Parameters to pass
● Execute system commands, shell scripts, and other
executables
● Use e.g. to drop or rename database tables
© Copyright IBM Corporation 2009 354
Notification Activity Stage
Include job
status info in
email body
© Copyright IBM Corporation 2009 355
User Variables Stage
Variable
User Variables stage
Expression defining the
value for the variable
© Copyright IBM Corporation 2009 356
Referencing the User Variable
Variable
© Copyright IBM Corporation 2009 357
Flow of Control Stages
© Copyright IBM Corporation 2009 358
Wait for File Stage
File
Options
© Copyright IBM Corporation 2009 359
Sequencer Stage
● Sequence multiple jobs using the Sequence stage
Can be set to all
or any
© Copyright IBM Corporation 2009 360
Nested Condition Stage
Fork based on trigger
conditions
Trigger conditions
© Copyright IBM Corporation 2009 361
Loop Stages
Reference link to start
Pass counter value
Counter values
© Copyright IBM Corporation 2009 362
Error Handling
© Copyright IBM Corporation 2009 363
Handling Activities that Fail
Pass control to
Exception stage when
an activity fails
© Copyright IBM Corporation 2009 364
Exception Handler Stage
Control
goes here
if an
activity fails
© Copyright IBM Corporation 2009 365
Restart
© Copyright IBM Corporation 2009 366
Enable Restart
Enable
checkpoints to be
added
© Copyright IBM Corporation 2009 367
Disable Checkpoint at a Stage
Don’t
checkpoint
this activity
© Copyright IBM Corporation 2009 368
Checkpoint
1. Which stage is used to run jobs in a job sequence?
2. Does the Exception Handler stage support an input link?
© Copyright IBM Corporation 2009 369
Unit summary
Having completed this unit, you should be able to:
● Use the DataStage Job Sequencer to build a job that
controls a sequence of jobs
● Use Sequencer links and stages to control the sequence a
set of jobs run in
● Use Sequencer triggers and stages to control the conditions
under which jobs run
● Pass information in job parameters from the master
controlling job to the controlled jobs
● Define user variables
● Enable restart
● Handle errors and exceptions
© Copyright IBM Corporation 2009 370
InfoSphere QualityStage
v8.1
Workshop
© Copyright IBM Corporation 2009
Course materials may not be reproduced in whole or in part without the prior written permission of IBM. 4.0.3
Course Contents
● Mod 13: Course Scenario
● Mod 14A: Structure of a Rule Set
● Mod 14B: Creation of a Custom Rule Set
● Mod 15: Initial Investigation of Data to be
Standardized
● Mod 16: Classification Schema/Classification
Table
● Mod 17: Pattern Action File
© Copyright IBM Corporation 2009 372
Module 13: Course Scenario
Information Management
© 2010 IBM Corporation
Information Management
Course Scenario
Acme Foods is implementing a data warehouse to:
– Provide the ability to analyze sales data and buying trends
– Enable improved target marketing campaigns for new customers
– Increase the profitability of existing customers
Historical sales data to be loaded into the warehouse has no
consistent structure or data formats as it was stored as free-form
text in various stand-alone systems used by the sales department
Prior to the warehouse load, historical data must also be enriched
with product ID data from the warehouse’s SCM system
10 IBM Confidential © 2010 IBM Corporation
Information Management
Review Questions
In what format is the current standardized data?
11 IBM Confidential © 2010 IBM Corporation
Information Management
Lab 1: Setup Data Files
Copy data files to disk if not using the virtual machine supplied
with the course materials
– Create a working directory C:\DX746
– Copy the entire content from C:\Class_Files\DX746\Lab to C:\DX746
12 IBM Confidential © 2010 IBM Corporation
Module 14A: Structure of a Rule Set
Information Management
© 2010 IBM Corporation
Information Management
Rule Sets
Rule sets contain logic for:
– Parsing
– Classifying
– Processing data elements
Three required files
– Classification Table (.cls)
– Dictionary File (.dct)
– Pattern Action File (.pat)
Optional lookup tables (.tbl)
Override tables
50 IBM Confidential © 2010 IBM Corporation
Information Management
Rule Set Files
Contains token, standard value,
Classification Table (.CLS) class, and threshold weights
(optional)
Contains SEPLIST/STRIPLIST, NYSIIS
Pattern Action File (.PAT) and Soundex calculations, and
pattern action sets for processing
data
Contains field identifier, type,
Dictionary File (.DCT) length, missing value identifier,
and field description
Optional conversion and lookup
Lookup Tables (.TBL) tables for converting and
returning standardized values
Tables for storing user overrides
Override Tables (.TBL) entered through the QualityStage
Designer GUI
51 IBM Confidential © 2010 IBM Corporation
Information Management
Classification Table (.cls)
The format for the Classification Table file is:
– token /standard value/class /[threshold-weights]/ [; comments]
The first line of a Classification Table file must be:
– \FORMAT\ SORT=Y
Each column must be separated by one or more spaces
Standard form of a token can be more than one word
– Must be in double quotes
– Example: POB “PO BOX” B
52 IBM Confidential © 2010 IBM Corporation
Information Management
Classes
One-character tag indicating the class of the word
User-defined classes can be any letter, A to Z
Zero (0) class indicates a null class
– Use of a class of zero indicates that a word is a null (or noise) and is skipped
in the pattern matching process
– Standard abbreviation can be any value desired since it is not used
Any token not classed by the classification table receives the
appropriate default class
53 IBM Confidential © 2010 IBM Corporation
Information Management
Default Classes
^ Numeric, all digits, such as 1234
+ Single, unknown word such as HIPPO
? Unknown, one or more words, such as CHERRY HILL
> Leading numeric, numbers followed by one or more letters,
such as 123A
< Leading alpha, letters followed by one or more numbers, such
as A3
@ Complex mix, a mixture of alpha and numeric characters such
as 123A45
~ Special, special characters including !, \, @, ~, %, etc.
0 Null
- Hyphen
/ Slash
\& Ampersand
\# Number sign
54
\( Left parenthesis
IBM Confidential © 2010 IBM Corporation
Information Management
Threshold Weights
Threshold weights specify the degree of tolerated spelling
uncertainty
An information-theoretic string comparator is used that takes into
account phonetic errors, random insertion, deletion and
replacement of characters, and transpositions of characters
Score is weighted by the length of the word, since small errors in
long words are less serious than errors in short words
– 900 Exact match
– 800 Strings are almost certainly the same
– 750 Strings are probably the same
– 700 Strings are probably different
55 IBM Confidential © 2010 IBM Corporation
Information Management
Comparison Threshold Examples
Token .CLS Entry 850 800 750
NRTHWEST NORTHWEST yes yes yes
NRTHWST NORTHWEST no yes yes
NRTHW NORTHWEST no no yes
56 IBM Confidential © 2010 IBM Corporation
Information Management
Dictionary File (.dct)
Defines the fields for the output file for the rule set
The format for the Dictionary File is:
field-identifier/ field-type/ field-length/missing value/description
[;comments]
– Field identifier: Any length, no spaces in the name,
• Example: HouseNumber
– Field type: C (character), N (numeric)
• This is legacy information but must be entered into the dictionary file
– Field length: Length (in bytes)
– Missing value identifier: S (spaces), Z (zero), X (none)
• This is legacy information but a value must be entered into the dictionary
file
– Description: No spaces allowed in the description
The first line of a Dictionary File must be:
– \FORMAT\ SORT=N
57 IBM Confidential © 2010 IBM Corporation
Information Management
Pattern Action File (.pat)
Contains the rules for standardization
After the input record is tokenized (parsed) and classified,
patterns are executed in the order they appear in the Pattern
Action file
ASCII file that can be created or updated using any standard text
editor and has the following general format:
\ PRAGMA_START
SEPLIST " ~`!@#$%^&*()_-+={}[]|\\:;\"<>,.?/"
STRIPLIST " ~`!@$^*()_+={}[]|\\:;\"<>,?"
\ PRAGMA_END
\ POST_START
NYSIIS {SN} {NS}
RSOUNDEX {SN} {SS}
\ POST_END
58
patternIBM Confidential © 2010 IBM Corporation
Information Management
Optional Tables (.tbl)
Lookup tables
– Contain information that is specific to the rule set; for example the gender
lookup table for USNAME contains the following:
MRS F
MS F
MR M
– ASCII file that can be manually edited
User override tables
– User classification, input text, input pattern, unhandled text, unhandled
pattern
– Entries added through Designer Standardization Overrides GUI
59 IBM Confidential © 2010 IBM Corporation
Information Management
Module 14A Review Questions
Which rule set file is not sorted? Why?
If the standard form of a token is two words, how is it formatted in
the classification table?
Where are the SEPLIST and STRIPLIST located?
60 IBM Confidential © 2010 IBM Corporation
Information Management
Lab 5: Copy and provision an existing rule set
USNAME CopyOfUSNAME
61 IBM Confidential © 2010 IBM Corporation
Information Management
Lab 6: Modify a rule set
CopyOfUSNAME
– CLS – Apaloosa Apalusa F
– PAT – F | +
• Change COPY TO COPY_A
Use Rules Analyzer to test
62 IBM Confidential © 2010 IBM Corporation
Module 14B: Creation of a Custom Rule Set
Information Management
© 2010 IBM Corporation
Information Management
Custom Rule Set Development Cycle
Investigate source data file – word investigation
Develop classification schema/table
Re-run investigation with new classification table to reveal
patterns
Develop dictionary file
Develop pattern action sets
Run standardization and investigate results
Refine pattern action processing
64 IBM Confidential © 2010 IBM Corporation
Information Management
Investigate Data File
Which comes first – the investigation or the rule set?
In order to investigate a free-form field, you must specify a rule
No rule exists for your data
65 IBM Confidential © 2010 IBM Corporation
Information Management
The Answer
Use a generic rule set first
– No entries in the classification table
Actually, there needs to be at least one entry in the classification table, one is
created by default when you create a new rule set
– Every token will be unclassified, therefore a default class (tag) will be
assigned based upon the data value
Create a folder and new Rule Set
Default values will be created for the dictionary, classification
table, and pattern action file
Modify the pattern action file to reference the default field in the
dictionary file
Use the Rule Analyzer to test
66 IBM Confidential © 2010 IBM Corporation
Information Management
Lab 7: Create new rule set
Create a folder called Standardization Rules under the
ACMEFOOD Folder
Create a New Rule Set name FOOD
– File, New
– Select Data Quality folder
– Choose Rule Set
Modify (see slide notes)
– Dictionary file
– Pattern Action file
Run the Standardization Rules Analyzer using your new rule on
the following input
– “THIS IS A TEST”
– If you correctly copied and updated your new rule, it will return an unhandled
pattern of +++T
67 IBM Confidential © 2010 IBM Corporation
Information Management
Parsing
Parsing is controlled by the SEPLIST/STRIPLIST
SEPLIST/STRIPLIST entries are stored in the pattern action file
– SEPLIST/STRIPLIST can also be accessed under Advanced Options in the
Investigate stage Wizard
– The SEPLIST/STRIPLIST located in the investigation window is read from the
pattern action file and displayed on the Advanced Options of the word
investigation screen
• Any changes made to the lists in the investigation screen are not reflected
in the pattern action file, to change this list in the pattern action file you
must edit the file and manually change the list
68 IBM Confidential © 2010 IBM Corporation
Information Management
Determining Parsing Requirements
Start by having only the space in the SEPLIST and only the space
in the STRIPLIST
Examine the pattern report to examine how the current SEPLIST/
STRIPLIST combination parsed the data
Remember, it’s important to thoroughly parse data before
classifying words
– New words can result from additional parsing
69 IBM Confidential © 2010 IBM Corporation
Information Management
Determining Parsing Requirements
For example, if the “/” is not in the SEPLIST, the following pattern
results
VALHALLA/WRAPPER @
In this example, a brand name and a container type are treated as
the same data token
70 IBM Confidential © 2010 IBM Corporation
Information Management
Determining Parsing Requirements
When the “/” is added to the SEPLIST, the following pattern
results
VALHALLA / WRAPPER ?/?
The pattern now shows that there are two (currently) unclassified
words separated by a “/”
– There are now two more words we can classify
71 IBM Confidential © 2010 IBM Corporation
Information Management
Determining Parsing Requirements
Continue to re-run the Word Investigate stage while modifying the
SEPLIST/STRIPLIST combination until the free-form data is
parsed into manageable words, “data chunks”, also called tokens
– TIP: Except when representing a complex mix of alphas and numbers
(C3PO), the “@” in pattern reports is an indicator that more can be added to
the SEPLIST
Before adding a character to the STRIPLIST, be sure the character
doesn’t add meaning or context to the data
– Rule of thumb: Whenever in doubt don’t strip it out
It’s important to analyze the pattern reports for ways the would-be
stripped character can impact the data
72 IBM Confidential © 2010 IBM Corporation
Information Management
Determining Parsing Requirements
How would removing the highlighted special characters impact the meaning
of the below data items?
#2 pencil
5# flour
Ave.
$100.00
£100.00
1/2
73 IBM Confidential © 2010 IBM Corporation
Information Management
Update the SEPLIST/STRIPLIST
After you’ve determined the parsing requirements, update the
SEPLIST/STRIPLIST in the .PAT file
Using CTRL-C / CTRL-V (copy/paste) can be helpful in copying the
characters used in the Word Investigate stage and pasting them to
the rule set’s .PAT file
74 IBM Confidential © 2010 IBM Corporation
Information Management
Module 14B Review Questions
True or false: It’s best to start with all special characters in the
STRIPLIST.
75 IBM Confidential © 2010 IBM Corporation
Module 15: Initial Investigation of Data to be Standardized
Information Management
© 2010 IBM Corporation
Information Management
Exercise 5-1
Create a new Investigate stage
– Name: InvWordProductData
– Option: Word
– Input file: SourceProductData
– Rule: FOOD
– Field: ProdDescr
– Adv. Opt.: Include Unclassified Alphas
Include Classified Original Spelling
Create a job, add the Investigate stage and run
77 IBM Confidential © 2010 IBM Corporation
Information Management
Word Investigation Implementation
78 IBM Confidential © 2010 IBM Corporation
Information Management
Word Investigation Implementation
79 IBM Confidential © 2010 IBM Corporation
Information Management
Word Investigation Implementation
80 IBM Confidential © 2010 IBM Corporation
Information Management
Word Investigation Implementation
81 IBM Confidential © 2010 IBM Corporation
Information Management
Result File InvWordProductData – Pattern Report
82 IBM Confidential © 2010 IBM Corporation
Information Management
Result File InvWordProductData – Token Report
83 IBM Confidential © 2010 IBM Corporation
Information Management
Lab 8: Create word investigation job
Refine SEPLIST/STRIPLIST
Modify .PAT to reflect new SEPLIST/STRIPLIST
Re-run Word Investigate stage
Iterate as often as needed
84 IBM Confidential © 2010 IBM Corporation
Module 16: Classification Schema/Classification Table
Information Management
© 2010 IBM Corporation
Information Management
Create the Classification Table
When creating or modifying a custom rule set, entries to the
classification table are often made by copying lines from the
Token Report report
The user develops a classification schema for key tokens (words)
that create context in patterns
The classification schema is a set of one letter codes used to
classify categories of words
Available are the 26 letters of the alphabet and the numeral 0 (for
words you want set to null or nothing)
86 IBM Confidential © 2010 IBM Corporation
Information Management
Classification Schema
When choosing letters to represent categories, consider choosing
letters that are not easily confused with other categories
– For example, if one category is of words is “Drugs” and the other is
“Diagnoses”, consider not categorizing either with the obvious choice of “D”
– This can help minimize misclassifications when creating the classification
table
87 IBM Confidential © 2010 IBM Corporation
Information Management
What to Classify
The general rule is “Don’t classify every word.”
Classify categories of words that occur many times in the data
and help to lend meaning and context to other items in the free-
form text
– Check the frequency of words in the investigation reports
Examples of words to classify are those whose category contains
a finite set of values, or whose expected values are finite
– For example: street types, name titles, name suffixes, unit types, units of
measure, packaging or container types, brand names
88 IBM Confidential © 2010 IBM Corporation
Information Management
What to Classify
Think twice before creating a classification for a category of
words for which the set of values are both infinite and do not
conform to a set of standards
Non-conforming or infinite category value words are best left to
be classified with default tags
– Examples include individual names, street names, and miscellaneous
description words
– Their context will allow them to be processed corrected in the pattern action
file
89 IBM Confidential © 2010 IBM Corporation
Information Management
Lab 9: Create classification schema
Review the Token Report created in the previous exercise
Develop a classification schema
Add your classification schema key to the classification table
Copy and paste entries from the Token Report file into the
classification table and replace the “?” with the appropriate one-
character class
90 IBM Confidential © 2010 IBM Corporation
Information Management
Result File
91 IBM Confidential © 2010 IBM Corporation
Information Management
Review Questions
True or false: You should classify as many tokens as you have
time to.
Which file do you look in for candidates for inclusion in the
classification table?
True or false: When you start developing a new rule set, the
classification table is completely blank.
How many classes did you use? What are they?
92 IBM Confidential © 2010 IBM Corporation
Pattern Review
Refining the Classification Table
Information Management
© 2010 IBM Corporation
Information Management
Generating Patterns
Once the classification table has been created and the Word
Investigate stage is rerun, patterns will emerge
Review of the Pattern Report will indicate whether additional
classes or new members to existing classes must be made
Are the patterns explicit enough for you to glean the meaning of
the data based on its context?
– If you can answer yes to this question, you are ready to begin developing
pattern action sets
– If you answer no to this question, you must refine/modify your classification
schema/table until you can answer this question in the affirmative
94 IBM Confidential © 2010 IBM Corporation
Information Management
Lab 10: Review Pattern Report
Add entries to/update the classification table
Re-run the job containing the Word Investigate stage
Review the Pattern Report
Refine/modify the classification table as needed
95 IBM Confidential © 2010 IBM Corporation
Information Management
Result File
96 IBM Confidential © 2010 IBM Corporation
Information Management
Module Review Question
Do any default classed tokens remain after you have developed
the classification table for a new rule set?
97 IBM Confidential © 2010 IBM Corporation
Module 17: Pattern Action File
Information Management
© 2010 IBM Corporation
Pattern Action Language
Information Management
© 2010 IBM Corporation
Information Management
Pattern Format
A pattern consists of one or more operands
Operands are separated by vertical lines
– The following pattern has four operands:
^|D|?|T
– These are referred to in the actions as
[1], [2], [3], and [4]
You can use spaces to separate operands
– ^|D|?|T is equivalent to ^ | D | ? | T
Comments follow a semicolon ; like this comment
100 IBM Confidential © 2010 IBM Corporation
Information Management
Pattern Action Set
A Pattern Action Set is comprised of a pattern and a set of action
statements
Pattern
^|D|?|T
COPY [1] {HouseNumber} ; Copy operand 1 to HOUSENUMBER
Action
COPY_A [2] {PreDirection} ; Copy operand 2 w/standardization to PREDIR
Statements
COPY_S [3] {StreetName} ; Copy operand 3 w/spaces to STREETNAME
COPY_A [4] {StreetType} ; Copy operand 4 w/standardization to
; SUFFIXTYPE
101 IBM Confidential © 2010 IBM Corporation
Information Management
Common Action Statements
COPY - Copy the data in it’s original form
COPY_A - Copy the standardized abbreviation of the word from
the classification table
– HINT: Only use COPY_A if the the token you’re dealing with has a letter
classification
COPY_S - Copy with spaces preserved
– HINT: Most often used with ?
CONCAT - Concatenates a data item to a user variable Also
available is CONCAT_A.
RETYPE - Change the classification of a data item, and optionally
it’s actual and/or standard value
102 IBM Confidential © 2010 IBM Corporation
Information Management
Pattern Action File (.PAT)
Which of the following patterns will handle the following example?
Example: 1 North Ronald McDonald Lane
^|?|T
COPY [1] {HN} ; Copy operand 1 to HOUSENUMBER
COPY_S [2] {SN} ; Copy operand 2 w/spaces to STREETNAME
COPY_A [3] {ST} ; Copy operand 3 w/standardization to SUFFIXTYPE
^|D|?|T
COPY [1] {HN} ; Copy operand 1 to HOUSENUMBER
COPY_A [2] {PD} ; Copy operand 2 w/standardization to PREDIRECTION
COPY_S [3] {SN} ; Copy operand 3 w/spaces to STREETNAME
COPY_A [4] {ST} ; Copy operand 4 w/standardization to SUFFIXTYPE
103 IBM Confidential © 2010 IBM Corporation
Information Management
User Variables
This pattern action set looks for a date such as 10/15/98 and
copies it as a single token to a date field
*^|/|^|/|^
COPY [1] temp
CONCAT [2] temp
CONCAT [3] temp
CONCAT [4] temp
CONCAT [5] temp Temporary Variable
COPY temp {DA}
104 IBM Confidential © 2010 IBM Corporation
Information Management
Position-related Commands
End of Field Specifier - $
– ^ | ? | T | $ would only match a field that had a simple address and nothing
else
• 123 Main Road would match
• 123 Main Road Apt 3 would not match
• Apt 3 123 Main Road would not match
Floating Position Specifier – *
– *^ | ? | T would look starting from the left to find a simple address that appears
anywhere within the field
• Home Address 123 Main Road would match
• Apt 3 123 Main Road would match
• 123 Main Road Apt 3 would match
105 IBM Confidential © 2010 IBM Corporation
Information Management
Position-related Commands
Reverse Floating Position Specifier - #
– #S | ^ would look starting from the right to find the first instance of state
followed by a numeric
– CA 45 Products, Phoenix, AZ 12345 Dept 45
– Notice it correctly finds AZ 12345. If we used the floating position (*), it
would have incorrectly identified CA 45 as the state and zip code
Fixed Position Specifier - %
– Specifies a particular operand based on its position in a string
– Input string: 123 Happy Valley Joy Road
• %1 would locate 123
• %3 would locate Valley
• %-1 would locate Road
106 IBM Confidential © 2010 IBM Corporation
Information Management
Subfield Commands
Subfield Classes - 1 to 9, -1 to -9
– Parses individual words of a ? string
– Johnson Controls Corp whose pattern is ? | C
– 1 | C – Johnson is operand 1 and Corp is operand 2
– -1 | C – Controls is operand 1 and Corp is operand 2
Subfield Ranges - beg:end
– Specified a range of unknown words
– 123 – A B Main Street whose pattern is ^ | - | ? | T
– ^ | - | (1:2) | T
• “A B Main” is the string specified by ?
• 1:2 is the portion A B
– Operand [3] is A B
• You can also look from the end of the field. For example, (-3:-1) specifies
a range of the third word from the last to the last word
107 IBM Confidential © 2010 IBM Corporation
Information Management
Additional Classes
Universal Class - **
– Matches all tokens
– ** | T would match any pattern that ended with a T class
Negation Class Qualifier - !
– Is used to indicate NOT
– ** | !T would match any pattern that did not end with a T class
108 IBM Confidential © 2010 IBM Corporation
Information Management
Conditionals
Simple Conditional Values
– ^ | ?=“MAPLE” | T
– Would only match simple addresses with the single street name of
“MAPLE”
• 120 MAPLE TERRACE would match
• 240 MAPLE DRIVE would match
• 140 MAPLE LEAF ROAD would not match
Series of Conditional Values
– ^ | ? | T=“AVE”, “DR”
– Would only match simple addresses with a standard form suffix of AVE or
DR
• 120 MAPLE AVE would match
• 240 MAPLE DR would match
• 140 MAPLE ROAD would not match
109 IBM Confidential © 2010 IBM Corporation
Information Management
Tables of Conditional Values
If it’s necessary to specify a large number of conditional values,
tables of values can be used
– Postcode.tbl contains valid postal codes
– S | ^ = @Postcode.tbl
– Would only match a state followed by a numeric if the numeric represented a
value in the Postcode.tbl file
110 IBM Confidential © 2010 IBM Corporation
Information Management
Referencing Operands
Referencing Data Represented by an Operand
– ^ [{} = 10000] is read as “a numeric whose actual value is 10000”
Referencing the length of an operand
– ^ [{} LEN = 5] is read as “a numeric whose length is 5 bytes
Referencing the template of an operand
– @ [{} PICT = “cncc”] is read as “a complex mixed that looks like a character
followed by a numeric followed by two characters” such as C3PO
Referencing a substring of an operand
– + [{} (-7:-1) = “STRASSE”] is read as “a single unknown alpha whose actual
value’s last 7 bytes are STRASSE” such as LEUVENSTRASSE
111 IBM Confidential © 2010 IBM Corporation
Information Management
Calling Subroutines
Subroutines facilitate faster processing
Pattern action sets are executed sequentially
Faster to test for a generic condition and then call subroutine to
process
*M ; look for existence of apartment data
Call APTS ; send record to APTS subroutine
…
\SUB APTS
Process apartment data here
\END_SUB
112 IBM Confidential © 2010 IBM Corporation
Information Management
Review Questions
COPY_A copies what form of the token?
COPY_S is most commonly used with what default class?
Would ^ | ? | T find an address anywhere in a field?
True or false: ^ | ? | T | $ would only find an address that was the
only value in the field.
True or false: ** is the negation class qualifier.
113 IBM Confidential © 2010 IBM Corporation
Development of Pattern Action Sets
Information Management
© 2010 IBM Corporation
Information Management
Defining Standardize Output File Format
The output of the Standardize stage is dictated by the dictionary
file
Your business rules and requirements will dictate the output file
format
Because we need to match the product data against the
warehouse file, our output should match the format of the
warehouse file
115 IBM Confidential © 2010 IBM Corporation
Information Management
Warehouse File Format
DWProductData is the warehouse reference file and has the
following format:
116 IBM Confidential © 2010 IBM Corporation
Information Management
Result - Dictionary File
117 IBM Confidential © 2010 IBM Corporation
Information Management
Lab 11: Develop Pattern Set
Add entries to your dictionary file based on the format of the
warehouse data file that you are going to match your standardized
data to
Select patterns and begin developing action statements (pattern
action sets) to bucket/standardize the data
– Create a Standardize stage
• Input file: SourceProductData
• Output file: StanProdData
• Option: Append All
• Rule: FOOD
• Standardize Field: ProdDescr
– Build and run Standardize job
118 IBM Confidential © 2010 IBM Corporation
Information Management
Result File
119 IBM Confidential © 2010 IBM Corporation
Refining Pattern Action Sets Investigate of
Standardize Results
Information Management
© 2010 IBM Corporation
Information Management
Investigate Your Standardize Results
Once you’ve run the Standardize stage, you should investigate the
results to see how accurately your rule set is working
Your rule set contains two fields that make this process much
faster and easier
– Unhandled Pattern – which shows the pattern of any data it was unable to
process
– Unhandled Data – which show the actual unhandled/unprocessed data
Running a Character Concatenate Investigate stage on these two
fields will provide you with a consolidated view of patterns/data
that still need processing
You should also run a Character Discrete Investigate stage on all
your single domain fields to ensure they have domain integrity
and your rule set is working as desired
121 IBM Confidential © 2010 IBM Corporation
Information Management
Lab 12: Audit Standardization Results (Task 1)
Create a Character Concatenate Investigate stage on the
Unhandled_Pattern and Unhandled_Data fields
– Use a C mask on UnhandledPattern and an X mask on UnhandledData
– Be sure to increase your sample size
Add/refine your pattern action sets
Rerun your Standardize stage and your Character Concatenate
Investigate stage until you have processed all the data correctly
122 IBM Confidential © 2010 IBM Corporation
Information Management
Lab 12: Audit Standardization Results (Task 2)
Create a Character Discrete Investigate stage to investigate the
contents of all your single domain fields in your Standardize
stage result file
Review the report to ensure that you have domain integrity in all
the fields
If you identify any errors, go back and modify your rule’s pattern
action sets to correct the mishandling of the data
123 IBM Confidential © 2010 IBM Corporation
Information Management
Review Questions
What pattern/actual data value did you find most difficult to
process?
Were you able to attain 100% processing of the data?
Why did we use a C mask on the unhandled pattern and an X
mask on the unhandled data in our quality control Character
Concatenate Investigate stage?
Did you need to make any changes to your classification table?
124 IBM Confidential © 2010 IBM Corporation
DB2® v9.1
Workshop
© Copyright IBM Corporation 2009
Course materials may not be reproduced in whole or in part without the prior written permission of IBM. 4.0.3
Course Contents
● Mod 18: Overview of DB2
● Mod 19: Command Line Processor (CLP) and GUI
usage
● Mod 20: DB2 Environment
● Mod 21: Create an instance and explore the
environment
● Mod 22: Creating databases and data placement
● Mod 23: Creating database objects
● Mod 24: Moving data
● Mod 25: Backup and recovery
© Copyright IBM Corporation 2009 374
Overview of DB2 9 on Linux, UNIX and
Windows
© Copyright IBM Corporation 2010
Course materials may not be reproduced in whole or in part without the prior written permission of IBM. 5.4
Unit objectives
After completing this unit, you should be able to:
• Contrast the DB2 Family of products
• Identify the DB2 Products
• Describe the functions of DB2 components
• Explore installation and parameters
© Copyright IBM Corporation 2010
DB2 product platforms
Windows Linux/UNIX
DB2 DB2
zSeries iSeries
DB2
DB2
© Copyright IBM Corporation 2010
DB2: The scalable database
IBM InfoSphere Warehouse
(DPF Database partitioning)
supports clusters of Physical or
Virtual Servers
DB2 Enterprise Server
meets data server needs of
mid-size to large-size businesses
DB2 Workgroup Server
DB2 Express data server of choice for deployment
DB2 data server, entry-level
in a departmental, workgroup, or
pricing, small and
medium-sized business environment
medium business
DB2 Express-C
Free, entry-level edition of the DB2 data server
for the developer and partner community
© Copyright IBM Corporation 2010
Accessing DB2
Command Line Administration
Applications
Processor Tools
APIs
Control Center
Commands Embedded
Command Editor SQL/XQuery
Health Center Call Level
Interactive
SQL Task Center Interface
XQuery Replication Center JDBC/
SQLJ
Journal
DAO/
Tools Settings RDO/
ADO
SQL/XML/API
Database Engine
© Copyright IBM Corporation 2010
DB2 product components
IBM Data Server
DRDA Driver Package
Application
Requester IBM Data IBM Data Server
Server Client Runtime Client
Communication Support
DB2 DB2 Connect
DRDA Application Server
© Copyright IBM Corporation 2010
Installation prerequisites
• Do you have:
– Enough disk space
– Enough memory
– Available Internet browser
– Security (Authorities) concept for
SYSADM / Admin. Server
• Specific requirements depend on operating system
http://www-01.ibm.com/software/data/db2/9/sysreqs.html
• Product documentation:
http://www.ibm.com/support/docview.wss?rs=71&uid=swg27015148
© Copyright IBM Corporation 2010
Installation steps
• GUI or non-GUI • Admin-Server?
• Customizing? • Notification?
• Default instance? • Installation verification
© Copyright IBM Corporation 2010
Unit summary
Having completed this unit, you should be able to:
• Contrast the DB2 Family of products
• Identify the DB2 Products
• Describe the functions of DB2 components
• Explore installation and parameters
© Copyright IBM Corporation 2010
Student exercise
© Copyright IBM Corporation 2010
Command Line Processor (CLP) and
GUI usage
© Copyright IBM Corporation 2010
Course materials may not be reproduced in whole or in part without the prior written permission of IBM. 5.4
Unit objectives
After completing this unit, you should be able to:
• Use the Command Line Processor
• Explore the GUI environment
• Describe the DAS role with GUI tools
© Copyright IBM Corporation 2010
CLP Command Line Processor
© Copyright IBM Corporation 2010
CLP syntax
db2
option-flag db2-command
sql-statement
?
phrase
message
sql-state
class-code
© Copyright IBM Corporation 2010
Online reference
• Online Command Reference
db2 ?
db2 ? command string
db2 ? SQLnnnn (nnnn = 4 or 5 digit SQLCODE)
db2 ? nnnnn (nnnnn = 5 digit SQLSTATE)
• Online Reference Manuals
© Copyright IBM Corporation 2010
Using the CLP
Non-interactive mode
db2 connect to musicdb
db2 "select * from syscat.tables" more
(double quotes may be required)
Interactive mode
db2
db2=> connect to musicdb
db2=> select * from syscat.tables
© Copyright IBM Corporation 2010
CLP command options
© Copyright IBM Corporation 2010
Modify CLP options
1 Temporary for Command
db2 -r options.rep list command options
db2 -svtf create.tab3
db2 +c "update tab3 set salary=salary + 100"
2 Temporary for Interactive CLP Session
db2=>update command options using c off a on
3 Temporary for non-interactive CLP Session
export DB2OPTIONS="-svt" (UNIX)
set DB2OPTIONS="-svt" (Windows)
db2 -f create.tab3
4 Every session
put point 3 in UNIX db2profile
or System Program Group in Windows
© Copyright IBM Corporation 2010
Input file
Edit create.tab
-- comment: db2 -svtf create.tab
connect to sample;
create table tab3
(name varchar(20) not null,
phone char(40),
salary dec(7,2));
select * from tab3;
commit work;
connect reset;
db2 -svtf create.tab
© Copyright IBM Corporation 2010
Input file: Operating system commands
edit seltab.cmd
vi seltab
echo "Table Name Is" $1 > out.sel echo 'Table Name Is' %1 > out.sel
db2 "select * from $1" >> out.sel db2 Select * from %1 >> out.sel
seltab org
Table Name Is org
DEPTNUMB DEPTNAME MANAGER DIVISION LOCATION
10 Head Office 160 Corporate New York
15 New England 50 Eastern Boston
out.sel
20 Mid Atlantic 10 Eastern Washington
contents
38 South Atlantic 30 Eastern Atlanta
42 Great Lakes 100 Midwest Chicago
51 Plains 140 Midwest Dallas
66 Pacific 270 Western San Francisco
84 Mountain 290 Western Denver
© Copyright IBM Corporation 2010
QUIT/TERMINATE/CONNECT RESET differences
CLP Terminate CLP Disconnect
COMMAND Back-end Process database Connection
quit No No
terminate Yes Yes
Yes if
connect reset No CONNECT=1
(RUOW)
© Copyright IBM Corporation 2010
CLPPlus
command line processor introduced in DB2 9.7
• CLPPlus:
– SQL*Plus compatible command
– Variable substitution
– Column formatting
– Simple reporting
– Control variables
© Copyright IBM Corporation 2010
CLPPlus features
• Support for establishing connections to databases when a
database user ID and password are provided.
• A buffer that can be used to store scripts, script fragments,
SQL statements, SQL PL statements, or PL/SQL statements
for editing and then execution. Text in the buffer can be listed,
printed, edited, or run as a batch script.
• A comprehensive set of processor commands can be used to
define variables and strings that can be stored in the buffer.
• A set of commands that retrieve information about the
database and database objects.
• Ability to store buffers or buffer output to a file.
• Multiple options for formatting the output of scripts and
queries.
• Support for executing system-defined routines.
• Support for executing operating system commands.
• Option for recording the output of executed commands,
statements, or scripts.
© Copyright IBM Corporation 2010
Roadmap to the GUI tools
IBM DB2
Information Command General Monitoring Setup
Line Tools Administration Tools Tools
Tools
Information Control Configuration
Center Command Center Assistant
(db2ic) Editor (db2cc) Event (db2ca)
(db2ce) Analyzer First Steps
Journal (db2eva)
(db2journal) (db2fs)
Command
Window Health Register Visual
Replication Center Studio Add-ins
(db2cw)
Center (db2hc) (db2vsregister)
Command (dbrc)
Line InDoubt Satellite
Processor Task Center Transaction Synchronizer
(db2) (db2tc) Manager
Command
Line (db2indbt)
License
Processor Center Activity Memory
Plus Monitor Visualizer
(clpplus) (db2am) (db2memvis)
© Copyright IBM Corporation 2010
Control Center
Contents
Pane
Object
Tree
Pane
Object
Detail
Pane
© Copyright IBM Corporation 2010
Command Editor
© Copyright IBM Corporation 2010
Administration tools
© Copyright IBM Corporation 2010
Using the wizards
© Copyright IBM Corporation 2010
Health Center
© Copyright IBM Corporation 2010
Journal
© Copyright IBM Corporation 2010
Configuration Assistant (1 of 2)
• Used to configure client connections
– Clients can be configured:
• Manually DB2
• Automatically using DB2 Discovery Administration
• Using imported profiles Server
• Search the network
DB2 Server
• DB2 CA can also:
– Connect to databases, both LAN and DRDA
DB2 Discovery
– Perform CLI/ODBC administration
– Bind applications to database
Configuration
Assistant
Remote Client
© Copyright IBM Corporation 2010
Configuration Assistant (2 of 2)
© Copyright IBM Corporation 2010
DB2 Administration Server (DAS)
• DB2 control point to assist with server tasks
• Must have running DAS on server to use Configuration
Assistant, Control Center, or Task Center
• Might be created and configured at installation time
• Assists with:
– Enabling remote administration of DB2 servers
– Providing the facility for job management, including scheduling
– Providing a means for discovering information about the
configuration of DB2 instances, databases, and other DB2
Administration Servers in conjunction with DB2 Discovery
© Copyright IBM Corporation 2010
Using the Administration Server
• Creating the Administration Server:
– dascrt ASName (Linux/UNIX)
– db2admin create (Windows)
• Starting and stopping the Administration Server:
– db2admin start
– db2admin stop
• Listing the Administration Server:
– db2admin
• Configuring the Administration Server:
– db2 get admin cfg
– db2 update admin cfg using ...
• Removing the Administration Server:
– dasdrop ASName (Linux/UNIX)
– db2admin drop (Windows)
© Copyright IBM Corporation 2010
Unit summary
Having completed this unit, you should be able to:
• Use the Command Line Processor
• Explore the GUI environment
• Describe the DAS role with GUI tools
© Copyright IBM Corporation 2010
Student exercise
© Copyright IBM Corporation 2010
The DB2 environment
© Copyright IBM Corporation 2010
Course materials may not be reproduced in whole or in part without the prior written permission of IBM. 5.4
Unit objectives
After completing this unit, you should be able to:
• Specify the key features of an Instance
• Create and drop an Instance
• Use db2start and db2stop
• Distinguish between types of configuration
• Describe and modify the Database Manager Configuration
© Copyright IBM Corporation 2010
What is an instance?
DB2 PRODUCT
Instance_1 Instance_2
DBM CFG DBM CFG
DB CFG DB CFG
Catalog DB_1 LOG Catalog DB_3 LOG
DB CFG
Catalog DB_2 LOG
© Copyright IBM Corporation 2010
The Database Manager instance
Computer system
Database Manager Database Manager
Inst1 inst2
database 1 database 1 database 2
Tablespace A
Tablespace B
Table 1 Table 2 Table 3
Table 4
co
nn
e ct
to
Local User Application
PATH=...
DB2INSTANCE=inst1
© Copyright IBM Corporation 2010
Create and drop an instance
• CREATE (different on Linux/UNIX and Windows):
– Prerequisites met?
– Creates the Instance
– Creates the SQLLIB subdirectory
– Creates needed directories in the SQLLIB subdirectory
– Creates the DBM CFG with defaults
db2icrt –u <fencedID> <instance_name> (UNIX/Linux)
db2icrt <instance_name> (Windows)
• DROP:
– Instance must be stopped and no applications are allowed to be
connected to the databases in this instance
– Does not remove databases (catalog entries)
– Removes the Instance
db2idrop <instance_name>
© Copyright IBM Corporation 2010
Starting and stopping an instance
db2start db2stop
© Copyright IBM Corporation 2010
Setting DB2 variable values
Environment Platform
Variables Specific
Instance-Level Use db2set
Must reinitialize command to change
Profile
environment after Must stop/restart
changing Registry instance
No restart Global-Level
of system
Profile
after changing!
Registry
© Copyright IBM Corporation 2010
The db2set command
• Command line tool for DB2 Profile Registry administration
• Displays, sets, resets, or removes profile variables
db2set variable=value
-g
-i instance [db-partition-number]
-n DAS node [[-u user id] [-p password]]
-u
-ul
-ur
-p
-r
-l
-lr
-v
-? (or -h)
-all
© Copyright IBM Corporation 2010
Database Manager configuration
C:\IBM\SQLLIB\BIN>db2 get dbm cfg
Database Manager Configuration
Node type = Enterprise Server Edition with local and remote clients
Database manager configuration release level = 0x0d00
Maximum total of files open (MAXTOTFILOP) = 16000
CPU speed (millisec/instruction) (CPUSPEED) = 4.605357e-007
Communications bandwidth (MB/sec) (COMM_BANDWIDTH) = 1.000000e+002
Max number of concurrently active databases (NUMDB) = 8
Federated Database System Support (FEDERATED) = NO
Transaction processor monitor name (TP_MON_NAME) =
Default charge-back account (DFT_ACCOUNT_STR) =
Java Development Kit installation path (JDK_PATH) = C:\IBM\SQLLIB\java\jd
k
Diagnostic error capture level (DIAGLEVEL) = 3
Notify Level (NOTIFYLEVEL) = 3
Diagnostic data directory path (DIAGPATH) =
Size of rotating db2diag & notify logs (MB) (DIAGSIZE) = 0
Default database monitor switches
Buffer pool (DFT_MON_BUFPOOL) = OFF
Lock (DFT_MON_LOCK) = OFF
Sort (DFT_MON_SORT) = OFF
Statement (DFT_MON_STMT) = OFF
Table (DFT_MON_TABLE) = OFF
Timestamp (DFT_MON_TIMESTAMP) = ON
Unit of work (DFT_MON_UOW) = OFF
Monitor health of instance and databases (HEALTH_MON) = ON
© Copyright IBM Corporation 2010
Unit summary
Having completed this unit, you should be able to:
• Specify the key features of an Instance
• Create and drop an Instance
• Use db2start and db2stop
• Distinguish between types of configuration
• Describe and modify the Database Manager Configuration
© Copyright IBM Corporation 2010
Student exercise
© Copyright IBM Corporation 2010
Creating databases and data placement
© Copyright IBM Corporation 2010
Course materials may not be reproduced in whole or in part without the prior written permission of IBM. 5.4
Unit objectives
After completing this unit, you should be able to:
• Review specifics of creating a database
• Explore the System Catalog tables and views
• Compare DMS versus SMS table spaces
• Describe how to setup and manage a DB2 database with
Automatic Storage enabled
• Differentiate between table spaces, containers, extents, and
pages
• Define table spaces
• Use the get snapshot for tablespaces command to display
table space statistics
• Explore Database configuration parameters
© Copyright IBM Corporation 2010
Create database overview
Database Manager Instance
database1 database2
Tablespace A Tablespace A
Tablespace B
Table 1 Table 2 Table 3 Table 4 Table 1 Table 2
• Databases are created within a Database Manager instance
• Table spaces are a logical layer created within a database
• Tables are created within table spaces
© Copyright IBM Corporation 2010
Database storage requirements
• Database Path:
– Database Control Files for each database
• Includes Database Configuration file, Recovery History, Log Control files,
Table space Control file, Bufferpool Control file and others
– Initial location for database log files
– Default location is dftdbpath in DBM CFG
• Needs to be a local file system
• Automatic Storage paths:
– Allows table space storage to be managed at the database level
– If Automatic storage is enabled there will be at least one path defined
– Initial Storage Paths defined when database is created
• Default System table spaces:
– Use Automatic Storage management by default, if enabled, but can be
defined to use any supported type of table space
– SYSCATSPACE: DB2 catalog tables
– TEMPSPACE1: System Temporary tables, provide work space for
sorting and utility processing
– USERSPACE1: Initial table space for defining user tables and indexes
© Copyright IBM Corporation 2010
DB2 Storage Management basics
• DB2 supports three types of storage management for table spaces
• All three types can be used in a single database
• Storage Management type set when a table space is created
• DMS – Database Managed Storage:
– Table space containers defined using the specified files or raw devices
– Disk space allocated is reserved for objects in that table space
• SMS – System Managed Storage:
– Table space containers defined using the specified directories
– No defined initial size or limit
– DB2 creates files for each database object
– Disk space is freed when objects are dropped
• Automatic Storage Management:
– Disk Storage Paths are assigned to the database
– Can be enabled when a database is created or added to an existing database
– Available space can be monitored at the database partition level
– DB2 defines the number and names for containers
– Uses SMS for temporary storage and DMS for other storage types
© Copyright IBM Corporation 2010
DMS table space: Minimum requirements
Table Space Creation container tag 1 extent per container
header for table space 1 extent
table space space map 1 extent
meta data
object table data 1 extent
4 extents
Table (Object) Creation extent map 1 extent
extent for data 1 extent
Minimum space required in containers = 6 extents
(for first object)
If EXTENTSIZE = 32 pages, then 6*32 = 192 pages
needed to be provided in containers
© Copyright IBM Corporation 2010
CREATE DATABASE syntax
CREATE DATABASE database-name
DB AT DBPARTIONNUM
|Create Database Options|
Create Database options:
AUTOMATIC STORAGE YES ON path DBPATH ON path
NO drive drive
,
ALIAS db-alias USING CODESET codeset TERRITORY territory
COLLATE USING PAGESIZE 4096 DFT_EXTENT_SZ dft-extentsize
IDENTITY n K
RESTRICTIVE CATALOG TABLESPACE |tblspace-defn|
USER TABLESPACE |tblspace-defn| TEMPORARY TABLESPACE |tblspace-defn|
WITH "comment-string" |autoconfigure-settings|
tblspace-defn: ,
| MANAGED BY SYSTEM USING ( 'container-string' )
,
DATABASE USING ( FILE 'container-string' num-pages )
DEVICE
EXTENTSIZE num-pages PREFETCHSIZE num-pages
OVERHEAD number-of-milliseconds TRANSFERRATE number-of-milliseconds
© Copyright IBM Corporation 2010
Create Database Wizard
© Copyright IBM Corporation 2010
CREATE DATABASE examples
create database sales1 on /dbsales1
– Database Path: /dbsales1
– Automatic Storage Path: /dbsales1
create database sales2 automatic storage no on
/dbsales2
– Database Path: /dbsales2
– Automatic Storage not enabled
create database sales3 on /dbauto3 dbpath on /dbsales3
– Database Path: /dbsales3
– Automatic Storage Path: /dbauto3
create database sales4 automatic storage yes
on /dbauto41,/dbauto42,/dbauto43
dbpath on /dbsales4
– Database Path: /dbsales4
– Automatic Storage Paths: /dbauto41, /dbauto42 and /dbauto43
© Copyright IBM Corporation 2010
Completed by
DB2 during database creation (1 of 2)
1. Creates database in the specified subdirectory
2. Creates IBMDEFAULTGROUP, IBMCATGROUP, and IBMTEMPGROUP database
partition groups
3. Creates SYSCATSPACE, TEMPSPACE1, and USERSPACE1 table spaces
4. Creates the system catalog tables and recovery logs
5. Catalogs database in local database directory and system database
directory
6. Stores the specified code set, territory, and collating sequence
7. Creates the schemas SYSCAT, SYSFUN, SYSIBM, and SYSSTAT
8. Binds database manager bind files to the database (db2ubind.lst)
– DB2 CLI packages are automatically bound to databases when the databases
are created or migrated. If a user has intentionally dropped a package, then
you must rebind db2cli.lst.
© Copyright IBM Corporation 2010
Completed by
DB2 during database creation (2 of 2)
9. Grants the following privileges:
– ACCESSCTRL , DATAACCESS , DBADM and SECADM
privileges to database creator
– SELECT privilege on system catalog tables and views to PUBLIC
– UPDATE access to the SYSSTAT catalog views
– BIND and EXECUTE privilege to PUBLIC for each successfully
bound utility
– CREATETAB, BINDADD, IMPLICIT_SCHEMA, and CONNECT
authorities to PUBLIC
– USE privilege on USERSPACE1 table space to PUBLIC
– Usage of the WLM workload for user class
SYSDEFAULTUSERCLASS to PUBLIC
• When the RESTRICTIVE option is used, no privileges are
automatically granted to PUBLIC
© Copyright IBM Corporation 2010
Database Path files
CREATE DATABASE DSS ON /dbauto DBPATH ON /database
database DB2INSTANCE=inst20
inst20
db2nodes.cfg
NODEnnnn
SQLDBDIR
sqldbdir
SQLxxxxx
SQLDBCONF SQLBP.1 SQLBP.2
db2rhist.asc db2rhist.bak
SQLOGCTL.LFH.1 SQLOGCTL.LFH.2 SQLOGMIR.LFH
SQLSGF.1 SQLSGF.2
SQLSPCS.1 SQLSPCS.2
SQLOGDIR
S0000000.LOG S0000001.LOG S0000002.LOG
© Copyright IBM Corporation 2010
Default
table space containers with Automatic Storage
CREATE DATABASE DSS ON /dbauto DBPATH ON /database
dbauto DB2INSTANCE=inst20
inst20
NODE0000
DSS
T0000000 SYSCATSPACE
C0000000.CAT
T0000001 TEMPSPACE1
C0000000.TMP
T0000002 USERSPACE1
C0000000.LRG
© Copyright IBM Corporation 2010
System Catalog tables and views
SYSCAT v
...
i
e SYSIBM.SYSCOLUMNS
SYSSTAT SYSIBM.SYSTABLES
w
...
s
See notes for actual table and view names.
© Copyright IBM Corporation 2010
Create BUFFERPOOL syntax
• Requires SYSCTRL or SYSADM authority
CREATE BUFFERPOOL bufferpool-name
DEFERRED
ALL DBPARTITIONNUMS
,
DATABASE PARTITION GROUP db-partition-group-name
SIZE 1000 AUTOMATIC
SIZE number-of-pages
1000
SIZE AUTOMATIC
number-of-pages
NUMBLOCKPAGES 0
NUMBLOCKPAGES number-of-pages
BLOCKSIZE number-of-pages
PAGESIZE integer
K
© Copyright IBM Corporation 2010
Table space, container, extent, page
TABLE SPACE
CONTAINER
EXTENT
DATA
PAGE
© Copyright IBM Corporation 2010
Containers and table spaces
• Container is an Allocation of Physical Space
What Does a Container Look Like?
File
Directory
Device
Directory File Device
SMS DMS DMS
© Copyright IBM Corporation 2010
Writing to containers
• DFT_EXTENT_SZ defined at database level
• EXTENTSIZE defined at table space level
• Data written in round-robin manner
Container 0 0 2 Container 1
1 3
Extent
Table space TBS
© Copyright IBM Corporation 2010
Table space design limits: Row Identifiers
• Standard RIDs • Large RIDs
4 64 4 8
KB GB KB TB
128 16
GB 256 TB 32
8 KB GB 8 KB TB
512 64
GB TB
16 KB 16 KB
32 KB 32 KB
Page size table space size Page size
table space size
16M 255 4x109 Rows 16M
512M ~2K 1.1x1012 Rows
Row ID (RID) 4 Bytes Large Row ID (RID) 6 Bytes
• For tables in LARGE table spaces
• For tables in Regular table spaces (DMS or Automatic Storage)
• Also all SYSTEM and USER temporary
table spaces
© Copyright IBM Corporation 2010
CREATE TABLESPACE syntax (1 of 2)
LARGE
CREATE TABLESPACE tablespace-name
REGULAR
SYSTEM
TEMPORARY
USER PAGESIZE 4096
DATABASE PARTITION GROUP PAGESIZE integer
K
IN db-partition-group-name
AUTOMATIC STORAGE | size-attributes |
MANAGED BY SYSTEM | system-containers |
DATABASE |database-containers | EXTENTSIZE num-pages
integer K
M
G
PREFETCHSIZE AUTOMATIC BUFFERPOOL bufferpool-name
num-pages
integer K
M
G
7.5 NO FILE SYSTEM CACHING
OVERHEAD number-of-milliseconds FILE SYSTEM CACHING
0.06 DROPPED TABLE RECOVERY ON
TRANSFERRATE number-of-milliseconds OFF
© Copyright IBM Corporation 2010
CREATE TABLESPACE syntax (2 of 2)
size-attributes:
AUTORESIZE YES INITIALSIZE int K INCREASESIZE int perc MAXSIZE int K
NO M K M
G M G
system-containers: G NONE
,
USING ( 'container-string' )
| on-db-partitions-clause |
database-containers:
USING | container-clause |
| on-db-partitions-clause |
container-clause:
,
( FILE 'container-string' num-pages )
DEVICE integer K
M
G
on-db-partitions-clause:
,
ON DBPARTITIONNUM ( db-partition-number1 )
DBPARTITIONNUMS TO db-partition-number2
© Copyright IBM Corporation 2010
Create Table Space Wizard
© Copyright IBM Corporation 2010
Storage Management alternatives: Automatic
• Automatic Storage Managed:
– Administration is very easy, no need to define the number or names of
the containers
– Disk space assigned and maintained at the database level
– Monitoring of available space at the database partition level instead of
each table space
– Multiple containers will be created using all available storage paths for
performance
– Automatic Storage can be enabled when the database is created or
added to an existing database
• Default is ON for CREATE DATABASE with DB2 9
– Storage paths can be added or removed using ALTER DATABASE
– Uses standard DMS and SMS under the covers:
• DMS used for REGULAR and LARGE table spaces
• SMS used for SYSTEM and USER TEMPORARY table spaces
– Table space allocation controlled by CREATE/ALTER options:
• INITIALSIZE: Defaults to 32 MB
• AUTORESIZE: Can be set to YES or NO
• INCREASESIZE: Can be set to amount or percent increase
• MAXSIZE: Can define growth limits
© Copyright IBM Corporation 2010
Automatic storage: Table space examples
• Syntax for CREATE and ALTER TABLESPACE:
CREATE TABLESPACE <tsName> [MANAGED BY AUTOMATIC STORAGE]
[INITIALSIZE integer {K|M|G}]
[AUTORESIZE {NO|YES}] [INCREASESIZE integer {PERCENT|K|M|G}]
[MAXSIZE {NONE | integer {K|M|G}}]
• Default initial size is 32 MB
• Default max size is none
• Default increase size is determined by DB2, which might
change over the life of the table space
• Examples:
CREATE TABLESPACE USER1
CREATE TEMPORARY TABLESPACE TEMPTS
CREATE TABLESPACE MYTS INITIALSIZE 100 M MAXSIZE 1 G
CREATE LARGE TABLESPACE LRGTS INITIALSIZE 5 G AUTORESIZE NO
CREATE REGULAR TABLESPACE USER2 INITIALSIZE 500 M
© Copyright IBM Corporation 2010
ALTER TABLESPACE
• ALTER TABLESPACE can be used to change table space
characteristics:
– For all types table space management, you can adjust:
• Bufferpool Assigned
• Prefetch size
• Overhead and Transfer rate I/O Costs
• File System Caching option
– For DMS-managed table spaces, you can:
• Use the ADD, DROP, RESIZE, EXTEND, REDUCE and BEGIN NEW
STRIPE SET to directly adjust container names and sizes.
• Use MANAGED BY AUTOMATIC STORAGE to convert to automatic
storage management
– For Automatic Storage-managed table spaces, you can:
• Use the REDUCE option to release unused disk space
• Use REBALANCE to reallocate containers when automatic storage
paths are added or dropped
– For DMS and Automatic Storage, you can:
• Change the MAXSIZE, INCREASESIZE and AUTORESIZE settings
© Copyright IBM Corporation 2010
Adding or dropping Automatic Storage Paths
• New Automatic Storage paths can added to a database using
the ALTER DATABASE
alter database add storage on ‘/dbauto/path3’,’/dbauto/path4’
– Automatic Storage can be enabled in an existing database by adding
the first paths
– Existing Automatic tablespaces will grow using the previously
assigned storage paths until remaining space is used
– Newly created tablespaces will begin to use all defined paths
– Individual table spaces can be altered using REBALANCE to spread
data over all storage paths
• Storage paths can also be removed using ALTER DATABASE
alter database drop storage on ‘/dbauto/path1’
– The dropped path will be in drop pending state until a ALTER
TABLESPACE with the REBALANCE option is used for each table
space with containers on the dropped path
© Copyright IBM Corporation 2010
Alter Table Space Wizard
© Copyright IBM Corporation 2010
GET SNAPSHOT FOR TABLESPACES command
• Use GET SNAPSHOT FOR TABLESPACES ON SAMPLE
– LIST TABLESPACES does not provide all of the information
available
• Example of GET SNAPSHOT FOR TABLESPACES ON
SAMPLE
<database> command:
Tablespace name = SYSCATSPACE
Tablespace ID = 0
…
Automatic Prefetch size enabled = Yes
Buffer pool ID currently in use = 1
Buffer pool ID next startup = 1
Using automatic storage = Yes
Auto-resize enabled = Yes
…
Increase size (bytes) = AUTOMATIC
• Note that Using automatic storage = Yes and Increase size
(bytes) = AUTOMATIC are shown in the snapshot.
© Copyright IBM Corporation 2010
Using the SNAPCONTAINER view
• Use the SYSIBMADM.SNAPCONTAINER view to retrieve
container information
SELECT SNAPSHOT_TIMESTAMP, TBSP_NAME, TBSP_ID,
CONTAINER_NAME, CONTAINER_ID, CONTAINER_TYPE,
ACCESSIBLE, TOTAL_PAGES, USABLE_PAGES,
DBPARTITIONNUM FROM SYSIBMADM.SNAPCONTAINER ORDER
BY DBPARTITIONNUM, TBSP_NAME, CONTAINER_ID
© Copyright IBM Corporation 2010
Using the MON_GET_TABLESPACE function
• The MON_GET_TABLESPACE table function returns monitor metrics
for one or more table spaces.
Syntax
>>-MON_GET_TABLESPACE--(--tbsp_name--,--member--)---><
• Example
To list table spaces ordered by number of physical reads from table
space containers.
SELECT varchar(tbsp_name, 30) as tbsp_name,
member,
tbsp_type,
pool_data_p_reads
FROM TABLE(MON_GET_TABLESPACE('',-2)) AS t
ORDER BY pool_data_p_reads DESC
TBSP_NAME MEMBER TBSP_TYPE POOL_DATA_P_READS
------------------------------ ------ ---------- --------------------
SYSCATSPACE 0 DMS 79
USERSPACE1 0 DMS 34
TEMPSPACE1 0 SMS 0
© Copyright IBM Corporation 2010
Data placement considerations
• If index scans are required:
– Consider separate table spaces for index and data
• Consider placing index table space on fastest media available
• Consider placing entire table onto separate table space
– If accessed frequently
• Consider EXTENTSIZE
– Trade-off between space (very small tables) and performance
• I/O PREFETCH size
– Setting to AUTOMATIC is the default and recommended
© Copyright IBM Corporation 2010
Maintain or List System Database Directory
• CLP
db2 ? CATALOG DATABASE
CATALOG DATABASE database-name [AS alias] [ON path | AT NODE nodename]
[AUTHENTICATION {SERVER | CLIENT | ... | SERVER_ENCRYPT}]
[WITH "comment-string"]
db2 'CATALOG DATABASE ourdb AS TESTDB ON /database WITH "Test Database"'
db2 LIST DATABASE DIRECTORY
...
Database alias = TESTDB
Database name = OURDB
Local database directory = /database
Database release level = a.00
Comment = Test Database
Directory entry type = Indirect
Catalog database partition number = 4
© Copyright IBM Corporation 2010
Additional DB2 System table spaces
• The SYSTOOLSPACE table space is a user data table space used by the DB2
administration tools and some SQL administrative routines for storing historical data
and configuration information:
– Design Advisor
– Alter table notebook
– Configure Automatic Maintenance wizard
– Storage Management tool
– db2look command
– Automatic statistics collection (including the Statistics Collection Required
health indicator)
– Automatic reorganization (including the Reorganization Required health
indicator)
– GET_DBSIZE_INFO stored procedure
– ADMIN_COPY_SCHEMA stored procedure
– ADMIN_DROP_SCHEMA stored procedure
– SYSINSTALLOBJECTS stored procedure
– ALTOBJ stored procedure
• The SYSTOOLSTMPSPACE table space is a user temporary table space used by
the REORGCHK_TB_STATS, REORGCHK_IX_STATS and the ADMIN_CMD
stored procedures for storing temporary data.
© Copyright IBM Corporation 2010
Database configuration
db2 get db cfg for musicdb
show detail
db2 update db cfg for musicdb
using <parm> <value>
© Copyright IBM Corporation 2010
Database Manager configuration
db2 get dbm cfg show detail
db2 update dbm cfg using <parm>
<value>
© Copyright IBM Corporation 2010
ACTIVATE and DEACTIVATE the database
• ACTIVATE DATABASE
– Activates the specified database and starts up all necessary
database services, so that the database is available for connection
and use by any application.
• Incurs the database startup processing
db2 activate db <db_name> user <user> using
<password>
• DEACTIVATE DATABASE
– Databases initialized by ACTIVATE DATABASE can be shut down
using the DEACTIVATE DATABASE command, or using the
db2stop command.
db2 deactivate db <db_name>
© Copyright IBM Corporation 2010
Unit summary
Having completed this unit, you should be able to:
• Review specifics of creating a database
• Explore the System Catalog tables and views
• Compare DMS versus SMS table spaces
• Describe how to setup and manage a DB2 database with
Automatic Storage enabled
• Differentiate between table spaces, containers, extents, and
pages
• Define table spaces
• Use the get snapshot for tablespaces command to display
table space statistics
• Explore Database configuration parameters
© Copyright IBM Corporation 2010
Student exercise
© Copyright IBM Corporation 2010
Creating database objects
© Copyright IBM Corporation 2010
Course materials may not be reproduced in whole or in part without the prior written permission of IBM. 5.4
Unit objectives
After completing this unit, you should be able to:
• List DB2 object hierarchy and physical directories and files
• Create the following objects:
• Schema, Table, View, Alias, Index
• Review the use of Temporary Tables
• Explore the use and implementation of Check Constraints,
Referential Integrity and Triggers
• Exploring the need for and the use of Large Objects
• Recognize XML and its native store as critical infrastructure
for emerging technologies
© Copyright IBM Corporation 2010
DB2 object hierarchy
Instance 1 dbm configuration file
SYSCATSPACE SYSCATSPACE
Catalog Catalog
View1 View1
Log Database 1 Log Database 2
View2
DB configuration file DB configuration file
TSSMS1 TSDMSLRG1 USERSPACE1
Table1 Table2 Table3 Table2
Index1 TSDMSREG2
Index1
TSDMSLRG3
BLOBs Index2
© Copyright IBM Corporation 2010
Database physical directories and files
/path
[ SAMPLE ]
[ DB2INSTANCE ] [ T0000000 ] C0000000.CAT
SQLCRT.FLG
db2rhist.asc [ T0000001 ] SQLCRT.FLG
[ NODE0000 ] db2rhist.bak
SQLBP.1
[ C0000000.TMP ]
[ sqldbdir ] SQLBP.2
[ T0000002 ] C0000000.LRG
SQLDBCON SQLCRT.FLG
[ SQL00001 ] SQLDBCONF C0000000.LRG
SQLINSLK [ T0000003 ] SQLCRT.FLG
SQLOGCTL.LFH.1
SQLOGCTL.LFH.2 C0000000.LRG
[ T0000004 ] SQLCRT.FLG
SQLOGMIR.LFH
SQLSGF.1 C0000000.LRG
SQLSGF.2 [ T0000005 ] SQLCRT.FLG
SQLSPCS.1
[ db2event ] SQLSPCS.2
SQLTMPLK
[ your_SMS_tablespace ]
[ db2detaildeadlock ]
00000000.EVT
[ CONT1 ]
..
DB2EVENT.CTL SQLTAG.NAM
[ SQLOGDIR ] [ CONT2 ]
S0000000.LOG SQLTAG.NAM
S0000001.LOG SQL00041.DAT
© Copyright IBM Corporation 2010
Create Schema
CREATE SCHEMA PAYROLL AUTHORIZATION DB2ADMIN;
COMMENT ON Schema PAYROLL IS 'schema for
payroll application';
© Copyright IBM Corporation 2010
Set current schema
connect to musicdb user Keith;
select * from employee;
• Will select from KEITH.EMPLOYEE
set current schema = 'PAYROLL‘;
select * from employee;
• Will select from PAYROLL.EMPLOYEE
© Copyright IBM Corporation 2010
Create Table
© Copyright IBM Corporation 2010
Create Table statement
connect to musicdb;
create table artists
(artno smallint not null,
name varchar(50) with default 'abc',
classification char(1) not null,
bio clob(100K) logged,
picture blob( 10M) not logged compact)
in dms01
index in dms02
long in dms03;
© Copyright IBM Corporation 2010
Application Temporary Tables
• Applications can utilize global temporary tables:
– Global temporary tables are stored in User temporary table spaces
– The associated data is deleted when the application connection ends
– Each connection accesses a private copy of the table, so it only sees data put
into the table by that one connection
– Normal table locking is not needed for global temporary tables since the data is
always limited to a single connection
– Declared Global Temporary table:
• These are defined during application execution using the DECLARE GLOBAL
TEMPORARY TABLE statement
• No DB2 Catalog information is required
• The schema for a declared global temporary table is always ‘SESSION’
– Created Global Temporary table:
• These are defined using a CREATE GLOBAL TEMPORARY TABLE statement
with a user selected schema
• The table and any associated indexes may be created before an application
connection starts, but only the catalog definition exists
• The catalog definition is used when the application references the table
• Each application connection still works only with the data it stores in the table
© Copyright IBM Corporation 2010
Example of a Declared Temporary Table
• A System Administrator creates the user temporary table space
CREATE USER TEMPORARY TABLESPACE "USR_TEMP_TS"
PAGESIZE 4 K MANAGED BY AUTOMATIC STORAGE
BUFFERPOOL IBMDEFAULTBP ;
• The application uses SQL statements to declare and access the
table
DECLARE GLOBAL TEMPORARY TABLE T1
LIKE REAL_T1
ON COMMIT DELETE ROWS
NOT LOGGED
IN USR_TEMP_TS;
INSERT INTO SESSION.T1
SELECT * FROM REAL_T1 WHERE DEPTNO=:mydept;
• /* do other work on T1 */
• /* when connection ends, table is automatically dropped */
© Copyright IBM Corporation 2010
Example of a Created Temporary Table
• A System Administrator creates the user temporary table
space and defines a global temporary table and indexes
CREATE USER TEMPORARY TABLESPACE "USR_TEMP_TS2"
PAGESIZE 4 K MANAGED BY AUTOMATIC STORAGE ;
CREATE GLOBAL TEMPORARY TABLE APP1.DEPARTMENT
LIKE PROD.DEPARTMENT
ON COMMIT DELETE ROWS
NOT LOGGED IN USR_TEMP_TS2;
CREATE INDEX APP1.DEPTIX1 ON APP1.DEPARTMENT (DEPTNO);
• The application uses SQL statements to reference the
temporary table; no DECLARE is needed
INSERT INTO APP1.DEPARTMENT
SELECT * FROM PROD.DEPARTMENT WHERE DEPTNO=:mydept;
SELECT * FROM APP1.DEPARTMENT WHERE LASTNAME = ‘STOPFER’
© Copyright IBM Corporation 2010
Large objects: The need
Action
News
DB2
by the Book
B L OB C L OB D B C L OB
H A J O Y H A J
I A J
A R E U T A R E
N R E R G C B E R G C
A G C A E T L A E T
R E T C E C
Y T T
E E
R R
DB2
© Copyright IBM Corporation 2010
Table partitioning
• Data organization scheme in which table data is divided across
multiple storage objects called data partitions or ranges:
– Each data partition is stored separately
– These storage objects can be in different table spaces, in the same
table space, or a combination of both
• Benefits:
– Easier roll-in and roll-out of table data
– Allows large data roll-in (ATTACH) or roll-out (DETACH) with a
minimal impact to table availability for applications
– Supports very large tables
– Indexes can be either partitioned (local) or non-partitioned (global)
– Table and Index scans can use partition elimination when access
includes predicates for the defined ranges
– Hierarchical Storage Management (HSM) solutions can be used to
manage older, less frequently accessed ranges
© Copyright IBM Corporation 2010
Example of a partitioned table
• The PARTITION BY RANGE clause defines a set of data ranges
CREATE TABLE PARTTAB.HISTORYPART ( ACCT_ID INTEGER NOT NULL ,
TELLER_ID SMALLINT NOT NULL ,
BRANCH_ID SMALLINT NOT NULL ,
BALANCE DECIMAL(15,2) NOT NULL ,
……….
TEMP CHAR(6) NOT NULL )
PARTITION BY RANGE (BRANCH_ID)
(STARTING FROM (1) ENDING (20) IN TSHISTP1 INDEX IN TSHISTI1 ,
STARTING FROM (21) ENDING (40) IN TSHISTP2 INDEX IN TSHISTI2 ,
STARTING FROM (41) ENDING (60) IN TSHISTP3 INDEX IN TSHISTI3 ,
STARTING FROM (61) ENDING (80) IN TSHISTP4 INDEX IN TSHISTI4 ) ;
CREATE INDEX PARTTAB.HISTPIX1 ON PARTTAB.HISTORYPART (TELLER_ID)
PARTITIONED ;
CREATE INDEX PARTTAB.HISTPIX2 ON PARTTAB.HISTORYPART (BRANCH_ID)
PARTITIONED ;
• In this example the data objects and index objects for each data range
are stored in different table spaces
• The table spaces used must be defined with the same options, such as
type of management, extent size and page size
© Copyright IBM Corporation 2010
Create View
© Copyright IBM Corporation 2010
Create View statement
CONNECT TO TESTDB;
CREATE VIEW EMPSALARY
AS SELECT EMPNO, EMPNAME, SALARY
FROM PAYROLL, PERSONNEL
WHERE EMPNO=EMPNUMB AND SALARY > 30000.00;
SELECT * FROM EMPSALARY
EMPNO EMPNAME SALARY
------ ----------------- ----------
10 John Smith 1000000.00
20 Jane Johnson 300000.00
30 Robert Appleton 250000.00
...
© Copyright IBM Corporation 2010
Create Alias
© Copyright IBM Corporation 2010
Create Alias statement
• Cannot be the same as an existing table, view, or alias
CONNECT TO SAMPLE;
CREATE ALIAS ADMIN.MUSICIANS FOR ADMIN.ARTISTS;
COMMENT ON Alias ADMIN.MUSICIANS IS 'MUSICIANS
alias for ARTISTS';
CONNECT RESET;
© Copyright IBM Corporation 2010
Create Index
© Copyright IBM Corporation 2010
Create Index statements
create unique index dba1.empno on
dba1.employee (empno asc)
pctfree 10
minpctused 10
allow reverse scans
page split symmetric
collect sampled detailed statistics ;
create unique index itemno on albums (itemno) ;
create index item on stock (itemno) cluster ;
create unique index empidx on employee (empno)
include (lastname, firstname) ;
© Copyright IBM Corporation 2010
Overview of Referential Integrity
Department Table
DEPT DEPTNAME
Parent Table
R PRIMARY KEY = DEPT
E
S
T Employee Table
R EMPNO Dependent Table
NAME WKDEPT
I FOREIGN KEY = WKDEPT
C
T
• Place constraints between tables
• Constraints specified with Create and Alter table statements
• Database services enforce constraints: inserts, updates, deletes
• Removes burden of constraint checking from application programs
© Copyright IBM Corporation 2010
RI: Add a foreign key
© Copyright IBM Corporation 2010
Referential Integrity: CREATE TABLE statement
CREATE TABLE DEPARTMENT
(DEPT CHAR (3) NOT NULL,
DEPTNAME CHAR(20) NOT NULL,
CONSTRAINT DEPT_NAME
UNIQUE (DEPTNAME),
PRIMARY KEY (DEPT))
IN DMS99 .... ;
CREATE TABLE EMPLOYEE
(EMPNO CHAR(6) NOT NULL,
NAME CHAR(30),
WKDEPT CHAR(3) NOT NULL,
CONSTRAINT DEPT
FOREIGN KEY
(WKDEPT)
REFERENCES DEPARTMENT
ON DELETE RESTRICT)
IN DMS100 ... ;
© Copyright IBM Corporation 2010
Unique Key considerations
• Multiple keys in one table can be foreign key targets
CREATE TABLE PAY.EMPTAB
(EMPNO SMALLINT NOT NULL PRIMARY KEY,
NAME CHAR(20),
DRIVERLIC CHAR(17) NOT NULL,
CONSTRAINT DRIV_UNIQ UNIQUE(DRIVERLIC)
)IN TBSP1 ;
Unique indexes PAY.DRIV_UNIQ and SYSIBM.yymmddhhmmssxxx
created Columns must be NOT NULL
• Deferral of unique checking until end of statement processing
– UPDATE EMPTAB SET EMPNO=EMPNO + 1
© Copyright IBM Corporation 2010
Overview of Check Constraints: The need
DATA RULE : No value in the US_SL
column in the table
SPEED_LIMITS can exceed 65.
PGM n
RULE PGM n
DB2 DB2
ENFORCEMENT
APPLICATION RULE
DATABASE
PGM 1 PGM 2 PGM 1 PGM 2
RULE RULE
© Copyright IBM Corporation 2010
Check Constraints: Definition
CREATE TABLE SPEED_LIMITS
(ROUTE_NUM SMALLINT,
CANADA_SL INTEGER NOT NULL,
US_SL INTEGER NOT NULL
CHECK (US_SL <=65) ) ;
ALTER TABLE SPEED_LIMITS
ADD
CONSTRAINT SPEED65
CHECK (US_SL <=65) ;
© Copyright IBM Corporation 2010
Implementation of Check Constraints
© Copyright IBM Corporation 2010
Overview of triggers: The need
BUSINESS RULES:
INSERT, UPDATE,
and DELETE
actions against a
table may require
another action or
set of actions
to occur to meet
PGM n the needs of the
business.
ENFORCEMENT DB2
APPLICATION
DATABASE
© Copyright IBM Corporation 2010
Implementation of triggers (1 of 2)
© Copyright IBM Corporation 2010
Implementation of triggers (2 of 2)
© Copyright IBM Corporation 2010
Create Trigger statement
create trigger reorder
after update
of qty on stock
referencing new as n
for each row mode db2sql
when (n.qty <=5)
insert into reorder values (n.itemno, current
timestamp) ;
© Copyright IBM Corporation 2010
Data Row Compression summary
• Data row compression was introduced in DB2 9.1
• COMPRESS option for CREATE and ALTER TABLE is used
to specify compression
• Compression uses a static dictionary:
– Dictionary stored in data object, about 100K in size
– Compression Dictionary needs to be built before a row can be
compressed
– Dictionary can be built or rebuilt using REORG TABLE offline, which
also compresses existing data
– A dictionary will be built automatically when a table reaches a
threshold size (about 2 MB). This applies to SQL INSERT as well as
IMPORT and LOAD (DB2 9.5).
• Compression is intended to:
– Reduce disk storage requirements
– Reduce I/Os for scanning tables
– Reduce buffer pool memory for storing data
• Compression requires DB2 Storage Optimization feature
© Copyright IBM Corporation 2010
Additional compression features added in DB2 9.7
• Multiple algorithms for automatic index compression
• Automatic compression for system and user
temporary tables
Temp Table Temp
Order By Order By
• Intelligent compression of large objects and XML
© Copyright IBM Corporation 2010
db2look utility
• DDL and statistics extraction tool, used to
capture table definitions and generate the
corresponding DDL.
• In addition to capturing the DDL for a set of
tables, can create a test system that mimics a SQL
production system by generating the following
things:
– The UPDATE statements used to replicate the
statistics on the objects
– The UPDATE DB CFG and DBM CFG statements
for replicating configuration parameter settings
– The db2set statements for replicating registry
settings
– Statements for creating partition groups, buffer
pools, table spaces, and so on
© Copyright IBM Corporation 2010
db2look examples
• To capture all of the DDL for a database (includes all tables, views,
RI, constraints, triggers, and so on):
db2look -d proddb -e -o statements.sql
{Edit the output file and change the database name}
db2 -tvf statements.sql
• To capture the DDL for one table in particular (table1 in this
example):
db2look -d proddb -e -t table1 -o statements.sql
{Edit the output file and change the database name}
db2 -tvf statements.sql
• To capture the DDL for all of the tables belonging to a particular
schema (db2user in this example):
db2look -d proddb -e -z db2user -o statements.sql
{Edit the output file and change the database name}
db2 -tvf statements.sql
© Copyright IBM Corporation 2010
What is XML? (1 of 2)
• XML = eXtensible Markup Language
• XML is self-describing data
• Here is some data from an ordinary delimited flat file:
47; John Doe; 58; Peter Pan; Database systems; 29; SQL;
relational
• What does this data mean?
<book>
<authors>
<author id=“47”>John Doe</author>
<author id=“58”>Peter Pan</author>
</authors>
<title>Database systems</title>
<price>29</price>
<keywords> XML: Describes data
<keyword>SQL</keyword>
<keyword>relational</keyword> HTML: Describes display
</keywords>
</book>
© Copyright IBM Corporation 2010
What is XML? (2 of 2)
Attribute
<book>
Start Tag <authors>
<author id=“47”>John Doe</author>
<author id=“58”>Peter Pan</author>
End Tag </authors>
<title>Database systems</title>
<price>29</price> Element
<keywords>
Start Tag
<keyword>SQL</keyword>
<keyword>relational</keyword>
</keywords>
End Tag
</book>
Data
© Copyright IBM Corporation 2010
The XML document tree
<book>
<authors>
<author id=“47”>John Doe</author>
<author id=“58”>Peter Pan</author>
</authors>
<title>Database systems</title>
<price>29</price>
<keywords>
XML Parsing
<keyword>SQL</keyword> book
<keyword>relational</keyword>
</keywords>
</book>
authors title price keywords
Each node
has a path
/book Database
author author 29 keyword keyword
Systems
/book/authors
/book/authors/author
/book/authors/author/@id id=47 John Doe id=58 Peter Pan SQL relational
(...)
© Copyright IBM Corporation 2010
Integration of XML and relational capabilities
• Applications combine XML and relational data
• Native XML data type (server and client side)
• XML capabilities in all DB2 components
DB2 SERVER
CLIENT SQL/XML DB2 Storage:
Relational
Interface Relational
DB2 Client /
Customer DB2 Engine
Client
Application
XQuery
XML
XML
Interface
© Copyright IBM Corporation 2010
Two ways to query XML data
• Given the following DDL used to create the table:
CREATE TABLE CUSTOMER (
CID BIGINT NOT NULL,
INFO XML,
HISTORY XML )
IN IBMDB2SAMPLEXML;
• Using XQuery as the primary language
XQUERY db2-fn:xmlcolumn (“CUSTOMER.INFO")
– Returns all Documents in column INFO
• Using SQL as the primary language
Select Pid from PRODUCT xmlp
WHERE XMLExists('declare default element namespace
"http://posample.org";$p/product/description[price
< 100]' passing xmlp.description as "p");
© Copyright IBM Corporation 2010
For more information (1 of 2)
• To learn more about DB2 and XML,
including the use of xQuery and
SQL/XML:
• Course CL131 – 5 days
Query and Manage XML Data with
DB2 9
• Course CL121 – 3 days
Query XML data with DB2 9
• Documentation:
pureXML Guide (SC27-2465)
DB2 XQuery Reference (SC27-2466)
© Copyright IBM Corporation 2010
For more information (2 of 2)
Some of our other courses:
• Course CE120 – 2 days
DB2 SQL Workshop
• Course CF18 – 4 days
Relational Database Design
• Course CE130 – 2.5 days
DB2 SQL Workshop for Experienced
Users
• Course DW11 – 4.5 days
Building the Data Warehouse
© Copyright IBM Corporation 2010
Unit summary
Having completed this unit, you should be able to:
• List DB2 object hierarchy and physical directories and files
• Create the following objects:
• Schema, Table, View, Alias, Index
• Review the use of Temporary Tables
• Explore the use and implementation of Check Constraints,
Referential Integrity and Triggers
• Exploring the need for and the use of Large Objects
• Recognize XML and its native store as critical infrastructure
for emerging technologies
© Copyright IBM Corporation 2010
Student exercise
© Copyright IBM Corporation 2010
Moving data
© Copyright IBM Corporation 2010
Course materials may not be reproduced in whole or in part without the prior written permission of IBM. 5.4
Unit objectives
After completing this unit, you should be able to:
• Discuss the INSERT statement and recognize its limitations
• Explain the differences between IMPORT and LOAD
• Explain the EXPORT, IMPORT, and LOAD syntax
• Create and use Exception Tables and Dump-Files
• Distinguish and resolve Table States:
– Load Pending and Set Integrity Pending
• Use the SET INTEGRITY command
• Discuss the db2move and db2look commands
© Copyright IBM Corporation 2010
Review INSERT statement
• The SQL INSERT statement can insert one or more rows of
data into tables, nicknames, or views:
– SQL overhead is imposed, such as obeying RI, or Check
Constraints, or Uniqueness, or executing triggers.
– As INSERTs occur, the activity is also stored in logs
• The SQL INSERT might not be the best or fastest method to
load massive amounts of data into a database
Example inserts:
insert into artists (artno, name,
classification) values (100, 'Patti & Cart
Wheels', 'S') ;
and
Insert into emptemp select * from employee ;
© Copyright IBM Corporation 2010
EXPORT/IMPORT overview
• Must be connected prior to call
Workstation File Formats
LOGS
ASCII
DEL ASCII
WSF
IMPORT
PC/IXF EXPORT DB2 for Linux,
UNIX, and Windows
or
DRDA Databases
© Copyright IBM Corporation 2010
EXPORT data wizard
© Copyright IBM Corporation 2010
EXPORT command syntax (Basic)
EXPORT TO filename OF filetype
, ,
LOBS TO lob-path LOBFILE filename
...
MODIFIED BY filetype-mod
select-statement
MESSAGES message-file
© Copyright IBM Corporation 2010
EXPORT command example
• Exports data from database tables to file
• Check message for error or warning messages
DB2
MUSICDB
EXPORT
artexprt
artists
db2 connect to musicdb
db2 export to artexprt
of ixf messages artmsg
select artno, name
from artists
© Copyright IBM Corporation 2010
IMPORT data wizard
© Copyright IBM Corporation 2010
IMPORT command syntax (Basic)
IMPORT FROM filename OF filetype
...
,
LOBS FROM lob-path
ALLOW NO ACCESS
ALLOW WRITE ACCESS
MODIFIED BY filetype-mod
COMMITCOUNT n/ RESTARTCOUNT n MESSAGES message-file
Automatic
INSERT INTO table-name
,
INSERT_UPDATE
REPLACE ( insert-column )
REPLACE_CREATE
CREATE INTO table-name | tblspace-specs |
,
( insert-column )
tblspace-specs
| |
IN tablespace-name
INDEX IN tablespace-name LONG IN tablespace-name
© Copyright IBM Corporation 2010
IMPORT command example
• Imports data from a file to a database table
db2 connect to musicdb
db2 import from artexprt of ixf
messages artmsg create into artists
insert
insert_update
replace
replace_create
• Imports data from a file to three DMS table spaces
db2 connect to musicdb
db2 import from artexprt of ixf
messages artmsg create into artists [(column_list)]
in <tablespace>
index in <indextablespace>
long in <largetablespace>
DB2
musicdb
IMPORT
artexprt artists
© Copyright IBM Corporation 2010
Differences between IMPORT and LOAD
IMPORT LOAD
• Slow when moving large • Faster for large amounts of
amounts of data data
• Creation of table/index • Tables and indexes must exist
supported with IXF format • Load into tables only
• Import into tables, views, • Existing data can still be seen
aliases by read applications
• Option to ALLOW WRITE • Minimal logging; can make
ACCESS copy
• All rows logged • Triggers not fired; unique
• Triggers fired, constraints constraints enforced, RI and
enforced check constraints via SET
• Inserts can use space freed INTEGRITY
by deleted rows • LOAD builds new extents
© Copyright IBM Corporation 2010
Four phases of Load
1. LOAD
Load data into tables DB2 Data
Collect index keys / sort
Consistency points at SAVECOUNT
Invalid data rows in dump file; messages in message file
2. BUILD
Indexes created or updated
3. DELETE
Unique Key Violations placed in Exception Table
Messages generated for unique key violations
Deletes Unique Key Violation Rows
4. INDEX COPY
Copy indexes from temp table space to index
table space
© Copyright IBM Corporation 2010
LOAD: Load phase
• During the LOAD phase, data is stored in a table and index
keys are collected
• Save points are established at intervals
• Messages indicate the number of input rows successfully
loaded
• If a failure occurs in this phase, use the RESTART option for
LOAD to restart from the last successful consistency point
Load
Input
Build
Data
Delete
Index
Copy
© Copyright IBM Corporation 2010
LOAD: Build phase
• During the BUILD phase, indexes are created based on the
index keys collected in the Load phase
• The index keys are sorted during the Load phase
• If a failure occurs during this phase, LOAD restarts from the
BUILD phase
Load
Input
Build
Data
Delete
Index
Copy
© Copyright IBM Corporation 2010
LOAD: Delete phase
• During the DELETE phase, all rows that have violated a
unique constraint are deleted
• If a failure occurs, LOAD restarts from the DELETE phase
• Once the database indexes are rebuilt, information about the
rows containing the invalid keys is contained in an exception
table, WHICH SHOULD BE CREATED BEFORE LOAD
• Finally, any duplicate keys found are deleted
Load
Input
Build
Data
Delete
Index
Copy
© Copyright IBM Corporation 2010
LOAD: Index Copy phase
• The index data is copied from a system temporary table
space to the original table space.
• This will only occur if a system temporary table space was
specified for index creation during a load operation with the
READ ACCESS option specified.
Load
Input
Build
Data
Delete
Index
Copy
© Copyright IBM Corporation 2010
Load data wizard
© Copyright IBM Corporation 2010
LOAD command syntax (Basic)
,
LOAD FROM file OF ASC
CLIENT pipe DEL
device IXF
cursorname CURSOR
MODIFIED BY filetype-mod
...
SAVECOUNT n ROWCOUNT n WARNINGCOUNT n MESSAGES msg-file
INSERT INTO table-name
REPLACE ,
RESTART ( insert-column ) FOR EXCEPTION table-name
TERMINATE
ALLOW NO ACCESS
ALLOW READ ACCESS
| statistics options | |copy options|
USE tablespace-name
| set integrity pending options | LOCK WITH FORCE
© Copyright IBM Corporation 2010
LOAD scenario
INPUT OUTPUT
calpar.del cal.par cal.parexp par.msgs dump.fil.000
10 ~~ 1 10 ~~ 1 30 ~~ 4 timestamp msg ... msg ... 20 ~~ -
50 7 timestamp msg 40 ~~ X
20 ~~ - 30 ~~ 3 ~~ 20 msg
30 ~~ 3 50 ~~ 6 ... ~ ~ ... ... ... 40 msg
30 ~~ 4 80 ~~ 8 ... ... ...
40 ~~ X not null, numeric column
Primary Key
50 ~~ 6 Rows not Loaded
50 ~~ 7 Table Exception Table
UNIQUE INDEX
80 ~~ 8
create tables/indexes 10 RID
obtain delimited input file in sorted format
create exception table 30 RID
db2 load from calpar.del of del 50 RID
modified by dumpfile=<path>/dump.fil
warningcount 100 messages par.msgs
insert into cal.par for exception cal.parexp 80 RID
db2 load query table cal.par
examine par.msgs file
examine cal.parexp exception table
© Copyright IBM Corporation 2010
Rules and methods for creating Exception Tables
• The first n columns are the same
• No constraints and no trigger definitions
• n + 1 column TIMESTAMP
• n + 2 column CLOB (32 KB)
• user INSERT privilege
CREATE TABLE T1EXC
LIKE T1
ALTER TABLE T1EXC
ADD COLUMN TS TIMESTAMP
ADD COLUMN MSG CLOB(32K) CREATE TABLE T1EXC AS
(SELECT T1.*,
CURRENT TIMESTAMP AS TS,
CLOB('', 32767) AS MSG
FROM T1)
DEFINITION ONLY
© Copyright IBM Corporation 2010
Offline versus Online Load
• ALLOW NO ACCESS
Lock Lock Load
requested granted commit
read/write read/write
Load allows no access
Time
• ALLOW READ ACCESS
Super Super
exclusive exclusive Load
Drain Drain lock lock commit
requested granted requested granted
read/write read
read
r/w
Load allows read access
Time
© Copyright IBM Corporation 2010
Table states
• (Load pending, Set Integrity Pending)
• LOAD QUERY TABLE <table-name>
• Tablestate:
Normal
Set Integrity Pending
Load in Progress
Load Pending
Reorg Pending
Read Access Only
Unavailable
Not Load Restartable
Unknown
• Table can be in several states at same time
Tablestate:
Set Integrity Pending
Load in Progress
Read Access Only
© Copyright IBM Corporation 2010
Checking Load status: Load query
db2 load query table prod.table1
SQL3501W The table space(s) in which the table resides will not be placed in
backup pending state since forward recovery is disabled for the database.
SQL3109N The utility is beginning to load data from file
"C:\cf45\reorg\savehist.del".
SQL3500W The utility is beginning the "LOAD" phase at time "03/29/2005
21:30:24.468073".
SQL3519W Begin Load Consistency Point. Input record count = "0".
SQL3520W Load Consistency Point was successful.
........................
SQL3519W Begin Load Consistency Point. Input record count = "48314".
SQL3520W Load Consistency Point was successful.
SQL0289N Unable to allocate new pages in table space "LOADTSPD".
SQLSTATE=57011
SQL3532I The Load utility is currently in the "LOAD" phase.
Number of rows read = 48314
Number of rows skipped = 0
Number of rows loaded = 48314
Number of rows rejected = 0
Number of rows deleted = 0
Number of rows committed = 48314
Number of warnings = 0
Tablestate:
Load Pending
© Copyright IBM Corporation 2010
Load monitoring: LIST UTILITIES
db2 LIST UTILITIES SHOW DETAIL
ID = 2
Type = LOAD
Database Name = MUSICDB
Partition Number = 0
Description = ONLINE LOAD DEL AUTOMATIC INDEXING INSERT NON
-RECOVERABLE ADMIN.LOA
Start Time = 03/30/2005 15:57:42.354567
Progress Monitoring:
Phase Number = 1
Description = SETUP
Total Work = 0 bytes
Completed Work = 0 bytes
Start Time = 03/30/2005 15:57:42.354573
Phase Number = 2
Description = LOAD
Total Work = 10000 rows
Completed Work = 10000 rows
Start Time = 03/30/2005 15:57:50.669476
Phase Number [Current] = 3
Description = BUILD
Total Work = 2 indexes
Completed Work = 2 indexes
Start Time = 03/30/2005 15:57:51.182767
© Copyright IBM Corporation 2010
Load Pending state:
Recovering from LOAD failure
• Restart the Load:
– Check Messages files
– Use Restart option
– Load operation automatically continues from last consistency point
in Load or Build phase
• or Delete phase if ALLOW NO ACCESS
• Replace the whole table
– LOAD ... REPLACE
• Terminate the Load:
– If LOAD ... INSERT, returns table to state preceding Load
– If LOAD ... REPLACE, table will be truncated to an empty state
• Create backup copy prior to Load to return to state preceding Load
Do not tamper with Load temporary files
© Copyright IBM Corporation 2010
Backup Pending state: COPY options
• When loading data and forward recovery is enabled:
– COPY NO (default)
• During load, Backup pending and Load in progress
• After load, Backup Pending
– COPY YES
• Load has made Copy not Backup pending
– NONRECOVERABLE
• No copy made and no backup required
UNIX
Databasealias.Type.Instancename.Nodename.Catnodename.Timestamp.number
Windows
Databasealias.Type.Instancename.Nodename.Catnodename.Timestamp.number
Type: 0=Full Backup
3=Table Space Backup
4 =Copy from Table Load
© Copyright IBM Corporation 2010
Set Integrity Pending table state
• Load turns OFF constraint checking:
– Leaves table in Set Integrity Pending state
– If parent table is in Set Integrity Pending, then dependents may also
be in Set Integrity Pending
– LOAD INSERT, ALLOW READ ACCESS
• Loaded table in Set Integrity Pending with read access
– LOAD INSERT, ALLOW NO ACCESS
• Loaded table in Set Integrity Pending with no access
– LOAD REPLACE, SET INTEGRITY PENDING CASCADE
DEFERRED
• Loaded table in Set Integrity Pending with no access
– LOAD REPLACE, SET INTEGRITY PENDING CASCADE
IMMEDIATE
• Loaded table and descendant foreign key tables are in Set Integrity
Pending with no access
© Copyright IBM Corporation 2010
SET INTEGRITY command syntax (Basic)
SET INTEGRITY ...
,
FOR table-name OFF
IMMEDIATE CHECKED
| exception-clause |
,
FOR table-name ALL IMMEDIATE UNCHECKED
FOREIGN KEY
CHECK
exception-clause
,
| FOR EXCEPTION IN table-name USE table-name |
© Copyright IBM Corporation 2010
Set Integrity Pending state
BEFORE
SYSCAT.TABLES
cal.par cal.for
TABNAME DEFINER STATUS CONST_ ACCESS_
CHECKED MODE 10 1 10 X
par cal C NYYY... R 30 3 10 Y
for cal C YNYY... R 50 6 20 Y
80 8 50 Z
AFTER
80 X
Primary Key
forexp 90 A
cal.for
20 Y timestamp msg
10 ~~ X Foreign Key
90 A timestamp msg
10 ~~ Y parexp
50 ~~ Z timestamp msg
80 ... X
db2 SET INTEGRITY for
cal.par, cal.for IMMEDIATE CHECKED
FOR EXCEPTION
IN cal.par USE cal.parexp,
IN cal.for USE cal.forexp
© Copyright IBM Corporation 2010
Meaningful steps for LOAD
• Create tables and indexes
• Create exception tables
• Sort data
• Back up TS/DB (if using REPLACE)
• Consider freespace
• Load .... for Exception ... Savecount ... Warningcount...
• Examine xx.msg and dumpfile (after LOAD completes)
• Examine exception tables (after LOAD completes)
• Back up table space if log retain=recovery and COPY NO
• Set Integrity for .... (only if table in Set Integrity Pending state)
• Update statistics (if necessary)
© Copyright IBM Corporation 2010
db2move utility
• Facilitates the moving/copying of large numbers s
bl e
Ta
of tables between databases
• Can be used with db2look to move/copy
a database between different platforms
(for example, AIX to Windows).
• Usage: db2move <dbName> EXPORT/IMPORT/LOAD/COPY [options]
• EXPORT: The system catalogs are queried, a list of tables is
compiled (based on the options), and the tables are exported in
IXF format -- additionally, a file called db2move.lst is created
• IMPORT: The db2move.lst file is used to import the IXF files
created in the EXPORT step
• LOAD: The db2move.lst file is used to load the PC/IXF data files
created in the EXPORT step
• COPY: Duplicates schema(s) into a target database
© Copyright IBM Corporation 2010
db2move options: Export/Import/Load
db2move.lst
• EXPORT:
-tc table creator list !"ADMIN"."EXPLAIN_INSTANCE"!tab1.ixf!tab1.msg!
!"ADMIN"."EXPLAIN_STATEMENT"!tab2.ixf!tab2.msg!
-tn table name list !"ADMIN"."EXPLAIN_ARGUMENT"!tab3.ixf!tab3.msg!
-sn schema name list !"ADMIN"."EXPLAIN_OBJECT"!tab4.ixf!tab4.msg!
!"ADMIN"."EXPLAIN_OPERATOR"!tab5.ixf!tab5.msg!
-ts table space list !"ADMIN"."EXPLAIN_PREDICATE"!tab6.ixf!tab6.msg!
-tf file contains list of tables !"ADMIN"."EXPLAIN_STREAM"!tab7.ixf!tab7.msg!
db2move sample export -tc admin -tn EXPL*
• IMPORT:
-io defaults to REPLACE_CREATE
CREATE,INSERT,INSERT_UPDATE,REPLACE
db2move sample import -io replace
• LOAD:
-lo defaults to INSERT
REPLACE
db2move sample load -lo insert
© Copyright IBM Corporation 2010
db2move COPY option
Copy one or more schemas between DB2 databases
Uses a -co option to specify:
• Target Database:
"TARGET_DB <db name> [USER <userid> USING <password>]"
• MODE:
– DDL_AND_LOAD - Creates all supported objects from the source schema,
and populates the tables with the source table data. Default
– DDL_ONLY -Creates all supported objects from the source schema, but
does not repopulate the tables.
– LOAD_ONLY- Loads all specified tables from the source database to the
target database. The tables must already exist on the target.
• SCHEMA_MAP: Allows user to rename schema when copying to target
• TABLESPACE_MAP: Table space name mappings to be used
• Load Utility option: COPY NO or Nonrecoverable
• Owner: Change the owner of each new object created in the target
schema
© Copyright IBM Corporation 2010
db2move COPY schema examples
• To duplicate schema schema1 Database dbsrc
from source database dbsrc
to target database dbtgt, issue:
db2move dbsrc COPY -sn schema1 -co TARGET_DB
dbtgt
USER myuser1 USING mypass1 db2move
• To duplicate schema schema1
from source database dbsrc
to target database dbtgt and Database dbtgt
rename the schema to newschema1 on
the target and map source table space ts1 Output files generated:
COPYSCHEMA.msg
to ts2 on the target, issue: COPYSCHEMA.err
LOADTABLE.msg
db2move dbsrc COPY -sn schema1 -co TARGET_DB LOADTABLE.err
dbtgt
USER myuser1 USING mypass1 These files are
SCHEMA_MAP ((schema1,newschema1)) timestamped.
TABLESPACE_MAP ((ts1,ts2), SYS_ANY))
© Copyright IBM Corporation 2010
Online Table Move stored procedure
• The ADMIN_MOVE_TABLE procedure available with DB2 9.7
is designed to move data from a source table to a target table
with a minimal impact to application access
– Changes that can be made using ADMIN_MOVE_TABLE:
• New Data, Index or Long table spaces, which could have a different
page size, extent size or type of table space management (like
moving from SMS to Automatic Storage)
• Data compression could be implemented during the move
• MDC clustering can be added or changed
• Range partitions can be added or changed
• Distribution keys can be changed for DPF tables
• Columns can be added, removed or changed
– Multiple phased processing allows write access to the source table
except for a short outage required to swap access to the target table
© Copyright IBM Corporation 2010
ADMIN_MOVE_TABLE: Processing phases
SYSTOOLS.ADMIN_MOVE_TABLE
1 INIT PHASE tabschema tabname key value
Create triggers,target,
staging tables
SOURCE TARGET
TABLE 2 TABLE
COPY PHASE
c1 c2 … cn c1 c2 … cn
3
REPLAY PHASE
INSERT INSERT
c1 c2 … cn
Rows with
DELETE DELETE keys present
UPDATE in staging
UPDATE
table are
re-copied
Online Keys of from source
Workload row changed table
by online
workload 4 SWAP PHASE
captured via
triggers STAGING Rename Target -> Source
TABLE
© Copyright IBM Corporation 2010
ADMIN_MOVE_TABLE procedure methods
• There are two methods of calling ADMIN_MOVE_TABLE:
One method specifies the how to define the target table.
>>-ADMIN_MOVE_TABLE--(--tabschema--,--tabname--,---------------->
>--data_tbsp--,--index_tbsp--,--lob_tbsp--,--mdc_cols--,-------->
.-,-------.
V |
>--partkey_cols--,--data_part--,--coldef--,----options-+--,----->
>--operation--)------------------------------------------------><
The second method allows a predefined table to be specified as the
target for the move.
>>-ADMIN_MOVE_TABLE--(--tabschema--,--tabname--,---------------->
.-,-------.
V |
>--target_tabname--,----options-+--,--operation--)-------------><
© Copyright IBM Corporation 2010
Example: Move a table to new table space
call SYSPROC.ADMIN_MOVE_TABLE
( 'MDC', 'HIST1', 'MDCTSP2', 'MDCTSPI', 'MDCTSP2', NULL, NULL, NULL, NULL,
'KEEP, REORG',
'MOVE' ) Source is kept, target is reorganized
KEY VALUE
-------------------------------- --------------------------------------------------
AUTHID INST461
CLEANUP_END 2009-06-01-10.57.55.282196
CLEANUP_START 2009-06-01-10.57.55.050882
COPY_END 2009-06-01-10.57.36.243768
COPY_INDEXNAME HIST1IX1
COPY_INDEXSCHEMA MDC
COPY_OPTS ARRAY_INSERT,CLUSTER_OVER_INDEX
COPY_START 2009-06-01-10.57.25.132774
COPY_TOTAL_ROWS 100000
INDEXNAME MDC.HIST1IX3
INDEXSCHEMA INST461
INDEX_CREATION_TOTAL_TIME 7
INIT_END 2009-06-01-10.57.24.880488
INIT_START 2009-06-01-10.57.21.220930
ORIGINAL HIST1AATOhxo
REPLAY_END 2009-06-01-10.57.53.870351
REPLAY_START 2009-06-01-10.57.36.244234
REPLAY_TOTAL_ROWS 0
REPLAY_TOTAL_TIME 0
STATUS COMPLETE
SWAP_END 2009-06-01-10.57.54.995848
SWAP_RETRIES 0
SWAP_START 2009-06-01-10.57.54.255941
VERSION 09.07.0000
© Copyright IBM Corporation 2010
Unit summary
Having completed this unit, you should be able to:
• Discuss the INSERT statement and recognize its limitations
• Explain the differences between IMPORT and LOAD
• Explain the EXPORT, IMPORT, and LOAD syntax
• Create and use Exception Tables and Dump-Files
• Distinguish and resolve Table States:
– Load Pending and Set Integrity Pending
• Use the SET INTEGRITY command
• Discuss the db2move and db2look commands
© Copyright IBM Corporation 2010
Student exercise
© Copyright IBM Corporation 2010
Backup and recovery
© Copyright IBM Corporation 2010
Course materials may not be reproduced in whole or in part without the prior written permission of IBM. 5.4
Unit objectives
After completing this unit, you should be able to:
• Describe the major principles and methods for backup and
recovery
• State the three types of recovery used by DB2
• Explain the importance of logging for backup and recovery
• Describe how data logging takes place, including circular
logging and archival logging
• Use the BACKUP, RESTORE, ROLLFORWARD and
RECOVER commands
• Perform a table space backup and recovery
• Restore a database to the end of logs or to a point-in-time
• Discuss the configuration parameters and the recovery
history file and use these to handle various backup and
recovery scenarios
© Copyright IBM Corporation 2010
Types of failure
MEDIA OPERATIONAL
HARDWARE
SOFTWARE
POWER
© Copyright IBM Corporation 2010
Database objects
DB2 Recovery based on restoring at the Database or Table Space level
Table Space Table Space Table Space
'SYSCATSPACE' 'USERSPACE1' 'TBSP1'
Database
Index
Table Index Table
Table
Table
Table
Container Container
0 Container 0
Container
0 1
© Copyright IBM Corporation 2010
DB2 Database recovery methods
Create offline Database level Backup at 1AM DB2 Logs
Database at 1AM
12
11 1
10
9
2
3
Log 1
8
7
6
5
4 Backup DB DB2 Database
Log 2
Database
Backup Table spaces
Log 3
1: Crash Recovery
Log 4
11
12
1 Database at 3PM
10 2
9 3
8 4
7
6
5 Restore DB
DB2 Database Log 5
Database Log 6
Backup Table spaces 3: Roll forward
2: Version Recovery Recovery
© Copyright IBM Corporation 2010
Introduction to logging
Buffer Pool
Log Buffer
Insert
Current Row New Row
Update
Delete Old Row
Requests Commit
for Reads Externalized
Log Buffer Full
and Writes
TABLES LOGS
© Copyright IBM Corporation 2010
Circular logging (Non-recoverable database)
"n" PRIMARY 2 1 "n"
SECONDARY
© Copyright IBM Corporation 2010
Archival logging (Recoverable database)
Manual
or
Automatic Archive
12 ACTIVE — Contains
information for non-
committed or non-
13 externalized transactions
14
OFFLINE ARCHIVE — ONLINE ARCHIVE —
Contains information 15
Archive moved from
for committed and
ACTIVE log subdirectory externalized transactions.
(may also be on other Stored in the ACTIVE 16
media log subdirectory.
© Copyright IBM Corporation 2010
Log file information
Time
Online Active Active Active Unused Unused
Archive Log Log File Log File Log File Log File Log File
....
....
loghead
SQLLOGCTL.LFH(1 and 2) contains:
— Log configuration
— Log files to be archived
— Active logs
— Start point for crash recovery
© Copyright IBM Corporation 2010
Location of log files
Instance
/your/choice
NODE0000 newlogpath
S~3.LOG S~4.LOG S~5.LOG
SQL0000n Different disk drive
and potentially a
different disk
controller unit
/2nd/choice
SQLOGDIR
mirrorlogpath
S~3.LOG S~4.LOG S~5.LOG
S~0.LOG S~1.LOG S~2.LOG
© Copyright IBM Corporation 2010
Configure database logs: Parameters
Primary log files New log path
(LOGPRIMARY) + (NEWLOGPATH)
? ?
Secondary log files Mirror log path
(LOGSECOND) (MIRRORLOGPATH)
Log size Primary Log
(LOGFILSIZ) Archive Method
Y
N (LOGARCHMETH1)
E DEFAULT
O Secondary Log
Log Buffer S
(LOGBUFSIZ) Archive Method
(LOGARCHMETH2)
Number of commits Log records to write
to group before soft checkpoint
(MINCOMMIT) (SOFTMAX)
? - ?
© Copyright IBM Corporation 2010
Configure database logs: GUI (Control Center)
© Copyright IBM Corporation 2010
Configure database logs: Wizard (CC)
© Copyright IBM Corporation 2010
Configure database logs: Command line
• Windows: db2 get db cfg for musicdb | find /i "log"
• Linux/UNIX: db2 get db cfg for musicdb | grep –i log
• Update: db2 update db cfg for musicdb using logprimary 15
Log retain for recovery status = NO
User exit for logging status = NO
Catalog cache size (4KB) (CATALOGCACHE_SZ) = 260
Log buffer size (4KB) (LOGBUFSZ) = 98
Log file size (4KB) (LOGFILSIZ) = 1024
Number of primary log files (LOGPRIMARY) = 13
Number of secondary log files (LOGSECOND) = 4
Changed path to log files (NEWLOGPATH) =
Path to log files = C:\DB2\NODE0000\SQL00003\SQLOGDIR\
Overflow log path (OVERFLOWLOGPATH) =
Mirror log path (MIRRORLOGPATH) =
First active log file =
Block log on disk full (BLK_LOG_DSK_FUL) = NO
Percent max primary log space by transaction (MAX_LOG) = 0
Num. of active log files for 1 active UOW(NUM_LOG_SPAN) = 0
Percent log file reclaimed before soft chckpt (SOFTMAX) = 520
Log retain for recovery enabled (LOGRETAIN) = OFF
User exit for logging enabled (USEREXIT) = OFF
HADR log write synchronization mode (HADR_SYNCMODE) = NEARSYNC
First log archive method (LOGARCHMETH1) = OFF
Options for logarchmeth1 (LOGARCHOPT1) =
Second log archive method (LOGARCHMETH2) = OFF
Options for logarchmeth2 (LOGARCHOPT2) =
Failover log archive path (FAILARCHPATH) =
Number of log archive retries on error (NUMARCHRETRY) = 5
Log archive retry Delay (secs) (ARCHRETRYDELAY) = 20
Log pages during index build (LOGINDEXBUILD) = OFF
© Copyright IBM Corporation 2010
DB2 recovery-related system files
• Recovery History File - db2rhist.asc:
– Created during Create Database command
– Updated by DB2 Utility execution:
• Back up database or table space
• Restore database or table space Database Directory
• Roll forward database or table space Instance
• Load table
• Reorg table
• DB2 Log file archival NODE0000
• Create/Alter/Rename table space
• Quiesce table spaces for table SQL0000n
• Drop Table (optional)
– Included on each DB2 backup file db2rhist.asc
– db2 LIST HISTORY command SQLOGCTL.LFH.1
• Log Control File – SQLOGCTL.LFH.1: SQLOGCTL.LFH.2
– Used during crash recovery
– Disk updated at end of each Log File
– Use softmax value < 100 (% of logfilsiz) to refresh pointers more often
– Included at end of each DB2 backup
© Copyright IBM Corporation 2010
BACKUP command
BACKUP DATABASE MUSICDB
REMOTE
*SIZE?
MUSICDB
SYSADM LOCAL -or-
SYSCTRL *REFERENCE?
SYSMAINT
-or-
Buffer size, TSM, XBSA or
number? vendor backup
shared library name
© Copyright IBM Corporation 2010
Backup Utility options
BACKUP DATABASE database-alias [USER username [USING password]]
[TABLESPACE (tblspace-name [ {,tblspace-name} ... ])] [ONLINE]
[INCREMENTAL [DELTA]]
[USE {TSM | XBSA} [OPEN num-sess SESSIONS]
[OPTIONS {options-string | options-filename}] | TO dir/dev
[ {,dir/dev} ... ] | LOAD lib-name [OPEN num-sess SESSIONS]
[OPTIONS {options-string | options-filename}]]
[WITH num-buff BUFFERS] [BUFFER buffer-size] [PARALLELISM n]
[COMPRESS [COMPRLIB lib-name [EXCLUDE]] [COMPROPTS options-
string]]
[UTIL_IMPACT_PRIORITY [priority]
[{INCLUDE | EXCLUDE} LOGS] [WITHOUT PROMPTING]
© Copyright IBM Corporation 2010
BACKUP using the GUI interface
© Copyright IBM Corporation 2010
Online versus Offline Backup
© Copyright IBM Corporation 2010
The backup files
• File name for backup images on disk has:
– Database alias
– Type of backup (0=full, 3=table space, 4=copy from table load)
– Instance name
– Node number
– Catalog node number
– Timestamp of backup (YYYYMMDDHHMMS)
– Sequence number (for multiple files)
MUSICDB.0.DB2.NODE0000.CATN0000.20071013150616.001
• Tape images are not named, but internally contain the same
information in the backup header for verification purposes
• Backup history provides key information in easy-to-use format
© Copyright IBM Corporation 2010
RESTORE command
RESTORE DATABASE MUSICDB
REMOTE
MUSICDB
SYSADM LOCAL -or-
SYSCTRL
SYSMAINT
-or-
Buffer size, TSM, XBSA or
number? vendor backup
shared library name
© Copyright IBM Corporation 2010
Syntax of the RESTORE command
RESTORE DATABASE source-database-alias | restore-options |
DB CONTINUE
ABORT
Restore options:
USER username TABLESPACE ONLINE
USING password ,
TABLESPACE ( tablespace-name )
ONLINE
HISTORY FILE
ONLINE
USE TSM TAKEN AT date-time
OPEN num-sessions SESSIONS
,
FROM directory
device
LOAD shared-library
OPEN num-sessions SESSIONS
TO target-directory INTO target-database-alias NEWLOGPATH directory
WITH num-buffers BUFFERS BUFFER buffer-size REPLACE EXISTING REDIRECT
...
WITHOUT ROLLING FORWARD WITHOUT PROMPTING
© Copyright IBM Corporation 2010
RESTORE using the GUI interface
© Copyright IBM Corporation 2010
Backup/restore table space considerations
• Roll-forward must be enabled
• Can choose to restore a subset of table spaces
• Generally best to put multiple spaces in one backup image:
– Makes table space recovery strategy easier
– Provides access to related tables spaces and coherent
management of these table spaces
• Handling of long/LOB/XML data requires a correlated strategy
• Point-in-time recovery is supported, but has requirements
• Faster recovery for Catalogs using Tablespace Level backup
• Critical business application tables should obviously be the
focus of the backup/restore, but other tables are needed in
support of these tables
© Copyright IBM Corporation 2010
Roll forward pending state
• Roll forward pending is set as a result of:
– Restore of offline database backup omitting the command option
WITHOUT ROLLING FORWARD
– Restore of an online database backup
– Restore of any table space level backup
– DB2 detects media failure isolated at a table space
• Scope of pending state managed by DB2:
– Database in pending state will not permit any activity
– Table spaces in pending state will permit access to other table
spaces
© Copyright IBM Corporation 2010
Syntax of the ROLLFORWARD command
>>-ROLLFORWARD--+-DATABASE-+--database-alias-------------------->
'-DB-------'
>--+-------------------------------------------------------------------------------------
+-->
| .-ON ALL DBPARTITIONNUMS-. .-USING UTC TIME---. |
+-TO--+-isotime--+------------------------+--+------------------+-+--+--------------+-+
| | '-USING LOCAL TIME-' | +-AND COMPLETE-+ |
| | .-ON ALL DBPARTITIONNUMS-. | '-AND STOP-----' |
| +-END OF BACKUP--+------------------------+-----------------+ |
| '-END OF LOGS--+----------------------------------+---------' |
| '-| On Database Partition clause |-' |
'-+-COMPLETE---------------------------+--+----------------------------------+--------'
+-STOP-------------------------------+ '-| On Database Partition clause |-'
+-CANCEL-----------------------------+
| .-USING UTC TIME---. |
'-QUERY STATUS--+------------------+-'
'-USING LOCAL TIME-'
>--+-------------------------------------------------------+---->
'-TABLESPACE--+-ONLINE--------------------------------+-'
| .-,---------------. |
| V | |
'-(----tablespace-name-+--)--+--------+-'
'-ONLINE-'
>--+------------------------------------------------------------------------+-->
'-OVERFLOW LOG PATH--(--log-directory--+----------------------------+--)-'
'-,--| Log Overflow clause |-'
>--+------------+----------------------------------------------->
'-NORETRIEVE-'
© Copyright IBM Corporation 2010
ROLLFORWARD: GUI interface
© Copyright IBM Corporation 2010
ROLLFORWARD: How far?
• END OF LOGS: (Apply as many changes as possible):
– Rollforward will apply all available logs beginning with the logs associated with
the backup that was restored
– Archived logs will be retrieved unless NORETRIEVE is specified
• Point-in-time (PIT): (Apply changes up to a specified time):
– Specified in Coordinated Universal Time (UTC) via command
– Specified in local time on server with USING LOCAL TIME
– Specified in local time on the client via GUI interface
– Format: yyyy-mm-dd-hh.mm.ss.nnnnnn
• END OF BACKUP: (Apply as few changes as possible):
– Allows a Database to be recovered from an online database backup and to end
the ROLLFORWARD processing at the earliest point where the database is
consistent.
– Recovery history file (RHF) shows logs associated with online backups
• Table space point-in-time considerations:
– Minimum roll forward time maintained for each table space – requires roll
forward at least to the last DDL change (create, alter, drop) in a table space
– Table spaces are placed in backup pending when the roll forward completes to
insure future recoverability
© Copyright IBM Corporation 2010
DB2 RECOVER command
db2 recover database salesdb
– RECOVER command performs both RESTORE and
ROLLFORWARD command processing using Recovery
History file data.
– Can use Full or Incremental Database level backup
images
– Allows Partitioned DPF database to be recovered using
a single command Recovery History
– Database can be recovered to End of logs or a Archived Logs
point in time.
– RESTART option forces failed RECOVER command to Backups
redo RESTORE even if previous restore completed
– USING HISTORY FILE option allows disaster recovery
with no existing database
– No table space level recovery options
© Copyright IBM Corporation 2010
Log file naming
0000000
Sxxxxxxx.LOG
9999999
S~0.LOG S~1.LOG S~2.LOG S~3.LOG S~4.LOG
BACKUP
• When roll-forward is enabled • RESTORE DATABASE
without rolling forward
• When S9999999.LOG is filled
• ROLLFORWARD DATABASE
• When roll-forward is disabled to some point in S~2.LOG
RESTART NAMING SEQUENCE
© Copyright IBM Corporation 2010
Disaster recovery considerations
Should you use database recovery, table space recovery, or a
combination?
• Database level backups:
– Offline backup does not require a roll forward
– Online backup will include the ‘required’ logs for a roll forward TO END OF
BACKUP
– Entire database can be restored on another system
– Reduces backup management
• Table space level backups:
– Can be used to supplement database backups
– A complete set of tablespace backups could be used to REBUILD a database
for disaster recovery
– Increases the number of backup images
• Using the REBUILD WITH option of RESTORE simplifies creating a full
or partial copy of a database using either database or table space
backup images
© Copyright IBM Corporation 2010
Questions to consider
• Will the log files and the database files be on separate physical
disks?
• What medium can best handle the volume of log data to be
generated?
• Should copies of log files, backups, or both be made and
managed?
• Can the OS support mirroring? Should a mirror log path be
defined?
• How far back should recovery be enabled?
• When will all log files and backups be discarded?
• Are paging space/swap files on the same device or drive as
DB2 logs?
• Have you backed up database system support files, such as
db2profile and db2nodes.cfg, and your shell scripts?
© Copyright IBM Corporation 2010
Logging/backup requirements summary
Consideration CRASH VERSION ROLL-
FORWARD
Circular or Circular or
Logging Archive
Archive Archive
OFFLINE
Backup N/A OFFLINE
or ONLINE
Resources Low Medium High
Management Low Medium High
Last
Currency after Media Latest
Not Provided Committed
Failure Backup Work
Table Space Recovery N/A NO YES
© Copyright IBM Corporation 2010
DB2 Integrated Cluster Manager support
• HA Cluster Manager Integration:
– Coupling of DB2 and TSA (S AMP) on Linux, Solaris and AIX.
– TSA can be installed and maintained with DB2 installation
procedures.
– DB2 interface to configure cluster.
– DB2 to maintain cluster configuration, add node, add table space,
and so on. Exploitation of new vendor independent layering (VIL),
providing support for any cluster manager.
• NO SCRIPTING REQUIRED!
– One set of embedded scripts that are used by all cluster managers.
• Automates HADR failover
– Exploit HA cluster manager integration to automate HADR Takeover
on Standby.
© Copyright IBM Corporation 2010
DB2: Cluster configuration simplified
• DB2 utility db2haicu:
– DB2 High Availability Instance Configuration Utility
– Sets up DB2 with the Cluster Manager
• For HADR or Shared Storage cluster
– Interactive or XML file driven interface
– Sets the DBM CFG option cluster_mgr
• DB2 communicates with the Cluster Manager:
– Keeps the Cluster Manager in sync with DB2 changes:
• CM needs to be aware of new or changed table spaces, containers,
and so on
• Avoids failover problems due to missing resources
– Keeps the XML cluster configuration file up to date
– Automates HADR failover
© Copyright IBM Corporation 2010
High Availability Disaster Recovery: HADR
Clients
Network
IB M
Standby DB2
Primary DB2 Database Server
IB M
Database Server
Direct Database to
Database TCPIP Link
Primary Copy of Standby Copy of
DB2 Database DB2 Database
HADR is supported on all editions of DB2 9.
For DB2 Express 9, HADR is part of the separately priced High Availability Feature.
© Copyright IBM Corporation 2010
DB2 additional recovery facilities
• On-demand log archiving
• Infinite active logs
• Block transactions on log directory full
• Split mirror database copies:
– SET WRITE SUSPEND/RESUME commands
– db2inidb command modes:
• SNAPSHOT: Database copy for testing reporting
• STANDBY: Database copy to create standby database for quick
recovery
• MIRROR: Use split mirror database copy instead of RESTORE
• Incremental and delta database and table space backups
• Relocating a database or a table space:
– RESTORE UTILITY with REDIRECT option
– db2relocatedb command
• Full and partial database REBUILD support
• Integrated Cluster Failover support
• High Availability Disaster Recovery (HADR)
© Copyright IBM Corporation 2010
DB2 for LUW Advanced Database Recovery
CL491: DB2 for LUW Advanced Database Recovery (4 days
with Labs):
• Automated Recovery log management
• Dropped table recovery
• Table space recovery
• Incremental backup and restore
• Database crash recovery
• Database and table space relocation methods
• Additional recovery facilities
• Recovery history file
• Disaster recovery of DB2 databases
• DB2 high availability and split mirror support
• DB2 partitioned database recovery considerations
• DB2 high availability disaster recovery (HADR) implementation
© Copyright IBM Corporation 2010
Unit summary
Having completed this unit, you should be able to:
• Describe the major principles and methods for backup and
recovery
• State the three types of recovery used by DB2
• Explain the importance of logging for backup and recovery
• Describe how data logging takes place, including circular
logging and archival logging
• Use the BACKUP, RESTORE, ROLLFORWARD and
RECOVER commands
• Perform a table space backup and recovery
• Restore a database to the end of logs or to a point-in-time
• Discuss the configuration parameters and the recovery
history file and use these to handle various backup and
recovery scenarios
© Copyright IBM Corporation 2010
Student exercise
© Copyright IBM Corporation 2010