Datastage Designer
Datastage Designer
Designer Guide
Published by Ascential Software Corporation. 1997-2003 Ascential Software Corporation. All rights reserved. Ascential, DataStage, QualityStage, AuditStage, ProfileStage, and MetaStage are trademarks of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Windows is a trademark of Microsoft Corporation. Unix is a registered trademark of The Open Group. Adobe and Acrobat are registered trademarks of Adobe Systems Incorporated. Other marks are the property of the owners of those marks. This product may contain or utilize third party components subject to the user documentation previously provided by Ascential Software Corporation or contained herein. Documentation Team: Mandy deBelin
Table of Contents
Preface
Organization of This Manual .........................................................................................x Documentation Conventions ....................................................................................... xi User Interface Conventions ................................................................................. xiii DataStage Documentation .......................................................................................... xiv
Chapter 1. Introduction
About Data Warehousing ........................................................................................... 1-1 Operational Databases Versus Data Warehouses ............................................. 1-2 Constructing the Data Warehouse ...................................................................... 1-2 Defining the Data Warehouse ............................................................................. 1-3 Data Extraction ...................................................................................................... 1-3 Data Aggregation .................................................................................................. 1-3 Data Transformation ............................................................................................. 1-3 Advantages of Data Warehousing ...................................................................... 1-4 About DataStage .......................................................................................................... 1-4 Client Components ............................................................................................... 1-5 Server Components .............................................................................................. 1-6 DataStage Projects ........................................................................................................ 1-6 DataStage Jobs .............................................................................................................. 1-6 DataStage NLS .............................................................................................................. 1-8 Character Set Maps and Locales ......................................................................... 1-9 DataStage Terms and Concepts ................................................................................ 1-10
Table of Contents
iii
Defining Table Definitions ...................................................................................2-6 Developing a Job ...........................................................................................................2-9 Adding Stages ........................................................................................................2-9 Linking Stages ......................................................................................................2-10 Editing the Stages ....................................................................................................... 2-11 Editing the UniVerse Stage ................................................................................ 2-11 Editing the Transformer Stage ...........................................................................2-16 Editing the Sequential File Stage .......................................................................2-21 Compiling a Job ..........................................................................................................2-23 Running a Job ..............................................................................................................2-24 Analyzing Your Data Warehouse .............................................................................2-25
Parallel Job Stages ............................................................................................... 4-11 Other Stages ......................................................................................................... 4-15 Links ............................................................................................................................. 4-16 Linking Server Stages ......................................................................................... 4-16 Linking Parallel Jobs ........................................................................................... 4-18 Linking Mainframe Stages ................................................................................. 4-23 Link Ordering ...................................................................................................... 4-25 Developing the Job Design ....................................................................................... 4-26 Adding Stages ..................................................................................................... 4-26 Moving Stages ..................................................................................................... 4-28 Renaming Stages ................................................................................................. 4-28 Deleting Stages .................................................................................................... 4-28 Linking Stages ..................................................................................................... 4-28 Editing Stages ...................................................................................................... 4-30 Cutting or Copying and Pasting Stages .......................................................... 4-38 Using the Data Browser ..................................................................................... 4-39 Using the Performance Monitor ....................................................................... 4-43 Compiling Server Jobs and Parallel Jobs ......................................................... 4-45 Running Server Jobs and Parallel Jobs ............................................................ 4-49 Generating Code for Mainframe Jobs .............................................................. 4-49 Job Properties .............................................................................................................. 4-54 Server Job and Parallel Job Properties ............................................................. 4-55 Specifying Job Parameters ................................................................................. 4-58 Job Control Routines .......................................................................................... 4-67 Specifying Job Dependencies ............................................................................ 4-70 Specifying Performance Enhancements .......................................................... 4-72 Specifying Maps and Locales for Server Jobs ................................................. 4-74 Specifying Maps and Locales for Parallel Jobs ............................................... 4-76 Generated OSH Page .......................................................................................... 4-76 Specifying Execution Page Options ................................................................. 4-77 Specifying Parallel Job Defaults ........................................................................ 4-78 Mainframe Job Properties ......................................................................................... 4-79 Specifying Mainframe Job Parameters ............................................................ 4-80 Specifying Mainframe Job Environment Properties ...................................... 4-83 Specifying Extension Variable Values .............................................................. 4-84 Specifying Operational Meta Data ................................................................... 4-85
Table of Contents
Chapter 5. Containers
Local Containers ...........................................................................................................5-1 Creating a Local Container ..................................................................................5-2 Viewing or Modifying a Local Container ..........................................................5-2 Using Input and Output Stages ..........................................................................5-3 Deconstructing a Local Container ......................................................................5-4 Shared Containers ........................................................................................................5-5 Creating a Shared Container ...............................................................................5-6 Viewing or Modifying a Shared Container Definition ....................................5-7 Editing Shared Container Definition Properties ...............................................5-8 Using a Shared Container in a Job ....................................................................5-10 Converting Containers ..............................................................................................5-17
Administrating Templates ................................................................................... 7-3 Creating a Job from a Template .................................................................................. 7-4 Using the Data Migration Assistant .......................................................................... 7-6
Table of Contents
vii
viii
Preface
This manual describes the features of the DataStage Designer. It is intended for application developers and system administrators who want to use DataStage to design and develop data warehousing applications. If you are new to DataStage, read the first two chapters for an overview of data warehousing and the concepts and use of DataStage. The manual contains enough information to get you started in designing DataStage jobs. For more detailed information about particular types of data source or data target, refer to DataStage Server: Server Job Developers Guide, DataStage Enterprise Edition: Parallel Job Developers Guide, and DataStage Enterprise MVS Edition: Mainframe Job Developer's Guide.
Preface
ix
Documentation Conventions
This manual uses the following conventions: Convention Bold Usage In syntax, bold indicates commands, function names, keywords, and options that must be input exactly as shown. In text, bold indicates keys to press, function names, and menu selections. In syntax, uppercase indicates BASIC statements and functions and SQL statements and keywords. In syntax, italic indicates information that you supply. In text, italic also indicates UNIX commands and options, file names, and pathnames. In text, plain indicates Windows NT commands and options, file names, and path names. Courier indicates examples of source code and system output. In examples, courier bold indicates characters that the user types or keys the user presses (for example, <Return>). Brackets enclose optional items. Do not type the brackets unless indicated. Braces enclose nonoptional items from which you must select at least one. Do not type the braces. A vertical bar separating items indicates that you can choose only one item. Do not type the vertical bar. Three periods indicate that more of the same type of item can optionally follow. A right arrow between menu commands indicates you should choose each command in sequence. For example, Choose File Exit means you should choose File from the menu bar, then choose Exit from the File pull-down menu.
UPPERCASE
Italic
[] {}
itemA | itemB
...
Preface
xi
Convention
Usage
This line The continuation character is used in source continues code examples to indicate a line that is too long to fit on the page, but must be entered as a single line on screen. The following conventions are also used: Syntax definitions and examples are indented for ease in reading. All punctuation marks included in the syntax for example, commas, parentheses, or quotation marks are required unless otherwise indicated. Syntax lines that do not fit on one line in this manual are continued on subsequent lines. The continuation lines are indented. When entering syntax, type the entire syntax entry, including the continuation lines, on the same input line.
xii
Browse Button
Check
Option Button
Box
Button
The DataStage user interface makes extensive use of tabbed pages, sometimes nesting them to enable you to reach the controls you need from within a single dialog box. At the top level, these are called pages, at the inner level these are called tabs. In the example above, we are looking at the General tab of the Inputs page. When using context sensitive online help you will find that each page has a separate help topic, but each tab uses the help topic for the parent page. You can jump to the help pages for the separate tabs from within the online help.
Preface
xiii
DataStage Documentation
DataStage documentation includes the following: DataStage Designer Guide. This guide describes the DataStage Designer, and gives a general description of how to create, design, and develop a DataStage application. DataStage Manager Guide. This guide describes the DataStage Manager and describes how to use and maintain the DataStage Repository. DataStage Server: Server Job Developer Guide: This guide describes the tools that are used in building a server job, and it supplies programmers reference information. DataStage Enterprise Edition: Parallel Job Developer Guide: This guide describes the tools that are used in building a parallel job, and it supplies programmers reference information. DataStage Enterprise MVS Edition: Mainframe Job Developer Guide: This guide describes the tools that are used in building a mainframe job, and it supplies programmers reference information. DataStage Director Guide: This guide describes the DataStage Director and how to validate, schedule, run, and monitor DataStage server jobs. DataStage Administrator Guide: This guide describes DataStage setup, routine housekeeping, and administration. DataStage Install and Upgrade Guide. This guide contains instructions for installing DataStage on Windows and UNIX platforms, and for upgrading existing installations of DataStage. DataStage NLS Guide. This Guide contains information about using the NLS features that are available in DataStage when NLS is installed. These guides are also available online in PDF format. You can read them with the Adobe Acrobat Reader supplied with DataStage. See DataStage Install and Upgrade Guide for details about installing the manuals and the Adobe Acrobat Reader. You can use the Acrobat search facilities to search the whole DataStage document set. To use this feature, first choose Edit Search
xiv
Select Indexes then add the index dstage7.pdx from the DataStage client docs directory (e.g., C:\Program Files\Ascential\DataStage\Docs). You can then choose Edit Search Query and enter the word or phrase you are searching for. (If the Search item is not available in your version of Acrobat Reader, install the version supplied with DataStage.) Extensive online help is also supplied. This is especially useful when you have become familiar with using DataStage and need to look up particular pieces of information.
Preface
xv
xvi
1
Introduction
This chapter is an overview of data warehousing and DataStage. The last few years have seen the continued growth of IT (information technology) and the requirement of organizations to make better use of the data they have at their disposal. This involves analyzing data in active databases and comparing it with data in archive systems. Although offering the advantage of a competitive edge, the cost of consolidating data into a data mart or data warehouse was high. It also required the use of data warehousing tools from a number of vendors and the skill to create a data warehouse. Developing a data warehouse or data mart involves design of the data warehouse and development of operational processes to populate and maintain it. In addition to the initial setup, you must be able to handle ongoing evolution to accommodate new data sources, processing, and goals. DataStage simplifies the data warehousing process. It is an integrated product that supports extraction of the source data, cleansing, decoding, transformation, integration, aggregation, and loading of target databases. Although primarily aimed at data warehousing environments, DataStage can also be used in any data handling, data migration, or data reengineering projects.
Introduction
1-1
database can be accessed by all users, ensuring that each group in an organization is accessing valuable, stable data. A data warehouse is a snapshot of the operational databases combined with data from archives. The data warehouse can be created or updated at any time, with minimum disruption to operational systems. Any number of analyses can be performed on the data, which would otherwise be impractical on the operational sources.
1-2
The person who constructs the data warehouse must know the needs of users who will use the data warehouse or data marts. This means knowing the data contained in each operational database and how each database is related (if at all).
Data Extraction
The data in operational or archive systems is the primary source of data for the data warehouse. Operational databases can be indexed files, networked databases, or relational database systems. Data extraction is the process used to obtain data from operational sources, archives, and external data sources.
Data Aggregation
An operational data source usually contains records of individual transactions such as product sales. If the user of a data warehouse only needs a summed total, you can reduce records to a more manageable number by aggregating the data. The summed (aggregated) total is stored in the data warehouse. Because the number of records stored in the data warehouse is greatly reduced, it is easier for the end user to browse and analyze the data.
Data Transformation
Because the data in a data warehouse comes from many sources, the data may be in different formats or be inconsistent. Transformation is the process that converts data to a required definition and value.
Introduction
1-3
Data is transformed using routines based on a transformation rule, for example, product codes can be mapped to a common format using a transformation rule that applies only to product codes. After data has been transformed it can be loaded into the data warehouse in a recognized and required format.
About DataStage
DataStage has the following features to aid the design and processing required to build a data warehouse: Uses graphical design tools. With simple point-and-click techniques you can draw a scheme to represent your processing requirements. Extracts data from any number or type of database. Handles all the meta data definitions required to define your data warehouse. You can view and modify the table definitions at any point during the design of your application. Aggregates data. You can modify SQL SELECT statements used to extract data.
1-4
Transforms data. DataStage has a set of predefined transforms and functions you can use to convert your data. You can easily extend the functionality by defining your own transforms to use. Loads the data warehouse. DataStage consists of a number of client and server components. For more information, see Client Components on page 1-5 and Server Components on page 1-6. DataStage server and parallel jobs are compiled and run on the DataStage server. The job will connect to databases on other machines as necessary, extract data, process it, then write the data to the target data warehouse. DataStage mainframe jobs are compiled and run on a mainframe. Data extracted by such jobs is then loaded into the data warehouse.
Client Components
DataStage has four client components which are installed on any PC running Windows 2000 or Windows NT 4.0 with Service Pack 4 or later: DataStage Designer. A design interface used to create DataStage applications (known as jobs). Each job specifies the data sources, the transforms required, and the destination of the data. Jobs are compiled to create executables that are scheduled by the Director and run by the Server (mainframe jobs are transferred and run on the mainframe). DataStage Director. A user interface used to validate, schedule, run, and monitor DataStage server jobs and parallel jobs. DataStage Manager. A user interface used to view and edit the contents of the Repository. DataStage Administrator. A user interface used to perform administration tasks such as setting up DataStage users, creating and moving projects, and setting up purging criteria.
Introduction
1-5
Server Components
There are three server components: Repository. A central store that contains all the information required to build a data mart or data warehouse. DataStage Server. Runs executable jobs that extract, transform, and load data into a data warehouse. DataStage Package Installer. A user interface used to install packaged DataStage jobs and plug-ins.
DataStage Projects
You always enter DataStage through a DataStage project. When you start a DataStage client you are prompted to attach to a project. Each project contains: DataStage jobs. Built-in components. These are predefined components used in a job. User-defined components. These are customized components created using the DataStage Manager. Each user-defined component performs a specific task in a job. A complete project may contain several jobs and user-defined components. There is a special class of project called a protected project. Normally nothing can be added, deleted, or changed in a protected project. Users can view objects in the project, and perform tasks that affect the way a job runs rather than the jobs design. Users with Production Manager status can import existing DataStage components into a protected project and manipulate projects in other ways.
DataStage Jobs
There are three basic types of DataStage job: Server jobs. These are compiled and run on the DataStage server. A server job will connect to databases on other machines as necessary, extract data, process it, then write the data to the target data warehouse.
1-6
Parallel jobs. These are compiled and run on the DataStage server in a similar way to server jobs, but support parallel processing on SMP, MPP, and cluster systems. Mainframe jobs. These are available only if you have Enterprise MVS Edition installed. A mainframe job is compiled and run on the mainframe. Data extracted by such jobs is then loaded into the data warehouse. There are two other entities that are similar to jobs in the way they appear in the DataStage Designer, and are handled by it. These are: Shared containers. These are reusable job elements. They typically comprise a number of stages and links. Copies of shared containers can be used in any number of server jobs or parallel jobs and edited as required. Job Sequences. A job sequence allows you to specify a sequence of DataStage jobs to be executed, and actions to take depending on results. DataStage jobs consist of individual stages. Each stage describes a particular database or process. For example, one stage may extract data from a data source, while another transforms it. Stages are added to a job and linked together using the Designer. There are three basic types of stage: Built-in stages. Supplied with DataStage and used for extracting, aggregating, transforming, or writing data. All types of job have these stages. Plug-in stages. Additional stages that can be installed in DataStage to perform specialized tasks that the built-in stages do not support. Server jobs and parallel jobs can make use of these. Job Sequence Stages. Special built-in stages which allow you to define sequences of activities to run. Only Job Sequences have these. The following diagram represents one of the simplest jobs you could have: a data source, a Transformer (conversion) stage, and the final database. The links between the stages represent the flow of data into or out of a stage.
Introduction
1-7
Data Source
Transformer Stage
Data Warehouse
You must specify the data you want at each stage, and how it is handled. For example, do you want all the columns in the source data, or only a select few? Should the data be aggregated or converted before being passed on to the next stage? You can use DataStage with MetaBrokers in order to exchange meta data with other data warehousing tools. You might, for example, import table definitions from a data modelling tool.
DataStage NLS
DataStage has built-in National Language Support (NLS). With NLS installed, DataStage can do the following: Process data in a wide range of languages Accept data in any character set into most DataStage fields Use local formats for dates, times, and money (server jobs) Sort data according to local rules Convert data between different encodings of the same language (for example, for Japanese it can convert JIS to EUC) DataStage NLS is optionally installed as part of the DataStage server. If NLS is installed, various extra features (such as dialog box pages and drop-down lists) appear in the product. If NLS is not installed, these features do not appear. NLS is implemented in different ways for server jobs and parallel jobs, and each has its own set of maps: For server jobs, NLS is implemented by the DataStage server engine. For parallel jobs, NLS is implemented using the ICU library.
1-8
Data is processed in Unicode format. This is an international standard character set that contains nearly all the characters used in languages around the world. DataStage maps data to or from Unicode format as required. For more detailed information about DataStages implementation of NLS, see DataStage NLS Guide.
Introduction
1-9
built-in transforms
column definition
1-10
Term Column Export stage Column Import stage Combine Records stage Compare stage Complex Flat File stage Compress stage container Container stage Copy stage custom transform Data Browser
Description A parallel job stage that exports a column of another type to a string or binary column. A parallel job stage that imports a column from a string or binary column. A parallel job stage that combines several columns associated by a key field to build a vector. A parallel job stage that performs a column by column compare of two pre-sorted data sets. A mainframe source stage that extracts data from a flat file containing complex data structures, such as arrays, groups, and redefines. A parallel job stage that compresses a data set. A group of stages and links in a job design. A built-in stage type that represents a group of stages and links in a job design. A parallel job stage that copies a data set. A transform function defined by the DataStage developer. A tool used from within the DataStage Manager or DataStage Designer to view the content of a table or file. A specification that describes the type of data in a column and how the data is converted. (Server jobs only.) A tool used to configure DataStage projects and users. For more details, see DataStage Administrator Guide. A graphical design tool used by the developer to design and develop a DataStage job. A tool used by the operator to run and monitor DataStage server jobs. A tool used to view and edit definitions in the Repository. A tool used to install packaged DataStage jobs and plug-ins.
data element
DataStage Administrator DataStage Designer DataStage Director DataStage Manager DataStage Package Installer
Introduction
1-11
Term Data Set stage DB2stage DB2 Load Ready Flat File stage
Description A parallel job stage. Stores a set of data. A parallel stage that allows you to read and write a DB2 database. A mainframe target stage. It writes data to a flat file in Load Ready format and defines the meta data required to generate the JCL and control statements for invoking the DB2 Bulk Loader. A parallel job stage that uses a UNIX command to decode a previously encoded data set. A mainframe target stage that writes data to a delimited flat file. The person designing and developing DataStage jobs. A parallel job stage that compares two data sets and works out the difference between them. A parallel job stage that encodes a data set using a UNIX command. A parallel job stage that expands a previously compressed data set. An interactive editor that helps you to enter correct expressions into a Transformer stage in a DataStage job design. A parallel job stage that uses an external program to filter a data set.
Decode stage
External Routine stage A mainframe processing stage that calls an external routine and passes row elements to it. External Source stage A mainframe source stage that allows a mainframe job to read data from an external source. A parallel job stage that allows a parallel job to read a data source.
1-12
Description A mainframe target stage that allows a mainframe job to write data to an external source. A parallel job stage that allows a parallel job to write to a data source.
File Set stage Filter stage Fixed-Width Flat File stage FTP stage Funnel stage Generator stage Graphical performance monitor
Parallel job stage. A set of files used to store data. Parallel job stage. Filters out records from an input data set. A mainframe source/target stage. It extracts data from binary fixed-width flat files, or writes data to such a file. A mainframe post-processing stage that generates JCL to perform an FTP operation. A parallel job stage that copies multiple data sets to a single data set. A parallel job stage that generates a dummy data set. A monitor that displays status information and performance statistics against links in a job open in the DataStage Designer canvas as the job runs in the Director or debugger. A stage that extracts data from or loads data into a database that contains hashed files. (Server jobs only) A parallel job stage that copies the specified number of records from the beginning of a data partition. A parallel job stage that allows you to read and write an Informix XPS database. DataStage comes complete with a number of intelligent assistants. These lead you step by step through some of the basic DataStage operations. A server job stage that allows you to run server jobs in parallel on an SMP system.
Head stage
Inter-process stage
Introduction
1-13
Term job
Description A collection of linked stages, data elements, and transforms that define how to extract, cleanse, transform, integrate, and load data into a target database. Jobs can either be server jobs or mainframe jobs. A routine that is used to create a controlling job, which invokes and runs other jobs. A controlling job which invokes and runs other jobs, built using the graphical job sequencer. A mainframe processing stage or parallel job active stage that joins two input sources. A server job stage that collects previously partitioned data together. A server job stage that allows you to partition data so that it can be processed in parallel on an SMP system. A container which is local to the job in which it was created. A mainframe processing stage and Parallel active stage that performs table lookups. A parallel job stage that provides storage for a lookup table. A job that is transferred to a mainframe, then compiled and run there. A parallel job stage that combines a number of vectors to form a subrecord. A parallel job stage that combines a number of fields to form a vector. A parallel job stage that combines data sets. Data about data, for example, a table definition describing columns in which data is structured. A tool that allows you to exchange meta data between DataStage and other data warehousing tools.
job control routine job sequence Join stage Link collector stage Link partitioner stage
local container Lookup stage Lookup File stage mainframe job Make Subrecord stage Make Vector stage Merge stage meta data
MetaBroker
1-14
Term MPP
Description Type of system providing parallel processing. In MPP (massively parallel processing) systems, there are multiple processors, and each has its own hardware resources such as disk and memory. A parallel job stage that alters the column definitions of the output data set. A mainframe source stage that handles different formats in flat file data sources. National Language Support. With NLS enabled, DataStage can support the handling of data in a variety of character sets. The conversion of records in NF2 (nonfirstnormal form) format, containing multivalued data, into one or more 1NF (first normal form) rows. A special value representing an unknown value. This is not the same as 0 (zero), a blank, or an empty string. A stage that extracts data from or loads data into a database that implements the industry standard Open Database Connectivity API. Used to represent a data source, an aggregation step, or a target data table. (Server jobs only) The person scheduling and monitoring DataStage jobs. A plug-in stage supplied with DataStage that bulk loads data into an Oracle 7 database table. (Server jobs only) A parallel job stage that allows you to read and write an Oracle database. The DataStage option that allows you to run parallel jobs. A type of DataStage job that allows you to take advantage of parallel processing on SMP, MPP, and cluster systems.
normalization
null value
ODBC stage
Introduction
1-15
Description A parallel job stage that prints column values to the screen as records are copied from its input data set to one or more output data sets. A definition for a plug-in stage. A stage that performs specific processing that is not supported by the standard server job or parallel job stages. A parallel job stage that promotes the members of a subrecord to a top level field. A mainframe source/target stage that reads from or writes to an MVS/DB2 database. A parallel job stage that removes duplicate entries from a data set. A DataStage area where projects and jobs are stored as well as definitions for all standard and user-defined data elements, transforms, and stages. A parallel job stage that allows you to run SAS applications from within the DataStage job. A parallel job stage that provides storage for SAS data sets. A parallel job stage that samples a data set. A stage that extracts data from, or writes data to, a text file. (Server job and parallel job only) A job that is compiled and run on the DataStage server. A container which exists as a separate item in the Repository and can be used by any server job in the project. DataStage supports both server and parallel shared containers. Type of system providing parallel processing. In SMP (symmetric multiprocessing) systems, there are multiple processors, but these share other hardware resources such as disk and memory. A mainframe processing stage or parallel job active stage that sorts input columns.
SAS stage Parallel SAS Data Set stage Sample stage Sequential File stage server job shared container
SMP
Sort stage
1-16
Term source
Description A source in DataStage terms means any database, whether you are extracting data from it or writing data to it. A parallel job stage that separates a number of subrecords into top level columns. A parallel job stage that separates a number of vector members into separate columns. A component that represents a data source, a processing step, or the data mart in a DataStage job. A parallel job stage which splits an input data set into different output sets depending on the value of a selector field. A definition describing the data you want including information about the data table and the columns associated with it. Also referred to as meta data. A parallel job stage that copies the specified number of records from the end of a data partition. A parallel stage that allows you to read and write a Teradata database. A function that takes one value and computes another value from it.
Switch stage
table definition
Tail stage
Introduction
1-17
1-18
2
Your First DataStage Project
This chapter describes the steps you need to follow to create your first data warehouse, using the sample data provided. The example builds a server job and uses a UniVerse table called EXAMPLE1, which is automatically copied into your DataStage project during server installation. EXAMPLE1 represents an SQL table from a wholesaler who deals in car parts. It contains details of the wheels they have in stock. There are approximately 255 rows of data and four columns: CODE. The product code for each type of wheel. PRODUCT. A text description of each type of wheel. DATE. The date new wheels arrived in stock (given in terms of year, month, and day). QTY. The number of wheels in stock. The aim of this example is to develop and run a DataStage job that: Extracts the data from the file. Converts (transforms) the data in the DATE column from a complete date (YYYY-MM-DD) stored in internal data format, to a year and month (YYYY-MM) stored as a string. Loads data from the DATE, CODE, and QTY columns into a data warehouse. The data warehouse is a sequential file that is created when you run the job.
2-1
To load a data mart or data warehouse, you must do the following: Set up your project Create a job Develop the job Edit the stages in the job Compile the job Run the job
This chapter describes the minimum tasks required to create a DataStage job. In the example, you will use the built-in settings and options supplied with DataStage. However, because DataStage allows you to customize and extend the built-in functionality provided, it is possible to perform additional processing at each step. Where this is possible, additional procedures are listed under a section called Advanced Procedures. These advanced procedures are discussed in detail in subsequent chapters.
2-2
This dialog box appears when you start the DataStage Designer, Manager, or Director client components from the DataStage program folder. In all cases, you must attach to a project by entering your logon details. Note: The program group may be called something other than DataStage, depending on how DataStage was installed. To connect to a project: 1. 2. 3. Enter the name of your host in the Host system field. This is the name of the system where the DataStage Server components are installed. Enter your user name in the User name field. This is your user name on the server system. Enter your password in the Password field. Note: If you are connecting to the server via LAN Manager, you can select the Omit check box. The User name and Password fields gray out and you log on to the server using your Windows NT Domain account details. 4. Choose the project to connect to from the Project drop-down list box. This list box displays all the projects installed on your DataStage server. Choose your project from the list box. At this point, you may only have one project installed on your system and this is displayed by default. 2-3
5.
Click OK. The DataStage Designer window appears with the New dialog box open, ready for you to create a new job:
Creating a Job
When a DataStage project is installed, it is empty and you must create the jobs you need. Each DataStage job can load one or more data tables in the final data warehouse. The number of jobs you have in a project depends on your data sources and how often you want to extract data or load the data warehouse.
2-4
Jobs are created using the DataStage Designer. For this example, you need to create a server job, so double-click the New Server Job icon. The diagram window appears, in the right pane of the Designer, along with the Tool palette for server jobs. You can now save the job and give it a name.
To save the job: 1. Choose File Save. The Create new job dialog box appears:
2-5
2. 3. 4.
Enter Example1 in the Job name field. Enter Example in the Category field. Click OK to save the job. The updated DataStage Designer window displays the name of the saved job.
2.
2-6
3.
Click OK. The updated Import Meta data (UniVerse Tables) dialog box displays all the files for the chosen data source name:
Note: The screen shot shows an example of tables found under localuv. Your system may contain different files to the ones shown here. 4. 5. Select project.EXAMPLE1 from the Tables list box, where project is the name of your DataStage project. Click OK. The column information from EXAMPLE1 is imported into DataStage. A table definition is created and is stored under the Table Definitions UniVerse localuv branch in the Repository. The updated DataStage Designer window displays the new table definition entry in the Repository window.
To view the new table definition, double-click the project.EXAMPLE1 item in the Repository window. The Table Definition dialog box appears. This dialog box has up to five pages. Click the tabs to display each page. The General page contains information about where the data is found and when the definition was created.
2-7
The Columns page contains information about the columns in the data source table. You should see the following columns for project.EXAMPLE1:
The Format page contains information describing how the data would be formatted when written to a sequential file. You do not need to edit this page. The Relationships page gives foreign key information about the table. We are not using foreign keys in this exercise, so you do not need to edit this page. The NLS page is present if you have NLS installed. It shows the current character set map for the table definitions. The map defines the character set that the data is in. You do not need to edit this page.
Advanced Procedures
To manually enter table definitions, see Chapter 7, Intelligent Assistants..
2-8
Developing a Job
Jobs are designed and developed using the Designer. The job design is developed in the Diagram window (the one with grid lines). Each data source, the data warehouse, and each processing step is represented by a stage in the job design. The stages are linked together to show the flow of data. This example requires three stages: A UniVerse stage to represent EXAMPLE1 (the data source). A Transformer stage to convert the data in the DATE column from a YYYY-MM-DD date in internal date format to a string giving just year and month (YYYY-MM). A Sequential File stage to represent the file created at run time (the data warehouse in this example).
Adding Stages
Stages are added using the tool palette. This palette contains icons that represent the components you can add to a job. The palette has different groups to organize the tools available. Click the group title to open the group.A typical tool palette is shown below:
2-9
To add a stage: 1. 2. Click the stage button on the tool palette that represents the stage type you want to add. Click in the Diagram window where you want the stage to be positioned. The stage appears in the Diagram window as a square.
You can also drag items from the palette to the Diagram window. We recommend that you position your stages as follows: Data sources on the left Data warehouse on the right Transformer stage in the center When you add stages, they are automatically assigned default names. These names are based on the type of stage and the number of the item in the Diagram window. You can use the default names in the example. Once all the stages are in place, you can link them together to show the flow of data.
Linking Stages
You need to add two links: One between the UniVerse and Transformer stages One between the Transformer and Sequential File stages Links are always made in the direction the data will flow, that is, usually left to right. When you add links, they are assigned default names. You can use the default names in the example. To add a link: 1. 2. Right-click the first stage, hold the mouse button down and drag the link to the transformer stage. Release the mouse button. Right-click the Transformer stage and drag the link to the Sequential File stage. The following screen shows how the Diagram window looks when you have added the stages and links:
2-10
3.
Keep the Designer open as you will need it for the next step.
Advanced Procedures
For more advanced procedures, see the following topics in Chapter 4: Moving Stages on page 4-28 Renaming Stages on page 4-28 Deleting Stages on page 4-28
2-11
This dialog box has two pages: Stage. Displayed by default. This page contains the name of the stage you are editing. The General tab specifies where the file is found and the connection type. Outputs. Contains information describing the data flowing from the stage. You edit this page to describe the data you want to extract from the file. In this example, the output from this stage goes to the Transformer stage. To edit the UniVerse stage: 1. Check that you are displaying the General tab on the Stage page. Choose localuv from the Data source name drop-down list. localuv is where EXAMPLE1 is copied to during installation. The remaining parameters on the General and Details tabs are used to enter logon details and describe where to find the file. Because EXAMPLE1 is installed in localuv, you do not have to complete these fields, which are disabled. 2. Click the Outputs tab. The Outputs page appears:
2-12
The Outputs page contains the name of the link the data flows along and the following four tabs: General. Contains the name of the table to use and an optional description of the link. Columns. Contains information about the columns in the table. Selection. Used to enter an optional SQL SELECT clause (an Advanced procedure). View SQL. Displays the SQL SELECT statement used to extract the data. 3. 4. 5. Choose dstage.EXAMPLE1 from the Available tables drop-down list. Click Add to add dstage.EXAMPLE1 to the Table names field. Click the Columns tab. The Columns tab appears at the front of the dialog box. You must specify the columns contained in the file you want to use. Because the column definitions are stored in a table definition in the Repository, you can load them directly.
2-13
6. 7. 8.
Click Load . The Table Definitions window appears with the UniVerse localuv branch highlighted. Select dstage.EXAMPLE1. The Select Columns dialog box appears, allowing you to select which column definitions you want to load. In this case you want to load all available columns definitions, so just click OK. The column definitions specified in the table definition are copied to the stage. The Columns tab contains definitions for the four columns in EXAMPLE1:
9.
You can use the Data Browser to view the actual data that is to be output from the UniVerse stage. Click the View Data button to open the Data Browser window.
2-14
10. Click OK to save the stage edits and close the UniVerse Stage dialog box. Notice that a small table icon appears on the output link to indicate that it now has column definitions associated with it.
11. Choose File Save to save your job design so far. Note: In server jobs column definitions are attached to a link. You can view or edit them at either end of the link. If you change them in a stage at one end of the link, the changes are automatically seen in the stage at the other end of the link. This is how column definitions are propagated through all the stages in a DataStage server job, so the column definitions you loaded into the UniVerse stage are viewed when you edit the Transformer stage.
2-15
2-16
Double-click the Transformer stage to edit it. The Transformer Editor appears:
Input columns are shown on the left, output columns on the right. The upper panes show the columns together with derivation details, the lower panes show the column meta data. In this case, input columns have already been defined for input link DSLink3. No output columns have been defined for output link DSLink4, so the right panes are blank. The next steps are to define the columns that will be output by the Transformer stage, and to specify the transform that will enable the stage to convert the type and format of dates before they are output. 1. Working in the upper-left pane of the Transformer Editor, select the input columns that you want to derive output columns from. Click on the CODE, DATE, and QTY columns while holding down the Ctrl key. Click the left mouse button again and, keeping it held down, drag the selected columns to the output link in the upper-right pane. Drop the columns over the Column Name field by releasing the mouse button.
2.
2-17
The columns appear in the top pane and the associated meta data appears in the lower-right pane:
The next step is to edit the meta data for the input and output links. You will be transforming dates from YYYY-MM-DD, presented in internal date format, to strings containing the date in the form YYYYMM. You need to select a data element for the input DATE column, to specify that the date is input to the transform in internal format, and a new SQL type and data element for the output DATE column, to specify that it will be carrying a string. You do this in the lower-left and lower-right panes of the Transformer Editor. 3. 4. 5. 6. In the Data element field for the DSLink3.DATE column, select Date from the drop-down list. In the SQL type field for the DSLink4 DATE column, select Char from the drop-down list. In the Length field or the DSLink4 DATE column, enter 7. In the Data element field for the DSLink4 DATE column, select MONTH.TAG from the drop-down list. Next you will specify the transform to apply to the input DATE column to produce the output DATE column. You do this in the upperright pane of the Transformer Editor. 7. Double-click the Derivation field for the DSLink4 DATE column. The Expression Editor box appears. At the moment, the box contains the text DSLink3.DATE, which indicates that the output DATE column
2-18
is directly derived from the input DATE column. Select the text DSLink3 and delete it by pressing the Delete key.
8.
Right-click in the Expression Editor box to open the Suggest Operand menu:
2-19
9.
Select DS Transform from the menu. The Expression Editor then displays the transforms that are applicable to the MONTH.TAG data element:
10. Select the MONTH.TAG transform. It appears in the Expression Editor box with the argument field [%Arg1%] highlighted. 11. Right-click to open the Suggest Operand menu again. This time, select Input Column. A list of available input columns appears:
12. Select DSLink3.DATE. This then becomes the argument for the transform. 13. Click OK to save the changes and exit the Transformer Editor. Once more the small icon appears on the output link from the transformer stage to indicate that the link now has column definitions associated with it.
2-20
This dialog box has two pages: Stage. Displayed by default. This page contains the name of the stage you are editing and two tabs. The General tab specifies the line termination type, and the NLS tab specifies a character set map to use with the stage (this appears if you have NLS installed). Inputs. Describes the data flowing into the stage. This page only appears when you have an input to a Sequential File stage. You do not need to edit the column definitions on this page, because they were all specified in the Transformer stage.
2-21
To edit the Sequential File stage: 1. Click the Inputs tab. The Inputs page appears. This page contains: The name of the link. This is automatically set to the link name used in the job design. General tab. Contains the pathname of the file, an optional description of the link, and update action choices. You can use the default settings for this example, but you may want to enter a file name (by default the file is named after the input link). Format tab. Determines how the data is written to the file. In this example, the data is written using the default settings, that is, as a comma-delimited file. Columns tab. Contains the column definitions for the data you want to extract. This tab contains the column definitions specified in the Transformer stages output link. 2. Enter the pathname of the text file you want to create in the File name field, for example, seqfile.txt. By default the file is placed in the server project directory (for example, c:\Ascential\DataStage\Projects\datastage) and is named after the input link, but you can enter, or browse for, a different directory. Click OK to close the Sequential File Stage dialog box. Choose File Save to save the job design.
3. 4.
2-22
Compiling a Job
When you finish your design you must compile it to create an executable job. Jobs are compiled using the Designer. To compile the job, do one of the following: Choose File Compile. Click the Compile button on the toolbar. The Compile Job window appears:
The job is compiled. The result of the compilation appears in the display area. If the result of the compilation is Job successfully compiled with no errors you can go on to schedule or run the job. The executable version of the job is stored in your project along with your job design. If an error is displayed, click Show Error. The stage where the problem occurs is highlighted in the job design. Check that all the input and output column definitions have been specified correctly, and that you have entered directory paths and file or table names where appropriate. For more information about the error, click More. Click Close to close the Compile Job window.
2-23
Running a Job
Executable jobs are scheduled by the DataStage Director and run by the DataStage Server. You can start the Director from the Designer by choosing Tools Run Director. When the Director is started, the DataStage Director window appears with the status of all the jobs in your project:
Highlight your job in the Job name column. To run the job, choose Job Run Now or click the Run button on the toolbar. The Job Run Options dialog box appears and allows you to specify any parameter values and to specify any job run limits. In this case, just click Run. The status changes to Running. When the job is complete, the status changes to Finished. Choose File Exit to close the DataStage Director window. Refer to DataStage Director Guide for more information about scheduling and running jobs.
Advanced Procedures
It is possible to run a job from within another job. For more information, see Job Control Routines on page 4-67 and Chapter 6, Job Sequences.
2-24
2-25
2-26
3
DataStage Designer Overview
This chapter describes the main features of the DataStage Designer. It tells you how to start the Designer and takes a quick tour of the user interface.
You can also start the Designer from the shortcut icon on the desktop, or from the DataStage Suite applications bar if you have it installed.
3-1
You must connect to a project as follows: 1. 2. 3. Enter the name of your host in the Host system field. This is the name of the system where the DataStage Server components are installed. Enter your user name in the User name field. This is your user name on the server system. Enter your password in the Password field. Note: If you are connecting to the server via LAN Manager, you can select the Omit check box. The User name and Password fields gray out and you log on to the server using your Windows NT Domain account details. 4. Choose the project to connect to from the Project drop-down list box. This list box displays all the projects installed on your DataStage server. Click OK. The DataStage Designer window appears, by default with the New dialog box open, allowing you to choose a type of job to create. You can set options to specify that the Designer opens with an empty server or mainframe job, or nothing at all, see Specifying Designer Options on page 3-24.
5.
Note: You can also start the DataStage Designer directly from the DataStage Manager or Director by choosing Tools Run Designer.
3-2
Server Shared containers. These are reusable job elements. Copies of shared containers can be used in any number of server jobs and edited as required. They can also be used in parallel jobs to make server job functionality available. Parallel Shared containers. These are reusable job elements. Copies of shared containers can be used in any number of parallel jobs and edited as required. Job Sequences. A job sequence allows you to specify a sequence of DataStage server and parallel jobs to be executed, and actions to take depending on results. New Data Migration Job. This quickly generates a parallel job to move data from a single source to a single target. New Slowly Changing Dimension Job. This assistant generates a collection of jobs and a job sequences that coordinate the execution of these jobs. These jobs help you implement one of three types of slowly changing dimension operation. Or you can choose to open an existing job of any of these types. You can use the DataStage options to specify that the Designer always opens a new server or mainframe job, shared container or job sequence when its starts. The initial appearance of the DataStage Designer is shown below:
3-3
The design pane on the right side and the Property browser are both empty, and a limited number of menus appear on the menu bar. To see a more fully populated Designer window, choose File New and choose the type of job to create from the New dialog box (this process will be familiar to you if you worked through the example in Chapter 2, Your First DataStage Project.) For the purposes of this example, we created a server job.
3-4
Menu Bar
There are nine pull-down menus. The commands available in each menu change depending on whether you are currently displaying a server job, parallel job, or a mainframe job. File. Creates, opens, closes, and saves DataStage jobs. Also sets up printers, compiles server and parallel jobs, runs server and parallel jobs, generates and uploads mainframe jobs, and exits the Designer.
Mainframe Job Edit. Allows you to undo and redo actions, and cut or copy items on the current diagram and paste them into another job or a new shared container. Renames or deletes stages and links in the Diagram window. Defines job properties (Job Properties item), and displays the stage dialog boxes (Properties item). Allows you to construct local or shared containers, deconstruct local containers, and convert local containers to shared containers and vice versa. Selects all items in a diagram window.
3-5
View. Determines what is displayed in the DataStage Designer window. Displays or hides the toolbar, tool palette, status bar, Repository window, and Property browser. For server jobs and server shared containers only, allows you to display or hide the debug bar. Other commands allow you to customize the tool palette and refresh the view of the Repository items in the Repository window. Diagram. Determines what actions are performed in the Diagram window. Displays or hides the grid or print lines, enables or disables annotations, activates or deactivates the Snap to Grid option, zooms in or out of the Diagram window and aligns selected items to the grid. Also turns performance monitoring on for server or parallel jobs. The snap to grid and zoom properties are applied to the job or container window currently selected. The settings are saved when the job or container is saved and restored when it is open. The other settings are personal to you, and are saved between DataStage sessions ready for you to use again. When you change personal settings they affect all open windows immediately.
3-6
Debug. This menu is available only for server jobs and server shared containers. Gives access to the debugger commands.
Tools. Allows you to define the Designer options. Starts the DataStage Manager or Director, and, if they are installed, MetaStage, QualityStage, ProfileStage, AuditStage, and Version Control. If you are running Parallel Extender on a UNIX server, you can open the Data Set Manager and create new stage types. Also lets you invoke third-party applications, or add third-party applications to the Designer. Window. Allows you to close the current window, or all windows. Specifies how the windows are displayed and arranges the icons.
Help. Invokes the Help system. Help is available from all areas of the Designer.
3-7
For links, the property browser gives: Name Input link description Output link description
3-8
3-9
Detailed information is in DataStage Developers Help and DataStage Manager Guide. A guide to defining and editing table definitions is given in this guide (Chapter 7) because table definitions are so central to job design.
In the Designer Repository window you can perform any of the actions that you can perform from the Repository tree in the Manager. When you select a category in the tree, a shortcut menu allows you to create a new item under that category or a new subcategory, or, for Table Definition categories, import a table definition from a data source. When you select an item in the tree, a shortcut menu allows you to perform various tasks depending on the type of item selected: Data elements, machine profiles, routines, transforms, IMS Databases, IMS Viewsets You can create a copy of these items, rename them, delete them, and display the properties of the item. Provided the item is not read-only, you can edit the properties. Jobs, shared containers You can create a copy of these items, add them to the palette, rename them, delete them, and edit them in the diagram window. Stage types You can add stage types to the diagram window palette and display their properties. If the stage belongs in a shortcut container, DataStage will add it there. Otherwise it will add it to the correct group. Provided the item is not read-only, you can edit the properties. 3-10 Ascential DataStage Designer Guide
Table definitions You can create a copy of table definitions, rename them, delete them and display the properties of the item. Provided the item is not read-only, you can edit the properties. You can also import table definitions from data sources.
It is a good idea to choose View Refresh from the main menu bar before acting on any Repository items to ensure that you have a completely upto-date view. You can drag certain types of item from the Repository window onto a diagram window or the diagram window, or onto specific components within a job: Jobs the job opens in a new diagram window or, if dragged to a job sequence window, is added to the job sequence. Shared containers if you drag one onto an open diagram window, the shared container appears in the job. If you drag a shared container onto the background a new diagram window opens showing the contents of the shared container. Stage types drag a stage type onto an open diagram window to add it to the job or container. You can also drag it to the tool palette to add it as a tool. Table definitions drag a table definition onto a link to load the column definitions for that link. The Select Columns dialog box allows you to select a subset of columns from the table definition to load if required. You can also drag items of these types to the palette for easy access.
3-11
diagram window has a colored background. You can turn this off using the Options dialog box (see Default Options on page 3-27). Most of the screenshots in this guide have the background turned off.
The diagram window is the canvas on which you design and display your job. This window has the following components: Title bar. Displays the name of the job or shared container. Page tabs. If you use local containers in your job, the contents of these containers are displayed in separate windows within the jobs diagram window. Switch between views using the tabs at the bottom of the diagram window.
3-12
Grid lines. Allow you to position stages more precisely in the window. The grid lines are not displayed by default. Choose Diagram Show Grid Lines to enable them. Scroll bars. Allow you to view the job components that do not fit in the display area. Print lines. Display the area that is printed when you choose File Print. The print lines also indicate page boundaries. When you cross these, you have the choice of printing over several pages or scaling to fit a single page when printing. The print lines are not displayed by default. Choose Diagram Show Print Lines to enable them. You can use the resize handle or the Maximize button to resize a diagram window. To resize the contents of the window, use the zoom commands in the Diagram shortcut menu. If you maximize a window an additional menu appears to the left of the File menu, giving access to Diagram window controls. By default, any stages you add to the Diagram window will snap to the grid lines. You can, however, turn this option off by unchecking Diagram Snap to Grid, clicking the Snap to Grid button in the toolbar, or from the Designer Options dialog box. The diagram window has a shortcut menu which gives you access to the settings on the Diagram menu (see Menu Bar on page 3-5) plus cut, copy, and paste:
3-13
Toolbar
The Designer toolbar contains the following buttons:
Open job Save Properties Paste Cut Save Copy All
Job
Undo
container
lines
The toolbar appears under the menu bar by default, but you can drag and drop it anywhere on the screen. It will dock and un-dock as required. Alternatively, you can hide the toolbar by choosing View Toolbar.
Tool Palette
The tool palette contains shortcuts to the components you can add to your job design. By default the tool palette is docked to the DataStage Designer, but you can drag and drop it anywhere on the screen. It will dock and undock as required. Alternatively, you can hide the tool palette by choosing View Palette. There is a separate tool palette for server jobs, parallel jobs, mainframe jobs, and job sequences (parallel shared containers use the parallel job palette, server shared containers use the server job palette). Which one is displayed depends on what is currently active in the Designer. The palette has different groups to organize the tools available. Click the group title to open the group. The Favorites group allows you to drag frequently used tools there so you can access them quickly. You can also drag other items there from the Repository window, such as jobs and shared containers.
3-14
Each group and each shortcut has properties which you can edit by choosing Properties from the shortcut menu. The following is an example parallel job tool palette:
To add a stage to the Diagram window, choose its shortcut from the tool palette and click the Diagram window. The stage is added at the insertion point in the diagram window. If you click and drag on the diagram window to draw a rectangle as an insertion point, the stage will be sized to fit that rectangle. You can also drag stages from the tool palette or from the Repository window and drop them on the Diagram window. Some of the shortcuts on the tool palette give access to several stages, these are called shortcut containers and you can recognize them because down arrows appear when you hover the mouse pointer over them. Click on the arrow to see the list of items this icon gives access to:
3-15
You can add the default Shortcut container item in the same way as ordinary palette items it can be dragged and dropped, renamed, deleted etc. Shortcut containers also have properties you can view in the same way as palette groups and ordinary shortcuts do, and these allow you to change the default item. To link two stages, choose the Link button. Click the first stage, then drag the mouse to the second stage. The stages are linked when you release the mouse button. You can customize the tool palette to add or remove various shortcuts. You can add the shortcuts for plug-ins you have installed, and remove the shortcuts for stages you know you will not use. You can also add your own shortcut containers. There are various ways in which you can customize the palette: In the palette itself. From the Repository window. From the Customize Toolbar dialog box. To customize the tool palette from the palette itself: To remove an existing item from the palette, select it and choose Edit Delete Shortcut. To move an item to another position in the palette, select it and drag it to the desired position. To reset to the default settings for DataStage choose Customization Reset to factory default or Customization Reset to compact default from the palette shortcut menu. To customize the palette from the Repository window: To add an additional item to the palette, select it in the Repository window and drag it to the palette or select Add to Palette from the items shortcut menu. In addition to stage types, you can also add other Repository items such as table definitions and shared containers. To customize the palette using the Customize Palette dialog box, open the Customize Palette dialog box by doing one of: Choose View Customize Palette from the main menu. Choose Customization Customize from the palette shortcut menu. Choose Customize Palette from the background shortcut menu. 3-16 Ascential DataStage Designer Guide
The Customize Palette dialog box shows all the Repository items and the default palette groups and shortcut containers in two tree structures in the left pane and all the available palette groups in the right pane. Use the right arrows to add items from the trees on the left to the groups on the right, or use drag and drop. Use the left arrow to remove an item from a palette group. Use the up and down arrows to rearrange items within a palette group.
Status Bar
The status bar appears at the bottom of the DataStage Designer window. It displays one-line help for the window components and information on the current state of job operations, for example, compilation of server jobs. You can hide the status bar by choosing View Status Bar.
Debugger Toolbar
Server jobs
DataStage has a built-in debugger that can be used with server jobs or server shared containers. The debugger toolbar contains buttons representing debugger functions. You can hide the debugger toolbar by
3-17
choosing View Debug Bar. The debug bar has a drop-down list displaying currently open server jobs, allowing you to select one of these as the debug focus. Go Step to Next Link Stop Job Toggle Breakpoint View Job Log Debug Window Target debug job Step to Next Row Edit Job Parameters Breakpoints Clear All Breakpoints
Shortcut Menus
There are a number of shortcut menus available which you display by clicking the right mouse button. The menu displayed depends on where you clicked. Background. Appears when you rightclick on the background area in the left of the Designer (i.e. the space around Diagram windows), or in any of the toolbar background areas. Gives access to the same items as the View menu (see page 3-6).
3-18
Diagram window background. Appears when you right-click on a window background. Gives access to the same items as the Diagram menu (see page 3-6).
Stage. Appears when you click the right mouse button on a highlighted stage. The menu contents depends on what type of stage you have clicked on. All menus enable you to open the stage editor by choosing Properties, and to rename and delete the stage, and to delete the stage complete with its links. If the stage has links, you can choose the link name to open the stage editor on the page giving details of that link. If there is data associated with the link of a server job passive, built-in stage or parallel job file or database stage, you can choose View link data to open the Data Browser on that link. The Transformer stage shortcut menu offers additional items. Choose Propagate columns to propagate columns from a selected input link to a selected output link. Choose Auto-Match columns to map columns on the selected input link to columns on the selected output link with matching names. Link. Appears when you click the right mouse button on a highlighted link. This menu contains options to move or delete the link, change the link type, and, for server jobs only, toggle any breakpoint on the link, or open the Edit Breakpoints dialog box.
3-19
Palette Group. Appears when you rightclick on the background area in the palette. Allows you to add a new shortcut container, add, delete, or rename a group, view the group properties. The Customization item gives access to a range of customization options: Open the Customize Palette dialog box to add more items to the group. Display the group items as small icons or large icons, with or without text labels. Sort the items by name. Show or hide the various groups in the current palette. Clean up the palette so that any icons that point to items no longer in the DataStage repository are removed. Load a previously saved palette configuration. Load the default settings for the project. Save the current palette configuration to a file. Make the current palette configuration the default for the project. Reset the palette to the original configuration as at initial install. Reset the palette to use small icons, without text, all in one group. This is similar to how the palette appeared in previous versions of DataStage Palette Item. Allows you to select an item, view its properties, delete, or rename it. Also gives the same customization options as described for palette groups.
3-20
Using Annotations
DataStage allows you to insert notes into a Diagram window. These are called annotations and there are two types: Annotation. You enter the text for this yourself. Use it to annotate stages and links in your job design. Description Annotation. This displays either the short or full description from the job properties. You can edit the description within the annotation if required. There can only be one of these per job. You can use annotations in server, parallel, or mainframe jobs, job sequences or shared containers. The following example shows a server job with a description annotation and an ordinary annotation:
The Toggle Annotations button in the Tool bar allows you to specify whether the Annotations are shown or not. To insert an annotation, assure the annotation option is on then drag the annotation icon from the tool palette onto the Diagram window. An annotation box appears, you can resize it as required using the controls in the boundary box. Alternatively, click an Annotation button in the tool palette, then draw a bounding box of the required size of annotation on the Diagram window. Annotations will always appear behind normal stages and links.
3-21
Annotations have a shortcut menu containing the following commands: Properties. Select this to open the properties dialog box. There is a different dialog for annotations and description annotations. Edit Text. Select this to put the annotation into edit mode. Delete. Select this to delete the annotation.
Annotation text. Displays the text in the annotation. You can edit this here if required. Vertical Justification. Choose whether the text aligns to the top, middle, or bottom of the annotation box. Horizontal Justification. Choose whether the text aligns to the left, center, or right of the annotation box.
3-22
Font. Click this to open a dialog box which allows you to specify a different font for the annotation text. Color. Click this to open a dialog box which allows you to specify a different color for the annotation text. Background color. Click this to open a dialog box which allows you to specify a different background color for the annotation. Border. Select this to specify that the border of the annotation is visible. Transparent. Select this to choose a transparent background. Description Type. Choose whether the Description Annotation displays the full description or short description from the job properties.
Annotation Properties
The Annotation Properties dialog box is as follows:
The properties are the same as described for description annotations, except there are no Description Type options.
3-23
Appearance Options
The Appearance options branch lets you control the appearance of the DataStage Designer. It gives access to four pages: General, Repository Tree, Palette, and Graphical Performance Monitor.
General
3-24
Appearance. These options allow you to decide how the Designer background canvas is displayed and how the stage icons appear on the canvas. By default the canvas has a background image which tells you whether you are editing a server job, parallel job, mainframe job, shared container, or job sequence. Clear the Show background images check box to replace these with a white background. By default the stage icons have no background. To display them with a blue background, select the Show stage outlines check box. You can also choose to show or hide the Ascential banner, and to limit the display to standard Windows colors for the Designer client (as in previous versions of DataStage). Unattached links. This option lets you choose the display color for unattached links. This is red by default. To change it, click on the color button and choose a new color in the Color dialog box that appears.
Repository Tree
This section allows you to choose what type of items are displayed in the Repository tree in the Designer.
3-25
Select the check boxes for each type of item you want to be displayed. By default all types of item are displayed.
Palette
This section allows you to control how your tool palette is displayed.
It controls the following options: Animate group when expanding. Select this to have the groups scroll down and scroll up when opening and closing. Otherwise they open and close instantly. Show expanded group in bold text. Select this to show the title of an expanded group in bold text. Different customization for each project. This is selected by default, and allows you to specify different palette options for different projects. Common customization between all Projects. Select this if you want the palette on all projects to share the same options as set in this project.
3-26
Automatically add newly installed stage types. Select this to have any new stage types added to the appropriate group in the appropriate palette. Show text labels. Select this to display the names of items in the palette. Otherwise just the icons are dispayed. Icons. Choose whether to display large icons or small icons in the palette.
Default Options
The Default options branch gives access to two pages: General and Mainframe.
3-27
General
The General page determines how the DataStage Designer behaves when started.
The page has three areas: When Designer starts. Determines whether the Designer automatically opens a new job when started, or prompts you for a job to create or open. Nothing Open. This is the default option. The Designer opens with no jobs, shared containers, or job sequences open, you can then decide whether to open and existing item, or create a new one. Prompt for. Select this and choose New, Existing or Recent from the drop-down list. The New dialog box appears when you start the DataStage Designer, with the New, Existing, or Recent page on top, allowing you to choose an item to open. Create new. Select this and choose Server, Mainframe, Parallel, Sequence job, or Shared container from the drop-down list. If this is selected, a new job of the specified type is automatically created when the DataStage Designer is started.
3-28
New job/container view attributes. Determines whether the snap to grid option will be on or not for any new jobs, job sequences, or shared containers that are opened. Shared Container/Job Activity double-click action. This allows you to specify the action the Designer takes when you double click on a shared container or job activity on the canvas. The options are to: Show the shared container's or job activity's properties (this is the the default), OR Open the shared container or job the item references.
Mainframe
This page allows you to specify options that apply to mainframe jobs only.
Base location for generated code. This specifies the base location on the DataStage client where the generated code and JCL files for a mainframe job are held. Each mainframe job holds these files in a subdirectory of the base location reflecting the server host name, project name, and job. For example, where the base location is c:\Ascential\DataStage\Gencode, a complete pathname might be c:\Ascential\DataStage\Gencode\R101\dstage\mjob1.
3-29
Source Viewer. Allows you to specify which program should be used for viewing generated source. This defaults to Notepad. Column push option. This option is selected by default. With the option on, all the columns loaded in a mainframe source stage are selected and appear on the output link without you needing to visit any of the Output pages. Just define the necessary information on the Stage page, and click OK. The columns defined for the input link of a mainframe active stage are similarly automatically mapped to the columns of the output link. Clear the option to specify that all selection and mapping is done manually.
The Expression Editor branch gives access to the server and parallel page which allows you to specify the features available in the DataStage Expression Editor. For more details on how to use the Expression Editor, see DataStage Server Job Developers Guide and DataStage Parallel Job Developers Guide.
There are four check boxes on this page: Check expression syntax Check variable names in expressions
3-30
Suggest expression elements Complete variable names in expressions These check boxes are selected by default. The settings are stored in the Repository and are used when you edit any job on this client machine.
SMTP Defaults
This page allows you to specify default details for Email Notification activities in job sequences.
SMTP Mail server name. The name of the server or its IP address. Senders email address. Given in the form bill.gamsworth@paddock.com. Recipients email address. Given in the form bill.gamsworth@paddock.com.
3-31
Include job status in email. Select this to include available job status information in the message.
3-32
You can also specify whether the Select Columns dialog box is always shown when you drag and drop meta data, or whether you need to hold down ALT as you drag and drop in order to display it.
3-33
Printing Options
The Printer branch allows you to specify the printing orientation. When you choose File Print, the default printing orientation is determined by the setting on this page. You can choose portrait or landscape orientation. To use portrait, select the Portrait orientation check box. The default setting for this option is cleared, i.e., landscape orientation is used.
Prompting Options
The Prompting branch gives access to pages which determine the level of prompting displayed when you perform various operations in the Designer. There are three pages: General, Mainframe, and Server.
3-34
General
This page determines various prompting actions for server jobs and parallel jobs.
Automatic Actions. Allows you to set options relating to the saving, compiling, and debugging of jobs. Autosave before compile. Select this to specify that a job will be automatically saved, without prompting, when you compile it. Autocompile before debug. Select this to specify that a job will be automatically compiled, without prompting, when you debug it. Autosave referenced Shared Containers before compile. Select this to specify that a shared container referenced by a job will be automatically saved, without prompting, when you compile the job. Container actions Generate names automatically on name conflicts. If name conflicts occur when constructing or deconstructing containers, you are normally prompted for replacement names. Choose this
3-35
Confirmation
This page has options for specifying whether you should be warned when performing various deletion and construction actions, allowing you to confirm that you really want to carry out the action. Tick the boxes to have the warnings, clear them otherwise.
3-36
Transformer Options
The Transformer branch allows you to specify colors used in the Transformer editor. (Selected column highlight and relationship arrow colors are set by altering the Windows active title bar color from the Windows Control Panel).
3-37
3-38
4
Developing a Job
The DataStage Designer is used to create and develop DataStage jobs. A DataStage job populates one or more tables in the target database. There is no limit to the number of jobs you can create in a DataStage project. This chapter gives an overview of how to develop a job and how to specify job properties using the Designer. A job design contains: Stages to represent the processing steps required Links between the stages to represent the flow of data There are three different types of job in DataStage: Server jobs. These are available if you have installed Server. They run on the DataStage Server, connecting to other data sources as necessary. Mainframe jobs. These are available only if you have installed Enterprise MVS Edition. Mainframe jobs are uploaded to a mainframe, where they are compiled and run. Parallel jobs. These are available only if you have installed the Enterprise Edition. These run on DataStage servers that are SMP, MPP, or cluster systems. There are two other entities that are similar to jobs in the way they appear in the DataStage Designer, and are handled by it. These are: Shared containers. These are reusable job elements. They typically comprise a number of stages and links. Copies of shared containers can be used in any number of server jobs and parallel jobs and edited as required. Shared containers are described in Chapter 5. Job Sequences. A job sequence allows you to specify a sequence of DataStage server or parallel jobs to be executed, and actions to take depending on results. Job sequences are described in Chapter 6.
Developing a Job
4-1
Note: If you want to use the DataStage Manager Reporting Tool (described in DataStage Manager Guide) you should ensure that the names of your DataStage components (jobs, stage, links etc.) do not exceed 64 characters.
The Diagram window appears, in the right pane of the Designer, along with the Toolbox for the chosen type of job. You can now save the job and give it a name.
4-2
Choose File Open Job . Click the Open button on the toolbar. The Open dialog box appears. This is the same as the New dialog box, except that it appears with the Existing page on top, allowing you to pick an existing job from the tree structure. The job last opened is highlighted, so you can simply click OK if you want to reopen the last job you worked on. Otherwise, choose the job you want to open and click OK. Alternatively, you can select the Recent page to see a list of the most recently opened jobs.
You can also find the job in the tree in the Repository window and doubleclick it, or select it and choose Edit from its shortcut menu, or drag it onto the background to open it. The updated DataStage Designer window displays the chosen job in a Diagram window.
Developing a Job
4-3
Saving a Job
To save the job: 1. Choose File Save. If this is the first time you have saved the job, the Create new job dialog box appears:
2. 3.
Enter the name of the job in the Job name field. Type a category for the job or select a category from the existing categories shown in the tree structure by clicking it. It appears in the Category box. (If you have already specified a job category in the Job Properties dialog box, this will be displayed in the Category box when you open the Create new job dialog box.) Click OK. If the job name is unique, the job is created and saved in the Repository. If the job name is not unique, a message box appears. You must acknowledge this message before you can enter an alternative name.
4.
To save an existing job with a different name choose File Save As and fill in the Create new dialog box, specifying the new name and the category in which the job is to be saved. Organizing your jobs into categories gives faster operation of the DataStage Director when displaying job status.
4-4
Stages
A job consists of stages linked together which describe the flow of data from a data source to a data target (for example, a final data warehouse). A stage usually has at least one data input and/or one data output. However, some stages can accept more than one data input, and output to more than one stage. The different types of job have different stage types. The stages that are available in the DataStage Designer depend on the type of job that is currently open in the Designer.
DataStage offers several built-in stage types for use in server jobs. These are used to represent data sources, data targets, or conversion stages. These stages are either passive or active stages. A passive stage handles access to databases for the extraction or writing of data. Active stages model the flow of data and provide mechanisms for combining data streams, aggregating data, and converting data from one data type to another. As well as using the built-in stage types, you can also use plug-in stages for specific operations that the built-in stages do not support. The Palette organizes stage types into different groups, according to function: Database File PlugIn Processing Real Time
Stages and links can be grouped in a shared container. Instances of the shared container can then be reused in different server jobs. You can also define a local container within a job, this groups stages and links into a single unit, but can only be used within the job in which it is defined. Each stage type has a set of predefined and editable properties. These properties are viewed or edited using stage editors. A stage editor exists for each stage type and these are described in detail in individual chapters in DataStage Server Job Developers Guide.
Developing a Job
4-5
At this point in your job development you need to decide which stage types to use in your job design. The following built-in stage types are available for server jobs:
Database
ODBC. Extracts data from or loads data into databases that support the industry standard Open Database Connectivity API. This stage is also used as an intermediate stage for aggregating data. This is a passive stage. UniVerse. Extracts data from or loads data into UniVerse databases. This stage is also used as an intermediate stage for aggregating data. This is a passive stage. UniData. Extracts data from or loads data into UniData databases. This is a passive stage. Oracle 7 Load. Bulk loads an Oracle 7 database. Previously known as ORABULK. Sybase BCP Load. Bulk loads a Sybase 6 database. Previously known as BCPLoad.
File
Hashed File. Extracts data from or loads data into databases that contain hashed files. Also acts as an intermediate stage for quick lookups. This is a passive stage. Sequential File. Extracts data from, or loads data into, operating system text files. This is a passive stage.
4-6
Processing
Aggregator. Classifies incoming data into groups, computes totals and other summary functions for each group, and passes them to another stage in the job. This is an active stage. BASIC Transformer. Receives incoming data, transforms it in a variety of ways, and outputs it to another stage in the job. This is an active stage. Folder. Folder stages are used to read or write data as files in a directory located on the DataStage server. Inter-process. Provides a communication channel between DataStage processes running simultaneously in the same job. This is a passive stage. Link Partitioner. Allows you to partition a data set into up to 64 partitions. Enables server jobs to run in parallel on SMP systems. This is an active stage. Link Collector. Collects partitioned data from up to 64 partitions. Enables server jobs to run in parallel on SMP systems. This is an active stage.
Real Time
RTI Source. Entry point for a Job exposed as an RTI service. The Table Definition specified on the output link dictates the input arguments of the generated RTI service. RTI Target. Exit point for a Job exposed as an RTI service. The Table Definition on the input link dictates the output arguments of the generated RTI service.
Developing a Job
4-7
Containers
Server Shared Container. Represents a group of stages and links. The group is replaced by a single Shared Container stage in the Diagram window. Shared Container stages are handled differently to other stage types, they do not appear on the palette. You insert specific shared containers in your job by dragging them from the Repository window (Server group). Local Container. Represents a group of stages and links. The group is replaced by a single Container stage in the Diagram window (these are similar to shared containers but are entirely private to the job they are created in and cannot be reused in other jobs). These appear in a shortcut container in the General group. Container Input and Output. Represent the interface that links a container stage to the rest of the job design. These appear in a shortcut container in the General group. You may find that the built-in stage types do not meet all your requirements for data extraction or transformation. In this case, you need to use a plug-in stage. The function and properties of a plug-in stage are determined by the particular plug-in specified when the stage is inserted. Plugins are written to perform specific tasks, for example, to bulk load data into a data warehouse. Plug-ins are supplied with DataStage for you to install if required.
4-8
DataStage offers several built-in stage types for use in mainframe jobs. These are used to represent data sources, data targets, or conversion stages. The Palette organizes stage types into different groups, according to function: Database File Processing Each stage type has a set of predefined and editable properties. Some stages can be used as data sources and some as data targets. Some can be used as both. Processing stages read data from a source, process it and write it to a data target target. These properties are viewed or edited using stage editors. A stage editor exists for each stage type and these are fully described in individual chapters in Mainframe Job Developers Guide. At this point in your job development you need to decide which stage types to use in your job design. The following stages are available for mainframe jobs:
Database
IMS. This is a source stage. It extracts data from an IMS database or viewset. Relational. This can be a source or a target stage. It reads data from, or writes data to, a relational database. Teradata. This can be a source or a target stage. It reads data from, or writes data to, a Teradata database.
File
Complex Flat File. This is a source stage. It reads data from a complex flat file. DB2 Load Read Flat File. This is a target stage. It is used to write data to a DB2 load ready flat file.
Developing a Job
4-9
Delimited Flat File. This can be a source or target stage. It is used to read data from, or write it to, a delimited flat file. External Source. This is a source file. It is used to read data from an external program. External Target. This is a target stage. It is used to write data to an external program. Fixed Width Flat File. This can be a source or target stage. It is used to read data from, or write it to, a fixed width flat file. Multi-Format Flat File. This is a source stage. It is used to read data from a file containing multiple record types.
Processing
Transformer. This performs data transformation on extracted data.
Aggregator. Groups data from a single input link and performs aggregation functions such as count, sum, average, first, last, min, and max. Business Rule. Allows you to perform complex transformations using SQL business rule logic.
External Routine. This calls COBOL subroutines in libraries external to DataStage. FTP. This transfers files to another machine. Join. This is used to join data from two input tables and produce one output table. Link Collector. 4-10 Ascential DataStage Designer Guide
Lookup. Allows you to perform table lookups. Sort. Allows you to perform Sort operations.
Each stage type has a set of predefined and editable properties. These properties are viewed or edited using stage editors. A stage editor exists for each stage type and these are fully described in individual chapters in DataStage Parallel Job Developers Guide. At this point in your job development you need to decide which stage types to use in your job design.
Database Stages
DB2/UDB Enterprise. Allows you to read and write a DB2 database. Informix Enterprise. Allows you to read and write an Informix XPS database. Oracle Enterprise. Allows you to read and write an Oracle database. Teradata Enterprise. Allows you to read and write a Teradata database.
Developing a Job
4-11
Development/Debug Stages
Row Generator. Generates a dummy data set.
Column Generator. Adds extra columns to a data set. Head. Copies the specified number of records from the beginning of a data partition. Peek. Prints column values to the screen as records are copied from its input data set to one or more output data sets. Sample. Samples a data set. Tail. Copies the specified number of records from the end of a data partition. Write range map. Enables you to carry out range map partitioning on a data set.
File Stages
Data set. Stores a set of data.
External source. Allows a parallel job to read an external data source. External target. Allows a parallel job to write to an external data source. File set. A set of files used to store data. Lookup file set. Provides storage for a lookup table.
4-12
SAS data set. Provides storage for SAS data sets. Sequential file. Extracts data from, or writes data to, a text file.
Processing Stages
Transformer. Receives incoming data, transforms it in a variety of ways, and outputs it to another stage in the job. Aggregator. Classifies incoming data into groups, computes totals and other summary functions for each group, and passes them to another stage in the job. Change apply. Applies a set of captured changes to a data set. Change Capture. Compares two data sets and records the differences between them. Compare. Performs a column by column compare of two pre-sorted data sets. Compress. Compresses a data set. Copy. Copies a data set.
Decode. Uses an operating system command to decode a previously encoded data set. Difference. Compares two data sets and works out the difference between them. Encode. Encodes a data set using an operating system command. Expand. Expands a previously compressed data set.
Developing a Job
4-13
External Filter. Uses an external program to filter a data set. Filter. Transfers, unmodified, the records of the input data set which satisfy requirements that you specify, and filters out all other records. Funnel. Copies multiple data sets to a single data set. Generic. Allows Orchestrate experts to specify their own custom commands. Join. Joins two input sources. Lookup. Performs table lookups. Merge. Combines data sets. Modify. Alters the record schema of its input data set. Remove duplicates. Removes duplicate entries from a data set. SAS. Allows you to run SAS applications from within the DataStage job. Sort. Sorts input columns. Switch. Takes a single data set as input and assigns each input record to an output data set based on the value of a selector field.
Real Time
RTI Source. Entry point for a Job exposed as an RTI service. The Table Definition specified on the output link dictates the input arguments of the generated RTI service. RTI Target. Exit point for a Job exposed as an RTI service. The Table Definition on the input link dictates the output arguments of the generated RTI service.
4-14
Restructure
Column export. Exports a column of another type to a string or binary column. Column import. Imports a column from a string or binary column. Combine records. Combines several columns associated by a key field to build a vector. Make subrecord. Combines a number of vectors to form a subrecord. Make vector. Combines a number of fields to form a vector. Promote subrecord. Promotes the members of a subrecord to a top level field. Split subrecord. Separates a number of subrecords into top level fields. Split vector. Separates a number of vector members into separate columns.
Other Stages
Parallel Shared Container. Represents a group of stages and links. The group is replaced by a single Parallel Shared Container stage in the Diagram window. Parallel Shared Container stages are handled differently to other stage types, they do not appear on the palette. You insert specific shared containers in your job by dragging them from the Repository window. Local Container. Represents a group of stages and links. The group is replaced by a single Container stage in the Diagram window (these are similar to shared containers but are entirely private to the job they are created in and cannot be reused in other jobs). Container Input and Output. Represent the interface that links a container stage to the rest of the job design.
Developing a Job
4-15
Links
Links join the various stages in a job together and are used to specify how data flows when the job is run.
Passive stages in server jobs (e.g., ODBC stages, Sequential File stages, UniVerse stages), are used to read or write data from a data source. The read/write link to the data source is represented by the stage itself, and connection details are given on the Stage general tabs. Input links connected to the stage generally carry data to be written to the underlying data target. Output links carry data read from the underlying data source. The column definitions on an input link define the data that will be written to a data target. The column definitions on an output link define the data to be read from a data source. An important point to note about linking stages in server jobs is that column definitions actually belong to, and travel with, the links as opposed to the stages. When you define column definitions for a stage's output link, those same column definitions will appear at the other end of the link where it is input to another stage. If you move either end of a link to another stage, the column definitions will appear on the new stage. If you change the details of a column definition at one end of a link, those changes will appear in the column definitions at the other end of the link. There are rules covering how links are used, depending on whether the link is an input or an output and what type of stages are being linked. DataStage server jobs support two types of input link: Stream. A link representing the flow of data. This is the principal type of link, and is used by both active and passive stages. Reference. A link representing a table lookup. Reference links are only used by active stages. They are used to provide information that might affect the way data is changed, but do not supply the data to be changed. The two link types are displayed differently in the Designer Diagram window: stream links are represented by solid lines and reference links by dotted lines.
4-16
There is only one type of output link, although some stages permit an output link to be used as a reference input to the next stage and some do not. Built-in stages have maximum and minimum numbers of links as follows: Stage Type Stream Inputs Max Min Container no limit 0 ODBC no limit 0 UniVerse no limit 0 Hashed File no limit 0 UniData no limit 0 Sequential no limit 0 File Folder no limit 0 Inter1 1 process Transformer 1 1 Aggregator 1 1 Link 1 1 Partitioner Link 64 1 Collector Reference Inputs Max Min no limit 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Outputs Max Min no limit 0 no limit 0 no limit 0 no limit 0 no limit 0 no limit 0 no limit 0 1 1 no limit 1 no limit 1 64 1 1 1 Reference Outputs? yes yes yes yes yes no yes no no no no no
no limit 0 0 0 0 0 0 0
Plug-in stages supplied with DataStage generally have the following maximums and minimums: Stage Type Stream Inputs Max Min Active 1 1 Passive no limit 0 Reference Inputs Outputs Max Min Max Min no limit 0 no limit 1 0 0 no limit 0 Reference Outputs? no yes
When designing your own plug-ins, you can specify maximum and minimum inputs and outputs as required.
Developing a Job
4-17
Link Marking
Server jobs
For server jobs, meta data is associated with a link, not a stage. If you have link marking enabled, a small icon attaches to the link to indicate if meta data is currently associated with it.
Link marking is enabled by default. To disable it, click on the link mark icon in the Designer toolbar, or deselect it in the Diagram menu, or the Diagram shortcut menu.
Unattached Links
You can add links that are only attached to a stage at one end, although they will need to be attached to a second stage before the job can successfully compile and run. Unattached links are shown in a special color (red by default but you can change this using the Options dialog, see page 3-24). By default, when you delete a stage, any attached links and their meta data are left behind, with the link shown in red. You can choose Delete including links from the Edit or shortcut menus to delete a selected stage along with its connected links.
File and database stages in parallel jobs (e.g., Data Set stages, Sequential File stage, Oracle stages), are used to read or write data from a data source. The read/write link to the data source is represented by the stage itself, and connection details are given in the stage properties.
4-18
Input links connected to the stage generally carry data to be written to the underlying data target. Output links carry data read from the underlying data source. The column definitions on an input link define the data that will be written to a data target. The column definitions on an output link define the data to be read from a data source. Active stages generally have an input link carrying data to be processed, and an output link passing on processed data. An important point to note about linking stages in parallel jobs is that column definitions actually belong to, and travel with, the links as opposed to the stages. When you define column definitions for a stage's output link, those same column definitions will appear at the other end of the link where it is input to another stage. If you move either end of a link to another stage, the column definitions will appear on the new stage. If you change the details of a column definition at one end of a link, those changes will appear in the column definitions at the other end of the link. There are rules covering how links are used, depending on whether the link is an input or an output and what type of stages are being linked. DataStage parallel jobs support three types of link: Stream. A link representing the flow of data. This is the principal type of link, and is used by all stage types. Reference. A link representing a table lookup. Reference links can only be input to Lookup stages, they can only be output from certain types of stage. Reject. Some parallel job stages allow you to output records that have been rejected for some reason onto an output link. Note that reject links derive their meta data from the associated output link and this cannot be edited. You can usually only have an input stream link or an output stream link on a File or Database stage, you cant have both together. The three link types are displayed differently in the Designer Diagram window: stream links are represented by solid lines, reference links by dotted lines, and reject links by dashed lines.
Developing a Job
4-19
Stage Type
Active Active Active Active
Column import Active Combine records Compare Copy Decode Difference Encode External filter Funnel Generator Head Join Lookup Active Active Active Active Active Active Active Active Active Active Active Active
Make subrecord Active Make vector Merge Peek Promote subrecord Remove duplicates SAS Sort Split subrecord Split vector Tail Transformer Data set Active Active Active Active Active Active Active Active Active Active Active File
4-20
Stage Name
Stage Type
Lookup file File Parallel SAS File data set Sequential file File Write map DB2 range File Database Database Database Database
Link Marking
Parallel jobs
For parallel jobs, meta data is associated with a link, not a stage. If you have link marking enabled, a small icon attaches to the link to indicate if meta data is currently associated with it. Link marking also shows you how data is partitioned or collected between stages, and whether data is sorted. The following diagram shows the different types of link marking. For an explanation, see DataStage Parallel
Developing a Job
4-21
Job Developers Guide. If you double click on a partitioning/collecting marker the stage editor for the stage the link is input to is opened on the Partitioning tab. Auto partition marker
Partition marker
Collection marker
Link marking is enabled by default. To disable it, click on the link mark icon in the Designer toolbar, or deselect it in the Diagram menu, or the Diagram shortcut menu.
Unattached Links
You can add links that are only attached to a stage at one end, although they will need to be attached to a second stage before the job can successfully compile and run. Unattached links are shown in a special color (red by default but you can change this using the Options dialog, see page 3-24).
4-22
By default, when you delete a stage, any attached links and their meta data are left behind, with the link shown in red. You can choose Delete including links from the Edit or shortcut menus to delete a selected stage along with its connected links.
Target stages in Mainframe jobs are used to write data to a data target. Source stages are used to read data from a data source. Some stages can act as a source or a target. The read/write link to the data source is represented by the stage itself, and connection details are given on the Stage general tabs. Links to and from source and target stages are used to carry data to or from a processing or post-processing stage. For source and target stage types, column definitions are associated with stages rather than with links. You decide what appears on the outputs link of a stage by selecting column definitions on the Selection page. You can set the Column Push Option to specify that stage column definitions be automatically mapped to output columns (this happens if you set the option, define the stage columns then click OK to leave the stage without visiting the Selection page). There are rules covering how links are used, depending on whether the link is an input or an output and what type of stages are being linked. Mainframe stages have only one type of link, which is shown as a solid line. (A table lookup function is supplied by the Lookup stage, and the input links to this which acts as a reference is shown with dotted lines to illustrate its function.) The following rules apply to linking mainframe stages: Stage Name Stage Type Inputs Number Source Type
multiple processing
source or target
NA
NA
multiple
NA NA
NA NA
multiple multiple
Developing a Job
4-23
Stage Name
Stage Type
Delimited Flat File External Target DB2 Load Ready Flat File Relational FTP Join Lookup
source or multiple target post-process- single ing processing two processing two
Aggregator Sort
processing processing
source single (reference or primary link), processing (primary link) source single source source, processing source, processing single single single
4-24
Link Marking
Mainframe jobs
For mainframe jobs, meta data is associated with the stage and flows down the links. If you have link marking enabled, a small icon attaches to the link to indicate if meta data is currently associated with it.
Link marking is enabled by default. To disable it, click on the link mark icon in the Designer toolbar, or deselect it in the Diagram menu, or the Diagram shortcut menu.
Unattached Links
Unlike server and parallel jobs, you cannot have unattached links in a mainframe job; both ends of a link must be attached to a stage. If you delete a stage, the attached links are automatically deleted too.
Link Ordering
The Transformer stage in server jobs and various active stages in parallel jobs allow you to specify the execution order of links coming into and/or going out from the stage. When looking at a job design in DataStage, there are two ways to look at the link execution order: Place the mouse pointer over a link that is an input to or an output from a Transformer stage. A ToolTip appears displaying the message: Input execution order = n for input links, and: Output execution order = n
Developing a Job
4-25
for output links. In both cases n gives the links place in the execution order. If an input link is no. 1, then it is the primary link. Where a link is an output from the Transformer stage and an input to another Transformer stage, then the output link information is shown when you rest the pointer over it. Select a stage and right-click to display the shortcut menu. Choose Input Links or Output Links to list all the input and output links for that Transformer stage and their order of execution.
Adding Stages
There is no limit to the number of stages you can add to a job. We recommend you position the stages as follows in the Diagram window: Server jobs
4-26
Data sources on the left Data targets on the right Transformer or Aggregator stages in the middle of the diagram Parallel Jobs Data sources on the left Data targets on the right Active stages in the middle of the diagram Mainframe jobs Source stages on the left Processing stages in the middle Target stages on the right There are a number of ways in which you can add a stage: Click the stage button on the tool palette. Click in the Diagram window where you want to position the stage. The stage appears in the Diagram window. Click the stage button on the tool palette. Drag it onto the Diagram window. Select the desired stage type in the tree in the Repository window and drag it to the Diagram window. When you insert a stage by clicking (as opposed to dragging) you can draw a rectangle as you click on the Diagram window to specify the size and shape of the stage you are inserting as well as its location. Each stage is given a default name which you can change if required (see Renaming Stages on page 4-28). If you want to add more than one stage of a particular type, press Shift after clicking the button on the tool palette and before clicking on the Diagram window. You can continue to click the Diagram window without having to reselect the button. Release the Shift key when you have added the stages you need; press Esc if you change your mind.
Developing a Job
4-27
Moving Stages
Once positioned, stages can be moved by clicking and dragging them to a new location in the Diagram window. If you have the Snap to Grid option activated, the stage is attached to the nearest grid position when you release the mouse button. If stages are linked together, the link is maintained when you move a stage.
Renaming Stages
There are a number of ways to rename a stage: You can change its name in its stage editor. You can select the stage in the Diagram window and then edit the name in the Property Browser. You can select the stage in the Diagram window, press Ctrl-R, choose Rename from its shortcut menu, or choose Edit Rename from the main menu and type a new name in the text box that appears beneath the stage. Select the stage in the diagram window and start typing.
Deleting Stages
Stages can be deleted from the Diagram window. Choose one or more stages and do one of the following: Press the Delete key. Choose Edit Delete. Choose Delete from the shortcut menu. A message box appears. Click Yes to delete the stage or stages and remove them from the Diagram window. (This confirmation prompting can be turned off if required.) When you delete stages in mainframe jobs, attached links are also deleted. When you delete stages in server or parallel jobs, the links are left behind, unless you choose Delete including links from the edit or shortcut menu.
Linking Stages
You can link stages in three ways:
4-28
Using the Link button. Choose the Link button from the tool palette. Click the first stage and drag the link to the second stage. The link is made when you release the mouse button. Using the mouse. Select the first stage. Position the mouse cursor on the edge of a stage until the mouse cursor changes to a circle. Click and drag the mouse to the other stage. The link is made when you release the mouse button. Using the mouse. Point at the first stage and right click then drag the link to the second stage and release it. Each link is given a default name which you can change.
Moving Links
Once positioned, a link can be moved to a new location in the Diagram window. You can choose a new source or destination for the link, but not both. To move a link: 1. 2. Click the link to move in the Diagram window. The link is highlighted. Click in the box at the end you want to move and drag the end to its new location.
In server and parallel jobs you can move one end of a link without reattaching it to another stage. In mainframe jobs both ends must be attached to a stage.
Deleting Links
Links can be deleted from the Diagram window. Choose the link and do one of the following: Press the Delete key. Choose Edit Delete. Choose Delete from the shortcut menu. A message box appears. Click Yes to delete the link. The link is removed from the Diagram window. Note: For server jobs, meta data is associated with a link, not a stage. If you delete a link, the associated meta data is deleted too. If you
Developing a Job
4-29
want to retain the meta data you have defined, do not delete the link; move it instead.
Renaming Links
There are a number of ways to rename a link: You can select it and start typing in a name in the text box that appears. You can select the link in the Diagram window and then edit the name in the Property Browser. You can select the link in the Diagram window, press Ctrl-R, choose Rename from its shortcut menu, or choose Edit Rename from the main menu and type a new name in the text box that appears beneath the link. Select the link in the diagram window and start typing.
Editing Stages
When you have added the stages and links to the Diagram window, you must edit the stages to specify the data you want to use and any aggregations or conversions required. Data arrives into a stage on an input link and is output from a stage on an output link. The properties of the stage and the data on each input and output link are specified using a stage editor. To edit a stage, do one of the following: Double-click the stage in the Diagram window. Select the stage and choose Properties from the shortcut menu. Select the stage and choose Edit Properties.
4-30
A dialog box appears. The content of this dialog box depends on the type of stage you are editing. See the individual stage chapters in DataStage Server Job Developers Guide, DataStage Parallel Job Developers Guide or Mainframe Job Developers Guide for a detailed description of the stage dialog box. The data on a link is specified using column definitions. The column definitions for a link are specified by editing a stage at either end of the link. Column definitions are entered and edited identically for each stage type.
Developing a Job
4-31
Press Ctrl-E.
4-32
2.
Enter a category name in the Data source type field. The name entered here determines how the definition will be stored under the main Table Definitions branch. By default, this field contains Saved. Enter a name in the Data source name field. This forms the second part of the table definition identifier and is the name of the branch created under the data source type branch. By default, this field contains the name of the stage you are editing. Enter a name in the Table/file name field. This is the last part of the table definition identifier and is the name of the leaf created under the data source name branch. By default, this field contains the name of the link you are editing. Optionally enter a brief description of the table definition in the Short description field. By default, this field contains the date and time you clicked Save . The format of the date and time depend on your Windows setup. Optionally enter a more detailed description of the table definition in the Long description field. Click OK. The column definitions are saved under the specified branches in the Repository.
3.
4.
5.
6. 7.
Developing a Job
4-33
Most stages allow you to selectively load columns, that is, specify the exact columns you want to load. To load column definitions: 1. Click Load . The Table Definitions window appears. This window displays all the table definitions in your project in the form of a table definition tree. The table definition categories are listed alphabetically in the tree.
2. 3. 4.
Double-click the appropriate category branch. Continue to expand the categories until you see the table definition items. Select the table definition you want. Note: You can use Find to enter the name of the table definition you want. The table definition is selected in the tree when you click OK.
5.
If you cannot find the table definition, you can click Import Data source type to import a table definition from a data source (see Importing a Table Definition on page 8-11 for details). Click OK. One of two things happens, depending on the type of stage you are editing:
6.
4-34
If the stage type does not support selective meta data loading, all the column definitions from the chosen table definition are copied into the Columns grid. If the stage type does support selective meta data loading, the Select Columns dialog box appears, allowing you to specify which column definitions you want to load.
Use the arrow keys to move columns back and forth between the Available columns list and the Selected columns list. The single arrow buttons move highlighted columns, the double arrow buttons move all items. By default all columns are selected for loading. Click Find to open a dialog box which lets you search for a particular column. The shortcut menu also gives access to Find and Find Next. Click OK when you are happy with your selection. This closes the Select Columns dialog box and loads the selected columns into the stage. For mainframe stages and certain parallel stages where the column definitions derive from a CFD file, the Select Columns dialog box may also contain a Create Filler check box. This happens when the table definition the columns are being loaded from represents a fixed-width table. Select this to cause sequences of unselected columns to be collapsed into filler items. Filler columns are sized appropriately, their datatype set to character, and name set to FILLER_XX_YY where XX is the start offset and YY the end offset.
Developing a Job
4-35
Using fillers results in a smaller set of columns, saving space and processing time and making the column set easier to understand. If you are importing column definitions that have been derived from a CFD file into server or parallel job stages, you are warned if any of the selected columns redefine other selected columns. You can choose to carry on with the load or go back and select columns again. 7. Click OK to proceed. If the stage you are loading already has column definitions of the same name, you are prompted to confirm that you want to overwrite them. The Merge Column Meta Data check box is selected by default and specifies that, if you confirm the overwrite, the Derivation, Description, Display Size and Field Position from the existing definition will be preserved (these contain information that is not necessarily part of the table definition and that you have possibly added manually).
8.
Click Yes or Yes to All to confirm the load. Changes are saved when you save your job design.
4-36
Developing a Job
4-37
Choose the drive you want from the drop-down list. The Directory list box is automatically updated when you choose a drive. Left list. Lists all the files in the currently selected directory. Right list. Lists the directories in the current directory. You can double-click on one to make that the current directory, or doubleclick .. to move up one level in the directory structure.
4-38
To paste a stage into a new shared container, select Edit Paste Special Into new Shared Container. The Paste Special into new Shared Container dialog box appears. This allows you to select a category and name for the new shared container, enter a description and optionally add a shortcut to the tool palette.
If you want to cut or copy meta data along with the stages, you should select source and destination stages, which will automatically select links and associated meta data. These can then be cut or copied and pasted as a group.
The Data Browser allows you to view the actual data that will flow through a server job or parallel stage. You can browse the data associated with the input or output links of any server job built-in passive stage or with the links to certain parallel job stages as follows: Data Set stage External Source stage File Set stage (output links) DB2 stage (output links) Informix XPS stage (output links) Oracle stage (output links) Teradata stage (output links)
Developing a Job
4-39
SAS Parallel Data Set stage Row Generator stage (output links) The Data Browser is invoked by clicking the View Data button from a stage Inputs or Outputs page, or by choosing the View link Data option from the shortcut menu. For parallel job stages a supplementary dialog box lets you select a subset of data to view by specifying the following: Rows to display. Specify the number of rows of data you want the data browser to display. Skip count. Skip the specified number of rows before viewing data. Period. Display every Pth record where P is the period. You can start after records have been skipped by using the Skip property. P must equal or be greater than 1. If your administrator has enabled the Generated OSH Visible option in the DataStage Administrator, the supplementary dialog box also has a Show OSH button. Click this to open a window showing the OSH that will be run to generate the data view. It is intended for expert users. The Data Browser displays a grid of rows in a window. If a field contains a linefeed character, the field is shown in bold, and you can, if required, resize the grid to view the whole field.
4-40
The Data Browser uses the meta data defined for that link. If there is insufficient data associated with a link to allow browsing, the View Data button and shortcut menu command used to invoke the Data Browser are disabled. If the Data Browser requires you to input some parameters before it can determine what data to display, the Job Run Options dialog box appears and collects the parameters (see The Job Run Options Dialog Box on page 4-86). Note: You cannot specify $ENV as an environment variable value when using the data browser. The Data Browser grid has the following controls: You can select any row or column, or any cell with a row or column, and press CTRL-C to copy it. You can select the whole of a very wide row by selecting the first cell and then pressing SHIFT+END. If a cell contains multiple lines, you can double-click the cell to expand it. Single-click to shrink it again.
Developing a Job
4-41
You can view a row containing a specific data item using the Find button. The Find dialog box will reposition the view to the row containing the data you are interested in. The search is started from the current row.
The Display button invokes the Column Display dialog box. This allows you to simplify the data displayed by the Data Browser by choosing to hide some of the columns. For server jobs, it also allows you to normalize multivalued data to provide a 1NF view in the Data Browser. This dialog box lists all the columns in the display, all of which are initially selected. To hide a column, clear it. For server jobs, the Normalize on drop-down list box allows you to select an association or an unassociated multivalued column on which to normalize the data. The default is Un-normalized, and choosing Unnormalized will display the data in NF2 form with each row shown on a single line. Alternatively you can select Un-Normalized (formatted), which displays multivalued rows split over several lines.
In the example, the Data Browser would display all columns except STARTDATE. The view would be normalized on the association PRICES.
4-42
The Performance monitor is a useful diagnostic aid when designing DataStage server jobs and parallel jobs. When you turn it on and compile a job it displays information against each link in the job. When you run the job, either through the DataStage Director or the debugger, the link information is populated with statistics to show the number of rows processed on the link and the speed at which they were processed. The links change color as the job runs to show the progress of the job. To use the performance monitor: 1. With the job open and compiled in the Designer choose Diagram Show performance statistics. Performance information appears against the links. If the job has not yet been run, the figures will be empty.
Developing a Job
4-43
2.
Run the job (either from the Director or by choosing Debug Go). Watch the links change color as the job runs and the statistics populate with number of rows and rows/sec.
If you alter anything on the job design you will lose the statistical information until the next time you compile the job.
4-44
The colors that the performance monitor uses are set via the Options dialog box. Chose Tools Options and select Graphical Performance Monitor under the Appearance branch to view the default colors and change them if required. You can also set the refresh interval at which the monitor updates the information while the job is running.
When you have finished developing a server or a parallel job, you need to compile it before you can actually run it. Server jobs and parallel jobs are compiled on the DataStage server, and are subsequently run on the server using the DataStage Director. To compile a job, open the job in the Designer and do one of the following: Choose File Compile. Click the Compile button on the toolbar. If the job has unsaved changes, you are prompted to save the job by clicking OK. The Compile Job dialog box appears. This dialog box contains a display area for compilation messages and has the following buttons: Re-Compile. Recompiles the job if you have made any changes.
Developing a Job
4-45
Show Error. Highlights the stage that generated a compilation error. This button is only active if an error is generated during compilation. More. Displays the output that does not fit in the display area. Some errors produced by the compiler include detailed BASIC output. Close. Closes the Compile Job dialog box. Help. Invokes the Help system.
The job is compiled as soon as this dialog box appears. You must check the display area for any compilation messages or errors that are generated. For parallel jobs there is also a force compile option. The compilation of parallel jobs is by default optimized such that transformer stages only get recompiled if they have changed since the last compilation. The force compile option overrides this and causes all transformer stages in the job to be compiled. To select this option: Choose File Force Compile
4-46
Key expressions. If you have key fields specified in your column definitions, the compiler checks whether there are key expressions joining the data tables. Transforms. If you have specified a transform, the compiler checks that this is a suitable transform for the data type.
Successful Compilation
If the Compile Job dialog box displays the message Job successfully compiled with no errors. You can: Validate the job Run or schedule the job Release the job Package the job for deployment on other DataStage systems
Jobs are validated and run using the DataStage Director. See DataStage Director Guide for additional information. More information about compiling, releasing and debugging DataStage server jobs is in DataStage Server Job Developers Guide. More information about compiling and releasing parallel jobs is in the DataStage Parallel Job Developers Guide.
The dscc command takes the following arguments: /H hostname. Specify the DataStage server where the job or jobs reside. /O. Specifying this is the equivalent of ticking the Omit box in the Attach dialog box. You do not need to specify username or password if you use this option. /U username. The username to use when attaching to the project. /P password. The password to use when attaching to the project. project_name. The project which the job or jobs belong to. /J jobname |* | category_name\*. Specifies the jobs to be compiled. Use jobname to specify a single job, * to compile all jobs
Developing a Job
4-47
in the project and category_name\* to compile all jobs in that category (this will not include categories within that category). /R routinename | * | category_name\*. Specifies routines to be compiled. Use routinename to specify a single routine, * to compile all routines in the project and category_name\* to compile all routines in that category (this will not include categories within that category). /F. Force compile (for parallel jobs). /OUC. Only compile uncompiled jobs. /RD reportname. Specify a name and destination for a compilation report. Specify DESKTOP\filename to write it to your desktop or .\filename to write it to the current working directory. The options are not case sensitive. For example:
dscc /h=r101 /u=fellp /p=plaintextpassword dstageprj /J mybigjob
Will connect to the machine r101, with a username and password of fellp and plaintextpassword, attach to the project dstageprj and compile the job mybigjob.
Compiler Wizard
DataStage also has a compiler wizard that will guide you through the process of compiling jobs. You can start the wizard from the Tools menu of the Designer, Manager, or Director clients. Select Tools Run Multiple Job Compile. The wizard proceeds as follows: 1. A screen prompts you to specify the criteria for selecting jobs to compile. Choose one or more of: Server Parallel Sequence Mainframe
You can also specify that only currently uncompiled jobs will be compiled, and that you want to manually select the jobs to compile. 2. Click Next>.
4-48
If you chose the Show job selection page option, the Job Selection screen appears. Choose jobs in the left pane and add them to the right pane by using the arrow buttons. All the jobs in the right pane will be compiled. 3. Click Next>, the Compiler Options screen appears, allowing you to specify the following: Force compile (for parallel jobs). An upload profile for mainframe jobs you are generating code for. 4. Click Next>. The Compile Process screen appears, displaying the names of the selected jobs and their current compile status. As the compilation proceeds the status changes from Not Compiled to Compiling to Compiled OK or Failed. To see more details, click on a job name in the selected jobs list. Click Next>. The job compilation report screen appears displaying the report generated by the compilation. You can view this in your default HTML browser if required. Click Finish.
5.
6.
When you have finished developing a mainframe job, you need to generate the code for the job. This code is then transferred to the mainframe machine, where the job is compiled and run. You can also generate code from the command line or using the compile wizard (see Compiling from the Command Line on page 4-47 and Compiler Wizardon page 4-48). To generate code for a job, open the job in the Designer and do one of the following: Choose File Generate Code. Click the Generate Code button on the toolbar.
Developing a Job
4-49
If the job has unsaved changes, you are prompted to save the job by clicking OK. The Mainframe Job Code Generation dialog box appears. This dialog box contains details of the code generation files and a display area for compilation messages. It has the following buttons: Generate. Click this to validate the job design and generate the COBOL code and JCL files for transfer to the mainframe. View. This allows you to view the generated files. Upload job. This button is enabled if the code generation is successful. Clicking it opens the Remote System dialog box, which allows you to specify a machine to which to upload the generated code. Status messages are displayed in the Validation and code generation status window in the dialog box. For more information about generating code, see Mainframe Job Developers Guide.
4-50
Job Validation
Validation of a mainframe job design involves: Checking that all stages in the job are connected in one continuous flow, and that each stage has the required number of input and output links. Checking the expressions used in each stage for syntax and semantic correctness. A status message is displayed as each stage is validated. If a stage fails, the validation will stop.
Code Generation
Code generation first validates the job design. If the validation fails, code generation stops. Status messages about validation are in the Validation and code generation status window. They give the names and locations of the generated files, and indicate the database name and user name used by each relational stage. Three files are produced during code generation: COBOL program file which contains the actual COBOL code that has been generated. Compile JCL file which contains the JCL that controls the compilation of the COBOL code on the target mainframe machine. Run JCL file which contains the JCL that controls the running of the job on the mainframe once it has been compiled.
Job Upload
Once you have successfully generated the mainframe code, you can upload the files to the target mainframe, where the job is compiled and run. To upload a job, choose File Upload Job. The Remote System dialog box appears, allowing you to specify information about connecting to the target mainframe system. Once you have successfully connected to the target machine, the Job Upload dialog box appears, allowing you to actually upload the job. For more details about uploading jobs, see Mainframe Job Developers Guide.
Developing a Job
4-51
JCL Templates
DataStage uses JCL templates to build the required JCL files when you generate a mainframe job. DataStage comes with a set of building-block JCL templates suitable for tasks such as: Allocate file Cleanup existing file Cleanup nonexistent file Create Compile and link DB2 compile, link, and bind DB2 load DB2 run FTP JOBCARD New file Run Sort
The supplied templates are in a directory called JCL Templates under the DataStage server install directory. There are also copies of the templates held in the DataStage Repository for each DataStage project. You can edit the templates to meet the requirements of your particular project. This is done using the JCL Templates dialog box from the DataStage Manager. Open the JCL Templates dialog box by choosing Tools JCL Templates in the DataStage Manager. It contains the following fields and buttons: Platform type. Displays the installed platform types in a dropdown list. Template name. Displays the available JCL templates for the chosen platform in a drop-down list. Short description. Briefly describes the selected template. Template. The code that the selected template contains. Save. This button is enabled if you edit the code, or subsequently reset a modified template to the default code. Click Save to save your changes. Reset. Resets the template code back to that of the default template.
4-52
If there are system wide changes that will apply to every project, then it is possible to edit the template defaults. Changes made here will be picked up by every DataStage project on that DataStage server. The JCL Templates directory contains two sets of template files: a default set that you can edit, and a master set which is read-only. You can always revert to the master templates if required, by copying the read-only masters over the default templates. Use a standard editing tool, such as Microsoft Notepad, to edit the default templates. More details about JCL templates are given in Appendix A of the Mainframe Job Developers Guide.
Code Customization
When you check the Generate COPY statement for customization box in the Code generation dialog box, DataStage provides four places in the generated COBOL program that you can customize. You can add code to be executed at program initialization or termination, or both. However, you cannot add code that would affect the row-by-row processing of the generated program. When you check Generate COPY statement for customization, four additional COPY statements are added to the generated COBOL program: COPY ARDTUDAT. This statement is generated just before the PROCEDURE DIVISION statement. You can use this to add WORKING-STORAGE variables and/or a LINKAGE SECTION to the program. COPY ARDTUBGN. This statement is generated just after the PROCEDURE DIVISION statement. You can use this to add your own program initialization code. If you included a LINKAGE SECTION in ARDTUDAT, you can use this to add the USING clause to the PROCEDURE DIVISION statement. COPY ARDTUEND. This statement is generated just before each STOP RUN statement. You can use this to add your own program termination code. COPY ARDTUCOD. This statement is generated as the last statement in the COBOL program. You use this to add your own paragraphs to the code. These paragraphs are those which are PERFORMed from the code in ARDTUBGN and ARDTUEND.
Developing a Job
4-53
DataStage provides default versions of these four COPYLIB members. As provided, ARDTUDAT, ARDTUEND, and ARDTUCOD contain only comments, and ARDTUBGN contains comments and a period. You can either preserve these members and create your own COPYLIB, or you can create your own members in the DataStage runtime COPYLIB. If you preserve the members, then you must modify the DataStage compile and link JCL templates to include the name of your COPYLIB before the DataStage runtime COPYLIB. If you replace the members in the DataStage COPYLIB, you do not need to change the JCL templates.
Job Properties
Each job in a project has properties, including optional descriptions and job parameters. To view and edit the job properties from the Designer, open the job in the Diagram window and choose Edit Job Properties or, if it is not currently open, select it in the Repository window and choose Properties from the shortcut menu. The Job Properties dialog box appears. The dialog box differs depending on whether it is a server job, parallel job, or a mainframe job. A server job has up to six pages: General, Parameters, Job control, Dependencies, Performance, and NLS. Parallel job properties are the same as server job properties except they have an Execution page rather than a Performance page, and also have a Generated OSH and Defaults page. A mainframe job has five pages: General, Parameters, Environment, Extensions, and Operational meta data.
4-54
It has the following fields: Category. The category to which the job belongs. Job version number. The version number of the job. A job version number has several components: The version number N.n.n. This number checks the compatibility of the job with the version of DataStage installed. This number is automatically set when DataStage is installed and cannot be edited. The release number n.N.n. This number is automatically incremented every time you release a job. For more information about releasing jobs, see DataStage Server Job Developers Guide and DataStage Parallel Job Developers Guide The bug fix number n.n.N. This number reflects minor changes to the job design or properties. To change this number, select it and enter a new value directly or use the arrow buttons to increase the number.
Developing a Job
4-55
Before-job subroutine and Input value. Optionally contain the name (and input parameter value) of a subroutine that is executed before the job runs. For example, you can specify a routine that prepares the data before processing starts. Choose a routine from the drop-down list box. This list box contains all the built routines defined as a Before/After Subroutine under the Routines branch in the Repository. Enter an appropriate value for the routines input argument in the Input value field. If you use a routine that is defined in the Repository, but which was edited and not compiled, a warning message reminds you to compile the routine when you close the Job Properties dialog box. If you installed or imported a job, the Before-job subroutine field may reference a routine which does not exist on your system. In this case, a warning message appears when you close the Job Properties dialog box. You must install or import the missing routine or choose an alternative one to use. A return code of 0 from the routine indicates success. Any other code indicates failure and causes a fatal error when the job is run. After-job subroutine and Input value. Optionally contains the name (and input parameter value) of a subroutine that is executed after the job has finished. For example, you can specify a routine that sends an electronic message when the job finishes. Choose a routine from the drop-down list box. This list box contains all the built routines defined as a Before/After Subroutine under the Routines branch in the Repository. Enter an appropriate value for the routines input argument in the Input value field. If you use a routine that is defined in the Repository, but which was edited but not compiled, a warning message reminds you to compile the routine when you close the Job Properties dialog box. A return code of 0 from the routine indicates success. Any other code indicates failure and causes a fatal error when the job is run. Only run after-job subroutine on successful job completion. This option is enabled if you have selected an After-job subroutine. If you select the option, then the After-job subroutine will only be run if the job has successfully completed running all its stages.
4-56
Enable Runtime Column Propagation for new links. This checkbox appears for parallel jobs if you have selected Enable Runtime Column propagation for Parallel jobs for this project in the DataStage Administrator. Check it to enable runtime column propagation by default for all new links on this job (see DataStage Parallel Job Developers Guide for a description of runtime column propagation). RTI Service Enabled. This checkbox only appears for server jobs. By selecting this checkbox, you indicate that the job is eligible to become a RTI Service. Eligible jobs will be available for deployment as services from the RTI console. Allow Multiple Instance. Select this to enable the DataStage Director to run multiple instances of this job. Enable hashed file cache sharing. Check this to enable multiple processes to access the same hash file in cache (the system checks if this is appropriate). This can save memory resources and speed up execution where you are, for example, running multiple instances of the same job. This applies to server jobs and to parallel jobs that used server functionality in container stages. Short job description. An optional brief description of the job. Full job description. An optional detailed description of the job. Parallel job properties have an additional check box: Enable Runtime Column Propagation for new links. This checkbox appears if you have selected Enable Runtime Column propagation for Parallel jobs for this project in the DataStage Administrator. Check it to enable runtime column propagation by default for all new links on this job (see DataStage Parallel Job Developers Guide for a description of runtime column propagation). If you installed or imported a job, the After-job subroutine field may reference a routine that does not exist on your system. In this case, a warning message appears when you close the Job Properties dialog box. You must install or import the missing routine or choose an alternative one to use.
Developing a Job
4-57
DSSendMail. This routine is an interlude to the local send mail program. DSWaitForFile. This routine is called to suspend a job until a named job either exists, or does not exist. DSJobReport. This routine can be called at the end of a job to write a job report to a file. The routine takes an argument comprising two or three elements separated by semi-colons as follows: Report type. 0, 1, or 2 to specify report detail. Type 0 produces a text string containing start/end time, time elapsed and status of job. Type 1 is as a basic report but also contains information about individual stages and links within the job. Type 2 produces a text string containing a full XML report. Directory. Specifies the directory in which the report will be written. XSL stylesheet. Optionally specifies an XSL style sheet to format an XML report. If the job had an alias ID then the report is written to JobName_alias.txt or JobName_alias.xml, depending on report type. If the job does not have an alias, the report is written to JobName_YYYYMMDD_HHMMSS.txt or JobName_YYYYMMDD_HHMMSS.xml, depending on report type. ExecDOS. This routine executes a command via an MS-DOS shell. The command executed is specified in the routines input argument. ExecTCL. This routine executes a command via a DataStage Engine shell. The command executed is specified in the routines input argument. ExecSH. This routine executes a command via a UNIX Korn shell.
Job parameters allow you to design flexible, reusable jobs. If you want to process data based on the results for a particular week, location, or product you can include these settings as part of your job design. However, when you want to use the job again for a different week or product, you must edit the design and recompile the job.
4-58
Instead of entering inherently variable factors as part of the job design, you can set up parameters which represent processing variables. For server and parallel jobs, you are prompted for values when you run or schedule the job. Job parameters are defined, edited, and deleted in the Parameters page of the Job Properties dialog box. All job parameters are defined by editing the empty row in the Job Parameters grid. For more information about adding and deleting rows, or moving between the cells of a grid, see Appendix A, Editing Grids. CAUTION: Before you remove a job parameter definition, you must make sure that you remove the references to this parameter in your job design. If you do not do this, your job may fail to run. You can also use the Parameters page to set different values for environment variables while the job runs. The settings only take effect at run-time, they do not affect the permanent settings of environment variables. The server job Parameters page is as follows:
Developing a Job
4-59
The Job Parameters grid has the following columns: Parameter name. The name of the parameter. Prompt. Text used as the field name in the run-time dialog box. Type. The type of the parameter (to enable validation). Default Value. The default setting for the parameter. Help text. The text that appears if a user clicks Property Help in the Job Run Options dialog box when running the job.
Job Parameters
Specify the type of the parameter by choosing one of the following from the drop-down list in the Type column: String. The default type. Encrypted. Used to specify a password. The default value is set by double-clicking the Default Value cell to open the Setup Password dialog box. Type the password in the Encrypted String field and
4-60
Integer. Long int (2147483648 to +2147483647). Float. Double (1.79769313486232E308 to 4.94065645841247E324 and 4.94065645841247E324 to 1.79769313486232E308). Pathname. Enter a default pathname or file name by typing it into Default Value or double-click the Default Value cell to open the Browse dialog box. List. A list of valid string variables. To set up a list, double-click the Default Value cell to open the Setup List and Default dialog box. Build a list by typing in each item into the Value field, then clicking Add. The item then appears in the List box. To remove an item, select it in the List box and click Remove. Select one of the items from the Set Default drop-down list box to be the default.
Date. Date in the ISO format yyyy-mm-dd. Time. Time in the format hh:mm:ss.
Developing a Job
4-61
DataStage uses the parameter type to validate any values that are subsequently supplied for that parameter, be it in the Director or the Designer.
4-62
Table name field on the General tab on the Inputs or Outputs page WHERE clause field on the Selection tab on the Outputs page Value cell on the Parameters tab, which appears in the Outputs page when you use a stored procedure (ODBC stage only) Expression field on the Derivation dialog box, opened from the Derivation column in the Outputs page of a UniVerse or ODBC Stage dialog box In Hashed File stages. You can use job parameters in the following fields in the Hashed File Stage dialog box: Use account name or Use directory path fields on the Stage page File name field on the General tab on the Inputs or Outputs page In UniData stages. You can use job parameters in the following fields in the UniData Stage dialog box: Server, Database, User name, and Password fields on the Stage page File name field on the General tab on the Inputs or Outputs page In Folder stages. You can use job parameters in the following fields in the Folder stage dialog box: Properties in the Properties tab of the Stage page Properties in the Properties tab of the Outputs page Before and after subroutines. You can use job parameters to specify argument values for before and after subroutines. Note: You can also use job parameters in the Property name field on the Properties tab in the stage type dialog box when you create a plugin. For more information, see DataStage Server Job Developers Guide.
Developing a Job
4-63
4-64
Environment Variables
To set a runtime value for an environment variable: 1. Click Add Environment Variable at the bottom of the Parameters page. The Choose environment variable list appears.
This shows a list of the available environment variables (the example shows parallel job environment variables).
Developing a Job
4-65
2.
Click on the environment variable you want to override at runtime. It appears in the parameter grid, distinguished from job parameters by being preceded by a $.
You can also click New at the top of the list to define a new environment variable. A dialog box appears allowing you to specify name and prompt. The new variable is added to the Choose environment variable list and you can click on it to add it to the parameters grid. 3. Set the required value in the Default Value column. This is the only field you can edit for an environment variable. Depending on the type of variable a further dialog box may appear to help you enter a value.
When you run the job and specify a value for the environment variable, you can specify the special value $ENV, which instructs DataStage to use the current setting for the environment variable. Environment variables are set up using the DataStage Administrator, see DataStage Administrator Guide.
4-66
A job control routine provides the means of controlling other jobs from the current job. A set of one or more jobs can be validated, run, reset, stopped, and scheduled in much the same way as the current job can be. You can, if required, set up a job whose only function is to control a set of other jobs. The graphical job sequence editor (see Chapter 6) produces a job control routine when you compile a job sequence (you can view this in the Job Sequence properties), but you can set up you own control job by entering your own routine on the Job control page of the Job Properties dialog box. The routine uses a set of BASIC functions provided for the purpose. For more information about these routines, see DataStage Developers Help, DataStage Server Job Developers Guide, or DataStage Parallel Job Developers Guide. You can use this same code for running parallel jobs. The Job control page provides a basic editor to let you construct a job control routine using the functions. The toolbar contains buttons for cutting, copying, pasting, and formatting code, and for activating Find (and Replace). The main part of this page consists of a multiline text box with scroll bars. The Add Job button provides a drop-down list box of all the server and parallel jobs in the current project. When you select a compiled job from the list and click Add, the Job Run Options dialog box appears, allowing you to specify any parameters or run-time limits to apply when the selected job is run. The job will also be added to the list of dependencies (see Specifying Job Dependencies on page 4-70). When you click OK in the Job Run Options dialog box, you return to the Job control page, where you will find that DataStage has added job control code for the selected job. The code sets any required job parameters and/or limits, runs the job, waits for it to finish, then tests for success.
Developing a Job
4-67
Alternatively, you can type your routine directly into the text box on the Job control page, specifying jobs, parameters, and any run-time limits directly in the code. The following is an example of a job control routine. It schedules two jobs, waits for them to finish running, tests their status, and then schedules another one. After the third job has finished, the routine gets its finishing status.
* get Hjob1 * set Dummy * run Dummy * get Hjob2 * set Dummy * run Dummy a handle for the first job = DSAttachJob("DailyJob1",DSJ.ERRFATAL) the job's parameters = DSSetParam(Hjob1,"Param1","Value1") the first job = DSRunJob(Hjob1,DSJ.RUNNORMAL) a handle for the second job = DSAttachJob("DailyJob2",DSJ.ERRFATAL) the job's parameters = DSSetParam(Hjob2,"Param2","Value2") the second job = DSRunJob(Hjob2,DSJ.RUNNORMAL)
* Now wait for both jobs to finish before scheduling the third job Dummy = DSWaitForJob(Hjob1) Dummy = DSWaitForJob(Hjob2)
4-68
* Test the status of the first job (failure causes routine to exit) J1stat = DSGetJobInfo(Hjob1, DSJ.JOBSTATUS) If J1stat = DSJS.RUNFAILED Then Call DSLogFatal("Job DailyJob1 failed","JobControl") End * Test the status of the second job (failure causes routine to * exit) J2stat = DSGetJobInfo(Hjob2, DSJ.JOBSTATUS) If J2stat = DSJS.RUNFAILED Then Call DSLogFatal("Job DailyJob2 failed","JobControl") End * Now get a handle for the third job Hjob3 = DSAttachJob("DailyJob3",DSJ.ERRFATAL) * and run it Dummy = DSRunJob(Hjob3,DSJ.RUNNORMAL) * then wait for it to finish Dummy = DSWaitForJob(Hjob3) * Finally, get the finishing status for the third job and test it J3stat = DSGetJobInfo(Hjob3, DSJ.JOBSTATUS) If J3stat = DSJS.RUNFAILED Then Call DSLogFatal("Job DailyJob3 failed","JobControl") End
Possible status conditions returned for a job are as follows. A job that is in progress is identified by: DSJS.RUNNING Job running; this is the only status that means the job is actually running. Jobs that are not running may have the following statuses: DSJS.RUNOK Job finished a normal run with no warnings. DSJS.RUNWARN Job finished a normal run with warnings. DSJS.RUNFAILED Job finished a normal run with a fatal error. DSJS.VALOK Job finished a validation run with no warnings. DSJS.VALWARN Job finished a validation run with warnings. DSJS.VALFAILED Job failed a validation run. DSJS.RESET Job finished a reset run. DSJS.STOPPED Job was stopped by operator intervention (cannot tell run type).
Developing a Job
4-69
Note: If a job has an active select list, but then calls another job, the second job will effectively wipe out the select list.
The Dependencies page of the Job Properties dialog box allows you to specify any dependencies a job has. These may be functions, routines, or other jobs that the job requires in order to run successfully. This is to ensure that, if the job is packaged for use on another system, all the required components will be included in the package.
Enter details as follows: Type. The type of item upon which the job depends. Choose from the following: Job. Released or unreleased job. If you have added a job on the Job control page (see page 4-67), this will automatically be included in the dependencies. If you subsequently delete the job from the job control routine, you must remove it from the dependencies list manually.
4-70
Local. Locally cataloged BASIC functions and subroutines (i.e., Transforms and Before/After routines). Global. Globally cataloged BASIC functions and subroutines (i.e., Custom UniVerse functions). File. A standard file. ActiveX. An ActiveX (OLE) object (not available on UNIX-based systems). Name. The name of the function or routine. The name required varies according to the Type of the dependency: Job. The name of a released, or unreleased, job. Local. The catalog name. Global. The catalog name. File. The file name. ActiveX. Server jobs only. The Name entry is actually irrelevant for ActiveX objects. Enter something meaningful to you (ActiveX objects are identified by the Location field). Location. The location of the dependency. A browse dialog box is available to help with this. This location can be an absolute path, but it is recommended you specify a relative path using the following environment variables: %SERVERENGINE% DataStage engine account directory (normally C:\Ascential\DataStage\ServerEngine). %PROJECT% Current project directory. %SYSTEM% System directory on Windows NT or /usr/lib on UNIX.
Developing a Job
4-71
The Browse Files dialog box is shown below. You cannot navigate to the parent directory of an environment variable.
When browsing for the location of a file on a UNIX server, there is an entry called Root in the base locations drop-down list (called Drives on mk56 in the above example).
The Performance page allows you to improve the performance of the job by specifying the way the system divides jobs into processes. For a full explanation of this, see Chapter 2 of DataStage Server Job Developers Guide.
4-72
These settings can also be made on a project-wide basis using the DataStage Administrator (see DataStage Administrator Guide).
The settings are: Use Project Defaults. Select this to use whatever setting have been made in the DataStage Administrator for the project to which this job belongs. Enable Row Buffering. There are two types of mutually exclusive row buffering: In process. You can improve the performance of most DataStage jobs by turning in-process row buffering on and recompiling the job. This allows connected active stages to pass data via buffers rather than row by row. Inter process. Use this if you are running server jobs on an SMP parallel system. This enables the job to run using a separate process for each active stage, which will run simultaneously on a separate processor. Note: You cannot use row-buffering of either sort if your job uses COMMON blocks in transform functions to pass data between
Developing a Job
4-73
stages. This is not recommended practice, and it is advisable to redesign your job to use row buffering rather than COMMON blocks. Buffer size. Specifies the size of the buffer used by in-process or inter-process row buffering. Defaults to 128 Kb. Timeout. Only applies when inter-process row buffering is used. Specifies the time one process will wait to communicate with another via the buffer before timing out. Defaults to 10 seconds.
You can ensure that DataStage uses the correct character set map and formatting rules for your server job by specifying character set maps and locales on the NLS page of the Job Properties dialog box.
4-74
Note: The list contains all character set maps that are loaded and ready for use. You can view other maps that are supplied with DataStage by clicking Show all maps, but these maps cannot be used unless they are loaded using the DataStage Administrator. For more information, see DataStage Administrator Guide.
Developing a Job
4-75
You can ensure that DataStage uses the correct character set map and collate formatting rules for your parallel job by specifying character set maps and collation locale on the NLS page of the Job Properties dialog box.
The character set map defines the character set DataStage uses for this job. You can select a specific character set map from the list or accept the default setting for the whole project. The locale determines the order for sorted data in the job. Select the project default or choose one from the list.
4-76
From this page you can switch tracing on for parallel jobs to help you debug them. You can also specify a collation sequence file and set the default runtime column propagation value setting for this job.
The page has the following options: Compile in trace mode. Select this so that you can use the tracing facilities after you have compiled this job. Force Sequential Mode. Select this to force the job to run sequentially on the conductor node. Limits per partition. These options enable you to limit data in each partition to make problems easier to diagnose: Number of Records per Link. This limits the number of records that will be included in each partition. Log Options Per Partition. These options enable you to specify how log data is handled for partitions. This can cut down the data in the log to make problems easier to diagnose. Skip count. Set this to N to skip the first N records in each partition.
Developing a Job
4-77
Period. Set this to N to print every Nth record per partition, starting with the first record. N must be >= 1. Advanced Runtime Options. This field allows experienced Orchestrate users to enter parameters that are added to the OSH command line. Under normal circumstances this should be left blank.
The page shows the current defaults for date, time, timestamp, and decimal separator. To change the default, clear the corresponding Project default check box, then either select a new format from the drop down list or type in a new format.
4-78
Category. The category to which the job belongs. Job version number. The version number of the job. A job version number has several components: The version number N.n.n. This number checks the compatibility of the job with the version of DataStage installed. This number is automatically set when DataStage is installed and cannot be edited. The release number n.N.n. This number is automatically incremented every time you release a job. The bug fix number n.n.N. This number reflects minor changes to the job design or properties. To change this number, select it and enter a new value directly or use the arrow buttons to increase the number.
Developing a Job
4-79
Century break year. Where a two-digit year is used in the data, this is used to specify the year that is used to separate 19nn years from 20nn years. Date format Specifies the default date format for the job. Choose a setting from the drop-down list, possible settings are: MM/DD/CCYY DD.MM.CCYY CCYY-MM-DD The default date is used by a number of stages to interpret the date field in their column definitions. It is also used where a date type from an active stage is mapped to character or other data types in a following passive stage. The default date is also specified at project level using the DataStage Administrator client. The job default overrides the project default. Perform expression semantic checking. Click this to enable semantic checking in the mainframe expression editor. Be aware that selecting this can incur performance overheads. This is most likely to affect jobs with large numbers of column derivations. Generate operational meta data. Click this to have the job generate operational meta data for use in MetaStage. Clicking this enables the Operational meta data page (see Specifying Operational Meta Data on page 4-85). Short job description. An optional brief description of the job. Full job description. An optional detailed description of the job. Click OK to record your changes in the job design. Changes are not saved to the Repository until you save the job design.
Instead of entering inherently variable factors as part of the job design you can set up parameters which represent processing variables. For mainframe jobs the parameter values are placed in a file that is accessed when the job is compiled and run on the mainframe.
4-80
Job parameters are defined, edited, and deleted in the Parameters page of the Job Properties dialog box. All job parameters are defined by editing the empty row in the Job Parameters grid. For more information about adding and deleting rows, or moving between the cells of a grid, see Appendix A, Editing Grids. CAUTION: Before you remove a job parameter definition, you must make sure that you remove the references to this parameter in your job design. If you do not do this, your job may fail to run. The mainframe job Parameters page is as follows:
It contains the following fields and columns: Parameter file name. The name of the file containing the parameters. COBOL DD name. The DD name for the location of the file. Name. The name of the parameter. Type. The type of the parameter. It can be one of:
Developing a Job
4-81
Char. A fixed-length string where the Length attribute is used to determine its length. The COBOL program defines this parameter with PIC X(length). Decimal. A COBOL signed zoned-decimal number, the precision is indicated by Length and the scale by Scale. The COBOL program defines this parameter with PIC S9(lengthscale)V9(scale). Integer. A COBOL signed zoned-decimal number, where the Length attribute is used to define its length. The COBOL program defines this parameter with PIC S9(length). Length. The length of a char or a decimal parameter. Scale. The precision of a decimal parameter. Description. Optional description of the parameter. Save As . Allows you to save the set of job parameters as a table definition in the DataStage Repository. Load . Allows you to load the job parameters from a table definition in the DataStage Repository.
4-82
The environment properties of a mainframe job in the Job Properties dialog box allow you to specify information that is used when code is generated for mainframe jobs.
It contains the following fields: DBMS. If your design includes relational stages, the code generation process looks here for database details to include in the JCL files. If these fields are blank, it will use the project defaults as specified in the DataStage Administrator. System name. The name of the database used by the relational stages in the job. If not specified, the project default is used. User name and Password. These will be used throughout the job. If not specified, the project default is used. Rows per commit. Defines the number of rows that are written to a DB2 database before they are committed. The default setting is 0, which means to commit after all rows are processed. If you enter a number, the commit occurs after the specified number of rows are processed. For inserts, only one row is written. For
Developing a Job
4-83
updates or deletes, multiple rows may be written. However, if an error is detected, a rollback occurs. Teradata. If your design includes Teradata stages, the code generation process looks here for database details to include in the JCL files. TDP id and Account id. The connection details used in Teradata stages throughout the job. User ID and Password. These will be used throughout the job.
It contains a grid with the following columns: Name. The name of the extension variable. The name must begin with an alphabetic character and can contain only alphabetic or numeric characters. It can be upper or lower case or mixed. Value. The value that the extension variable will take in this job. No validation is done on the value.
4-84
The fields are: Machine Profile. If you have already specified a machine profile that contains some or all of the required details, you can select it from the drop-down list and the relevant fields will be automatically filled in. IP address. IP Host name/address for the machine running your program and generating the operational meta data. File exchange method. Choose between FTP and connect direct. User name. The user name for connecting to the machine. Password. The password for connecting to the machine XML file target directory and Dataset name for XML file. Specify the target directory and dataset name for the XML file which will record the operational meta data.
Developing a Job
4-85
When the DataStage Designer needs you to specify information about the running of a server job or parallel job, it displays the Job Run Options dialog box. It has two pages: one to collect any parameters the job requires and one to let you specify any run-time limits. This dialog box may appear when you are using the Data Browser, specifying a job control routine, or using the debugger.
The Parameters page lists any parameters that have been defined for the job. If default values have been specified, these are displayed too. You can enter a value in the Value column, edit the default, or accept the default as it is. Click Set to Default to set a parameter to its default value, or click All to Default to set all parameters to their default values. Click Property Help to display any help text that has been defined for the selected parameter (this button is disabled if no help has been defined). Click OK when you are satisfied with the values for the parameters. When setting a value for an environment variable, you can specify the special value $ENV, which instructs DataStage to use the current setting for the environment variable. Note that you cannot use $ENV when viewing data on Parallel jobs. You will be warned if you try to do this.
4-86
The Limits page allows you to specify whether stages in the job should be limited in how many rows they process and whether run-time error warnings should be ignored. To specify a rows limits: 1. 2. Click the Stop stages after option button. Select the number of rows from the drop-down list box.
To specify that the job should abort after a certain number of warnings: 1. 2. Click the Abort job after option button. Select the number of warnings from the drop-down list box.
Developing a Job
4-87
4-88
5
Containers
Server jobs and Parallel jobs
A container is a group of stages and links. Containers enable you to simplify and modularize your server job designs by replacing complex areas of the diagram with a single container stage. DataStage provides two types of container: Local containers. These are created within a job and are only accessible by that job. A local container is edited in a tabbed page of the jobs Diagram window. Local containers can be used in server jobs or parallel jobs. Shared containers. These are created separately and are stored in the Repository in the same way that jobs are. There are two types of shared container: Server shared container. Used in server jobs (can also be used in parallel jobs). Parallel shared container. Used in parallel jobs. You can also include server shared containers in parallel jobs as a way of incorporating server job functionality into a parallel stage (for example, you could use one to make a server plug-in stage available to a parallel job).
Local Containers
Server jobs and Parallel jobs
The main purpose of using a DataStage local container is to simplify a complex design visually to make it easier to understand in the Diagram window. If the DataStage job has lots of stages and links, it may be easier to create additional containers to describe a particular sequence of steps. Containers are linked to other stages or containers in the job by input and output stages.
Containers
5-1
You can create a local container from scratch, or place a set of existing stages and links within a container. A local container is only accessible to the job in which it is created.
To insert an empty local container, to which you can add stages and links, click the Container icon in the General group on the tool palette and click on the Diagram window, or drag and drop it onto the Diagram window. A Container stage is added to the Diagram window, double-click on the stage to open it, and add stages and links to the container. You can rename, move, and delete a container stage in the same way as any other stage in your job design (see Stages on page 4-5).
5-2
Select the container and choose Properties from the shortcut menu. You can edit the stages and links in a container in the same way you do for a job. See Using Input and Output Stages on page 5-3 for details on how to link the container to other stages in the job.
The first ODBC stage links to a stage in the container, and is represented by a Container Input stage. A different stage in the container links to the second ODBC stage, which is represented by a Container Output stage. The container Diagram window includes the input and output stages required to link to the two ODBC stages. Note that the link names match those used for the links between the ODBC stages and the container in the main Diagram window.
Containers
5-3
The way in which the Container Input and Output stages are used depends on whether you construct a local container using existing stages and links or create a new one: If you construct a local container from an existing group of stages and links, the input and output stages are automatically added. The link between the input or output stage and the stage in the container has the same name as the link in the main job Diagram window. If you create a new container, you must add stages to the container Diagram window between the input and output stages. Link the stages together and edit the link names to match the ones in the main Diagram window. You can have any number of links into and out of a local container, all of the link names inside the container must match the link names into and out of it in the job. Once a connection is made, editing meta data on either side of the container edits the meta data on the connected stage in the job.
5-4
Select the container stage in the Job Diagram window and choose Edit Deconstruct Container from the main menu. DataStage prompts you to confirm the action (you can disable this prompt if required). Click OK and the constituent parts of the container appear in the Job Diagram window, with existing stages and links shifted to accommodate them. If any name conflicts arise during the deconstruction process between stages from the container and existing ones, you are prompted for new names. You can click the Use Generated Names checkbox to have DataStage allocate new names automatically from then on. If the container has any unconnected links, these are discarded. Connected links remain connected. Deconstructing a local container is not recursive. If the container you are deconstructing contains other containers, they move up a level but are not themselves deconstructed.
Shared Containers
Server jobs and Parallel jobs
Shared containers also help you to simplify your design but, unlike local containers, they are reusable by other jobs. You can use shared containers to make common job components available throughout the project. You can also insert a server shared container into a parallel job as a way of making server job functionality available. For example, you could use it to give the parallel job access to the functionality of a plug-in stage. (Note that you can only use server shared containers on SMP systems, not MPP or cluster systems.) Shared containers comprise groups of stages and links and are stored in the Repository like DataStage jobs. When you insert a shared container into a job, DataStage places an instance of that container into the design. When you compile the job containing an instance of a shared container, the code for the container is included in the compiled job. You can use the DataStage debugger on instances of shared containers used within jobs. When you add an instance of a shared container to a job, you will need to map meta data for the links into and out of the container, as these may vary in each job in which you use the shared container. If you change the contents of a shared container, you will need to recompile those jobs that use the container in order for the changes to take effect. For parallel shared containers, you can take advantage of runtime column propagation to
Containers
5-5
avoid the need to map the meta data. If you enable runtime column propagation, then, when the jobs runs, meta data will be automatically propagated across the boundary between the shared container and the stage(s) to which it connects in the job (see Parallel Job Developers Guide for a description of runtime column propagation). Note that there is nothing inherently parallel about a parallel shared container - although the stages within it have parallel capability. The stages themselves determine how the shared container code will run. Conversely, when you include a server shared container in a parallel job, the server stages have no parallel capability, but the entire container can operate in parallel because the parallel job can execute multiple instances of it. You can create a shared container from scratch, or place a set of existing stages and links within a shared container. Note: If you encounter a problem when running a job which uses a server shared container in a parallel job, you could try increasing the value of the DSIPC_OPEN_TIMEOUT environment variable in the Parallel Operator specific category of the enironment variable dialog box in the DataStage Administrator (see DataStage Administrator Guide).
5-6
container as container parameters. The instance created has all its parameters assigned to corresponding job parameters. To create an empty shared container, to which you can add stages and links, choose File New from the DataStage Designer menu. The New dialog box appears, choose the server Shared Container icon or parallel shared container icon as appropriate and click OK.
A new Diagram window appears in the Designer, along with a Tool palette which has the same content as for server jobs or parallel jobs, depending on the type of shared container. You can now save the shared container and give it a name. This is exactly the same as saving a job (see Saving a Job on page 4-4).
Containers
5-7
Select its icon in the job design and select Open from the shortcut menu. Choose File Open from the main menu and select the shared container from the Open dialog box. A Diagram window appears, showing the contents of the shared container. You can edit the stages and links in a container in the same way you do for a job. Note: The shared container is edited independently of any job in which it is used. Saving a job, for example, will not save any open shared containers used in that job.
5-8
Category. The category containing the shared container. Version. The version number of the shared container. A version number has several components: The version number N.n.n. This number checks the compatibility of the shared container with the version of DataStage installed. This number is automatically set when DataStage is installed and cannot be edited. The bug fix number n.n.N. This number reflects minor changes to the shared container design or properties. To change this number, select it and enter a new value directly or use the arrow buttons to increase the number. Enable Runtime Column Propagation for new links. This checkbox appears for parallel shared containers if you have selected Enable Runtime Column propagation for Parallel jobs for this project in the DataStage Administrator. Check it to enable runtime column propagation by default for all new links in this shared container (see DataStage Parallel Job Developers Guide for a description of runtime column propagation). Short Container Description. An optional brief description of the shared container. Full Container Description. An optional detailed description of the shared container. Shared containers use parameters to ensure that the container is reusable in different jobs. Any properties of the container that are likely to change between jobs can be supplied by a parameter, and the actual value for that parameter specified in the job design. Container parameters can be used in the same places as job parameters, see Using Job Parameters in Server Jobs on page 4-62.
Containers
5-9
Parameter name. The name of the parameter. Type. The type of the parameter. Help text. The text that appears in the Job Container Stage editor to help the designer add a value for the parameter in a job design (see Using a Shared Container in a Job on page 5-10).
5-10
Double-click the container stage in the Diagram window. Select the container stage and choose Edit Properties . Select the container stage and choose Properties from the shortcut menu. The Shared Container Stage editor appears:
This is similar to a general stage editor, and has Stage, Inputs, and Outputs pages, each with subsidiary tabs.
Stage Page
Stage Name. The name of the instance of the shared container. You can edit this if required. Shared Container Name. The name of the shared container of which this is an instance. You cannot change this. The General tab enables you to add an optional description of the container instance.
Containers
5-11
The Properties tab allows you to specify values for container parameters. You need to have defined some parameters in the shared container properties for this tab to appear.
Name. The name of the expected parameter. Value. Enter a value for the parameter. You must enter values for all expected parameters here as the job does not prompt for these at run time. (You can leave string parameters blank, an empty string will be inferred.) Insert Parameter. You can use a parameter from a parent job (or container) to supply a value for a container parameter. Click Insert Parameter to be offered a list of available parameters from which to choose.
5-12
The Advanced tab appears when you are using a server shared container within a parallel job. It has the same fields and functionality as the Advanced tab on all parallel stage editors. See Chapter 3 of DataStage Parallel Job Developers Guide for details.
Inputs Page
When inserted in a job, a shared container instance already has meta data defined for its various links. This meta data must match that on the link that the job uses to connect to the container exactly in all properties. The inputs page enables you to map meta data as required. The only exception to this is where you are using runtime column propagation (RCP) with a parallel shared container. If RCP is enabled for the job, and specifically for the stage whose output connects to the shared container input, then meta data will be propagated at run time, so there is no need to map it at design time. In all other cases, in order to match, the meta data on the links being matched must have the same number of columns, with corresponding properties for each.
Containers
5-13
The Inputs page for a server shared container has an Input field and two tabs, General and Columns. The Inputs page for a parallel shared container, or a server shared container used in a parallel job, has an additional tab: Partitioning. Input. Choose the input link to the container that you want to map. The General page is as follows:
Map to Container Link. Choose the link within the shared container to which the incoming job link will be mapped. Changing the link triggers a validation process, and you will be warned if the meta data does not match and are offered the option of reconciling the meta data as described below. Validate. Click this to request validation of the meta data on the two links. You are warned if validation fails and given the option of reconciling the meta data. If you choose to reconcile, the meta data on the container link replaces that on the job link. Surplus columns on the job link are removed. Job link columns that have the same name but different properties as a container column will have the properties overwritten, but derivation information preserved.
5-14
Note: You can use a Transformer stage within the job to manually map data between a job stage and the container stage in order to supply the meta data that the container requires. Description. Optional description of the job input link. The Columns page shows the meta data defined for the job stage link in a standard grid. You can use the Reconcile option on the Load button to overwrite meta data on the job stage link with the container link meta data in the same way as described for the Validate option.
Containers
5-15
The Partitioning tab appears for parallel shared containers and when you are using a server shared container within a parallel job. It has the same fields and functionality as the Partitioning tab on all parallel stage editors. See Chapter 3 of DataStage Parallel Job Developers Guide for details.
Outputs Page
The Outputs page enables you to map meta data between a container link and the job link which connects to the container on the output side. It has an Outputs field and a General tab and Columns tab which perform equivalent functions as described for the Inputs page. The columns tab for parallel shared containers has a Runtime column propagation check box. This is visible provided RCP is enabled for the job. It shows whether RCP is switched on or off for the link the container link is mapped onto. This removes the need to map the meta data.
5-16
Converting Containers
Server jobs and Parallel jobs
You can convert local containers to shared containers and vice versa. By converting a local container to a shared one you can make the functionality available to all jobs in the project. You may want to convert a shared container to a local one if you want to slightly modify its functionality within a job. You can also convert a shared container to a local container and then deconstruct it into its constituent parts as described in Deconstructing a Local Container on page 5-4. To convert a container, select its stage icon in the job Diagram window and do one of the following: Choose Convert from the shortcut menu. Choose Edit Convert Container from the main menu. DataStage prompts you to confirm the conversion. Containers nested within the container you are converting are not affected. When converting from shared to local, you are warned if link name conflicts occur and given a chance to resolve them. A shared container cannot be converted to a local container if it has a parameter with the same name as a parameter in the parent job (or container) which is not derived from the parents corresponding parameter. You are warned if this occurs and must resolve the conflict before the container can be converted. Note: Converting a shared container instance to a local container has no affect on the original shared container.
Containers
5-17
5-18
6
Job Sequences
DataStage provides a graphical Job Sequencer which allows you to specify a sequence of server jobs or parallel jobs to run. The sequence can also contain control information; for example, you can specify different courses of action to take depending on whether a job in the sequence succeeds or fails. Once you have defined a job sequence, it can be scheduled and run using the DataStage Director. It appears in the DataStage Repository and in the DataStage Director client as a job.
Job Sequence Note: This tool is provided in addition to the batch job facilities of the DataStage Director and the job control routine facilities of the DataStage Designer. Designing a job sequence is similar to designing a job. You create the job sequence in the DataStage Designer, add activities (as opposed to stages) from the tool palette, and join these together with triggers (as opposed to links) to define control flow. Each activity has properties that can be tested in trigger expressions and passed to other activities further on in the sequence. Activities can also have parameters, which are used to supply job parameters and routine arguments. The job sequence itself has properties, and can have parameters, which can be passed to the activities it is sequencing.
Job Sequences
6-1
The sample job sequence shows a sequence that will run the job Demo. If demo runs successfully, the Success trigger causes the Overnightrun1 job to run. If demo fails, the Failure trigger causes the Failure job to run.
6-2
The Diagram window appears, in the right pane of the Designer, along with the Tool palette for job sequences. You can now save the job sequence and give it a name. This is exactly the same as saving a job (see Saving a Job on page 4-4).
You can open an existing job sequence in the same way you would open an existing job (see Opening an Existing Job on page 4-2).
Job Sequences
6-3
Activities
The job sequence supports the following types of activity: Job. Specifies a DataStage server or parallel job.
Routine. Specifies a routine. This can be any routine in the DataStage Repository (but not transforms). ExecCommand. Specifies an operating system command to execute. Email Notification. Specifies that an email notification should be sent at this point of the sequence (uses SMTP). Wait-for-file. Waits for a specified file to appear or disappear. Run-activity-on-exception. There can only be one of these in a job sequence. It is executed if a job in the sequence fails to run (other exceptions are handled by triggers). To add an activity to your job sequence, drag the corresponding icon from the tool palette and drop it on the Diagram window. You can also add particular jobs or routines to the design as activities by dragging the icon representing that job or routine from the DataStage Designers Repository window and dropping it in the Diagram window. The job or routine appears as an activity in the Diagram window. Activities can be named, moved, and deleted in the same way as stages in an ordinary server or parallel job (see Chapter 4, Developing a Job.)
Triggers
The control flow in the sequence is dictated by how you interconnect activity icons with triggers.
6-4
To add a trigger, select the trigger icon in the tool palette, click the source activity in the Diagram window, then click the target activity. Triggers can be named, moved, and deleted in the same way as links in an ordinary server or parallel job (see Chapter 4, Developing a Job.). Other trigger features are specified by editing the properties of their source activity. Activities can only have one input trigger, but can have multiple output triggers. Trigger names must be unique for each activity. For example, you could have several triggers called success in a job sequence, but each activity can only have one trigger called success. There are three types of trigger: Conditional. A conditional trigger fires the target activity if the source activity fulfills the specified condition. The condition is defined by an expression, and can be one of the following types: OK. Activity succeeds. Failed. Activity fails. Warnings. Activity produced warnings. ReturnValue. A routine or command has returned a value. Custom. Allows you to define a custom expression. User status. Allows you to define a custom status message to write to the log. Unconditional. An unconditional trigger fires the target activity once the source activity completes, regardless of what other triggers are fired from the same activity. Otherwise. An otherwise trigger is used as a default where a source activity has multiple output triggers, but none of the conditional ones have fired.
Job Sequences
6-5
Different activities can output different types of trigger: Activity Type Wait-for-file, ExecuteCommand Trigger Type Unconditional Otherwise Conditional - OK Conditional - Failed Conditional - Custom Conditional - ReturnValue Unconditional Otherwise Conditional - OK Conditional - Failed Conditional - Custom Conditional - ReturnValue Unconditional Otherwise Conditional - OK Conditional - Failed Conditional - Warnings Conditional - Custom Conditional - UserStatus Unconditional Otherwise Conditional - Custom Unconditional
Routine
Job
Nested condition
Note: If a job fails to run, for example because it was in the aborted state when due to run, this will not fire a trigger. Job activities can only fire triggers if they run. Non-running jobs can be handled by exception activities, or by choosing an execution action of reset then run rather than just run for jobs (see page 6-16).
6-6
Control Entities
The Job Sequencer provides additional control entities to help control execution in a job sequence. Nested Conditions and Sequences are represented in the job design by icons and joined to activities by triggers.
Nested Conditions
A nested condition allows you to further branch the execution of a sequence depending on a condition. For example, you could use a nested condition to implement the following control sequence:
Load/init jobA Run jobA If ExitStatus of jobA = OK then /*tested by trigger*/ If Today = Wednesday then /*tested by nested condition*/ run jobW If Today = Saturday then run jobS Else run JobB
Each nested condition can have one input trigger and will normally have multiple output triggers. You specify the condition it branches on by editing the expressions attached to the output triggers in the Triggers page of its Properties dialog box (see Nested Condition Properties on page 6-23).
Sequencer
A sequencer allows you to synchronize the control flow of multiple activities in a job sequence. It can have multiple input triggers as well as multiple output triggers. The sequencer operates in two modes: ALL mode. In this mode all of the inputs to the sequencer must be TRUE for any of the sequencer outputs to fire.
Job Sequences
6-7
ANY mode. In this mode, output triggers can be fired if any of the sequencer inputs are TRUE. Sequencer details are specified by editing its properties.
The General page contains: Category. The job category containing the job sequence. Version number. The version number of the job sequence. A version number has several components: The version number N.n.n. This number checks the compatibility of the job with the version of DataStage installed. This number is
6-8
automatically set when DataStage is installed and cannot be edited. The release number n.N.n. This number is automatically incremented every time you release a job sequence. (You can release a job sequence in the same way as you release a job.) The bug fix number n.n.N. This number reflects minor changes to the job sequence design or properties. To change this number, select it and enter a new value directly or use the arrow buttons to increase the number. Allow Multiple Instance. Select this to enable the DataStage Director to run multiple instances of this job sequence. Short Description. An optional brief description of the job sequence. Full Description. An optional detailed description of the job sequence. The Parameters page is as follows:
Job Sequences
6-9
The Parameters page allows you to specify parameters for the job sequence. Values for the parameters are collected when the job sequence is run in the Director. The parameters you define here are available to all the activities in the job sequence, so where you are sequencing jobs that have parameters, you need to make these parameters visible here. For example, if you were scheduling three jobs, each of which expected a file name to be provided at run time, you would specify three parameters here, calling them, for example, filename1, filename2, and filename 3. You would then edit the Job page of each of these job activities in your sequence to map the jobs filename parameter onto filename1, filename2, or filename3 as appropriate (see Job Activity Properties on page 6-16). When you run the job sequence, the Job Run Options dialog box appears, prompting you to enter values for filename1, filename2, and filename3. The appropriate filename is then passed to each job as it runs. The Parameters grid has the following columns: Parameter name. The name of the parameter. Prompt. Text used as the field name in the run-time dialog box. Type. The type of the parameter (to enable validation). Default Value. The default setting for the parameter. Help text. The text that appears if a user clicks Property Help in the Job Run Options dialog box when running the job sequence. The Job Control page displays the code generated when the job sequence is compiled.
6-10
The Dependencies page of the Properties dialog box shows you the dependencies the job sequence has. These may be functions, routines, or jobs that the job sequence runs. Listing the dependencies of the job sequence here ensures that, if the job sequence is packaged for use on another system, all the required components will be included in the package. The details as follows: Type. The type of item upon which the job sequence depends: Job. Released or unreleased job. If you have added a job to the sequence, this will automatically be included in the dependencies. If you subsequently delete the job from the sequence, you must remove it from the dependencies list manually. Local. Locally cataloged BASIC functions and subroutines (i.e., Transforms and Before/After routines). Global. Globally cataloged BASIC functions and subroutines (i.e., Custom UniVerse functions).
Job Sequences
6-11
File. A standard file. ActiveX. An ActiveX (OLE) object (not available on UNIX-based systems). Name. The name of the function or routine. The name required varies according to the Type of the dependency: Job. The name of a released, or unreleased, job. Local. The catalog name. Global. The catalog name. File. The file name. ActiveX. The Name entry is actually irrelevant for ActiveX objects. Enter something meaningful to you (ActiveX objects are identified by the Location field). Location. The location of the dependency. A browse dialog box is available to help with this. This location can be an absolute path, but it is recommended you specify a relative path using the following environment variables: %SERVERENGINE% DataStage engine account directory (normally C:\Ascential\DataStage\ServerEngine). %PROJECT% Current project directory. %SYSTEM% System directory on Windows NT or /usr/lib on UNIX.
Activity Properties
When you have outlined your basic design by adding activities and triggers to the diagram window, you fill in the details by editing the properties of the activities. To edit an activity, do one of the following: Double-click the activity in the Diagram window. Select the activity and choose Properties from the shortcut menu. Select the activity and choose Edit Properties. The format of the Properties dialog box depends on the type of activity. All have a General page, however, and any activities with output triggers have a Triggers page.
6-12
The General page contains: Name. The name of the activity, you can edit the name here if required. Description. An optional description of the activity. Logging text. The text that will be written to the Director log when this activity is about to run.
Job Sequences
6-13
The Triggers page contains: Name. The name of the output trigger. Expression Type. The type of expression attached to the trigger. Choose a type from the drop-down list (see Triggers on page 6-4 for an explanation of the different types). Expression. The expression associated with the trigger. For most predefined conditions, this is fixed and you cannot edit it. For Custom conditions you can enter an expression, and for UserStatus conditions you can enter a text message. You can use variables when defining trigger expressions for Custom and ReturnValue conditional triggers. The rules are given in the following table: Activity Type Job Variable stage_label.$JobStatus stage_label.$UserStatus Use Value of job completion status Value of jobs user status
6-14
Command status Value of routines return code Value returned by DSWaitForFile before/after subroutine
stage_label is name of the activity stage as given in the Diagram window. You can also use the job parameters from the job sequence itself. Custom conditional triggers in Nested condition and Sequencer activities can use any of the variable in the above table used by the activities connected to them. The specific pages for particular activities are described in the following sections.
Job Sequences
6-15
The Job page contains: Job name. Allows you to specify the name of the job the activity is to run. You can select a job by browsing the Repository. If you have added the activity by dragging a job from the Repository window, the Job name will already be filled in. Invocation ID Expression. This only appears if the job identified by Job Name has Allow Multiple Instance enabled. Enter an expression identifying the instance of the job the activity will run. Enclose literal values in inverted commas, e.g. instance_1. You can also click the browse button to be presented with a list of available job parameters you could use. You cannot leave this field blank.
6-16
Execution Action. Allows you to specify what the activity will do with the job. Choose one of the following from the drop-down list: Run (the default) Reset if required then run Validate only Reset only
Parameters. Allows you to provide values for any parameters that the job requires. The grid displays all the parameters expected by the job. You can: Type in an expression giving a value for the parameter in the Value Expression column. Literal values must be enclosed in inverted commas. Select a parameter and click Insert Parameter Value to use another parameter or argument in the sequence to provide the value. A dialog box appears displaying a tree of all the available parameters and arguments occurring in the sequence before the current activity, This includes parameters that you have defined for the job sequence itself in the Job Sequence Properties dialog box (see Job Sequence Properties on page 6-8). Choose the required parameter or argument and click OK. You can use this feature to determine control flow through the sequence. Click Clear to clear the value expression from the selected parameter. Click Clear All to clear the expression values from all parameters. Select a parameter and click Set to Default to enter the default for that parameter as defined in the job itself. Click All to Default to set all the parameters to their default values. When you select the icon representing a job activity, you can choose Open Job from the shortcut menu to open the job in the Designer ready for editing.
Job Sequences
6-17
The Routine page contains: Routine name. Allows you to specify the name of the routine the activity is to execute. You can select a routine by browsing the Repository. If you have added the activity by dragging a routine from the Repository window, the Routine name will already be filled in. Parameters. Allows you to provide values for any arguments that the routine requires. The grid displays all the arguments expected by the routine. You can: Type in an expression giving the value for the argument in the Value Expression column. Literal values must be enclosed in inverted commas. Click Clear to clear the value expression from the selected parameter.
6-18
Click Clear All to clear the expression values from all parameters. Select an argument and click Insert Parameter Value to use another parameter or argument in the sequence to provide the value. A dialog box appears displaying a tree of all the available parameters and arguments occurring in the sequence before the current activity. Choose the required parameter or argument and click OK. You can use this feature to determine control flow through the sequence. When you select the icon representing a routine activity, you can choose Open Routine from the shortcut menu to open the Routine dialog box for that routine ready to edit.
Job Sequences
6-19
systems generally require you to specify a Senders email address whereas NT systems do not. So specifying this field may be mandatory in a UNIX system, but have no effect in an NT system.
The Notification page contains: SMTP Mail server name. The name of the server or its IP address. Senders email address. Given in the form bill.gamsworth@paddock.com. Recipients email address. The address the email is to be sent to, given in the form bill.gamsworth@paddock.com. Email subject. The text to appear as the email subject. Email body. The actual message to be sent. Include job status in email. Select this to include available job status information in the message.
6-20
The Wait For File page contains: Filename. The full pathname of the file that the activity is to wait for. Wait for file to appear. Select this if the activity is to wait for the specified file to appear. Wait for file to disappear. Select this if the activity is to wait for the specified file to disappear. Timeout Length (hh:mm:ss). The amount of time to wait for the file to appear or disappear before the activity times out and completes.
Job Sequences
6-21
The ExecCommand page contains: Command. The full pathname of the command to execute. This can be an operating system command, a batch command file, or an executable. Parameters. Allows you to pass parameters to the command. These should be entered in the format that the command expects them.
6-22
An exception activity can only have a single unconditional output trigger, so does not require a Triggers page. It has no input triggers. It serves as a starting point for a sequence of activities to run if an exception has occurred somewhere in the main sequence. Its Properties dialog box contains only a General page.
Sequencer Properties
In addition to the General and Triggers pages, the Properties dialog box for a Sequencer control contains a Sequencer page.
The Sequencer page contains: Mode. Choose All or Any to select the mode of operation (see Sequencer on page 6-7 for an explanation of modes).
Job Sequences
6-23
You can also change the mode by selecting the activity and using the shortcut menu. The sequencer has a slightly different icon depending on whether All or Any mode is operational.
Link the activities with triggers appropriate for your job sequence. For each QualityStage ExecCommand activity: a. b. Open the Properties dialog box. Select the ExecCommand page.
6-24
c.
Fill in the Command property with the full path of the QualityStage Parallel Extender mode script to execute. For example, to run the TEST.par Parallel Extender script that is located in /Projects/TEST/Scripts, type the following in the Command property: /Projects/TEST/Scripts/TEST.par
d. Fill in the Parameters property with the parameters of the QualityStage Parallel Extender mode script. Parameters for QualityStage Parallel Extender scripts are as follows: ipe.env job_env_file ipe.env proj_env_file noimport 1 job_env_file is the full path of the environment file associated with the job. It is located in the Scripts directory. Its file name is the name of the procedure with an .env extension. proj_env_file is the full path of the project environment file. It is located in the project directory. Its file name is ipe.env.sh. noimport 1 indicates that the script should treat input and output data as Parallel Extender persistent data sets rather than as text files. For example, the parameters associated with the command specified in the example from item c above are: ipe.env /Projects/TEST/Scripts/TEST.env ipe.env /Projects/TEST/ipe.env.sh noimport 1 Fill in any other properties required (see ExecCommand Activity Properties on page 6-22).
e. 4.
To set up each DataStage Parallel job activity, follow the instructions in Job Activity Properties on page 6-16.
QualityStage Parallel Extender jobs read and write persistent data sets with schemas defined as follows: There is only one field defined in the record. The field length = total record length. The type of the field is raw.
Job Sequences
6-25
A DataStage parallel job that interfaces with an QualityStage parallel extender job must account for the schema requirements of the QualityStage job in order for it to work properly.
6-26
7
Intelligent Assistants
DataStage provides intelligent assistants which guide you through basic DataStage tasks. Specifically they allow you to: Create a template from a server, parallel, or mainframe job. You can subsequently use this template to create new jobs. New jobs will be copies of the original job. Create a new job from a previously created template. Create a simple parallel data migration job. This extracts data from a source and writes it to a target.
Intelligent Assistants
7-1
project are displayed. Since job sequences are not supported, they are not displayed
2.
Select the job to be used as a basis for the template. Click OK. Another dialog box appears in order to collect details about your template:
3.
Enter a template name, a template category, and an informative description of the job captured in the template.The restrictions on the template name and category should follow Windows naming restric-
7-2
tions. The description is displayed in the dialog for creating jobs from templates. Press OK. The Template-From-Job Assistant creates the template and saves it in the template directory specified during installation. Templates are saved in XML notation. 4. Enter a template name, a template category, and an informative description of the job captured in the template.The restrictions on the template name and category should follow Windows naming restrictions. The description is displayed in the dialog for creating jobs from templates. Press OK. The Template-From-Job Assistant creates the template and saves it in the template directory specified during installation. Templates are saved in XML notation.
Administrating Templates
To delete a template, start the Job-From-Template Assistant and select the template. Click the Delete button. Use the same procedure to select and delete empty categories. The Assistant stores all the templates you create in the directory you specified during your installation of DataStage. You browse this directory when you create a new job from a template. Typically, all the developers using the Designer save their templates in this single directory. After installation, no dialog is available for changing the template directory. You can, however change the registry entry for the template directory. The default registry value is:
[HKLM/SOFTWARE/Ascential Software/DataStage Client/ currentVersion/Intelligent Assistant/Templates]
Intelligent Assistants
7-3
2.
Select the template to be used as the basis for the job. All the templates in your template directory are displayed. If you have custom templates authored by Consulting or other authorized personnel, and you select one of these, a further dialog box prompts
7-4
you to enter job customization details until the Assistant has sufficient information to create your job.
3.
When you have answered the questions, click Apply. You may cancel at any time if your are unable to enter all the information. Another dialog appears in order to collect the details of the job you are creating:
Intelligent Assistants
7-5
4.
Enter a new name and category for your job. The restrictions on the job name should follow DataStage naming restrictions (i.e., job names should start with a letter and consist of alphanumeric characters). Select OK. DataStage creates the job in your project and automatically loads the job into the DataStage Designer.
5.
2.
7-6
Stage, you may need to enter a user name or database name or the host server to provide connection details for the database.
3.
When you have chosen your source, and supplied any information required, click Next. The DataStage Select Table dialog box appears in order to let you choose a table definition. The table definition specifies the columns that the job will read. If the table definition for your source data isnt there, click Import in order to import a table defini-
Intelligent Assistants
7-7
tion directly from the data source (see Chapter 8, Table Definitions, for more details).
4.
Select a Table Definition from the tree structure and click OK. The name of the chosen table definition is displayed in the wizard screen. If you want to change this, click Change to open the Table Definition dialog box again. This screen also allows you to specify the table
7-8
name or file name for your source data (as appropriate for the type of data source).
Intelligent Assistants
7-9
5.
Click Next to go to the next screen. This allows you to specify details about the target where your job will write the extracted data.
6.
Select one of these stages to receive your data: Data Set, DB2, InformixXPS, Oracle, Sequential File, or Teradata. Enter additional information when prompted by the dialog. Click Next. The screen that appears shows the table definition that will be used to write the data (this is the same as the one used to extract the data). This screen also allows you to specify the table
7.
7-10
name or file name for your data target (as appropriate for the type of data target).
8.
Click Next. The next screen invites you to supply details about the job that will be created. You must specify a job name and optionally specify a job category. The job name should follow DataStage naming
Intelligent Assistants
7-11
9.
Select Create Job to trigger job generation. A screen displays the progress of the job generation. Using the information you entered, the
7-12
DataStage generation process gathers meta data, creates a new job, and adds the created job to the current project
10. When the job generation is complete, click Finish to exit the dialog. All jobs consist of one source stage, one transformer stage, and one target stage. In order to support password maintenance, all passwords in your generated jobs are parameterized and are prompted for at run time.
Intelligent Assistants
7-13
7-14
8
Table Definitions
Table definitions are the key to your DataStage project and specify the data to be used at each stage of a DataStage job. Table definitions are stored in the Repository and are shared by all the jobs in a project. You need, as a minimum, table definitions for each data source and one for each data target in the data warehouse. When you develop a DataStage job you will typically load your stages with column definitions from table definitions held in the Repository. You can import, create, or edit a table definition using either the DataStage Designer or the DataStage Manager. (If you are dealing with a large number of table definitions, we recommend that you use the Manager).
Table Definitions
8-1
This dialog box has up to six pages: General Columns Format NLS Relationships Parallel
8-2
sources, this is the last component of the directory path where the sequential file is found. Table/file name. The table or file name containing the data. Owner. Gives the owner of the table where the table definition comes from a relational database. Mainframe platform type. The type of mainframe platform that the table definition applies to. Where the table definition does not apply to a mainframe data source, it displays <Not applicable>. Mainframe access type. Where the table definition has been imported from a mainframe or is applicable to a mainframe, this specifies the type of database. If it is not a mainframe-type table definition, the field is set to <Not applicable>. Meta data supports Multi-valued fields. Select this check box if the meta data supports multivalued data. If the check box is selected, three extra grid columns used for multivalued data support will appear on the Columns page. The check box is disabled for ODBC, mainframe, and stored procedure table definitions. ODBC quote character. Allows you to specify what character an ODBC data source uses as a quote character. Specify 000 to suppress the quote character. Short description. A brief description of the data. Long description. A full description of the data. The combination of the data source type, data source name, and table or file name forms a unique identifier for the table definition. No two table definitions can have the same identifier.
Table Definitions
8-3
8-4
Association. The name of the association (if any) that the column belongs to. Position. The field number. Type. The nesting type, which can be S, M, MV, or MS. The following column may appear if NLS is enabled: NLS Map. This property is visible only if NLS is enabled and Allow per-column mapping has been selected on the NLS page of the Table Definition dialog box. It allows you to specify a separate character set map for a column (which overrides the map set for the project or table). The following columns appear if the table definition is derived from a COBOL file definition mainframe data source: Level number. The COBOL level number. Mainframe table definitions also have the following columns, but due to space considerations, these are not displayed on the columns page. To view them, choose Edit Row from the Columns page shortcut menu, the Edit Column Meta Data dialog appears, displaying the following field in the COBOL tab: Occurs. The COBOL occurs clause. Sign indicator. Indicates whether the column can be signed or not. Sign option. If the column is signed, gives the location of the sign in the data. Sync indicator. Indicates whether this is a COBOL-synchronized clause or not. Usage. The COBOL usage clause. Redefined field. The COBOL REDEFINED clause. Depending on. A COBOL OCCURS-DEPENDING-ON clause. Storage length. Gives the storage length in bytes of the column as defined. Picture. The COBOL PICTURE clause. For more information about these fields, see page 8-16.
Table Definitions
8-5
The Columns page for each link also contains a Clear All and a Load button. The Clear All button deletes all the column definitions. The Load button loads (copies) the column definitions from a table definition elsewhere in the Repository. A shortcut menu available in grids allows you to edit a cell, delete a row, or add a row. For more information about editing the columns grid, see Appendix A, Editing Grids.
Server jobs
There are three check boxes on this page: Fixed-width columns. Specifies whether the sequential file contains fixed-width fields. This check box is cleared by default, that is, the file does not contain fixed-width fields. When this check box is selected, the Spaces between columns field is enabled. First line is column names. Specifies whether the first line in the file contains the column names. This check box is cleared by
8-6
default, that is, the first row in the file does not contain the column names. Omit last new-line. Specifies whether the last newline character in the file is ignored. By default this check box is cleared, that is, if a newline character exists in the file, it is used. The rest of this page contains five fields. The available fields depend on the settings for the check boxes. Spaces between columns. Specifies the number of spaces used between the columns in the file. This field appears when you select Fixed-width columns. Delimiter. Contains the delimiter that separates the data fields. By default this field contains a comma. You can enter a single printable character or a decimal or hexadecimal number to represent the ASCII code for the character you want to use. Valid ASCII codes are in the range 1 to 253. Decimal values 1 through 9 must be preceded with a zero. Hexadecimal values must be prefixed with &h. Enter 000 to suppress the delimiter Quote character. Contains the character used to enclose strings. By default this field contains a double quotation mark. You can enter a single printable character or a decimal or hexadecimal number to represent the ASCII code for the character you want to use. Valid ASCII codes are in the range 1 to 253. Decimal values 1 through 9 must be preceded with a zero. Hexadecimal values must be prefixed with &h. Enter 000 to suppress the quote character. NULL string. Contains characters that are written to the file when a column contains SQL null values. Padding character. Contains the character used to pad missing columns. This is # by default. The Sync Parallel button is only visible if your system supports parallel jobs. It causes the properties set on the Parallel tab to mirror the properties set on this page when the button is pressed. A dialog box appears asking you to confirm this action, if you do the Parallel tab appears and lets you view the settings.
Table Definitions
8-7
The information given here is the same as on the Format tab in one of the following parallel job stages: Sequential File Stage File Set Stage External Source Stage External Target Stage Column Import Stage Column Export Stage
See DataStage Parallel Job Developers Guide for details. The Defaults button gives access to a shortcut menu offering the choice of: Save current as default. Saves the settings you have made in this dialog box as the default ones for your table definition. Reset defaults from factory settings. Resets to the defaults that DataStage came with.
8-8
Set current from default. Set the current settings to the default (this could be the factory default, or your own default if you have set one up). Click the Show schema button to open a window showing how the current table definition is generated into an OSH schema. This shows how DataStage will interpret the column definitions and format properties of the table definition in the context of a parallel job stage.
Server jobs and Parallel jobs
The page contains two grids: Foreign Keys. This shows which columns in the table definition are foreign keys and which columns and tables they reference. You can define foreign keys manually by entering the information yourself. The table you reference does not have to exist in the DataStage Repository, but you will be informed if it doesnt. Referencing and referenced table do have to be in the same category.
Table Definitions
8-9
Tables which reference this table. This gives details of where other table definitions in the Repository reference this one using a foreign key. You cannot edit the contents of this grid.
If NLS is enabled, this page contains the name of the map to use for the table definitions. The map should match the character set used in the definitions. By default, the list box shows all the maps that are loaded and ready to use with server jobs. Show all Server maps lists all the maps that are shipped with DataStage. Show all Parallel maps lists the maps that are available for use with parallel jobs
Note: You cannot use a server map unless it is loaded into DataStage. You can load different maps using the DataStage Administrator. For more information, see DataStage Administrator Guide. Select Allow per-column mapping if you want to assign different character set maps to individual columns.
8-10
You can import table definitions from an ODBC data source, certain plugin stages (including Sybase Open Client and Oracle OCI), a UniVerse table, a hashed (UniVerse) file, a UniData file, or a sequential file. DataStage connects to the specified data source and extracts the required table definition meta data. You can use the Data Browser to view the actual data in data sources from which you are importing table definitions. To import table definitions in this way: 1. Select the Table Definitions branch in the DataStage Designer Repository window and choose Import Table Definitions Data Source Type from the shortcut menu. For most data source types, a dialog box appears enabling you to connect to the data source (for plug-in data sources, a wizard appears and guides you through the process). 2. Fill in the required connection details and click OK. Once a connection to the data source has been made successfully, the updated dialog box gives details of the table definitions available for import. Select the required table definitions and click OK. The table definition meta data is imported into the DataStage Repository.
3.
Specific information about importing from particular types of data source is in DataStage Developers Help.
CFD and DCLGen files
You can also import meta data from CFD files and DCLGen files. The import derives the meta data from table definition files which are generated on a mainframe and transferred to the DataStage client. The table definitions are then derived from these files. The Data Browser is not available when importing meta data in this way. To import table definitions in this way: 1. Select the Table Definitions branch in the DataStage Designer Repository window and choose Import Table Definitions COBOL File Definitions or Import Table Definitions DCLGen File Definitions from the shortcut menu. The Import Meta Data dialog box appears, allowing you to enter details of the file to import.
Table Definitions
8-11
2.
Enter details of the file, including name, location, and start position then click Refresh. A list of table definitions appears in the Tables list. Select the table definitions you want to import, or click Select all to select all of them. Click OK. The table definition meta data is imported into the DataStage Repository.
3.
More detailed information about importing from mainframe data sources is in DataStage Developers Help.
3.
4.
5.
8-12
6.
If you are entering a mainframe table definition, choose the platform type from the Mainframe platform type drop-down list, and the access type from the Mainframe access type drop-down list. Otherwise leave both of these items set to <Not applicable>. Select the Meta data supports Multi-valued fields check box if the meta data supports multivalued data. If required, specify what character an ODBC data source uses as a quote character in ODBC quote character. Enter a brief description of the data in the Short description field. This is an optional field.
7. 8. 9.
10. Enter a more detailed description of the data in the Long description field. This is an optional field. 11. Click the Columns tab. The Columns page appears at the front of the Table Definition dialog box. You can now enter or load column definitions for your data.
Table Definitions
8-13
The exact fields that appear in this dialog box depend on the type of table definition as set on the General page of the Table Definition dialog box.
2.
Enter the general information for each column you want to define as follows: Column name. Type in the name of the column. This is the only mandatory field in the definition. Key. Select Yes or No from the drop-down list. Native type. For data sources with a platform type of OS390, choose the native data type from the drop-down list. The contents of the list are determined by the Access Type you specified on the General page of the Table Definition dialog box. (The list is blank for non-mainframe data sources.) SQL type. Choose from the drop-down list of supported SQL types. If you are a adding a table definition for platform type OS390, you cannot manually enter an SQL type, it is automatically derived from the Native type. Length. Type a number representing the length or precision of the column.
8-14
Scale. If the column is numeric, type a number to define the number of decimal places. Nullable. Select Yes or No from the drop-down list. This is set to indicate whether the column is subject to a NOT NULL constraint. It does not itself enforce a NOT NULL constraint. Date format. Choose the date format that the column uses from the drop-down list of available formats. Description. Type in a description of the column. Server Jobs. If you are specifying meta data for a server job type data source or target, then the Edit Column Meta Data dialog bog box appears with the Server tab on top. Enter any required information that is specific to server jobs: Data element. Choose from the drop-down list of available data elements. Display. Type a number representing the display length for the column. Position. Visible only if you have specified Meta data supports Multi-valued fields on the General page of the Table Definition dialog box. Enter a number representing the field number. Type. Visible only if you have specified Meta data supports Multivalued fields on the General page of the Table Definition dialog box. Choose S, M, MV, MS, or blank from the drop-down list. Association. Visible only if you have specified Meta data supports Multi-valued fields on the General page of the Table Definition dialog box. Type in the name of the association that the column belongs to (if any). NLS Map. Visible only if NLS is enabled and Allow per-column mapping has been selected on the NLS page of the Table Definition dialog box. Choose a separate character set map for a column, which overrides the map set for the project or table. (The percolumn mapping feature is available only for sequential, ODBC, or generic plug-in data source types.) Null String. This is the character that represents null in the data. Padding. This is the character used to pad missing columns. Set to # by default.
Table Definitions
8-15
Mainframe Jobs. If you are specifying meta data for a mainframe job type data source, then the Edit Column Meta Data dialog bog box appears with the COBOL tab on top. Enter any required information that is specific to mainframe jobs: Level number. Type in a number giving the COBOL level number in the range 02 49. The default value is 05. Occurs. Type in a number giving the COBOL occurs clause. If the column defines a group, gives the number of elements in the group. Usage. Choose the COBOL usage clause from the drop-down list. This specifies which COBOL format the column will be read in. These formats map to the formats in the Native type field, and changing one will normally change the other. Possible values are: COMP Binary COMP-1 single-precision Float COMP-2 packed decimal Float COMP-3 packed decimal COMP-5 used with NATIVE BINARY native types DISPLAY zone decimal, used with Display_numeric or Character native types DISPLAY-1 double-byte zone decimal, used with Graphic_G or Graphic_N Sign indicator. Choose Signed or blank from the drop-down list to specify whether the column can be signed or not. The default is blank. Sign option. If the column is signed, choose the location of the sign in the data from the drop-down list. Choose from the following: LEADING the sign is the first byte of storage TRAILING the sign is the last byte of storage LEADING SEPARATE the sign is in a separate byte that has been added to the beginning of storage TRAILING SEPARATE the sign is in a separate byte that has been added to the end of storage
8-16
Selecting either LEADING SEPARATE or TRAILING SEPARATE will increase the storage length of the column by one byte. Sync indicator. Choose SYNC or blank from the drop-down list to indicate whether this is a COBOL-synchronized clause or not. Redefined field. Optionally specify a COBOL REDEFINES clause. This allows you to describe data in the same storage area using a different data description. The redefining column must be the same length, or smaller, than the column it redefines. Both columns must have the same level, and a column can only redefine the immediately preceding column with that level. Depending on. Optionally choose a COBOL OCCURSDEPENDING ON clause from the drop-down list. Storage length. Gives the storage length in bytes of the column as defined. The field cannot be edited. Picture. Gives the COBOL PICTURE clause, which is derived from the column definition. The field cannot be edited. The Server tab is still accessible, but the Server page only contains the Data Element and Display fields. The following table shows the relationships between native COBOL types and SQL types:
Native Data Type
BINARY
SQL Type
SmallInt Integer Decimal Char Decimal Decimal
Precision (p)
1 to 4 5 to 9 10 to 18 n p+s p+s
Scale (s)
n/a n/a n/a n/a s s
4 8 n*2 n*2
4 8 n*2 n*2
Table Definitions
8-17
SQL Type
Char
Precision (p)
n
Scale (s)
n/a
NATIVE BINARY
PIC S9 to S9(4) COMP-5 PIC S9(5) to S9(9) COMP-5 PIC S9(10) to S9(18) COMP-5 PIC S9(4) COMP PIC X(n) PIC S9(4) COMP PIC G(n) DISPLAY-1 PIC S9(4) COMP PIC N(n)
Parallel Jobs. If you are specifying meta data for a parallel job type data source or target, then the Edit Column Meta Data dialog bog box appears with the Parallel tab on top. This allows you to enter detailed information about the format of the column. Field level This has the following properties: Bytes to Skip. Skip the specified number of bytes from the end of the previous field to the beginning of this field. Delimiter. Specifies the trailing delimiter of the field. Type an ASCII character or select one of whitespace, end, none, or null. whitespace. A whitespace character is used. end. specifies that the last field in the record is composed of all remaining bytes until the end of the record. none. No delimiter. null. Null character is used. Delimiter string. Specify a string to be written at the end of the field. Enter one or more ASCII characters. Drop on Input. Specify this property when you must fully define the layout of your input data but do not want this field actually read into the data set.
8-18
Generate on output. Creates a field and sets it to the default value. Prefix bytes. Specifies that each field in the data file is prefixed by 1, 2, or 4 bytes containing, as a binary value, either the fields length or the tag value for a tagged field. Quote. Specifies that variable length fields are enclosed in single quotes, double quotes, or another ASCII character or pair of ASCII characters. Choose Single or Double, or enter an ASCII character. Start position. Specifies the starting position of a field in the record. The starting position can be either an absolute byte offset from the first record position (0) or the starting position of another field. Tag case value. Explicitly specifies the tag value corresponding to a subfield in a tagged subrecord. By default the fields are numbered 0 to N-1, where N is the number of fields. (A tagged subrecord is a field whose type can vary. The subfields of the tagged subrecord are the possible types. The tag case value of the tagged subrecord selects which of those types is used to interpret the fields value for the record.) User defined. Allows free format entry of any properties not defined elsewhere. Specify in a comma-separated list. String Type This appears for char, varchar, and longvarchar data types data types and has the following properties: Character Set. Choose from ASCII or EBCDIC. Default. The value to substitute for a field that causes an error. Export EBCDIC as ASCII. Select this to specify that EBCDIC characters are written as ASCII characters. Import ASCII as EBCDIC. Select this to specify that ASCII characters are read as EBCDIC characters. Is link field. Selected to indicate that a field holds the length of a another, variable-length field of the record or of the tag value of a tagged record field.
Table Definitions
8-19
Field max width. The maximum number of bytes in a field represented as a string. Enter a number. Field width. The number of bytes in a field represented as a string. Enter a number. Pad char. Specifies the pad character used when strings or numeric values are exported to an external string representation. Enter an ASCII character or choose null. String Type (Unicode) This appears for char, varchar, and longvarchar data types if you have selected the Extended (Unicode) option and nchar, nvarchar, and longnvarchar data types, and has the following properties: Default. The value to substitute for a field that causes an error. Is link field. Selected to indicate that a field holds the length of a another, variable-length field of the record or of the tag value of a tagged record field. Field max width. The maximum number of bytes in a field represented as a string. Enter a number. Field width. The number of bytes in a field represented as a string. Enter a number. Pad char. Specifies the pad character used when strings or numeric values are exported to an external string representation. Enter an ASCII character or choose null. Date Type This appears for date data types and has the following properties: Byte order. Specifies how multiple byte data types are ordered. Choose from: little-endian. The high byte is on the left. big-endian. The high byte is on the right. native-endian. As defined by the native format of the machine. Character Set. Choose from ASCII or EBCDIC. Days since. Dates are written as a signed integer containing the number of days since the specified date. Enter a date in the form %yyyy-%mm-%dd. Default. Set a default value for the column.
8-20
Format. Specifies the data representation format of a field. Choose from: binary text Format string. The string format of a date. By default this is %yyyy-%mm-%dd. Is Julian. Select this to specify that dates are written as a numeric value containing the Julian day. A Julian day specifies the date as the number of days from 4713 BCE January 1, 12:00 hours (noon) GMT. Time Type This appears for Time or Time (microseconds) data types and has the following properties: Byte order. Specifies how multiple byte data types are ordered. Choose from: little-endian. The high byte is on the left. big-endian. The high byte is on the right. native-endian. As defined by the native format of the machine. Character Set. Choose from ASCII or EBCDIC. Default. Set a default value for the column. Data Format. Specifies the data representation format of a field. Choose from: binary text Format string. Specifies the format of fields representing time as a string. By default this is %hh-%mm-%ss. Is midnight seconds. Select this to specify that times are written as a binary 32-bit integer containing the number of seconds elapsed from the previous midnight. Timestamp Type This appears for Timestamp or Timestamp (microseconds) data types and has the following properties: Byte order. Specifies how multiple byte data types are ordered. Choose from:
Table Definitions
8-21
little-endian. The high byte is on the right. big-endian. The high byte is on the left. native-endian. As defined by the native format of the machine. Character Set. Choose from ASCII or EBCDIC. Default. Set a default value for the column. Data Format. Specifies the data representation format of a field. Choose from: binary text Format string. Specifies the format of a field representing a timestamp as a string. defaults to %yyyy-%mm-%dd %hh:%nn:%ss. Integer Type This appears for all int or uint types: Byte order. Specifies how multiple byte data types are ordered. Choose from: little-endian. The high byte is on the right. big-endian. The high byte is on the left. native-endian. As defined by the native format of the machine. Character Set. Choose from ASCII or EBCDIC. C_format. Perform non-default conversion of data from integer or floating-point data to a string. This property specifies a Clanguage format string used for writing integer or floating point strings. This is passed to sprintf(). Default. The value to substitute for a field that causes an error. Format. Specifies the data representation format of a field. Choose from: binary text Is link field. Selected to indicate that a field holds the length of a another, variable-length field of the record or of the tag value of a tagged record field. Field max width. The maximum number of bytes in a field represented as a string. Enter a number.
8-22
Field width. The number of bytes in a field represented as a string. Enter a number. In_format. Format string used for conversion of data from string to integer or floating-point data. This is passed to sscanf(). Out_format. Format string used for conversion of data from integer or floating-point data to a string. This is passed to sprintf(). Pad char. Specifies the pad character used when strings or numeric values are exported to an external string representation. Enter an ASCII character or choose null. Decimal Type This appears for Decimal data types. Allow all zeros. Specifies whether to treat a packed decimal field containing all zeros (which is normally illegal) as a valid representation of zero. Select Yes or No. Character Set. Choose from ASCII or EBCDIC. Data Format. Specifies the data representation format of a field. Choose from: binary text Default. The value to substitute for a field that causes an error. Format. Specifies the data representation format of a field. Choose from: binary text Field max width. The maximum number of bytes in a field represented as a string. Enter a number. Field width. The number of bytes in a field represented as a string. Enter a number. Packed. Select Yes to specify that the decimal fields contain data in packed decimal format or No to specify that they contain unpacked decimal with a separate sign byte. This property has two dependent properties as follows:
Table Definitions
8-23
Check. Select Yes to verify that data is packed, or No to not verify. Signed. Select Yes to use the existing sign when writing decimal fields. Select No to write a positive sign (0xf) regardless of the fields actual sign value. Precision. Specifies the precision where a decimal field is written in text format. Enter a number. Rounding. Specifies how to round a decimal field when writing it. Choose from: up (ceiling). Truncate source field towards positive infinity. down (floor). Truncate source field towards negative infinity. nearest value. Round the source field towards the nearest representable value. truncate towards zero. This is the default. Discard fractional digits to the right of the right-most fractional digit supported by the destination, regardless of sign. Scale. Specifies how to round a source decimal when its precision and scale are greater than those of the destination. Float Type This appears for Float and Double data types. C_format. Perform non-default conversion of data from integer or floating-point data to a string. This property specifies a Clanguage format string used for writing integer or floating point strings. This is passed to sprintf(). Character Set. Choose from ASCII or EBCDIC. Data Format. Specifies the data representation format of a field. Choose from: binary text Default. The value to substitute for a field that causes an error. Is link field. Selected to indicate that a field holds the length of a another, variable-length field of the record or of the tag value of a tagged record field.
8-24
Field max width. The maximum number of bytes in a field represented as a string. Enter a number. Field width. The number of bytes in a field represented as a string. Enter a number. In_format. Format string used for conversion of data from string to integer or floating-point data. This is passed to sscanf(). Out_format. Format string used for conversion of data from integer or floating-point data to a string. This is passed to sprintf(). Pad char. Specifies the pad character used when strings or numeric values are exported to an external string representation. Enter an ASCII character or choose null. Nullable This appears for nullable fields. Actual field length. Specifies the number of bytes skipped or filled with the null value or pad character when a field is identified as null. Null field length. Specifies the value of the length prefix of a variable-length field that contains a null. Null field value. On import, specifies the vaue given to a field containing a null. On export specifies the value given to a field if the source is set to null. Can be a number, string, or C-type literal escape character. Generator Specifies how the field behaves in a Generator stage (see DataStage Parallel Job Developers Guide). Vectors If the row you are editing represents a variable length vector, tick the Variable check box. The Vector properties appear, these give the size of the vector in one of two ways: Link Field Reference. The name of a field containing the number of elements in the variable length vector. This should have an integer or float type, and have its Is Link field property set.
Table Definitions
8-25
Vector prefix. Specifies 1-, 2-, or 4-byte prefix containing the number of elements in the vector. If the row you are editing represents a vector of known length, enter the number of elements in the Vector box. Extended For certain data types the Extended check box appears to allow you to modify the data type as follows: Char, varchar, longvarchar. Select to specify that the underlying data type is a ustring. Time. Select to indicate that the time field includes microseconds. Timestamp. Select to indicate that the timestamp field includes microseconds. Int types. Select to indicate that the underlying data type is the equivalent uint field. Use the buttons at the bottom of the Edit Column Meta Data dialog box to continue adding or editing columns, or to save and close. The buttons are: Previous and Next. View the meta data in the previous or next row. These buttons are enabled only where there is a previous or next row enabled. If there are outstanding changes to the current row, you are asked whether you want to save them before moving on. Close. Close the Edit Column Meta Data dialog box. If there are outstanding changes to the current row, you are asked whether you want to save them before closing. Apply. Save changes to the current row. Reset. Remove all changes made to the row since the last time you applied changes. Click OK to save the column definitions and close the Edit Column Meta Data dialog box. Remember, you can also edit a columns definition grid using the general grid editing controls, described in Editing the Grid Directly on page A-5.
8-26
This dialog box displays all the table definitions in the project in the form of a table definition tree. 2. 3. Double-click the appropriate branch to display the table definitions available. Select the table definition you want to use. Note: You can use the Find button to enter the name of the table definition you want. The table definition is automatically highlighted in the tree when you click OK. You can use the Import button to import a table definition from a data source. 4. If you cannot find the table definition, you can click Import Data source type to import a table definition from a data source (see Importing a Table Definition on page 8-11 for details).
Table Definitions
8-27
5.
Click OK. The Select Columns dialog box appears. It allows you to specify which column definitions from the table definition you want to load.
Use the arrow keys to move columns back and forth between the Available columns list and the Selected columns list. The single arrow buttons move highlighted columns, the double arrow buttons move all items. By default all columns are selected for loading. Click Find to open a dialog box which lets you search for a particular column. The shortcut menu also gives access to Find and Find Next. Click OK when you are happy with your selection. This closes the Select Columns dialog box and loads the selected columns into the stage. For mainframe stages and certain parallel stages where the column definitions derive from a CFD file, the Select Columns dialog box may also contain a Create Filler check box. This happens when the table definition the columns are being loaded from represents a fixed-width table. Select this to cause sequences of unselected columns to be collapsed into filler items. Filler columns are sized appropriately, their datatype set to character, and name set to FILLER_XX_YY where XX is the start offset and YY the end offset. Using fillers results in a smaller set of columns, saving space and processing time and making the column set easier to understand. If you are importing column definitions that have been derived from a CFD file into server or parallel job stages, you are warned if any of
8-28
the selected columns redefine other selected columns. You can choose to carry on with the load or go back and select columns again. 6. Save the table definition by clicking OK.
You can edit the table definition to remove unwanted column definitions, assign data elements, or change branch names.
Table Definitions
8-29
Propagating Values
You can propagate the values for the properties set in a column to several other columns. Select the column whose values you want to propagate, then hold down shift and select the columns you want to propagate to. Choose Propagate values... from the shortcut menu to open the dialog box.
8-30
In the Property column, click the check box for the property or properties whose values you want to propagate. The Usage field tells you if a particular property is applicable to certain types of job only (e.g. server, mainframe, or parallel) or certain types of table definition (e.g. COBOL). The Value field shows the value that will be propagated for a particular property.
When importing table definitions from a data source, you can view the actual data in the tables using the Data Browser. The Data Browser can be used when importing table definitions from the following sources: ODBC table UniVerse table Hashed (UniVerse) file Sequential file UniData file Some types of plug-in
The Data Browser is opened by clicking the View Data button on the Import Meta Data dialog box. The Data Browser window appears:
The Data Browser uses the meta data defined in the data source. If there is no data, a Data source is empty message appears instead of the Data Browser.
Table Definitions
8-31
The Data Browser grid has the following controls: You can select any row or column, or any cell with a row or column, and press CTRL-C to copy it. You can select the whole of a very wide row by selecting the first cell and then pressing SHIFT+END. If a cell contains multiple lines, you can double-click the cell to expand it. Single-click to shrink it again. You can view a row containing a specific data item using the Find button. The Find dialog box repositions the view to the row containing the data you are interested in. The search is started from the current row.
The Display button opens the Column Display dialog box. It allows you to simplify the data displayed by the Data Browser by choosing to hide some of the columns. It also allows you to normalize multivalued data to provide a 1NF view in the Data Browser. This dialog box lists all the columns in the display, and initially these are all selected. To hide a column, clear it. The Normalize on drop-down list box allows you to select an association or an unassociated multivalued column on which to normalize the data. The default is Un-Normalized, and choosing Un-Normalized will display
8-32
the data in NF2 form with each row shown on a single line. Alternatively you can select Un-Normalize (formatted), which displays multivalued rows split over several lines.
In the example, the Data Browser would display all columns except STARTDATE. The view would be normalized on the association PRICES.
If your DataStage jobs will be reading data from or writing data to a database via an ODBC connection, you can use a stored procedure to define the data to use. A stored procedure can: Have associated parameters, which may be input or output Return a value (like a function call) Create a result set in the same way as an SQL SELECT statement Note: DataStage supports the use of stored procedures with or without input arguments and the creation of a result set, but does not support output arguments or return values. A stored procedure may have a return value defined, but it is ignored at run time. A stored procedure may not have output parameters. The definition for a stored procedure (including the associated parameters and meta data) can be stored in the Repository. These stored procedure definitions can be used when you edit an ODBC stage in your job design. For more information about the use of stored procedures in ODBC stages, see DataStage Server Job Developers Guide. You can import, create, or edit a stored procedure definition using the DataStage Manager or DataStage Designer.
Table Definitions
8-33
2.
3.
When you create, edit, or view a stored procedure definition, the Table Definition dialog box appears. This dialog box is described in The Table Definition Dialog Box on page 8-2. The dialog box for stored procedures has additional pages, having up to six pages in all: General. Contains general information about the stored procedure. The Data source type field on this page must contain StoredProcedures to display the additional Parameters page. Columns. Contains a grid displaying the column definitions for each column in the stored procedure result set. You can add new column definitions, delete unwanted definitions, or edit existing ones. For more information about editing a grid, see Editing Column Definitions on page 8-29. Parameters. Contains a grid displaying the properties of each input parameter.
8-34
Note: If you cannot see the Parameters page, you must enter StoredProcedures in the Data source type field on the General page.
The grid has the following columns: Column name. The name of the parameter column. Key. Indicates whether the column is part of the primary key. SQL type. The SQL data type. Length. The data precision. This is the length for CHAR data and the maximum length for VARCHAR data. Scale. The data scale factor. Nullable. Specifies whether the column can contain null values. This is set to indicate whether the column is subject to a NOT NULL constraint. It does not itself enforce a NOT NULL constraint. Display. The maximum number of characters required to display the column data. Data element. The type of data in the column. Description. A text description of the column.
Table Definitions
8-35
8-36
To manually enter a stored procedure definition, first create the definition. You can then enter suitable settings for the general properties, before specifying definitions for the columns in the result set and the input parameters. Note: You do not need to edit the Format page for a stored procedure definition.
2.
3.
4.
5. 6. 7.
8.
Table Definitions
8-37
8-38
7. 8.
In the Server tab, enter the maximum number of characters required to display the parameter data in the Display cell. In the Server tab, choose the type of data the column contains from the drop-down list in the Data element cell. This list contains all the built-in data elements supplied with DataStage and any additional data elements you have defined. You do not need to edit this cell to create a column definition. You can assign a data element at any point during the development of your job. Click APPLY and CLOSE to save and close the Edit Column Meta Data dialog box.
9.
10. You can continue to add more parameter definitions by editing the last row in the grid. New parameters are always added to the bottom of the grid, but you can select and drag the row to a new position in the grid.
You can view or modify any stored procedure definition in your project. To view a stored procedure definition, select it in the Repository window and do one of the following: Choose Properties from the shortcut menu. Double-click the stored procedure definition in the display area. The Table Definition dialog box appears. You can edit or delete any of the column or parameter definitions.
Table Definitions
8-39
8-40
9
Programming in DataStage
This chapter describes the programming tasks that you can perform in DataStage. The programming tasks that might be required depend on whether you are working on server jobs, parallel jobs, or mainframe jobs. This chapter provides a general introduction to the subject, telling you what you can do. Details of programming tasks are in DataStage Server: Server Job Developers Guide, DataStage Enterprise Edition: Parallel Job Developers Guide, and DataStage Enterprise MVS Edition: Mainframe Job Developer's Guide. Note: When using shared libraries you will need to ensure that they libraries are in the right order in the LD_LIBRARY PATH environment variable (UNIX servers).
Programming in DataStage
9-1
Defining custom routines to use as building blocks in other programming tasks. For example, you may define a routine which will then be reused by several custom transforms. You can view, edit, and create your own BASIC routines using the DataStage Manager. Defining custom transforms. The function specified in a transform definition converts the data in a chosen column. Defining derivations, key expressions, and constraints while editing a Transformer stage. Defining before-stage and after-stage subroutines. These subroutines perform an action before or after a stage has processed data. These subroutines can be specified for Aggregator, Transformer, and some plug-in stages. Defining before-job and after-job subroutines. These subroutines perform an action before or after a job is run and are set as job properties. Defining job control routines. These subroutines can be used to control other jobs from within the current job.
Programming Components
There are different types of programming components used in server jobs. They fall within these three broad categories: Built-in. DataStage has several built-in programming components that you can reuse in your server jobs as required. Some of the built-in components are accessible using the DataStage Manager or DataStage Designer, and you can copy code from these. Others are
9-2
only accessible from the Expression Editor, and the underlying code is not visible. Custom. You can also define your own programming components using the DataStage Manager or DataStage Designer, specifically routines and custom transforms. These are stored in the DataStage Repository and can be reused for other jobs and by other DataStage users. External. You can use certain types of external component from within DataStage. If you have a large investment in custom UniVerse functions or ActiveX (OLE) functions, then it is possible to call these from within DataStage. This is done by defining a wrapper routine which in turn calls the external functions. Note that the mechanism for including custom UniVerse functions is different from including ActiveX (OLE) functions. The following sections discuss programming terms you will come across when programming server jobs.
Routines
Routines are stored in the Routines branch of the DataStage Repository, where you can create, view, or edit them using the Routine dialog box. The following program components are classified as routines: Transform functions. These are functions that you can use when defining custom transforms. DataStage has a number of built-in transform functions which are located in the Routines Examples Functions branch of the Repository. You can also define your own transform functions in the Routine dialog box. Before/After subroutines. When designing a job, you can specify a subroutine to run before or after the job, or before or after an active stage. DataStage has a number of built-in before/after subroutines, which are located in the Routines Built-in Before/After branch in the Repository. You can also define your own before/after subroutines using the Routine dialog box. Custom UniVerse functions. These are specialized BASIC functions that have been defined outside DataStage. Using the Routine dialog box, you can get DataStage to create a wrapper that enables you to call these functions from within DataStage. These functions are stored under the Routines branch in the Repository. You specify the category when you create the routine. If NLS is enabled,
Programming in DataStage
9-3
you should be aware of any mapping requirements when using custom UniVerse functions. If a function uses data in a particular character set, it is your responsibility to map the data to and from Unicode. ActiveX (OLE) functions. You can use ActiveX (OLE) functions as programming components within DataStage. Such functions are made accessible to DataStage by importing them. This creates a wrapper that enables you to call the functions. After import, you can view and edit the BASIC wrapper using the Routine dialog box. By default, such functions are located in the Routines Class name branch in the Repository, but you can specify your own category when importing the functions. When using the Expression Editor, all of these components appear under the DS Routines command on the Suggest Operand menu. A special case of routine is the job control routine. Such a routine is used to set up a DataStage job that controls other DataStage jobs. Job control routines are specified in the Job control page on the Job Properties dialog box. Job control routines are not stored under the Routines branch in the Repository.
Transforms
Transforms are stored in the Transforms branch of the DataStage Repository, where you can create, view or edit them using the Transform dialog box. Transforms specify the type of data transformed, the type it is transformed into, and the expression that performs the transformation. DataStage is supplied with a number of built-in transforms (which you cannot edit). You can also define your own custom transforms, which are stored in the Repository and can be used by other DataStage jobs. When using the Expression Editor, the transforms appear under the DS Transform command on the Suggest Operand menu.
Functions
Functions take arguments and return a value. The word function is applied to many components in DataStage: BASIC functions. These are one of the fundamental building blocks of the BASIC language. When using the Expression Editor,
9-4
you can access the BASIC functions via the Function command on the Suggest Operand menu. DataStage BASIC functions. These are special BASIC functions that are specific to DataStage. These are mostly used in job control routines. DataStage functions begin with DS to distinguish them from general BASIC functions. When using the Expression Editor, you can access the DataStage BASIC functions via the DS Functions command on the Suggest Operand menu. The following items, although called functions, are classified as routines and are described under Routines on page 9-3. When using the Expression Editor, they all appear under the DS Routines command on the Suggest Operand menu. Transform functions Custom UniVerse functions ActiveX (OLE) functions
Expressions
An expression is an element of code that defines a value. The word expression is used both as a specific part of BASIC syntax, and to describe portions of code that you can enter when defining a job. Areas of DataStage where you can use such expressions are: Defining breakpoints in the debugger Defining column derivations, key expressions and constraints in Transformer stages Defining a custom transform In each of these cases the DataStage Expression Editor guides you as to what programming elements you can insert into the expression.
Subroutines
A subroutine is a set of instructions that perform a specific task. Subroutines do not return a value. The word subroutine is used both as a specific part of BASIC syntax, but also to refer particularly to before/after subroutines which carry out tasks either before or after a job or an active stage. DataStage has many built-in before/after subroutines, or you can define your own.
Programming in DataStage
9-5
Before/after subroutines are included under the general routine classification as they are accessible under the Routines branch in the Repository.
Macros
DataStage has a number of built-in macros. These can be used in expressions, job control routines, and before/after subroutines. The available macros are concerned with ascertaining job status. When using the Expression Editor, they appear under the DS Macro command on the Suggest Operand menu.
Expressions
Expressions are defined using a built-in language based on SQL3. For more information about this language, see Mainframe Job Developers Guide. You can use expressions to specify: Column derivations Key expressions Constraints Stage variables
You specify these in various mainframe job stage editors as follows: Transformer stage column derivations for output links, stage variables, and constraints for output links Relational stage key expressions in output links Complex Flat File stage key expressions in output links Fixed-Width Flat File stage key expressions in output links Join stage key expression in the join predicate External Routine stage constraint in each stage instance
9-6
The Expression Editor helps you with entering appropriate programming elements. It operates for mainframe jobs in much the same way as it does for server jobs. It helps you to enter correct expressions and can: Facilitate the entry of expression elements Validate variable names and the complete expression When the Expression Editor is used in a mainframe job it offers a set of the following, depending on the context in which you are using it: Link columns Variables Job parameters SQL3 functions Constants
More details about using the mainframe Expression Editor are given in Mainframe Job Developers Guide.
Routines
The External Routine stage enables you to call a COBOL subroutine that exists in a library external to DataStage in your job. You must first define the routine, details of the library, and its input and output arguments. The routine definition is stored in the DataStage Repository and can be referenced from any number of External Routine stages in any number of mainframe jobs. Defining and calling external routines is described in more detail in the Mainframe Job Developers Guide.
Programming in DataStage
9-7
system the functions could be executed in any order as processing power becomes available. In such a situation you should write functions so that the required information can be passed through a pipe.
Expressions
Expressions are used to define: Column derivations Constraints Stage variables Expressions are defined using a built-in language. The Expression Editor available from within the Transformer stage helps you with entering appropriate programming elements. It operates for parallel jobs in much the same way as it does for server jobs and mainframe jobs. It helps you to enter correct expressions and can: Facilitate the entry of expression elements Validate variable names and the complete expression For more details about the expression editor, and about the built-in language, see DataStage Parallel Job Developers Guide.
Functions
For many expressions you can choose ready made functions from the built-in ones supplied with DataStage. You can also, however, define your own functions that can be accessed from the expression editor, such functions must be supplied within a UNIX shared library or in a standard object file (filename.o) and then referenced by defining a parallel routine within the DataStage project which calls it. For details of how to define a function, see DataStage Manager Guide.
Routines
Parallel jobs also have the ability of executing routines before or after an active stage executes. These routines are defined and stored in the DataStage Repository, and then called in the Triggers page of the particular Transformer stage Properties dialog box (see Parallel Job Developers Guide for more details). These routines must be supplied in a UNIX shared library or an object file, and do not return a value. For details of how to define a routine, see DataStage Manager Guide.
9-8
A
Editing Grids
DataStage uses grids in many dialog boxes for displaying data. This system provides an easy way to view and edit tables. This appendix describes how to navigate around grids and edit the values they contain.
Grids
The following screen shows a typical grid used in a DataStage dialog box:
On the left side of the grid is a row selector button. Click this to select a row, which is then highlighted and ready for data input, or click any of the cells in a row to select them. The current cell is highlighted by a chequered border. The current cell is not visible if you have scrolled it out of sight. Some cells allow you to type text, some to select a checkbox and some to select a value from a drop-down list. You can move columns within the definition by clicking the column header and dragging it to its new position. You can resize columns to the available space by double-clicking the column header splitter.
Editing Grids
A-1
You can move rows within the definition by right-clicking on the row header and dragging it to its new position. The numbers in the row header are incremented/decremented to show its new position. The grid has a shortcut menu containing the following commands: Edit Cell. Open a cell for editing. Find Row . Opens the Find dialog box (seeFinding Rows in the Grid on page A-5). Edit Row . Opens the relevant Edit Row dialog box (see Editing in the Grid on page A-5). Insert Row. Inserts a blank row at the current cursor position. Delete Row. Deletes the currently selected row. Propagate values . Allows you to set the properties of several rows at once (see Propagating Values on page A-7). Properties. Opens the Properties dialog box (see Grid Properties on page A-3).
A-2
Grid Properties
The Grid Properties dialog box allows you to select certain features on the grid.
Select and order columns. Allows you to select what columns are displayed and in what order. The Grid Properties dialog box displays the set of columns appropriate to the type of grid. The example shows columns for a server job columns definition. You can move columns within the definition by right-clicking on them and dragging them to a new position. The numbers in the position column show the new position. Allow freezing of left columns. Choose this to freeze the selected columns so they never scroll out of view. Select the columns in the grid by dragging the black vertical bar from next to the row headers to the right side of the columns you want to freeze. Allow freezing of top rows. Choose this to freeze the selected rows so they never scroll out of view. Select the rows in the grid by drag-
Editing Grids
A-3
ging the black horizontal bar from under the column headers to the bottom edge of the rows you want to freeze. Enable row highlight. Select this to enable highlighting of selected rows, disable it to highlight only the current cell. Excel style headers. Select this to specify that selected row and column header buttons should be shown as raised when selected. Apply settings to current display only. Select this to apply the selected properties to only this grid. Save settings for future display. Select this to apply setting to all future displays of grids of this type.
Shift-Tab
A-4
Editing Grids
A-5
When you start editing, the current contents of the cell are highlighted ready for editing. If the cell is currently empty, an edit cursor appears. Table A-2 shows the keys that are used for editing in the grid. Table A-2. Keys Used in Grid Editing Key Esc Action Cancel the current edit. The grid leaves edit mode, and the cell reverts to its previous value. The focus does not move. Accept the current edit. The grid leaves edit mode, and the cell shows the new value. When the focus moves away from a modified row, the row is validated. If the data fails validation, a message box is displayed, and the focus returns to the modified row. Move the selection up a drop-down list or to the cell immediately above. Move the selection down a drop-down list or to the cell immediately below. Move the insertion point to the left in the current value. When the extreme left of the value is reached, exit edit mode and move to the next cell on the left.
Enter
Right Arrow Move the insertion point to the right in the current value. When the extreme right of the value is reached, exit edit mode and move to the next cell on the right. Ctrl-Enter Enter a line break in a value.
Adding Rows
You can add a new row by entering data in the empty row. When you move the focus away from the row, the new data is validated. If it passes validation, it is added to the table, and a new empty row appears. Alternatively, press the Insert key or choose Insert row from the shortcut menu, and a row is inserted with the default column name Newn, ready for you to edit (where n is an integer providing a unique Newn column name).
A-6
Deleting Rows
To delete a row, click anywhere in the row you want to delete to select it. Press the Delete key or choose Delete row from the shortcut menu. To delete multiple rows, hold down the Ctrl key and click in the row selector column for the rows you want to delete and press the Delete key or choose Delete row from the shortcut menu.
Propagating Values
You can propagate the values for the properties set in a grid to several rows in the grid. Select the column whose values you want to propagate, then hold down shift and select the columns you want to propagate to. Choose Propagate values... from the shortcut menu to open the dialog box.
In the Property column, click the check box for the property or properties whose values you want to propagate. The Usage field tells you if a particular property is applicable to certain types of job only (e.g. server, mainframe, or parallel) or certain types of table definition (e.g. COBOL). The Value field shows the value that will be propagated for a particular property.
Editing Grids
A-7
Descriptions of each of the fields in this dialog box are in Entering Column Definitions on page 8-13. 2. 3. Enter the general information for each column you want to define. If you are specifying meta data for a server job type data source or target, then the Edit Column Meta Data dialog bog box appears with the Server tab on top. Enter any required information that is specific to server jobs.
A-8
4.
If you are specifying meta data for a parallel job type data source or target, then the Edit Column Meta Data dialog bog box appears with the Parallel tab on top. Enter any required format information that is required by a parallel job. If you are specifying meta data for a mainframe job type data source or target, then the Edit Column Meta Data dialog bog box appears with the COBOL tab on top. Enter any required information that is specific to mainframe jobs. Use the buttons at the bottom of the Edit Column Meta Data dialog box to continue adding or editing columns, or to save and close. The buttons are: <Previous and Next>. View the meta data in the previous or next row. These buttons are enabled only where there is a previous or next row enabled. If there are outstanding changes to the current row, you are asked whether you want to save them before moving on. Close. Close the Edit Column Meta Data dialog box. If there are outstanding changes to the current row, you are asked whether you want to save them before closing. Apply. Save changes to the current row. Reset. Remove all changes made to the row since the last time you applied changes.
5.
6.
7.
Click OK to save the column definitions and close the Edit Column Meta Data dialog box.
You can also edit a columns definition grid using the general grid editing controls, described in Editing the Grid Directly on page A-5.
Editing Grids
A-9
A-10
The dialog box for the Fixed Width Flat File and Relational stages is:
Editing Grids
A-11
A-12
Editing Grids
A-13
A-14
The Edit Routine Argument dialog box for external source routines is as follows:
Editing Grids
A-15
The Edit Routine Argument dialog box for external target routines is as follows:
A-16
Editing Grids
A-17
A-18
B
Troubleshooting
This appendix describes problems you may encounter with DataStage and gives solutions.
Troubleshooting
B-1
then you must edit the UNIAPI.INI file and change the value of the PROTOCOL variable. In this case, change it from 11 to 12:
PROTOCOL = 12
B-2
then uncheck the Use International Settings checkbox in the DBLibrary option page of the SQL Server Client Network Utility. If your job uses data in the upper 128 characters of the character set and the data is not appearing correctly on the database then uncheck the Automatic ANSI to OEM conversion checkbox in the DB-Library option page of the SQL Server Client Network Utility.
The solution is to recompile, rerelease, and, if necessary, repackage jobs under the later release of DataStage.
Troubleshooting
B-3
Take the following steps: 1. Replace all occurences of <langdef> with the locale used by the server (the locale must be one of those listed when you use the locale -a command). Remove the #s at the start of the lines. Stop and restart the DataStage server: To stop the server: $DSHOME/bin/uv -admin -stop To start the server: $DSHOME/bin/uv -admin -start Ensure that you allow sufficient time between executing stop and start commands (minimum of 30 seconds recommended).
2. 3.
Miscellaneous Problems
Browsing for Directories
When browsing directories within DataStage, you may find that only local drive letters are shown. If you want to use a remote drive letter, you should type it in rather than relying on the browse feature.
B-4
Index
A
ActiveX (OLE) functions programming functions 9-4 adding stages 2-9, 4-26 administrator, definition 1-10 after-job subroutines 4-56 definition 1-10 after-stage subroutines, definition 1-10 aggregating data 1-3 Aggregator stages 4-7, 4-13 definition 1-10 Annotation 1-10 Attach to Project dialog box 2-3, 3-1 cluster 1-10 code customization 4-53 column definitions column name 8-4 data element 8-4 defining for a stage 4-31 definition 1-10 deleting 4-32, 8-29, 8-40 editing 4-31, 8-29, 8-40 using the Columns grid 4-31 using the Edit button 4-32 entering 8-13 inserting 4-32 key fields 8-4 length 8-4 loading 4-33, 8-27 scale factor 8-4 Column Export stage 1-11 Column export stage 4-15 Column Import stage 1-11 Column import stage 4-15 Columns grid 4-31, 8-4 Combine Records stage 1-11 Combine records stage 4-15 Compare stage 1-11, 4-13 compiling jobs 2-23 Complex Flat File stages, definition 1-11 Compress stage 1-11 Container Input stage 4-8 Container Output stage 4-8, 4-15 Container stages definition 1-11 containers 4-5 creating 5-2 definition 1-11 editing 5-2 Index-1
B
BASIC routines, writing 9-2 BCPLoad stages, definition 1-10 before-job subroutines 4-56 definition 1-10 before-stage subroutines, definition 1-10 browsing server directories 4-37, B-4 built-in data elements definition 1-10 built-in transforms, definition 1-10
C
Change Apply stage 1-10 Change apply stage 4-13 Change Capture stage 1-10, 4-13 character set maps, specifying 4-74, 4-76
viewing 5-2 Copy stage 1-11, 4-13 creating containers 5-2 data warehouses 1-2 jobs 2-4 stored procedure definitions 8-37 table definitions 8-12 currency formats 4-75 current cell in grids A-1 custom transforms, definition 1-11 customizing COBOL code 4-53
D
data aggregating 1-3 extracting 1-3 sources 1-17 transforming 1-3 Data Browser 2-14, 8-31 definition 1-11 using 4-39 data elements definition 1-11 Data Migration Assistant 7-6 Data Set stage 1-12 Data set stage 4-12 data warehouses advantages of 1-4 example 2-1 DataStage client components 1-5 concepts 1-10 jobs 1-6 programming in 9-1 projects 1-6 server components 1-6 terms 1-10 DataStage Administrator 1-5 definition 1-11 DataStage Designer 1-5, 3-1 Index-2
definition 1-11 exiting 3-38 main window 3-2 options 4-74 starting 3-1 DataStage Designer window 3-2 menu bar 3-6 shortcut menus 3-19 status bar 3-18 tool palette 3-15 toolbar 3-15 DataStage Director 1-5 definition 1-11 DataStage Manager 1-5 definition 1-11 starting 2-3 DataStage Manager window 2-4 DataStage Package Installer 1-6 definition 1-11 DataStage Repository 1-6 definition 1-16 DataStage Server 1-6 date formats 4-75 DB2 Load Ready Flat File stages, definition 1-12 DB2 stage 1-12, 4-11, 4-15 debugger toolbar 3-18 Decode stage 1-12, 4-13 defining data warehouses 1-3 locales 4-74, 4-76 maps 4-74, 4-76 table definitions 2-4 deleting column definitions 4-32, 8-29, 8-40 links 4-30 stages 4-28 Delimited Flat File stages, definition 1-12 developer, definition 1-12 developing jobs 2-9, 3-1, 4-1, 4-26 Diagram window 3-12 Difference stage 1-12 Ascential DataStage Designer Guide
E
edit mode in grids A-5 editing column definitions 4-31, 8-29, 8-40 using the Columns grid 4-31 using the Edit button 4-32 containers 5-2 grids A-1, A-6 job properties 4-54 stages 2-11, 4-30 stored procedure definitions 8-39 table definitions 8-29 email notification activity 6-4 Encode stage 1-12, 4-14 entering column definitions 8-13 errors and UniData stage B-1 examples of projects 2-1 ExecCommand ativity 6-4 exiting DataStage Designer 3-38 Expand stage 1-12 Expression Editor 3-31, 4-62 configuring 3-31 definition 1-12 External Filter stage 1-12, 4-14 External Routine stages, definition 1-12 External Source stage 1-12 External source stage 4-12 External Target stage 1-13 External target stage 4-12 extracting data 1-3
File set stage 4-12 Filter stage 1-13 Find dialog box 8-32 Fixed-Width Flat File stages, definition 1-13 Folder stages 4-7 FTP stages, definition 1-13 Funnel stage 1-13, 4-14
G
generating code customizing code 4-53 Generator stage 1-13, 4-12 graphical performance monitor 1-13 grids A-1 adding rows in A-6 current cell A-1 deleting rows in A-7 editing A-1, A-5 keys used for navigating in A-4 keys used in editing A-6 navigating in A-4 row selector button A-1
H
Hashed File stages 4-6 definition 1-13 Head stage 1-13, 4-12
I
importing stored procedure definitions 8-34 table definitions 2-6, 8-11 Informix XPS stage 1-13, 4-11 input parameters, specifying 8-38 inserting column definitions 4-32 Intelligent Assistant 1-13 intelligent assistants 7-1 Inter-process Stage 4-7
F
File Set stage 1-13
Index-3
J
JCL templates 4-52 job 1-14 job activity 6-4 job control routines 4-67 definition 1-14 job parameters 4-62 job properties 4-54 editing 4-54 saving 4-80 viewing 4-54 job sequence definition 1-14 jobs compiling 2-23 creating 2-4 defining locales 4-74, 4-76 defining maps 4-74, 4-76 definition 1-14 dependencies, specifying 4-70 developing 2-9, 3-1, 4-1, 4-26 mainframe 1-14 opening 4-2 overview 1-6 properties of 4-54 running 2-24 server 1-16 version number 4-55, 4-79 Join stage 4-14 Join stages, definition 1-14
link partitioner stage 1-14 linking stages 2-10, 4-28 links deleting 4-30 moving 4-29 multiple 4-30 renaming 4-30 loading column definitions 4-33, 8-27 local container definition 1-14 Local container stages 4-8, 4-15 locales and jobs 4-75 specifying 4-74, 4-76 Lookup File stage 1-14 Lookup file stage 4-12 Lookup stage 4-14 Lookup stages, definition 1-14
M
mainframe job stages Multi-Format Flat File 1-15 mainframe jobs 1-7 definition 1-14 Make Subrecord stage 1-14 Make subrecord stage 4-15 Make vector stage 4-15 Make Vector stageparallel job stages Make Vector 1-14 manually entering stored procedure definitions 8-36 table definitions 8-12 massively parallel processing 1-15 menu bar in DataStage Designer window 3-6 Merge stage 1-14, 4-15 meta data definition 1-14 importing from a UniData database B-2 MetaBrokers
K
key field 8-4, 8-35
L
link collector stage 1-14 Link Partitioner Stage 4-7
Index-4
definition 1-14 Modify stage 1-15 monetary formats 4-75 moving links 4-29 stages 4-28 MPP 1-15 Multi-Format Flat File stage 1-15 multiple links 4-30 multivalued data 8-3
of projects 1-6
P
parallel extender 1-15 parallel job routines 9-8 parallel job stages Data Set 1-12 File Set 1-13 Filter 1-13 Informix XPS 1-13 Lookup File 1-14 Make Subrecord 1-14 Modify 1-15 Oracle 1-15 Switch 1-17 Teradata 1-17 parallel job, definition 1-15 Parallel SAS Data Set stage 1-16 parallel stages DB2 1-12 parameter definitions data element 8-35 key fields 8-35 length 8-35 scale factor 8-35 Parameters grid 8-35 Peek 1-16 Peek stage 4-12 performance monitor 4-43 plug-in stages 4-8 definition 1-16 plug-ins definition 1-16 programming in DataStage 9-1 projects example 2-1 overview 1-6 setting up 2-2 Promote Subrecord stage 1-16 Promote subrecord stage 4-15
N
Name Editor 3-12 navigation in grids A-4 New Job from Template 7-4 New Template from Job 7-1 NLS (National Language Support) definition 1-15 overview 1-8 NLS page of the Sequential File Stage dialog box 2-21 of the Table Definition dialog box 2-8, 8-10 normalization 4-42 definition 1-15 null values 8-4, 8-35 definition 1-15 number formats 4-75
O
ODBC stages 4-6 definition 1-15 opening a job 4-2 operator, definition 1-15 Oracle 7 Load stages, definition 1-15 Oracle stage 1-15, 4-11 overview of jobs 1-6 of NLS 1-8
Index-5
R
reference links 4-16, 4-19 Relational stages, definition 1-16 Remove duplicates stage 1-16, 4-14 renaming links 4-30 stages 4-28 Repository 1-6 definition 1-16 routine activity 6-4 Routine dialog box 9-6 routines parallel jobs 9-8 routines, writing 9-2 row selector button A-1 Run-activity-on-exception activity 6-4 running a job 2-24
S
Sample stage 1-16 SAS stage 1-16, 4-14 saving job properties 4-80 Sequential file stage 4-13 Sequential File stages 4-6 definition 1-16 Server 1-6 server directories, browsing 4-37 server jobs 1-6 definition 1-16 server shared container stages 4-8, 4-15 setting up a project 2-2 shared container definition 1-16 shortcut menus in DataStage Designer window 3-19 SMP 1-16 sort order 4-75, 4-76 Sort stage 4-14 Sort stages, definition 1-16 Index-6
source, definition 1-17 specifying Designer options 3-25 input parameters for stored procedures 8-38 job dependencies 4-70 Split Subrecord stage 1-17 Split subrecord stage 4-15 Split Vector stage 1-17 Split vector. stage 4-15 SQL data precision 8-4, 8-35 data scale factor 8-4, 8-35 data type 8-4, 8-35 display characters 8-4, 8-35 stages 4-5 adding 2-9, 4-26 Aggregator 1-10, 4-7, 4-13 BCPLoad 1-10 column definitions for 4-31 Complex Flat File 1-11 Container 1-11 Container Input 4-8 Container Output 4-8, 4-15 DB2 Load Ready Flat File 1-12 definition 1-17 deleting 4-28 Delimited Flat File 1-12 editing 2-11, 4-30 External Routine 1-12 Fixed-Width Flat File 1-13 Folder 4-7 FTP 1-13 Hashed File 1-13, 4-6 Join 1-14 linking 2-10 local container 4-8, 4-15 Lookup 1-14 moving 4-28 ODBC 1-15, 4-6 Oracle 7 load 1-15 plug-in 1-16, 4-8 Relational 1-16 Ascential DataStage Designer Guide
renaming 4-28 Sequential File 1-16, 4-6 server shared container 4-8, 4-15 Sort 1-16 specifying 4-26 Transformer 1-10, 4-7, 4-13 UniData 4-6 starting DataStage Designer 3-1 DataStage Manager 2-3 starting DataStage B-4 status bar in DataStage Designer window 3-18 Stopping DataStage B-4 stored procedure definitions 4-42, 8-32 creating 8-37 editing 8-39 importing 8-34 manually defining 8-36 result set 8-38 viewing 8-39 stored procedures 8-33 stream link 4-16, 4-19 Switch stage 1-17 symmetric multiprocessing 1-16
Tail stage 1-17, 4-15 Teradata stage 1-17, 4-11 terms and concepts 1-10 time formats 4-75 tool palette 2-9, 3-15 toolbars debugger 3-18 Designer 3-15 transform functions, definition 1-17 Transformer stages 1-10, 4-7, 4-13 transforming data 1-3 transforms, definition 1-17 custom 1-11 troubleshooting B-1
U
Unicode 1-9 UniData stages 4-6 troubleshooting B-1 UniVerse stages 4-6 using Data Browser 4-39 job parameters 4-62
V T
Table Definition dialog box 8-2 for stored procedures 8-34 Format page 8-6 General page 8-2, 8-4, 8-6, 8-10 NLS page 8-10 Parameters page 8-34 table definitions creating 8-12 defining 2-4 definition 1-17 editing 8-29 importing 2-6, 8-11 manually entering 8-12 viewing 8-29 version number for a job 4-55, 4-79 viewing containers 5-2 job properties 4-54 stored procedure definitions 8-39 table definitions 8-29
W
wait-for-file activity 6-4 writing BASIC routines 9-2
Index-7
Index-8