Quality stage Notes: Bhaskar Reddy
DataStage Interview Questions
A list of top frequently asked DataStage Interview Questions and answers are given
below.
1) What is IBM DataStage?
DataStage is one of the most powerful ETL tools. It comes with the feature of graphical
visualizations for data integration. It extracts, transforms, and loads data from source to the
target.
DataStage is an integrated set of tools for designing, developing, running, compiling, and
managing applications. It can extract data from one or more data sources, achieve multi-
part conversions of the data, and load one or more target files or databases with the
resultant data.
2) Describe the Architecture of DataStage?
DataStage follows the client-server model. It has different types of client-server architecture
for different versions of DataStage.
DataStage architecture contains the following components.
o Projects
o Jobs
o Stages
o Servers
o Client Components
Quality stage Notes: Bhaskar Reddy
3) Explain the DataStage Parallel Extender (PX) or Enterprise
Edition (EE)?
DataStage PX is an IBM data integration tool. It is one of the most widely used extractions,
transformation, and loading (ETL) tools in the data warehousing industry. This tool collects
the information from various sources to perform transformations as per the business needs
and load data into respective data warehouses.
DataStage PX is also called as DataStage Enterprise Edition.
4) Describe the main features of DataStage?
The main features of DataStage are as follows.
o DataStage provides partitioning and parallel processing techniques which allow the
DataStage jobs to process an enormous volume of data quite faster.
o It has enterprise-level networking.
o It's a data integration component of IBM InfoSphere information server.
o It's a GUI based tool.
o In DataStage, we need to drag and drop the DataStage objects, and also we can
convert it to DataStage code.
Quality stage Notes: Bhaskar Reddy
o DataStage is used to perform the various ETL operations (Extract, transform, load)
o It provides connectivity with different sources & multiple targets at the same time
5) What are some prerequisites for DataStage?
For DataStage, following set ups are necessary.
o InfoSphere
o DataStage Server 9.1.2 or above
o Microsoft Visual Studio .NET 2010 Express Edition C++
o Oracle client (full client, not an instant client) if connecting to an Oracle database
o DB2 client if connecting to a DB2 database
6) How to read multiple files using a single DataStage job if files
have the same metadata?
o Search if the metadata of files is different or same then specify file names in the
sequential stage.
o Attach the metadata with a sequential stage in its properties.
o Select Read method as 'Specific File(s)'then add all files by selecting 'file' property
from the 'available properties to add.'
It will look like:
1. File= /home/myFile1.txt
2. File= /home/myFile2.txt
3. File= /home/myFile3.txt
4. Read Method= Specific file(s) fcec
7) Explain IBM InfoSphere information server and highlight its
main features?
IBM InfoSphere Information Server is a leading data integration platform which contains a
group of products that enable you to understand, filter, monitor, transform, and
deliver data. The scalable solution facilitates with massively parallel processing capabilities
to help you to manage small and massive data volumes. It assists you in forwarding reliable
Quality stage Notes: Bhaskar Reddy
information to your key business goals such as big data and analytics, data warehouse
modernization, and master data management.
Features of IBM InfoSphere information server
o IBM InfoSphere can connect with multiple source systems as well as write to various
target systems. It acts as a single platform for data integration.
o It is based on centralized layers. All the modules of the suit can share the baseline
architecture of the suite.
o It has some additional layers for the unified repository, for integrated metadata
services, and sharing a parallel engine.
o It has tools for analysis, monitoring, cleansing, transforming and delivering data.
o It has extremely parallel processing capabilities that provide high-speed processing.
8) What is IBM DataStage Flow Designer?
IBM DataStage Flow Designer allows you to create, edit, load, and run jobs in DataStage.
DFD is a thin client, web-based version of DataStage. Its a web-based UI for DataStage
than DataStage Designer, which is a Window-based thick client.
9) How do you run DataStage job from the command line?
To run a DataStage job, use command"dsjob" command as follows.
1. 'dsjob -run -jobstatus projectname jobname
10) What are some different alternative commands associated
with "dsjob"?
Many alternative optional commands can be used with dsjob command to perform any
specific task. These commands are used in the below format.
1. $dsjob -run alternative command
A list of commonly used alternative options of dsjob command is given below.
Stop: it is used to stop the running job
Lprojects: it is used to list the projects
Quality stage Notes: Bhaskar Reddy
ljobs: it is used to list the jobs in project
lparams: it is used to list the parameters in a job
paraminfo: it returns the parameters info
Linkinfo: It returns the link information
Logdetail: it is used to display details like event_id, time, and message
Lognewest: it is used to display the newest log id.
log: it is used to add a text message to log.
Logsum: it is used to display the log.
lstages: it is used to list the stages present in the job.
Llinks: it is used to list the links.
Projectinfo: it returns the project information (hostname and project name)
Jobinfo: it returns the job information (Job-status, job runtime,end time, etc.)
Stageinfo: it returns the stage name, stage type, input rows, etc.)
Report: it is used to display a report which contains Generated time, start time, elapsed
time, status, etc.
Jobid: it is used to provide Job id information.
11) What is a Quality Stage in DataStage tool?
A Quality Stage helps in integrating different types of data from multiple sources.
It is also termed as the Integrity Stage.
12) What is the process of killing a job in DataStage?
To kill a job, you must destroy the particular processing ID.
Quality stage Notes: Bhaskar Reddy
13) What is a DS Designer?
DataStage Designer is used to design the job. It also develops the work area and adds
various links to it.
14) What are the Stages in DataStage?
Stages are the basic structural blocks in InfoSphere DataStage. It provides a rich, unique
set of functionality to perform advanced or straightforward data integration task. Stages
hold and represent the processing steps that will be performed on the data.
15) What are Operators in DataStage?
The parallel job stages are made on operators. A single-stage might belong to a single
operator or a number of operators. The number of operators depends on the properties you
have set. During compilation, InfoSphere DataStage estimates your job design and
sometimes will optimize operators.
16) Explain connectivity between DataStage with DataSources?
IBM InfoSphere Information Server supports connectors and enables jobs for data transfer
between InfoSphere Information Server and data sources.
IBM InfoSphere DataStage and QualityStage jobs can access data from enterprise
applications and data sources such as:
o Relational databases
o Mainframe databases
o Enterprise Resource Planning (ERP) or Customer Relationship Management (CRM)
databases
o Online Analytical Processing (OLAP) or performance management databases
o Business and analytic applications
17) Describe Stream connector?
Quality stage Notes: Bhaskar Reddy
The Stream connector allows integration between the Streams and the DataStage.
InfoSphere Stream connector is used to send data from a DataStage job to a Stream job
and vice versa.
InfoSphere Streams can perform close to real-time analytic processing in parallel to the
data loading into a data warehouse. Alternatively, the InfoSphere Streams job
performs RTAP processing. After RTAP processing, it forwards the data to InfoSphere
DataStage to transform, enrich, and store the details for archival purposes.
18) What is the use of HoursFromTime() Function in Transformer
Stage in DataStage?
HoursFromTime Function is used to return hour portion of the time. Its input is time, and
Output is hours (int8).
Examples: If myexample1.time contains the time 22:30:00, then the following two
functions are equivalent and return the integer value 22.
1. HoursFromTime(myexample1.time)
2. HoursFromTime("22:30:00")
19) What is the Difference between Informatica and DataStage?
The DataStage and Informatica both are powerful ETL tools. Both tools do almost the same
work in nearly the same manner. In both tools, the performance, maintainability, and
learning curve are similar and comparable. Below are the few differences between both
tools.
Parameter DataStage Informatica
Multiple DataStage's pipeline partitioning uses multiple Informatica offers to pa
Partitions partitions. partitioning.
User Interface DataStage offers 3 GUIs Informatica offers 4 GUI
IBM DataStage Designer Informatica PowerDesig
Job Sequence Designer(workflow design) Repository Manager
Director (for monitoring) Workflow Designer
Workflow Manager.
Quality stage Notes: Bhaskar Reddy
Data Encryption Data encryption needs to be done before Informatica allows "Dat
reaching the DataStage Server. Transformation" inside
Designer as a separate
Transformations_ DataStage becomes a powerful transformation Informatica allows abou
engine by using functions (Oconv and IConv) transformations to proce
and routines. It offers about 40 data
transforming stages/objects. Almost any
transformation can be performed in DataStage.
Reusability We can achieve re-usability of a job in It offers access of re-us
DataStage by through Mapplets and
using containers(local&shared). To re-use a using mappings and wo
Job Sequence, you will have to make a copy, improves performance.
compile it, and run.
20) How We Can Covert Server Job To A Parallel Job?
We can convert a server job into a parallel job by using Link Collector and IPC Collector.
21) What are the different layers in the information server
architecture?
The different layers of information server architecture are as follows.
o Unified user interface
o Common services
o Unified parallel processing
o Unified Metadata
o Common connectivity
22) If you want to use the same piece of code in different jobs,
how will you achieve it?
Quality stage Notes: Bhaskar Reddy
DataStage facilitates with a feature called shared containers which allows sharing the
same piece of code for a different job. The containers are shared for reusability. A shared
container consists of a reusable job element of stages and links. We can call a shared
container in, unlike DataStage jobs.
23) How many types of Sorting methods are available in
DataStage?
There are two types of sorting methods available in DataStage for parallel jobs.
o Link sort
o Standalone Sort stage
24) Describe Link Sort?
The Link sort supports fewer options than other sorts. It is easy to maintain in a DataStage
job as there are only few stages in the DataStage job canvas.
Link sort is used unless a specific option is needed over Sort Stage. Most often, the Sort
stage is used to specify the Sort Key mode for partial sorts.
Sorting on a link option is available on the input/partitioning stage options. We cannot
specify a keyed partition if we use auto partition method.
25) Which commands are used to import and export the
DataStage jobs?
We use the following commands for the given operations.
For Import: we use the dsimport.exe command
For Export, we use the dsexport.exe command
26) Describe routines in DataStage? Enlist various types of
routines.
Quality stage Notes: Bhaskar Reddy
Routine is a set of tasks which are defined by the DS manager. It is run via the transformer
stage.
There are three kinds of routines
o Parallel routines
o Mainframe routines
o Server routines
27) What is the different type of jobs in DataStage?
There are two types of jobs in DataStage
o Server jobs: These jobs run in a sequential manner
o Parallel jobs: These jobs get executed in a parallel way
28) State the difference between an Operational DataStage and a
Data Warehouse?
An Operational DataStage can be considered as a presentation area for user processing and
real-time analysis. Thus, operational DataStage is a temporary repository. Whereas the
Data Warehouse is used for durable data storage needs and holds the complete data of the
entire business.
29) What is the importance of the exception activity in DataStage?
The reason behind the importance of exception activity is that during the job execution,
exception activity handles all the unfamiliar error activity.
30) What is "Fatal Error/RDBMS Code 3996" error?
This error occurs while testing jobs in DataStage 8.5 during Teradata 13 to 14 upgrade.
It is because the user tries to assign a longer string to a shorter string destination, and
sometimes if the length of one or more range boundaries in a RANGE_N function is a string
literal with a length higher than that of the test value
Quality stage Notes: Bhaskar Reddy