Datastage Performance Guide
Datastage Performance Guide
than one column. For a Merge Stage to work properly master dataset and
update dataset should contain unique records. Merge Stage is generally
used to combine datasets or files.
Sort Stage :
The Sort Stage is an active Stage. The Sort Stage is used to sort input
dataset either in Ascending or Descending order. The Sort Stage offers a
variety of options of retaining first or last records when removing
duplicate records, Stable sorting, can specify the algorithm used for
sorting to improve performance, etc. Even though data can be sorted on a
link, Sort Stage is used when the data to be sorted is huge.When we sort
data on link ( sort / unique option) once the data size is beyond the fixed
memory limit , I/O to disk takes place, which incurs an overhead.
Therefore, if the volume of data is large explicit sort stage should be used
instead of sort on link.Sort Stage gives an option on increasing the buffer
memory used for sorting this would mean lower I/O and better
performance.
Transformer Stage :
The Transformer Stage is an active Stage, which can have a single input
link and multiple output links. It is a very robust Stage with lot of inbuilt
functionality. Transformer Stage always generates C-code, which is then
compiled to a parallel component. So the overheads for using a
transformer Stage are high. Therefore, in any job, it is imperative that the
use of a transformer is kept to a minimum and instead other Stages are
used, such as:Copy Stage can be used for mapping input links with
multiple output links without any transformations. Filter Stage can be used
for filtering out data based on certain criteria. Switch Stage can be used to
map single input link with multiple output links based on the value of a
selector field. It is also advisable to reduce the number of transformers in
a Job by combining the logic into a single transformer rather than having
multiple transformers .
Funnel Stage
Funnel Stage is used to combine multiple inputs into a single output
stream. But presence of a Funnel Stage reduces the performance of a job.
It would increase the time taken by job by 30% (observations). When a
Funnel Stage is to be used in a large job it is better to isolate itself to one
job. Write the output to Datasets and funnel them in new job. Funnel
Stage should be run in continuous mode, without hindrance.
transformations, then the design of jobs could be done in such a way that
this Stage is put in a separate job all together (more resources for the
transformer Stage!!!). Also while designing jobs, care must be taken that
unnecessary column propagation is not done. Columns, which are not
needed in the job flow, should not be propagated from one Stage to
another and from one job to the next. As far as possible, RCP (Runtime
Column Propagation) should be disabled in the jobs. Sorting in a job
should be taken care try to minimise number sorts in a job. Design a job in
such a way as to combine operations around same sort keys, if possible
maintain same hash keys. Most often neglected option is dont sort if
previously sorted in sort Stage, set this option to true. This improves
the Sort Stage performance a great deal. In Transformer Stage Preserve
Sort Order can be used to maintain sort order of the data and reduce
sorting in the job.In a transformer minimum of Stage variables should be
used. More the no of Stage variable lower is the performance. An
overloaded transformer can choke the data flow and lead to bad
performance or even failure of job at some point. In order to minimise the
load on transformer we can Avoid some unnecessary function calls. For
example to convert a varchar field with date value can be type cast into
Date type by simple formatting the input value. We need not use
StringToDate function, which is used to convert a String to Date type.
Implicit conversion of data types.
Reduce the number of Stage variables used. It was observed in our
previous project by removing 5 Stage variables and 6 function calls,
runtime for the job was reduced from 2 hours to 1 hour 10 min
(approximately) with 100 million records input.Try to balance load on
transformers by sharing the transformations across existing transformers.
This would ensure smooth flow of data.If you require type casting,
renaming of columns or addition of new columns, use Copy or Modify
Stages to achieve thisWhenever you have to use Lookups on large tables,
look at the options such as unloading the lookup tables to datasets and
using, user defined join SQL to reduce the look up volume with the help of
temp tables, etc.The Copy stage should be used instead of a Transformer
for simple operations including:o Job Design placeholder between stages o
Renaming Columnso Dropping Columnso Implicit (default) Type
Conversions The upsert works well if the data is sorted on the primary
key column of the table which is being loaded. Or Determine , if the record
already exists or not to have Insert and Update separately.It is
sometimes possible to re-arrange the order of business logic within a job
flow to leverage the same sort order, partitioning, and groupings. Dont
read from a Sequential File using SAME partitioning. Unless more than one
source file is specified, this scenario will read the entire file into a single
partition, making the entire downstream flow run sequentially (unless it is
repartitioned)
You may get many errors in datastage while compiling the jobs or running the jobs.
Some of the errors are as follows
a)Source file not found.
If you are trying to read the file, which was not there with that name.
b)Some times you may get Fatal Errors.
c) Data type mismatches.
This will occur when data type mismaches occurs in the jobs.
d) Field Size errors.
e) Meta data Mismach
f) Data type size between source and target different
g) Column Mismatch
i) Pricess time out.
If server is busy. This error will come some time.
ds_Trailer_Rec: When checking operator: When binding output schema variable "outRec":
When binding output interface field "TrailerDetailRecCount" to field
"TrailerDetailRecCount": Implicit conversion from source type "ustring" to result type
"string[max=255]": Possible truncation of variable length ustring when converting to
string using codepage ISO-8859-1.
Solution:I resolved changing the extended col under meta data of the
transformer to unicode
When checking operator: A sequential operator cannot preserve the
partitioning
of the parallel data set on input port 0.
Solution:I resolved by changing the preserve partioning to 'clear' under
transformer properties
Syntax error: Error in "group" operator: Error in output redirection: Error in output
parameters: Error in modify adapter: Error in binding: Could not find type: "subrec", line
35
Solution:Its the issue of level number of those columns which were being added in transformer.
Their level number was blank and the columns that were being taken from cff file had it as 02.
Added the level number and job worked.
Out_Trailer: When checking operator: When binding output schema variable "outRec": When binding output interface field
"STDCA_TRLR_REC_CNT" to field "STDCA_TRLR_REC_CNT": Implicit conversion from source type "dfloat" to result
type "decimal[10,0]": Possible range/precision limitation.
CE_Trailer: When checking operator: When binding output interface field "Data" to field "Data": Implicit conversion from
source type "string" to result type "string[max=500]": Possible truncation of variable length string.
Implicit conversion from source type "dfloat" to result type "decimal[10,0]": Possible range/precision limitation.
Solution: Used to transformer function'DFloatToDecimal'. As target field is Decimal. By default the output from aggregator
output is double, getting the above by using above function able to resolve the warning.
When binding output schema variable "outputData": When binding output interface field "RecordCount" to field
"RecordCount": Implicit conversion from source type "string[max=255]" to result type "int16": Converting string to number.
Problem(Abstract)
Jobs that process a large amount of data in a column can abort with this error:
the record is too big to fit in a block; the length requested is: xxxx, the max block length is: xxxx.
Resolving the problem
1.
2.
3.
To fix this error you need to increase the block size to accommodate the record size:
Log into Designer and open the job.
Open the job properties--> parameters-->add environment variable and select:
APT_DEFAULT_TRANSPORT_BLOCK_SIZE
You can set this up to 256MB but you really shouldn't need to go over 1MB.
NOTE: value is in KB
For example to set the value to 1MB:
APT_DEFAULT_TRANSPORT_BLOCK_SIZE=1048576
The default for this value is 128kb.
When setting APT_DEFAULT_TRANSPORT_BLOCK_SIZE you want to use the smallest possible value since
this value will be used for all links in the job.
For example if your job fails with APT_DEFAULT_TRANSPORT_BLOCK_SIZE set to 1 MB and succeeds at 4
MB you would want to do further testing to see what it the smallest value between 1 MB and 4 MB that will allow
the job to run and use that value. Using 4 MB could cause the job to use more memory than needed since all the
links would use a 4 MB transport block size.
NOTE: If this error appears for a dataset use APT_PHYSICAL_DATASET_BLOCK_SIZE.
.
While connecting Remote Desktop, Terminal server has been exceeded maximum number of allowed
connections
SOL: In Command Prompt, type mstsc /v: ip address of server /admin
OR
2.
SQL20521N. Error occurred processing a conditional compilation directive near string. Reason code=rc.
Following link has issue description:
http://pic.dhe.ibm.com/infocenter/db2luw/v9r7/index.jsp?topic=%2Fcom.ibm.db2.luw.messages.sql.doc%2Fdoc
%2Fmsql20521n.html
3.
SK_RETAILER_GROUP_BRDIGE,1: runLocally() did not reach EOF on its input data set 0.
4.
While connecting to Datastage client, there is no response, and while restarting websphere services,
following errors occurred
[root@poluloro01 bin]# ./stopServer.sh server1 -user wasadmin -password Wasadmin0708
ADMU0116I: Tool information is being logged in file
/opt/ibm/WebSphere/AppServer/profiles/default/logs/server1/stopServer.log
ADMU0128I: Starting tool with the default profile
ADMU3100I: Reading configuration for server: server1
ADMU0111E: Program exiting with error: javax.management.JMRuntimeException:
ADMN0022E: Access is denied for the stop operation on Server MBean
because of insufficient or empty credentials.
ADMU4113E: Verify that username and password information is on the command line
(-username and -password) or in the <conntype>.client.props file.
ADMU1211I: To obtain a full trace of the failure, use the -trace option.
ADMU0211I: Error details may be seen in the file:
/opt/ibm/WebSphere/AppServer/profiles/default/logs/server1/stopServer.log
SOL:
Wasadmin and XMeta passwords needs to be reset and commands are below..
[root@poluloro01 bin]# cd /opt/ibm/InformationServer/ASBServer/bin/
5.
SOL: Most of the time "The specified field: XXXXXX does not exist in the view adapted schema" occurred when
we missed a field to map. Every stage has got an output tab if used in the between of the job. Make sure you
have mapped every single field required for the next stage.
Sometime even after mapping the fields this error can be occurred and one of the reason could be that the view
adapter has not linked the input and output fields. Hence in this case the required field mapping should be
dropped and recreated.
Just to give an insight on this, the view adapter is an operator which is responsible for mapping the input and
output fields. Hence DataStage creates an instance of APT_ViewAdapter which translate the components of the
operator input interface schema to matching components of the interface schema. So if the interface schema is
not having the same columns as operator input interface schema then this error will be reported.
1)When we use same partitioning in datastage transformer stage we get the following warning in 7.5.2 version.
TFCP000043
2
3
input_tfm: Input dataset 0 has a partitioning method other than entire specified;
disabling memory sharing.
This is known issue and you can safely demote that warning into informational by adding this warning to Project
specific message handler.
2) Warning: A sequential operator cannot preserve the partitioning of input data set on input port 0
Resolution: Clear the preserve partition flag before Sequential file stages.
3)DataStage parallel job fails with fork() failed, Resource temporarily unavailable
On aix execute following command to check maxuproc setting and increase it if you plan to run multiple jobs at
the same time.
lsattr -E -l sys0 | grep maxuproc
maxuproc
1024
Maximum number of PROCESSES allowed per user
True
4)TFIP000000
3
Agg_stg: When checking operator: When binding input interface field
CUST_ACT_NBR to field CUST_ACT_NBR: Implicit conversion from source type string[5] to result type
dfloat: Converting string to number.
Resolution: use the Modify stage explicitly convert the data type before sending to aggregator stage.
5)Warning: A user defined sort operator does not satisfy the requirements.
Resolution:check the order of sorting columns and make sure use the same order when use join stage after sort
to joing two inputs.
6)TFTM000000
2
3
Stg_tfm_header,1: Conversion error calling conversion routine
timestamp_from_string data may have been lost
TFTM000000
may have been lost
Resolution:check for the correct date format or decimal format and also null values in the date or decimal fields
before passing to datastage StringToDate, DateToString,DecimalToString or StringToDecimal functions.
7)TOSO000119
2
3
Join_sort: When checking operator: Data claims to already be sorted on the
specified keys the sorted option can be used to confirm this. Data will be resorted as necessary. Performance
may improve if this sort is removed from the flow
Resolution: Sort the data before sending to join stage and check for the order of sorting keys and join keys and
make sure both are in the same order.
8)TFOR000000
2
1
Join_Outer: When checking operator: Dropping component CUST_NBR
because of a prior component with the same name.
Resolution:If you are using join,diff,merge or comp stages make sure both links have the differnt column names
Run ps -ef | grep dsapi_slave to check if any dsapi_slave processes exist. If so, kill them.
Run netstat -a | grep dsprc to see if any processes have sockets that are ESTABLISHED, FIN_WAIT, or
CLOSE_WAIT. These will prevent the dsprcd from starting. The sockets with status FIN_WAIT or CLOSE_WAIT
will eventually time out and disappear, allowing you to restart DataStage.
Then Restart DSEngine.
b)
c)
Click on Command
d)
e)
Select option 2
f)
g)
Warning: A sequential operator cannot preserve the partitioning of input data set on input port 0
SOL:
3.
Warning: A user defined sort operator does not satisfy the requirements.
SOL: Check the order of sorting columns and make sure use the same order when use join stage after sort to
joing two inputs.
4. Conversion error calling conversion routine timestamp_from_string data may have been lost. xfmJournals,1:
Conversion error calling conversion routine decimal_from_string data may have been lost
SOL: check for the correct date format or decimal format and also null values in the date or decimal fields
before passing to datastage StringToDate, DateToString,DecimalToString or StringToDecimal functions.
5.
SOL:
cd /opt/ibm/InformationServer/Server/DSEngine/bin
./dsjob -ljobs <project_name>
6.
While attempting to compile job, failed to invoke GenRunTime using Phantom process helper
RC:
SOL:
a)
b)
c)
Ask the application team to close the active or stale sessions running from applications user.
If they have closed the sessions, but sessions are still there, then kill those sessions.
Check for output for below command before stopping Datastage services.
netstat a|grep dsrpc
If any processes are in established, check any job or stale or active or osh sessions are not running.
If any processes are in close_wait, then wait for some time, those processes
will not be visible.
Wait for 10 to 15 min for shared memory to be released by process holding them.
Start the Datastage services.
./uv admin start
If asking for dsadm password while firing the command , then enable impersonation.through root user
${DSHOME}/scripts/DSEnable_impersonation.sh
Datasets
Managing and operating on large data volumes is usually an extremely complex process. To
facilitate that, Orchestrate might extend relational database management system support in
Datastage by implementing Datasets.
By and large, datasets might be interpreted as uniform sets of rows within the internal
representation of Framework. Commonly, the two types of datasets might be distinguished:
According to the scheme above, there are the two groups of Datasets - persistent and
virtual.
The first type, persistent Datasets are marked with *.ds extensions, while for second type,
virtual datasets *.v extension is reserved. (It's important to mention, that no *.v files might be
visible in the Unix file system, as long as they exist only virtually, while inhabiting RAM
memory. Extesion *.v itself is characteristic strictly for OSH - the Orchestrate language of
scripting).
Further differences are much more significant. Primarily, persistent Datasets are being
stored in Unix files using internal Datastage EE format, while virtual Datasets are never
stored on disk - they do exist within links, and in EE format, but in RAM memory. Finally,
persistent Datasets are readable and rewriteable with the DataSet Stage, and virtual
Datasets - might be passed through in memory.
Accurateness demands mentioning the third type of Datasets - they're called filesets, while
their storage places are diverse Unix files and they're human-readable. Filesets are generally
marked with the *.fs extension.
There are a few features specific for Datasets that have to be mentioned to complete the
Datasets policy. Firstly, as a single Dataset contains multiple records, it is obvious that all of
them must undergo the same processes and modifications. In a word, all of them must go
through the same successive stage.
Secondly, it should be expected that different Datasets usually have different schemas,
therefore they cannot be treated commonly.
More accurately, typically Orchestrate Datasets are a single-file equivalent for the whole
sequence of records. With datasets, they might be shown as a one object. Datasets themselves
help ignoring the fact that demanded data really are compounded of multiple and diverse files
remaining spread across different sectors of processors and disks of parallel computers. Along
that, complexity of programming might be significantly reduced, as shown in the example
below.
Primary multiple files - shown on the left side of the scheme - have been bracketed together,
what resulted in five nodes. While using datasets, all the files, all the nodes might be boiled
down to the only one, single Dataset - shown on the right side of the scheme. Thereupon, you
might program only one file and get results on all the input files. That significantly shorten
time needed for modifying the whole group of separate files, and reduce the possibility of
engendering accidental errors. What are its measurable profits? Mainly, significantly
increasing speed of applications basing on large data volumes.
All in all, is it worth-while? Surely. Especially, while talking about technological advance and
succeeding data-dependence. They do coerce using larger and larger volumes of data, and - as
a consequence - systems able to rapid cooperate on them became a necessity.
Scenario 1:
We have two datasets with 4 cols each with different names. We should create a dataset with
4 cols 3 from one dataset and one col with the record count of one dataset.
We can use aggregator with a dummy column and get the count from one dataset and do a
look up from other dataset and map it to the 3 rd dataset
Something similar to the below design:
Scenario 2:
Following is the existing job design. But requirement got changed to: Head and trailer
datasets should populate even if detail records is not present in the source file. Below
job don't do that job.
Used row generator with a copy stage. Given default value( zero) for col( count) coming in
from row generator. If no detail records it will pick the record count from row generator.
We have a source which is a sequential file with header and footer. How to remove the header
and footer while reading this file using sequential file stage of Datastage?
Sol:Type command in putty: sed '1d;$d' file_name>new_file_name (type
this in job before job subroutine then use new file in seq stage)
IF I HAVE SOURCE LIKE COL1 A A B AND TARGET LIKE COL1 COL2 A 1 A 2 B1.
HOW TO ACHIEVE THIS OUTPUT USING STAGE VARIABLE IN TRANSFORMER
STAGE?
If keyChange =1 Then 1 Else stagevaraible+1
Suppose that 4 job control by the sequencer like (job 1, job 2, job 3, job 4 )if job 1 have
10,000 row ,after run the job only 5000 data has been loaded in target table remaining are not
loaded and your job going to be aborted then.. How can short out the problem.Suppose job
sequencer synchronies or control 4 job but job 1 have problem, in this condition should go
director and check it what type of problem showing either data type problem, warning
massage, job fail or job aborted, If job fail means data type problem or missing column action
.So u should go Run window ->Click-> Tracing->Performance or In your target table
->general -> action-> select this option here two option
(i) On Fail -- commit , Continue
(ii) On Skip -- Commit, Continue.
First u check how many data already load after then select on skip option then continue and
what remaining position data not loaded then select On Fail , Continue ...... Again Run the job
defiantly u get successful massage
Question: I want to process 3 files in sequentially one by one how can i do that. while
processing the files it should fetch files automatically .
Ans:If the metadata for all the files r same then create a job having file name as parameter
then use same job in routine and call the job with different file name...or u can create
sequencer to use the job..
Parameterize the file name.
Build the job using that parameter
Build job sequencer which will call this job and will accept the parameter for file name.
Write a UNIX shell script which will call the job sequencer three times by passing different
file each time.
RE: What Happens if RCP is disable ?
In such case Osh has to perform Import and export every time when the job runs and the
processing time job is also increased...
Runtime column propagation (RCP): If RCP is enabled for any job and specifically for those
stages whose output connects to the shared container input then meta data will be propagated
at run time so there is no need to map it at design time.
If RCP is disabled for the job in such case OSH has to perform Import and export every time
when the job runs and the processing time job is also increased.
Then you have to manually enter all the column description in each stage.RCP- Runtime
column propagation
Question:
Source:
Eno
1
2
3
Target
Ename
a,b
c,d
e,f
LOCATION
HYD
BAN
CHE
HYD
Eno
1
2
3
Ename
a
b
c
TCS
IBM
HCL
HCL
CHE
BAN
BAN
CHE
LIKE THIS.......
THEN THE OUTPUT LOOKS LIKE THIS....
Company loc count
TCS HYD 3
BAN
CHE
IBM HYD 3
BAN
CHE
HCL HYD 3
BAN
CHE
2)input is like this:
no,char
1,a
2,b
3,a
4,b
5,a
6,a
7,b
8,a
But the output is in this form with row numbering of Duplicate occurence
output:
no,char,Count
"1","a","1"
"6","a","2"
"5","a","3"
"8","a","4"
"3","a","5"
"2","b","1"
"7","b","2"
"4","b","3"
3)Input is like this:
file1
10
20
10
10
20
30
Output is like:
file2
file3(duplicates)
10
10
20
10
30
20
4)Input is like:
file1
10
20
10
10
20
30
Output is like Multiple occurrences in one file and single occurrences in one file:
file2
file3
10
30
10
10
20
20
5)Input is like this:
file1
10
20
10
10
20
30
Output is like:
file2
file3
10
30
20
6)Input is like this:
file1
1
2
3
4
5
6
7
8
9
10
Output is like:
file2(odd)
file3(even)
1
2
3
4
5
6
7
8
9
10
7)How to calculate Sum(sal), Avg(sal), Min(sal), Max(sal) with out using Aggregator stage..
8)How to find out First sal, Last sal in each dept with out using aggregator stage
9)How many ways are there to perform remove duplicates function with out using Remove
duplicate stage..
Scenario:
source has 2 fields like
COMPANY
IBM
TCS
IBM
HCL
TCS
IBM
HCL
HCL
LOCATION
HYD
BAN
CHE
HYD
CHE
BAN
BAN
CHE
LIKE THIS.......
THEN THE OUTPUT LOOKS LIKE THIS....
Company loc count
TCS HYD 3
BAN
CHE
IBM HYD 3
BAN
CHE
HCL HYD 3
BAN
CHE
Solution:
Seqfile......>Sort......>Trans......>RemoveDuplicates..........Dataset
Sort
Key=Company
Sort order=Asc
Company1:',':in.Location
create keychange=True
Trans:
create stage variable as Company1
Company1=If(in.keychange=1) then in.Location Else
RemoveDup:
Key=Company
Duplicates To Retain=Last
11)The input is
Shirt|red|blue|green
Pant|pink|red|blue
Output should be,
Shirt:red
Shirt:blue
Shirt:green
pant:pink
pant:red
pant:blue
Solution:
it is reverse to pivote stage
use
seq------sort------tr----rd-----tr----tg
in the sort stage use create key change column is true
in trans create stage variable=if colu=1 then key c.value else key v::colum
rd stage use duplicates retain last
tran stage use field function superate columns
similar Scenario: :
source
col1 col3
1 samsung
1 nokia
1 ercisson
2 iphone
2 motrolla
3 lava
3 blackberry
3 reliance
Expected Output
col 1 col2
col3 col4
1
samsung nokia ercission
2
iphone motrolla
3
lava
blackberry reliance
Ok
First Read and Load the data into your source file( For Example Sequential File )
And in Sort stage
Go to Transformer stage
Create one stage variable.
You can do this by right click in stage variable go to properties and name it as your
wish ( For example temp)
and in expression write as below
if keychange column =1
This column name is the one you want in the required column with delimited
commas.
On remove duplicates stage key is col1 and set option duplicates retain to--> Last.
in transformer drop col3 and define 3 columns like col2,col3,col4
in col1 derivation give Field(InputColumn,",",1) and
in col1 derivation give Field(InputColumn,",",2) and
in col1 derivation give Field(InputColumn,",",3)
Scenario:
12)Consider the following employees data as source?
employee_id, salary
------------------10,
1000
20,
2000
30,
3000
40,
5000
Create a job to find the sum of salaries of all employees and this sum should repeat for
all the rows.
The output should look like as
employee_id, salary, salary_sum
------------------------------10,
1000, 11000
20,
2000, 11000
30,
3000, 11000
40,
5000, 11000
Scenario:
I have two source tables/files numbered 1 and 2.
In the the target, there are three output tables/files, numbered 3,4 and 5.
Scenario:
Create a job to find the sum of salaries of all employees and this sum should repeat for
all the rows.
The output should look like as
employee_id, salary, salary_sum
------------------------------10,
1000, 11000
20,
2000, 11000
30,
3000, 11000
40,
5000, 11000
sltn:
Take Source --->Transformer(Add new Column on both the output links and assign a
value as 1 )------------------------>
1) Aggregator (Do group by
using that new column)
2)lookup/join( join on that new column)-------->tgt.
Scenario:
sno,sname,mark1,mark2,mark3
1,rajesh,70,68,79
2,mamatha,39,45,78
3,anjali,67,39,78
4,pavani,89,56,45
5,indu,56,67,78
out put is
sno,snmae,mark1,mark2,mark3,delimetercount
1,rajesh,70,68,79,4
2,mamatha,39,45,78,4
3,anjali,67,39,78,4
4,pavani,89,56,45,4
5,indu,56,67,78,4
seq--->trans--->seq
create one stage variable as delimiter..
and put derivation on stage as DSLink4.sno : "," : DSLink4.sname : "," :
DSLink4.mark1 : "," :DSLink4.mark2 : "," : DSLink4.mark3
and do mapping and create one more column count as integer type.
and put derivation on count column as Count(delimter, ",")
scenario:
sname
total_vowels_count
Allen
2
Scott
1
Ward
1
Under Transformer Stage Description:
total_Vowels_Count=Count(DSLink3.last_name,"a")+Count(DSLink3.last_name,"e")
+Count(DSLink3.last_name,"i")+Count(DSLink3.last_name,"o")
+Count(DSLink3.last_name,"u").
Scenario:
1)On daily we r getting some huge files data so all files metadata is same we have to
load in to target table how we can load?
Use File Pattern in sequential file
2) One column having 10 records at run time we have to send 5th and 6th record to
target at run time how we can send?
Can get through,by using UNIX command in sequential file filter option
How can we get 18 months date data in transformer stage?
Use transformer stage after input seq file and try this one as constraint in
transformer stage :
DaysSinceFromDate(CurrentDate(), DSLink3.date_18)<=548 OR
DaysSinceFromDate(CurrentDate(), DSLink3.date_18)<=546
where date_18 column is the column having that date which needs to be less or
equal to 18 months and 548 is no. of days for 18 months and for leap year it is
546(these numbers you need to check).
What is differences between Force Compile and Compile ?
My source is Like
Sr_no, Name
10,a
10,b
20,c
30,d
30,e
40,f
My target Should Like:
Target 1:(Only unique means which records r only once)
20,c
40,f
Target 2:(Records which r having more than 1 time)
10,a
10,b
30,d
30,e
How to do this in DataStage....
**************
use aggregator and transformer stages
source-->aggregator-->transformat-->target
perform count in aggregator, and take two op links in trasformer, filter data count>1 for one llink
and put count=1 for second link.
Scenario:
in my i/p source i have N no.of records
In output i have 3 targets
i want o/p like 1st rec goes to 1st targt and
2nd rec goes to 2nd target and
3rd rec goes to 3rd target again
4th rec goes to 1st taget ............ like this
do this ""without using partition techniques "" remember it.
*****************
source--->trans---->target
in trans use conditions on constraints
mod(empno,3)=1
mod(empno,3)=2
mod(empno,3)=0
Scenario:
im having i/p as
col A
a_b_c
x_F_I
DE_GH_IF
we hav to mak it as
col1 col 2 col3
abc
xfi
de gh if
*********************
Transformer
create 3 columns with derivation
col1 Field(colA,'_',1)
col2 Field(colA,'_',2)
col3 Field(colA,'_',3)
**************
Field function divides the column based on the delimeter,
if the data in the col is like A,B,C
then
Field(col,',',1) gives A
Field(col,',',2) gives B
Field(col,',',3) gives C
DataSet in DataStage
Inside a InfoSphere DataStage parallel job, data is moved around in data sets. These carry
meta data with them, both column definitions and information about the configuration that
was in effect when the data set was created. If for example, you have a stage which limits
execution to a subset of available nodes, and the data set was created by a stage using all
nodes, InfoSphere DataStage can detect that the data will need repartitioning.
If required, data sets can be landed as persistent data sets, represented by a Data Set stage
.This is the most efficient way of moving data between linked jobs. Persistent data sets are
stored in a series of files linked by a control file (note that you should not attempt to
manipulate these files using UNIX tools such as RM or MV. Always use the tools provided
with InfoSphere DataStage).
there are the two groups of Datasets - persistent and virtual.
The first type, persistent Datasets are marked with *.ds extensions, while for second type,
virtual datasets *.v extension is reserved. (It's important to mention, that no *.v files might be
visible in the Unix file system, as long as they exist only virtually, while inhabiting RAM
memory. Extesion *.v itself is characteristic strictly for OSH - the Orchestrate language of
scripting).
Further differences are much more significant. Primarily, persistent Datasets are being
stored in Unix files using internal Datastage EE format, while virtual Datasets are never
stored on disk - they do exist within links, and in EE format, but in RAM memory. Finally,
persistent Datasets are readable and rewriteable with the DataSet Stage, and virtual
Datasets - might be passed through in memory.
A data set comprises a descriptor file and a number of other files that are added as the data set
grows. These files are stored on multiple disks in your system. A data set is organized in
terms of partitions and segments.
Each partition of a data set is stored on a single processing node. Each data segment contains
all the records written by a single job. So a segment can contain files from many partitions,
and a partition has files from many segments.
Firstly, as a single Dataset contains multiple records, it is obvious that all of them must
undergo the same processes and modifications. In a word, all of them must go through the
same successive stage.
Secondly, it should be expected that different Datasets usually have different schemas,
therefore they cannot be treated commonly.
Alias names of Datasets are
1) Orchestrate File
2) Operating System file
And Dataset is multiple files. They are
a) Descriptor File
b) Data File
c) Control file
d) Header Files
In Descriptor File, we can see the Schema details and address of data.
In Data File, we can see the data in Native format.
And Control and Header files reside in Operating System.
2. Select the data set you want to manage and click OK. The Data Set Viewer appears.
From here you can copy or delete the chosen data set. You can also view its schema
(column definitions) or the data it contains.
3. SCD Type 2
4. Slowly changing dimension Type 2 is a model where the whole history is stored in
the database. An additional dimension record is created and the segmenting between
the old record values and the new (current) value is easy to extract and the history is
clear.
The fields 'effective date' and 'current indicator' are very often used in that dimension
and the fact table usually stores dimension key and version number.
CUST_
CUST_
CUST_
CUST_
REC_
NAME
GROUP
_ID
TYPE_
ID
COUNTRY
_ID
VERSI
ON
DRBOUA
7
Dream
Basket
EL
PL
ETIMAA
5
ETL
tools
info
BI
FI
FAMMFA
0
Fajatso
FD
CD
FICILA0
First
Pactonic
FD
IT
FRDXXA
2
Frasir
EL
SK
GAMOPA
9
GGMOPA
9
Ganpa
LTD.
GG
electroni
FD
EL
C
S
US
RU
1
1
REC
_
EFF
DT
01102006
2909200
6
27092006
25092006
23092006
21092006
1909-
REC_
CURRENT_
IND
Y
Y
Y
Y
Y
Y
Y
cs
GLMFIA6
Glasithkli
ni
FD
PL
GLMPEA
9
Globitele
co
TC
FI
GONDW
A5
Goli
Airlines
BN
GB
2006
17091 2006
15091 2006
13091 2006
Y
Y
Y
The most important facts and stages of the CUST_SCD2 job processing:
The dimension table with customers is refreshed daily and one of the data sources is a text
file. For the purpose of this example the CUST_ID=ETIMAA5 differs from the one stored in
the database and it is the only record with changed data. It has the following structure and
data:
SCD 2 - Customers file extract:
There is a hashed file (Hash_NewCust) which handles a lookup of the new data coming
from the text file.
A T001_Lookups transformer does a lookup into a hashed file and maps new and old
values to separate columns.
SCD 2 lookup transformer
A T003 transformer handles the UPDATE and INSERT actions of a record. The old record is
updated with current indictator flag set to no and the new record is inserted with current
indictator flag set to yes, increased record version by 1 and the current date.
SCD 2 insert-update record transformer
CUST_I
D
DRBOUA
7
ETIMAA
5
FAMMFA
0
FICILA0
FRDXXA
2
GAMOPA
9
GGMOPA
9
Ganpa
LTD.
GG
electroni
cs
GLMFIA6
Glasithkli
ni
FD
PL
GLMPEA
9
Globitele
co
TC
FI
GONDW
A5
Goli
Airlines
BN
GB
FD
US
EL
RU
21092006
19092006
17092006
15092006
13092006
Y
Y
Y
Y
Y
IBM INFOSPHERE
DATASTAGE PERFORMANCE
TUNING: OVERVIEW OF BEST
PRACTICES
INTRODUCTION
Data integration processes are very time and resource consuming. The
amount of data and the size of the datasets are constantly growing but
data and information are still expected to be delivered on-time.
Performance is therefore a key element in the success of a Business
Intelligence & Data Warehousing project and in order to guarantee the
agreed level of service, management of data warehouse performance and
performance tuning have to take a full role during the data warehouse and
ETL development process.
Tuning, however, is not always straightforward. A chain is only as strong
as its weakest link. In this context, there are five crucial domains that
require attention when tuning an IBM Infosphere DataStage environment :
System Infrastructure
Network
Database
A projection is then executed using the model selected. The results show
the total CPU needed, disk space requirements, scratch space
requirements, and other relevant information.
Different projections can be run with different data volumes and each can
be saved. Graphical charts are also available for analysis, which allow the
user to drill into each stage and each partition. A report can be generated
or printed with the estimations.
This feature will greatly assist developers in estimating the time and
machine resources needed for job execution. This kind of analysis can
help when analyzing the performance of a job, but IBM DataStage also
offers another possibility to analyze job performance.
Detailed charts are then available for that specific job run including:
Job timeline
Record Throughput
CPU Utilization
Job Timing
Job Memory Utilization
Physical Machine Utilization (shows what else is happening overall
on the machine, not just the DataStage activity).
Each partitions information is available in different tabs.
Transformer stages
It is best practice to avoid having multiple stages where the functionality
could be incorporated into a single stage, and use other stage types to
perform simple transformation operations. Try to balance load on
Transformers by sharing the transformations across existing Transformers.
This will ensure a smooth flow of data.
When type casting, renaming of columns or addition of new columns is
required, use Copy or Modify Stages to achieve this. The Copy stage, for
example, should be used instead of a Transformer for simple operations
including :
Job Design placeholder between stages
Renaming Columns
Dropping Columns
Implicit (default) Type Conversions
A developer should try to minimize the stage variables in a Transformer
stage because the performance of a job decreases as stage variables are
Sorting
A sort done on a database is usually a lot faster than a sort done in
DataStage. So if possible try to already do the sorting when reading
data from the database instead of using a Sort stage or sorting on the
input link. This could also mean a big performance gain in the job,
although it is not always possible to avoid needing a Sort stage in jobs.
Careful job design can improve the performance of sort operations, both in
standalone Sort stages and in on-link sorts specified in other stage types,
when not being able to make use of the database sorting power.
If data has already been partitioned and sorted on a set of key columns,
specify the dont sort, previously sorted option for the key columns in the
Sort stage. This reduces the cost of sorting and takes more advantage of
pipeline parallelism. When writing to parallel data sets, sort order and
partitioning are preserved. When reading from these data sets, try to
maintain this sorting if possible by using theSame partitioning method.
The stable sort option is much more expensive than non-stable sorts, and
should only be used if there is a need to maintain row order other than as
needed to perform the sort.
The performance of individual sorts can be improved by increasing the
memory usage per partition using the Restrict Memory Usage(MB) option
of the Sort stage. The default setting is 20 MB per partition. Note that sort
memory usage can only be specified for standalone Sort stages, it cannot
be changed for inline (on a link) sorts.
Sequential files
While handling huge volumes of data, the Sequential File stage can itself
become one of the major bottlenecks as reading and writing from this
stage is slow. Certainly do not use sequential files for intermediate storage
between jobs. It causes performance overhead, as it needs to do data
conversion before writing and reading from a file. Rather Dataset stages
should be used for intermediate storage between different jobs.
Datasets are key to good performance in a set of linked jobs. They help in
achieving end-to-end parallelism by writing data in partitioned form and
maintaining the sort order. No repartitioning or import/export conversions
are needed.
In order to have faster reading from the Sequential File stage the number
of readers per node can be increased (default value is one). This means,
for example, that a single file can be partitioned as it is read (even though
the stage is constrained to running sequentially on the conductor mode).
This is an optional property and only applies to files containing fixedlength records. But this provides a way of partitioning data contained in a
single file. Each node reads a single file, but the file can be divided
according to the number of readers per node, and written to separate
partitions. This method can result in better I/O performance on an SMP
(Symmetric Multi Processing) system.
It can also be specified that single files can be read by multiple nodes.
This is also an optional property and only applies to files containing fixed-
length records. Set this option to Yes to allow individual files to be read
by several nodes. This can improve performance on cluster systems.
IBM DataStage knows the number of nodes available, and using the fixed
length record size, and the actual size of the file to be read, allocates to
the reader on each node a separate region within the file to process. The
regions will be of roughly equal size.
The options Read From Multiple Nodes and Number of Readers Per
Node are mutually exclusive.
In all cases, the size of the reference data sets is a concern. If these take
up a large amount of memory relative to the physical RAM memory size of
the computer DataStage is running on, then a Lookup stage might crash
because the reference datasets might not fit in RAM along with everything
else that has to be in RAM. This results in a very slow performance since
each lookup operation can, and typically will, cause a page fault and an
I/O operation.
So, if the reference datasets are big enough to cause trouble, use a join. A
join does a high-speed sort on the driving and reference datasets. This can
involve I/O if the data is big enough, but the I/O is all highly optimized and
sequential. After the sort is over, the join processing is very fast and never
involves paging or other I/O.
Databases
The best choice is to use Connector stages if available for the database.
The next best choice is the Enterprise database stages as these give
maximum parallel performance and features when compared to plug-in
stages. The Enterprise stages are:
DB2/UDB Enterprise
Informix Enterprise
Oracle Enterprise
Teradata Enterprise
SQLServer Enterprise
Sybase Enterprise
ODBC Enterprise
iWay Enterprise
Netezza Enterprise
Avoid generating target tables in the database from the IBM DataStage job
(that is, using the Create write mode on the database stage) unless they
are intended for temporary storage only. This is because this method does
not allow, for example, specifying target table space, and inadvertently
data-management policies on the database can be violated.
When there is a need to create a table on a target database from within a
job, use the Open command property on the database stage to explicitly
create the table and allocate table space, or any other options required.
CONCLUSION
Performance tuning can be a labor intensive and quite costly process. That
is exactly the reason why care for optimization and performance should be
taken into account from the beginning of the development process. With
the combination of best practices, performance guidelines and past
experience, the majority of performance problems can be avoided during
the development process.
If performance issues still occur even when performance guidelines have
been taken into account during development, then these issues can be
tackled and analyzed using the available, discussed tools such as
Resource Estimation and Performance Analysis functionalities.
LOOPING IN TRANSFORMER STAGE DATASTAGE 8.5 : EXAMPLE 2
Performing aggregation using a Transformer:
Code
ABC
ABC
ABC
DEF
For the first two records the function will return 0 but
for the last record ABC,3 it will return 1 indicating
that it is the last record for the group where student
name is ABC
GetSavedInputRecord(): This function returns the
record that was stored in cache by the function
SaveInputRecord()
Back to the task at hand, we need 7 stage variables to
perform the aggregation operation successfully.
1. LoopNumber: Holds the value of number of records
stored in cache for a student
2. LoopBreak: This is to identify the last record for a
particular student
3. SumSub1: This variable will hold the final sum of marks
for each student in subject 1
4. IntermediateSumSub1: This variable will hold the sum
of marks until the final record is evaluated for a student
(subject 1)
5. SumSub2: Similar to SumSub1 (for subject 2)
6. IntermediateSumSub2: Similar to IntermediateSumSub1
(for subject 2)
7. LoopBreakNum: Holds the value for the number of times
the loop has to run
To explain the above use of variables When the first record comes to stage variables, it is saved
in the cache using the function SaveInputRecord() in first
stage variableLoopNumber
The second stage variable checks if it is the last record for
this particular student, if it is it stores 1 else 0
The third SumSub1 is executed only if the record is the
last record
The fourth IntermediateSumSum1 is executed when the
input record is not the last record, thereby storing the
intermediate sum of the subject for a student
Fifth and sixth are the same as 3 and 4 stage variables
Seven will have the first value as 1 and for the second
record also if the same student is fetched it will change to
2 and so on
The loop variable will be executed until the final record for
a student is identified and the GetSavedInputRecord()
function will make sure the current record is processed
before the next record is brought for processing.
What the above logic does is for each and every record it
will send the sum of marks scored by each student to the
output. But our requirement is to have only one record per
student in the output.
So we simply add a remove duplicates stage and add the
student name as a primary key
Once done our job design is done it will look like this
Run the job and the output will be according to our initial
expectation
State
Name1
Name2
Name3
xy
FGH
Sam
Dean
Winchester
State
Name
xy
FGH
Sam
xy
FGH
Dean
xy
FGH
Winchester
In the adjacent image you can see a new box called Loop
Condition. This where we are going to control the loop
variables.
Below is the screenshot when we expand the Loop
Condition box
If@ITERATION=1ThenDSLink2.Name1ElseIf@ITERATION=2ThenDSLink2.Name2
ElseDSLink2.Name3
PARALLEL DEBUGGING IN
DATASTAGE 8.5
IBM has added a new debug feature for DataStage from
the 8.5 version. It provides basic debugging features for
testing/debugging the job design.
As you can see the above job design is very simple; I have
a row generator stage that will generate 10 records for
the column number
The final stage is the peek stage that we all have been
using for debugging our job design all these years.
Now that we understand the job desing, well start
debugging it.
Add breakpoints to links
Right click on the link where you wish to check/debug
the data and the following drop down menu appears
As you can see in the above screenshot, the data for that
particular link is displayed along with the corresponding
column name.
Once we are done analyzing the data we can request the
debugger to either go to the next record of run till the end
DATASTAGE COMMON
ERRORS/WARNINGS AND SOLUTIONS
3
1. While connecting Remote Desktop, Terminal server
has been exceeded maximum number of allowed
connections
SOL: In Command Prompt, type mstsc /v: ip address of
server /admin
OR
mstsc /v: ip
address /console
2. SQL20521N. Error occurred processing a conditional
compilation directive near string. Reason code=rc.
Following link has issue description:
http://pic.dhe.ibm.com/infocenter/db2luw/v9r7/index.jsp?
topic=%2Fcom.ibm.db2.luw.messages.sql.doc%2Fdoc
%2Fmsql20521n.html
3. SK_RETAILER_GROUP_BRDIGE,1: runLocally() did not
reach EOF on its input data set 0.
SOL: Warning will be disappeared by regenerating SK
File.
4.
While connecting to Datastage client, there is no
response, and while restarting websphere services,
following errors occurred
[root@poluloro01 bin]# ./stopServer.sh server1 -user
wasadmin -password Wasadmin0708
ADMU0116I: Tool information is being logged in file
/opt/ibm/WebSphere/AppServer/profiles/default/logs/
server1/stopServer.log
ADMU0128I: Starting tool with the default profile
ADMU3100I: Reading configuration for server: server1
ADMU0111E: Program exiting with error:
javax.management.JMRuntimeException:
ADMN0022E: Access is denied for the stop operation on
Server MBean because of insufficient or empty
credentials.
ADMU4113E: Verify that username and password
information is on the command line
(-username and -password) or in the
<conntype>.client.props file.
ADMU1211I: To obtain a full trace of the failure, use the
-trace option.
a)
b)
c)
Click on Command
d)
e)
Select option 2
f)
g)
ps ef
| grep dsadm
cd /opt/ibm/InformationServer/server/DSEngine
-password
Output link
The output link has two columns. The jobname and one for
the function names. Per input row, there are 10 rows
outputted. Even if the fields are NULL. Divide the
fieldnames with spaces, not with any other delimiter.
If the date is valid date, it will return one else zero. But, if
we have the date as 1/1/1999 though the date is valid it
will return 0, as date and month fields have only one digit.
Therefore, we will create a Parallel Routine, which will
take date in various format viz. mm/dd/yyyy, m/dd/yyyy,
mm/d/yyyy, mm/dd/yy, m/d/yy etc and return it in a fixed
format of mmddyyyy, which we can use in the function
specified above.
C/C++ Code:
The code written below takes a date string in any of the
formats mentioned above and returns a date string in
mmddyyyy format (if the year is provided in two digits yy
then it will return 19nn if nn>=50 and 20nn if nn<50 as
the year part).
//FileName:date_format.cpp
#include<stdio.h>
#include<string.h>
#include<stdlib.h>
char*date_format(charstr[20])
{
char*fst,*lst,date[10];
intlen,decade;
len=strlen(str);
fst=strchr(str,'/');
lst=strrchr(str,'/');
//handlingthedatepart
if((fststr)==1)
{
strcpy(date,"0");
strncat(date,str,1);
}
elseif((fststr)==2)
{
strncpy(date,str,2);
}
else
return("Invalid");
//handlingthemonthpart
if((lstfst1)==1)
{
strcat(date,"0");
strncat(date,fst+1,1);
}
elseif((lstfst1)==2)
{
strncat(date,fst+1,2);
}
else
return("Invalid");
//handlingtheyearpart
if((len(lststr)1)==2)
{
decade=atoi(lst+1);
if(decade>=50)
{
strcat(date,"19");
}
else
{
strcat(date,"20");
}
strncat(date,lst+1,2);
}
elseif((len(lststr)1)==4)
{
strncat(date,lst+1,4);
}
else
return("Invalid");
return(date);
}
DATASTAGE COMMON
ERRORS/WARNINGS AND SOLUTIONS
1
1.
While running ./NodeAgents.sh start command
getting the following error: LoggingAgent.sh process
stopped unexpectedly
SOL:
SOL:
cd/opt/ibm/InformationServer/Server/DSEngine/bin
./dsjobljobs<project_name>
6.
Error trying to query dsadm[]. There might be an
issue in database server
SOL:
SOL:
Go to
/opt/ibm/InformationServer/server/Projects/proj_name/
ls RT_SCT* then
rm f RT_SCTEMP
then try to restart it.
10.
While attempting to compile job, failed to invoke
GenRunTime using Phantom process helper
RC:
CODE
ERROR TOKEN
DESCRIPTION
DSJE_NOERROR
-1
DSJE_BADHANDLE
Invalid JobHandle.
-2
DSJE_BADSTATE
-3
DSJE_BADPARAM
-4
DSJE_BADVALUE
-5
DSJE_BADTYPE
-6
DSJE_WRONGJOB
CODE
0
ERROR TOKEN
DSJE_NOERROR
DESCRIPTION
No InfoSphere DataStage API error has
occurred.
current process.
-7
DSJE_BADSTAGE
-8
DSJE_NOTINSTAGE
-9
DSJE_BADLINK
-10
DSJE_JOBLOCKED
-11
DSJE_JOBDELETED
-12
DSJE_BADNAME
-13
DSJE_BADTIME
-14
DSJE_TIMEOUT
-15
DSJE_DECRYPTERR
-16
DSJE_NOACCESS
-99
DSJE_REPERROR
-100
DSJE_NOTADMINUSER
-101
DSJE_ISADMINFAILED
-102
DSJE_READPROJPROPERTY
-103
DSJE_WRITEPROJPROPERTY
-104
DSJE_BADPROPERTY
-105
DSJE_PROPNOTSUPPORTED
Unsupported property.
-106
DSJE_BADPROPVALUE
-107
DSJE_OSHVISIBLEFLAG
-108
DSJE_BADENVVARNAME
CODE
ERROR TOKEN
DESCRIPTION
DSJE_NOERROR
-109
DSJE_BADENVVARTYPE
-110
DSJE_BADENVVARPROMPT
No prompt supplied.
-111
DSJE_READENVVARDEFNS
-112
DSJE_READENVVARVALUES
-113
DSJE_WRITEENVVARDEFNS
-114
DSJE_WRITEENVVARVALUES
-115
DSJE_DUPENVVARNAME
-116
DSJE_BADENVVAR
-117
DSJE_NOTUSERDEFINED
-118
DSJE_BADBOOLEANVALUE
-119
DSJE_BADNUMERICVALUE
-120
DSJE_BADLISTVALUE
-121
DSJE_PXNOTINSTALLED
-122
DSJE_ISPARALLELLICENCED
-123
DSJE_ENCODEFAILED
-124
DSJE_DELPROJFAILED
-125
DSJE_DELPROJFILESFAILED
-126
DSJE_LISTSCHEDULEFAILED
CODE
ERROR TOKEN
DESCRIPTION
DSJE_NOERROR
-127
DSJE_CLEARSCHEDULEFAILED
-128
DSJE_BADPROJNAME
-129
DSJE_GETDEFAULTPATHFAILED
-130
DSJE_BADPROJLOCATION
-131
DSJE_INVALIDPROJECTLOCATION
-132
DSJE_OPENFAILED
-133
DSJE_READUFAILED
-134
DSJE_ADDPROJECTBLOCKED
-135
DSJE_ADDPROJECTFAILED
-136
DSJE_LICENSEPROJECTFAILED
-137
DSJE_RELEASEFAILED
-138
DSJE_DELETEPROJECTBLOCKED
-139
DSJE_NOTAPROJECT
-140
DSJE_ACCOUNTPATHFAILED
-141
DSJE_LOGTOFAILED
-201
DSJE_UNKNOWN_JOBNAME
1001
DSJE_NOMORE
1002
DSJE_BADPROJECT
ProjectName is not a
known InfoSphere DataStage project.
1003
DSJE_NO_DATASTAGE
1004
DSJE_OPENFAIL
CODE
ERROR TOKEN
DESCRIPTION
DSJE_NOERROR
1005
DSJE_NO_MEMORY
1006
DSJE_SERVER_ERROR
1007
DSJE_NOT_AVAILABLE
1008
DSJE_BAD_VERSION
1009
DSJE_INCOMPATIBLE_SERVER
ERROR NUMBER
DESCRIPTION
39121
39134
80011
80019
Now you will have to create an obj file for the above code
that can be done by using the command
/usr/vacpp/bin/xlC_rOcqspill=32704<FILENAME>
COMPILER SPECIFICATION
IN DATASTAGE
One of the pre-requisites to writing Parallel rotines using
C/C++ is to check for the compiler specification. This post
guides in seeing the C/C++ compiler available in our
installation. To find compiler, we need to login into
Administrator and go to the path below
.cp
p
DATASTAGE LEARNINGS-4:
REUSABILITY IN DATASTAGE
Below are some of the ways through which reusability can
be achieved in DataStage.
Multiple Instance Jobs.
Parallel Shared Container
After-job Routines.
Multiple-Instance Jobs:
Generally, in data warehousing, there would be
scenarios/requirement to load the maintenance tables in
addition to the tables in which we load the actual data.
These maintenance tables typically have data like
1.Count of records loaded in the actual target tables.
2.Batch Load Number/Day on which the load occurred
3.The last processed sequence id for a particular load
This information can also be used for data count
validation, email notification to users. So, for loading
these maintenance tables, it is always suggested to
design a single multiple-instance job instead of different
jobs with the same logic. In the Job Properties tab, check
the option Allow Multiple Instance. The job activity
which attach a multiple-instance job shows an additional
tab Invocation Id .So we can have different job
sequences triggering the same job with different
invocation ids. Note that a job triggered by different
After-job Routines:
After/Before job subroutines are types of routines which
run after/before the job to which the routine is attached.
We might have a scenario where in we shouldnt have any
of the input records to be rejected by any of the stages in
the job. So we design a job which have reject links for
different stages in the job and then code a common afterjob routine which counts the number of records in the
reject links of the job and aborts the job when the count
exceeds a pre-defined limit.
MAR
Below are the important points to be taken care of, for
proper restartability to be implemented in Job Sequence
Checkpoint information has to be maintained and
sequence has to be made restartable. So, the option
Add checkpoints so sequence is restartable in the
Job Sequence Properties, should be checked. Also
only then we would find the option Do not
checkpoint run under the Job tab in each of the
job activities present in the job sequence.
The option Do not checkpoint run in job activity
should not be checked because if another job later in
the sequencer fails, and the sequencer is restarted,
this job will not rerun. However if the logic demands,
then it has to be checked.
The option Reset if required, then run has to be
chosen in the job activity so that when an aborted
sequence is re-run then, the aborted job(due to
which sequence is aborted) will re-run successfully.
Below are the screenshots :
.$DSHOME/dsenv
proj=$1
jobname=$2
#Youhavetosearchwhereexactlyyourprojectisplaced.
#Inourcaseitisinbelowmentionedpath
cd/data/projectlib/$proj
echoLocationforthejob$jobnameis
$DSHOME/bin/uvshSELECTCATEGORYFROMDS_JOBSWHERENAME=$jobname;
Entire partition has all data across the nodes So while matching(in lookup) the records all data
should be present across all nodes.
For lookup sorting is not required.so when we are not using entire partition then reference data
splits into all nodes. Then each primary record need check with all nodes for matched reference
record.Then we face performance issue.If we use entire in lookup then one primary record needs
to look into 1 node is enough.if match found then that record goes to target otherwise it move to
reject,drop etc(based on requirement)no need check in another node.In this case if we are running
job in 4 nodes then at a time 4 records should process.
Note:Please remember we go for lookup only we have small reference data.If we go for big data
it is performance issue(I/O work will increase here) and also some times job will abort.
What is stable sort?
Since keeping track of relative record location means more work, setting Stable to "False" will
speed up performance.
Stable sort means "if you have two or more records that have the same exact keys, keep them in
the same order on output that they were on input".
HLD: It refers to the functionlity to be achieved to meet the client requirement. Precisely
speaking it is a
diagramatic representation of clients operational systems,staging areas, dwh n
datamarts. also how n what frequency
the data is extracted n loaded into the target database.
LLD: It is prepared for every mapping along with unit test plan. It contains the names of
source definitions, target
definitions, transformatoins used, column names, data types, business logic written n
source to target field
matrix, session name, mapping name.
How to handle values in Sequential file stage?
Open Sequential File---> Go to Format--->click on Field Defaults--->bottom right side you
will find AVAILABLE Properties to ADD, Under that select, "NULL FIELD VALUE" and give
the value as " 0" [zero]. You'll get the Null records in your output sequential file.
Have you ever used or use RCP option in your project?
Yes, to Load Data from Multiple Sources to Target Tables/Files with out Much
Transformation i.e Straight Load.
How do you protect your project?
In Administrator
Can aggregator and transformer stage used for sorting data? How
Yes , in Transformer , I want to take count in stage Variable for rec the records should be
sorted.
What are XML files? How do you read data from XML files and which stage to
be used?
5) Pipeline Parallelism
Pipeline Parallelism is the process, the extraction, transformation and loading will be occurred
simultaneously.
Re- Partitioning: The distribution of distributed data is Re-Partitioning.
Reverse Partitioning: Reverse Partitioning is called as Collecting.
Collecting methods are
Ordered
Round Robin
Sort Merge
Auto
a) Hash
b) Modulus
c) Range
d) DB2
Key Less Techniques are
a) Same
b) Entire
c) Round Robin
d) Random
PARTITIONS IN DATASTAGE
Partition Techniques are used in datastage to get good performance.
They are different types of Partition Techniques in datastag. They are
a) Key Based Partition Techniques
b) Key Less Partition Techniques
Key Less Techniques are
1) Same
2) Entire
3) Round Robin
4) Random
a) Same: This technique is used in oder to do not alter the existing partition technique in the previous
stage.
b) Entire: Each Partition gets the entire dataset. That is rows are duplicated.
c) Round Robin :In Round Robin Technique rows are evenly distributed among the Partition.
d) Random: Partition a row is assigned to is Random.
WHAT IS ETL PROJECT PHASE | PROJECT PHASE WITH ETL TOOL( DATASTAGE)
ETL Project contains with four phases to implement the project.
ETL Means Extraction Transformation Loading
ETL is the tool used to extract the data
Transformation the Job
And to Load the Data
It is used for Business Developments
And four phases are
1) Data Profiling
2) Data Quality
3) Data Transformation
4) Meta data management
Data Profiling:Data Profiling performs in 5 steps. Data Profiling will analysis weather the source data is good or dirty
or not.
And these 5 steps are
a) Column Analysis
b) Primary Key Analysis
c) Foreign Key Analysis
d) Cross domain Analysis
e) Base Line analysis
After completing the Analysis, if the data is good not a problem. If your data is dirty, it will be sent for
cleansing. This will be done in the second phase.
Data Quality:Data Quality, after getting the dirty data it will clean the data by using 5 different ways.
They are
a) Parsing
b) Correcting
c) Standardize
d) Matching
e) Consolidate
Data Transformation:After competing the second phase, it will gives the Golden Copy.
Golden copy is nothing but single version of truth.
That means , the data is good one now.
In 2nd Partition
20,20
In 3rd Partition
30
In 4th Partition
40
job. They are 1. APT_CONFIG_FILE
2. APT_RECORD_COUNT
3.
APT_DUMP_SCORE
How to create user defined environment variable (parameter)?u can create parameter in 2 ways
1.job level params
2.project level params.Job-Level: when u want to create job level params ,go to job properties and
use it.
Project -Level :when u want to create project level, go to DS-Admin and click on environment variable
and define which parameter u need and open design go to job properties and call those params
(which u defined in DS admin) in job properties and use it.
30 jobs are running in unix.i want to find out my job.how to do this? Give me command?
ps -ef|grep PID
how to move project from developement to uat?
By using the Datastage Manager we can move the project from Dev to Uat. Through datastage
manager Export the project into your local machine as .dsx format (project.dsx) from DEV server. The
same .dsx (project.dsx) import into UAT server by using the datastage manager.
How to do error handling in datastage?
Error handling can be done by using the reject file link.what are the errors coming through job needs
to be capture in sequential file and that file needs to be fetch in job which will load this exceptions or
errors in database.
How can u Call the Shell Scripting/Unix Commands in Job Sequence?
There are two scenarios where u myt want to call a script
Scenario 1(Dependency exists between script and a job): Where a job has to be executed first then
the script has to run, upon completion of script execution only the sec job has to be invoked. In this
case develop a sequencer job where first job activity will invoke the first job then using Execute
command activity call the script u would desire to invoke by typing "sh <script name>" in the
command property of the activity, then with the other job activity call the second job.
Scenario 2: (Script and job are independent) : In this case right in your parallel job say job1, under job
properties u can find "After-job subroutine" where u need to select "ExecSH" and pass the script
name which you would like to execute. By doing this once the job1 execution completes the script
gets invoked. The job succeeding the job1 say job2 doesnt wait for the execution of the script.
HOW TO REMOVE SPECIAL CHARACTERS DATA AND LOAD REST OF THE DATA
Here we are going to know how to remove Special characters data rows and load rest of the rows into
the target.
Some times we get the data with special characters added for some of the rows.
If we like to remove those special characters mixed rows in the column. We can use Alphafunction.
Alpha Function is used for this Job.
If we use "Alpha" function. It will drop the special characters mixed rows and loads the rest of the rows
into the target.
So you can take sequential file to read the data and you can take Transformer stage to apply business
logic.
In Transformer stage in Constrain you can write the Alpha function.
And Drag and Drop into the Target. .
Then Compile and Run.
True
4)TFIP000000
3
Agg_stg: When checking operator: When binding input
interface field CUST_ACT_NBR to field CUST_ACT_NBR: Implicit conversion
from source type string[5] to result type dfloat: Converting string to number.
Resolution: use the Modify stage explicitly convert the data type before sending to aggregator
stage.
5)Warning: A user defined sort operator does not satisfy the requirements.
Resolution:check the order of sorting columns and make sure use the same order when use
join stage after sort to joing two inputs.
6)TFTM000000
2
3
Stg_tfm_header,1: Conversion error calling conversion
routine timestamp_from_string data may have been lost
TFTM000000
1
xfmJournals,1: Conversion error calling conversion routine
decimal_from_string data may have been lost
Resolution:check for the correct date format or decimal format and also null values in the
date or decimal fields before passing to datastage StringToDate,
DateToString,DecimalToString or StringToDecimal functions.
7)TOSO000119
2
3
Join_sort: When checking operator: Data claims to already
be sorted on the specified keys the sorted option can be used to confirm this. Data will
be resorted as necessary. Performance may improve if this sort is removed from the
flow
Resolution: Sort the data before sending to join stage and check for the order of sorting keys
and join keys and make sure both are in the same order.
8)TFOR000000
2
1
Join_Outer: When checking operator: Dropping
component CUST_NBR because of a prior component with the same name.
Resolution:If you are using join,diff,merge or comp stages make sure both links have the
differnt column names other than key columns
9)TFIP000022
1
oci_oracle_source: When checking operator: When binding
output interface field MEMBER_NAME to field MEMBER_NAME: Converting a
nullable source to a non-nullable result;
Resolution:If you are reading from oracle database or in any processing stage where
incoming column is defined as nullable and if you define metadata in datastage as nonnullable then you will get above issue.if you want to convert a nullable field to non nullable
make sure you apply available null functions in datastage or in the extract query.
Null function in oracle:NVL,NVL2
Datastage:IsNull,NullToEmpty,NullToZero