KEMBAR78
Differences in Data Stage Tool | PDF | Parallel Computing | File Format
0% found this document useful (0 votes)
885 views4 pages

Differences in Data Stage Tool

Server generates datastage BASIC, parallel generates Orchestrate shell script (osh) and C++, mainframe generates COBOL and JCL. Orchestrate itself is an ETL tool with extensive parallel processing capabilities and running on UNIX platform.

Uploaded by

pankaj55555
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
885 views4 pages

Differences in Data Stage Tool

Server generates datastage BASIC, parallel generates Orchestrate shell script (osh) and C++, mainframe generates COBOL and JCL. Orchestrate itself is an ETL tool with extensive parallel processing capabilities and running on UNIX platform.

Uploaded by

pankaj55555
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
You are on page 1/ 4

Differences in Data stage Tool (PANKAJ DAS ORATOR---BANGALORE)

What is difference between server jobs & parallel jobs?


--- Server generates DataStage BASIC, parallel generates Orchestrate shell script (osh) and C++, mainframe generates
COBOL and JCL.
--- In server and mainframe you tend to do most of the work in Transformer stage. In parallel you tend to use specific
stage types for specific tasks (and the Transformer stage doesn't do lookups). There are many more stage types for
parallel than server or mainframe, and parallel stages correspond to Orchestrate operators.
--- Also there's the automatic partitioning and collection of data in the parallel environment, which would have to be
managed manually (if at all) in the server environment.

What is the diff b/w switch and filter stage in datastage.


Filter:
1)we can write the multiple conditions on multiple fields
2)it supports one inputlink and n number of outputlinks
Switch:
1)multiple conditions on a single field(column)
2)it supports one inputlink and 128 output links

What is the differeces between hash and modulus partition methods.


Hash - here data will be populated in such a way that related data will stay together.
Range of primary key records will be populated to one partition.

EG.) all primary records in this month will be populated to one partition.
Modulus:-partitioned data will provide some information. In this sense customers related to one store will populated to
one partition.

If the key field is numeric use modules else hash partition.as per performance tuning.Modulus is nothing but Modulus
in Maths, so it can be performed only on Numeric Data Fields, Hash can be used for
any kind of data fileds, it will assign similar values in the portioning.
Modulus Hash
1. For numerics 1. For Numerics and characters
2. Datatype specific 2. Not Datatype spefic

Difference between function and procedure


Function must returns value, but Procedure may or may not return value.
Function can be use with select statement where Procedure can't be used with select statement.

Orchestrate Vs Datastage Parallel Extender?


Orchestrate itself is an ETL tool with extensive parallel processing capabilities and running on UNIX platform.
Datastage used Orchestrate with Datastage XE (Beta version of 6.0) to incorporate the parallel processing capabilities.
Now Datastage has purchased Orchestrate and integrated it with Datastage XE and released a new version Datastage
6.0 i.e Parallel Extender.

Difference between Hashfile and Sequential File?

Hash file stores the data based on hash algorithm and on a key value. A sequential file is just a file with no key column.
Hash file used as a reference for look up. Sequential file cannot.Difference between Hashfile and sequential file is ,
searching a record is too fast in hash file based on the hashkey, we can get the address of record directly in hashfile
based on the hashkey, and in sequential file it should search record sequential mode only, it has to search for record by
record, and we can remove duplicate records based on the hash key in hashfile, we cannot in sequential file.

What is the difference between sequential file and a dataset? When to use the copy stage?
Sequentiial Stage stores small amount of the data with any extension in order to acces the file where as DataSet is used
to store Huge amount of the data and it opens only with an extension (.ds ) .The Copy stage copies a single input data
set to a number of output datasets. Each record of the input data set is copied to every output data set.Records can be
copied without modification or you can drop or change theorder of columns.
Seq file : u can view the data in Unix box or View data button.
U can delete the file in unix like rm file name.

DataSet : U can't see the data in dataset stage. even cant see the data in unix box also
But if u want to delete the data u can use $orchadmin < delete | del | rm > [-f | -x] descriptorfiles?.

What is the exact difference between Join,Merge and Lookup Stage??


The three stages differ mainly in the memory they use
DataStage doesn't know how large your data is, so cannot make an informed choice whether to combine data using a
join stage or a lookup stage. Here's how to decide which to use:
if the reference datasets are big enough to cause trouble, use a join. A join does a high-speed sort on the driving and
reference datasets. This can involve I/O if the data is big enough, but the I/O is all highly optimized and sequential.
Once the sort is over the join processing is very fast and never involves paging or other I/O
Unlike Join stages and Lookup stages, the Merge stage allows you to specify several reject links as many as input links.

What is data set? and what is file set?


File set:- It allows you to read data from or write data to a file set. The stage can have a single input link. a single
output link, and a single rejects link. It only executes in parallel modeThe data files and the file that lists them are
called a file set. This capability is useful because some operatingsystems impose a 2 GB limit on the size of a file and
you need to distribute files among nodes to prevent overruns.
Datasets r used to import the data in parallel jobs like odbc in server jobs

Difference between Dataset and Fileset?


Dataset is an internal format of DataStage the main points to be considered about dataset before using are:
1) It stores data in binary in the internal format of DataStage so, it takes less time to read/write from dataset than any
other source/target.
2)It preserves the partioning schemes so that you don't have to partition it again.
3)You cannot view data without datastage

Now, About Fileset


1)It stores data in the format similar to a sequential file.
2)Only advantage of using fileset over a sequential file is "it preserves partioning scheme"
3)You can view the data but in the order defined in partitioning scheme

What is difference between file set and data set?


Dataset:- Datasets are operating system files, each referred to by a control file, which has the suffix .ds. PX jobs use
datasets to manage data within a job. The data in the Datasets is stored in internal format.
A Dataset consists of two parts:
Descriptor file: It contains metadata and data location.
Data file: It contains the data.
--The Dataset Stage is used to read data from or write data to a dataset. It allows you to store data in persistent form,
which can then be used by other jobs.
Fileset:- DataStage can generate and name exported files, write them to their destination, and list the files it has
generated in a file with extension, .fs. The data files and the file that lists them is called a ‘Fileset’.
A fileset consists of two parts:
Descriptor file: It contains location of raw data files and the meta data.
Individual raw Data files.
-- The Fileset Stage is used to read data from or write data to a fileset.
Difference between datastage and qualitystage?
Datastage is an ETL(Extract Transform and Load) tool while quality stage is a cleansing tool.

What is the Difference Between DataStage 7.5 version and 8.1 Version?
1. in ds 7.5.2 we have manager as client. in 8.0.1 we dont have any manager client. the manager client is embeded in
designer client.
2. in 7.5.2 quality stage has seperate designer .in 8.0.1 quality stage is integrated in designer.
3. in 7.5.2 we required operating system authentications. in 8.0.1 we requiree operating system authentications and
datastage authentications.
4. in 7.5.2 we dont have range lookup. in8.0.1 we have range lookup.
5. in 7..5.2 a single join stage can't support multiple references. in 8.0.1 a single join stage can support multiple
references.
6. in 7.5.2 , when a developer opens a particular job, and another developer wants to open the same job , that job
can't be opend. in 8.0. it can be possible when a developer opens a particular job and another developer wants to open
the same job then it can be opend as read only job.
7. in 8.0.1 a compare utility is avilable to compare 2 jobs , one in development another is in production. in
7.5.2 it is not possible.
8. in 8 we have scd stage but 7 we don’t have.
9. in 8.0.1 quick find and advance find features are avilable , in 7.5.2 not available.

10. in 7.5.2 first time one job is run and surogate key s generated from initial to n value. next time the same job is
compile and run again surrogate key is generated from initial to n. automatic increment of surrogate key is not in 7.5.2.
but in 8.0.1 surrogate key is incremented automatically.a state file is used to store the maximum value of surrogate key.

DataStage Parallel jobs Vs DataStage Server jobs


1) The basic difference between server and parallel jobs is the degree of parallelism that PX offers.
2) In parallel we can choose to display OSH , which gives information about the how job works.
3) In Parallel Transformer there is no reference link possibility, in server stage reference could be given to
transformer. Parallel stage can use both basic and parallel oriented functions.
4) Datastage server executed by datastage server environment but parallel executed under control of datastage
runtime environment
5) Datastage compiled in to BASIC(interpreted pseudo code) and Parallel compiled to OSH(Orchestrate Scripting
Language).
6) Debugging and Testing Stages are available only in the Parallel Extender.
7) Look up of sequntial file is possible in parallel jobs
8) In parallel we can specify more file paths to fetch data from using file pattern similar to Folder stage in
Server, while in server we can specify one file name in one O/P link.
9) The difference is file size Restriction.
Sequential file size in server is : 2GB
Sequential file size in parallel is : No Limitation..
10) Parallel sequential file has filter options too. Where you can specify the file pattern.

Px takes advantage of both pipeline parallelism and partitoning paralellism. Pipeline parallelism
means that as soon as data is available between stages( in pipes or links), it can be exchanged between them without
waiting for the entire record set to be read. Partitioning parallelism means that entire record set is partitioned into small
sets and processed on different nodes(logical processors). For example if there are 100 records, then if there are 4
logical nodes then each node would process 25 records each. This enhances the speed at which loading takes place to
an amazing degree. Imagine situations where billions of records have to be loaded daily. This is where datastage PX
comes as a boon for ETL process and surpasses all other ETL tools in the market.
What is the difference between stages and operators?
--- Stages are generic user interface from where we can read and write from files and databases, trouble shoot and
develop jobs, also it's capable of doing processing of data.
--- Operators are the basic functional units of an orchestrate application. In orchestrate framework DataStage stages
generates an orchestrate operator directly.

5.what is the difference between routine and transform and function?

What is the difference between Job Control and Job Sequence?


Job control specially used to control the job means through this we can pass the parameters some conditions some log
file information dashboard information load recover etc...

job seq is used to run the group of jobs based upon some conditions. For final/incremental processing we keep all the
jobs in one diff seq and we run the jobs at a time by giving some triggers.

You might also like