KEMBAR78
Ab Initio Introduction | PDF | Data | Operating System
0% found this document useful (0 votes)
23 views13 pages

Ab Initio Introduction

Ab Initio is a client-server based ETL tool that uses a Graphical Development Environment (GDE) for designing and running graphs, which are scripts executed on a Co-Operating System. It supports both serial and multi-file system environments, allowing for parallel processing and efficient data management. The document also compares Ab Initio with Informatica, highlighting features such as error handling, robustness, and parallelism capabilities.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views13 pages

Ab Initio Introduction

Ab Initio is a client-server based ETL tool that uses a Graphical Development Environment (GDE) for designing and running graphs, which are scripts executed on a Co-Operating System. It supports both serial and multi-file system environments, allowing for parallel processing and efficient data management. The document also compares Ab Initio with Informatica, highlighting features such as error handling, robustness, and parallelism capabilities.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

Ab Initio Introduction

Ab Initio means 'Starts From the Beginning'. Ab-Initio software works with the
client-server model.

The client is called “Graphical Development Environment” (you can call it


GDE).It
resides on user desktop. The server or back-end is called Co-Operating
System. The Co-Operating System can reside in a mainframe or unix remote
machine.

The Ab-Initio code is called graph, which has got .mp extension. The graph
from GDE is required to be deployed in corresponding .ksh version. In Co-
Operating system the
corresponding .ksh in run to do the required job.

How an Ab-Initio Job Is Run What happens when you push the 'Run' button?
Your graph is translated into a script that can be executed in the Shell
Development
This script and any metadata files stored on the GDE client machine are
shipped (via
FTP) to the server.
The script is invoked (via REXEC or TELNET) on the server.
The script creates and runs a job that may run across many hosts.
Monitoring information is sent back to the GDE client.
Ab-Initio Environment The advantage of Ab-Initio code is that it can run in
both the serial and multi-file system environment. Serial Environment: The
normal UNIX file system. Multi-File System: Multi-File System (mfs) is meant for
Parallelism. In an MFS, a particular file is physically stored across different
partition of the machine or even different
machine but pointed by a logical file, which is stored in the co-operating
system. The
The logical file is the control file that holds the pointer to the physical locations.
About Ab-Initio Graphs: An Ab-Initio graph comprises a number of components
to serve different purposes. Data is read or written by a component according
to the dml ( do not
confuse with the database 'data manipulating language' The most
commonly used
components are described in the following sections.

Co Operating System

Co Operating System is a program provided by AbInitio which operates on


the top of the operating system and is a base for all AbInitio processes.
provides additional features known as air commands which can be installed
on a variety of system environments such as Unix, HP-UX, Linux, IBM AIX,
Windows systems. The AbInitio CoOperating System provides the following
features
- Manage and run AbInitio graphs and control the ETL processes
Provides AbInitio extensions to the operating system
ETL processes monitoring and debugging
Metadata management and interaction with the EME

AbInitio GDE (Graphical Development Environment)

GDE is a graphical application for developers which is used for designing and
running AbInitio graphs. It also provides:
The ETL process in AbInitio is represented by AbInitio graphs. Graphs are
formed by components (from the standard components library or custom),
flows (data streams) and parameters.
A user-friendly frontend for designing Ab Initio ETL graphs
Ability to run, debug Ab Initio jobs and trace execution logs
The GDE AbInitio graph compilation process results in the generation of a UNIX shell.
script which may be executed on a machine without the GDE installed

AbInitio EME

Enterprise Meta Environment (EME) is an AbInitio repository and environment.


for storing and managing metadata. It provides capability to store both
business and technical metadata. EME metadata can be accessed from the
Ab Initio GDE, web browser or Ab Initio CoOperating system command line
(air commands)

Conduct It

Conduct is an environment for creating enterprise Ab Initio data integration.


systems. Its main role is to create Ab Initio Plans which is a special type of
graph constructed of other graphs and scripts. Ab Initio provides both
graphical and command-line interface to Conduct IT.

Data Profiler

The Data Profiler is an analytical application that can specify data range,
scope, distribution, variance, and quality. It runs in a graphic environment on
top of the Co Operating system.

Component Library
The Ab Initio Component Library is a reusable software module for sorting.
data transformation, and high-speed database loading and unloading. This is
a flexible and extensible tool which adapts at runtime to the formats of
records entered and allows creation and incorporation of new components
obtained from any program that permits integration and reuse of external
legacy codes and storage engines.

Questions Set 1
Informatica vs Ab Initio
Feature From the beginning Informatics
About Tool Code based ETL Engine based ETL
Parallelism Supports one type of parallelism Supports three types of parallelism
Scheduler No scheduler Schedule through script available
Error Handling Can attach error and reject files One file for all
Robust Robustness by function comparison Basic in terms of robustness
Feedback Provides performance metrics for each component executed Debug
mode, but slow implementation
Delimiters while reading Supports multiple delimiters Only dedicated delimiter

Q. What is the relationship between eme, gde and the cooperating system?
Eme is referred to as enterprise metadata environment, gde as graphical development environment and cooperating.
The system can be referred to as the Ab Initio server; the relationship between this co-op, EME, and GDE is as follows.
The operating system is the Ab Initio server. This co-op is installed on a specific OS platform that
is called native o.s. coming to the eme, it's just as repository in Informatica, it's hold the
metadata, transformations, dbconfig files source and targets information. Coming to gde
it is end user environment where we can develop the graphs (mapping just like in
Informatica designer uses the gde and designs the graphs and saves to the eme or sandbox.
it is at user side. Where eme is at server side.

Q. What are the benefits of data processing according to you?


Well, processing of data derives a very large number of benefits. Users can put separate
many factors that matter to them. In addition to this, with the help of this approach, one
can easily keep up the pace simply by deriving data into different structures from a totally
unstructured format. In addition to this, processing is useful in eliminating various bugs that
are often associated with the data and cause problems at a later section. It is because of no
Other than this, data processing has wide application in a number of tasks.

Q. What exactly do you understand by the term data processing and businesses can trust
this approach?
Processing is basically a procedure that simply converts the data from a useless form into a
useful one without making a lot of efforts. However, the same may vary depending on
factors such as the size of data and its format. A sequence of operations is generally carried
out to perform this task and depending on the type of data, this sequence could be
automatic or manual. Because in the present scenario, most of the devices that perform this
Tasks are that PCs' automatic approach is more popular than ever before. Users are free to obtain
data in forms such as a table, vectors, images, graphs, charts and so on. This is the best
things that business owners can simply enjoy.

Q. How is data processed and what are the fundamentals of this approach?
There are certain activities which require the collection of the data and the best thing is
processing largely depends on the same in many cases. The fact is data needs to be stored
and analyzed before it is actually processed. This task depends on some major factors.
they are

1. Collection of Data
2. Presentation
3. Final Outcomes
4. Analysis
5. Sorting

These are also regarded as the basic fundamentals that can be trusted to keep up the pace.
in this matter.

Q. What would be the next step after collecting the data?


Once the data is collected, the next important task is to enter it in the concerned machine.
or system. Well, gone are those days when storage depends on papers. In the present time,
data size is very large and it needs to be performed in a reliable manner. The digital
approach is a good option for this as it simply lets users perform this task easily and in fact
without compromising with anything. A large set of operations then need to be performed
for the meaningful analysis. In many cases, conversion also largely matters and users are
always free to consider the outcomes which best meet their expectations.

Q. What is a data processing cycle and what is its significance?


Data often needs to be processed continuously and it is used at the same time. It is known
as data processing cycle. The same provides results that are quick or may take extra time
depending on the type, size and nature of data. This is boosting the complexity in this
approach and thus there is a need for methods that are more reliable and advanced than the existing ones
approaches. The data cycle simply ensures that complexity can be avoided up to the
possible extent and without doing much.

Q. What are the factors on which storage of data depends?


Basically, it depends on the sorting and filtering. In addition to this, it largely depends on the
software one uses.

Q. Do you think effective communication is necessary in data processing? What is your opinion?
strength in terms of the same?

The biggest ability that one could have in this domain is the ability to rely on the data or the
information. Of course, communication matters a lot in accomplishing several important
tasks such as representation of the information. There are many departments in an
Organization and communication ensure that things are good and reliable for everyone.

Q. Suppose we assign you a new project. What would be your initial point and the key steps?
that you follow?
The first thing that largely matters is defining the objective of the task and then engages the
team in it. This provides a solid direction for the accomplishment of the task. This is
important when one is working on a set of data which is completely unique or fresh. After
this, next big thing that needs attention is effective data modeling. This includes finding the
missing values and data validation. Last thing is to track the results.

Q. Suppose you find the term Validation mentioned with a set of data, what does that mean?
simply represent?
It represents that the concerned data is clean, correct and can thus be used reliably without
worrying about anything. Data validation is widely regarded as the key points in the
processing system.
Q. What do you mean by data sorting?
It is not always necessary that data remains in a well-defined sequence. In fact, it is always a
random collection of objects. Sorting is nothing but arranging the data items in desired sets
or in sequence.

Q. Name the technique you can use to simply combine multiple data sets?
It is known as Aggregation

Q. How is scientific data processing different from commercial data processing?


Scientific data processing simply means data with a great amount of computation i.e.
arithmetic operations. In this, a limited amount of data is provided as input and a bulk data
is there at the outcome. On the other hand, commercial data processing is different. In this,
the outcome is limited compared to the input data. The computational operations are
limited in commercial data processing.

Q. What are the benefits of data analyzing


It ensures the following:

1. Explanation of development related to the core tasks can be assured


Test hypotheses with an integration approach is always there
3. Pattern detection in a reliable manner

Q. What are the key elements of a data processing system?


These are Converter, Aggregator, Validator, Analyzer, Summarizer, and a sorter

Q. Name any two stages of the data processing cycle and provide your answer in terms of a.
comparative study of them?
The first is Collection and the second one is preparation of data. Of course, the collection is the
first stage and preparation is the second in a cycle dealing with data processing. The first
stage provides baseline to the second and the success and simplicity of the first depends on
how accurately the first has been accomplished. Preparation is mainly the manipulation of
Important data. Collection breaks data sets while preparation joins them together.
Q. What do you mean by the overflow errors?
While processing data, calculations that are bulky are often present and it is not always
necessary that they fit the memory allocated for them. In case a character of more than 8-
bits is stored there, this errors results simply

Q. What are the facts that can compromise data integrity?


There are several errors that can cause this issue and transform many other problems.
These are:

Bugs and malwares


Human error
Hardware error
Transfer errors which generally include data compression beyond a limit.

Q. What is data encoding?


Data needs to be kept confidential in many cases and it can be done through this approach.
It simply ensures that information remains in a form that no one else besides the sender can access.
the receiver can understand.

Q. What does EDP stand for?


It means Electronic Data Processing

Q. Name one method which is generally considered by remote workstation when it comes
to processing
Distributed processing

Q. What do you mean by a transaction file and how is it different from a Sort file?
The Transaction file is generally considered to hold input data and that is for the time when
a transaction is under process. All the master files can be updated with it simply. Sorting is
done to assign a fixed location to the data files on the other hand.
Q. What is the use of aggregation when we have rollup as we know rollup component in
Ab Initio is used to summarize a group of data records. Then where will we use aggregation?

Aggregation and Rollup both can summarize the data but rollup is much more convenient to
use. In order to understand how a particular summarization being rollup is much more
explanatory compared to aggregate. Rollup can do some other functionality like input and
output filtering of records. Aggregate and rollup perform the same action, rollup display
Intermediate result in main memory, Aggregate does not support intermediate result.

Learn Ab Initio Tutorial

Q. What kinds of layouts does Ab Initio support?


Basically there are serial and parallel layouts supported by AbInitio. A graph can have both.
at the same time. The parallel one depends on the degree of data parallelism. If the multi-
if the layout allows
is defined as being the same as the degree of parallelism.

Q. How do you add default rules in a transformer?


Double click on the transform parameter of the parameter tab page of component properties, it
will open transform editor. In the transform editor click on the Edit menu and then select
Add Default Rules from the dropdown. It will show two options–1) Match Names 2)
Wildcard.

Q. Do you know what a local lookup is?


If your lookup file is a multifile and partitioned/sorted on a particular key then local lookup
Function can be used ahead of lookup function call. This is local to a particular partition.
depending on the key.
Lookup File consists of data records which can be held in main memory. This makes the
transform function to retrieve the records much faster than retrieving from disk. It allows
the transform component to process the data records of multiple files fast.

Q. What is the difference between a look-up file and a look-up, with a relevant example?

Generally, the Lookup file represents one or more serial files (Flat files). The amount of data is
small enough to be held in the memory. This allows transform functions to retrieve records
much more quickly than it could retrieve from Disk.
Q. How many components are in your most complicated graph?

It depends on the type of components you use. Usually, avoid using overly complicated transforms.
function in a graph.

Q. Have you worked with packages?


Multistage transform components by default use packages. However, users can create their own.
own set of functions in a transfer function and can include this in other transfer functions.

Q. Can sorting and storing be done through a single software or do you need different software for these tasks?
approaches?
Well, it actually depends on the type and nature of data. Although it is possible to
accomplish both these tasks through the same software, many software have their own
specialization and it would be good if one adopts such an approach to get the quality
outcomes. There are also some pre-defined set of modules and operations that largely
matters. If the conditions imposed by them are met, users can perform multiple tasks with
the similar software. The output file is provided in the various formats.

Q. What are the different forms of output that can be obtained after processing of data?
These are

1. Tables
2. Plain Text files
3. Image files
4. Maps
5. Charts
6. Vectors
7. Raw files

Sometimes data is required to be produced in more than one format and therefore the
software accomplishing this task must have features available in it to keep up the pace in
this matter.

Q. Give one reason when you need to consider multiple data processing?
When the required files are not the complete outcomes which are required and need
further processing.

Q. What are the types of data processing you are familiar with?
The very first one is the manual data approach. In this, the data is generally processed.
without the dependency on a machine and thus it contains several errors. In the present
time, this technique is not generally followed or only limited data is processed with this
approach. The second type is the Mechanical data processing. The mechanical devices have
some important roles in this approach. When the data is a combination of different
formats, this approach is adopted. The next approach is the Electronic data processing
which is regarded as the fastest and is widely adopted in the current scenario. It has top
accuracy and reliability.

Q. Name the different types of processing based on the steps that you know about?
They are:

Real-Time processing
2. Multiprocessing
3. Time Sharing
4. Batch processing
5. Adequate Processing

Q. Why do you think data processing is important?


The fact is data is generally collected from different sources. Thus, the same may vary.
largely in a number of terms. The fact is this data needs to be passed from various analysis
and other processes before it is stored. This process is not as easy as it seems in most of the
cases. Thus, processing matter. A lot of time can be saved by processing the data to
accomplish various tasks that largely matter. The dependency on the various factors for the
reliable operation can also be avoided to a good extent.

Q. What is common between data validity and data integrity?


Both these approaches deal with errors related to errors and ensure a smooth flow of
operations that largely matter.
Q. What do you mean by the term data warehousing? Is it different from Data Mining?
Many times there is a need to have data retrieval, warehousing can simply be considered to
assure the same without affecting the efficiency of operational systems. It simply supports
decision support and always works in addition to the business applications and Customer
Relationship Management and warehouse architecture. Data mining is closely related to this.
approach. It assures simple findings of required operators from the warehouse.

Q. What exactly do you know about the typical data analysis?


It generally involves the organization as well as the collection of important files in the form
of important files. The main aim is to know the exact relation among the industrial data or
the full data and the one which is analyzed. Some experts also call it as one of the best
available approaches to find errors. It entails the ability to spot problems and enable the
operator to find out root causes of the errors.

Q. Have you used the rollup component? Describe how?

If the user wants to group the records by specific field values, then rollup is the best way to do it.
that. Rollup is a multi-stage transform function and it contains the following mandatory
functions.

1. Initialize
2. Rollup
3. Finalize

Also need to declare one temporary variable if you want to get counts of a particular group.
For each of the group, first it does call the initialize function once, followed by rollup
function calls for each of the records in the group and finally calls the finalize function once
at the end of last rollup call.

Q. How to add default rules in transformer?

Add Default Rules—Opens the Add Default Rules dialog. Select one of the following: Match
Names—Match names: generates a set of rules that copies input fields to output fields
with the same name. Use Wildcard (.*) Rule—Generates one rule that copies input fields to
output fields with the same name.
1) If it is not already displayed, display the Transform Editor Grid.
2) Click the Business Rules tab if it is not already displayed.
3) Select Edit > Add Default Rules.
In case of reformatting, if the destination field names are the same or a subset of the source fields
Then no need to write anything in the reformat xfr unless you don't want to use any real.
transform other than reducing the set of fields or split the flow into a number of flows to
achieve the functionality.

Q. Qu'est-ce que la différence entre le partitionnement par clé et par round robin?

Partition by Key or hash partition -> This is a partitioning technique which is used to partition
data when the keys are diverse. If the key is present in large volume then there can large
data skew? But this method is used more often for parallel data processing.
Round robin partition is another partitioning technique to uniformly distribute the data on
each of the destination data partitions. The skew is zero in this case when number of records is
divisible by number of partitions. A real life example is how a pack of 52 cards is distributed
among 4 players in a round-robin manner.

Q. How do you improve the performance of a graph?


There are many ways the performance of the graph can be improved.
Use a limited number of components in a particular phase
Use optimum value of max core values for sorting and joining components

3) Minimize the number of sort components


4) Minimize sorted join component and if possible replace them by in-memory join/hash
join
5) Use only required fields in the sort, reformat, join components
6) Use phasing/flow buffers in case of merge, sorted joins
7) If the two inputs are huge then use sorted join, otherwise use hash join with proper
driving port
For large datasets, do not use broadcast as a partitioner.

9) Minimize the use of regular expression functions like re_index in the transfer functions
10) Avoid repartitioning of data unnecessarily
Try to run the graph as long as possible in MFS. For these input files should be partitioned.
and if possible the output file should also be partitioned.
Q. How do you truncate a table?
From Abinitio run SQL component using the DDL 'truncate table by using the truncate table'
component in Ab Initio

Q. Have you ever encountered an error called 'depth not equal'?


When two components are linked together, if their layout does not match, then this problem
can occur during the compilation of the graph. A solution to this problem would be to use a
partitioning component in between if there was a change in layout.

Q. What are primary keys and foreign keys?


In RDBMS, the relationship between the two tables is represented as Primary key and
foreign key relationship. Whereas the primary key table is the parent table and foreign key
table is the child table. The criteria for both the tables are there should be a matching
column.

Q. What is an outer join?

An outer join is used when one wants to select all the records from a port–whether it has
satisfied the join criteria or not.

You might also like