InitioArchitecture
Architecture
AbAbInitio
By:
Arun Ravindranath
172055
L1/L2 Application Support
TCS Confidential
1
About Ab Initio
Ab Initio is a general purpose data processing platform for
enterprise class, mission critical applications such as data
warehousing, clickstream processing, data movement, data
transformation and analytics.
Supports integration of arbitrary data sources and
programs, and provides complete metadata management
across the enterprise.
Proven best of breed ETL solution.
Applications of Ab Initio:
ETL for data warehouses, data marts and operational
data sources.
Parallel data cleansing and validation.
Parallel data transformation and filtering.
High performance analytics
Real time, parallel data capture.
TCS Confidential
2
Ab Initio Architecture
Applications
Ab Initio
Metadata
Repository
Application Development Environments
Graphical
C ++
Shell
Component
Library
User-defined
Components
Third Party
Components
Ab Initio Co>Operating System
Native Operating System
UNIX
Windows NT
TCS Confidential
3
Ab Initio Overview
User
User
Create all
your
graphs
GDE
Run all your
graphs
EME
Co>Operating
system
Store all variables
in a repository / is
also used for
control / also
collects all
metadata about
graph developed
in GDE
DTM
User
Graph when
deployed
generate .ksh
Used to schedule graphs developed in
GDE. It also has capability to maintain
dependencies between graphs
TCS Confidential
4
Co>Operating System
The Co>Operating System is core software that unites a
network of computing resources-CPUs, storage disks,
programs, datasets-into a production-quality data
processing system with scalable performance and
mainframe reliability.
The Co>Operating System is layered on top of the native
operating systems of a collection of computers. It provides
a distributed model for process execution, file
management, process monitoring, check-pointing, and
debugging.
TCS Confidential
5
Co>Operating System
The Graphical Development Environment (GDE) provides a
graphical user interface into the services of the
Co>Operating System.
Unlimited scalability : Data parallelism results in speedups
proportional to the hardware resources provided, double
the number of CPUs and execution time is halved.
Flexibility : Provides a powerful and efficient data
transformation engine and an open component model for
extending and customizing Ab Initios functionality.
Portability : Runs heterogeneously across a huge variety of
operating system and hardware platforms.
TCS Confidential
6
Graphical Development Environment (GDE)
GDE lets create applications by dragging and dropping
components onto a canvas configuring them with familiar,
intuitive point and click operations, and connecting them into
executable flowcharts.
These diagrams are architectural documents that
developers and managers alike can understand and use, but
they are not mere pictures: the co>operating system
executes these flowcharts directly. This means that there is a
seamless and solid connection between the abstract picture of
the application and the concrete reality of its execution.
TCS Confidential
7
Ab Initio S/w Versions & File Extensions
Software
Versions
Co>Operating System Version =>
GDE Version =>
File
Extensions
.mp
Stored Ab Initio graph or graph component
.mpc
Program or custom component
.mdc
Dataset or custom dataset component
.dml
Data Manipulation Language file or
record
type definition
.xfr
Transform function file
.dat
Data file (either serial file or multifile)
TCS Confidential
8
Connecting to Co>op Server from GDE
TCS Confidential
9
Host Profile Setting
1.
2.
3.
4.
5.
6.
7.
8.
Choose settings from the run menu
Check the use host profile setting checkbox.
Click Edit button to open the Host profile dialog.
If running Ab Initio on your local NT system, check Local
Execution (NT) checkbox and go to step 6.
If running Ab Initio on a Remote UNIX system, fill in the
path to the Host and Host Login and Password.
Type the full path of Host directory.
Select the Shell Type from pull down menu.
Test Login and if necessary make changes.
TCS Confidential
10
Host Profile
Enter Host,
Login,
Password &
Host directory
Select the
Shell Type
TCS Confidential
11
Ab Initio Components
Ab Initio provided
components. Datasets,
Partition, Transform,
Sort, Database are
frequently used.
TCS Confidential
12
Creating Graph
Type the
Label
Specify the
Input .dat
file
TCS Confidential
13
Create Graph - Dml
Specify the
.dml file
Propagate from Neighbors:
Copy record formats from
connected flow.
Same As: Copy record formats
from a specific components
port.
Path: Store record formats in a
Local file, Host File, or in the
Ab Initio repository.
Embedded: Type the record
format directly in a string.
TCS Confidential
14
Creating Graph - dml
DML is Ab Initios Data
Manipulation Language.
DML describes data in
terms of
Record Formats that list the fields and
format of input, output, and
intermediate records.
Expressions that define simple
computations, for example, selection.
Transform Functions that control
reformatting, aggregation, and other
data transformations.
Keys that specify groupings, ordering,
and partitioning relationships between
records.
Editing .dml file through
Record Format Editor Grid
View
TCS Confidential
15
Creating Graph - Transform
A transform function is either a DML file or
a DML string that describes how you
manipulate your data.
Ab Initio transform functions mainly
consist of a series of assignment
statements. Each statement is called a
business rule.
When Ab Initio evaluates a transform
function, it performs following tasks:
Initializes local variables
Evaluates statements
Evaluates rules.
Transform function files have the xfr
extension.
Specify the .xfr file
TCS Confidential
16
Creating Graph - xfr
Transform functions: A set of rules
that compute output values from
input values.
Business rule: Part of a transform
function that describes how you
manipulate one field of your
output data.
Variable: Optional part of a
transform function that provides
storage for temporary values.
Statement: Optional part of a
transform function that assigns
values of variables in a specific
order.
TCS Confidential
17
Sample Components
Sort
Dedup
Join
Replicate
Rollup
Filter by Expression
Merge
Lookup
Reformat etc.
TCS Confidential
18
Creating Graph Sort Component
Specify Key for
the Sort
Sort: The sort
component reorders
data. It comprises
two parameters: Key
and max-core.
Key: The Key is one
of the parameters for
Sort component
which describes the
collation order.
Max-core: The maxcore parameter
controls how often
the sort component
dumps data from
memory to disk.
TCS Confidential
19
Creating Graph Dedup component
Dedup
component
removes
duplicate
records.
Dedup criteria
will be either
unique-only,
First or Last.
Select Dedup criteria.
TCS Confidential
20
Creating Graph Replicate Component
Replicate
combines the
data records
from the
inputs into
one flow and
writes a copy
of that flow
to each of its
output ports.
Use Replicate
to support
component
parallelism.
TCS Confidential
21
Creating Graph Join Component
Specify the key for join
Specify Type of Join
TCS Confidential
22
Database Configuration (.dbc)
A file with a .dbc extension which provides the GDE with
the information it needs to connect to a database. A
configuration file contains the following information:
The name and version number of the database to which
you want to connect.
The name of the computer on which the database
instance or server to which you want to connect runs,
or on which the database remote access software is
installed.
The name of the database instance, server, or provider
to which you want to connect.
You generate a configuration file by using the Properties
dialog box for one of the Database components.
TCS Confidential
23
Creating Parallel Applications
Types of Parallel Processing
Component-level Parallelism: An application with
multiple components running simultaneously on
separate data uses component parallelism.
Pipeline parallelism: An application with multiple
components running simultaneously on the same data
uses pipeline parallelism.
Data Parallelism: An application with data divided into
segments that operates on each segment
simultaneously uses data parallelism.
TCS Confidential
24
Partition Components
Partition by Expression: Dividing data according to a DML
expression.
Partition by Key: Grouping data by a key.
Partition with Load balance: Dynamic load balancing.
Partition by Percentage: Distributing data, so the output is
proportional to fractions of 100.
Partition by Range: Dividing data evenly among nodes,
based on a key and a set of partitioning ranges.
Partition by Round-robin: Distributing data evenly, in
blocksize chunks, across the output partitions.
TCS Confidential
25
Departition Components
Concatenate: Concatenate component produces a single
output flow that contains first all the records from the first
input partition, then all the records from the second input
partition and so on.
Gather: Gather component collects inputs from multiple
partitions in an arbitrary manner, and produces a single
output flow, does not maintain sort order.
Interleave: Interleave component collects records from
many sources in round robin fashion.
Merge: Merge component collects inputs from multiple
sorted partitions and maintains the sort order.
TCS Confidential
26
Multifile systems
A multifile system is a specially created set of directories,
possibly on different machines, which have identical
substructure.
Each directory is a partition of the multifile system. When a
multifile is placed in a multifile system, its partitions are
files within each of the partitions of the multifile system.
Multifile system leads to better performance than flat file
systems because multifile systems can divide your data
among multiple disks or CPUs.
Typically (SMP machine is exception) a multifile system is
created with the control partition on one node and data
partitions on other nodes to distribute the work and
improve performance.
To do this use full internet URLs that specify file and
directory names and locations on remote machines.
TCS Confidential
27
Multifile
TCS Confidential
28
SANDBOX
A sandbox is a collection of graphs and related files that
are stored in a single directory tree, and treated as a group
for purposes of version control, navigation, and migration.
A sandbox can be a file system copy of a datastore project.
In the graph, instead of specifying the entire path for any
file location ,we specify only the sandbox parameter
variable. For ex : $AI_IN_DATA/customer_info.dat. where
$AI_IN_DATA contains the entire path with reference to the
sandbox $AI_HOME variable.
The actual in_data dir is $AI_HOME/in_data in sandbox
TCS Confidential
29
SANDBOX
The sandbox provides an excellent mechanism to maintain
uniqueness while moving from development to production
environment by means switch parameters.
We can define parameters in sandbox those can be used
across all the graphs pertaining to that sandbox.
The topmost variable $PROJECT_DIR contains the path of
the home directory
TCS Confidential
30
SANDBOX
TCS Confidential
31
Deploying
Every graph after validation and testing has to be deployed
as .ksh file into the run directory on UNIX.
This .ksh file is an executable file which is the backbone for
the entire automation/wrapper process.
The wrapper automation consists of .run, .env, dependency
list ,job list etc
For a detailed description on wrapper and different
directories and files , Please refer the documentation on
wrapper / UNIX presentation.
TCS Confidential
32
References
Ab Initio Tutorial
Ab Initio Online Help
Website (abinitio.com)
TCS Confidential
33