Data Profiling:
Data profiling is a technique used to analyze the structure of source data.
Content of source data and quality of source data.
By using the Data Profiling task, we can
• Analyze the source data more effectively by validating business rules
on source data.
• For better understand the structure of source data.
• Prevent data quality problems before they are introduced into the data
warehouse.
Data quality is important to every business.
Defining the metrics for data quality can be difficult, because they
are specific to the domain or specific to the application or specific to the
business. One common approach to defining data quality is data profiling.
Data Profiling Warehouse:
The Data Profiling warehouse is a set of tables that stores the results
which are generated from profile sessions. You can create a Data Profiling
warehouse on any relational database that PowerCenter supports as a source or
target database. Create a Data Profiling warehouse for each PowerCenter
repository you want to store data profiles in.
Data Profiling Process
The following steps describe the data profiling process:
1. Create a data profile. Use the Profile Wizard in the Designer to create a
data profile based on a source definition and a set of functions. The
Profile Wizard generates a mapping and a session based on criteria that
you provide.
2. Run the profile session. You can choose to run the profile session when
Page1
you finish the Profile Wizard, or you can run it from the Profile Manager.
The Integration Service runs the session and loads the profile results to
the Data Profiling warehouse.
3. View the reports. View the Data Profiling report associated with the
profile session. Based on the type of profile report, you can view reports
from the Profile Manager or from Data Analyzer.
Steps for Profiling Source Data
1. Create a data profile.
2. Run a profile session.
3. View profile reports.
1. Create a Data Profile
To profile source data, you create a data profile based on a source or
Mapplet in the repository. Data profiles contain functions that perform
calculations on the source data. For example, you can use a function to
validate business rules in a data profile. You can apply profile functions to
a column within a source, to a single source, or to multiple sources.
You can create the following types of data profiles:
Auto profile: Contains a predefined set of functions for profiling
source data. Use an auto profile during mapping development to learn more
about source data.
Creating an Auto Profile
Create an auto profile to learn more about source data or mapplet output data
during mapping development. When you create an auto profile, the Designer
creates a data profile with the following functions:
Aggregate functions: Calculates an aggregate value for numeric or
string values in a column. Use aggregate functions to count null values,
determine average values, determine minimum or maximum values, and minimum
and maximum length for strings values.
Candidate Key Evaluation: Calculates the number and percentage
of unique values in one or more source columns.
Distinct Value Count: Returns the number of distinct values for the
column. You can configure the auto profile to load verbose data to the Data
Profiling warehouse.
Domain Inference: Reads all values in a column and infers a pattern
that fits the data. You can configure the Profile Wizard to filter the Domain
Inference results.
Redundancy Evaluation: Calculates the number of duplicate
values in one or more source columns.
Row Count: Counts the number of rows read from the source during the
profile session. When you create a data profile that uses the Row Count
function with data samples, the Row Count function estimates the total row
count.
Custom profile: A data profile you define with the functions you need
to profile source data. Use a custom profile during mapping development to
Page1
validate documented business rules about the source data. You can also use a
custom profile to monitor data quality or validate the results of BI reports.
You use the Designer to create a data profile. When you create a profile, the
Designer generates a mapping and a session based on the profile information.
You can configure a data profile to write verbose data to the Data Profiling
warehouse during a profile session.
Verbose data provides more details about the data those results from a
profile function. For example, for a function that validates business rules,
verbose data may include the invalid rows in the source. For a function that
determines the number of distinct values, verbose data can include a list of
distinct values.
After you create a data profile, you can view profile details from the
Profile Manager. You can also edit and delete the data profile.
Steps to Create an Auto Profile:
When you create an auto profile, you can profile groups or columns in the
source. Or, you can profile the entire source. Auto profiling large sources
impacts performance.
To create an auto profile:
1. Select the source definition in the Source Analyzer or mapplet in
the Mapplet Designer you want to profile.
2. Launch the Profile Wizard from the following Designer tools:
Source Analyzer. Click Sources > Profiling > Create Auto Profile.
Mapplet Designer. Click Mapplets > Profiling > Create Auto Profile.
Inference tuning settings.
3. Optionally, click Description to add a description for the data
profile. Click OK.
Enter a description up to 200 characters.
4. Optionally, select the groups or columns in the source that you
want to profile.
By default, all columns or groups are selected.
5. Select Load Verbose Data if you want the Integration Service to
write verbose data to the Data Profiling warehouse during the profile
session.
By default, Load Verbose Data option is disabled.
Loading verbose data for large sources may impact system performance.
Note: If you load verbose data for columns with a precision greater than
1,000 characters, the Integration Service writes truncated data to the
Data Profiling warehouse during the profile session.
6. Click Next.
7. Select additional functions to include in the auto profile. You
Page1
can also clear functions you do not want to include.
The Profile Wizard selects the profile functions you specified in the
default data profile options.
8. Optionally, click Save As Default to create new default functions
based on the functions selected here.
9. Optionally, click Profile Settings to enter settings for domain
inference and structure inference tuning.
The Profile Settings dialog box displays the default domain inference
tuning and structure inference settings.
10. Optionally, modify the default profile settings and click OK.
11. Click Configure Session to configure the session properties after
you create the data profile.
12. Click Next if you selected Configure Session, or click Finish if
you disabled Configure Session.
The Designer generates a data profile and profile mapping based on the
profile functions.
13. Configure the Profile Run options and click Next.
The Session Setup page appears.
14. Configure the Session Setup options.
15. Click Finish.
If you selected Run Session, the Profile Manager starts the session.
Creating a Custom Profile
You can create a custom profile from the following Designer tools:
Source Analyzer. Click Sources > Profiling > Create Custom Profile.
Mapplet Designer. Click Mapplets > Profiling > Create Custom Profile.
Profile Manager. Click Profile > Create Custom.
Designer tool: If you create a custom profile this way, you can only
profile that source. If you need to include multiple sources in the profile,
or if you want to create an inter source function, use the Designer menu
commands.
You can also edit or delete a data profile.
To create a custom profile, complete the following steps:
Step 1: Enter a Data Profile Name and Description
When you start the Profile Wizard, the General Properties page prompts you to
Page1
enter a name for the data profile and add a description.
Data profile names must start with a letter and cannot contain the following
characters: .+-=~`!%^&*()[]{}'\";:/?,<>\\|\t\r\n @
When you are finished, click next.
Step 2: Add Sources to the Data Profile
The Profile Sources page prompts you to select the sources you want to
profile. If you selected the source definitions or mapplets you want to
profile before you launched the Profile Wizard, this page does not display
unless you selected a multi-group source definition.
If you want to profile multiple sources, you can create a mapplet that
combines multiple sources and create a data profile based on the mapplet
output data.
Note: If you use a source as a lookup source within a data profile, it cannot
be used as a non-lookup source within the same data profile. For example,
when you create a Domain Validation function using a Column Lookup domain,
the source you use for the column lookup cannot be a profiled source in the
same data profile. If two profile sources attempt to validate data against
each other, the Designer creates an invalid mapping.
Step 3: Add Functions and Enable Session
Configuration
After you add sources to the data profile, use the Function-Level Operations
page to complete the following tasks:
Add functions: When you add functions to the data profile, the Profile
Wizard opens the Function Details page for you to configure details about the
functions.
Edit functions: You can edit existing functions for the data profile.
Delete functions: You can remove functions from the data profile.
Organize functions: Use the Up and Down arrows to organize the
functions in a data profile. The order of the functions does not affect the
data profile results.
Select columns to load in verbose mode: When you
configure a function, you can select the columns to load in verbose mode.
Enable session configuration: When you enable session
configuration, the Profile Wizard prompts you to
Configure the profile session for the mapping: If you
configured the default data profile options to always run profile sessions
interactively, this option is selected by default.
If you finish adding functions to the data profile and you have not enabled
session configuration, click Finish. The Profile Wizard generates the profile
mapping.
If you finish adding functions to the data profile and you enabled session
configuration, click next. The Profile Wizard prompts you to configure the
profile session.
Page1
Step 4: Configure Profile Functions
When you add a function to the data profile, the Function Details page
prompts you to complete the following tasks:
Name the function: Function names are not case sensitive and cannot
contain spaces. Function names must start with a letter and cannot contain
the following characters: .+-=~`!%^&*()[]{}'\";:/?,<>\\|\t\r\n @
Enter a description of the function: Optionally, enter text
to describe the function.
Select the type of function: You can select
source-level,
column-level,
or intersource functions.
Select a function: The functions you can configure depends on the
type of function you choose.
If you added multiple sources to the data profile, you must select the source
you want to apply the function to.
If you select an intersource function, you must select at least two sources
or two groups from different sources to apply the function to.
After you select the function type and function, click next. The Profile
Wizard prompts you to specify the function details for the function. The
function Details window and available options change depending on the type of
function you select.
Each function type has a subset of functionality you can configure to perform
calculations on the source data.
When you finish configuring the function, the Profile Wizard returns to the
function Level Operations page. From the Function Level Operations page, you
can continue to add and configure functions for the data profile.
Configuring a Function with Group By Columns
Some functions let you generate profile data in a profile session run by
group. When you configure a function, you can determine the column by which
you want to group the data.
To select a group by column:
1. Configure a source-level function or column-level function.
2. Select Generate Profile Data By Group.
3. Click Group By. The Group By Columns dialog box appears.
4. Select the columns you want to group by. You can select up to three
columns. You cannot group by the column for which you created the function.
For example, if you created a Business Rule Validation function for the
column Agreement_Status, you cannot select this column to group by.
5. Click OK.
Configuring a Function for Verbose Mode
Page1
When you configure a function for verbose mode, the Integration Service
writes verbose data to the Data Profiling warehouse during a profile session.
You can configure verbose mode for the following functions:
Source-level Business Rule Validation
Column-level Business Rule Validation
Domain Validation
Distinct Value Count
Row Uniqueness
The type of verbose data the Integration Service can load to the target
depends on the function for which you configure verbose mode.
Step 5: Configure the Profile Session
If you enabled session configuration on the Function-Level Operations page,
the Profile Wizard opens the Profiler run page. You can configure and run a
profile session or save the session configuration and run the profile session
at another time.
Generating the Profile Mapping
After you create a data profile, the Designer generates a mapping based on
the data profile metadata. You must save changes to store the new data
profile and profile mapping in the repository. The Designer saves the data
profile and profile mapping in the repository folder that stores the source
or mapplet output you profiled.
You can view profile mappings in the Designer. The Designer denotes profile
mappings in the Repository Navigator with a Profile Mappings icon. The
profile mapping name is based on the data profile name. By default, the
mapping name contains the prefix m_DP_.
For example, if you name the data profile SalaryValidation, the mapping name
for the data profile is m_DP_SalaryValidation.
You can change the naming convention for profile mappings in the default data
profile options.
Working with Functions Overview
You include functions in a data profile to perform calculations on sources
during a profile session. When you create an auto profile, the Designer adds
a predefined set of functions to the data profile. When you create a custom
profile, you create functions that meet your business needs, and add them to
the data profile. You can add the following types of functions to a data
profile:
Source-level functions: Perform calculations on two or more
source columns, source group, or mapplet group.
Column-level functions: Perform calculations on one column in a
source.
Intersource functions: Perform calculations on two or more
sources, source groups, or mapplet groups.
Source-Level Functions
Source-level functions perform calculations on two or more columns of a
source, source group, or mapplet group.
Page1
You can add the following source-level functions to a data profile:
Row Count: Counts the number of rows read from the source during
the profile session. If you enable data sampling, the Row Count function
returns an estimated row count based on the rows sampled.
Business Rule Validation: Calculates the number of rows for one
or more source columns that satisfy a specified business rule, and
evaluates those rows that do satisfy the business rule.
Candidate Key Evaluation: Calculates the number and percentage
of unique values in one or more source columns. This helps you identify
source columns you might use as a primary key.
Redundancy Evaluation: Calculates the number of duplicate
values in one or more source columns. This helps you identify columns to
normalize into separate tables.
Row Uniqueness: Calculates the number of unique and duplicate
values in the source based on the columns selected. You can profile all
columns in the source row or choose individual columns to profile.
Functional Dependencies Analysis: Determines exact and
approximate dependencies between columns and column sets in a source.
Column-Level Functions
Column-level functions perform a calculation on one column in a source. You
can add the following column-level
functions to a data profile:
Business Rule Validation: Calculates the number of rows in a
single source column that satisfy and do not satisfy a specified
business rule, and evaluates those rows that do satisfy the business
rule.
Domain Validation: Calculates the number of values in the
profile source column that fall within a specified domain and the
number of values that do not. When you create a Domain Validation
function, you include domains in the function.
Domain Inference: Reads all values in the column and infers a
pattern that fits the data.
Aggregate Functions: Calculates an aggregate value for a
numeric or string value in a column.
Distinct Value Count: Returns the number of distinct values for
the column. When you specify a column-level function on the Function
Details page of the Profile Wizard, the Profile Wizard prompts you to
configure the function. The options available on the Function Role
Details page for column-level functions depend on the function you
select.
Intersource Functions
Intersource functions perform calculations on two or more sources, source
groups from different sources, or mapplet output groups, and generate
Page1
information about their relationship. You can add the following intersource
functions to a data profile:
Referential Integrity Analysis: Compares the values of columns
in two sources to determine orphan values.
Join Complexity Evaluation: Measures the columns in multiple
sources that satisfy a join condition.
Intersource Structure Analysis: Determines the primary key-
foreign key relationships between sources.
To create a custom profile and include an intersource function, use the
Designer menu commands. If you rightclick a source to create a custom
profile, you cannot include additional sources in the profile.
Viewing Profile Results Overview
After you run a profile session, you can view the session results in a
report. There are two types of Data Profiling reports:
PowerCenter Data Profiling reports: Reports you can view
from the Profile Manager after running a profile session. PowerCenter Data
Profiling reports display data for the latest session run. Use PowerCenter
Data Profiling reports to quickly view profile results during mapping
development.
Data Analyzer Data Profiling reports: Reports you can view
from Data Analyzer after running a profile session. Data Analyzer reports
provide a historic view of data. They also display information about rejected
rows in the profile results. Use Data Analyzer Data Profiling reports when
you want to monitor data quality during production.
Page1