IN 1041 DataQualityPerformanceTuningGuide en
IN 1041 DataQualityPerformanceTuningGuide en
10.4.1
This software and documentation are provided only under a separate license agreement containing restrictions on use and disclosure. No part of this document may be
reproduced or transmitted in any form, by any means (electronic, photocopying, recording or otherwise) without prior consent of Informatica LLC.
Informatica and the Informatica logo are trademarks or registered trademarks of Informatica LLC in the United States and many jurisdictions throughout the world. A
current list of Informatica trademarks is available on the web at https://www.informatica.com/trademarks.html. Other company and product names may be trade
names or trademarks of their respective owners.
Subject to your opt-out rights, the software will automatically transmit to Informatica in the USA information about the computing and network environment in which the
Software is deployed and the data usage and system statistics of the deployment. This transmission is deemed part of the Services under the Informatica privacy policy
and Informatica will use and otherwise process this information in accordance with the Informatica privacy policy available at https://www.informatica.com/in/
privacy-policy.html. You may disable usage collection in Administrator tool.
U.S. GOVERNMENT RIGHTS Programs, software, databases, and related documentation and technical data delivered to U.S. Government customers are "commercial
computer software" or "commercial technical data" pursuant to the applicable Federal Acquisition Regulation and agency-specific supplemental regulations. As such,
the use, duplication, disclosure, modification, and adaptation is subject to the restrictions and license terms set forth in the applicable Government contract, and, to the
extent applicable by the terms of the Government contract, the additional rights set forth in FAR 52.227-19, Commercial Computer Software License.
Portions of this software and/or documentation are subject to copyright held by third parties. Required third party notices are included with the product.
The information in this documentation is subject to change without notice. If you find any problems in this documentation, report them to us at
infa_documentation@informatica.com.
Informatica products are warranted according to the terms and conditions of the agreements under which they are provided. INFORMATICA PROVIDES THE
INFORMATION IN THIS DOCUMENT "AS IS" WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING WITHOUT ANY WARRANTIES OF
MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND ANY WARRANTY OR CONDITION OF NON-INFRINGEMENT.
Table of Contents 3
Multithreading. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4 Table of Contents
Preface
Refer to the Informatica® Data Quality Performance Tuning Guide to learn how to configure your data quality
transformations to optimize run-time performance and to learn about the memory requirements that apply to
Informatica reference data.
Informatica Resources
Informatica provides you with a range of product resources through the Informatica Network and other online
portals. Use the resources to get the most from your Informatica products and solutions and to learn from
other Informatica users and subject matter experts.
Informatica Network
The Informatica Network is the gateway to many resources, including the Informatica Knowledge Base and
Informatica Global Customer Support. To enter the Informatica Network, visit
https://network.informatica.com.
To search the Knowledge Base, visit https://search.informatica.com. If you have questions, comments, or
ideas about the Knowledge Base, contact the Informatica Knowledge Base team at
KB_Feedback@informatica.com.
Informatica Documentation
Use the Informatica Documentation Portal to explore an extensive library of documentation for current and
recent product releases. To explore the Documentation Portal, visit https://docs.informatica.com.
If you have questions, comments, or ideas about the product documentation, contact the Informatica
Documentation team at infa_documentation@informatica.com.
5
Informatica Product Availability Matrices
Product Availability Matrices (PAMs) indicate the versions of the operating systems, databases, and types of
data sources and targets that a product release supports. You can browse the Informatica PAMs at
https://network.informatica.com/community/informatica-network/product-availability-matrices.
Informatica Velocity
Informatica Velocity is a collection of tips and best practices developed by Informatica Professional Services
and based on real-world experiences from hundreds of data management projects. Informatica Velocity
represents the collective knowledge of Informatica consultants who work with organizations around the
world to plan, develop, deploy, and maintain successful data management solutions.
You can find Informatica Velocity resources at http://velocity.informatica.com. If you have questions,
comments, or ideas about Informatica Velocity, contact Informatica Professional Services at
ips@informatica.com.
Informatica Marketplace
The Informatica Marketplace is a forum where you can find solutions that extend and enhance your
Informatica implementations. Leverage any of the hundreds of solutions from Informatica developers and
partners on the Marketplace to improve your productivity and speed up time to implementation on your
projects. You can find the Informatica Marketplace at https://marketplace.informatica.com.
To find your local Informatica Global Customer Support telephone number, visit the Informatica website at
the following link:
https://www.informatica.com/services-and-training/customer-success-services/contact-us.html.
To find online support resources on the Informatica Network, visit https://network.informatica.com and
select the eSupport option.
6 Preface
Chapter 1
• Overview, 7
• Basic On-Disk Installation Size, 7
• General runtime memory size, 8
• Capacity Planning, 9
Overview
You can verify and enhance the performance of your Informatica Data Quality and Data Engineering Quality
installations by monitoring and updating key parameters in your Informatica domain and applications.
Factors affecting performance include the memory available to the applications and the properties that you
select on the transformations and mappings that you configure.
The following table provides a guide to the reference data footprint in a standard installation
7
The following table describes the disk space necessary for additional address reference data files:
Geocoding 11 GB
Fast completion 7 GB
Consumer segmentation 2 GB
The Virtual Set is the total virtual memory used, and the Working Set is the physically resident memory used.
The Content Management Service dictates how the address reference data files are loaded. You can view the
Content Management Service settings in the Administrator tool. The Data Integration Service applies the
Content Management Service settings when it loads the address reference data. The Data Integration Service
loads the data in the same way for all users.
The average size in memory of each loaded element is approximately the same as the disk footprint. For
example, if a user runs a mapping that uses a 533 MB address reference data file in batch or interactive
mode, the process memory size will grow by approximately 533 MB.
The memory remains in use while any address validation mapping runs. The Data Integration Service unloads
the address reference data and releases the memory when the mapping finishes.
Capacity Planning 9
Capacity Planning in Data Engineering Quality
The following table shows the performance results for a range of mappings in Data Engineering Quality:
• Set the maximum length on string ports accurately. Do not set values that far exceed the physical size of
the data that the ports will carry.
• Configure your input ports to be the same type as the corresponding output ports. For example, cast
numbers to numbers and strings to strings. Casting from one type to another when not necessary will
have a performance impact.
• Ensure that the Tracing Level in all transformations is set to Normal, as performance is degraded when
set to a more verbose options.
Address Validator
Several configuration options affect the performance of the Address Validator transformation. You can
review and update the configuration properties on the Content Management Service.
Preloading Method
When the preloading method is set to MAP, Data Quality does not load the address reference data for every
process that uses it. Instead, the reference data is shared across processes. This is significant when more
than one Data Integration Service process or job is set to run out-of-process. If the reference data is
preloaded for another process, it will not be loaded again.
11
Memory Usage
The Memory Usage option controls the amount of memory that the Data Integration Service can use to
preload address reference data. If the amount of memory is insufficient to preload the required data, the Data
Integration Service will attempt to partially preload the data. If the amount of available memory is too low, the
service will not preload any data.
Cache Size
The Cache Size value affects the PARTIAL and NONE preloading options and can yield a small improvement
in performance. Increasing the cache size can improve the loading of address data, particularly below the
Locality level.
• You can store the reference data on a fast hard disk, solid-state disk, or even a flash disk (high-speed USB
stick).
• Where possible, install sufficient memory to allow all databases to fully pre-load into memory.
• Preload at least the databases of frequently used countries. At a minimum, the available memory should
equal the aggregate size of the most often-used country databases plus 256 MB.
• If you will use reference data from all countries simultaneously, add memory to cover the size of the
databases.
• Use a 64-bit environment to preload more than 3 GB of reference data.
• Do not set a country code as a No Preload value. Enter a No Preload value of ALL to avoid using a country
code.
• Minimize the access latency (average access time).
• If you use a solid-state disk, do not preload the databases. Set a LARGE cache size in the Content
Management Service instead.
• Do not use the same drive to store address reference data and source or target files.
• When enough memory is available, processor speed directly determines the speed of address processing.
• Try to sort your address records by country or postcode prior to processing. Validation also benefits from
internal and operating system caches for sorted addresses as opposed to addresses in random order.
• The Max Thread Count value must be greater than or equal to the number of partitions.
Association
If you run the Association transformation on a large data set, the transformation may not be able to store all
associated records in memory, and some records will be written to disk. The Cache File Size property on the
transformation specifies the amount of memory available.
A cache size value below 65536 represents megabytes, and any higher value represents bytes.
The Cache File Directory identifies a storage area for the temporary files that the association operations
create. Configure the cache directory on the smallest, fastest disk for performance improvements.
B-Tree Considerations
The Association transformation makes extensive use of B-tree file-based storage. Each column that the
transformation reads has its own B-tree, and a general B-tree is used to store all input data rows. The
Informatica B-tree is space-efficient but not compressed.
Consolidation
The Consolidation transformation uses standard Informatica sorting techniques. By default, the techniques
give the transformation as much memory as possible without affecting system performance.
You can set a limit on the amount of main memory that the transformation uses to sort data. This increases
on-disk temporary memory use, as the transformation must store all data rows.
Association 13
Key Generator
The Key Generator transformation can create a set of unique identifiers for the rows in a data set. Use the
transformation to create sequence ID values for a Match transformation.
The Key Generator transformation includes an option to sort the output data. To maximize performance, do
not check this option. To sort the data, pass the output to a Sorter transformation and configure the cache
settings on the Sorter.
Match
To optimize performance in a Match transformation, you must understand the concepts that underpin match
analysis.
Group size has a significant impact on performance. For example, applying the formula above to a group of
2,000 records will produce 1,999,000 matches. Applying the formula to a group of 5,000 records will produce
12,497,500 matches, or over six times the amount.
For optimal performance, groups of over 10,000 are not recommended. Group sizes should be meaningful, so
that you do not miss possible matches, but they should not be too large.
If you perform matching on a large data set, the Match transformation may not be able to store all
comparison pairs in memory, and some pairs will be written to disk. The Cache Size property on the
transformation determines the amount of memory available.
A cache size value below 65536 is measured in megabytes, and any higher value is measured in bytes.
The Cache Directory property identifies a storage area for the temporary files that match analysis creates.
Configure the cache directory on the smallest, fastest disk for performance improvements.
The Match transformation can generate Link Score and Driver Score values that represent the degrees of
similarity between different pairs of records in a cluster of matching records.
For optimum performance, choose Link Scores and not Driver Scores. Choosing Driver Scores will greatly
decrease the performance of your match mapping, as Driver Scores write more information to disk.
Selecting the Filter Exact Match property significantly improves match performance if the data contains a
significant number of exactly matched pairs. Otherwise the option has a negligible performance impact.
n x m , where n is the number of records in group 1 in data set 1 and m is number of records in group 1 in
data set 2.
For example, if data set 1 includes a group with 3,000 rows and the same group exists in data set 2 with
2,000 rows, match analysis will generate 6,000,000 record pairs.
Identity Matching
The use of groups in identity matching is optional but recommended. As is the case in field matching, very
large group sizes will result in considerably slower performance.
To significantly improve identity matching performance, increase the number of execution instances on the
transformation. When you increase the number of execution instances, the Data Integration Service splits the
workload over multiple threads. The availability of execution instances depends on the number of processor
cores on the Data Integration Service machine.
The performance improvement will not be linear. The complete matching process cannot be split over
multiple threads. Part of the process must be completed in a single thread.
Note: For optimal performance with identity matching, set your execution instances to the number of
processor cores minus 1.
Field matching
The following formula is a guide to the quantity of disk space in MB required to run field matching on a
data set, generating only the link score:
Match 15
where d = the sum of the Match transformation input port precisions, n = the number of records, and
0.0000025 = the memory required per character.
If the mapping has dual sources, n in the above formula represents the total of the two sources.
The result above will double when the driver score is required.
Identity matching
The following formula is a guide to the quantity of disk space in MB required to run identity matching on
a data set, generating only the link score:
where d = the sum of the Match transformation input port precisions, n = the number of records, and
0.000005 = the memory required per character.
If the mapping has dual sources, n in the above formula represents the total of the two sources.
The report represents a profile of the data and including a table that describes the composition of the groups.
Note: The Developer tool shows the first 16,000 groups. To see the full makeup of the data, export the report
to a file.
Modify the minimum and maximum group sizes to evaluate the likely effect of different group sizes on the
mapping performance.
• Overview, 17
• Human Tasks, 17
• Probabilistic Models, 18
• Classifier Models, 19
• Hadoop, 20
• Web Services, 20
• Multithreading, 21
Overview
You can configure several data quality components in addition to the transformations that define data
analysis and enhancement operations.
Human Tasks
Informatica implements the ActiveVos engine to allow you to run workflows and resolve issues and conflicts
manually with Human tasks. An administrator provides a database connection for the Human task metadata.
If you encounter bottlenecks, you may need to increase the number of connections.
17
Data Integration Service Java Heap Size
The memory heap size allocated to the Data Integration Service is set as an advanced process property on
the Data Integration Service. Set this value to a minimum of 1024 MB for Human tasks and to 3072 MB for a
heavily-used Workflow Orchestration Service.
The following image shows the Maximum Heap Size property on the Data Integration Service in the
Administrator tool:
For information on additional fine tuning on this value, contact Informatica Global Customer Support.
You can choose from the following options when configuring load balancing:
• By number of items.
- Number of items per task. The number of tasks created will be determined by the amount of work to be
reviewed.
- Number of tasks. The specified number of tasks will be created, and the workload will be split across the
tasks.
• By data value. The number of tasks created is unknown prior to run time.
Probabilistic Models
Probabilistic models are reference data files that identify the type of information in each value in a text
string. Consider the information below when you add probabilistic models to a Data Quality installation.
Design-Time Considerations
When you add training data to compile the model, all rows are loaded in memory. The default heap size of
768 MB available to the Developer tool means you might have access to approximately 500,000 rows. To edit
larger training data sets, increase the heap size.
To increase the heap memory available, update the -Xmx768M value in the developer.ini file. Tests indicate
that 100,000 rows of data require 100 MB of memory.
When you compile a model, check the Content Management Service logs at the following location for errors:
$INFA_HOME/logs/[nodename]/services/ContentManagementService
If there is insufficient memory to compile the model, you will see an error like this:
com.informatica.cms.service.webapp.ContentManagementServiceServlet
$ContentManagementServiceDefaultUncaughtExceptionHandler uncaughtException
WARNING: uncaughtException in CMS - Java heap space
When you see such an error, the Content Management Service has insufficient memory to compile the model,
and you must increase the java heap size for the process. You can increase the java heap size in the
Administrator tool. Navigate to the Processes tab on the Content Management Service, and update the
Maximum Heap Size advanced property.
You specify the heap size in megabytes, for example 2048M, or gigabytes, for example 2G. Note the syntax in
this case, as mistakes are common. After you update the heap size, restart the Content Management Service.
Note: If you are not concerned about the probabilistic score output from a Labeler or Parser transformation,
leaving the score port unconnected on the transformation will improve performance.
Classifier Models
Classifier models are reference data files that identify the most common type of information in long text
strings.
Consider the information below when you add classifier models to a Data Quality installation.
Design-Time Considerations
When designing a Classifier model, the resources required on the Developer tool are similar to the resources
required for probabilistic models. You should not need to increase the Java heap size attributed to the Java
process unless you are editing hundreds of thousands of rows of data. If cases exist where you are editing
such volumes, you may need to increase the heap size in the developer.ini file.
Model Creation
Classifier models are compiled under the Content Management Service in the same way as probabilistic
models. The classifier compilation process does not require the amount of memory that the probabilistic
models require. As a result, you do not need to increase the resources available to the Content Management
Service in order to successfully compile classifier models.
Classifier Models 19
Hadoop
In some cases, the default Java heap size is insufficient for execution of mappings on a hadoop cluster. For
example, if you push down a mapping that reads a probabilistic model, you may need to increase the heap
size. If such an issue occurs, increase the default Java heap size to eliminate errors.
The following code fragment increases the Java heap size to 1024 MB:
$INFA_HOME/services/shared/Hadoop/conf/hadoopEnv.properties
infapdo.java.opts=-Djava.library.path=$HADOOP_NODE_INFA_HOME/services/shared/bin –Xmx1024M -
Djava.security.egd=file:/dev/./urandom
You may see a Java heap size error in the Jobtracker logs, which are typically available at
http://<NameNode>:50030/jobtracker.jsp.
Web Services
A web service client can connect to an Informatica web service to access, transform, or deliver data. An
external application or a Web Service Consumer transformation can connect to a web service as a web
service client. You can create an Informatica web service in the Developer tool.
Consider the information below when you configure Data Quality for web services.
The DTM Keep Alive Time value defines how long to keep the DTM running after it has dealt with a request.
Configure the keep alive time as high as the available resources permit.
When a web service request is received and no DTM is available, Data Quality initializes a new DTM to deal
with the request. The initialization process may mean loading reference tables, address reference data, or
probabilistic models into memory. The process may take seconds to complete and considerably increase the
response time. If all the above are preloaded to the DTM, the response time will be milliseconds as opposed
to seconds.
You can configure the DTM Keep Alive Time value as a Data Integration Service property in the Administrator
tool.
The following image shows the DTM Keep Alive Time property on the Data Integration Service:
You can also set a web service-specific keep alive time. This value takes priority over the value set on the
Data Integration Service. To set the value, browse to the web service under the Applications tab of the Data
Integration Service in the Administrator tool.
Note: The DTM Keep Alive Time value is specified in milliseconds unless you set a negative integer value. Set
a negative integer in the DTM Keep Alive Time property for the web to specify that the web service will read
the property value from the Data Integration Service.
Note: Once you have developed your web service, you can turn off logging on the web service. This will
increase performance as no logs will be written for each request.
Multithreading
You can run mappings in a multi-threaded or parallel manner. The Execution Options on the Data Integration
Service define the maximum number of parallel mappings that the service can run.
The following image shows the maximum parallelism option on the Data Integration Service:
Multithreading 21
Parallelism also applies to the transformations within a mapping. Set the maximum permitted parallelism
within a given mapping as a run-time property on the mapping.
The following image shows the maximum parallelism option on the mapping:
All data quality transformations can be multithreaded except for the Exception, Association, Classifier, and
Consolidation transformations. You can configure a Decision transformation to be partitionable.
The following graphs show the increase in throughput that you can achieve by enabling partitioning on a
Standardizer and Parser transformation:
Similar performance increases are observed for other data quality transformations.
• The number of execution instances that you set on a Match transformation or Address Validator
transformation must not exceed the maximum parallelism values that you set on the Data Integration
Service or on the mapping that contains the transformation.
• If you set the maximum parallelism value on a mapping to Auto, the mapping uses the maximum
parallelism value on the Data Integration Service. This may result in diminished performance, depending
on the number of mappings that run concurrently.
Multithreading 23