Best practices for IBM Master Data Management performance tuning
The IBM Master Data Management service processes and analyzes large volumes of record data to create master data entities. By following the guidelines and best practices in this topic, you can help to ensure that your IBM Master Data Management service deployment runs optimally.
For information about editing the IBM
Master Data Management
custom resource (mdm-cr), see Modifying the CR to customize your IBM Master Data Management installation.
Choose the right scaling configuration
level_1(x-small) - For demonstration and development purposes only.level_2(small_mincpureq) - Similar to thelevel_3size, but with smaller CPU request sizes.level_3(small) - The default size, for smaller deployments.level_4(medium) - For medium volume deployments.level_5(large) - For large volume deployments.
level_1 scaling configuration level requires the minimum system
requirements, as documented in Hardware requirements.If your estimated deployment configuration size falls in between two of the standard sizes, then start with the lower scaling specification. You can then adjust your resource allocation as needed.
Ensure that you have sufficient hardware resources for the size that you select. For details and guidance, contact your IBM® Sales representative or IBM Support.
For more information about sizing options and considerations, download the component scaling guidance document. For more information about how to download the PDF, see Downloading the component scaling guidance PDF from the IBM Entitled Registry.
Load data in bulk by using the API
For bulk data load jobs that involve large volumes of data, use the bulk load API instead of the IBM Master Data Management user interface to load the data. For more information about API methods, see the IBM Master Data Management API Reference.
- To simplify failure handling, it is better to divide large volume files into multiple parts, and then load them one at a time.
- Data load jobs are very I/O intensive, so it is important to configure the right executor count. Your executor count configuration depends on the available resources, but it normally should be less than 20. It is better to have Spark parallelism as a multiple of executors. For example, for N executors, configure Spark parallelism of N to 4N.
Tune the service for bulk derive jobs
For bulk derive jobs, you might need to adjust your system's Apache Spark parallelism. Spark parallelism of 128 is sufficient for 10 million records when there is 6 GB of executor memory. As the volume of data increases, you must increase Spark parallelism proportionately. For any given volume, the parallelism number should be higher if executor memory is reduced.
Tune the service for bulk match jobs
Similar to bulk derive jobs, you might need to adjust your system's Apache Spark parallelism for bulk match jobs. The required parallelism value depends on your specific matching algorithm setup, the total volume of records in your system, and the amount of memory allocated to spark executors. Spark parallelism of 256 should be sufficient for data sets with 10 million records, assuming 6 GB of executor memory.
For any given volume, the parallelism number should be higher if executor memory is reduced.
As the volume of data increases, you must increase Spark parallelism proportionately. Generally, the required parallelism grows more rapidly than the volume.
If you experience executors failing due to OutOfMemory errors, adjust your
configuration to have higher Spark parallelism or a higher memory overhead. The memory overhead is
defined by the spark.kubernetes.memoryOverheadFactor parameter in the
mdm_matching section of the IBM
Master Data Management CR.
spark_driver:
containers:
- resources:
limits:
ephemeral-storage: 20Gi
requests:
ephemeral-storage: 20Gi
tolerations:
- effect: NoSchedule
key: cp4dmdm
operator: Equal
value: "true"
spark_executor:
containers:
- resources:
limits:
ephemeral-storage: 20Gi
requests:
ephemeral-storage: 20Gi
Tune the service for bulk synchronization or export jobs
Depending on the volume of data being synchronized or exported, you might need to adjust Spark parallelism. Spark parallelism of 100 should be sufficient for 10 million records, assuming 6 GB of executor memory. With any increase in volume, the Spark parallelism must be increased proportionately.
Tune the runtime REST APIs
For highly concurrent operations targeting more than 300 transactions per second (TPS), you might
need to add additional Nginx pods by editing the IBM Software
Hub
zenservices CR. Run the command oc edit zenservices to modify the
configuration of the spec:Nginx:replicas parameter.
Tune the Neo4j database
With a Neo4J graph database, the data partition (/data) holds majority of the
contents. To avoid potential problems, you must size the data partition appropriately, depending on
the data volume. As a best practice, transaction logs should be stored in a different partition
(/transactions). The size of a partition can only be increased. It can not be
reduced.
Database memory requirements depend on the volume of data being stored and processed. A larger page cache size can help to reduce physical I/O operations, which improves performance.