Best practices for IBM Master Data Management performance tuning

The IBM Master Data Management service processes and analyzes large volumes of record data to create master data entities. By following the guidelines and best practices in this topic, you can help to ensure that your IBM Master Data Management service deployment runs optimally.

For information about editing the IBM Master Data Management custom resource (mdm-cr), see Modifying the CR to customize your IBM Master Data Management installation.

Important: Changing the default resource configuration has an impact on the performance of the service. Only modify the configuration options if you are certain of your changes. Be prepared to revert your changes if necessary.

Choose the right scaling configuration

When you deploy IBM Master Data Management, you can choose between different preconfigured standard sizes:
  • level_1 (x-small) - For demonstration and development purposes only.
  • level_2 (small_mincpureq) - Similar to the level_3 size, but with smaller CPU request sizes.
  • level_3 (small) - The default size, for smaller deployments.
  • level_4 (medium) - For medium volume deployments.
  • level_5 (large) - For large volume deployments.
The level_1 scaling configuration level requires the minimum system requirements, as documented in Hardware requirements.

If your estimated deployment configuration size falls in between two of the standard sizes, then start with the lower scaling specification. You can then adjust your resource allocation as needed.

Ensure that you have sufficient hardware resources for the size that you select. For details and guidance, contact your IBM® Sales representative or IBM Support.

For more information about sizing options and considerations, download the component scaling guidance document. For more information about how to download the PDF, see Downloading the component scaling guidance PDF from the IBM Entitled Registry.

Load data in bulk by using the API

For bulk data load jobs that involve large volumes of data, use the bulk load API instead of the IBM Master Data Management user interface to load the data. For more information about API methods, see the IBM Master Data Management API Reference.

Additionally, consider the following guidance about loading large volumes of data:
  • To simplify failure handling, it is better to divide large volume files into multiple parts, and then load them one at a time.
  • Data load jobs are very I/O intensive, so it is important to configure the right executor count. Your executor count configuration depends on the available resources, but it normally should be less than 20. It is better to have Spark parallelism as a multiple of executors. For example, for N executors, configure Spark parallelism of N to 4N.

Tune the service for bulk derive jobs

For bulk derive jobs, you might need to adjust your system's Apache Spark parallelism. Spark parallelism of 128 is sufficient for 10 million records when there is 6 GB of executor memory. As the volume of data increases, you must increase Spark parallelism proportionately. For any given volume, the parallelism number should be higher if executor memory is reduced.

Tune the service for bulk match jobs

Similar to bulk derive jobs, you might need to adjust your system's Apache Spark parallelism for bulk match jobs. The required parallelism value depends on your specific matching algorithm setup, the total volume of records in your system, and the amount of memory allocated to spark executors. Spark parallelism of 256 should be sufficient for data sets with 10 million records, assuming 6 GB of executor memory.

For any given volume, the parallelism number should be higher if executor memory is reduced.

As the volume of data increases, you must increase Spark parallelism proportionately. Generally, the required parallelism grows more rapidly than the volume.

If you experience executors failing due to OutOfMemory errors, adjust your configuration to have higher Spark parallelism or a higher memory overhead. The memory overhead is defined by the spark.kubernetes.memoryOverheadFactor parameter in the mdm_matching section of the IBM Master Data Management CR.

At higher volumes, you might need higher ephemeral storage for Spark executors and drivers. Another option is to keep ephemeral storage unlimited by not specifying a limit. For example, for a volume of 10 million records, add 20 Gi of ephemeral storage. For example:
spark_driver:
    containers:
    - resources:
        limits:
          ephemeral-storage: 20Gi
        requests:
          ephemeral-storage: 20Gi
    tolerations:
    - effect: NoSchedule
      key: cp4dmdm
      operator: Equal
      value: "true"
  spark_executor:
    containers:
    - resources:
        limits:
          ephemeral-storage: 20Gi
        requests:
          ephemeral-storage: 20Gi
 

Tune the service for bulk synchronization or export jobs

Depending on the volume of data being synchronized or exported, you might need to adjust Spark parallelism. Spark parallelism of 100 should be sufficient for 10 million records, assuming 6 GB of executor memory. With any increase in volume, the Spark parallelism must be increased proportionately.

Tune the runtime REST APIs

For highly concurrent operations targeting more than 300 transactions per second (TPS), you might need to add additional Nginx pods by editing the IBM Software Hub zenservices CR. Run the command oc edit zenservices to modify the configuration of the spec:Nginx:replicas parameter.

Tune the Neo4j database

With a Neo4J graph database, the data partition (/data) holds majority of the contents. To avoid potential problems, you must size the data partition appropriately, depending on the data volume. As a best practice, transaction logs should be stored in a different partition (/transactions). The size of a partition can only be increased. It can not be reduced.

Database memory requirements depend on the volume of data being stored and processed. A larger page cache size can help to reduce physical I/O operations, which improves performance.