Compare Members in Bulk job (mpxcomp utility)

The Compare Members in Bulk job (mpxcomp utility) enables the comparison of records and is one of the processes used during bulk (BXM) and incremental cross matches (IXM).

When run, this utility selects candidates, compares member records, and assigns comparison scores. The mpxcomp utility must be run once for each type of entity (for example, identity and household) implemented, as the comparison algorithm is specific to each entity type.

The mpxcomp utility loads the entire input data set into memory for processing. If you are working with large files, this might cause system memory and performance issues, because your machine must have sufficient continuous memory to accommodate the data files. For large data sets, you can elect to use the *Part options to conserve system memory and optimize performance. Use of these options (-nMemParts,
-nBktParts, -minBktPart, -maxBktPart and -maxParts) partitions the data to avoid pulling the entire set into memory at one time. BktParts option should be the first option adjusted to accommodate available memory.

If you plan to partition data, a partitioning strategy should be devised before beginning data derivation. Data must be partitioned consistently between the derivation step (mpxdata, mpxfsdvd, mpxprep, and mpxredvd utilities), the comparison step (mpxcomp utility), and the linking step (mpxlink utility). In addition to the derived data binary files, you must have the following in place before running the mpxcomp utility:

  • a InfoSphere® MDM instance if you are running the utility from MDM Workbench; if running from a command line, the operational server instance is not required
  • an operational server configured with your algorithm and data dictionary (includes threshold settings)
  • weights

The BXM process uses the weights to create an aggregate comparison score, which is then compared to the threshold settings to determine auto-links and tasks.

The output is additional binary files that represent the entity link and task groupings and comparison scores. This output is the input to the next phase of BXM, which is Link Entities (mpxlink utility).

When to run Compare Members in Bulk job (mpxcomp utility)

You run the Compare Members in Bulk job (mpxcomp utility):
  • During the initial stage of implementation to generate baseline comparison scores.
  • During the reiterate step of the implementation process. After going through the entire set of implementation steps and analyzing your data results, you might determine that modifications to your algorithm and data dictionary are necessary. If so, you typically re-derive your data (by using mpxdata, mpxfsdvd, or mpxredvd utilities) and run another BXM.
  • After implementation if you modify the attributes that are used by your comparison functions (for example, adding a alias to a name comparison) or changes to your bucketing configuration. Comparison function and bucket changes require new weights, re-derivation of data and a new BXM.
  • When running an IXM.
Table 1. Compare Members in Bulk job (mpxcomp utility) options
Workbench Command line Description
Entity Type -entType This option identifies the type of entity being computed. If you are implementing multiple entity types (for example, id for Identity and hh for Household), you must run Compare Members in Bulk job (mpxcomp utility) for each type.

Default: varies depending on available types

Inputs and Outputs
Input directory -bxmInpDir This is the directory where the input binary (.bin) files from Derive Data and Create UNLs job (mpxdata utility) are stored. This is typically the work directory created under <MDM_INSTALL_HOME>/inst/mpinet_<MDM_INSTANCE_NAME>/work. Multiple servers might share the same <MDM_INSTALL_HOME> director, but have a different <MDM_INSTANCE_NAME>. The operational server instance that you connect to dictates where the work directory lives.

You can list multiple directories for this option; use commas to separate multiple directories.

Default: bxm

Output directory -bxmOutDir This is the directory in which you want the Compare Members in Bulk job (mpxcomp utility) output binary files located. This is typically relative to the work directory on the server hosting the configuration.

Default: bxm

Performance tuning

The term “partitions” as used in these options refers to breaking the member, bucket, or query data files into pieces. The derivation utilities (mpxdata, mpxprep, mpxfsdvd, and mpxredvd) produce a set of initial BXM files that are used by other utilities down-stream to do a cross match or generate weights. If the data set is large, the BXM files are subsequently large. The utilities that read these files (for example, mpxcomp) need to be able to fit this data into the available memory (RAM). If the memory requirement is larger than the available memory, the processes might swap or even run out of memory and fail. By breaking the data into pieces (partitions), the utilities can read pieces of the BXM data at a time and run within the available memory.

Keep in mind:
  1. Both BktParts and MemParts options are specified when doing data derivation. The output of the DVD utilities becomes the input for Compare Members in Bulk job (mpxcomp utility). The value specified for the mpxcomp utility should match the value set for the data derivation processes.
  2. minBktPart and maxBktPart settings override any value set for nBktParts.
  3. To expedite the BXM process, set minBktPart and maxBktPart to run multiple processes against the same pool of buckets part files. For example, with a pool of ten files (mpx_bxmbktd.001 through mpx_bxmbktd.010), running the Compare Members in Bulk job (mpxcomp) set with -minBktPart 5 and -maxBktPart 10 instructs the mpxcomp utility to consume only mpx_bxmbktd.005 through mpx_bxmbktd.010.
Number of threads -nthreads The number of threads to use for the Compare Members in Bulk job (mpxcomp utility) process. The number of threads set can have an affect on system performance. The value should correspond to the number of CPUs available on the system (for example, if the machine has 4 CPUs, set the number of threads for the mpxcomp utility to 4 to optimize the time it takes to process).

Default: 1

Maximum value: 64

Recommended value: 1 thread per processor

Number of member partitions -nMemParts This option identifies the data partitions consumed by Compare Members in Bulk job (mpxcomp). When this option is defined for mpxdata, mpxprep, mpxredvd or mpxfsdvd utilities, the processes breaks up the data from memHead and memCmpd (comparison data). While using this option can cut your memory usage significantly, your setting can affect performance. The higher the memParts setting, the slower your comparison process simply because the operational server is forced to do more duplicate comparisons. In order to compare every member that shares at least one bucket, the operational server compares each memPart against itself and then against all other memParts. For example if you had memParts set to 3, you would have parts A, B, and C. For each BktPart you would compare:

A > A , A > B, A > C

B > B, B > C

C > C

If it is necessary to use this option, specify a minimum setting of 3. Since the comparison has to bring in two parts to compare against each other, only splitting the data in half does not save memory. You should always use just enough parts to get all of the comparison data into physical memory.

Default: 1
Maximum value: 100
Recommended value: 1. The value used for the data derivation process should be the same values used for Compare Members in Bulk job (mpxcomp utility) and Link Entities job (mpxlink utility).

Number of bucket partitions -nBktParts This option breaks up the member bucket data (membktd) into smaller, more manageable chunks to optimize memory use. This option can assist sort performance on large data sets. All members that share a given bucket value end up in the same part and be compared to one another. The number of bucket parts you set when running Compare Members in Bulk job (mpxcomp utility) should match the number you specified for the derivation process (mpxdata, mpxfsdvd, or mpxredvd utilities). If running an IXM, set this to the number used in the original data load. BktParts should be the first option adjusted to accommodate available memory.

If -minBktPart and -maxBktPart options in Compare Members in Bulk job (mpxcomp) are used, they will override any settings for -nBktParts. When attempting to reduce your memory footprint, increase nBktParts before adjusting MemParts. Running a utility with the -noexec option outputs memory usage requirements that can help you determine how to adjust the -n*Part settings.

Default: 10
Maximum value: 100
Recommended value: Should match the number of bktParts created in the bxmInpDir.

Minimum bucket partitions to process -minBktPart Determines the minimum number of bucket parts to process. Both
-minBktPart and -maxBktPart are performance options that allow Compare Members in Bulk job (mpxcomp) to process a range of bucket parts. This is often used when running Compare Members in Bulk job (mpxcomp) across multiple machines. For example, you can run bucket parts 1 through 5 on machine 1 and parts 6 through 10 on machine 2.

Default: 0
Maximum value: Less than or equal to the value of maxBktPart

Maximum bucket partitions to process -maxBktPart Determines the maximum number of bucket parts to process.

Default: 0
Maximum value: 100

Maximum number of output partitions -nMxmParts This option partitions the output of Compare Members in Bulk job (mpxcomp utility) into smaller chunks for use by Link Entities job (mpxlink utility). The value set for the mpxcomp utility determines how many partitioned file segments are passed to Link Entities job (mpxlink utility), thus the MxmParts value for both must be the same.

Default: 1
Maximum value: 100

Attention: The mpxcomp utility fails when:
  • Either minBktPart or maxBktPart is set to 0 (which means bktPart is not used). The min and max bucket parts must be set to a valid range (for example, -minBktPart 1 - maxBktPart 5).
  • The value of minBktPart is greater than or equal to the value of maxBktPart
Maximum bucket set size -maxbktsize Maximum bucket size determines the maximum number of members that can have the same bucket value for candidate selection. For example, using the default of 500, of more than 500 members have the same bucket value, Compare Members in Bulk job (mpxcomp utility) ignores those members for comparison. The value set depends on a variety of factors, including the number of members in the database and the bucketing strategy. The log reports bucket hash values exceeding this parameter, as well as 5 members for you to examine. If set appropriately, the values reported in the log might indicate a bucket value that should be defined as an anonymous value.

Default: 500
Maximum value: 1048576

Options
Encoding -encoding Determines the encoding of the .unl files. Options are Latin1, UTF8, and UTF16.

Default: Latin1

Minimum bucket role -minBktTag Minimum bucket tag to use. Bucket tags are used for speed optimization and allow bucket data to be created on bucket roles greater than or equal to the minimum bucket tag and less than or equal to the maximum bucket tag. Using the bucket tag option enables you to eliminate roles that do not have any impact on the linking outcome.

Default: 0
Maximum value: 15

Maximum bucket role -maxBktTag Maximum bucket tag to use.

Default: 0
Maximum value: 15

Write linkage item records -{no}bxmLink Determines whether to write output for linkage records. Output is written to a file. Disable this option if you do not want to write records.

Default: -bxmLink, (write linkage records)

Write task item records -{no}bxmTask Determines whether to write output for task records. Disable this option if you do not want to write records.

Default: -bxmTask (write task records)

Write review identifier item records -{no}bxmRvid Determines whether to write output for Review Identifier task records. Disable this option if you do not want to write records.

Default: -bxmRvid (write Review Identifier tasks)

Do only different source comparisons -{no}DiffSrcOnly Use this option if you want to compare members from one source only to records in a different source. For example, records from Source A are compared to those in Source B and C, but not against other records in A.

Default: -noDiffSrcOnly (compares records across all sources)

Do only same source comparisons -{no}SameSrcOnly Use this option to compare records only against those from the same source. For example, records from Source A are compared against A, but not against B and C.

Default: -noSameSrcOnly (compares records across all sources)

Create a dense memhead lookup table -{no}dense When this option is enabled (-dense), Compare Members in Bulk job (mpxcomp utility) creates a memhead lookup table that is used during the comparison operation. The lookup table replaces runtime computation with a simple array indexing operation. The -dense option uses more memory, but is faster than -nodense. Specify -dense only when you have sufficient memory for the data set and if you have large gaps in your memRecno ranges.

Default: -nodense

Enable incremental cross match -ixmMode Use this option to enable incremental cross matching. In IXM mode, a subset of members are compared rather than the entire member set. If running a BXM, use the default of false. If running an IXM, set this to true.

Default: disabled

Generate memory usage information only The memory usage information generated by this option is viewable in MDM Workbench by executing the Get job results action on the Compare Members in Bulk job (mpxcomp utility) log job returned by the operational server.

Default: disabled

Comparison mode -cmpMode This option controls the Compare Members in Bulk job (mpxcomp utility) comparison behavior and is intended to improve performance by excluding comparisons configured only for searching. Comparison modes can be set in your algorithm as follows: cmpmode 1 = match and link members, cmpmode 2 = search members, cmpmode 3 = search, match and link members. The mode set in your algorithm does not have to match the option specified for mpxcomp utility. The Compare Members in Bulk job (mpxcomp utility) option acts as a filter for selecting which comparison functions are used for comparison. For example, if you specify match and link (option 1), the mpxcomp utility uses only the comparison functions that are set to match and link.

Typically you would use Compare Members in Bulk job (mpxcomp utility) with match, link and search (comparison mode 3). If this option is not set, all comparison modes configured in your algorithm are compared. If set to match and link only, comparison modes 1 and 3 are compared. If set to search only, comparison modes 2 and 3 are compared.

Default: Match, link and search (comparison mode 3)

Log Options
Trace logging Produces a trace of activity as interactions flow through the system. This option is very verbose and should only be used for short periods of time.

Default: disabled

Debug logging   Produces low-level diagnostics used internally by IBM® to identify what was happening in the system before an error condition occurred. This option generates a large amount of output per activity and should only be used for short periods of time.
Attention: Debug logging can potentially include personal member information such as member identification number, name, and so forth.

Default: disabled

Timer logging   Produces timings on certain operations to help identify where significant processing time is elapsing.

Default: disabled

SQL logging   Outputs the SQL that is sent by the InfoSphere MDM database layer to the RDBMS. This helps in diagnosing database-related issues. This option can produce large amounts of output depending on the activity.

Default: disabled

Audit logging   Produces activity information and non-critical warnings. Often, this option is used when a new system is first implemented to monitor activity.

Default: disabled

Algorithm logging   A separate logging level for algorithm-related debug information without the risk of including protected health information (PHI).

Default: disabled