Compare Members in Bulk job (mpxcomp utility)
The Compare Members in Bulk job (mpxcomp utility) enables the comparison of records and is one of the processes used during bulk (BXM) and incremental cross matches (IXM).
When run, this utility selects candidates, compares member records, and assigns comparison scores. The mpxcomp utility must be run once for each type of entity (for example, identity and household) implemented, as the comparison algorithm is specific to each entity type.
The mpxcomp utility loads the entire input data
set into memory for processing. If you are working with large files,
this might cause system memory and performance issues, because your
machine must have sufficient continuous memory to accommodate the
data files. For large data sets, you can elect to use the *Part options
to conserve system memory and optimize performance. Use of these options
(-nMemParts,
-nBktParts, -minBktPart, -maxBktPart and
-maxParts) partitions the data to avoid pulling the entire set into
memory at one time. BktParts option should be the first option adjusted
to accommodate available memory.
If you plan to partition data, a partitioning strategy should be devised before beginning data derivation. Data must be partitioned consistently between the derivation step (mpxdata, mpxfsdvd, mpxprep, and mpxredvd utilities), the comparison step (mpxcomp utility), and the linking step (mpxlink utility). In addition to the derived data binary files, you must have the following in place before running the mpxcomp utility:
- a InfoSphere® MDM instance if you are running the utility from MDM Workbench; if running from a command line, the operational server instance is not required
- an operational server configured with your algorithm and data dictionary (includes threshold settings)
- weights
The BXM process uses the weights to create an aggregate comparison score, which is then compared to the threshold settings to determine auto-links and tasks.
The output is additional binary files that represent the entity link and task groupings and comparison scores. This output is the input to the next phase of BXM, which is Link Entities (mpxlink utility).
When to run Compare Members in Bulk job (mpxcomp utility)
- During the initial stage of implementation to generate baseline comparison scores.
- During the reiterate step of the implementation process. After going through the entire set of implementation steps and analyzing your data results, you might determine that modifications to your algorithm and data dictionary are necessary. If so, you typically re-derive your data (by using mpxdata, mpxfsdvd, or mpxredvd utilities) and run another BXM.
- After implementation if you modify the attributes that are used by your comparison functions (for example, adding a alias to a name comparison) or changes to your bucketing configuration. Comparison function and bucket changes require new weights, re-derivation of data and a new BXM.
- When running an IXM.
Workbench | Command line | Description |
---|---|---|
Entity Type | -entType |
This option identifies the type of entity being computed. If
you are implementing multiple entity types (for example, id for Identity
and hh for Household), you must run Compare Members in Bulk job (mpxcomp
utility) for each type. Default: varies depending on available types |
Inputs and Outputs | ||
Input directory | -bxmInpDir |
This is the directory where the input binary (.bin)
files from Derive Data and Create UNLs job (mpxdata utility) are stored.
This is typically the work directory created under <MDM_INSTALL_HOME>/inst/mpinet_<MDM_INSTANCE_NAME>/work .
Multiple servers might share the same <MDM_INSTALL_HOME> director,
but have a different <MDM_INSTANCE_NAME>. The
operational server instance that you connect to dictates where the
work directory lives. You can list multiple directories for this option; use commas to separate multiple directories. Default: bxm |
Output directory | -bxmOutDir |
This is the directory in which you want the Compare Members
in Bulk job (mpxcomp utility) output binary files located. This is
typically relative to the work directory on the server hosting the
configuration. Default: bxm |
Performance tuning | The term “partitions” as used in these options refers to breaking the member, bucket, or query data files into pieces. The derivation utilities (mpxdata, mpxprep, mpxfsdvd, and mpxredvd) produce a set of initial BXM files that are used by other utilities down-stream to do a cross match or generate weights. If the data set is large, the BXM files are subsequently large. The utilities that read these files (for example, mpxcomp) need to be able to fit this data into the available memory (RAM). If the memory requirement is larger than the available memory, the processes might swap or even run out of memory and fail. By breaking the data into pieces (partitions), the utilities can read pieces of the BXM data at a time and run within the available memory. Keep in mind:
|
|
Number of threads | -nthreads |
The number of threads to use for the Compare Members in Bulk
job (mpxcomp utility) process. The number of threads set can have
an affect on system performance. The value should correspond to the
number of CPUs available on the system (for example, if the machine
has 4 CPUs, set the number of threads for the mpxcomp utility to 4
to optimize the time it takes to process). Default:
1 Maximum value: 64 Recommended value: 1 thread per processor |
Number of member partitions | -nMemParts |
This option identifies the data partitions consumed by Compare
Members in Bulk job (mpxcomp). When this option is defined for mpxdata,
mpxprep, mpxredvd or mpxfsdvd utilities, the processes breaks up the
data from memHead and memCmpd (comparison data). While using this
option can cut your memory usage significantly, your setting can affect
performance. The higher the memParts setting, the slower your comparison
process simply because the operational server is forced to do more
duplicate comparisons. In order to compare every member that shares
at least one bucket, the operational server compares each memPart
against itself and then against all other memParts. For example if
you had memParts set to 3, you would have parts A, B, and C. For each
BktPart you would compare: A > A , A > B, A > C B > B, B > C C > C If it is necessary to use this option, specify a minimum setting of 3. Since the comparison has to bring in two parts to compare against each other, only splitting the data in half does not save memory. You should always use just enough parts to get all of the comparison data into physical memory. Default:
1 |
Number of bucket partitions | -nBktParts |
This option breaks up the member bucket data (membktd) into
smaller, more manageable chunks to optimize memory use. This option
can assist sort performance on large data sets. All members that share
a given bucket value end up in the same part and be compared to one
another. The number of bucket parts you set when running Compare Members
in Bulk job (mpxcomp utility) should match the number you specified
for the derivation process (mpxdata, mpxfsdvd, or mpxredvd utilities).
If running an IXM, set this to the number used in the original data
load. BktParts should be the first option adjusted to accommodate
available memory. If -minBktPart and -maxBktPart options in Compare Members in Bulk job (mpxcomp) are used, they will override any settings for -nBktParts. When attempting to reduce your memory footprint, increase nBktParts before adjusting MemParts. Running a utility with the -noexec option outputs memory usage requirements that can help you determine how to adjust the -n*Part settings. Default:
10 |
Minimum bucket partitions to process | -minBktPart |
Determines the minimum number of bucket parts to process. Both -minBktPart and -maxBktPart are performance options that allow Compare Members in Bulk job (mpxcomp) to process a range of bucket parts. This is often used when running Compare Members in Bulk job (mpxcomp) across multiple machines. For example, you can run bucket parts 1 through 5 on machine 1 and parts 6 through 10 on machine 2. Default: 0 |
Maximum bucket partitions to process | -maxBktPart |
Determines the maximum number of bucket parts to process. Default: 0 |
Maximum number of output partitions | -nMxmParts |
This option partitions the output of Compare Members in Bulk
job (mpxcomp utility) into smaller chunks for use by Link Entities
job (mpxlink utility). The value set for the mpxcomp utility determines
how many partitioned file segments are passed to Link Entities job
(mpxlink utility), thus the MxmParts value for both must be the same. Default: 1 Attention: The mpxcomp utility fails when:
|
Maximum bucket set size | -maxbktsize |
Maximum bucket size determines the maximum number of members
that can have the same bucket value for candidate selection. For example,
using the default of 500, of more than 500 members have the same bucket
value, Compare Members in Bulk job (mpxcomp utility) ignores those
members for comparison. The value set depends on a variety of factors,
including the number of members in the database and the bucketing
strategy. The log reports bucket hash values exceeding this parameter,
as well as 5 members for you to examine. If set appropriately, the
values reported in the log might indicate a bucket value that should
be defined as an anonymous value. Default: 500 |
Options | ||
Encoding | -encoding |
Determines the encoding of the .unl files.
Options are Latin1, UTF8, and UTF16. Default: Latin1 |
Minimum bucket role | -minBktTag |
Minimum bucket tag to use. Bucket tags are used for speed optimization
and allow bucket data to be created on bucket roles greater than or
equal to the minimum bucket tag and less than or equal to the maximum
bucket tag. Using the bucket tag option enables you to eliminate roles
that do not have any impact on the linking outcome. Default:
0 |
Maximum bucket role | -maxBktTag |
Maximum bucket tag to use. Default: 0 |
Write linkage item records | -{no}bxmLink |
Determines whether to write output for linkage records. Output
is written to a file. Disable this option if you do not want to write
records. Default: -bxmLink, (write linkage records) |
Write task item records | -{no}bxmTask |
Determines whether to write output for task records. Disable
this option if you do not want to write records. Default: -bxmTask (write task records) |
Write review identifier item records | -{no}bxmRvid |
Determines whether to write output for Review Identifier task
records. Disable this option if you do not want to write records. Default: -bxmRvid (write Review Identifier tasks) |
Do only different source comparisons | -{no}DiffSrcOnly |
Use this option if you want to compare members from one source
only to records in a different source. For example, records from Source
A are compared to those in Source B and C, but not against other records
in A. Default: -noDiffSrcOnly (compares records across all sources) |
Do only same source comparisons | -{no}SameSrcOnly |
Use this option to compare records only against those from
the same source. For example, records from Source A are compared against
A, but not against B and C. Default: -noSameSrcOnly (compares records across all sources) |
Create a dense memhead lookup table | -{no}dense |
When this option is enabled (-dense), Compare Members in Bulk
job (mpxcomp utility) creates a memhead lookup table that is used
during the comparison operation. The lookup table replaces runtime
computation with a simple array indexing operation. The -dense option
uses more memory, but is faster than -nodense. Specify -dense only
when you have sufficient memory for the data set and if you have large
gaps in your memRecno ranges. Default: -nodense |
Enable incremental cross match | -ixmMode | Use this option to enable incremental cross matching. In IXM
mode, a subset of members are compared rather than the entire member
set. If running a BXM, use the default of false. If running an IXM,
set this to true. Default: disabled |
Generate memory usage information only | The memory usage information generated by this option is viewable
in MDM Workbench by
executing the Get job results action on the
Compare Members in Bulk job (mpxcomp utility) log job returned by
the operational server. Default: disabled |
|
Comparison mode | -cmpMode |
This option controls the Compare Members in Bulk job (mpxcomp
utility) comparison behavior and is intended to improve performance
by excluding comparisons configured only for searching. Comparison
modes can be set in your algorithm as follows: cmpmode 1 = match and
link members, cmpmode 2 = search members, cmpmode 3 = search, match
and link members. The mode set in your algorithm does not have to
match the option specified for mpxcomp utility. The Compare Members
in Bulk job (mpxcomp utility) option acts as a filter for selecting
which comparison functions are used for comparison. For example, if
you specify match and link (option 1), the mpxcomp utility uses only
the comparison functions that are set to match and link. Typically you would use Compare Members in Bulk job (mpxcomp utility) with match, link and search (comparison mode 3). If this option is not set, all comparison modes configured in your algorithm are compared. If set to match and link only, comparison modes 1 and 3 are compared. If set to search only, comparison modes 2 and 3 are compared. Default: Match, link and search (comparison mode 3) |
Log Options | ||
Trace logging | Produces a trace of activity as interactions flow through the
system. This option is very verbose and should only be used for short
periods of time. Default: disabled |
|
Debug logging | Produces low-level diagnostics used internally by IBM® to identify what was happening in the system
before an error condition occurred. This option generates a large
amount of output per activity and should only be used for short periods
of time. Attention: Debug logging can potentially
include personal member information such as member identification
number, name, and so forth.
Default: disabled |
|
Timer logging | Produces timings on certain operations to help identify where
significant processing time is elapsing. Default: disabled |
|
SQL logging | Outputs the SQL that is sent by the InfoSphere MDM database
layer to the RDBMS. This helps in diagnosing database-related issues.
This option can produce large amounts of output depending on the activity. Default: disabled |
|
Audit logging | Produces activity information and non-critical warnings. Often,
this option is used when a new system is first implemented to monitor
activity. Default: disabled |
|
Algorithm logging | A separate logging level for algorithm-related debug information
without the risk of including protected health information (PHI). Default: disabled |