Generate Threshold Calculator Input Files

Generates the usamp, dsamp, and msamp files used by the Threshold Calculator for its calculations.

The usamp, dsamp, and msamp files are normally generated by the Weight Generation job but they should be regenerated whenever the generated weights are modified.

If you have generated weights and then manually modified those weights, run the Generate Threshold Calculator Input Files job in order to regenerate the frequency data based on your updated weights.

Table 1. Generate threshold calculator input files options
Workbench Description
Entity Type Specifies the entity type (enttype) for which you want to generate weights. All member types within the specified enttype are processed. If you are implementing multiple entity types (for example, id for Identity and hh for Household), you must generate weights separately for each type.
Inputs and Outputs
Working directory Specifies the directory where calculator input files are saved within the MDM Workbench project. The default is the sampfiles directory. All files are saved to a subdirectory within the specified Working directory named for the entity type.
Attention: The working directory is created each time the job is run. If the default working directory (sampfiles) is used for every job, be sure to import each set of calculator input files into the MDM Workbench project before running the job again for the next entity type. If you do not import the files after each job, the next job will overwrite the files from the previous job.
BXM input directory Specifies the input directory from which the bulk-cross-match results are read. This directory must match the Output Directory used by the mpx utility that generated the derived data.
Weight file input directory Specifies the input directory from which the weight files are read. The default is the weights/final directory.
Performance Tuning
Number of threads This value should correspond to the number of CPUs available on the operational server. The goal is to take advantage of all the processing resources available. For example, if running the operational server on a computer with four CPUs, set the number of threads to 4 to keep all four CPUs busy and minimize the time Generate Frequency Stats (mpxfreq utility) takes to run. If you set it to 2, only two CPUs are used, and the processing time is longer. Setting it too high can cause the operational server to switch back and forth between running threads and threads waiting for available cpu cycles.

Default: 1

Number of comparison bucket partitions This value must match the “Maximum number of Bucket partitions” value selected in the utility that was used to generate the derived data. The default value might not be correct.

Default: 10

Number of random pairs bucket partitions This is the number of random bucket partitions to use when executing the “Generate random pairs of members” step.

Default: 5

Maximum number of input/output partitions Default: 10
Number of random pairs to generate Specifies the “Number of random pairs to generate” used by the mpxpair utility. This corresponds to the mad.wgtgen.upair.count property.

Default: 3000000

Interval for reporting processed records Default: 10000
Maximum bucket set size Default: 1000
Minimum weight for writing item records Default: 4.0
Number of member partitions Default: 1
Options
Encoding Choices are Latin1, UTF8 and UTF16.

Default: latin1

Comparison mode Used to limit the comparison function. Typically, you want to use the default value except for rare circumstances, such as generating weights for a search-only attribute. Valid options are:
  • Match and link only
  • Search only
  • Match, link and search (default)
Skip last step because of too few attributes If the data dictionary does not have enough attributes upon which to compare members, the operational server cannot make valid matching decisions. When you have too few attributes, the operational server does not run the step to generate matched statistics (or any of the steps that follow). Instead the operational server generates a set of weights based on the recommended boot weights. For example, if there are only two attributes such as name and address, to derive the matched statistics for name, you have only address. You cannot derive a good matched set from address alone. This option enables you to skip the automatic step that generates matched set and matched statistics, while still performing the iteration step which checks for convergence of weights.
Log Options
Trace logging Produces a trace of activity as interactions flow through the system. This option is very verbose and should only be used for short periods of time.

Default: disabled

Debug logging Produces low-level diagnostics used internally by IBM® to identify what was happening on the system before an error condition occurred. This option generates a large amount of output per activity and should only be used for short periods of time.

Debug logging can potentially include personal member information such as member identification number, name, and so forth.

Default: disabled

Timer logging Produces timings on certain operations to help identify where significant processing time is elapsing.

Default: disabled

SQL logging Outputs the SQL that is sent by the InfoSphere® MDM database layer to the RDBMS. This helps in diagnosing database-related issues. This option can produce large amounts of output depending on the activity.

Default: disabled

Audit logging Produces activity information and non-critical warnings. Often, this option is used when a new system is first implemented to monitor activity.

Default: disabled

Algorithm logging A separate logging level for algorithm-related debug information without the risk of including protected health information (PHI).

Default: disabled