Generate Threshold Calculator Input Files
Generates the usamp, dsamp, and msamp files used by the Threshold Calculator for its calculations.
The usamp, dsamp, and msamp files are normally generated by the Weight Generation job but they should be regenerated whenever the generated weights are modified.
If you have generated weights and then manually modified those weights, run the Generate Threshold Calculator Input Files job in order to regenerate the frequency data based on your updated weights.
Workbench | Description |
---|---|
Entity Type | Specifies the entity type (enttype) for which you want to generate weights. All member types within the specified enttype are processed. If you are implementing multiple entity types (for example, id for Identity and hh for Household), you must generate weights separately for each type. |
Inputs and Outputs | |
Working directory | Specifies the directory where calculator input files are saved
within the MDM Workbench project.
The default is the sampfiles directory. All files
are saved to a subdirectory within the specified Working directory
named for the entity type. Attention: The working directory
is created each time the job is run. If the default working directory
(sampfiles) is used for every job, be sure to
import each set of calculator input files into the MDM Workbench project
before running the job again for the next entity type. If you do not
import the files after each job, the next job will overwrite the files
from the previous job.
|
BXM input directory | Specifies the input directory from which the bulk-cross-match results are read. This directory must match the Output Directory used by the mpx utility that generated the derived data. |
Weight file input directory | Specifies the input directory from which the weight files are read. The default is the weights/final directory. |
Performance Tuning | |
Number of threads | This value should correspond to the number of CPUs available
on the operational server. The goal is to take advantage of all the
processing resources available. For example, if running the operational
server on a computer with four CPUs, set the number of threads to
4 to keep all four CPUs busy and minimize the time Generate Frequency
Stats (mpxfreq utility) takes to run. If you set it to 2, only two
CPUs are used, and the processing time is longer. Setting it too high
can cause the operational server to switch back and forth between
running threads and threads waiting for available cpu cycles. Default: 1 |
Number of comparison bucket partitions | This value must match the “Maximum number of Bucket partitions”
value selected in the utility that was used to generate the derived
data. The default value might not be correct. Default: 10 |
Number of random pairs bucket partitions | This is the number of random bucket partitions to use when
executing the “Generate random pairs of members” step. Default: 5 |
Maximum number of input/output partitions | Default: 10 |
Number of random pairs to generate | Specifies the “Number of random pairs to generate” used by
the mpxpair utility. This corresponds to the mad.wgtgen.upair.count
property. Default: 3000000 |
Interval for reporting processed records | Default: 10000 |
Maximum bucket set size | Default: 1000 |
Minimum weight for writing item records | Default: 4.0 |
Number of member partitions | Default: 1 |
Options | |
Encoding | Choices are Latin1, UTF8 and UTF16. Default: latin1 |
Comparison mode | Used to limit the comparison function. Typically, you want
to use the default value except for rare circumstances, such as generating
weights for a search-only attribute. Valid options are:
|
Skip last step because of too few attributes | If the data dictionary does not have enough attributes upon which to compare members, the operational server cannot make valid matching decisions. When you have too few attributes, the operational server does not run the step to generate matched statistics (or any of the steps that follow). Instead the operational server generates a set of weights based on the recommended boot weights. For example, if there are only two attributes such as name and address, to derive the matched statistics for name, you have only address. You cannot derive a good matched set from address alone. This option enables you to skip the automatic step that generates matched set and matched statistics, while still performing the iteration step which checks for convergence of weights. |
Log Options | |
Trace logging | Produces a trace of activity as interactions flow through the
system. This option is very verbose and should only be used for short
periods of time. Default: disabled |
Debug logging | Produces low-level diagnostics used internally by IBM® to identify what was happening on the system
before an error condition occurred. This option generates a large
amount of output per activity and should only be used for short periods
of time. Debug logging can potentially include personal member information such as member identification number, name, and so forth. Default: disabled |
Timer logging | Produces timings on certain operations to help identify where
significant processing time is elapsing. Default: disabled |
SQL logging | Outputs the SQL that is sent by the InfoSphere® MDM database
layer to the RDBMS. This helps in diagnosing database-related issues.
This option can produce large amounts of output depending on the activity. Default: disabled |
Audit logging | Produces activity information and non-critical warnings. Often,
this option is used when a new system is first implemented to monitor
activity. Default: disabled |
Algorithm logging | A separate logging level for algorithm-related debug information
without the risk of including protected health information (PHI). Default: disabled |