Calculating thresholds

The Threshold Calculator enables you to use sample data from your operational server database to calculate the appropriate Clerical Review and Auto-link thresholds. Based on the weights files you generated, the Threshold calculator generates a ROC curve. A ROC curve (Receiver Operating Characteristic curve) is a plot of true positive rate against false positive rate for different threshold values.

About this task

Optionally, to enable false positive calculations and task estimates, you can run the Generate Threshold Analysis Pairs job, and combine the sample pairs that are created by that utility with the weight generation files. See Database Tools jobs.

Attention: Run the Compare Members in Bulk (mpxcomp) and Link Entities (mpxlink) utilities before running the Generate Threshold Analysis Pairs utility.

You must be in InfoSphere® MDM Workbench Configuration perspective.

Procedure

  1. Select Expert from the View list.
  2. In the View list, select Thresholds.
  3. Click Threshold Calculator to open the Threshold Calculator wizard.
  4. Provide the Source directory for calculations or accept the default. The directory defaults to the project weights\entity-type directory, for example weights\id. This directory is where the distribution sample (dsamp), unmatched sample (usamp), and matched sample (msamp) files are stored by default for the particular entity type. The files were generated during the last step in the weight generation process and copied to the directory when you performed the Get job results function on the Generate Weights job.
  5. Optionally, provide a Analyzed Sample Pair file for calculations. You can browse to the sample pair .xls file generated by the Generate Threshold Analysis Pairs job. The default directory is the Workbench workspace\project_name directory. If multiple sample pair files were generated and evaluated, they must be in the same directory and uploaded at the same time. The Threshold Calculator uses the sample pairs files to enable false positive calculations and task estimates. If you do not provide a sample pair file, the Threshold Calculator uses the usamp weights file.
  6. Optionally, provide a Source directory for comparison. You can choose to generate a separate set of weight files and compare how the different weights influence False Positive Rate, False Negative Rate, and the thresholds. Be sure to specify the output directory when running the Generate Weights job to keep from overwriting your existing weights files. Providing a source directory for comparison populates the Comparison Values column on the subsequent panel of the Threshold Calculator.
  7. Optionally, provide a Analyzed Sample Pair file for comparison. As with Analyzed Sample Pair file for calculations, you can provide sample pair .xls file to enable false positive calculations and task estimates. In this case, the sample pair file applies to the weight files provided for comparison. If you do not provide a sample pair file, the Threshold Calculator uses the usamp weights file in the Source directory for comparison.
  8. Optionally, indicate the Number of members for this entity type in production database. To complete the procedure in a reasonable time, run the Threshold Calculator on a subset of the total number of records in your database. Unless you are running against the full database, be sure to indicate the total number of members for the entity type in the production database. The Threshold Calculator adjusts the task estimate number to reflect the total number of members. When running the Threshold Calculator against a sample of 1 million records for example, if you do not indicate the Number of members for this entity type in production database, the Estimated number of tasks on the subsequent window might display 500. If, however, you indicate that the Number of members for this entity type in production database is 10 million, the Threshold Calculator adjusts the Estimated number of tasks to reflect the total. For example, the adjusted total might be 5,000. By default, the field is blank. Any number you supply applies both to the baseline weights and to the comparison weights. However, the Threshold Calculator provides only a task estimate if you provided a sample pair .xls file, whether for the Analyzed Sample Pair file for calculations field or for the Analyzed Sample Pair file for comparison field.
  9. Click Next.

Results

If you supplied a set of comparison weight files, the Comparison Values column displays the relevant results. If you supplied a sample pair file, the Estimated Number of Tasks displays the relevant read-only results. Adjust the values to see how changes affect the other fields. The values update automatically.

You cannot calculate thresholds by the comparison value, however as you change certain values in the left column (such as the Clerical Review Threshold), the value updates in the Comparison Values column. Similarly, for False Positive Rate, the left column differs from the Comparison Values value because the left column value is based on the usamp file while the Comparison Values value is based on the sample pairs file. By contrast, if you change the False Negative Rate, the value in the Comparison Values column does not change because the Threshold Calculator refers to the same weight files for both values.

The different tabs display different graphical views of the data and calculator results:
ROC Curve

A Receiver Operating Characteristic (ROC) curve is a plot of true positive rate against false positive rate for different threshold values. The Threshold Calculator generates the ROC curve based on the usamp, msamp, and dsamp weight files. Use the ROC curve to determine optimal Clerical Review Thresholds and Auto-link Thresholds. The ROC curve uses the actual data in your system, which is represented through statistical estimates that are captured in the usamp file (random sampling), msamp file (matched sampling) and dsamp file (distributed sampling). The false positive rates and the corresponding Auto-link threshold are calculated using the usamp file. The false negative rates and the corresponding CR threshold are calculated using the msamp and dsamp statistics.

In the ROC curve graph, the X axis represents the false positive rate. For example, a value of -3 on the X axis corresponds to a 10 ^-3 in false positives OR 1 in a 1000 false positives.

The Y axis represents (1 - false negative rate). For example, a value of 0.95 on the Y axis corresponds to 0.05 false negative rate or, in other words, 1 in 20 false negatives.

Attention: A ROC curve can also be generated using sample pairs, which are based on user review. When the sample pair file is supplied as an input, the ROC curve uses it to adjust the false positive rate and Auto-link Threshold instead of using the usamp file. Using sample pairs does not influence the false negative rate calculation.
ROC Curve (Comparison)

If you supplied a value for Source directory for comparison, the tab displays the ROC Curve generated from the comparison weights.

This curve is populated only when you have another set of ROC values to compare. If you have two sets of algorithms that operate on the same data, you can supply two sets of inputs and get two ROC curves. You can them use them to compare error rates and thresholds.

Sample Pair Evaluations

This graph shows the number of pairs that are the same versus not the same for a particular score. This graph is drawn using the evaluations the user provides in the sample pair file, for example Yes, No, and Maybe. Only the counts for Yes and No are used to populate the graph.

This graph is populated only if you supplied a valid sample pair .xls file to the Threshold Calculator.

Sample Pair FPR

Using the number of pairs that are not the same (from the Sample Pair Evaluations graph) and the score distribution (from the Score Distribution graph), this graph is populated to show the probability of false positives by score.

This graph is populated only if you supplied a valid sample pair .xls file to the Threshold Calculator.

Score Distribution

This graph represents the number of pairs that scored at a particular weight.

This graph is populated only if you supplied a valid sample pair .xls file to the Threshold Calculator.

What to do next

When the values are set, click Finish. The Save Thresholds dialog opens. Select the source combinations for which you want to apply the calculated Clerical Review Threshold and Autolink Threshold values. The Select All, Deselect All, Select Same and Select Different buttons, and the Set Same and Set Different check boxes, provide a quick way to select combinations. You can also press Shift and click or press Ctrl and click to select multiple source combinations and check or clear their select boxes. Click Save to finish.