Edit distance comparison functions

Edit distance functions compare two strings and determine the number of insertions, deletions, or transpositions it would take to make the two strings the same.

For example, you have a three-dimensional (3Dim) comparison for x(ZIP + address + phone) to y(ZIP + address + phone). The edit distance is the number of edits it would take to obtain an exact match of the x and y strings.

You can have one-dimensional, two-dimensional, three-dimensional, or four-dimensional comparisons. Each dimension corresponds to a wgtdim table (mpi_wgt1dim, mpi_wgt2dim, mpi_wgt3dim, and mpi_wgt4dim).

There is a common set of numbers used in edit distance results.

  • A 0 (zero) edit distance means that one string is missing from the comparison input.
  • 1 means an edit distance of 0 and both strings match exactly.
  • 2 means that there is an edit distance of 1, as in one edit (insertion, deletion, or transposition) must be made to make the strings match.
  • 3 means that two edits must be made for a match, and so on.
  • >10 means that you have a mismatch.

Most of the time, you can deduce the edit distance functionality by understanding how they are named. Names are formatted as DR<r>D<d><t> where:

  • DR<r> means edit Distance Role and the <r> indicates the number of roles (or standardized attributes) that are being compared. This is typically a number between 1 and 4. If you are comparing phone number, you would have one role. If you are comparing ZIP + address + phone, you would have three roles.
  • D<d> represents Dimensions with <d> being the number of dimensions (tokens or fields that make up an attribute) that are being compared in each role.
  • <t> is a letter representing the type of comparison being performed. Options are:
    • A = simple edit distance comparison resulting in match or no match, so each dimension of the weight table has exactly two entries.
    • B = quick edit distance returns an integer representing the edit distance. The returned edit distance can sometimes be higher than the true edit distance, but it is much more efficient. This comparison is suggested for long strings.
    • C = a real or true edit distance. This is the most accurate comparison, but can use system resources. This comparison is suggested for short strings or in instances where absolute accuracy is vital.

Using the example of DR3D1A, you can determine that this comparison is a distance-based comparison that is using three comparison roles (cmproles), one dimension per role, and results in a simple match-no match.

Tip: If you have performance concerns, you might want to avoid using an edit distance calculation for too many attributes in your algorithm.