Fuzzy string search functions

The Netezza Performance Server SQL language supports two fuzzy string search functions: Levenshtein Edit Distance and Damerau-Levenshtein Edit Distance.

A fuzzy string search is a form of approximate string matching that is based on defined techniques or algorithms. These functions compare two strings to show how similar or different they are. These functions support VARCHAR and CHAR data types (Latin9 encoding) only.

Levenshtein edit distance

SQL syntax:
le_dst(<str_expr_1>, <str_expr_2>)

The return value is an integer (int4) that indicates how different the two input strings are calculated according to the Levenshtein edit distance algorithm. A value of 0 indicates that the strings are equivalent without any modifications. The algorithm computes the number of modifications that are required to change the first string into the second string. The strings are case-sensitive. A modification is a change such as the addition, deletion, case change, or substitution of a single character.

For example:
  • le_dst('sow','show') returns a value of 1 (the addition of the character h).
  • le_dst('hello','Hollow') returns a value of 3 (the substitution of e for o, the capitalization of H, and the addition of w).
To ignore case differences, use functions the upper or lower function to change the case of strings before the comparison. For example:
  • select le_dst('Smith','SMYTH') returns a value of 4 (three uppercase letter changes and the i-to-Y letter substitution).
  • select le_dst(upper('Smith'),'SMYTH') returns a value of 1 (the I-to-Y letter substitution).

Damerau-Levenshtein edit distance

SQL syntax:
dle_dst (<str_expr_1>, <str_expr_2>)
The value that is returned indicates how different the two input strings are calculated according to the Damerau-Levenshtein edit distance algorithm. Similar to the Levenshtein algorithm:
  • A modification is a change such as the addition, deletion, case change, or substitution of a single character.
  • The return value is an integer (int4) that indicates the number of modifications required to change the first string into the second string.
  • A value of 0 indicates that the strings are equivalent without any modifications.
  • The string comparisons are case-sensitive.
However, in the Damerau-Levenshtein algorithm, transposition of neighboring characters counts as a single modification, not two modifications. For example:
  • le_dst('two','tow') returns a value of 2 (changing the w to an o and the o to a w).
  • dle_dst('two','tow') returns a value of 1 (transposing the w and o).