Phonetic matching functions
The Netezza Performance Server SQL language supports two phonetic matching functions that you can use to encode names into phonetic representations by using the SoundEx NYSIIS or Double Metaphone algorithms.
By encoding names phonetically, you can match names that are based on their pronunciation and reduce misses that might result from spelling variations. The phonetic matching functions are not case-sensitive comparisons; the phonetic representations are the same for two strings that have the same spelling but different letter casing. These functions support VARCHAR and CHAR data types and specifically the ASCII subset of Latin9 encoding only. The functions ignore any characters outside the ASCII subset.
SoundEx NYSIIS
<varchar(6) value> = nysiis(<str_expr>)
Soundex
is a well-known phonetic algorithm for indexing names by sound as
pronounced in English. This function converts a string into its Soundex
representation by using the New York State Identification and Intelligence
System (NYSIIS) variation of Soundex. The return value is a string
of up to six characters that identifies the pronunciation of the input
string. For example, the function nysiis('Washington')
returns
the string ‘wasang’, while the function nysiis('brown')
returns
the value ‘bran’.
Double Metaphone
<int4 value> = dbl_mp(<str_expr>)
Double Metaphone is another phonetic algorithm for indexing strings by their pronunciation. Similar to Soundex, it uses a different set of rules for English and alternate pronunciation. The function returns two 4-character string encodings, a primary key and secondary (or alternate) key, for pronunciation of the input string. Similar sounding words share keys, though they might be of variable length and spelling. For example, the double metaphone primary and secondary keys for the name ‘washington’ are 'AXNK' and 'FXNK'.
For improved performance,
the dbl_mp function maps the 4-character keys to
16-bit numbers and returns a composite 32-bit value (Netezza Performance Server type
int4) that holds both the 16-bit primary and secondary keys. So, the
function dbl_mp('washington')
returns the value 781598358.
There are three helper functions (pri_mp, sec_mp, and score_mp) that you can use to extract the primary and secondary keys as strings from the return value, and to run key comparisons for scoring relevance.
Primary Metaphone
<varchar(4) value> = pri_mp(<int4 dbl_mp return value>)
This helper function takes the value that is returned
by a call to the dbl_mp function and returns the
corresponding four-character primary metaphone string. For example, pri_mp(781598358)
returns
the primary key AXNK.
Secondary Metaphone
<varchar(4) value> = sec_mp(<int4 dbl_mp return value>)
This helper function takes the value that is returned
by a call to the dbl_mp function and returns the
corresponding four-character secondary metaphone string. For example, sec_mp(781598358)
returns
the secondary key FXNK.
Score Metaphones
<varchar(4) value> = score_mp(<int4 dbl_mp value 1>, <int4 dbl_mp
value 1>, <int4 strong match value>, <int4 normal match value>,
<int4 minor match value>, <int4 no match value>)
- Strongest Match
- Primary Key (1) = Primary Key (2)
- Normal Match
- Secondary Key (1) = Primary Key (2)
Primary Key (1) = Secondary Key (2)
- Minimal Match
- Secondary Key (1) = Secondary Key (2)
For the four match value arguments, you can specify values such as 1, 2, 3, and 4 (for strongest, normal, minimal, or no match). You can also use weighted values such as 100, 50, 25, and 0 to return more points for better match results.
score_mp(781598358,781596310,1,2,3,4)
The function returns the value 1, which indicates a strongest match.
score_mp(781598358,1050031766,100,50,25,0)
The function returns the value 50, which indicates a normal match.