NGRAM

Use the NGRAM generation function to help improve candidate selection when your attributes contain similar values, but might vary in spelling or position.

The NGRAM bucket generation function can be used with the supporting bucketing functions of ATTR, BXNM, CXNM, and PXNM.

An NGRAM is a sub-sequence of size N generated from an input sequence. Collectively, NGRAMS are all the sub-sequences of size N that can be generated from the input sequence. This function creates buckets based on the N-grams generated from each data item.

The value, or size, of “N” is user-defined and set as a dvdArg in mpi_dvdybkt. If not specified, the default value of 3 is used. For names, a gram size of 3 or 4 is suggested. For phone numbers, a gram size of 7 is suggested.

An example of NGRAM output is:

N = 3 with the input of “MOHAMMED” results in MOH OHA HAM AMM MME MED

N = 4 with the input of “6345111” results in 6345 3451 4511 5111

Two additional dvdargs, TINYGRAMS and REMOVEVOWELS, are available to further break down the output and enhance bucket results. Multiple dvdargs can be specified by using either a comma (,) or plus (+) as separators.

  • TINYGRAMS

    The TINYGRAMS dvdArg generates grams from tokens with a length less than N. The NGRAMS default dvdArg setting does not use TINYGRAMS and generate buckets only from tokens with the length equal to the size of N. Use the TINYGRAMS option when name tokens exist that are smaller than your choice of N, but might be significant for bucketing purposes and would otherwise be ignored.

    For example, use the name AL AS'AD BEN HANI. Without TINYGRAMS, the output would be:

    ASA ASD BEN HAN ANI

    With TINYGRAMS, the output is:

    AL ASA ASD BEN HAN ANI

    By not using TINYGRAMS, the token ‘AL' would be ignored in bucketing. In many cases, the presence of this token could be significant to the comparison results.

  • REMOVEVOWELS

    This function removes all occurrences of A, E, I, O, and U from medial and final token positions. The default is to keep all vowels. Use this option to increase the chance that similar names differing by their vowels will bucket.

    For example, use the name MOHAMMAD HASSAN AKHUND. Without REMOVEVOWELS, the output is:

    MOH OHA HAM AMM MMA MAD HAS ASS SSA SAN AKH KHU HUN UND

    With REMOVEVOWELS, the output is:

    MHM HMM MMD HSS SSN AKH KHN HND

The NGRAM function can generate a large number of buckets and should be used with Frequency-Based-Bucketing.