IBM PureData System for Analytics, Version 7.1

Hashing functions

You can use hashing functions to encode data, transforming the input into a “hash code” or “hash value.” The hash algorithm is designed to minimize the chance that two inputs have the same hash value, termed a collision.

You can use hashing functions to speed up the retrieval of data records (simple one-way lookups), for the validation of data (“checksums”), and for cryptography. For lookups, the hash code is used as an index into a hash table which contains a pointer to the data record. For checksums, the hash code is computed for the data before storage or transmission and then recomputed afterward to verify data integrity; if the hash codes do not match, the data is corrupted. Cryptographic hash functions are used for data security.

Some common use cases for hashing functions include:

Detect duplicated records. Because the hash keys of duplicates hash to the same “bucket” in the hash table, the task reduces to scanning buckets that have more than two records, a much faster method than sorting and comparing each record in the file. (This same technique can be used to find similar records, because similar keys hash to buckets that are contiguous, the search for similar records can therefore be limited to those buckets.)
Locate points that are near each other. Applying a hashing function to spatial data effectively partitions the space that is being modeled into a grid, and as in the previous example, the retrieval/comparison time is greatly reduced because only contiguous cells in the grid need to be searched. This same technique works for other types of spatial data, such as shapes and images.
Verify message integrity. The hash of message digests is made both before and after transmission and the two hash values compared to determine whether the message was corrupted.
Verify passwords. During authentication, the login credentials of a user are hashed and this value is compared with the hashed password stored for that user.