You can use hashing functions to encode data, transforming
the input into a “hash code” or “hash value.” The hash algorithm is
designed to minimize the chance that two inputs have the same hash
value, termed a collision.
You can use hashing functions to speed up the retrieval of data
records (simple one-way lookups), for the validation of data (“checksums”),
and for cryptography. For lookups, the hash code is used as an index
into a hash table which contains a pointer to the data record. For
checksums, the hash code is computed for the data before storage or
transmission and then recomputed afterward to verify data integrity;
if the hash codes do not match, the data is corrupted. Cryptographic
hash functions are used for data security.
Some common use cases for hashing functions include:
- Detect duplicated records. Because the hash keys of duplicates
hash to the same “bucket” in the hash table, the task reduces to scanning
buckets that have more than two records, a much faster method than
sorting and comparing each record in the file. (This same technique
can be used to find similar records, because similar keys hash to
buckets that are contiguous, the search for similar records can therefore
be limited to those buckets.)
- Locate points that are near each other. Applying a hashing function
to spatial data effectively partitions the space that is being modeled
into a grid, and as in the previous example, the retrieval/comparison
time is greatly reduced because only contiguous cells in the grid
need to be searched. This same technique works for other types of
spatial data, such as shapes and images.
- Verify message integrity. The hash of message digests is made
both before and after transmission and the two hash values compared
to determine whether the message was corrupted.
- Verify passwords. During authentication, the login credentials
of a user are hashed and this value is compared with the hashed password
stored for that user.