Topic
  • No replies
DaveyC
DaveyC
56 Posts

Pinned topic tr1::hash<unsigned long long> considered harmful!

‏2011-06-04T09:28:07Z |
I've discovered that unordered_hash/unordered_set with STCK values as a key results in pathological performance which is almost O(n).
I noticed this when testing a real-world test case. STCK values are used as unique keys on z/OS for UOW tokens for CICS/IMS/DB2 etc.
Our products use STCK as keys to queues. In the case of unordered_set/unordered_map this showed up like a sore thumb! I can
circumvnet the problem by using a good 64bit hash function (thanks Thomas Wang) but would like IBM to address this issue.

The implementation (below) is is not a hash function, it's cast which casts a 64bit integer to a 32bit integer which
just takes the low 32 bits. In the case of a STCK value, this results in a sub-optimal value (it's a shocker)! FWIW, I also tested
the boost::hash<unsigned long long> which is just as pathetic. But the boost hash tables use a prime number for the bucket size
which fixes bad hashes. The IBM version uses a linear hash table, which is great - much faster for expands/shrinks - but does not
protect against bad hash functions.



// TEMPLATE CLASS hash<unsigned long long> template struct hash<unsigned 

long long> : 

public unary_function<unsigned 

long long, size_t> 
{ size_t operator()(unsigned 

long 

long _Val) 

const 
{

return static_cast<size_t>(_Val); 
} 
};


I've coded a simple hash test which, IIRC, comes from Kerinhan & Pikes book "The art of programming". It's based on markov
chains an prints a histogram of the length of synonym chains.


#define __IBMCPP_TR1__ 1   #include <iostream> #include <iomanip> #include <string> #include <cmath> #include <ctime> #include <vector> #include <algorithm> #include <numeric> #include <function> #include <builtins.h> #include <inttypes.h>     #include <functional>   using namespace std; 

double cputime( 

void  ) 
{ 

double time; time = (

double) clock(); time = time / CLOCKS_PER_SEC; 

return time; 
} 
// thanks to Thomas Wong - adapted from Java struct GoodHash : 

public unary_function<uint64_t, size_t> 
{ size_t operator()(uint64_t key) 

const 
{ key = (~key) + (key << 18); key = key ^ (key >> 31); key = (key + (key << 2)) + (key << 4); key = key ^ (key >> 11); key = key + (key << 6); key = key ^ (key >> 22); 

return static_cast<uint32_t>( key ); 
} 
};   struct CalcAverage 
{ CalcAverage() : numBuckets(0), sum(0) 
{ 
} 

void operator() ( 

const size_t& n ) 
{ 

if ( n > 0 ) 
{ sum++; numBuckets += n; 
} 
} size_t result() 
{ 

if ( sum > 0 ) 
{ 

return numBuckets / sum; 
} 

return 0; 
} size_t numBuckets; size_t sum; 
}; bool isPowerOfTwo( 

int n ) 
{ 

return n && !( n & ( n - 1 ) ); 
} bool isPrime( 

const 

int x ) 
{ 

const 

int max = static_cast<int>( std::sqrt( static_cast<double>( x ) ) ) + 1; 

for ( 

int i = 2; i != max; ++i ) 
{ 

if ( x % i == 0 ) 

return 

false; 
} 

return 

true; 
} template <typename K, typename H> 

class HashTester 
{ 

public: HashTester( size_t size );   

public: 

void insert( 

const K& k ); 

void printReport();   

private: H m_hasher; size_t m_size; size_t m_numInserts; std::vector<uint32_t> m_buckets; 

double m_hashTime; 
};   template <typename K, typename H> HashTester<K,H>::HashTester( size_t size ) : m_size( size ) , m_numInserts( 0 ) , m_buckets( m_size, 0 ) 
{ 
}     template <typename K, typename H> 

void HashTester<K,H>::insert( 

const K& k ) 
{ m_numInserts++; 

double startTime = cputime(); uint32_t hash = m_hasher( k ); m_hashTime += cputime() - startTime; m_buckets[hash % m_size]++; 
}     template <typename K, typename H> 

void HashTester<K,H>::printReport() 
{ uint32_t maxBucket = 0; 

const size_t freqsize = 16;        
// good size to fix on a screen  uint32_t freq[freqsize + 1] = 
{0
}; 
// +1 for overflow   
// build the frequency array 

for ( 

int i = 0; i < m_size; i++ ) 
{ maxBucket = max( maxBucket, m_buckets[i] ); 

if ( m_buckets[i] > freqsize ) 
{ freq[freqsize] += m_buckets[i]; 
} 

else 

if ( m_buckets[i] == 0 ) 
{ freq[0]++; 
} 

else 
{ freq[m_buckets[i]] += m_buckets[i]; 
} 
} cout.setf(ios::fixed); size_t average = for_each( m_buckets.begin(), m_buckets.end(), CalcAverage() ).result(); string tableType; 

if      ( isPrime( m_size) )        tableType = 
"(Prime)"; 

else 

if ( isPowerOfTwo( m_size ) )  tableType = 
"(Power of two)"; cout << 
"Statistics:\n"; cout << 
"   Hash table size . . . . . " << m_size << 
" " << tableType << 
"\n"; cout << 
"   Total inserts . . . . . . " << m_size << 
"\n"; cout << 
"   Maximum bucket size . . . " << maxBucket << 
"\n"; cout << 
"   Average bucket size . . . " << average   << 
"\n"; cout << 
"   Hash time in CPU secs . . " << left << setw(10) << setprecision(6) << m_hashTime  << 
"\n"; cout << 
"Histogram:\n"; cout << 
"  Size    Count   %total\n"; cout << 
" -----  -------  ------- *....1....2....3....4....5....6....7....8....9....*\n"; 

float  percent; size_t stars; 

for ( 

int i = 0; i < freqsize; i++ ) 
{ percent = i > 0 ? (freq[i] * 100.0) / m_numInserts : 0; stars = static_cast<int>( percent * 0.5 ); cout << setw(5) << right << i << 
"   " << setw(7) << freq[i] << 
"  " << setw(6) << setprecision(2) << percent << 
"%" << 
"  " << string( stars, 
'*' ) << 
"\n"; 
} percent = (freq[freqsize] * 100.0) / m_numInserts; stars = static_cast<int>( percent * 0.5 ); cout << 
"  >" << setw(2) << freqsize - 1 << 
"   " << setw(7) << freq[freqsize] << 
"  " << setw(6) << setprecision(2) << percent << 
"%" << 
"  " << string( stars, 
'*' ) << 
"\n"; 
} 

int main( 

int argc, 

char *argv[] ) 
{ 
// set defaults size_t tableSize = 16384; size_t inserts = tableSize * 8; 

switch ( argc ) 
{ 

case 2: tableSize = atoi( argv[1] ); inserts   = tableSize * 8; 

break; 

case 3: tableSize = atoi( argv[1] ); inserts   = atoi( argv[2] ); 
} typedef std::vector<uint64_t> KeyArray; KeyArray keys;   

while ( inserts-- ) 
{ unsigned 

long 

long s; __stck( &s ); keys.push_back( s ); 
} HashTester<uint64_t, std::tr1::hash<uint64_t> > bad( tableSize ); 

for ( KeyArray::iterator i = keys.begin(); i < keys.end(); i++ ) 
{ bad.insert( *i ); 
} bad.printReport(); HashTester<uint64_t, GoodHash> good( tableSize ); 

for ( KeyArray::iterator i = keys.begin(); i < keys.end(); i++ ) 
{ good.insert( *i ); 
} good.printReport(); 
}

David Crayford
Updated on 2012-03-29T07:49:53Z at 2012-03-29T07:49:53Z by DaveyC
  • DaveyC
    DaveyC
    56 Posts

    Re: tr1::hash&lt;unsigned long long&gt; considered harmful!

    ‏2011-08-09T15:25:28Z  
    Follow up: Here's a real world use case. I wrote a program to analyze a GETMAIN/FREEMAIN GTF trace and used an unordered_map<int,GTFRec>, where the key is the storage address. First problem was a compiler error due to some code that references the _VACPP_HASH_FUNCTION_CHECK. The code is a debug routine and should be enclosed in #ifdef/#endif!
    "//'CEE.SCEEH(XHASHTBL)'", line 516.50: CCN5274 (S) The name lookup for "_VACPP_HASH_FUNCTION_CHECK" did not find a
    declaration.
    "//'CEE.SCEEH(XHASHTBL)'", line 516.50: CCN6226 (I) Declarations for non-dependent names are resolved in the template
    definition.
    "//'CEE.SCEEH(XHASHTBL)'", line 516.50: CCN6227 (I) "_VACPP_HASH_FUNCTION_CHECK" does not depend on a template argument
    CCN0793(I) Compilation failed for file //'DOC.CPP(GFSDUMP)'. Object file not created.

    So after I compiled my program with the _VACPP_HASH_FUNCTION_CHECK switch on I realised that the same problem exists for hash<int>, it's not a hash function it's a cast! It looks to me like these hash functions have been copied from boost, which uses primes so gets away with it! As we all know, with linear hash tables (powers of 2) we need a good hash function. Here's a good one...

    
    struct Hasher 
    { uint32_t operator() ( uint32_t a ) 
    
    const 
    { a = ( a + 0X7ED55D16 ) + ( a << 12); a = ( a ^ 0XC761C23C ) ^ ( a >> 19); a = ( a + 0X165667B1 ) + ( a <<  5); a = ( a + 0XD3A2646C ) ^ ( a <<  9); a = ( a + 0XFD7046C5 ) + ( a <<  3); a = ( a ^ 0XB55A4F09 ) ^ ( a >> 16); 
    
    return a; 
    } 
    };
    


    Come on guys, we can do better than this! I can't be bothered to open a PMR but I'm guessing that all the hash functions are bad.

    David Crayford
  • ChrisCambly
    ChrisCambly
    1 Post

    Re: tr1::hash&lt;unsigned long long&gt; considered harmful!

    ‏2011-08-18T21:28:53Z  
    • DaveyC
    • ‏2011-08-09T15:25:28Z
    Follow up: Here's a real world use case. I wrote a program to analyze a GETMAIN/FREEMAIN GTF trace and used an unordered_map<int,GTFRec>, where the key is the storage address. First problem was a compiler error due to some code that references the _VACPP_HASH_FUNCTION_CHECK. The code is a debug routine and should be enclosed in #ifdef/#endif!
    "//'CEE.SCEEH(XHASHTBL)'", line 516.50: CCN5274 (S) The name lookup for "_VACPP_HASH_FUNCTION_CHECK" did not find a
    declaration.
    "//'CEE.SCEEH(XHASHTBL)'", line 516.50: CCN6226 (I) Declarations for non-dependent names are resolved in the template
    definition.
    "//'CEE.SCEEH(XHASHTBL)'", line 516.50: CCN6227 (I) "_VACPP_HASH_FUNCTION_CHECK" does not depend on a template argument
    CCN0793(I) Compilation failed for file //'DOC.CPP(GFSDUMP)'. Object file not created.

    So after I compiled my program with the _VACPP_HASH_FUNCTION_CHECK switch on I realised that the same problem exists for hash<int>, it's not a hash function it's a cast! It looks to me like these hash functions have been copied from boost, which uses primes so gets away with it! As we all know, with linear hash tables (powers of 2) we need a good hash function. Here's a good one...

    <pre class="jive-pre"> struct Hasher { uint32_t operator() ( uint32_t a ) const { a = ( a + 0X7ED55D16 ) + ( a << 12); a = ( a ^ 0XC761C23C ) ^ ( a >> 19); a = ( a + 0X165667B1 ) + ( a << 5); a = ( a + 0XD3A2646C ) ^ ( a << 9); a = ( a + 0XFD7046C5 ) + ( a << 3); a = ( a ^ 0XB55A4F09 ) ^ ( a >> 16); return a; } }; </pre>

    Come on guys, we can do better than this! I can't be bothered to open a PMR but I'm guessing that all the hash functions are bad.

    David Crayford
    The unordered associative containers use specializations of the class template hash as the default hash function. The class template is specialized on all built-in types in the <functional> header and all of the specializations currently do the same thing. The implementations are basically the identity function. The standard mandates that these functions must be provided but it does not specify what exactly they should do.

    At the C++ Language committee level the Library Working Group did not want to get involved in the specification of generic hash functions that are valid for each builtin types. A generic hash function is very difficult to provide since each users requirements and usage of hash tables is different. It was always the intent that the user would be required to write their own hash function and pick an appropriate hash function based on their own specific data set. The question about which generic hash function to provide for each users data set is best answered by the user since they are familiar with their own requirements.

    As you have stated, several implemenations use the identity function as the default hash function. It was likely chosen as the default because it is fast from an execution point of view not from a bucket distribution or usefulness view point. You have provided an excellent example of how one would write their own hash function. In addition, the unordered associative container also provides utility functions for monitoring as well as tuning the performance of the hash table in the functions: bucket_count(), bucket_size(n) and load_factor(). If you find that the performance of the hash table is not quite what you expect the hash function is the most likely candidate for replacement and the utility functions can be used to help determine if that is the case.

    The _VACPP_HASH_FUNCTION_CHECK problem appears to have been fixed already. If you could provide the specifics of your compiler version and OS I can check if a fix has been made available for your levels.
  • DaveyC
    DaveyC
    56 Posts

    Re: tr1::hash&lt;unsigned long long&gt; considered harmful!

    ‏2011-09-06T13:07:53Z  
    The unordered associative containers use specializations of the class template hash as the default hash function. The class template is specialized on all built-in types in the <functional> header and all of the specializations currently do the same thing. The implementations are basically the identity function. The standard mandates that these functions must be provided but it does not specify what exactly they should do.

    At the C++ Language committee level the Library Working Group did not want to get involved in the specification of generic hash functions that are valid for each builtin types. A generic hash function is very difficult to provide since each users requirements and usage of hash tables is different. It was always the intent that the user would be required to write their own hash function and pick an appropriate hash function based on their own specific data set. The question about which generic hash function to provide for each users data set is best answered by the user since they are familiar with their own requirements.

    As you have stated, several implemenations use the identity function as the default hash function. It was likely chosen as the default because it is fast from an execution point of view not from a bucket distribution or usefulness view point. You have provided an excellent example of how one would write their own hash function. In addition, the unordered associative container also provides utility functions for monitoring as well as tuning the performance of the hash table in the functions: bucket_count(), bucket_size(n) and load_factor(). If you find that the performance of the hash table is not quite what you expect the hash function is the most likely candidate for replacement and the utility functions can be used to help determine if that is the case.

    The _VACPP_HASH_FUNCTION_CHECK problem appears to have been fixed already. If you could provide the specifics of your compiler version and OS I can check if a fix has been made available for your levels.
    Chris,

    Thank you so much for your reply. I can understand that it's difficult to provide decent generic hash functions. What I want are hash functions that perform optionally on the platform I use. I've checked all the unordered associative containers and the hash functions in the popular STL implementations (z/OS, MSVC, GCC, Dinkumware, SGI) and the only one that performs poorly is z/OS. I also noticed that the only libraries that used the identity function also used a prime based scheme, and as you well know that circumvents poor hash functions.

    As a customer I don't expect to have to RYO my own hash function. The standard says that average case should be O(1) and worse case O(n). I would say the current implementation is crippled and is average close to O(n) for the majority of cases.

    This is a enterprise compiler and library. It's not cheap! I would expect out of the box performance to be world class! Maybe I have lofty expectations?

    David Crayford
  • DaveyC
    DaveyC
    56 Posts

    Re: tr1::hash&lt;unsigned long long&gt; considered harmful!

    ‏2012-03-29T07:49:53Z  
    • DaveyC
    • ‏2011-09-06T13:07:53Z
    Chris,

    Thank you so much for your reply. I can understand that it's difficult to provide decent generic hash functions. What I want are hash functions that perform optionally on the platform I use. I've checked all the unordered associative containers and the hash functions in the popular STL implementations (z/OS, MSVC, GCC, Dinkumware, SGI) and the only one that performs poorly is z/OS. I also noticed that the only libraries that used the identity function also used a prime based scheme, and as you well know that circumvents poor hash functions.

    As a customer I don't expect to have to RYO my own hash function. The standard says that average case should be O(1) and worse case O(n). I would say the current implementation is crippled and is average close to O(n) for the majority of cases.

    This is a enterprise compiler and library. It's not cheap! I would expect out of the box performance to be world class! Maybe I have lofty expectations?

    David Crayford
    I've just compiled this on a z/OS 1.13 system and the _VACPP_HASH_FUNCTION_CHECK problem has not been fixed! Shall I open a PMR?

    David Crayford