Topic
  • 17 replies
  • Latest Post - ‏2008-11-19T00:30:52Z by SystemAdmin
SystemAdmin
SystemAdmin
10114 Posts

Pinned topic cryptography - md5 hash

‏2008-09-16T03:00:28Z |
Hi, has anyone seen the blackhat paper by nick breese on http://www.security-assessment.com/ ?

He initially claimed to be able to do >1billion md5 hash /second and later revised this number to 80million/second. I have checked his code, available here and this version indeed runs at over >1billion md5hash /second.

However, this code hashes the same value over and over again. The moment i start to increment the hash data and use this hash data, the iterations immediately drop to ~80million. My additional code is negligible.

Does anyone know the reason for this disparity?
Updated on 2008-11-19T00:30:52Z at 2008-11-19T00:30:52Z by SystemAdmin
  • SystemAdmin
    SystemAdmin
    10114 Posts

    Re: cryptography - md5 hash

    ‏2008-09-16T08:27:41Z  
    Hi,

    How did you measure these results?
    The code you referred does not do any iterations if I compile it with -DSPEEDTEST.
  • SystemAdmin
    SystemAdmin
    10114 Posts

    Re: cryptography - md5 hash

    ‏2008-09-16T08:36:36Z  
    Hi,

    How did you measure these results?
    The code you referred does not do any iterations if I compile it with -DSPEEDTEST.
    The way I did it was to include the line #define SPEEDTEST in spu_md5.c

    Also include some code in ppu_md5.c to calculate the time. Not the most elegant ways but they work
    time(&start);
    ...
    time(&end);
    double dif = difftime (end,start);
    printf("\nSuccessfully executed in %.2lf seconds.\n", dif);
  • SystemAdmin
    SystemAdmin
    10114 Posts

    Re: cryptography - md5 hash

    ‏2008-09-16T08:39:41Z  
    Hi,

    How did you measure these results?
    The code you referred does not do any iterations if I compile it with -DSPEEDTEST.
    Nick breese claims a compiler optimisation caught him out. Is this possible?

    thanks
  • SystemAdmin
    SystemAdmin
    10114 Posts

    Re: cryptography - md5 hash

    ‏2008-09-16T11:06:23Z  
    Nick breese claims a compiler optimisation caught him out. Is this possible?

    thanks
    Interesting application. It looks very much like compiler optimizations are the problem. Looking at the generated assembly, defining SPEEDTEST causes the md5_process function to be optimized out entirely! This is because it has no side-effects. I've attached a modified version. Note that it must be compiled with the following additional option to force inlining of the huge function:
    --param max-inline-insns-single=2000

    My changes are commented (my comments prefixed with 'jmt'). In summary they are:
    • Add volatile variables where necessary to prevent dead code stripping
    • Trivial instruction for very slight pipelining improvement
    • Calculate another four values in parallel in order to fill pipeline stalls with useful work.

    At a quick glance there doesn't seem to be much to do to improve the instruction mix. It would be great to shift some work onto the odd pipeline, but I can't see any obvious way to do that.

    Interestingly, a bonus from doing two calculations in parallel is that this happens to push the constant-initialization code out of the loop, which gives a slight performance boost. In my experience spu-gcc is not very good at promoting such instructions out of loops, even if there are spare registers available. It seems pot luck whether it does so or not.

    The modified code calculates 8 values on each of 2.5M iterations on 6 SPUs, making 120M calculations per second. I'm 99% certain the published performance values are meaningless as they are not actually timing full execution of the algorithm.

    Hope this is of interest to the forum...
    Jonny
  • SystemAdmin
    SystemAdmin
    10114 Posts

    Re: cryptography - md5 hash

    ‏2008-09-16T14:12:14Z  
    Interesting application. It looks very much like compiler optimizations are the problem. Looking at the generated assembly, defining SPEEDTEST causes the md5_process function to be optimized out entirely! This is because it has no side-effects. I've attached a modified version. Note that it must be compiled with the following additional option to force inlining of the huge function:
    --param max-inline-insns-single=2000

    My changes are commented (my comments prefixed with 'jmt'). In summary they are:
    • Add volatile variables where necessary to prevent dead code stripping
    • Trivial instruction for very slight pipelining improvement
    • Calculate another four values in parallel in order to fill pipeline stalls with useful work.

    At a quick glance there doesn't seem to be much to do to improve the instruction mix. It would be great to shift some work onto the odd pipeline, but I can't see any obvious way to do that.

    Interestingly, a bonus from doing two calculations in parallel is that this happens to push the constant-initialization code out of the loop, which gives a slight performance boost. In my experience spu-gcc is not very good at promoting such instructions out of loops, even if there are spare registers available. It seems pot luck whether it does so or not.

    The modified code calculates 8 values on each of 2.5M iterations on 6 SPUs, making 120M calculations per second. I'm 99% certain the published performance values are meaningless as they are not actually timing full execution of the algorithm.

    Hope this is of interest to the forum...
    Jonny
    "Trivial instruction for very slight pipelining improvement" = "Trivial instruction reordering for very slight pipelining improvement"

    Also, improved instruction selection for rotate_left and the and/or-with-complement operations, which nearly doubles the speed.

    I'm fairly certain that there are no compiler optimizations removing/merging redundant code now, but that may merit further checking.

    Performance now 225M iterations/sec, code actually attached this time.
  • SystemAdmin
    SystemAdmin
    10114 Posts

    Re: cryptography - md5 hash

    ‏2008-09-16T14:18:27Z  
    "Trivial instruction for very slight pipelining improvement" = "Trivial instruction reordering for very slight pipelining improvement"

    Also, improved instruction selection for rotate_left and the and/or-with-complement operations, which nearly doubles the speed.

    I'm fairly certain that there are no compiler optimizations removing/merging redundant code now, but that may merit further checking.

    Performance now 225M iterations/sec, code actually attached this time.
    Why aren't my attachments showing up?

    Code posted at:
    http://www.dur.ac.uk/j.m.taylor/spu_md5.c
  • SystemAdmin
    SystemAdmin
    10114 Posts

    Re: cryptography - md5 hash

    ‏2008-09-17T01:06:04Z  
    Why aren't my attachments showing up?

    Code posted at:
    http://www.dur.ac.uk/j.m.taylor/spu_md5.c
    Hi, thanks. great analysis
  • SystemAdmin
    SystemAdmin
    10114 Posts

    Re: cryptography - md5 hash

    ‏2008-10-22T14:19:55Z  
    Question about this code. I'm new to this so bear with me. The comment says it is processing four values: 1234aaa, 2345aaa, 3456aaa, and 4567aaa. But the hex doesn't include the aaa in each value.

    vec_uint4 data = { 0x34333231, 0x35343332, 0x36353433, 0x37363534 };

    Obviously there is something here I am not understanding. If someone could explain this a learner....I would be grateful.
  • SystemAdmin
    SystemAdmin
    10114 Posts

    Re: cryptography - md5 hash

    ‏2008-10-23T00:33:17Z  
    Question about this code. I'm new to this so bear with me. The comment says it is processing four values: 1234aaa, 2345aaa, 3456aaa, and 4567aaa. But the hex doesn't include the aaa in each value.

    vec_uint4 data = { 0x34333231, 0x35343332, 0x36353433, 0x37363534 };

    Obviously there is something here I am not understanding. If someone could explain this a learner....I would be grateful.
    Hi, this is the ascii value of 1234, 2345, 3456 etc..

    He stores it backwards as 4321, 5432,6543 etc because of big/little endian conversion (Which i feel actually isnt neccessary and just confuses things)
  • SystemAdmin
    SystemAdmin
    10114 Posts

    Re: cryptography - md5 hash

    ‏2008-10-23T13:11:47Z  
    Hi, this is the ascii value of 1234, 2345, 3456 etc..

    He stores it backwards as 4321, 5432,6543 etc because of big/little endian conversion (Which i feel actually isnt neccessary and just confuses things)
    Thanks for the reply? Do you think there is a reason he hard-coded the aaa into the vector?

    datachunk[1]=(vec_uint4){0x80616161,0x80616161,0x80616161,0x80616161};
  • SystemAdmin
    SystemAdmin
    10114 Posts

    Re: cryptography - md5 hash

    ‏2008-10-24T00:51:00Z  
    Thanks for the reply? Do you think there is a reason he hard-coded the aaa into the vector?

    datachunk[1]=(vec_uint4){0x80616161,0x80616161,0x80616161,0x80616161};
    hi, i think the code is very simple, purely for speed testing purpose. Taken in this context, I think hard-coding aaa is just for convenience. :)
  • SystemAdmin
    SystemAdmin
    10114 Posts

    Re: cryptography - md5 hash

    ‏2008-10-29T08:01:04Z  
    Why aren't my attachments showing up?

    Code posted at:
    http://www.dur.ac.uk/j.m.taylor/spu_md5.c
    The slight revival of this thread reminded me of a bit of extra comparison people might find interesting. I wanted to compare the PS3 performance to that of a high-end desktop. I was very surprised to find that a Mac Pro (running all 8 cores) is pretty much exactly the same speed as the PS3 (running all 6 SPUs) for this task.

    I had assumed that this task was something the PS3 would excel at. While that is true, it's also a task that the Xeon is fairly well suited to! The problem doesn't require a large working register set, and only miminal memory access, and is sufficiently parallel on an instruction-by-instruction basis to keep the CPU well fed.

    This is in contrast to my own scientific code which runs about 3 times as fast on the PS3 than on the Mac Pro. In the MD5 case, though, it seems that the PS3 wins but only on purchase cost...
  • SystemAdmin
    SystemAdmin
    10114 Posts

    Re: cryptography - md5 hash

    ‏2008-10-30T01:04:46Z  
    The slight revival of this thread reminded me of a bit of extra comparison people might find interesting. I wanted to compare the PS3 performance to that of a high-end desktop. I was very surprised to find that a Mac Pro (running all 8 cores) is pretty much exactly the same speed as the PS3 (running all 6 SPUs) for this task.

    I had assumed that this task was something the PS3 would excel at. While that is true, it's also a task that the Xeon is fairly well suited to! The problem doesn't require a large working register set, and only miminal memory access, and is sufficiently parallel on an instruction-by-instruction basis to keep the CPU well fed.

    This is in contrast to my own scientific code which runs about 3 times as fast on the PS3 than on the Mac Pro. In the MD5 case, though, it seems that the PS3 wins but only on purchase cost...
    yup, i think performance really depends on coding style. I am still experimenting and little things can cause large pipeline stalls even when using compiler optimisation.

    I am currently looking at bitsliced implementations of DES, has anyone tried this yet?
  • SystemAdmin
    SystemAdmin
    10114 Posts

    Re: cryptography - md5 hash

    ‏2008-10-31T18:57:58Z  
    yup, i think performance really depends on coding style. I am still experimenting and little things can cause large pipeline stalls even when using compiler optimisation.

    I am currently looking at bitsliced implementations of DES, has anyone tried this yet?
    I hadn't, but you got me curious. As well as the large number of registers being potentially useful for a bitsliced implementation, it strikes me that the presence of a single-instruction spu_sel could potentially give the SPUs quite an advantage over an x86 machine when it comes to implementing the S-boxes.

    As I already had some code which deals with a different instruction-minimization problem, I had a go at applying it to the case of the S-boxes. I can get the sequence down to an average of about 45 operations per S-box, which is better than the value of 51 quoted here (which doesn't use a 'select' primitive): http://www.darkside.com.au/bitslice/

    I'm optimistic that better search code could get it down significantly further. Since in some cases my code generates a worse sequence than his, despite the availability of an extra primitive, my code is clearly not doing such an effective job of searching the parameter space...
  • SystemAdmin
    SystemAdmin
    10114 Posts

    Re: cryptography - md5 hash

    ‏2008-11-18T05:08:40Z  
    I hadn't, but you got me curious. As well as the large number of registers being potentially useful for a bitsliced implementation, it strikes me that the presence of a single-instruction spu_sel could potentially give the SPUs quite an advantage over an x86 machine when it comes to implementing the S-boxes.

    As I already had some code which deals with a different instruction-minimization problem, I had a go at applying it to the case of the S-boxes. I can get the sequence down to an average of about 45 operations per S-box, which is better than the value of 51 quoted here (which doesn't use a 'select' primitive): http://www.darkside.com.au/bitslice/

    I'm optimistic that better search code could get it down significantly further. Since in some cases my code generates a worse sequence than his, despite the availability of an extra primitive, my code is clearly not doing such an effective job of searching the parameter space...
    Marc Bevand has kindly posted his bitsliced DES sbox optimiser for PS3 at http://www.epitech.eu/v4/perso/~bevand_m/

    Like you, he achieves 45 instructions on average per sbox. Marc Bevand achieves <40 instructions here http://www.hyperelliptic.org/SPEED/slides/Osvik_cell-speed.pdf

    Theoratically, without optimisation and just 'selb', the PS3 only needs 12+ 4x(8+4+2+1) = 72 gates so this decrease to <40 is not too surprising

    Thanks jmt for your prompt responses so far. I better get down to understanding this instruction minimisation problem!!
  • SystemAdmin
    SystemAdmin
    10114 Posts

    Re: cryptography - md5 hash

    ‏2008-11-18T10:04:36Z  
    Marc Bevand has kindly posted his bitsliced DES sbox optimiser for PS3 at http://www.epitech.eu/v4/perso/~bevand_m/

    Like you, he achieves 45 instructions on average per sbox. Marc Bevand achieves <40 instructions here http://www.hyperelliptic.org/SPEED/slides/Osvik_cell-speed.pdf

    Theoratically, without optimisation and just 'selb', the PS3 only needs 12+ 4x(8+4+2+1) = 72 gates so this decrease to <40 is not too surprising

    Thanks jmt for your prompt responses so far. I better get down to understanding this instruction minimisation problem!!
    Very interesting, thanks for posting those links.

    I had a skim through the PDF by Osvik & Tromer, but couldn't find any reference to <40 instructions. Could you point me to the bit where that is mentioned?

    When I have a chance I will post my code, but unfortunately at the moment my own (unrelated) research is taking up all my available time. I'm doing a few overnight runs at the moment, though, playing with a modified algorithm that I think should potentially yield slightly shorter sequences...
  • SystemAdmin
    SystemAdmin
    10114 Posts

    Re: cryptography - md5 hash

    ‏2008-11-19T00:30:52Z  
    Very interesting, thanks for posting those links.

    I had a skim through the PDF by Osvik & Tromer, but couldn't find any reference to <40 instructions. Could you point me to the bit where that is mentioned?

    When I have a chance I will post my code, but unfortunately at the moment my own (unrelated) research is taking up all my available time. I'm doing a few overnight runs at the moment, though, playing with a modified algorithm that I think should potentially yield slightly shorter sequences...
    Hi its at slide 16/26, under the header 'selb'