No replies
1 Post

Pinned topic optimisation on blowfish s-box

‏2012-09-10T11:15:50Z |

The context

Actually I'm searching to optimize the blowfish algorithm to do some "number crunching" (trying to decode something like 10k test vectors from 40 bits widths keys), but I'm not really successful.

After some months of multiples tests and thoughts , I'm hitting a wall (I'm getting only 1400 keys/s for 1 spu, which is slightly better than a not so much manually optimized on a phenom 3.Ghz (for 1 core)).

The algorithm
The blowfish is a simple algorithm :

The core of the algorithm is to get two 32 bits blocks "L" and "R" , and some key-generated (all uint32) constant : P[0-17], and the s-boxes S1[256], S2[256] , S3[256], S4[256] (yes four S-boxes of 256 uint32).

then you apply a simple feistel scheme, which is best describe in pseudo code (and in the picture images of wikipedia :
the F function :
All the algortithm :
The blowfish page :
a pseudo code example of the F function:

uint8_t offset[4]=(uint8_t * )L; uint32_t a1=S1[offset[1]]; uint32_t a2=S2[offset[1]]; uint32_t a3=S3[offset[1]]; uint32_t a4=S4[offset[1]];   uint32_t result=a1+a2; result=result^a3;
//the binary xor result=result+a4;

My approach

To use the vectorization of the spe, I use 4 test vectors by register.

( vector L = {L_tv1,L_tv2,L_tv3,L_tv4} ; vector R = {R_tv1,R_tv2,R_tv3,R_tv4}; Where "L_tv1" is the L block of the first test vector)
The P and S constant are identical across theirs vectors ( vector P1 = {P1,P1,P1,P1};... ; vector S1[256] = { { S1[1],S1[1],S1[1],S1[1]}, {S1[2],S1[2],S1[2],S1[2]},...} ).

It is not the best way to use the memory, but all could be store inside the LS.

With this, my software could be see as the example code

while (not finish) 
{ subkey=generate_key
//512*16 calls to the F function 

for i in all_testvector 
// approx 2700 rows 
{ vector L=Ltest[i]; vector R=rtest[i]; deciph (L,R,subkey); 
// 16 call to the F function 

if (L == goodLvalue && R == goodRvalue) printf (
"the key %llx is ok for on of the test vector at index %d\n",key,i); key++; 

My problem
The trouble with this way , is that the F function become increasingly complex with 4x4 "offset" to get, load and reassemble to do the computation.
And since this function is call more than 50 thousand times for one key, it's optimization is crucial.
my F function is doing approximately 70 lines :
(I'm not a professionnal programmer so I'm sorry if the code isn't so good
I've used so much temporary variable to avoid gcc create false dependency and help it dual issuing instruction)

__attribute__((always_inline))  inline 

static vector unsigned 

int getS(vector unsigned 

int myL) 

static vector unsigned 

int r1,r2,r3,r4; vector unsigned temp; vector unsigned 

int a1,a2,a3,a4, c1,c2,c3,c4;   unsigned 

int a,b,c,d, e,f,g,h, i,j,k_,l, m,n,o,p ;   

const vector unsigned 

int selpata=

const vector unsigned 

int selpat=
};     vector unsigned Xa,Xe,Xi,Xm; vector unsigned Ya,Ye,Yi,Ym; 
//first 32 bit bloc Ya=spu_rlmask(myL,-24); Xa=spu_and(Ya,0xFF); a=si_to_uint(Xa); b=spu_extract(Xa,1); c=spu_extract(Xa,2); d=spu_extract(Xa,3);   a1=spu_sel(S1offset[a],S1offset[b],selpata); c1=spu_sel(S1offset[c],S1offset[d],selpata); r1=spu_sel(a1,c1,selpat);   
//second 32 bit bloc Ye=spu_rlmask(myL,-16); Xe=spu_and(Ye,0xFF); e=si_to_uint(Xe); f=spu_extract(Xe,1); g=spu_extract(Xe,2); h=spu_extract(Xe,3); a2=spu_sel(S2offset[e],S2offset[f],selpata); c2=spu_sel(S2offset[g],S2offset[h],selpata); r2=spu_sel(a2,c2,selpat);   temp=spu_add(r1,r2);   
//third Yi=spu_rlmask(myL,-8); Xi=spu_and(Yi,0xFF); i=si_to_uint(Xi); j=spu_extract(Xi,1); k_=spu_extract(Xi,2); l=spu_extract(Xi,3); a3=spu_sel(S3offset[i],S3offset[j],selpata); c3=spu_sel(S3offset[k],S3offset[l],selpata); r3=spu_sel(a3,c3,selpat); temp=spu_xor(temp,r3); 
//fourth Ym=myL;
//Y=spu_rlmask(myL,0); Xm=spu_and(Ym,0xFF); m=si_to_uint(Xm); n=spu_extract(Xm,1); o=spu_extract(Xm,2); p=spu_extract(Xm,3); a4=spu_sel(S4offset[m],S4offset[n],selpata); c4=spu_sel(S4offset[o],S4offset[p],selpata); r4=spu_sel(a4,c4,selpat); temp=spu_add(temp,r4); 

return temp;   

As you can see, I take much more time to extract the rights 8 bit value than to do the calculations.
(I use spu_sel instead of spu_shuffle , even if the arithmetic pipeline is more used than the load pipeline, because the code is dependant of theses values so 4 cycles instead of 6 are better).

I have done some other optimization (like interleaving two decipher blocks and getting something like 65% of dual issues), but the "critical path" is really this function.
Do you have any idea which can simplify/optimize the design and/or the code ?