Topic
• No replies
bria
1 Post

# Pinned topic optimisation on blowfish s-box

‏2012-09-10T11:15:50Z |
Hello,

#### The context

Actually I'm searching to optimize the blowfish algorithm to do some "number crunching" (trying to decode something like 10k test vectors from 40 bits widths keys), but I'm not really successful.

After some months of multiples tests and thoughts , I'm hitting a wall (I'm getting only 1400 keys/s for 1 spu, which is slightly better than a not so much manually optimized on a phenom 3.Ghz (for 1 core)).

##### The algorithm
The blowfish is a simple algorithm :

The core of the algorithm is to get two 32 bits blocks "L" and "R" , and some key-generated (all uint32) constant : P[0-17], and the s-boxes S1[256], S2[256] , S3[256], S4[256] (yes four S-boxes of 256 uint32).

then you apply a simple feistel scheme, which is best describe in pseudo code (and in the picture images of wikipedia :
the F function : http://en.wikipedia.org/wiki/File:BlowfishFFunction.svg
All the algortithm : http://en.wikipedia.org/wiki/File:BlowfishDiagram.png
The blowfish page : http://en.wikipedia.org/wiki/Blowfish_%28cipher%29
)
a pseudo code example of the F function:
``````
uint8_t offset[4]=(uint8_t * )L; uint32_t a1=S1[offset[1]]; uint32_t a2=S2[offset[1]]; uint32_t a3=S3[offset[1]]; uint32_t a4=S4[offset[1]];   uint32_t result=a1+a2; result=result^a3;
//the binary xor result=result+a4;```
```

#### My approach

To use the vectorization of the spe, I use 4 test vectors by register.

( vector L = {L_tv1,L_tv2,L_tv3,L_tv4} ; vector R = {R_tv1,R_tv2,R_tv3,R_tv4}; Where "L_tv1" is the L block of the first test vector)
The P and S constant are identical across theirs vectors ( vector P1 = {P1,P1,P1,P1};... ; vector S1[256] = { { S1[1],S1[1],S1[1],S1[1]}, {S1[2],S1[2],S1[2],S1[2]},...} ).

It is not the best way to use the memory, but all could be store inside the LS.

With this, my software could be see as the example code
``````

while (not finish)
{ subkey=generate_key
{key
};
//512*16 calls to the F function

for i in all_testvector
// approx 2700 rows
{ vector L=Ltest[i]; vector R=rtest[i]; deciph (L,R,subkey);
// 16 call to the F function

if (L == goodLvalue && R == goodRvalue) printf (
"the key %llx is ok for on of the test vector at index %d\n",key,i); key++;
}```
```

##### My problem
The trouble with this way , is that the F function become increasingly complex with 4x4 "offset" to get, load and reassemble to do the computation.
And since this function is call more than 50 thousand times for one key, it's optimization is crucial.
my F function is doing approximately 70 lines :
(I'm not a professionnal programmer so I'm sorry if the code isn't so good
I've used so much temporary variable to avoid gcc create false dependency and help it dual issuing instruction)
``````
__attribute__((always_inline))  inline

static vector unsigned

int getS(vector unsigned

int myL)
{

static vector unsigned

int r1,r2,r3,r4; vector unsigned temp; vector unsigned

int a1,a2,a3,a4, c1,c2,c3,c4;   unsigned

int a,b,c,d, e,f,g,h, i,j,k_,l, m,n,o,p ;

const vector unsigned

int selpata=
{0x0,0xFFFFFFFF,0x0,0xFFFFFFFF
};

const vector unsigned

int selpat=
{0x0,0x0,0xFFFFFFFF,0xFFFFFFFF
};     vector unsigned Xa,Xe,Xi,Xm; vector unsigned Ya,Ye,Yi,Ym;
//first 32 bit bloc Ya=spu_rlmask(myL,-24); Xa=spu_and(Ya,0xFF); a=si_to_uint(Xa); b=spu_extract(Xa,1); c=spu_extract(Xa,2); d=spu_extract(Xa,3);   a1=spu_sel(S1offset[a],S1offset[b],selpata); c1=spu_sel(S1offset[c],S1offset[d],selpata); r1=spu_sel(a1,c1,selpat);
//second 32 bit bloc Ye=spu_rlmask(myL,-16); Xe=spu_and(Ye,0xFF); e=si_to_uint(Xe); f=spu_extract(Xe,1); g=spu_extract(Xe,2); h=spu_extract(Xe,3); a2=spu_sel(S2offset[e],S2offset[f],selpata); c2=spu_sel(S2offset[g],S2offset[h],selpata); r2=spu_sel(a2,c2,selpat);   temp=spu_add(r1,r2);
//third Yi=spu_rlmask(myL,-8); Xi=spu_and(Yi,0xFF); i=si_to_uint(Xi); j=spu_extract(Xi,1); k_=spu_extract(Xi,2); l=spu_extract(Xi,3); a3=spu_sel(S3offset[i],S3offset[j],selpata); c3=spu_sel(S3offset[k],S3offset[l],selpata); r3=spu_sel(a3,c3,selpat); temp=spu_xor(temp,r3);
//fourth Ym=myL;