The library provides SPE APIs only for certain routines. These APIs do not conform to the existing BLAS standard. There are constraints on the functionality (range of strides, sizes, and so on) supported by these routines. Prototypes of these routines are listed in blas_s.h.:
sscal_spu : Scales a vector by a constant. BLAS Level 1.scopy_spu : Copies a vector from source to destination. BLAS Level 1.saxpy_spu : Scales a source vector and element-wise adds it to the destination vector. BLAS Level 1.sdot_spu : Performs dot product of two vectors. BLAS Level 1.isamax_spu : Determines the (first occurring) index of the largest element in a vector. BLAS Level 1.sgemv_spu : Multiplies a matrix and a vector, adding the result to a resultant vector with suitable scaling. BLAS Level 2.sgemm_spu : Multiplies two matrices, A and B, and adds the result to the resultant matrix C after suitable scaling. BLAS Level 3.ssyrk_64x64 : Multiplies matrix A with its transpose A^{T} and adds the result to the resultant matrix C after suitable scaling. BLAS Level 3.strsm_spu : Solves a system of equations involving a triangular matrix with multiple right-hand sides. BLAS Level 3.strsm_64x64 : Solves a system of equations involving a triangular matrix with multiple right-hand sides. BLAS Level 3.
The following sections will provide details on each.
Also in this mini-series: BLAS API overview | Management API details
sscal_spu This BLAS 1 routine scales a vector by a constant. The following operation is performed in scaling:x <= a x where x is a vector and a is a constant. Unlike the equivalent PPE API, the SPE interface is designed for stride 1 only, whereby n consecutive elements, starting with first element, get scaled. The routine has limitations on the n value and vector alignment. n value should be a multiple of 32. The x vector must be aligned at a 16-byte boundary.
Form: void sscal_spu ( float *sx, float sa, int n ) - The parameter
sx is a pointer to vector of floats to scale. - The parameter
sa is a float constant to scale vector elements with. - The parameter
n is an integer storing number of vector elements to scale. (Must be a multiple of 32.)
Example: #define len 1024 float buf_x[len] __attribute__ (( aligned (16) )) ;
int main() { int size=len, k ;
float alpha = 0.6476 ;
for(k=0;<ksize;k++) { buf_x[k] = (float)k ; }
sscal_spu( buf_x, alpha, size ) ; return 0 ; }
scopy_spu This BLAS 1 routine copies a vector from source to destination. The following operation is performed in copy:y <= x where x and y are vectors. Unlike the equivalent PPE API, this routine supports only stride 1, whereby n consecutive elements, starting with first element, get copied. The routine has no limitation on the value of n and vector alignments.
Form: void scopy_spu (float *sx, float *sy, int n) - The parameter
sx is a pointer to source vector of floats. - The parameter
sy is a pointer to destination vector of floats. - The parameter
n is an integer storing number of vector elements to copy.
Example: int main() { int size=len, k ; float buf_x[len] ; float buf_y[len] ;
for(k=0;<ksize;k++) { buf_x[k] = (float)k ; }
scopy_spu( buf_x, buf_y, size ) ; return 0 ; }
saxpy_spu This BLAS 1 routine scales a source vector and element-wise adds it to the destination vector. The following operation is performed in scale and add:y <= ax + y where x, y are vectors and a is a constant. Unlike the equivalent PPE API, the SPE interface is designed for stride 1 only, wherein n consecutive elements, starting with first element, get operated on. This routine has limitations on the n value and vector alignment supported. n value should be a multiple of 32. The x and y vectors must be aligned at a 16-byte boundary.
Form: void saxpy_spu (float *sx, float *sy, float sa, int n) - The parameter
sx is a pointer to source vector (x) of floats. - The parameter
sy is a pointer to destination vector (y) of floats. - The parameter
sa is a float constant to scale elements of vector x with. - The parameter
n is an integer storing number of vector elements to scale and add.
Example: #define len 1024 float buf_x[len] __attribute__ (( aligned (16) )) ; float buf_y[len] __attribute__ (( aligned (16) )) ;
int main() { int size=len, k ; float alpha = 0.6476 ;
for(k=0; k<size; k++) { buf_x[k] = (float)k ; buf_y[k] = (float)(k * 0.23) ; }
saxpy_spu( buf_x, buf_y, alpha, size ) ;
return 0 ; }
sdot_spu This BLAS 1 routine performs dot product of two vectors. The following operation is performed in dot product:result <= x . y where x and y are vectors. Unlike the equivalent PPE API, the SPE interface is designed for stride 1 only, whereby n consecutive elements, starting with first element, get operated on. This routine has limitations on the n value and vector alignment. n value should be a multiple of 32. The x and y vector must be aligned at a 16-byte boundary.
Form: float sdot_spu ( float *sx, float *sy, int n ) - The parameter
sx is a pointer to first vector (x) of floats. - The parameter
sy is a pointer to second vector (y) of floats. - The parameter
n is an integer storing number of vector elements.
Return value: float : Dot product of the two vectors.
Example: #define len 1024float buf_x[len] __attribute__ (( aligned (16) )) ; float buf_y[len] __attribute__ (( aligned (16) )) ;
int main() { int size = len, k ; float sum = 0.0 ;
for(k=0;<ksize;k++) { buf_x[k] = (float) k; buf_y[k] = buf_x[k]; }
sum = sdot_spu( buf_x, buf_y, size ) ;
return 0 ; }
isamax_spu This BLAS 1 routine determines the (first occurring) index of the largest element in a vector. The following operation is performed in vector max index:result <= 1st k s.t. x[k] = max(x[i]) where x is a vector. The routine is designed for stride 1 only, wherein n consecutive elements, starting with first element, get operated on. This routine has limitations on the n value and vector alignment. n value should be a multiple of 64. The x vector must be aligned at a 16-byte boundary.
Form: int isamax_spu ( float *sx, int n) - The parameter
sx is a pointer to vector (x) of floats. - The parameter
n is an integer storing number of vector elements.
Return value: int : Index of (first occurring) largest element. (Indices start with 0.)
Example: #define len 1024float buf_x[len] __attribute__ (( aligned (16) )) ;
int main() { int size=len, k ;
int index ;
for(k=0;<ksize;k++) { buf_x[k] = (float) k; } index = isamax_spu( buf_x, size ) ; return 0 ; }
sgemv_spu This BLAS 2 routine multiplies a matrix and a vector, adding the result to a resultant vector with suitable scaling. The following operation is performed:y <= a A x + y where x and y are vectors, A is a matrix and a is a scalar. Unlike equivalent PPE interface, the SPE interface for this routine only supports stride (increment) of one for vectors x and y. m must be a multiple of 32. n must be a multiple of 8. All the input vectors and matrix must be 16-byte aligned. Matrix A is assumed to be stored in the column major order.
Form: void sgemv_spu ( int m, int n, float alpha, float *a, float *x, float *y) - The parameter
m is an integer specifying number of rows in matrix A. - The parameter
n is an integer specifying number of columns in matrix A. - The parameter
alpha is a float storing constant to scale the matrix-vector product AX. - The parameter
a is a pointer to matrix A. - The parameter
x is a pointer to vector X. - The parameter
y is a pointer to vector Y.
Example: #define M 512 #define N 32
float Y[M] __attribute__ (( aligned (16) )) ; float A[M*N] __attribute__ (( aligned (16) )) ; float X[N] __attribute__ (( aligned (16) )) ;
int main() { int k ; float alpha = 1.2;
for(k = 0; k < M; k++) Y[k] = (float) k;
for(k = 0; k < M*N; k++) A[k] = (float) k;
for(k = 0; k < N; k++) X[k] = (float) k;
sgemv_spu(M, N, alpha, A, X, Y);
return 0; }
sgemm_spu This BLAS 3 routine multiplies two matrices, A and B, and adds the result to the resultant matrix C after suitable scaling. The following operation is performed:C <= A B + C where A, B, and C are matrices. The matrices must be 16-byte aligned and stored in row major order. m must be multiple of 4. n must be multiple of 16. k must be multiple of 4.
Form: void sgemm_spu (int m, int n, int k, float *a, float *b, float *c) - The parameter
m is an integer specifying number of rows in matrices A and C. - The parameter
n is an integer specifying number of columns in matrices B and C. - The parameter
k is an integer specifying number of columns in matrix A and rows in matrix B. - The parameter
a is a pointer to matrix A. - The parameter
b is a pointer to matrix B. - The parameter
c is a pointer to matrix C.
Example: #define M 64 #define N 16 #define K 32
float A[M * K] __attribute__( (aligned (16)) ) ; float B[K * N] __attribute__( (aligned (16)) ) ; float C[M * N] __attribute__( (aligned (16)) ) ;
int main() { int i, j; for( i = 0 ; i < M ; i++ ) for( j = 0; j < N ; j++ ) C[ ( N * i ) + j ] = (float) i ;
/* Similar code to fill in other matrix arrays */ . . . . sgemm_spu( M, N, K, A, B, C) ; return 0; }
ssyrk_64x64 This BLAS 3 routine multiplies matrix A with its transpose AT and adds the result to the resultant matrix C after suitable scaling. The following operation is performed:C <= a A AT + C where only the lower triangular elements of C matrix are updated (the remaining elements remain unchanged). The matrices must be 16-byte aligned and stored in row major order. Also, the matrices must be of size 64x64.
Form: void ssyrk_64x64(float *blkA, float *blkC, float *Alpha) - The parameter
blkA is a pointer to input matrix A. - The parameter
blkC is a pointer to input matrix C; this matrix is updated with result. - The parameter
Alpha is a pointer to scalar value with which Matrix A is scaled.
Example: #define MY_M 64 #define MY_N 64
float myA[ MY_M * MY_N ] __attribute__((aligned (16))); float myC[ MY_M * MY_M ] __attribute__((aligned (16)));
int main() { int i,j ; float alpha = 2.0;
for( i = 0 ; i < MY_M ; i++ ) for( j = 0; j < MY_N ; j++ ) myA[ ( MY_N * i ) + j ] = (float)i ;
for( i = 0 ; i < MY_M ; i++ ) for( j = 0 ; j < MY_M ; j++ ) myC[ ( MY_M * i ) + j ] = (float)i ;
ssyrk_64x64( myA, myC , ?) ;
return 0; }
strsm_spu This BLAS 3 routine solves a system of equations involving a triangular matrix with multiple right-hand sides. The following equation is solved and the result is updated in matrix B:AX = B where A is lower triangular n x n matrix and B is an n x m regular matrix. This routine has certain limitations in the values supported for matrix sizes and alignment of the matrix. n must be a multiple of 4. m must be a multiple of 8. Matrices A and B must be aligned at a 16-byte boundary and must be stored in row-major order.
Form: void strsm_spu (int m, int n, float *a, float *b ) - The parameter
m is an integer specifying number of columns of matrix B. - The parameter
n is an integer specifying number of rows of matrix B. - The parameter
a is a pointer to matrix A. - The parameter
b is a pointer to matrix B.
Example: #define MY_M 32 #define MY_N 32
float myA[ MY_N * MY_N ] __attribute__( (aligned (16)) ) ; float myB[ MY_N * MY_M ] __attribute__( (aligned (16)) ) ;
int main() { int i,j,k ;
for( i = 0 ; i < MY_N ; i++ ) { for( j = 0; j <= i ; j++ ) myA[ ( MY_N * i ) + j ] = (float)(i + 1) ; for( j = i+1; j < MY_N ; j++ ) myA[( MY_N * i ) + j ] = 0 ; }
for( i = 0 ; i < MY_N ; i++ ) for( j = 0 ; j < MY_M ; j++ ) myB[ ( MY_M * i ) + j ] = (float)(i+1)*(j +1);
strsm_spu( MY_M, MY_N, myA, myB ) ;
return 0; }
strsm_64x64 This BLAS 3 routine solves a system of equations involving a triangular matrix with multiple right-hand sides. The following equation is solved and the result is updated in matrix B:AX = B where A is lower triangular 64 x 64 matrix and B is a 64 x 64 regular matrix. This routine is similar in operation to strsm_spu but is designed specifically for matrix size of 64 x 64. Hence better performance is achieved for 64 x 64 matrices when this routine is used rather than the more generic strsm_spu . Matrices A and B must be aligned at a 16-byte boundary and must be stored in row-major order.
Form: void strsm_64x64 (float *a, float *b ) - The parameter
a is a pointer to matrix A. - The parameter
b is a pointer to matrix B.
Example: #define MY_M 64 #define MY_N 64
float myA[ MY_N * MY_N ] __attribute__( (aligned (16)) ) ; float myB[ MY_N * MY_M ] __attribute__( (aligned (16)) ) ;
int main() { int i,j,k ;
for( i = 0 ; i < MY_N ; i++ ) { for( j = 0; j <= i ; j++ ) myA[ ( MY_N * i ) + j ] = (float)(i + 1) ; for( j = i+1; j < MY_N ; j++ ) myA[( MY_N * i ) + j ] = 0 ; }
for( i = 0 ; i < MY_N ; i++ ) for( j = 0 ; j < MY_M ; j++ ) myB[ ( MY_M * i ) + j ] = (float)(i+1)*(j +1);
strsm_64x64( myA, myB ) ;
return 0; }
Taken from the Basic Linear Algebra Subprograms Programmer's Guide and API Reference. Download the SDK 3.0. Check out some reference guides in the Cell Resource Center SDK library. |