Topic
24 replies Latest Post - ‏2012-01-25T11:39:52Z by masterzorag
joker5bb
joker5bb
3 Posts
ACCEPTED ANSWER

Pinned topic Pyrit with opencl on the PS3

‏2010-01-17T17:46:56Z |
I wanted to know how much is opencl cell optimized to do bruteforcing on the ps3?

i want to use this Project:
http://code.google.com/p/pyrit/

source is available, it has opencl support

is it compatible?
Updated on 2012-01-25T11:39:52Z at 2012-01-25T11:39:52Z by masterzorag
  • JoaquinM
    JoaquinM
    5 Posts
    ACCEPTED ANSWER

    Re: Pyrit with opencl on the PS3

    ‏2010-01-29T17:38:34Z  in response to joker5bb
    It looks like it should run fine. They define the WORKGROUP_SIZE as a #define in the OpenCL code. You'll need to change the OpenCL kernel to include the attribute "__attribute__((reqd_work_group_size(X, Y, Z)))" in the kernel for maximum performance on Cell.

    Give it a try and let us know the results!

    Joaquin
    • masterzorag
      masterzorag
      25 Posts
      ACCEPTED ANSWER

      Re: Pyrit with opencl on the PS3

      ‏2010-03-22T00:38:36Z  in response to JoaquinM
      I've OpenCL working on a PS3 as core in pyrit, see here:
      http://masterzorag.blogspot.com/2010/03/pyrit-opencl-ps3.html

      But I've noticed that adding the include dir /opt/ibmcmp/xlc/opencl/0.1/include the cpyrit_opencl/setup.py build fails with something related to a missing '__attribute__’ before ‘__vastart’, see log line 12

      Can be "__attribute__((reqd_work_group_size(X, Y, Z)))" as you suggest?
      So my video show a version of a module without include from IBM? It is compiled without, becouse without that dir compiled fine and runs!

      Any help is appreciated!

      Attachments

      • SystemAdmin
        SystemAdmin
        131 Posts
        ACCEPTED ANSWER

        Re: Pyrit with opencl on the PS3

        ‏2010-03-22T14:29:22Z  in response to masterzorag
        why are you including that directory? you don't need to do that (and from the error, shouldn't do that). it's an internal compiler include directory, and it already gets included at the correct point of the compile.

        as you working code shows, you do not need to include the directory.

        .bri.
        • masterzorag
          masterzorag
          25 Posts
          ACCEPTED ANSWER

          Re: Pyrit with opencl on the PS3

          ‏2010-03-22T16:42:14Z  in response to SystemAdmin
          Thank you for this, now I've understood.
          But, what about this:
          You'll need to change the OpenCL kernel to include the attribute "__attribute__((reqd_work_group_size(X, Y, Z)))" in the kernel for maximum performance on Cell.
          Where can I include this attribute? I want to see the maximum performance on Cell!
          Thanks
          • SystemAdmin
            SystemAdmin
            131 Posts
            ACCEPTED ANSWER

            Re: Pyrit with opencl on the PS3

            ‏2010-03-23T11:15:52Z  in response to masterzorag
            read up about the clEnqueueNDRange call, and the parts about the local work group size. specifying and making your kernel work with a larger (larger than 1,1,1) local work group size will usually give better performance, because each invocation of the kernel will do more work, and the compiler may be able to schedule things better.

            on IBM's current release, we have a restriction that in order to use a local work group size greater than (1,1,1), you must specify the reqd_work_group_wize attribute on the __kernel definition. that attribute is describe in section 6.7.2, page 156 of the 1.0 spec.

            .bri.
            • ebfe
              ebfe
              1 Post
              ACCEPTED ANSWER

              Re: Pyrit with opencl on the PS3

              ‏2010-03-24T08:14:02Z  in response to SystemAdmin
              Hi, I'm the author of Pyrit. Masterzorag pointed me to this thread. As far as I can see, the need to specify a value for reqd_work_group_size is an incredible bad shortcoming of IBM's OpenCL-implementation:

              Pyrit has no need for work-groups (in OpenCL-terms) as every task in completely independent from the other. Therefor we leave local_work_size to NULL when calling clEnqueueNDRange(). The specifications say that the platform has to figure out "appropiate" values for this parameter itself if it's meaning is unneeded (as in our case).

              The need to specify some arbitrary required minimum work-group-size undermines this approach completely: As you said, we get non-optimal performance if the value is too low (e.g. (1,1,1) or undefined). However we might get even worse results if we specifiy some other, arbitrary value (e.g. (64,1,1)) as register allocation and shared memory usage might force platforms like GPUs to run with lower occupancy-figures. In the end, we force many "threads" to be bound into a work-group without any need for that.
              Can you estimate the performance-gain by specifying reqd_work_group_size? I can't test this myself, as I don't have access to a Cell-platform...
              • masterzorag
                masterzorag
                25 Posts
                ACCEPTED ANSWER

                Re: Pyrit with opencl on the PS3

                ‏2010-03-25T15:30:46Z  in response to ebfe
                I've to leave my PS3, should I come back from holiday next week.
                If you are interested, from April, I can give you ssh access to mine one, mail me.
            • masterzorag
              masterzorag
              25 Posts
              ACCEPTED ANSWER

              Re: Pyrit with opencl on the PS3

              ‏2010-04-13T00:19:53Z  in response to SystemAdmin
              Thanks, I've read and added __attribute__((reqd_work_group_size(X, Y, Z))), but got about the same results!
              Computed 963.93 PMKs/s total.

              with (1, 1, 1) or others numbers give me
              SystemError: Failed to execute kernel (CL_INVALID_WORK_GROUP_SIZE)

              The other relevant line is:
              clEnqueueNDRangeKernel(self->dev_queue, self->dev_kernel, 1, NULL, gWorksize, NULL, 1, &clEvents[0], &clEvents[1])

              Can we do better?
              • SystemAdmin
                SystemAdmin
                131 Posts
                ACCEPTED ANSWER

                Re: Pyrit with opencl on the PS3

                ‏2010-04-13T16:54:38Z  in response to masterzorag
                spec says that if you set a reqd_work_group_size attribute, then you need to pass that size into the clEnqueueNDRangeKernel call. something like:
                size_t lWorksize[3] = { 16,1,1 };

                clEnqueueNDRangeKernel(self->dev_queue, self->dev_kernel, 1, NULL, gWorksize, lWorksize, 1, &clEvents[0], &clEvents[1])
                if you have 16,1,1 in the reqd_work_group_size.

                .bri.
                • masterzorag
                  masterzorag
                  25 Posts
                  ACCEPTED ANSWER

                  Re: Pyrit with opencl on the PS3

                  ‏2010-04-13T23:37:45Z  in response to SystemAdmin
                  Thanks for the hints, this really changes performances, some examples of pyrit benchmark:

                  16,1,1 Computed 869.67 PMKs/s total.
                  16,16,1 Computed 99.02 PMKs/s total.

                  64,1,1 Computed 858.74 PMKs/s total.
                  64,16,1Computed 118.80 PMKs/s total.

                  4,1,1 Computed 849.12 PMKs/s total
                  4,4,4 Computed 97.94 PMKs/s total

                  But seems that optimal performances is reached without __attribute__, or with X,Y,Z;
                  Here I don't know the optimal numbers!
                • masterzorag
                  masterzorag
                  25 Posts
                  ACCEPTED ANSWER

                  Re: Pyrit with opencl on the PS3

                  ‏2010-04-19T12:09:08Z  in response to SystemAdmin
                  Testing a 500000 words batching:
                  with 1,1,1 got 1300;
                  with 256,1,1 got about 1100
                  This can explain about "Pyrit has no need for work-groups (in OpenCL-terms) as every task in completely independent from the other"

                  But I've see demo program like perlin, where specifing different workgroupsize values I got 28x performances in computing;
                  and blacksholes that uses five kernel cores to do different tasks in parallel!

                  I think that special improvement can be made adapting the entire _cpyrit_oclkernel.cl code to do some work in parallel, like the sha1process in it.
                  Maybe I'm wrong but the _cpyrit_opencl implementation was derived from the cuda intructions flow (as a fast addon), so
                  I think it's about generic in opencl terms, I think that an optimized __cpyrit_cellcl can impress!

                  If fact, as this change http://code.google.com/p/pyrit/source/detail?r=248 with one optimization
                  cell passes from about 1000 to about 1200 pmks/s, than,
                  adding the __attribute__ with 1,1,1 cell gots 1300 pmks/s

                  Here is an example with several types of compute kernel that work well on Cell
                  http://www.ibm.com/developerworks/forums/thread.jspa?messageID=14385403

                  We should have an optimized interface to pyrit to get real cell benefits,
                  something like the _cpyrit_calpp is doing now for ati cards, in the pyrit's land
                  • masterzorag
                    masterzorag
                    25 Posts
                    ACCEPTED ANSWER

                    Re: Pyrit with opencl on the PS3

                    ‏2010-05-07T12:52:29Z  in response to masterzorag
                    How pass -cl-fast-relaxed-math to kernel?

                    Pyrit uses only int operations, as defined in _cpyrit_opencl.h
                    #define uint32_t unsigned int
                    The OpenCL application should use __global memory buffers whose type's size is a multiple of a quad-word (16 bytes) ... due to alignment checks
                    uint should be 4 bytes, we should use uint4?

                    But pyrit uses this structs:

                    typedef struct {
                    uint32_t h0,h1,h2,h3,h4;
                    } SHA_DEV_CTX;

                    typedef struct {
                    SHA_DEV_CTX ctx_ipad;
                    SHA_DEV_CTX ctx_opad;
                    SHA_DEV_CTX e1;
                    SHA_DEV_CTX e2;
                    } gpu_inbuffer;

                    About alignment check, I've found in cl_platform.h
                    typedef uint32_t cl_uint4[4] __attribute__((aligned(16)));
                    How can pad the data structure to 16 bytes?
                    • SystemAdmin
                      SystemAdmin
                      131 Posts
                      ACCEPTED ANSWER

                      Re: Pyrit with opencl on the PS3

                      ‏2010-05-07T14:30:47Z  in response to masterzorag
                      -cl-fast-relaxed-math is a kernel compile option that is set by including the string in the options string parameter of clBuildProgram.

                      The aligned attribute (e.g. "__attribute__((aligned(16)))" is not an alignment check, but instead informs the compiler/linker that the variable must be placed in memory on a 16 byte alignment. This facility can be used to ensure h0 is aligned on a 16-byte boundary. For example:

                      
                      typedef struct 
                      { uint32_t h0 __attribute__ ((aligned(16))); uint32_t h1, h2, h3, h4; 
                      } SHA_DEV_CTX;
                      


                      Adjacent instances of SHA_DEV_CTX will then be padded with 3 32-bit words. Alternatively, you can just add pad variables to your structure, but that won't guarantee the alignment of the structure beyond word (32-bit) aligned.
                      • masterzorag
                        masterzorag
                        25 Posts
                        ACCEPTED ANSWER

                        Re: Pyrit with opencl on the PS3

                        ‏2010-05-11T16:30:17Z  in response to SystemAdmin
                        Thanks for aswering, I've added:
                        const char *options = "-cl-fast-relaxed-math";
                        clBuildProgram(self->dev_prog, 0, NULL, options, NULL, NULL);
                        1) About syntax, how can I add this inline (without declare a variable)?

                        I've read in guide:
                        "Memory buffer objects used in conjunction with an SPU accelerator device should be aligned at 128 bytes for best performance."
                        We have on host:
                        cl_mem g_inbuffer, g_outbuffer;
                        g_inbuffer = clCreateBuffer(self->dev_ctx, CL_MEM_READ_ONLY, gWorksize[0]*sizeof(gpu_inbuffer), NULL, &errcode);
                        clSetKernelArg(self->dev_kernel, 0, sizeof(cl_mem), &g_inbuffer);

                        According to:
                        #ifndef uint32_t
                        #define uint32_t unsigned int
                        #endif

                        typedef struct {
                        SHA_DEV_CTX ctx_ipad;
                        SHA_DEV_CTX ctx_opad;
                        SHA_DEV_CTX e1;
                        SHA_DEV_CTX e2;
                        } gpu_inbuffer;

                        typedef struct {
                        uint32_t h0,h1,h2,h3,h4;
                        } SHA_DEV_CTX;

                        We have in kernel:
                        __global gpu_inbuffer *inbuffer, so working over inbufferidx, this should be made of 20 uint = 80KB, isn't true?
                        2) We should align to 128KB for best performance every g_inbuffer?

                        I'm thinking to aggregate 4 uint in uint4 vector in this way:
                        uint4 h0 = { inbufferidx.ctx_ipad.h0, inbufferidx.ctx_opad.h0, inbufferidx.e1.h0, inbufferidx.e2.h0 }
                        uint4 h1 = { inbufferidx.ctx_ipad.h1, inbufferidx.ctx_opad.h1, inbufferidx.e1.h1, inbufferidx.e2.h1 }
                        ... for h2, h3, h4 to operate over 5 uint4 vector = 80KB
                        Then perform sha1_process over a uint4 vector

                        "The maximum number of compute units on an SPU accelerator device is 16. The SPU accelerator device has a maximum local memory size of 256KB."
                        3) So, in PS3 we have 256 x 6 = 1536KB of local memory shared between __global, __local, __private, variables, temp data, kernel?
                        4) There is a cl_builtin function that give us values of mem used/free when running a kernel? How compute free mem?
                        • SystemAdmin
                          SystemAdmin
                          131 Posts
                          ACCEPTED ANSWER

                          Re: Pyrit with opencl on the PS3

                          ‏2010-05-12T17:03:52Z  in response to masterzorag
                          If I understand your questions correctly

                          1) About syntax, how can I add this inline (without declare a variable)?

                          clBuildProgram(self->dev_prog, 0, NULL, "-cl-fast-relaxed-math", NULL, NULL);

                          2) We should align to 128KB for best performance every g_inbuffer?

                          For the SPU Accelerator is best that transfers are aligned to 128 byte (not 128KB) and that the transfers are multiples of 128 bytes. Allocated OpenCL memory objects are aligned to at least 128 bytes.
                          3) So, in PS3 we have 256 x 6 = 1536KB of local memory shared between __global, __local, __private, variables, temp data, kernel?

                          The SPE's 256 KB local storage is shared between the OpenCL runtime, the program kernels, __local and __private variables, stack, and the SW cache used for global memory accesses.

                          4) There is a cl_builtin function that give us values of mem used/free when running a kernel? How compute free mem?

                          I'm not aware of a such a built-in.
                          • SystemAdmin
                            SystemAdmin
                            131 Posts
                            ACCEPTED ANSWER

                            Re: Pyrit with opencl on the PS3

                            ‏2010-05-13T13:22:38Z  in response to SystemAdmin
                            To elaborate on some of the responses from brokensh...

                            2) OpenCL allows the memory object alignment size to be queried on a per-device basis through the CL_DEVICE_MEM_BASE_ADDR_ALIGN attribute. This attribute represents the alignment in BITS of memory objects supported by the given device. As brokensh mentioned, on the SPU device the alignment is 128 bytes or 1024 bits.

                            4) As brokensh mentioned, there currently is no builtin to query available/free mem. As well, no allocator builtin is currently supported within a kernel, so I'm not sure what good the query would be anyhow. Kernel memory allocation is controlled through the arguments passed to the kernel as well as statically declared constant storage.

                            Presumably you are interested in the amount of free LOCAL memory, in which case this value can be approximated by comparing the kernel CL_KERNEL_LOCAL_MEM_SIZE against the device CL_DEVICE_LOCAL_MEM_SIZE. The difference of these two should give you the amount of free device local memory. This calculation takes into account the kernel text and data usage as well as kernel argument allocations. This does not account for kernel stack usage.
                            • masterzorag
                              masterzorag
                              25 Posts
                              ACCEPTED ANSWER

                              Re: Pyrit with opencl on the PS3

                              ‏2010-05-19T01:30:03Z  in response to SystemAdmin
                              Connecting to IBM ACCELERATOR CellBE processor...
                              CL_DEVICE_MAX_WORK_ITEM_DIMENSIONS 3
                              CL_DEVICE_MAX_WORK_ITEM_SIZES 256
                              CL_DEVICE_MAX_WORK_GROUP_SIZE 256
                              CL_DEVICE_MAX_CONSTANT_BUFFER_SIZE 65536

                              CL_DEVICE_PREFERRED_VECTOR_WIDTH_INT 4
                              That's why we should use uint4;

                              CL_DEVICE_LOCAL_MEM_SIZE 248832
                              It's the total of entire Cell/Be, or is for every SPE ( about 256Kb, so should be considered x6 in PS3 )?

                              CL_DEVICE_MEM_BASE_ADDR_ALIGN 1024
                              That's why I'm tuning pyrit to pass a fixed set of 1024 ( num of PMKs in a single array to be processed by the kernel ) x 80 ( Kb for a single PMK data set )

                              gWorksize[0] is computed against size of array and rounded to nearest 256, so I got 1024 at 99%;
                              g_inbuffer = clCreateBuffer ... gWorksize[0]*sizeof(gpu_inbuffer) ... so I got 1024 x 80 = 81920Kb of memRO
                              clEnqueueWriteBuffer ... the same ... so I got 1024 x 80 = 81920Kb
                              g_outbuffer = clCreateBuffer ... gWorksize[0]*sizeof(gpu_outbuffer) ... so I got 1024 x 40 = 40960Kb of memWO
                              g_inbuffer and g_outbuffer are __global pointers, so question is:
                              I'm respecting the CL_DEVICE_MEM_BASE_ADDR_ALIGN?

                              Another one question:
                              How can pack 4 input fields into an uint4 vector, native uint4 vector are available while processing the kernel!
                              I want to pack 4 struct of 4 x 5 uint = 4 x 80Kb = 320Kb into 20 uint4 = 20 x 16Kb = 320Kb

                              typedef struct {
                              uint32_t h0,h1,h2,h3,h4;
                              } SHA_DEV_CTX;

                              typedef struct {
                              SHA_DEV_CTX ctx_ipad;
                              SHA_DEV_CTX ctx_opad;
                              SHA_DEV_CTX e1;
                              SHA_DEV_CTX e2;
                              } gpu_inbuffer;

                              typedef struct {
                              cl_uint4 ctx_ipad_h0, ctx_ipad_h1, ctx_ipad_h2, ctx_ipad_h3, ctx_ipad_h4;
                              cl_uint4 ctx_opad_h0, ctx_opad_h1, ctx_opad_h2, ctx_opad_h3, ctx_opad_h4;
                              cl_uint4 e1_h0, e1_h1, e1_h2, e1_h3, e1_h4;
                              cl_uint4 e2_h0, e2_h1, e2_h2, e2_h3, e2_h4;
                              } vector;
  • SystemAdmin
    SystemAdmin
    131 Posts
    ACCEPTED ANSWER

    Re: Pyrit with opencl on the PS3

    ‏2010-01-29T17:40:12Z  in response to joker5bb
    I am not aware of anyone trying the OpenCL pyrit on Cell. If it was well written and didn't utilize any OpenCL extensions or features that have yet to be implemented (see the OpenCL Dev Kit Install and User's Guide for details on the limitations), then I see no reason it shouldn't work. I would like to hear your results.
  • joker5bb
    joker5bb
    3 Posts
    ACCEPTED ANSWER

    Re: Pyrit with opencl on the PS3

    ‏2010-02-01T21:48:28Z  in response to joker5bb
    well i could not get the cell opencl sdk to install on ubuntu,

    how can i compile it for 64bit ppc
    • SystemAdmin
      SystemAdmin
      131 Posts
      ACCEPTED ANSWER

      Re: Pyrit with opencl on the PS3

      ‏2010-02-01T22:52:44Z  in response to joker5bb
      you can't -- currently, we are only shipping 32bit OpenCL libraries.

      .bri.
  • masterzorag
    masterzorag
    25 Posts
    ACCEPTED ANSWER

    On the isolated pmk kernel...

    ‏2010-07-12T23:44:34Z  in response to joker5bb
    I've only isolated the opencl core used by pyrit to be run under control to see if improvement can be made;
    host, header and kernel included; I'm using newest 0.2, compile with:
    gcc -o try2 try2.c -lm -lstdc++ -lOpenCL -std=c99
    by default it setup an array of 20480 empty structs and perform the in/out compute part to obtain the pmk
    1. ./try2

    Starting...
    calloc for size: 1638400, array count: 20480
    Building kernel...
    Connecting to IBM ACCELERATOR CellBE processor...
    memR: 1638400, EWB: 1638400
    Executing with CL_KERNEL_WORK_GROUP_SIZE 1...
    Timing...
    kernel execution time: 16.40
    number of computes/sec: 1248.54
    Reading back, ERB: 819200
    Done!

    Quite, this is only the opencl pyrit's part, this doesn't look at any input data loading functions, as memory accesses should be as pyrit's does.
    This host code only runs the untouched the pyrit opencl code, as we can see the svn code computes about 1200 per second!

    By the way, great 0.2 release, with clu and extensions like numa and hugepages there are many things to test...
    • masterzorag
      masterzorag
      25 Posts
      ACCEPTED ANSWER

      pmk, dissected

      ‏2011-02-20T13:20:56Z  in response to masterzorag
      Some time ago was speaking here about how improve pyrit kernel, but I miss some knowledges...
      How improve cl code without knowing about C, bits and byte, mem alignments, math, and so on!
      Here I'm back, forget some of what I wrote about foolish questions, we can start improve for real.

      I have an host code that do the prepare part in C, linked with openssl, as pyrit does.
      after this prepare function pyrit stores entire data for 1 pmk (80b) into a struct of scalars uint:
      ipad.h0 to ipad.h4, opad.h0 to opad.h4, e1.h0 to e1.h4 and e2.h0 to e2.h4 = 4x5x4b = 80b
      pyrit pass arrays of this input structs to kernel, after process output is half (40b):
      pmk1.h0 to pmk1.h4, pmk2.h0 to pmk2.h4 = 2x5x4b = 40b
      after readed back, with first 32b of this 40b the host code compose final pmk and store this as result.

      I'm doing this instead:
      I setup 4 arrays of uint4, *ipad, *opad, *e1, *e2, to make use of vectors in kernel computing
      after prepare, I fold data for 4pmk into x, y, z and w of those arrays as this:
      data for first pmk, pmk[0]:
      ipad[0].h0 to ipad[0].h4 into ipad0 to 4.x
      opad[0].h0 to opad[0].h4 into opad0 to 4.x
      e1[0].h0 to e1[0].h4 into e10 to 4.x
      e2[0].h0 to e2[0].h4 into e20 to 4.x
      ... data for fourth pmk, pmk[3]:
      ipad[3].h0 to ipad[3].h4 into ipad0 to 4.w
      opad[3].h0 to opad[3].h4 into opad0 to 4.w
      e1[3].h0 to e1[3].h4 into e10 to 4.w
      e2[3].h0 to e2[3].h4 into e20 to 4.w

      As result I have 4 arrays (*ipad, *opad, *e1, *e2), of 5 (from 0 to 4) uint4 (with data for pmk[0] in .x, pmk[1] in .y, pmk[2] in .z, pmk[3] in .w) = 4x5x16 = 320b
      I'm doing 4pmk, as 320/4 = 80b

      This is my solution to: aligning memory at 16b (a uint4 is 16b), made use of native uint4 in kernel to do x4 in same time, and to coalesce data accesses.
      What about this?

      Putting simple: with this 4pmk case I do global=1, so kernel does 1 workitem to compute all 4.
      My kernel does gid5=get_global_id(0) *5 to grab 5 uint and process over sha1, output is for 4pmk, works!
      After reading back I've included a verify function that tell me if data output for 4pmk are valid, about this verify method I compare results over valid ones computed from original pyrit kernel.

      Now I've entire control (with verify working I know if I'm doing right computations) over pmk algo, I will start target the different address spaces and use of _local.

      Stay curious.
      • SystemAdmin
        SystemAdmin
        131 Posts
        ACCEPTED ANSWER

        Re: pmk, dissected

        ‏2011-02-21T17:56:35Z  in response to masterzorag
        masterzorag,

        Your strategy of changing your data structures from an array of structures (AOS) to a structure of arrays (SOA) in which each array is sized to matched the device's preferred vector width should yield improved performance. This is especially true on CPU devices and SPE accelerator devices that are optimized for SIMD execution.

        Dan B.
        • masterzorag
          masterzorag
          25 Posts
          ACCEPTED ANSWER

          pmk, I've got improvements!

          ‏2012-01-25T11:39:52Z  in response to SystemAdmin
          Check this one: http://masterzorag.blogspot.com/2012/01/opencl-ps3-improvements.html
          I've to optimize more now!
          Second kernel uses _local and async_work_group_copy to move data, but a different SHA-1 function that left me set localsize = 64!
          My faster kernel operate on _private and is not using _local at all, but it needs a localsize = 1 to run!
          It's a more vectorized version operating on uint8: we need only 8192 SHA-1 rounds, doing outbuffer1 + outbuffer2 at same time!