IC4NOTICE: developerWorks Community will be offline May 29-30, 2015 while we upgrade to the latest version of IBM Connections. For more information, read our upgrade FAQ.
4 replies Latest Post - ‏2013-01-30T19:57:43Z by db808
2092 Posts

Pinned topic NSDPerf Values

‏2013-01-28T17:21:38Z |
Hi everyone,

I'm currently trying to track down an issue in a cluster implemenation.

When using NSDPerf on two of our compute nodes (pureFlex x240) in a single chassis, I see 3GB/s read and 1,5GB/s write performance.

Using IB_WRITE_BW or IB_READ_BW I do get the full 6GB/s for both read and write...

Any idea where I can start investigating.

The real problem is that we don't get any better GPFS through put to the 4 NSD Servers than 3GB/s read and write, even with multiple clients, although the servers themself are capable of putting 12GB/s to the disk subsystem (We verified raw read and write performance). We also see that the full bandwidth to the NSD servers is never use and therefore suspect problems in the IB Fabric. Fabric is clean and on latest FW Levels.

Thanks for any hints and tips...

Updated on 2013-01-30T19:57:43Z at 2013-01-30T19:57:43Z by db808
  • SystemAdmin
    2092 Posts

    Re: NSDPerf Values

    ‏2013-01-28T17:30:14Z  in response to SystemAdmin
    Few things:

    • what version of GPFS on the NSD servers?
    • can you post the output from mmlsconfig on your NSD server cluster?
    • SystemAdmin
      2092 Posts

      Re: NSDPerf Values

      ‏2013-01-29T07:49:57Z  in response to SystemAdmin

      here the output of mmlsconfig

      Configuration data for cluster rb-hpc-eth:

      myNodeConfigNumber 1
      clusterName rb-hpc-eth
      clusterId 6969073756151928916
      autoload yes
      dmapiFileHandleSize 32
      pagepool 8192M
      maxMBpS 11200
      maxblocksize 2048K
      minQuorumNodes 2
      verbsPorts mlx4_0/1
      verbsRdma enable
      healthCheckInterval 20
      worker1Threads 96
      prefetchThreads 288
      nsdThreadsPerDisk 16
      nsdMinWorkerThreads 512
      nsdMaxWorkerThreads 1024
      nsdbufspace 70
      verbsRdmasPerConnection 48
      verbsRdmasPerNode 192
      adminMode central

      File systems in cluster rb-hpc-eth:


      • SystemAdmin
        2092 Posts

        Re: NSDPerf Values

        ‏2013-01-29T18:08:32Z  in response to SystemAdmin
        (I'm not a GPFS dev, so any IBMers here reading this, if there is anything wrong here, please scold the hell out of me, :-) )

        Okay, the first thing I would do is make sure your 1 client to filesystem bandwidth is what you would expect, which it seems you have done. Provided you don't have routing issues or your opensm setup isn't horked, GPFS+RDMA should be pretty straight forward. If you've absolutely confirmed that your storage subsystem is setup correctly, and made sure all of the GPFS blocksize bits match, you might need to play with a few GPFS tunables.

        I believe (GPFS devs would need to confirm this) that nsdThreadsPerDisk is ignored in GPFS 3.5.

        I know in GPFS 3.5, the queues for IO are split into large and small queues. There is also a new tunable, nsdThreadsPerQueue, nsdSmallThreadRatio.

        Here is what I have, which seems to work pretty well:

        mmlsconfig <SNIPPED>
        nsdThreadsPerQueue 12
        nsdSmallThreadRatio 1

        The nsdSmallThreadRatio=1 will create an equal number of small and large queues. If you run mmfsadm dump nsd (while idle, or during IO, you should see the following):

        mmfsadm dump nsd
        Derived config parms:
        threadRatio: 1, threadsPerQueue: 12, numLargeQueues 24, numSmallQueues: 24
        largeBufferSize: 8388608, smallBufferSize: 65536, desiredThreadsForType 288

        During IO, dump nsd, and check how many waiting requests you have in your large queues. You might need more threadsPerQueue.
        Just my thoughts...

        Hope this helps.
  • db808
    86 Posts

    Re: NSDPerf Values

    ‏2013-01-30T19:57:43Z  in response to SystemAdmin
    Hi Volker,

    When I first read your question, the fact that you were seeing 1/4 of the total expected bandwidth from your 4 NSD servers lead me to believe that your remote NSD configuration might be grossly unbalanced.

    In general, GPFS will send the inter-cluster NSD request to the NSD server that is listed first in the definition of the NSD. You can change this ordering with the mmchnsd command.

    What does the NSD configuration look like? You should have each of your 4 NSD servers as the first remote NSD server in 1/4 of the NSDs ... if the 4 NSD servers are directly attached to all LUNs. If you are using the typical GPFS topology with a LUN only visible to 2 NSD servers, then you would want 1/2 of the LUN's NSD configurations to specify one NSD server, and the other 1/2 specify the second NSD server as the first member.

    If your four NSD servers are all directly connected to the storage (call them nsd_server1, nsd_server_2, nsd_server_3, and nsd_server4) ... and all the NSDs are configured with "nsd_server1" as the first remote NSD server, then ALL the traffic from the compute nodes will be directed to "nsd_server1" and the other three NSD servers will be mostly idle.

    Being able to aggregate sequential performance across multiple NSD servers is also somewhat dependent on the ordering of the NSDs when the file system was created.

    I'll give a simple example. Lets assume that you have 40 LUNs (and NSDs) of the same size, and NSD_00 - NSD_09 are on nsd_server1
    NSD_10 - NSD_19 are on nsd_server2
    NSD_20 - NSD_29 are on nsd_server3
    NSD_30 - NSD_49 are on nsd_server4

    If you create the file system with the NSD ordering of:
    NSD_00, NSD_01, ... NSD_09, NSD_10 ...NSD_19, NSD_20 ... NSD_29, NSD_30 ... NSD_39

    GPFS will try to create 40-way stripes in basically the above ordering (if the LUNs are the same size) There will be small variations caused by files that are less than 40 x blocksize in length and the modulo 40 x blocksize remainder unevenness at the end of the files ... but the big picture will be a 40-way stripe.

    In this example, on average, the first 10 read-aheads go to the same NSD server, since the NSDs are owned by the same NSD server. You do not get the second NSD server involved until the 11-th read ahead. The third NSD server is not involved until the 21'th read ahead, and the fourth NSD server is not involved until the 31'th read ahead. Yes, GPFS can perform deep read aheads, but it is difficult to sustain.

    As an alternative, if you created the file system with the following ordering:
    NSD_00, NSD_10, NSD_20, NSD_30, NSD_01, NSD_11, NSD_21, NSD_31 .... you are rapidly alternating across NSDs owned by different NSD servers. No two adjacent NSDs in the list are on the same NSD server.

    In this latter case, a GPFS 4-deep read ahead engages all four NSD servers, and multi-NSD-server performance will ramp much faster and be easier to sustain.

    Within the NSDs owned by one NSD server, you can also manipulate the NSD ordering such that you rapidly alternate across multiple storage arrays rather than clump all the NSDs from the same storage array together.

    I don't know if you can dump GPFS IO history from your compute node. If it works, it should show you the last 512 IO requests and the NSD servers that they came from. Are the requests alternating across all four NSD servers?

    Hope this helps.

    Dave B.