I'm currently trying to track down an issue in a cluster implemenation.
When using NSDPerf on two of our compute nodes (pureFlex x240) in a single chassis, I see 3GB/s read and 1,5GB/s write performance.
Using IB_WRITE_BW or IB_READ_BW I do get the full 6GB/s for both read and write...
Any idea where I can start investigating.
The real problem is that we don't get any better GPFS through put to the 4 NSD Servers than 3GB/s read and write, even with multiple clients, although the servers themself are capable of putting 12GB/s to the disk subsystem (We verified raw read and write performance). We also see that the full bandwidth to the NSD servers is never use and therefore suspect problems in the IB Fabric. Fabric is clean and on latest FW Levels.
Thanks for any hints and tips...
Re: NSDPerf Values2013-01-29T07:49:57ZThis is the accepted answer. This is the accepted answer.
- SystemAdmin 110000D4XK
here the output of mmlsconfig
Configuration data for cluster rb-hpc-eth:
File systems in cluster rb-hpc-eth:
Re: NSDPerf Values2013-01-29T18:08:32ZThis is the accepted answer. This is the accepted answer.
- SystemAdmin 110000D4XK
Okay, the first thing I would do is make sure your 1 client to filesystem bandwidth is what you would expect, which it seems you have done. Provided you don't have routing issues or your opensm setup isn't horked, GPFS+RDMA should be pretty straight forward. If you've absolutely confirmed that your storage subsystem is setup correctly, and made sure all of the GPFS blocksize bits match, you might need to play with a few GPFS tunables.
I believe (GPFS devs would need to confirm this) that nsdThreadsPerDisk is ignored in GPFS 3.5.
I know in GPFS 3.5, the queues for IO are split into large and small queues. There is also a new tunable, nsdThreadsPerQueue, nsdSmallThreadRatio.
Here is what I have, which seems to work pretty well:
The nsdSmallThreadRatio=1 will create an equal number of small and large queues. If you run mmfsadm dump nsd (while idle, or during IO, you should see the following):
mmfsadm dump nsd
Derived config parms:
threadRatio: 1, threadsPerQueue: 12, numLargeQueues 24, numSmallQueues: 24
largeBufferSize: 8388608, smallBufferSize: 65536, desiredThreadsForType 288
During IO, dump nsd, and check how many waiting requests you have in your large queues. You might need more threadsPerQueue.
Just my thoughts...
Hope this helps.
db808 270002HU3E87 Posts
Re: NSDPerf Values2013-01-30T19:57:43ZThis is the accepted answer. This is the accepted answer.Hi Volker,
When I first read your question, the fact that you were seeing 1/4 of the total expected bandwidth from your 4 NSD servers lead me to believe that your remote NSD configuration might be grossly unbalanced.
In general, GPFS will send the inter-cluster NSD request to the NSD server that is listed first in the definition of the NSD. You can change this ordering with the mmchnsd command.
What does the NSD configuration look like? You should have each of your 4 NSD servers as the first remote NSD server in 1/4 of the NSDs ... if the 4 NSD servers are directly attached to all LUNs. If you are using the typical GPFS topology with a LUN only visible to 2 NSD servers, then you would want 1/2 of the LUN's NSD configurations to specify one NSD server, and the other 1/2 specify the second NSD server as the first member.
If your four NSD servers are all directly connected to the storage (call them nsd_server1, nsd_server_2, nsd_server_3, and nsd_server4) ... and all the NSDs are configured with "nsd_server1" as the first remote NSD server, then ALL the traffic from the compute nodes will be directed to "nsd_server1" and the other three NSD servers will be mostly idle.
Being able to aggregate sequential performance across multiple NSD servers is also somewhat dependent on the ordering of the NSDs when the file system was created.
I'll give a simple example. Lets assume that you have 40 LUNs (and NSDs) of the same size, and NSD_00 - NSD_09 are on nsd_server1
NSD_10 - NSD_19 are on nsd_server2
NSD_20 - NSD_29 are on nsd_server3
NSD_30 - NSD_49 are on nsd_server4
If you create the file system with the NSD ordering of:
NSD_00, NSD_01, ... NSD_09, NSD_10 ...NSD_19, NSD_20 ... NSD_29, NSD_30 ... NSD_39
GPFS will try to create 40-way stripes in basically the above ordering (if the LUNs are the same size) There will be small variations caused by files that are less than 40 x blocksize in length and the modulo 40 x blocksize remainder unevenness at the end of the files ... but the big picture will be a 40-way stripe.
In this example, on average, the first 10 read-aheads go to the same NSD server, since the NSDs are owned by the same NSD server. You do not get the second NSD server involved until the 11-th read ahead. The third NSD server is not involved until the 21'th read ahead, and the fourth NSD server is not involved until the 31'th read ahead. Yes, GPFS can perform deep read aheads, but it is difficult to sustain.
As an alternative, if you created the file system with the following ordering:
NSD_00, NSD_10, NSD_20, NSD_30, NSD_01, NSD_11, NSD_21, NSD_31 .... you are rapidly alternating across NSDs owned by different NSD servers. No two adjacent NSDs in the list are on the same NSD server.
In this latter case, a GPFS 4-deep read ahead engages all four NSD servers, and multi-NSD-server performance will ramp much faster and be easier to sustain.
Within the NSDs owned by one NSD server, you can also manipulate the NSD ordering such that you rapidly alternate across multiple storage arrays rather than clump all the NSDs from the same storage array together.
I don't know if you can dump GPFS IO history from your compute node. If it works, it should show you the last 512 IO requests and the NSD servers that they came from. Are the requests alternating across all four NSD servers?
Hope this helps.