We are building a new GPFS cluster as an evolution of our old one. This new cluster will be formed by 2 Building Blocks (later on it will probably be expanded to 4). Each BB is composed by:
- 2x NSD servers (4 servers total)
- 2x IBM DS3512 NAS storage systems (4 total)
- 48x 3Tb disks per storage system (96 disks per BB, 192 disks, ~430TB total)
- 4x SAS wires, connecting the two servers to the two controllers (within the BB).
The building blocks are independent one from the other (servers on BB1 cannot see disks on BB2) and all the servers are connected to a dedicated infiniband network.
In our experience with our current GPFS cluster (2 BB, each one consisting of 2 servers and 1 IBM DS3512 storage system with 36 disks => 4 servers, 72 disks, 150 TB total), we have had some problems with metadata access performance. We reached a point where we had 29 million files in our filesystem, where 23 of them were 4k or less and some processes were accessing to an important part of those.
Our old filesystem is composed by 6 NSDs. Each NSD was defined as a Raid5 LUN using 8+1 disks. Data and Metadata were mixed in the NSDs. The controllers cache (2Gb per controller) was enabled for all the LUNs, but, as the NSDs contained both data and metadata, the caché was totally overwhelmed by data, thus rendering the metadata caching almost useless.
Willing to learn from our past mistakes, we are planning our new filesystem as follows:
- 20x Raid5 (8+1) groups.
- Each raid group contains a big (22TB) LUN for data and a small (50GB) LUN for metadata.
- The controller's cache is enabled only for the small LUNs.
- Splitting of data and metadata in different NSDs at filesystem level.
We are aware that having multiple LUNs in a single raid creates a lot of collisions, but this way we are allowed to use the controller's cache (2Gb x 2 controllers x 4 = 16GB) for metadata exclusively, and this way we have access to all (180) HDD headers.
Is this a good idea? Should we dismiss this, ignore/disable the controller's cache and mix data and metadata together in a single LUN/NSD? Should we go for the caching, but using metadata-exclusive raids? Even if that decreases the header count from 180 to, say, 9?
And what about snapshots or MD replication in any of these scenarios? (we are too poor to afford a proper backup of our data)
We were also thinking about buying some SSD disks for each server and using them as NSDs for metadata with replication among different servers. But we have some concerns about metadata integrity and its behaviour with building blocks:
We want the metadata disks in the 2 servers of BB1 to be identical and replicate each-other. We cannot allow metadata from BB1 to be replicated on BB2's disks, because if a server from BB1 fails and a server from BB2 fails too, all data disks would still be accessible, but the metadata would be compromised.
Is this true or is there a way to enable explicit replication from NSDx to NSDy ?
Thanks in advance,