Topic
  • 6 replies
  • Latest Post - ‏2013-03-12T16:43:27Z by SystemAdmin
SystemAdmin
SystemAdmin
2092 Posts

Pinned topic architecture question -- large GPFS servers or SAN-attached GPFS clients

‏2013-03-11T19:14:56Z |
I'm in the process of planning a new HPC cluster, and I'd appreciate
getting some feedback on different approaches to the GPFS architecture.

The cluster will have about 25~50 nodes initially (up to 1000 CPU-cores),
expected to grow to about 50~80 nodes.

The jobs are primarily independent, single-threaded, with a mixture of
small- medium-sized IO, and a lot of random access. It is very common to
have 100s of jobs, each accessing the same directories, often with an overlap
of the same data files.

For example, many jobs on different nodes will use the same executable
and the same baseline data models, but will differ in individual data
files to compare to the model.

My goal is to ensure reasonable performance, particularly when there's
a lot of contention from multiple jobs accessing the same meta-data and
some of the same data files.

My question here is in a choice between two GPFS archicture designs.
The storage array configurations, drive types, RAID types, etc. are
also being examined separately. I'd really like to hear any suggestions
about these (or other) configurations:

.h2 [1] Large GPFS servers
  • About 5 GPFS servers with significant RAM. Each GPFS server would be connected to storage via an 8Gb/s fibre SAN (multiple paths) to storage arrays.
  • Each GPFS server would provide NSDs via 10Gb/s and 1Gb/s (for legacy servers) ethernet to GPFS clients (computational compute nodes).
  • Since the GPFS clients would not be SAN attached with direct access to block storage, and many clients (~50) will access similar data (and the same directories) for many jobs, it seems like it would make sense to do a lot of caching on the GPFS servers. Multiple clients would benefit by reading from the same cached data on the servers.

  • I'm thinking of sizing caches to handle 1~2GB per core in the compute nodes, divided by the number of GPFS servers. This would mean caching (maxFilesToCache, pagepool, maxStatCache) on the GPFS servers of about 200GB+ on each GPFS server.

Questions:

Is there any way to configure GPFS so that the GPFS servers can do a large amount of caching without requiring the same resources on the GPFS clients?

Is there any way to configure the GPFS clients so that their RAM can be used primarily for computational jobs?

.h2 [2] Direct-attached GPFS clients
  • About 3~5 GPFS servers with modest resources (8CPU-cores, ~60GB RAM).

  • Each GPFS server and client (HPC compute node) would be directly connected to the SAN (8Gb/s fibre, iSCSI over 10Gb/s ethernet, FCoE over 10Gb/s ethernet).

  • Either 10Gb/s or 1Gb/s ethernet for communication between GPFS nodes.

  • Since this is a relatively small cluster, in terms of the total node count, the increased cost in terms of HBAs, switches, and cabling for direct-connecting all nodes to the storage shouldn't be excessive.

Thanks,

Mark
Updated on 2013-03-12T16:43:27Z at 2013-03-12T16:43:27Z by SystemAdmin
  • SystemAdmin
    SystemAdmin
    2092 Posts

    Re: architecture question -- large GPFS servers or SAN-attached GPFS clients

    ‏2013-03-11T19:28:17Z  
    I think you may be operating under a misconception. AFAIK, an NSD server does not cache disk data. It is simply a pass-through service providing access to its locally attached disks via a not-SAN network. I believe it was modeled after the old AIX VSD.

    See also: http://miguelsprofessional.blogspot.com/2009/12/gpfs-tuning-recommendations.html
  • SystemAdmin
    SystemAdmin
    2092 Posts

    Re: architecture question -- large GPFS servers or SAN-attached GPFS clients

    ‏2013-03-11T19:41:19Z  
    I think you may be operating under a misconception. AFAIK, an NSD server does not cache disk data. It is simply a pass-through service providing access to its locally attached disks via a not-SAN network. I believe it was modeled after the old AIX VSD.

    See also: http://miguelsprofessional.blogspot.com/2009/12/gpfs-tuning-recommendations.html
    > marcofGPFS wrote:
    > I think you may be operating under a misconception. AFAIK, an NSD server does not cache disk data. It is simply a pass-through

    Perhaps I'm confused, but the "GPFS Concepts, Planning, and Installation Guide" says:



    The pinned memory is called the pagepool and is configured by setting the pagepool cluster
    configuration parameter. This pinned area of memory is used for storing file data and for
    optimizing the performance of various data access patterns. In a non-pinned area of the shared
    segment, GPFS keeps information about open and recently opened files. This information is held
    in two forms:
    1. A full inode cache
    2. A stat cache

    Pinned and non-pinned memory
    Pinned memory

    GPFS uses pinned memory (also called pagepool memory) for storing file data and metadata in support
    of I/O operations. With some access patterns, increasing the amount of pagepool memory can increase
    I/O performance.



    > See also: http://miguelsprofessional.blogspot.com/2009/12/gpfs-tuning-recommendations.html

    Right...which says "GPFS does not use the regular file buffer cache of the operating system (f.e. non-computational memory in AIX) but uses its own mechanism to implement caching. GPFS uses pinned computational memory to maintain its file buffer cache, called the pagepool, which is used to cache user file data and file system metadata."
  • SystemAdmin
    SystemAdmin
    2092 Posts

    Re: architecture question -- large GPFS servers or SAN-attached GPFS clients

    ‏2013-03-11T22:30:54Z  
    > marcofGPFS wrote:
    > I think you may be operating under a misconception. AFAIK, an NSD server does not cache disk data. It is simply a pass-through

    Perhaps I'm confused, but the "GPFS Concepts, Planning, and Installation Guide" says:



    The pinned memory is called the pagepool and is configured by setting the pagepool cluster
    configuration parameter. This pinned area of memory is used for storing file data and for
    optimizing the performance of various data access patterns. In a non-pinned area of the shared
    segment, GPFS keeps information about open and recently opened files. This information is held
    in two forms:
    1. A full inode cache
    2. A stat cache

    Pinned and non-pinned memory
    Pinned memory

    GPFS uses pinned memory (also called pagepool memory) for storing file data and metadata in support
    of I/O operations. With some access patterns, increasing the amount of pagepool memory can increase
    I/O performance.



    > See also: http://miguelsprofessional.blogspot.com/2009/12/gpfs-tuning-recommendations.html

    Right...which says "GPFS does not use the regular file buffer cache of the operating system (f.e. non-computational memory in AIX) but uses its own mechanism to implement caching. GPFS uses pinned computational memory to maintain its file buffer cache, called the pagepool, which is used to cache user file data and file system metadata."
    Yes, GPFS caches data and metadata. BUT that's above the NSD/disk access layer. Within the software structure all file operations eventually translate to disk reads/writes to disk sectors. Those disk operations are either done to a device handled by the local OS (Unix device) or if the disk cannot be reached via the local OS facilities, the operation is passed over a network attachment at the NSD level, which has no notions of caching higher level GPFS objects.

    OTOH, modern disk controllers do implement caches...
  • SystemAdmin
    SystemAdmin
    2092 Posts

    Re: architecture question -- large GPFS servers or SAN-attached GPFS clients

    ‏2013-03-11T23:15:12Z  
    > - Since the GPFS clients would not be SAN attached with direct access to block storage, and many clients (~50) will access similar data (and the same directories) for many jobs, it seems like it would make sense to do a lot of caching on the GPFS servers. Multiple clients would benefit by reading from the same cached data on the servers.

    GPFS only does server-side caching when GPFS Native RAID (GNR) is used. Traditional NSD servers don't cache anything. More generally, GPFS is a cluster file system, not a client-server config, so all "client" nodes are responsible for doing their own IO, which could go through the local block device interface or the NSD protocol. GPFS does caching on the nodes where the data access actually occurs.

    yuri
  • HajoEhlers
    HajoEhlers
    253 Posts

    Re: architecture question -- large GPFS servers or SAN-attached GPFS clients

    ‏2013-03-12T14:36:31Z  
    Some ideas:

    1) pnfs ( NFS v4.1) - Your NSD Server would run as pNFS server as well.
    2) GPFS wan option - Forgot the real name - In caches large amount of data so why not uses in a local lan ;-) Since even small blades can have over 200GB of RAM.

    and you might think about a none symmetric access approach: Meaning reading from pNFS or GPFS-WAN to benefit of the very large cache but write via NSD.

    Happy testing
    Hajo
  • SystemAdmin
    SystemAdmin
    2092 Posts

    Re: architecture question -- large GPFS servers or SAN-attached GPFS clients

    ‏2013-03-12T16:43:27Z  
    > - Since the GPFS clients would not be SAN attached with direct access to block storage, and many clients (~50) will access similar data (and the same directories) for many jobs, it seems like it would make sense to do a lot of caching on the GPFS servers. Multiple clients would benefit by reading from the same cached data on the servers.

    GPFS only does server-side caching when GPFS Native RAID (GNR) is used. Traditional NSD servers don't cache anything. More generally, GPFS is a cluster file system, not a client-server config, so all "client" nodes are responsible for doing their own IO, which could go through the local block device interface or the NSD protocol. GPFS does caching on the nodes where the data access actually occurs.

    yuri
    Native Raid caching? Interesting. So I suppose you could configure that to get block level caching at the NSD server, but now somewhat below NSD, where the GPFS Native Raid software is essentially taking the place of an outboard RAID controller. Which by-the-way is great stuff, if you want to build ultra-big, ultra-reliable disk storage out of JBOD.