Topic
  • 11 replies
  • Latest Post - ‏2014-08-29T07:15:24Z by Ronaldo@IBM
mduff
mduff
35 Posts

Pinned topic Reads from replicated metadata from one failure group

‏2012-11-08T00:06:12Z |
Hello,

We are currently seeing reads from metadataonly NSDs only use one failure group. This is a local filesystem, no remote clusters are being used.

This can be seen while running a find command or a GPFS LIST policy. iostat is showing only one failure group being used for reads.

Shouldn't reads of replicated metadata be using both failure groups to provide faster access? The nodes running the find or the GPFS policy have direct connections to all drives in both failure groups.

The readReplicatePolicy didn't make a difference (either default or local), which is expected as I believe this only applies when using remote clusters.

Thank you
Updated on 2012-12-20T23:42:47Z at 2012-12-20T23:42:47Z by mduff
  • dlmcnabb
    dlmcnabb
    1012 Posts

    Re: Reads from replicated metadata from one failure group

    ‏2012-11-08T00:13:30Z  
    It should unless there are suspended LUNs in the other FG.

    Or possibly if you created the filesystem with only one FG than then later changed it to mmchfs $fsname -m 2 and ran mmrestripefs -R to replicate the existing metadata. GPFS does not rebalance the metadata.
  • mduff
    mduff
    35 Posts

    Re: Reads from replicated metadata from one failure group

    ‏2012-11-28T00:18:00Z  
    • dlmcnabb
    • ‏2012-11-08T00:13:30Z
    It should unless there are suspended LUNs in the other FG.

    Or possibly if you created the filesystem with only one FG than then later changed it to mmchfs $fsname -m 2 and ran mmrestripefs -R to replicate the existing metadata. GPFS does not rebalance the metadata.
    Cheers for that Dan.

    We lost an entire metadata failure group, and then ran the restripe after it was recovered.

    Should we run "mmrestripefs -b" to rebalance the metadata?

    Thank you
  • dlmcnabb
    dlmcnabb
    1012 Posts

    Re: Reads from replicated metadata from one failure group

    ‏2012-11-28T06:58:39Z  
    • mduff
    • ‏2012-11-28T00:18:00Z
    Cheers for that Dan.

    We lost an entire metadata failure group, and then ran the restripe after it was recovered.

    Should we run "mmrestripefs -b" to rebalance the metadata?

    Thank you
    Unfortunately, rebalance does not work on metadata.
  • mduff
    mduff
    35 Posts

    Re: Reads from replicated metadata from one failure group

    ‏2012-11-28T16:11:22Z  
    • dlmcnabb
    • ‏2012-11-28T06:58:39Z
    Unfortunately, rebalance does not work on metadata.
    I'm not sure of the terminology, but if we call the MD copies primary and secondary, and we know that all of the primary copies are in one failure group (unbalanced), what is the criteria for using the secondary copy to improve read performance?
  • pce
    pce
    57 Posts

    Re: Reads from replicated metadata from one failure group

    ‏2012-12-20T14:47:06Z  
    • mduff
    • ‏2012-11-28T16:11:22Z
    I'm not sure of the terminology, but if we call the MD copies primary and secondary, and we know that all of the primary copies are in one failure group (unbalanced), what is the criteria for using the secondary copy to improve read performance?
    I'd like to confirm the settings. Please run "mmfsadm dump config | grep readReplica" on the nodes of interest to see if they are all set to the same readReplicaPolicy.

    readReplicaPolicy is not a function of remote clusters. It specifies whether to preferentially use local (SAN) access. On reads, the code by default will read the first replica. When readReplicaPolicy is set to "local", and the first replica cannot be accessed locally on the node, the code will check if the second replica can be, and use that instead. Since both replicas are locally accessible, I dont see how it can be generally used to select replicas
  • mduff
    mduff
    35 Posts

    Re: Reads from replicated metadata from one failure group

    ‏2012-12-20T15:42:55Z  
    • pce
    • ‏2012-12-20T14:47:06Z
    I'd like to confirm the settings. Please run "mmfsadm dump config | grep readReplica" on the nodes of interest to see if they are all set to the same readReplicaPolicy.

    readReplicaPolicy is not a function of remote clusters. It specifies whether to preferentially use local (SAN) access. On reads, the code by default will read the first replica. When readReplicaPolicy is set to "local", and the first replica cannot be accessed locally on the node, the code will check if the second replica can be, and use that instead. Since both replicas are locally accessible, I dont see how it can be generally used to select replicas
    Thank you for this pce.

    Dan actually has a informative post about this and I have already summarized this, and we have tested both settings.

    Here is what I sent:


    There is an undocumented configuration parameter that applies to where
    data is read.

    The readReplicaPolicy parameter can be set to either local or default.
    The only difference is that the 'local' setting specifies that the reads
    will be from local disk or an NSD server that is on the same subnet.  This
    is as opposed to 'default', which will choose the first available replica.

    GPFS can only differentiate between:
    1) Direct attach               vs   NSD attach
    2) NSD attach on local subnet  vs   NSD attach not on local subnet

    Would you be able to test reads going across all four NSD servers, with
    and without the readReplicaPolicy set to local?

    You can find the current setting of readReplicaPolicy with mmfsadm dump
    config:

    1. mmfsadm dump config  | grep -i readReplicaPolicy
      readReplicaPolicy default

    Change the value with:

    1. mmchconfig readReplicaPolicy=local
    mmchconfig: Command successfully completed
    mmchconfig: Propagating the cluster configuration data to all
     affected nodes.  This is an asynchronous process.

    Here are the current settings:

    :/# /usr/lpp/mmfs/bin/mmfsadm dump config | grep readReplica
       readReplicaPolicy default
    []# /usr/lpp/mmfs/bin/mmfsadm dump config | grep readReplica
       readReplicaPolicy default
    /# /usr/lpp/mmfs/bin/mmfsadm dump config | grep readReplica
       readReplicaPolicy default
    /# /usr/lpp/mmfs/bin/mmfsadm dump config | grep readReplica
       readReplicaPolicy default
     
  • pce
    pce
    57 Posts

    Re: Reads from replicated metadata from one failure group

    ‏2012-12-20T15:59:56Z  
    • mduff
    • ‏2012-12-20T15:42:55Z
    Thank you for this pce.

    Dan actually has a informative post about this and I have already summarized this, and we have tested both settings.

    Here is what I sent:


    There is an undocumented configuration parameter that applies to where
    data is read.

    The readReplicaPolicy parameter can be set to either local or default.
    The only difference is that the 'local' setting specifies that the reads
    will be from local disk or an NSD server that is on the same subnet.  This
    is as opposed to 'default', which will choose the first available replica.

    GPFS can only differentiate between:
    1) Direct attach               vs   NSD attach
    2) NSD attach on local subnet  vs   NSD attach not on local subnet

    Would you be able to test reads going across all four NSD servers, with
    and without the readReplicaPolicy set to local?

    You can find the current setting of readReplicaPolicy with mmfsadm dump
    config:

    1. mmfsadm dump config  | grep -i readReplicaPolicy
      readReplicaPolicy default

    Change the value with:

    1. mmchconfig readReplicaPolicy=local
    mmchconfig: Command successfully completed
    mmchconfig: Propagating the cluster configuration data to all
     affected nodes.  This is an asynchronous process.

    Here are the current settings:

    :/# /usr/lpp/mmfs/bin/mmfsadm dump config | grep readReplica
       readReplicaPolicy default
    []# /usr/lpp/mmfs/bin/mmfsadm dump config | grep readReplica
       readReplicaPolicy default
    /# /usr/lpp/mmfs/bin/mmfsadm dump config | grep readReplica
       readReplicaPolicy default
    /# /usr/lpp/mmfs/bin/mmfsadm dump config | grep readReplica
       readReplicaPolicy default
     
    The policy did not change to 'local'; it is still at 'default'.

    Use "mmchconfig readReplicaPolicy=local -i" to get an immediate effect, then try your experiments.
  • mduff
    mduff
    35 Posts

    Re: Reads from replicated metadata from one failure group

    ‏2012-12-20T23:42:47Z  
    • pce
    • ‏2012-12-20T15:59:56Z
    The policy did not change to 'local'; it is still at 'default'.

    Use "mmchconfig readReplicaPolicy=local -i" to get an immediate effect, then try your experiments.
    We have tried both settings and there is no difference.
  • Ronaldo@IBM
    Ronaldo@IBM
    2 Posts

    Re: Reads from replicated metadata from one failure group

    ‏2014-08-28T14:14:05Z  
    • mduff
    • ‏2012-12-20T23:42:47Z
    We have tried both settings and there is no difference.

    I do not agreed. But this differs per level. Iam using gpfs 3.5.0.17 at this time.

     

    gpfsnode1:~ # mmlsconfig
    Configuration data for cluster slesgpfscluster.gpfsnode1:
    ---------------------------------------------------------
    myNodeConfigNumber 1
    clusterName slesgpfscluster.gpfsnode1
    clusterId 13187353207204870340
    autoload no
    dmapiFileHandleSize 32
    minReleaseLevel 3.5.0.11
    restripeOnDiskFailure yes
    adminMode central

    File systems in cluster slesgpfscluster.gpfsnode1:
    --------------------------------------------------
    /dev/sapmnt

    gpfsnode1:~ # mmfsadm dump config | grep readReplicaPolicy
     # readReplicaPolicy default
     

    gpfsnode1:~ # mmchconfig readReplicaPolicy=local
    mmchconfig: Command successfully completed
    mmchconfig: Propagating the cluster configuration data to all
      affected nodes.  This is an asynchronous process.
     

    gpfsnode1:~ # mmlsconfig
    Configuration data for cluster slesgpfscluster.gpfsnode1:
    ---------------------------------------------------------
    myNodeConfigNumber 1
    clusterName slesgpfscluster.gpfsnode1
    clusterId 13187353207204870340
    autoload no
    dmapiFileHandleSize 32
    minReleaseLevel 3.5.0.11
    restripeOnDiskFailure yes
    readReplicaPolicy local
    adminMode central

    File systems in cluster slesgpfscluster.gpfsnode1:
    --------------------------------------------------
    /dev/sapmnt
    gpfsnode1:~ # mmfsadm dump config | grep readReplicaPolicy
     # readReplicaPolicy default
    gpfsnode1:~ #
     

    Still the same but,

     

    gpfsnode1:~ # mmchconfig readReplicaPolicy=local -i
    mmchconfig: Command successfully completed
    mmchconfig: Propagating the cluster configuration data to all
      affected nodes.  This is an asynchronous process.
    gpfsnode1:~ # mmlsconfig
    Configuration data for cluster slesgpfscluster.gpfsnode1:
    ---------------------------------------------------------
    myNodeConfigNumber 1
    clusterName slesgpfscluster.gpfsnode1
    clusterId 13187353207204870340
    autoload no
    dmapiFileHandleSize 32
    minReleaseLevel 3.5.0.11
    restripeOnDiskFailure yes
    readReplicaPolicy local
    adminMode central

    File systems in cluster slesgpfscluster.gpfsnode1:
    --------------------------------------------------
    /dev/sapmnt
    gpfsnode1:~ # mmfsadm dump config | grep readReplicaPolicy
     # readReplicaPolicy local

     

    Also on the nodes you see the difference. Reads done on a node are not demanding reads from other nodes anymore.

    But this only works with FPO GPFS clusters.

     

  • yuri
    yuri
    210 Posts

    Re: Reads from replicated metadata from one failure group

    ‏2014-08-28T17:14:24Z  

    I do not agreed. But this differs per level. Iam using gpfs 3.5.0.17 at this time.

     

    gpfsnode1:~ # mmlsconfig
    Configuration data for cluster slesgpfscluster.gpfsnode1:
    ---------------------------------------------------------
    myNodeConfigNumber 1
    clusterName slesgpfscluster.gpfsnode1
    clusterId 13187353207204870340
    autoload no
    dmapiFileHandleSize 32
    minReleaseLevel 3.5.0.11
    restripeOnDiskFailure yes
    adminMode central

    File systems in cluster slesgpfscluster.gpfsnode1:
    --------------------------------------------------
    /dev/sapmnt

    gpfsnode1:~ # mmfsadm dump config | grep readReplicaPolicy
     # readReplicaPolicy default
     

    gpfsnode1:~ # mmchconfig readReplicaPolicy=local
    mmchconfig: Command successfully completed
    mmchconfig: Propagating the cluster configuration data to all
      affected nodes.  This is an asynchronous process.
     

    gpfsnode1:~ # mmlsconfig
    Configuration data for cluster slesgpfscluster.gpfsnode1:
    ---------------------------------------------------------
    myNodeConfigNumber 1
    clusterName slesgpfscluster.gpfsnode1
    clusterId 13187353207204870340
    autoload no
    dmapiFileHandleSize 32
    minReleaseLevel 3.5.0.11
    restripeOnDiskFailure yes
    readReplicaPolicy local
    adminMode central

    File systems in cluster slesgpfscluster.gpfsnode1:
    --------------------------------------------------
    /dev/sapmnt
    gpfsnode1:~ # mmfsadm dump config | grep readReplicaPolicy
     # readReplicaPolicy default
    gpfsnode1:~ #
     

    Still the same but,

     

    gpfsnode1:~ # mmchconfig readReplicaPolicy=local -i
    mmchconfig: Command successfully completed
    mmchconfig: Propagating the cluster configuration data to all
      affected nodes.  This is an asynchronous process.
    gpfsnode1:~ # mmlsconfig
    Configuration data for cluster slesgpfscluster.gpfsnode1:
    ---------------------------------------------------------
    myNodeConfigNumber 1
    clusterName slesgpfscluster.gpfsnode1
    clusterId 13187353207204870340
    autoload no
    dmapiFileHandleSize 32
    minReleaseLevel 3.5.0.11
    restripeOnDiskFailure yes
    readReplicaPolicy local
    adminMode central

    File systems in cluster slesgpfscluster.gpfsnode1:
    --------------------------------------------------
    /dev/sapmnt
    gpfsnode1:~ # mmfsadm dump config | grep readReplicaPolicy
     # readReplicaPolicy local

     

    Also on the nodes you see the difference. Reads done on a node are not demanding reads from other nodes anymore.

    But this only works with FPO GPFS clusters.

     

    readReplicaPolicy is not related to FPO.  This tunable controls the choice of the replica to read when multiple replicas are available.  It is applicable in FPO clusters as well as in more traditional GPFS replicated environments, e.g. a "stretch cluster" active/active DR setup.  And yes, the "-i" parameter must be passed to mmchconfig if you want the change to be effective immediately, as opposed to the next GPFS restart.

    The tricky part about readReplicaPolicy=local is how to communicate the "disk locality" information to GPFS.  The currently available mechanism leverages subnet topology: if the NSD server for one of the replicas is on the same subnet as the client, while NSD servers for other replicas are not, then that replica is more local than others.  That mechanism, while practical in some cases, has some inherent flaws.  We're well aware of those, and are working on implementing a better mechanism (I can't discuss future function availability dates in this forum).  

    Note that over the years the existing readReplicaPolicy code has been tweaked.  In particular, at one point the code would bypass replica locality determination if all replicas are available via a global SAN, even if NSD servers are defined.  That was fixed a year or so ago, now the code will select a more local replica even if the IO is done through the local block device interface, provided NSD servers are defined.

    yuri

  • Ronaldo@IBM
    Ronaldo@IBM
    2 Posts

    Re: Reads from replicated metadata from one failure group

    ‏2014-08-29T07:15:24Z  
    • yuri
    • ‏2014-08-28T17:14:24Z

    readReplicaPolicy is not related to FPO.  This tunable controls the choice of the replica to read when multiple replicas are available.  It is applicable in FPO clusters as well as in more traditional GPFS replicated environments, e.g. a "stretch cluster" active/active DR setup.  And yes, the "-i" parameter must be passed to mmchconfig if you want the change to be effective immediately, as opposed to the next GPFS restart.

    The tricky part about readReplicaPolicy=local is how to communicate the "disk locality" information to GPFS.  The currently available mechanism leverages subnet topology: if the NSD server for one of the replicas is on the same subnet as the client, while NSD servers for other replicas are not, then that replica is more local than others.  That mechanism, while practical in some cases, has some inherent flaws.  We're well aware of those, and are working on implementing a better mechanism (I can't discuss future function availability dates in this forum).  

    Note that over the years the existing readReplicaPolicy code has been tweaked.  In particular, at one point the code would bypass replica locality determination if all replicas are available via a global SAN, even if NSD servers are defined.  That was fixed a year or so ago, now the code will select a more local replica even if the IO is done through the local block device interface, provided NSD servers are defined.

    yuri

    Hi Yuri,

    I agree with your reply. Thanks for the update.

    Ronald.