Topic
28 replies Latest Post - ‏2012-06-19T18:49:58Z by osc
botemout
botemout
70 Posts
ACCEPTED ANSWER

Pinned topic Migrating to metadataOnly disks

‏2012-05-22T20:57:40Z |
Greetings,

I've just acquired 6 15K, SAS drives which I'll be adding, as metadataOnly, to a file system that presently contains 6 SATA NSDs (dataAndMetadata) in the system pool. I'm hoping this might improve our poor metadata performance. My question is about the best way to make the change.

Would it be as simple as:
  • create the NSDs for the 6 SAS drives
  • add them to the filesystem as metadataOnly
  • use mmchdisk to change the type of the 6 SATA NSDs to dataOnly
  • restripe the filesystem?

Is that safe? Can the filesystem remain online while this happens (I'm aware that it'll be under-performing while running the restripe)?

Thanks much,
JR
Updated on 2012-06-19T18:49:58Z at 2012-06-19T18:49:58Z by osc
  • SystemAdmin
    SystemAdmin
    2092 Posts
    ACCEPTED ANSWER

    Re: Migrating to metadataOnly disks

    ‏2012-05-22T21:17:36Z  in response to botemout
    Yes.... BUT

    1) Consider using policy and pools. That will give you the flexibility to store important data on the SAS/system drives.

    2) Test before messing with your precious production system files.
    • botemout
      botemout
      70 Posts
      ACCEPTED ANSWER

      Re: Migrating to metadataOnly disks

      ‏2012-05-22T21:44:52Z  in response to SystemAdmin
      Thanks Marc,

      If there's any risk to doing this with a live system, we would schedule some time to bring the filesystem down.

      As for policies and pools: I hadn't thought about doing it this way. I can imagine that I'd use a file migration policy. I'm having trouble thinking about how this would work. Since I have 197TB of space and only 139TB is in use I could remove an NSD, add it back in in a new pool. I could then write the migrate rule to move the data to that pool. Something like that? I'm not sure how the fileset would be used (or whether it would be needed). Perhaps you/someone could point me to some more specific documentation/examples that do this?

      Thanks much,
      John
      • botemout
        botemout
        70 Posts
        ACCEPTED ANSWER

        Re: Migrating to metadataOnly disks

        ‏2012-05-22T22:00:47Z  in response to botemout
        Hmmm ... is it as simple as:

        • add 6 metadataOnly NSDs

        • remove an existing NSD (I have remove to recover one) and re-add into a different storage pool, say dataOnlySP

        • create a placement rule so that all new files go to the new pool; something like:
        RULE 'all_new_files_to_dataOnlySP' SET POOL 'dataOnlySP'

        • create a migration rule something like:
        RULE 'migrate_to_dataOnlySP' MIGRATE FROM POOL 'system' TO POOL 'dataOnlySP' LIMIT (95%)

        Does that sound close?

        Thanks
      • SystemAdmin
        SystemAdmin
        2092 Posts
        ACCEPTED ANSWER

        Re: Migrating to metadataOnly disks

        ‏2012-05-23T10:45:02Z  in response to botemout
        Yes your plan is safe and will work, with the proviso that once you have marked your existing dataAndMetadata disks as data only you do not under any circumstances umount the file system until you have completed the restripe.

        If you do then bad things happen.
        • botemout
          botemout
          70 Posts
          ACCEPTED ANSWER

          Re: Migrating to metadataOnly disks

          ‏2012-05-23T14:20:39Z  in response to SystemAdmin
          Hi Jonathan,

          Do you think the original plan is safer, easier than using policy rules? I supposed it's possible that we could have a power failure while the restripe is happening (sure we have UPSs but ...), or that the tiebreaker disk will fail and the filesystem will be umounted. Unlikely but not impossible. What exactly would happen if the filesystem was unmounted from all nodes? An unrecoverable situation?

          I'm curious whether the policy recipe I sketched out above might be better and safer? It certainly sounds less convenient (since I'd have to restart the migrate policy multiple times). Of course, I'm not sure whether it's actually viable. For instance, after the new dataOnly disks are in and I start the migrate, will the metadata be migrated properly to the metadataOnly disks?

          John
          • dlmcnabb
            dlmcnabb
            1012 Posts
            ACCEPTED ANSWER

            Re: Migrating to metadataOnly disks

            ‏2012-05-23T14:40:41Z  in response to botemout
            Nothing bad will happen as long as you don't lose any disks. Losing disks may mean you lose copies of some system metadata blocks which may render the filesystem unmountable.

            Stopping the restripe is safe since the metadata is still either on the old disk or the new disk.
            • SystemAdmin
              SystemAdmin
              2092 Posts
              ACCEPTED ANSWER

              Re: Migrating to metadataOnly disks

              ‏2012-05-23T15:20:37Z  in response to dlmcnabb
              Policy is preferable if there is the possibility that you may want to store some data on the new "mostly metadata" disks.

              Which depends on your application mix and your tiering strategy.

              For example:

              If you are running apps that would benefit from having some/all of their files on fast seeking disks.
              E.g. "hot" database tables.

              If you can distinguish new and frequently accessed data, from seldom accessed "historic" data.

              The only hassle may be getting started. Because GPFS pools facility wasn't designed to make this particular change easy.
              Specifically (see Admin Guide)
              To move a disk from one storage pool to another, use the mmdeldisk and mmadddisk commands

              Which means you have to clean the very data you want to store on your old disks, off of your old disks, change the disk pool designation of those disks, and then move the data back! Sorry. If you have 50% free space, this isn't too terrible:

              1) Add all the new "fast" disks as system pool disks.

              2) deldisk half of your old disks. deldisk will move data to any available space on the other (new and old) disks (all in pool 'system') Then adddisk back the same disks (nsds) but set them to a new pool name like 'data'.

              3) Use policy to migrate data TO POOL 'data' - BUT use the -I defer option, so you avoid actually moving any data blocks at this step - which means it will go pretty quickly. The files will all be marked as "supposed to be in pool 'data'", but the data blocks will remain in pool 'system' until the next restripe.

              4) deldisk the other old disks. Then adddisk ...

              5) Restripe the entire filesystem.
              • SystemAdmin
                SystemAdmin
                2092 Posts
                ACCEPTED ANSWER

                Re: Migrating to metadataOnly disks

                ‏2012-05-23T15:25:29Z  in response to SystemAdmin
                Oh, if step 4 doesn't succeed for lack of disk space, you can mmrestripefs at that point, which will move the data blocks that belong in pool 'data'. Then retry step 4.
              • botemout
                botemout
                70 Posts
                ACCEPTED ANSWER

                Re: Migrating to metadataOnly disks

                ‏2012-05-23T16:40:51Z  in response to SystemAdmin
                Marc,

                I don't I'll need to write any data to the these fast drives. I have a filesystem composed of 10 SSDs that we use for that.

                I'm leaning toward using the non-policy approach. I'm a little confused as to the danger of the filesystem being unmounted before the restripe completes. At one point Jonathon said that it needs to finish, then later he said that it might be okay if it had just started.
                • SystemAdmin
                  SystemAdmin
                  2092 Posts
                  ACCEPTED ANSWER

                  Re: Migrating to metadataOnly disks

                  ‏2012-05-23T18:51:17Z  in response to botemout
                  SSD disks? Hmmm... if you had it to do over again, you might consider combining everything into one filesystem.
                  Then you could use your presumably fast SSD disks for the metadata ...

                  This is a scenario where pools and policy can work to your advantage.
                  Administratively and from an application point of view, it's often easier and simpler to have all files in one filesystem with one namespace. But if you are so lucky as to have several classes of disk storage, with different performance characteristics, each goes into the appropriate pool ...

                  I'm not saying you should convert today. We all have "legacy baggage". Just consider it down the road.
                  • botemout
                    botemout
                    70 Posts
                    ACCEPTED ANSWER

                    Re: Migrating to metadataOnly disks

                    ‏2012-05-23T19:53:18Z  in response to SystemAdmin
                    Marc,

                    The SSD drives are only 100G so a mere 10 won't hold all my metadata. Also, we've had several occasions now where we've had to fsck our filesystem and the pain is non-trivial. I'm a bit more cautious about creating gigantic filesystems than I was when I first encountered GPFS ;-)
          • SystemAdmin
            SystemAdmin
            2092 Posts
            ACCEPTED ANSWER

            Re: Migrating to metadataOnly disks

            ‏2012-05-23T15:59:25Z  in response to botemout
            It looks a rather simple system, six NSD that are currently mixed data/metadata going to having six extra disks (I am presuming they are actually RAID arrays) for pure metadata. A metadata restripe should be pretty quick, how many files do you have.

            If you umount the file system before the restripe is at least started it becomes unmountable again; I have the badge. We where able to recover the situation by marking the original mixed disks as data/metadata again, and mounting the file system, this was a gamble after IBM told us to restore from backup, not funny when that restore is over 100TB.
            • botemout
              botemout
              70 Posts
              ACCEPTED ANSWER

              Re: Migrating to metadataOnly disks

              ‏2012-05-23T16:28:16Z  in response to SystemAdmin
              Hi Jonathon,

              Yes, this filesystem is pretty straightforward. The existing mixed disks are actually 8+P R6 LUNs (~24TB). Looks like this:

              Raid cabinet #1:
              3 mixed SATA drives
              3 15K 600G SAS drives
              failure group = 3000

              Raid cabinet #2:
              3 mixed SATA drives
              3 15K 600G SAS drives
              failure group = 3001

              The newly added 6 15K 600G drives are not raided, though I'm replicating metadata.

              I don't expect that it will be especially fast. I have about 500 million used inodes (last time I looked there were about 420 million actual files).

              So, are you saying that if the restripe starts and, say, runs for 5 mins, but then stops and the filesystem is unmounted, that that's sufficient to allow the filesystem to be mounted again? (Seems odd.)

              Thanks much,
            • SystemAdmin
              SystemAdmin
              2092 Posts
              ACCEPTED ANSWER

              Interrupting restripe ...

              ‏2012-05-23T18:44:53Z  in response to SystemAdmin
              By design, restripe is interruptible, even a hard system crash should be recoverable.

              The system copies a datablock, then logs changes to the metadata, to record the new datablock location.

              If you crash before the log record is flushed, then it's as if it never happened.
              If you crash after the log record is flushed, then part of the system bringup replays the log...

              Conceptually, pretty simple. As always the "devil is in the details".

              But seriously, if it doesn't work, that's a bug we need to know about and fix.
              • SystemAdmin
                SystemAdmin
                2092 Posts
                ACCEPTED ANSWER

                Re: Interrupting restripe ...

                ‏2012-05-24T12:42:54Z  in response to SystemAdmin
                The situation was that we had a system that was lots of RAID6 1TB SATA drives, spread over two DS4700 arrays, built into a single file system with the disks as mixed data/metadata.

                The performance was not good, so we added to each DS4700 15 RAID1 arrays of 147GB 15krpm disks. These disks where marked as meta data disks, and added to the file system. The mixed data/meta data disks where then changed to data only disks. Everything was fine.

                Then before the restripe of the file system took place the whole lots was unmounted so the disk firmware could be upgraded on the DS4700. Then when we attempted to remount the file system it would not.

                I can dig out the PMR number, but the basics where IBM told us to restore the file system. At this point with nothing to loose, we decided to change the RAID6 disks back to data/metadata and try the remount. It took rather a long time but did in the end mount. We then did an offline fsck, followed by changing the disks to data only and a restripe.

                So for certain bad stuff happens before the restripe starts. I have not be daft enough to ever repeat the exercise and we left the file system mounted for the restripe. So you might be alright after it has done a bit of restriping, but who knows and I would not want to experiment.
                • SystemAdmin
                  SystemAdmin
                  2092 Posts
                  ACCEPTED ANSWER

                  Re: Interrupting restripe ...

                  ‏2012-05-24T15:09:59Z  in response to SystemAdmin
                  Hmmm... big variable there was the change of firmware. Of course, that procedure was also designed to be data-preserving, but who knows?

                  For example, most controllers these days have a non-vram cache that should be flushed to disk before making changes to the controller...
                  Crucial DB and Filesystem Log records and hot control blocks are most likely to be in the VRAM! Unmounting does not change that, only the disk controller knows.

                  marc.
                  • SystemAdmin
                    SystemAdmin
                    2092 Posts
                    ACCEPTED ANSWER

                    Re: Interrupting restripe ...

                    ‏2012-05-25T09:26:11Z  in response to SystemAdmin
                    The firmware was of no consequence. Remember a firmware update of the actual hard disks on a DS4000 or any other Engenio storage array requires complete quiescence of all activity on the controller for the duration of the firmware upgrade. That was why the file system was unmounted. Besides which there was replicated metadata of the file system anyway.

                    My working hypothesis is that there where critical bits of the metadata on disks that where originally mixed data/metadata and these need updating on mounting the file system. When they are on a disk that is marked dataOnly GPFS won't update the metadata and consequently won't mount the file system.

                    However it is not something from my perspective worth investigating, it is simpler to just no do that.
                    • dlmcnabb
                      dlmcnabb
                      1012 Posts
                      ACCEPTED ANSWER

                      Re: Interrupting restripe ...

                      ‏2012-05-25T18:30:26Z  in response to SystemAdmin
                      Your hypothesis is not correct. It does not matter to GPFS which disk the metadata is located on.

                      The disk usage type only matters when allocating new objects, and also to the mmrestripefs command which will notice that something is not on the correct type of disk and will move it.
                      • SystemAdmin
                        SystemAdmin
                        2092 Posts
                        ACCEPTED ANSWER

                        Re: Interrupting restripe ...

                        ‏2012-05-28T10:18:20Z  in response to dlmcnabb
                        Then do you want to explain why until I marked the disks as dataAndMetadata again the file system was unmountable?

                        Regardless of how you may think GPFS works, until the disks where marked as being able to store metadata again the file system refused to mount. This is empirical evidence that it needed to update metadata on disks that where marked as dataOnly and bailed when it could not.

                        Clearly there is an issue around this, at least in old 3.2.x GPFS which is what I was using at the time. Maybe it is changed in newer GPFS but it emphatically was an issue in the past.
                        • SystemAdmin
                          SystemAdmin
                          2092 Posts
                          ACCEPTED ANSWER

                          Re: Interrupting restripe ...

                          ‏2012-05-28T14:35:56Z  in response to SystemAdmin
                          Just to note that Dan DOES know how GPFS works -- he's one of several IBMers who has been working on GPFS since before it was GPFS.
                          • SystemAdmin
                            SystemAdmin
                            2092 Posts
                            ACCEPTED ANSWER

                            Re: Interrupting restripe ...

                            ‏2012-05-29T08:55:52Z  in response to SystemAdmin
                            Sure, but I have a data point (admittedly with an older version of GPFS than is current) that contradicts what Dan things GPFS should do. So while I fully accept that Dan thinks that adding extra metadataOnly disks and marking all your dataAndMetadata disks as dataOnly, should not cause the file system to become unmountable before a restripe takes place in practice for me at least it did. Given that after marking the disks as dataAndMetadata again the file system mounted this is very strong evidence at least in the past on a mount GPFS wants to fiddle with metadata and if that metadata is on a disk that is marked dataOnly it gives up the mount.

                            If there is another explanation I am all ears.

                            Like in science experimental data (with the proviso that it is verified) trumps all theories every time with no exceptions.
                            • SystemAdmin
                              SystemAdmin
                              2092 Posts
                              ACCEPTED ANSWER

                              Experiment done - NTF

                              ‏2012-05-29T14:12:02Z  in response to SystemAdmin
                              So let's do an experiment:

                              
                              [root@fin44 gpfs-git]# mmdf xxx disk                disk size  failure holds    holds              free KB             free KB name                    in KB    group metadata data        in full blocks        in fragments --------------- ------------- -------- -------- ----- -------------------- ------------------- Disks in storage pool: system (Maximum disk size allowed is 7.2 GB) dx                     256000        1 No       Yes          253440 ( 99%)           496 ( 0%) dxp                    262144        1 Yes      No           119296 ( 46%)           616 ( 0%) -------------                         -------------------- ------------------- (pool total)           518144                                372736 ( 72%)          1112 ( 0%)   Disks in storage pool: p2 (Maximum disk size allowed is 2.9 GB) dx2                    102400        1 No       Yes               0 (  0%)             0 ( 0%) -------------                         -------------------- ------------------- (pool total)           102400                                     0 (  0%)             0 ( 0%)     [root@fin44 gpfs-git]# mmrestripefs xxx -b Scanning file system metadata, phase 1 ... Scan completed successfully. Scanning file system metadata, phase 2 ... Scanning file system metadata 
                              
                              for p2 storage pool Scan completed successfully. Scanning file system metadata, phase 3 ... Scan completed successfully. Scanning file system metadata, phase 4 ... Scan completed successfully. Scanning user file metadata ... 100.00 % complete on Tue May 29 06:43:59 2012 Scan completed successfully.
                              

                              Notice that dx is dataOnly and dxp is metadataOnly.
                              I've restriped the filesystem so every bit of data and metadata is now on the proper disks.

                              Let's mount the filesystem and check that we can read a file:

                              
                              [root@fin44 gpfs-git]# mmmount xxx Tue May 29 06:44:31 PDT 2012: mmmount: Mounting file systems …   [root@fin44 gpfs-git]# mmlsattr -L  /xxx/tdir
                              /* file name:            /xxx/tdir/jonathan metadata replication: 1 max 2 data replication:     1 max 2 immutable:            no appendOnly:           no flags: storage pool name:    system fileset name:         root snapshot name: creation time:        Tue May 29 06:45:37 2012 Windows attributes:   ARCHIVE [root@fin44 gpfs-git]# cat /xxx/tdir/jonathan Hello!
                              

                              While the system is mounted...
                              Let's flip metadataOnly and dataOnly designation of the disks, and check that we can still read a file:
                              
                              [root@fin44 gpfs-git]# mmchdisk xxx change -d 
                              "dxp:::dataOnly" Attention: No metadata disks remain. Verifying file system configuration information ... mmchdisk: Propagating the cluster configuration data to all affected nodes.  This is an asynchronous process.   [root@fin44 gpfs-git]# mmchdisk xxx change -d 
                              "dx:::metadataOnly" Verifying file system configuration information ... mmchdisk: Propagating the cluster configuration data to all affected nodes.  This is an asynchronous process.   [root@fin44 gpfs-git]# mmdf xxx disk                disk size  failure holds    holds              free KB             free KB name                    in KB    group metadata data        in full blocks        in fragments --------------- ------------- -------- -------- ----- -------------------- ------------------- Disks in storage pool: system (Maximum disk size allowed is 7.2 GB) dx                     256000        1 Yes      No           253696 ( 99%)           248 ( 0%) dxp                    262144        1 No       Yes          119040 ( 45%)           864 ( 0%) -------------                         -------------------- ------------------- (pool total)           518144                                372736 ( 72%)          1112 ( 0%)   Disks in storage pool: p2 (Maximum disk size allowed is 2.9 GB) dx2                    102400        1 No       Yes               0 (  0%)             0 ( 0%) -------------                         -------------------- ------------------- (pool total)           102400                                     0 (  0%)             0 ( 0%)   [root@fin44 gpfs-git]# mmlsattr -L  /xxx/tdir
                              /* file name:            /xxx/tdir/jonathan metadata replication: 1 max 2 data replication:     1 max 2 immutable:            no appendOnly:           no flags: storage pool name:    system fileset name:         root snapshot name: creation time:        Tue May 29 06:45:37 2012 Windows attributes:   ARCHIVE [root@fin44 gpfs-git]# cat /xxx/tdir/jonathan Hello!
                              


                              Now while the data and metadata are still on the "wrong disks", let's unmount, re-mount and check
                              that worked:

                              
                              [root@fin44 gpfs-git]# mmumount xxx Tue May 29 06:54:31 PDT 2012: mmumount: Unmounting file systems ... [root@fin44 gpfs-git]# cat /xxx/tdir/jonathan cat: /xxx/tdir/jonathan: No such file or directory [root@fin44 gpfs-git]# mmmount xxx Tue May 29 06:54:51 PDT 2012: mmmount: Mounting file systems ... [root@fin44 gpfs-git]# cat /xxx/tdir/jonathan Hello!
                              

                              NTF!

                              Just for good measure and neatness, let's restripe again...test...unmount,remount and test again...
                              
                              [root@fin44 gpfs-git]# mmrestripefs xxx -b Scanning file system metadata, phase 1 ... Scan completed successfully. Scanning file system metadata, phase 2 ... Scanning file system metadata 
                              
                              for p2 storage pool Scan completed successfully. Scanning file system metadata, phase 3 ... Scan completed successfully. Scanning file system metadata, phase 4 ... Scan completed successfully. Scanning user file metadata ... 100.00 % complete on Tue May 29 07:09:57 2012 Scan completed successfully. [root@fin44 gpfs-git]# cat /xxx/tdir/jonathan Hello! [root@fin44 gpfs-git]# mmumount xxx Tue May 29 07:10:09 PDT 2012: mmumount: Unmounting file systems ... [root@fin44 gpfs-git]# cat /xxx/tdir/jonathan cat: /xxx/tdir/jonathan: No such file or directory [root@fin44 gpfs-git]# mmmount xxx Tue May 29 07:10:23 PDT 2012: mmmount: Mounting file systems ... [root@fin44 gpfs-git]# cat /xxx/tdir/jonathan Hello!
                              

                              NTF!
                            • dlmcnabb
                              dlmcnabb
                              1012 Posts
                              ACCEPTED ANSWER

                              Re: Interrupting restripe ...

                              ‏2012-05-29T15:39:21Z  in response to SystemAdmin
                              Did you turn quotas on or off? The quota files are very special. When they are active, their blocks need to be on metadata disks, and when inactive they must reside on data disks. If the files did not exist, when the first mount happens, they need to be created on metadata disks. If quota was being turned off, then existing quota files need to be moved from the metadata disks to data disks.

                              Without traces on the FS manager, we may never know why it failed. But it was not because of the disk usage designation.
  • botemout
    botemout
    70 Posts
    ACCEPTED ANSWER

    Re: Migrating to metadataOnly disks

    ‏2012-05-24T21:41:58Z  in response to botemout
    Okay, sounds like the original plan is the best approach (or, at least, the easiest). Could someone confirm that the following recipe is sane?
    • create the NSDs for the 6 SAS drives and add them to the filesystem as metadataOnly
    • use mmchdisk to change the type of the 6 SATA NSDs to dataOnly using:

    mmchdisk myFS change -d "disk1:::dataOnly;<others>"

    • restripe the filesystem

    mmrestripefs myFS -b -N <all NSD servers and fast linux clients>

    • cross my fingers and hope I don't lose any disks while this is running ;-)

    Getting back to safety would there be any gain in doing this in stages, i.e., add the new metadataOnly disks but then only change, say, half of the data drives to dataOnly?

    It turns that I was wrong about the number of NSDs. About a month ago I added 3 more NSDs (i.e., another storage cabinet) when we were very close to filling up the filesystem. These 3 NDSs are dramatically less full than the oldest 6. It is for this reason that, I think, I should select -b as the option to mmrestripefs. However, I worry that all this IO will crater our performance. Is anything built into GPFS that moderates the impact of this operation (i.e., some threshold of impact that it won't go beyond - I'd doubt it but just wondering)? Many of the files on the system are large and invariant so I thin it will help us a lot.

    Thanks to all for the information and help,
    John
    • SystemAdmin
      SystemAdmin
      2092 Posts
      ACCEPTED ANSWER

      Re: Migrating to metadataOnly disks

      ‏2012-05-24T21:55:48Z  in response to botemout
      Reliability (surviving from disk crashes) is gained by using GPFS replication and/or disk controller duplication and/or RAID.
      • botemout
        botemout
        70 Posts
        ACCEPTED ANSWER

        Re: Migrating to metadataOnly disks

        ‏2012-05-25T00:20:48Z  in response to SystemAdmin
        Marc, we have a duplicate copy of our data at a DR site. We also replicate metadata and the LUNs are Raid 6.
  • osc
    osc
    11 Posts
    ACCEPTED ANSWER

    Re: Migrating to metadataOnly disks

    ‏2012-06-19T18:49:58Z  in response to botemout
    I've wondered about this for some time. Another simple solution would be to create the new disks as a replicate of the old and in a separate failure group, but as meta-data only. Then when replication is complete, change the mixed to data only and re-stripe should take no time at all.

    I was going to test this as I could not recall if this was supported, but I have yet to have the time. Anyone tried this?