Topic
  • 20 replies
  • Latest Post - ‏2013-03-27T13:30:07Z by robinsguk
robinsguk
robinsguk
48 Posts

Pinned topic Excessive save and restore times for mksysb

‏2013-03-25T11:05:27Z |
Hello,

AIX 5.3 TL9

I've been working with a customer to migrate them from P5 to P7.

We are running a mksysb on one of the existing LPARs and it's taking around 14 hours. This produces a mksysb file which is around 7.5GB in size.

Initially we found that the /var/spool/mqueue folder had over 3 million entries. This has been cleared down now.

However the mksysb still takes around 14 hours. I believe the problem may be to do with the number of inodes.

I noticed during the mksysb creation that the find command is executed and this seems to be taking the largest proportion of the overall 14 hours.

When we recreate the LPAR on P7 using the mksysb it takes over 40 hours!!!!!

Any suggestions on how we might reduce the save and restore times of the mksysb.

Thanks

Glenn
Updated on 2013-03-27T13:30:07Z at 2013-03-27T13:30:07Z by robinsguk
  • unixgrl
    unixgrl
    185 Posts

    Re: Excessive save and restore times for mksysb

    ‏2013-03-25T14:34:34Z  
    Take a look at the options you are using for mksysb and see if they make sense for what you have.

    -e excludes files in /etc/exclude.rootvg.

    If you have things in rootvg that really don't belong, exclude them. Also exclude directories such as /tmp or /var/tmp
  • robinsguk
    robinsguk
    48 Posts

    Re: Excessive save and restore times for mksysb

    ‏2013-03-25T14:44:16Z  
    • unixgrl
    • ‏2013-03-25T14:34:34Z
    Take a look at the options you are using for mksysb and see if they make sense for what you have.

    -e excludes files in /etc/exclude.rootvg.

    If you have things in rootvg that really don't belong, exclude them. Also exclude directories such as /tmp or /var/tmp
    Sorry, I should have said that we are using -e and that /etc/exclude.rootvg is populated to exclude

    ^/var/spool/mqueue

    and some other directories.

    Without this the mksysb is over 16GB.

    It looks like mksysb runs find, which takes forever, but then refers to /etc/exclude.rootvg when it starts saving the data.
    Glenn
  • alethad
    alethad
    286 Posts

    Re: Excessive save and restore times for mksysb

    ‏2013-03-25T14:52:25Z  
    • robinsguk
    • ‏2013-03-25T14:44:16Z
    Sorry, I should have said that we are using -e and that /etc/exclude.rootvg is populated to exclude

    ^/var/spool/mqueue

    and some other directories.

    Without this the mksysb is over 16GB.

    It looks like mksysb runs find, which takes forever, but then refers to /etc/exclude.rootvg when it starts saving the data.
    Glenn
    In know this is a stab but.....Could you be also having some type of adapter issue? have you checked for any errors on adapter or VIO or some other network error maybe? What device are you sending the mksysb to? Or are you using NIM which would be network related?
    That is a really long time for just a simple mksysb. Is the rootvg really getting hit hard by users or paging or something?

    Just 2 cents.
    Good luck.

    You've got to continue to grow, or you're just like last night's cornbread -- stale & dry Loretta Lynn alethad
  • robinsguk
    robinsguk
    48 Posts

    Re: Excessive save and restore times for mksysb

    ‏2013-03-25T15:00:33Z  
    • alethad
    • ‏2013-03-25T14:52:25Z
    In know this is a stab but.....Could you be also having some type of adapter issue? have you checked for any errors on adapter or VIO or some other network error maybe? What device are you sending the mksysb to? Or are you using NIM which would be network related?
    That is a really long time for just a simple mksysb. Is the rootvg really getting hit hard by users or paging or something?

    Just 2 cents.
    Good luck.

    You've got to continue to grow, or you're just like last night's cornbread -- stale & dry Loretta Lynn alethad
    I haven't check the adapters but, if that were the case on the P5 box, I'd expect the restore of the mksysb to fly through on the P7 box as there are no issues there.

    You make a fair point about the activity on the system, there is plenty but I', seeing the find command at the top of the list of CPU and I/O consumers. Again, I'd expect the restore of the mksysb to fly as there is no other activity on the system other than VIOS and NIM.

    I've run a mksysb save and restore on a couple of other LPARs and they are all done and dusted in less than an 25 mins.

    I'm deploying the mksysb from NIM. I've also tried deploying the mksysb using ISO images in VIOS. It makes no difference it still takes over 40 hours to restore.
    Glenn
  • alethad
    alethad
    286 Posts

    Re: Excessive save and restore times for mksysb

    ‏2013-03-25T15:16:15Z  
    • robinsguk
    • ‏2013-03-25T15:00:33Z
    I haven't check the adapters but, if that were the case on the P5 box, I'd expect the restore of the mksysb to fly through on the P7 box as there are no issues there.

    You make a fair point about the activity on the system, there is plenty but I', seeing the find command at the top of the list of CPU and I/O consumers. Again, I'd expect the restore of the mksysb to fly as there is no other activity on the system other than VIOS and NIM.

    I've run a mksysb save and restore on a couple of other LPARs and they are all done and dusted in less than an 25 mins.

    I'm deploying the mksysb from NIM. I've also tried deploying the mksysb using ISO images in VIOS. It makes no difference it still takes over 40 hours to restore.
    Glenn
    Ok.
    There's got to be a difference somewhere. What about the resources you have allocated to this LPAR as compared to others? Is it a tiny LPAR or under-allocated resources or something? Is the VIO set up differently? Has this always been an issue for this LPAR? Where is your rootvg located? Locally to the system or across network?

    But you still have the network as a point of failure with NIM. As in are you seeing dropped packets to this LPAR and not the others?
    Since I don't know you're environment I can only guess. Just trying to put some ideas out there for you in case you haven't thought of it. Although I'm sure you have. I'm not sure I wouldn't run a performance monitor on it just for kicks. Strange.
    Maybe someone else has some better ideas for you.
    Good luck.

    You've got to continue to grow, or you're just like last night's cornbread -- stale & dry Loretta Lynn alethad
  • robinsguk
    robinsguk
    48 Posts

    Re: Excessive save and restore times for mksysb

    ‏2013-03-25T15:21:29Z  
    • alethad
    • ‏2013-03-25T15:16:15Z
    Ok.
    There's got to be a difference somewhere. What about the resources you have allocated to this LPAR as compared to others? Is it a tiny LPAR or under-allocated resources or something? Is the VIO set up differently? Has this always been an issue for this LPAR? Where is your rootvg located? Locally to the system or across network?

    But you still have the network as a point of failure with NIM. As in are you seeing dropped packets to this LPAR and not the others?
    Since I don't know you're environment I can only guess. Just trying to put some ideas out there for you in case you haven't thought of it. Although I'm sure you have. I'm not sure I wouldn't run a performance monitor on it just for kicks. Strange.
    Maybe someone else has some better ideas for you.
    Good luck.

    You've got to continue to grow, or you're just like last night's cornbread -- stale & dry Loretta Lynn alethad
    NIM is on the same box and it flies with the other LPARS.

    This is specific to this LPAR I believe. As with the other LPARS rootvg is on the internal disks.

    This, for me, is something to do with the large number of files/I nodes in use.

    Glenn
  • alethad
    alethad
    286 Posts

    Re: Excessive save and restore times for mksysb

    ‏2013-03-25T15:43:10Z  
    • robinsguk
    • ‏2013-03-25T15:21:29Z
    NIM is on the same box and it flies with the other LPARS.

    This is specific to this LPAR I believe. As with the other LPARS rootvg is on the internal disks.

    This, for me, is something to do with the large number of files/I nodes in use.

    Glenn
    Yeah but 40 hours? Even if it is 16G that's still a couple of days. I mean I get 60G an hour with normal backups across my network. Are the other LPAR's mksysb this big?
    You've got to continue to grow, or you're just like last night's cornbread -- stale & dry Loretta Lynn alethad
  • robinsguk
    robinsguk
    48 Posts

    Re: Excessive save and restore times for mksysb

    ‏2013-03-25T16:02:16Z  
    • alethad
    • ‏2013-03-25T15:43:10Z
    Yeah but 40 hours? Even if it is 16G that's still a couple of days. I mean I get 60G an hour with normal backups across my network. Are the other LPAR's mksysb this big?
    You've got to continue to grow, or you're just like last night's cornbread -- stale & dry Loretta Lynn alethad
    Absolutely agree.

    The NIM server is on the same P7 as the LPARs.

    It flies through the mksysb restore until it gets to about 29% and then grinds to a halt. We've tried this four times now with a new mksysb after some housekeeping/modifying exclude.rootvg and it slows down at the same point.

    The box is a 16c 720 with 128GB memory. The LPAR I'm restoring to has 4 cores and 48GB memory.

    VIOS and NIM run just fine.

    It's something weird with this one LPAR not the P5 or P7.

    Like you, I've done a shed load of these over the years, and they're pretty straightforward normally. This one has me scratching my head.
  • alethad
    alethad
    286 Posts

    Re: Excessive save and restore times for mksysb

    ‏2013-03-25T16:16:50Z  
    • robinsguk
    • ‏2013-03-25T16:02:16Z
    Absolutely agree.

    The NIM server is on the same P7 as the LPARs.

    It flies through the mksysb restore until it gets to about 29% and then grinds to a halt. We've tried this four times now with a new mksysb after some housekeeping/modifying exclude.rootvg and it slows down at the same point.

    The box is a 16c 720 with 128GB memory. The LPAR I'm restoring to has 4 cores and 48GB memory.

    VIOS and NIM run just fine.

    It's something weird with this one LPAR not the P5 or P7.

    Like you, I've done a shed load of these over the years, and they're pretty straightforward normally. This one has me scratching my head.
    yeah like wow. Not envying you with this one. Man. And believe me I've had some doozies over the years too. Not this but other stuff.

    So it's got to be whatever files it's hitting at the 29% mark. File locks maybe? Did you do a verbose listing to see the files at that point? Dumb question I know.

    I guess the only thing I can think of beyond that is to test it by excluding the main filesystems one at time to at least identify which one of those is the culprit. That would be painful though.

    Ok I'll shut up now. Hope you find it soon.

    You've got to continue to grow, or you're just like last night's cornbread -- stale & dry Loretta Lynn alethad
  • MarkTaylor
    MarkTaylor
    230 Posts

    Re: Excessive save and restore times for mksysb

    ‏2013-03-25T16:48:37Z  
    • alethad
    • ‏2013-03-25T16:16:50Z
    yeah like wow. Not envying you with this one. Man. And believe me I've had some doozies over the years too. Not this but other stuff.

    So it's got to be whatever files it's hitting at the 29% mark. File locks maybe? Did you do a verbose listing to see the files at that point? Dumb question I know.

    I guess the only thing I can think of beyond that is to test it by excluding the main filesystems one at time to at least identify which one of those is the culprit. That would be painful though.

    Ok I'll shut up now. Hope you find it soon.

    You've got to continue to grow, or you're just like last night's cornbread -- stale & dry Loretta Lynn alethad
    So, going back to basics, what is in rootvg that the find command is getting hung up on then ? Do you have a filesystem in rootvg with millions of files ? or a filesystem with very very large files ?

    If any of this is the case, then you need to rethink your strategy, rootvg should be as small as possible meaning recovery from mksysb is a quick as possible .. if you have applications running out of a rootvg filesystem, then consider migrating the filesystems into an appvg etc.

    If you don't have any of the above, then proctree the find command to see if you can ID where its hanging, maybe procstack, trace, truss to give you clues, collect some 1 second nmon and tie it all together blah blah debugging 101 ..

    HTH
    Mark Taylor
  • MarkTaylor
    MarkTaylor
    230 Posts

    Re: Excessive save and restore times for mksysb

    ‏2013-03-25T16:49:17Z  
    So, going back to basics, what is in rootvg that the find command is getting hung up on then ? Do you have a filesystem in rootvg with millions of files ? or a filesystem with very very large files ?

    If any of this is the case, then you need to rethink your strategy, rootvg should be as small as possible meaning recovery from mksysb is a quick as possible .. if you have applications running out of a rootvg filesystem, then consider migrating the filesystems into an appvg etc.

    If you don't have any of the above, then proctree the find command to see if you can ID where its hanging, maybe procstack, trace, truss to give you clues, collect some 1 second nmon and tie it all together blah blah debugging 101 ..

    HTH
    Mark Taylor
    Also, 5.3 TL9 !!! really ? you know its out of support right :)
  • robinsguk
    robinsguk
    48 Posts

    Re: Excessive save and restore times for mksysb

    ‏2013-03-25T16:58:42Z  
    Also, 5.3 TL9 !!! really ? you know its out of support right :)
    I understand the 5.3 support issue and the rootvg size comment.

    This is a customer situation I've been put on to sort out in the last 6 weeks so I can't comment on the how's and why's.

    We all know that stuff like this happens in the 'real world', I just want to fix the issue for the customer.

    They've done some more housekeeping today so will run a mksysb at 2am tonight. Hopefully we will get some better news,

    Thanks for all the feedback, appreciated.

    Glenn
  • alethad
    alethad
    286 Posts

    Re: Excessive save and restore times for mksysb

    ‏2013-03-25T18:50:52Z  
    • robinsguk
    • ‏2013-03-25T16:58:42Z
    I understand the 5.3 support issue and the rootvg size comment.

    This is a customer situation I've been put on to sort out in the last 6 weeks so I can't comment on the how's and why's.

    We all know that stuff like this happens in the 'real world', I just want to fix the issue for the customer.

    They've done some more housekeeping today so will run a mksysb at 2am tonight. Hopefully we will get some better news,

    Thanks for all the feedback, appreciated.

    Glenn
    You know it's going to be some crazy piece of third party app that the customer just happen to forget they still had running out there that just happens to do file locks or some other stupid process that sucks the life out of the LPAR. And there you are caught up in the mess.

    :)
    Sorry. I just had to say it.
    I get held hostage by vendors on OS levels too.
    Good luck tonight.
    You've got to continue to grow, or you're just like last night's cornbread -- stale & dry Loretta Lynn alethad
  • robinsguk
    robinsguk
    48 Posts

    Re: Excessive save and restore times for mksysb

    ‏2013-03-25T18:55:04Z  
    • alethad
    • ‏2013-03-25T18:50:52Z
    You know it's going to be some crazy piece of third party app that the customer just happen to forget they still had running out there that just happens to do file locks or some other stupid process that sucks the life out of the LPAR. And there you are caught up in the mess.

    :)
    Sorry. I just had to say it.
    I get held hostage by vendors on OS levels too.
    Good luck tonight.
    You've got to continue to grow, or you're just like last night's cornbread -- stale & dry Loretta Lynn alethad
    I thought I'd leave this bit out until someone mentioned the app.

    It's SAP running Oracle.

    :-)
  • alethad
    alethad
    286 Posts

    Re: Excessive save and restore times for mksysb

    ‏2013-03-25T19:06:20Z  
    • robinsguk
    • ‏2013-03-25T18:55:04Z
    I thought I'd leave this bit out until someone mentioned the app.

    It's SAP running Oracle.

    :-)
    I was was just making a joke. They aren't running any part of Oracle in rootvg are they?

    You've got to continue to grow, or you're just like last night's cornbread -- stale & dry Loretta Lynn alethad
  • alethad
    alethad
    286 Posts

    Re: Excessive save and restore times for mksysb

    ‏2013-03-25T19:20:04Z  
    • alethad
    • ‏2013-03-25T19:06:20Z
    I was was just making a joke. They aren't running any part of Oracle in rootvg are they?

    You've got to continue to grow, or you're just like last night's cornbread -- stale & dry Loretta Lynn alethad
    Forgot to ask what versions of those are you running?
    You've got to continue to grow, or you're just like last night's cornbread -- stale & dry Loretta Lynn alethad
  • jklotz
    jklotz
    27 Posts

    Re: Excessive save and restore times for mksysb

    ‏2013-03-26T12:13:18Z  
    • alethad
    • ‏2013-03-25T19:20:04Z
    Forgot to ask what versions of those are you running?
    You've got to continue to grow, or you're just like last night's cornbread -- stale & dry Loretta Lynn alethad
    hello,

    Just a few suggestions :
    • 'du -ms *' in the root folder will give you the size of each subdirectory, so you can track down until the biggest one;
    • df -g displays the number of inodes per filesystem, can be useful
    • lines in /etc/exclude.rootvg should be '^./filesystem', not '^/filesystem'. Not sure about the impact, but it could make this entry worthless;
    • 'restore -Tvqf mksysb.file' lists the .TOC of the mksysb, useful to check if a directory has been included or not in the backup.
  • MarkTaylor
    MarkTaylor
    230 Posts

    Re: Excessive save and restore times for mksysb

    ‏2013-03-27T12:10:06Z  
    • jklotz
    • ‏2013-03-26T12:13:18Z
    hello,

    Just a few suggestions :
    • 'du -ms *' in the root folder will give you the size of each subdirectory, so you can track down until the biggest one;
    • df -g displays the number of inodes per filesystem, can be useful
    • lines in /etc/exclude.rootvg should be '^./filesystem', not '^/filesystem'. Not sure about the impact, but it could make this entry worthless;
    • 'restore -Tvqf mksysb.file' lists the .TOC of the mksysb, useful to check if a directory has been included or not in the backup.
    Glenn, not too sure you have taken on board my comments in the vain they were intended, they were specifically how to fix this for the customer, for example, if they have a rootvg filesystem with millions of files and that's where the mksysb find command is hanging up (before you get to the exclude), then there is zilch you can do about this without reconfiguring the filesystem onto a different VG .. unless you hack the mksysb script to exclude that filesystem from the find command which obviously would not be supported, but, is do-able .. What is your main concern, the backup time, the restore time or both ? using an exclude file if the hangup is the find command will save you some time on the backup part, but the restore will be a lot faster ..

    You mentioned the find command was the longest part of the process, did you get anywhere in debugging where and why etc ? this will help the inv ..

    Rgds
    Mark Taylor
  • MarkTaylor
    MarkTaylor
    230 Posts

    Re: Excessive save and restore times for mksysb

    ‏2013-03-27T12:28:36Z  
    Glenn, not too sure you have taken on board my comments in the vain they were intended, they were specifically how to fix this for the customer, for example, if they have a rootvg filesystem with millions of files and that's where the mksysb find command is hanging up (before you get to the exclude), then there is zilch you can do about this without reconfiguring the filesystem onto a different VG .. unless you hack the mksysb script to exclude that filesystem from the find command which obviously would not be supported, but, is do-able .. What is your main concern, the backup time, the restore time or both ? using an exclude file if the hangup is the find command will save you some time on the backup part, but the restore will be a lot faster ..

    You mentioned the find command was the longest part of the process, did you get anywhere in debugging where and why etc ? this will help the inv ..

    Rgds
    Mark Taylor
    The mksysb command is a tad backwards, could prob do with a rewrite .. do a search in the mksysb script for "-fstype jfs -o -fstype jfs2" and you will see the find command as part of a loop .. the variable that you are looping through comes from IMAGE_DATA .. so .. you could just hack the image data rather than the mksysb script .. and make sure you dont regen it at the start of the mksysb .. try it out ..

    HTH
    Mark Taylor
  • robinsguk
    robinsguk
    48 Posts

    Re: Excessive save and restore times for mksysb

    ‏2013-03-27T13:30:07Z  
    • robinsguk
    • ‏2013-03-25T18:55:04Z
    I thought I'd leave this bit out until someone mentioned the app.

    It's SAP running Oracle.

    :-)
    Just heard from the customer that the mksysb is now down to 15 minutes :-)

    Seems we just needed to clear down as many of the files in /var/spool/mqueue as possible. Including redirecting output from SAP cron jobs away from the mailq.

    Thanks all.

    Glenn