Topic
1 reply Latest Post - ‏2013-05-03T18:32:46Z by bodily
GregioRenato
GregioRenato
5 Posts
ACCEPTED ANSWER

Pinned topic Cross-Site LVM Mirroring Problem

‏2013-04-24T20:10:32Z |

 

Hi everyone, 
 
I´m claim for help for the following scenario:
I have an environment AIX 6.1 with PowerHA Cross-Site LVM Mirroring with SAP and Oracle. it consists in 2 LPARs in different sites, and P795 and Cross-Site LVM Mirroring between 2 Storage Disk DS8700. My sites are connected by an DWDM Link.   Exactly like image attached.
 
I have Cross-Site LVM Mirroring, my copies are with perfect state ( every PP in disk from Site A is cloned on PP in disk from Site B ), and all my VGs are with quorum Disabled.
When my DWDM Link fail, they need 3 seconds to automaticaly migrate for alternate/redundant LINK. When it happens( link lost to other storage Disk), AIX generate an error in ERRPT database and my VG identify that i lost access to disks from remote site and mark LVOLS in stale state. 
Last month i had that problem, losing link between sites for 3 seconds, consequently i lost access from redundant Storage and my systems remained accessing disks from local Storage Only.
My HACMP didn´t detect errors, it was expected because i have Cross-Site LVM Mirroring, but i had a lot of other problems that cause a big impact for Oracle:
LVOLs Marked as Stale State ( expected )
AIX Generate error "PATH HAS FAILED" for disks from remote site (expected )
AIX Generate error "I/O ERROR DETECTED BY LVM"  (Not Expected)
Oracle can´t access filesystem and lock
 
After stabilish the environment, i open an PMR at IBM, and i´m trying to identify "Why i have I/O ERROR DETECTED BY LVM if i have integrity in my Cross-Site LVM MIrroring implemented"
 

 

I think that this problem can have relation with some disk tunning parameters, like "hcheck_interval" or "rw_timeout". Where disks wait a lot of time for second disk mirror response time and oracle can´t wait this amount of time. So, i´m planning do an tunning in these parameters, putting arount 3 seconds.
 
Someone can help me to find solution for this problem?

Thanks,
Renato Gregio

  • bodily
    bodily
    33 Posts
    ACCEPTED ANSWER

    Re: Cross-Site LVM Mirroring Problem

    ‏2013-05-03T18:32:46Z  in response to GregioRenato

    Correct in the "My HACMP didn´t detect errors," as HACMP does not need to do anything in that case. This is purely AIX LVM. You would have the EXACT same results w/o HACMP in this scenario. I

    I have seen in testing i/o hangs to the primary/only copy left in the 3-5 minute range before. The PowerHA 6.1 Enterprise Edition redbook actually documented results of:

    "The status of the resource group pokrg is still available in node Zhifa for around 5 minutes, and during that time the application appears to be hung and the users cannot write or read to the disks."

    I would like to think there are some tuning parameters to help, but can't say I've had luck. I tried fast_fail on the fc adapter,  hcheck_interval, and queue_depth and results were still about 90% the same.

    My inclination, is its more fiber related. I have a long history with LVM mirroring and I don't recall seeing these significant delays in SCSI and SSA storage days. But I MAY have selective memory these days.

    I would be greatly curious if support does give you some options that help this as I would like make note of it and push it out in our pubs if possible.