As you might already know the Flex nodes now can be managed by an HMC (*). If you do not know, have a look at IBM Annoucement Letter ZG13-0206 (that is the EMEA one). The minimum required version is V7R7.7.0M2 (MH01354). What I was unaware of, and so could be you, is that the security model in the Flex FSP was changed. Forget about users "admin", "general" and "HMC". They are still there but one cannot use them anymore.
When I joined the team I found two HMCs and two Flex nodes. However they had only three connections and not the full set of four I would have expected. One of the HMCs was connected only to its local node and was not managing the other node. The other HMC was managing both nodes fine. So I had to use the well-known "Add Connection" action.
Those with any HMC experience on Power know that the latter requires the "HMC" password. I didn't have it. You know how it works on Power5/6/7, right? No problems, I can reset it using the "admin" user and set the new connection. The password reset will break the authentication of the other HMC but I can update the new password there. However my colleagues were unable to supply me with the user "admin" password either. Fine, one more password to be reset, this time it is the "admin" password. Remeber the Nac Mac Feegles: "Nae problemo!" We could call the IBM Support to the rescue. They will come with their magic "celogin" user which can reset the lost password for "admin". We know how to do the rest.
Alas! It served me well for the last 10 years but not this time. The option to reset any passwords was missing. My world was in shambles and my knowledge was useless. I had to openly admit I have asked Support the wrong question. Instead of asking "Could you please help with resetting the "admin" password?" I should have asked "How can one connect an HMC to a Flex node?"
I thought I knew the answer and I was proven wrong. Time to go back to school. Unfortunately neither the Flex Knowledge Center, nor the Flex-related redbooks were of help. Feeding different search patterns to IBM Search and Google was giving links to documents I've already read and knew. The HMC-to-Flex connection was still a mistery. There was little comfort that all my contacts in IBM would have advised what I've just described above. They perfectly knew the Power, and were coming with the same set of advices I would have given in cases like this. This was a bit of a concern to me but bearable. The Sev.3 call I had already open should sooner or later reach the right people.
Remember Murphy's Law? It can go wrong, thus it will. I had to work out one more minor nuisanse. Long ago someone has mismatched the VIOS/VIOC slot numbers and there was a serviceable event on the HMC. As result the Attention LED was lit. The "Attention" state was propagated to the CMM which is right. I closed the HMC events and switched off the attention indicator. However the Flex node was still showing as "Critical" in the CMM. Advised by Support I restarted the CMM. The chassis holds the node managed by both HMCs and they both lost the connection. After the CMM restarted I was able to login to it but the HMCs were now reporting Connection State: "Failed Authentication" and Connection Error: "Incorrect LDAP password". The node was not manageable any more. Will bold, hyperlink and tag these messages in the hope the search engines will index them. If they were searcheable I would have resolved my nuisanse before it bite me.
From a concern it escalated to a problem. No (D)LPAR operations possible. I had to ask IBM Support to increase my call to Sev.2. In these days we cannot survive long without DLPAR, can we? One advice you should be aware of, and which I wish have been worded better:
The celogin account does not have access to edit user account information. The FSP for power nodes should be picking up the credentials used to login into the CMM of the chassis it's installed in. In other words, if one sets the USERID password to 'fred' for CMM access, they can also log into the FSP using USERID / fred.
If the FSP is not properly obtaining the user account information from the CMM, try a virtual or physical reseat of the node to ensure the FSP is restarted and rediscovered by the CMM.
1. If the customer would like a user id specifically for HMC use, they can create or modify an account on the CMM.
The user "celogin" has indeed lost the power to change the user password ... on Flex. The proper wording could have been something like "The 'celogin' account on Flex nodes does not have the access to edit..." All other Power servers still follow the well-known security model and the Knowledge Center rightfully advises to contact Support to get "celogin" access.
The information that one can login to the FSP as "USERID" (the CMM default administrator) was of some use because I was now able to login via the ASMI. I was even able to enable "celogin1". It was good to have a human login possible as it would give me the chance to see the FSP logs. It was possible to see the failed "admin" logins which proves the account still exists. The question is how to reset its password as the "USERID" is coming with the role "unknown". I still needed the HMC login though.
Here be dragons! You think a logon is harmless, right? I have the right to logon, I was not in a breach of any security policy, and I was advised to do so by the vendor of the equipment. Following this train of though I was confident and was bitterly proven wrong again. Logging to the working FSP as "USERID" triggered something and the connection to it was also severed. I had now both Flex nodes reporting "Failed Authentication". It was time to panic because the active sides of some important clusters were now not manageable. Luckily I was already in the middle of a Sev.1 and the wheels were spinning fast in the background.
The advice to define a user on the CMM is a misleading misinterpretation. I have tried to create user "HMC" on the CMM. Attempts to use its password as an HMC-password failed. Even attempts to login as "HMC" via the ASMI failed.
I was reluctant to give in to the suggestion to re-seat the nodes, especially when it comes to a physical pull-out and slam-in, because I have to justify my actions and to back them with solid reasoning. In the UNIX land we have no other option but to have an enterprise attitude because we are under tight SLAs.
The correct answer is:
The password one should use as an HMC-password is the very same password used to login as "USERID" in the CMM. If you change it on the CMM, you have to update it on the HMC(s) as well.
(Thank you Mihai, I am indebted to you. I do not like the thrill of the Sev.1's)
Yes, the HMC-password entered on the HMC is sent to the FSP, and later is somehow (read "via LDAP") authenticated on the CMM. Now it all clicked in place. The chassis and the nodes were installed half a year ago, and the "USERID" password has expired before. As a matter of fact it expired again when my Sev.3 was finding its way through. Who could have known it is the root cause?! As often happens it was burried in the logs three months ago, and was disguised as a rather innocent looking everyday activity. The HMC-FSP connection was running fine for months, and there was no need for any authentication. Once the connection was severed the HMC had to re-authenticate. The authnetication was failing because of the changed password. The HMC had the old password cached.
How the logon triggered a re-authentication and broke the other node's connection is a mistery to me. The HMC was opened but not touched at the time, and I detected the connection loss within a minute or two. It might have something to deal with the "USERID" password being cached not at the HMC but at the FSP, and the authentication to have flushed the cached copy. It is a wild guess, and I can point several objections against it myself. Hope IBM will enlighten us some day.
Later Anthony was able to point me to IBM Support document N1010350 which describes the HMC-Flex connection:
The password field requires the Flex Node's password for USERID. The node authenticates against its ldap server which by default for an unmanaged chassis is CMM. If the CMM is configured to use an external LDAP server, the server must have an account of USERID.
Add the power node to a second HMC if desired. Dual HMCs are supported; both HMCs must be at the same version an(d) release.
Unfortunately the document was not coming on any searches, even when I was using only "Flex HMC". If I have opened it I might have read it in full but the description is still deeply buried. It is nice to have the dual-HMC support statement though.
Few words about the fallout (as I see it):
The "USERID" cannot be deleted. Its name is hardcoded somewhere, probably in the FSP firmware.
If the CMM is set to "High" security, the default password age is 90 days. If a "Break-glass" process is implemented (those who deal with SOX would have it), the password has to be changed after being used. Thus every time I pull the CMM password for "USERID" I am going to break my HMC connection(s), and have to pull the HMC password(s) too in order to fix it. An alternative could be to deal with an awful lot of red tape in order to get a security waiver, and to exclude the CMM from the "Break-glass".
The level of access needed to operate the HMC-FSP connection could be documented somewhere but I haven't found such document. If reduced access is sufficient, a potential workaround could be to create valid human "superadmin" account and to demote the "USERID". Please consider this only as a working hypothesis as it still needs to be tested. Currently it is "superadmin", and IBM might expect it to stay that way.
I hope IBM development will revisit the design of the security model, and will separate back the HMC (application) authentication from the administrator (human) one.
* List of the Power-related acronims used:
ASMI: Advanced System Management Interface
CMM: Chassis Management Module
FSP: Flexible Service Processor
HMC: Hardware Management Console
VIOC: Virtual I/O Client partition
VIOS: Virtual I/O Server