Upgrading to version 2.0.2

Follow this procedure to upgrade Cloud Pak for Data System to version 2.0.2. The end to end upgrade time for 2.0.1.1 to 2.0.2 ranges from 24 to 30 hours. This includes system, firmware and OCP/OCS component upgrades.

Upgrade in 2.0.2 is split into two phases that are:
  • Platform upgrade (extended outage required)
  • OCP upgrade

Extended outage upgrade applies to both Cloud Pak for Data System and Netezza® Performance Server.

Before you begin

Upgrade prerequisites:

  • Your system must be on version 2.0.x and Cloud Pak for Data 4.0.2 to upgrade with the following instructions. For more information, see Verifying software components version.
  • If your version of Cloud Pak for Data System is 2.0.0, before starting the upgrade, you must run the following command:
    sed -i -e '/apadmin/s/ibmapadmins/ibmapsysadmins/g' /opt/ibm/appliance/apupgrade/modules/ibm/ca/repository/yosemite_repository.py
    to avoid the upgrade failure caused by the incorrect rpm user group combination in 2.0.0 upgrade script.
  • If FIPS is enabled on your system, you must disable it before starting the upgrade. For more information, see Configuring FIPS on pre-2.0.2 Cloud Pak for Data System. If you do not disable FIPS, apupgrade will fail.
Network setup prerequisites:
  • If the system already has a custom network configuration, it must be configured by using the /opt/ibm/appliance/platform/apos-comms/customer_network_config/ansible with a System_Name.yml file:

    Before you upgrade, ensure that in /opt/ibm/appliance/platform/apos-comms/customer_network_config/ansible there is a System_Name.yml file specifying the house network configuration.

    To locate the file, run the following command from /opt/ibm/appliance/platform/apos-comms/customer_network_config/ansible:
    ls -t *yml | grep -v template | head -1

    If the file does not exist, you must create it. Otherwise, your network configuration might break during the upgrade. For more information on the file and how to create it, see the Node side network configuration section, and the following link specifically: Editing the network configuration YAML file.

  • To connect to the node1 management interface, you must use custom_hostname value under node1 section or ip value under network1 section from your System_Name.yml For example:
    all:
      children:
        control_nodes:
          hosts:
            node1:
              custom_hostname: sbpoc04a.svl.ibm.com
              management_network:
                network1:
                  ip: 9.30.106.111
  • If apupgrade detects custom network configuration and no yml file, it will fail at the pre-check step.
  • If you are upgrading a new system with no network configuration, the apupgrade will not stop at the check for System_Name.yml, but will continue the upgrade process.
  • Before you start the upgrade, from /opt/ibm/appliance/platform/apos-comms/customer_network_config/ansible directory, you must run:
    ANSIBLE_HASH_BEHAVIOUR=merge ansible-playbook -i ./System_Name.yml playbooks/house_config.yml --check -v
    If any changes are listed in --check --v, ensure that they are expected. If they are unexpected, you must edit the YAML file so that it contains only the expected changes. You might rerun this command as necessary, until you see no errors.
Netezza prerequisites:
  • Before you start the 2.0.2 upgrade, you must stop NPS by running:
    1. Stop NPS monitoring. Run:
      1. oc get magnetomonitor -o=custom-columns='NAME:metadata.name,NUM_OF_MONITORS:spec.number_of_monitors,TYPE:spec.monitor_type' -n ap-magneto
        and save the number of monitors for monitor type: nps
      2. For each name of monitor type nps run:
        oc patch magnetomonitor <magneto_monitor_name_of_nps_type> --type json -p '[{"op": "replace", "path": "/spec/number_of_monitors", "value": 0}]' -n ap-magneto
        For example:
        oc patch magnetomonitor magneto-monitor-nps-nps-1 --type json -p '[{"op": "replace", "path": "/spec/number_of_monitors", "value": 0}]' -n ap-magneto
        oc get magnetomonitor  -o=custom-columns='NAME:metadata.name,NUM_OF_MONITORS:spec.number_of_monitors,TYPE:spec.monitor_type' -n ap-magneto
        NAME                        NUM_OF_MONITORS   TYPE
        magneto-gateway             1                 gateway
        magneto-monitor-node        3                 node
        magneto-monitor-node2       1                 node2
        magneto-monitor-nps-nps-1   1                 nps
        magneto-monitor-ocs         1                 ocs
        where the monitor of type nps is magneto-monitor-nps-nps-1 and has num_of_monitors=1
    2. Stop NPS. For each NPS <nps_namespace_name> instance, run:
      1. Stop NPS by using nzstop. Run:
        oc --namespace=<nps_namespace_name> exec -t ipshost-0 -- su - nz -c "nzstop"
      2. Scale down ipshost and SPU statefulsets. Run:
        oc scale sts --all -n <nps_namespace_name> --replicas=0
      3. Run:
        oc get sts -n <nps_namespace_name>
        to confirm that all NPS statefullsets scaled down. The expected value in READY column is 0/0 for each statefullset. For example:
        NAME      READY   AGE
        ipshost   0/0     41d
      4. Run:
        oc get mcp
        to determine machineconfigpool name associated with NPS <nps_namespace_name> instance.
      5. Un-pause the associated machineconfigpool. Run:
        
        oc patch --type=merge --patch='{"spec":{"paused":false}}' machineconfigpool/nps-shared
      6. Run:
        oc get machineconfigpool/nps-shared -o yaml | grep paused
        to confirm that associated machineconfigpool updates are unpaused. The expected result is as follows:
          f:paused: {}
          paused: false

During the extended upgrade, after the firmware update phase, SPU nodes might power down because the PXE boot source (NPS host) is unavailable as it was shut down during the procedure. This is an expected upgrade behavior. SPUs are powered on as part of the Netezza Performance Server post-upgrade steps.

Procedure

  1. Connect to the node1 management interface by using the custom_hostname or ip values from your System_Name.yml
  2. Verify that e1n1 is the hub:
    1. Check for the hub node by verifying that the dhcpd service is running:
      systemctl is-active dhcpd
    2. If the dhcpd service is running on a node other than e1n1, bring the service down on that other node:
      systemctl stop dhcpd
    3. On e1n1, run:
      systemctl start dhcpd
  3. Download the icpds_system-2.0.2.0_*.tar.gz bundle from Fix Central and copy it to /localrepo on e1n1.
    Note: The upgrade bundle requires a significant amount of free space. Make sure you delete all bundle files from previous releases.
  4. From the /localrepo directory on e1n1, run:
    mkdir 2.0.2.0_release
    and move the icpds_system-2.0.2.0_*.tar.gz bundle into that directory.
    The directory that is used here must be uniquely named - for example, no previous upgrades on the system can have been run out of a directory with the same name.
  5. Verify the status of your appliance by running:
    ap issues
    ap version -s
    ap sw
  6. Run upgrade details to view details about the specific upgrade version:
    apupgrade --upgrade-details --use-version 2.0.2.0_release --upgrade-directory /localrepo --phase platform
    To facilitate working through the potential upgrade issues, you can find a text file with the applicable workarounds in the following locations:
    • For platform bundle that is /localrepo/<2.0.2.0 upgrade dir>/EXTRACT/ocp/Workarounds-2.0.2.0.txt
    • For OCP that is /opt/ibm/appliance/storage/platform/localrepo/<2.0.2.0 upgrade dir>/EXTRACT/ocp/Workarounds-2.0.2.0.txt
  7. Upgrade the apupgrade command to get the new command options:
    apupgrade --upgrade-apupgrade --use-version 2.0.2.0_release --upgrade-directory /localrepo --phase platform
  8. Before you start the upgrade process, depending on your requirements:
    • Run the preliminary checks with --preliminary-check option:
      apupgrade --preliminary-check --use-version 2.0.2.0_release --upgrade-directory /localrepo --phase platform
      if you just want to check for potential issues and cannot accept any system disruptions. This check is non-invasive and you can rerun it as necessary. You can expect the following output after you run the preliminary checks.
      All preliminary checks complete
      Finished running pre-checks.
    • Optional: Run the preliminary checks with --preliminary-check-with-fixes option:
      apupgrade --preliminary-check-with-fixes --use-version 2.0.2.0_release --upgrade-directory /localrepo --phase platform
      if you want to check for potential issues and attempt to automatically fix those. Run it if you can accept your system to be disrupted as this command might cause the nodes to reboot.

    The value for the --use-version parameter is the same as the name of the directory you created in 4.

  9. Run:
    ap issues
    to ensure that the system has no pending issues identified.
  10. Initiate the upgrade of the platform.
    • Run:
      apupgrade --upgrade --use-version 2.0.2.0_release --upgrade-directory /localrepo --phase platform
      Note: You can monitor the extended outage upgrade by tracking the status file located at: /var/log/appliance/apupgrade/2.0.2.0_status/e1n1.apupgrade.2.0.2.0_platform.status

      The operator upgrade can be monitored by tracking the status file at: /var/log/appliance/apupgrade/2.0.2.0_status/e1n1.apupgrade.2.0.2.0_platform_operator.status

      You can expect the following output after the platform upgrade completes. For example:
      Broadcast message from root@sbpoc04a.svl.ibm.com (somewhere) (Thu Oct  6 16:26:
      
      A running upgrade completed successfully.
      
      
      Broadcast message from root@sbpoc04a.svl.ibm.com (somewhere) (Thu Oct  6 16:26:
      
      See details at /var/log/appliance/apupgrade/20221006/apupgrade20221006155514.log for further information

In 2.0.2, the firmware upgrade is enabled by default in the upgrade process. Unlike Cloud Pak for Data System 1.0, --skip-firmware option is not allowed in 2.0.x. During the extended outage upgrade the firmware on all the nodes is upgraded at once. For Netezza Performance Server systems, there is a downtime during extended outage upgrade.

You now need to upgrade OCP and its components. In Cloud Pak for Data System 2.0.2 upgrade, OCP cannot be upgraded directly from 4.6 to 4.8 version. During the first hop, OCP and OCS are upgraded from 4.6 to 4.7, and in the second upgrade step to 4.8. OCP and OCS will be marked install_complete and postinstall_complete after the first hop, and started before the second hop. Also, CLO upgrade involves applying a machine configuration as Red Hat changed the repo names from 4.6 to 5.3.4.14 versions which require an update to imagecontentsource policies on the old cluster.

Note: You can monitor the progress of the OCP upgrade by tracking the status file located at: /var/log/appliance/apupgrade/2.0.2.0_status/e1n1.apupgrade.2.0.2.0_ocp.status

You can locate the upgrade directory in /opt/ibm/appliance/storage/platform/localrepo. You need icpds_ocp-2.0.2.0_*.tar.gz bundle to perform the OCP upgrade.

  1. For Cloud Pak for Data System 2.0.0 or 2.0.1.0 versions only. If you want to upgrade to 2.0.2 or greater, you must reconfigure OCS storage nodes. For more information, see OCS storage nodes reconfinguring is required when upgrading from 2.0.0 or 2.0.1.0 version to 2.0.2 or greater
  2. Create directory for the icpds_ocp-2.0.2.0_*.tar.gz bundle: /opt/ibm/appliance/storage/platform/localrepo/<release> For example:
    mkdir -p /opt/ibm/appliance/storage/platform/localrepo/2.0.2.0_release
  3. Download the icpds_ocp-2.0.2.0_*.tar.gz bundle from Fix Central.
  4. Move icpds_ocp-2.0.2.0_*.tar.gz to the upgrade directory.
  5. Run:
    apupgrade --upgrade --use-version 2.0.2.0_release --phase ocp --upgrade-directory /opt/ibm/appliance/storage/platform/localrepo
    Note: You can restart the upgrade at any point by rerunning the same command.

    apupgrade monitors oc adm upgrade result until completed.

Netezza Performance Server post-upgrade steps

Perform the following actions to start NPS.

Procedure

  1. Start each NPS <nps_namespace_name> instance that was stopped by following these steps:
    1. Pause machineconfigpool associated to NPS <nps_namespace_name> instance. Run:
      oc patch --type=merge --patch='{"spec":{"paused":true}}' machineconfigpool/nps-shared
    2. Power on SPU nodes. Run:
      for dev in $(oc describe ns <nps_namespace_name> | grep bm_spu | awk -F'=' '{print $2}'); do ipmitool -I lanplus -H $dev -U USERID -P PASSW0RD power on; done
    3. Scale up ipshost and SPU statefulsets. Run:
      oc scale sts --all -n <nps_namespace_name> --replicas=1
  2. Start NPS monitoring. Run:
    oc patch magnetomonitor <magneto_monitor_name_of_nps_type> --type json -p '[{"op": "replace", "path": "/spec/number_of_monitors", "value": <value saved in first step of deactivation>}]' -n ap-magneto
    For example:
    oc patch magnetomonitor magneto-monitor-nps-nps-1 --type json -p '[{"op": "replace", "path": "/spec/number_of_monitors", "value": 1}]' -n ap-magneto
  3. To monitor NPS state for <nps_namespace_name> instance, run:
    oc exec -it ipshost-0 -n <nps_namespace_name> -- runuser -l nz -c 'nzstate -local'
    Note: After successful completion of platform and OCP upgrades, you must ensure that the Platform Manager is active and NPS is online. Next, you must run:
    sys_hw_config node -f --ssd-nvme-storage
    to upgrade the node NVME firmware.

FIPS post-upgrade steps

After you upgrade to 2.0.2 version and try to re-enable FIPS, the command fails with the following error: dracut: installkernel failed in module kernel-modules-extra as there are old kernel-modules-extra RPM on your system along with the new kernel RPM after the upgrade. For more information, see Version 2.0.2 release notes.

Cloud Pak for Data post-upgrade steps

If you have Cloud Pak for Data installed on your system, after you upgrade your Cloud Pak for Data System to version 2.0.2, you must ensure that an unexpected Cloud Pak for Data upgrade is not triggered. To prevent that, existing 2.x Cloud Pak for Data System customers must pin down the zen operand version to 4.2.0 (Cloud Pak for Data 4.0.2 level) in ap-console namespace. Cloud Pak for Data upgrade to a higher version should be done only on the zen namespace. Perform the following actions after your Cloud Pak for Data System 2.0.2 upgrade completes.

Procedure

  1. Run:
    oc get zenservice lite-cr -n ap-console -o json | jq .spec.version
    to determine the current version.
  2. Depending on the oc get version results, perform the following actions:
    • If you get a version above 4.2.0, contact IBM Cloud Pak for Data Support.
    • If you get 4.2.0 as the current version, no action is required.
    • If you get a version below 4.2.0, run:
      oc patch zenservice lite-cr  --namespace ap-console  --type=merge --patch '{"spec": {"version":"4.2.0"}}'