Backup & Restore service installation and upgrade issues

Use this troubleshooting information to resolve install and upgrade problems that are related to Backup & Restore service.

Fresh installation of Backup & Restore may hang or fail

Cause
It is because the isf-prereq operator deletes the OADP operator artifacts during installation, which creates a race condition.
Symptoms
Check the Installed Operators in the ibm-backup-restore namespace for any of the following issues:
  • OADP is not installed
  • An error in OADP operator
  • An error in OAPD subscription
  • guardian-dm-operator or guardian-dp-operator in failed state.
Resolution
Important:
  • This is applicable only for the fresh installation of IBM Storage Fusion Backup & Restore 2.8.1 on OpenShift® Container Platform 4.15 and earlier. Do not apply this patch if you are upgrading from an earlier version.
  • Backup & Restore 2.8.1 is not supported on OpenShift Container Platform 4.16.
  1. Download and run uninstall-backup-restore.sh to clean the failed installation.
  2. Download and run br_pre_install_patch281.sh.
  3. Install Backup & Restore service (hub or spoke).
Note: After the Backup & Restore is installed successfully, run the following oc command to patch spectrumfusion CR:
oc -n <FusionNS> patch --type json --patch='[ { op: remove, path: /spec/configuration/services/skipOnBoardingServices } ]' spectrumfusion <SpectrumFusionCR>

Download and run br-post-install-patch281.sh to update the supported OADP version in the cluster service version.

Application information not repopulated after service reinstallation

Problem statement
The application information gets removed during Backup & Restore service uninstallation. It does not get repopulated in MongoDB after the service reinstallation. Restore jobs validation fail with the following error message.
postJobRequest: Application Info not returned from application service for applicationId
Resolution
The workaround is to restart the isf-application-operator-controller-manager pod in the IBM Storage Fusion namespace.

Backup & Restore server installation displays an Invalid Input error

Problem statement
The Backup & Restore service deployment fails with the following error in the IBM Storage Fusion user interface:
Install Error: Invalid Input: Both the api Server and bootstrapToken fields need to be populated
Resolution
As a resolution, try with a private browser window.

Backup & Restore hub service in a custom namespace

Problem statement
In general, the Backup & Restore service is installed or upgraded to the default ibm-backup-restore namespace. To avoid issues during the installation or upgrade of the service with custom namespace, do the resolution steps.
Resolution
  1. Go to Operators > Installed Operators > Fusion Service Definition.
  2. Open the ibm-backup-restore-service CR and go to the YAML tab.
  3. In spec.onboarding.parameters, search for parameter with name as namespace and change the defaultValue to the custom-namespace where the service must be installed.
    Example:
    
    parameters:
          - dataType: string
            name: namespace
            defaultValue: <custom-namespace>
            userInterface: false
            required: true
            descriptionCode: BMYSRV00003
            displayNameCode: BMYSRV00004
  4. Store the following YAML in server-fsi.yaml, and replace <custom-namespace> with the namespace where the service must be installed.
    
    apiVersion: service.isf.ibm.com/v1
    kind: FusionServiceInstance
    metadata:
      name: ibm-backup-restore-service-instance
      namespace: ibm-spectrum-fusion-ns
    spec:
      parameters:
        - name: doInstall
          provided: false
          value: 'true'
        - name: namespace
          provided: false
          value: <custom-namespace>
        - name: storageClass
          provided: true
          value: lvms-lmvg
      serviceDefinition: ibm-backup-restore-service
      triggerUpdate: false
      enabled: true
      doInstall: true
  5. Run the following command to apply the changes:
    oc apply -f server-fsi.yaml 
  6. For the upgrade procedure, upgrade the IBM Storage Fusion operator, repeat step 1 of the resolution, and then upgrade the service.

Pods in Crashloopbackoff state after upgrade

Problem statement
The Backup & Restore service health changes to unknown and two pods go into Crashlookbackoff state.
Resolution

In the resource settings of guardian-dp-operator pod that is in ibm-backup-restore namespace, set the value of IBM Storage Fusion operator memory limits to 1000 mi.

Example:

resources:    
          limits:    
            cpu: 1000m    
            memory: 1000Mi    
          requests:    
            cpu: 500m    
            memory: 250Mi 

storage-operator and isf-data-protection-operator-controller-manager pods crash with CrashLoopBackOff status

Problems statement

The storage-operator and isf-data-protection-operator-controller-manager crashes due to OOM error.

Resolution
To resolve the error, increase the memory limit:
  1. Log in to the Red Hat® OpenShift web console as an administrator.
  2. Select the project ibm-spectrum-fusion-ns.
  3. Go to Operators > Installed operators, click IBM Storage Fusion, and go to the YAML tab.
  4. Change the memory limits:
    isf-storage-operator:
    1. Search for control-plane: isf-storage-operator in the YAML file.
    2. In the YAML, change the memory limit for isf-storage-operator container from 300Mi to 500Mi.
      
      containers:
          - resources:
              limits:
                cpu: 100m
                memory: 500Mi
    3. Click Save.
    4. Wait until a new isf-storage-operator pod comes up.
    isf-data-protection-operator-controller-manager:
    1. Search for control-plane: isf-data-protection-operator-controller-manager in the YAML file.
    2. In the YAML, change the memory limit for isf-storage-operator container from 500Mi to 1000Mi.
      
      containers:
          - resources:
              limits:
                cpu: 500m
                memory: 1000Mi
    3. Click Save.
    4. Wait until a new isf-data-protection-operator-controller-manager pod comes up.

Backup & Restore service goes into unknown state

Problem statement
This Backup & Restore service might go into unknown state when you upgrade IBM Storage Fusion from 2.6.x to 2.8.0.
Resolution
It automatically shows healthy on the IBM Storage Fusion user interface after you upgrade the Backup & Restore service.

MongoDB pod crashes with CrashLoopBackOff status

Problem statement
The MongoDB pod crashes due to an OOM error.
Resolution
To resolve the error, increase the memory limit from 256Mi to 512Mi. Do the following steps to change the memory limit:
  1. Log in to the Red Hat OpenShift web console as an administrator.
  2. Go to Workloads > StatefulSet.
  3. Select the project ibm-backup-restore.
  4. Select the MongoDB pod, and go to the YAML tab.
  5. In the YAML, change the memory limit for MongoDB container from 256Mi to 512Mi.
  6. Click Save.

Backup & Restore stuck at 5% during upgrade

Diagnosis
  1. In the OpenShift user interface, go to the Installed Operator and filter on ibm-backup-restore namespace.
  2. Click IBM Storage Fusion Backup and Restore Server.
  3. Go to Subscriptions.
  4. If you see the following errors, then do the workaround steps to resolve the issue:
    "error validating existing CRs against new CRD's schema for "guardiancopyrestores.guardian.isf.ibm.com": error validating custom resource against new schema for GuardianCopyRestore"
Resolution
  1. From command line or command prompt, log in to the cluster and run the following commands to delete the guardiancopyrestore CRs and 2.6.0 cs:
    oc -n ibm-backup-restore delete guardiancopyrestore.guardian.isf.ibm.com --all
    oc -n ibm-backup-restore delete csv guardian-dm-operator.v2.6.0 
    
  2. From the OpenShift Container Platform console, go to Installed Operator and filter on ibm-backup-restore namespace.
  3. Click IBM Storage Fusion Backup and Restore Server.
  4. Go to Subscriptions.
  5. Find the failing installplan and delete it.
  6. Go to Installed Operator and go to ibm-backup-restore namespace.
  7. Find IBM Storage Fusion Backup and Restore Server and click Upgrade available and approve the Install plan for IBM Storage Fusion Backup and Restore Server.
  8. Wait for the service upgrade to resume.

Backup & Restore service installation gets stuck after upgrade to IBM Storage Fusion 2.7.0

Problem statement
If the installation or upgrade of the Backup & Restore service does not reach 100% completion, then check for failed startup probes. Run the following command to look for any pods that are not in the READY state:
oc get pods -n ibm-backup-restore
Example output:
NAME                                                              READY   STATUS      RESTARTS        AGE
applicationsvc-855746ffbf-2pmz7                                   0/1     Running     5 (2m8s ago)    17m
...
job-manager-845dc56b8d-r5w6j                                      0/1     Running     5 (2m50s ago)   17m
...
  1. Go to OpenShift Container Platform.
  2. Go to the Installed Operators view with the selected ibm-backup-restore project or namespace.
  3. Click IBM Storage Fusion Backup & Restore Server.
  4. Select the Data Protection Server tab and click the instance.
  5. Select the YAML tab and update the following:
    triggerUpgrade: false
    to
    triggerUpgrade: true
  6. Save and reload the YAML.

    After some time, all the pods must be READY and the Backup & Restore service must show healthy in the IBM Storage Fusion user interface.

Diagnosis

If found, then describe each of the pods and look at the event section to determine whether there is a startup probe failure.

Example output that shows probe failure:

Events:
  Type     Reason          Age                    From               Message
  ----     ------          ----                   ----               -------
  ... 
  Warning  Unhealthy       3m40s (x5 over 6m40s)  kubelet            Startup probe failed: HTTP probe failed with statuscode: 503
If one or more pods show startup probe failures, then patch the probes to provide additional start time. Update the following script and run to patch the probes. To determine the respective deployment names, update the DEPLOYMENT_NAMES variable with the list of deployments that have failing startup probes and run the script.
oc get deployment -n ibm-backup-restore

Example output:

NAME                                          READY   UP-TO-DATE   AVAILABLE   AGE
...
applicationsvc                                0/1     1            1           12h
...
job-manager                                   0/1     1            1           12h
...
Now, check whether the pods are online. If the startup probes continue to fail, then increase the STARTUP_DELAY_SEC parameter and try again.

If it is an upgrade and the pods come online, but the progress does not reach 100% in the IBM Storage Fusion user interface even after 30 minutes, then you must update the Backup & Restore Server CR to re-trigger the upgrade.

Resolution
Update the Backup & Restore Server CR to re-trigger the upgrade

Backup & Restore service install and upgrade issue with the IBM Storage Protect Plus catalog source

Problem statement
The backups can fail with a FailedValidation error after you install or upgrade the IBM Storage Fusion and Backup & Restore service to 2.8.0.
validationErrors': ['an existing backup storage location wasn't specified at backup creation time and the default guardian-minio wasn't found.
Error: BackupStorageLocation.velero.io "guardian-minio" not found'
It occurs when the IBM Storage Protect Plus catalog exists in the OpenShift Container Platform clusters.
Resolution
The workaround is to upgrade the OADP, installed in the ibm-backup-restore namespace, to version 1.3.