Troubleshooting for Analytics Engine Powered by Apache Spark

Use these resources to resolve the problems that you might encounter with the Analytics Engine powered by Apache Spark service.

Delay in starting the Spark application execution

If you notice that the Spark applications are experiencing delay in starting, and you are using observability agents like Instana or Dynatrace, the issue might be related to the additional load imposed by these agents. These agents collect various types of monitoring data from the application and pod environment, which can slow down the Spark driver process.

To resolve the issue, allocate one additional CPU core to the Spark driver process.

Removing libraries, Spark event directories and log files after notebook execution

If you don't want to persist any of the following information after you have executed a Spark notebook, then run the respective code snippet in the last cell of the notebook.

Directory paths
Directory Path Description
/home/spark/shared/conda Conda or Python libraries which are installed from the current Spark notebook
/home/spark/shared/user-libs Contains python3.7, python3.8, R, and spark2 directories. Each folder contains respective downloaded packages or jars which are included in the class path
/home/spark/shared/spark-events Spark event directory
/home/spark/shared/log Spark master, worker, and driver logs

Example: To not persist user-libs directory, one can use following code snippet.

scala

import scala.reflect.io.Directory
import java.io.File

val directory = new Directory(new File("/home/spark/shared/user-libs"))
directory.deleteRecursively()

R

if (dir.exists("/home/spark/shared/user-libs")) {
  #Delete dir if it exists
  unlink("/home/spark/shared", recursive = TRUE)
}

Python

!rm -rf /home/spark/shared/user-libs
Note: If you have missed adding the above code snippet to remove the unwanted directories, the system admin can help you remove them from **files-api-claim pvc**.

Deleting an Analytics Engine powered by Apache Spark instance

Before you can delete an Analytics Engine powered by Apache Spark instance, you need to first delete the deployment space associated with it. However you can't delete the space if any jobs are stuck in Starting or Running state.

To enable deleting the deployment space, you need to change all the jobs stuck in Starting or Running state to Failed state:

  1. From the Navigation menu on the IBM Cloud Pak for Data web user interface, click Services > Instances, find the Analytics Engine powered by Apache Spark instance and click it to view the instance details.

  2. Click the open and close list of options icon on the right of the instance details page and select Deployment Space to open the deployment space on the Jobs tab where you can view the Spark jobs.

  3. Find the jobs that are stuck in Starting or Running state.

  4. For each stuck job, get the run_id and space_id from the URL in the browser. For example:

    https://<CloudPakforData_URL>/jobs/<job_id>/runs/<run_id>?space_id=<space_id>&context=icp4data
    
  5. Run the following API to update the state of each stuck job. See Generating an API authorization token.

    space_id = <copy the space_id from URL in the browser>
    run_id = <copy the run_id from the URL in the browser>
    
    curl -ik -X PATCH -H "content-type: application/json" https://<CloudPakforData_URL>/v2/assets/${run_id}/attributes/job_run?space_id=${space_id} -H "Authorization: ZenApiKey ${TOKEN}" -d '[{"op": "replace","path": "/state","value": "Failed"}]'
    

    When all the jobs are in Failed state, you can delete the deployment space and then the Analytics Engine powered by Apache Spark instance. See Managing Analytics Engine powered by Apache Spark instances for how to delete the space and then the instance.

Recovery steps if you upgrade Analytics Engine powered by Apache Spark and Cloud Pak for Data at the same time

You should never upgrade Analytics Engine powered by Apache Spark and Cloud Pak for Data at the same time, because this might lead to database inconsistencies. In addition, the upgrade process of Analytics Engine powered by Apache Spark might fail and not be able to recover from this state.

You cannot use Analytics Engine powered by Apache Spark unless the Analytics Engine powered by Apache Spark database has been successfully restored and all Spark tables are available.

You need to have Cloud Pak for Data project administration rights to perform the following recovery steps.

To resolve upgrade issues if you do accidentally upgrade both services at the same time:

  1. Define the environment variables you need, in particular PROJECT_CPD_INST_OPERANDS. For details on setting environment variables, see Setting up installation environment variables.

  2. After you have defined PROJECT_CPD_INST_OPERANDS, run the following commands to set DOCKER_IMAGE and CONFIDENTIAL_PROP:

    export DOCKER_IMAGE=`oc get cronjob -n zen spark-hb-job-cleanup-cron -o jsonpath='{..image}' -n ${PROJECT_CPD_INST_OPERANDS}`
    export CONFIDENTIAL_PROP=`oc get secret spark-hb-confidential-properties -n ${PROJECT_CPD_INST_OPERANDS} -o jsonpath="{.data.confidential\.properties}" | base64 -d | grep "dbUrl" | cut -d '=' -f 2-`
    

    Check that these variables have meaningful values.

  3. Now set the environment variable called DB_VERSION depending on the Cloud Pak for Data you are working on.

    Choose the value you need for DB_VERSION:

    • Cloud Pak for Data 4.0: 6

    • Cloud Pak for Data 4.5.x: 12

    • Cloud Pak for Data 4.6.x: 16

      For example, for Cloud Pak for Data enter:

      export DB_VERSION=16
      
  4. Create and deploy a K8s job:

    1. Create the following load-spark-db-schema.yml file:

      # This is a YAML-formatted file.
      apiVersion: batch/v1
      kind: Job
      metadata:
      name: spark-hb-load-db-specs
      labels:
          app: analyticsengine
          app.kubernetes.io/component: analyticsengine
          app.kubernetes.io/instance: ibm-analyticsengine-prod
          app.kubernetes.io/managed-by: analyticsengine
          app.kubernetes.io/name: analyticsengine
          component: analyticsengine
          function: spark-hb-load-db-specs
          icpdsupport/addOnId: spark
          icpdsupport/app: api
          release: ibm-analyticsengine-prod
      spec:
      template:
          metadata:
          annotations:
              cloudpakId: "eb9998dcc5d24e3eb5b6fb488f750fe2"
              cloudpakInstanceId: ""
              cloudpakName: IBM Cloud Pak for Data
              hook.activate.cpd.ibm.com/command: '[]'
              hook.deactivate.cpd.ibm.com/command: '[]'
              productChargedContainers: All
              productCloudpakRatio: "1:1"
              productID: eb9998dcc5d24e3eb5b6fb488f750fe2
              productMetric: VIRTUAL_PROCESSOR_CORE
              productName: Analytics Engine powered by Apache Spark
              productVersion: 4.6.1
          labels:
              app: analyticsengine
              app.kubernetes.io/component: analyticsengine
              app.kubernetes.io/instance: ibm-analyticsengine-prod
              app.kubernetes.io/managed-by: analyticsengine
              app.kubernetes.io/name: analyticsengine
              component: analyticsengine
              function: spark-hb-load-db-specs
              icpdsupport/addOnId: spark
              icpdsupport/app: api
              job-name: spark-hb-load-db-specs
              release: ibm-analyticsengine-prod
          spec:
          affinity:
              nodeAffinity:
              requiredDuringSchedulingIgnoredDuringExecution:
                  nodeSelectorTerms:
                  - matchExpressions:
                  - key: kubernetes.io/arch
                      operator: In
                      values:
                      - amd64
          restartPolicy: "OnFailure"
          serviceAccount: zen-viewer-sa
          serviceAccountName: zen-viewer-sa
          automountServiceAccountToken: false
          hostNetwork: false
          hostPID: false
          hostIPC: false
          containers:
          - name: "spark-hb-load-db-specs"
              securityContext:
              allowPrivilegeEscalation: false
              capabilities:
                  drop:
                  - ALL
              runAsNonRoot: true
              privileged: false
              readOnlyRootFilesystem: false
              image: $DOCKER_IMAGE
              imagePullPolicy: Always
              resources:
              requests:
                  cpu: 100m
                  memory: 128Mi
                  ephemeral-storage: 100Mi
              limits:
                  cpu: 100m
                  memory: 128Mi
                  ephemeral-storage: 100Mi
              env:
              - name: DB_URL
              value: "$CONFIDENTIAL_PROP"
              command: ["/bin/bash", "-c"]
              args:
              - "bash /opt/ibm/entrypoint/load-db-specs.sh /opt/ibm/entrypoint/ cp.icr.io/cp/cpd/spark-hb-python $DB_VERSION /opt/hb/confidential_config/zenmetastore_certs /tmp/zenmetastore_certs_temp"
              volumeMounts:
              - name: "spark-hb-load-db-specs-script"
              mountPath: "/opt/ibm/entrypoint/"
              - name: "metastore-secret"
              mountPath: "/tmp/zenmetastore_certs_temp"
          volumes:
              - name: "spark-hb-load-db-specs-script"
              configMap:
                  name: "spark-hb-load-db-specs-script"
              - name: "spark-hb-zen-metstore-certs"
              secret:
                  secretName: "zen-service-broker-secret"
              - name: "metastore-secret"
              secret:
                  secretName: "metastore-secret"
      
    2. Now deploy this job by running the following command:

      envsubst < load-spark-db-schema.yml | oc apply -n ${PROJECT_CPD_INST_OPERANDS} -f -
      
    3. After you have deployed the job, check the job status:

      oc get job spark-hb-load-db-specs -n ${PROJECT_CPD_INST_OPERANDS}
      

      If the job ran successfully, you should see the following response:

      NAME                     COMPLETIONS   DURATION   AGE
      spark-hb-load-db-specs   1/1           13s        22s
      
    4. Delete the job if the recovery was successful:

      oc delete job spark-hb-load-db-specs -n ${PROJECT_CPD_INST_OPERANDS}
      

      The Analytics Engine powered by Apache Spark database is now repaired and recovered. Analytics Engine powered by Apache Spark should be successfully reconciled and reach "Completed" state.

The Analytics Engine Custom Resource stuck in InProgress state

While installing or upgrading Analytics Engine powered by Apache Spark, the Analytics Engine custom resource (AE CR) might become stuck in InProgress state for a long time.

If this happens, you can check for possible errors by running the following command. You need to have Cloud Pak for Data project administration rights to fix this issue.

oc get ae -n ${PROJECT_CPD_INST_OPERANDS} -o yaml

If the response contains Register dataplane task failed, perform the following steps to recover from this situation:

  1. Get all ibm-nginx pods:

    oc get pods | grep ibm-nginx
    
  2. Restart nginx in the ibm-nginx pods. You must run the command for all the ibm-nginx pods returned by the previous command.

    oc exec <ibm-nginx-pod-name> bash -- nginx -s reload
    
  3. Delete the spark-hb-register-dataplane pod:

    oc delete <spark-hb-register-dataplane-pod>
    

    Wait for the pod to restart which might take about 6-8 minutes. When the spark-hb-register-dataplane pod is running again, the AE CR should move to completed state.

Using a CA certificate to connect to internal servers from the platform

If you want to enable the Cloud Pak for Data platform to use your company's CA certificate to validate certificates from your internal servers, you must create a secret that contains the CA certificate. Additionally, if your internal servers use an SSL certificate that is signed using your company's CA certificate, you must create this secret to enable the platform to connect to the servers. For details, see Using a CA certificate to connect to internal servers from the platform.