Troubleshooting for Analytics Engine Powered by Apache Spark
Use these resources to resolve the problems that you might encounter with the Analytics Engine powered by Apache Spark service.
- Removing libraries, Spark event directories and log files after notebook execution
- Deleting an Analytics Engine powered by Apache Spark instance
- Recovery steps if you upgrade Analytics Engine powered by Apache Spark and Cloud Pak for Data at the same time
- The Analytics Engine Custom Resource stuck in
InProgress
state - Using a CA certificate to connect to internal servers from the platform
- Delay in starting the Spark application execution
Delay in starting the Spark application execution
If you notice that the Spark applications are experiencing delay in starting, and you are using observability agents like Instana or Dynatrace, the issue might be related to the additional load imposed by these agents. These agents collect various types of monitoring data from the application and pod environment, which can slow down the Spark driver process.
To resolve the issue, allocate one additional CPU core to the Spark driver process.
Removing libraries, Spark event directories and log files after notebook execution
If you don't want to persist any of the following information after you have executed a Spark notebook, then run the respective code snippet in the last cell of the notebook.
Directory Path | Description |
---|---|
/home/spark/shared/conda |
Conda or Python libraries which are installed from the current Spark notebook |
/home/spark/shared/user-libs |
Contains python3.7, python3.8, R, and spark2 directories. Each folder contains respective downloaded packages or jars which are included in the class path |
/home/spark/shared/spark-events |
Spark event directory |
/home/spark/shared/log |
Spark master, worker, and driver logs |
Example: To not persist user-libs directory, one can use following code snippet.
scala
import scala.reflect.io.Directory
import java.io.File
val directory = new Directory(new File("/home/spark/shared/user-libs"))
directory.deleteRecursively()
R
if (dir.exists("/home/spark/shared/user-libs")) {
#Delete dir if it exists
unlink("/home/spark/shared", recursive = TRUE)
}
Python
!rm -rf /home/spark/shared/user-libs
Deleting an Analytics Engine powered by Apache Spark instance
Before you can delete an Analytics Engine powered by Apache Spark instance, you need to first delete the deployment space associated with it. However you can't delete the space if any jobs are stuck in Starting or Running state.
To enable deleting the deployment space, you need to change all the jobs stuck in Starting or Running state to Failed state:
-
From the Navigation menu on the IBM Cloud Pak for Data web user interface, click Services > Instances, find the Analytics Engine powered by Apache Spark instance and click it to view the instance details.
-
Click
on the right of the instance details page and select Deployment Space to open the deployment space on the Jobs tab where you can view the Spark jobs.
-
Find the jobs that are stuck in Starting or Running state.
-
For each stuck job, get the
run_id
andspace_id
from the URL in the browser. For example:https://<CloudPakforData_URL>/jobs/<job_id>/runs/<run_id>?space_id=<space_id>&context=icp4data
-
Run the following API to update the state of each stuck job. See Generating an API authorization token.
space_id = <copy the space_id from URL in the browser> run_id = <copy the run_id from the URL in the browser> curl -ik -X PATCH -H "content-type: application/json" https://<CloudPakforData_URL>/v2/assets/${run_id}/attributes/job_run?space_id=${space_id} -H "Authorization: ZenApiKey ${TOKEN}" -d '[{"op": "replace","path": "/state","value": "Failed"}]'
When all the jobs are in Failed state, you can delete the deployment space and then the Analytics Engine powered by Apache Spark instance. See Managing Analytics Engine powered by Apache Spark instances for how to delete the space and then the instance.
Recovery steps if you upgrade Analytics Engine powered by Apache Spark and Cloud Pak for Data at the same time
You should never upgrade Analytics Engine powered by Apache Spark and Cloud Pak for Data at the same time, because this might lead to database inconsistencies. In addition, the upgrade process of Analytics Engine powered by Apache Spark might fail and not be able to recover from this state.
You cannot use Analytics Engine powered by Apache Spark unless the Analytics Engine powered by Apache Spark database has been successfully restored and all Spark tables are available.
You need to have Cloud Pak for Data project administration rights to perform the following recovery steps.
To resolve upgrade issues if you do accidentally upgrade both services at the same time:
-
Define the environment variables you need, in particular PROJECT_CPD_INST_OPERANDS. For details on setting environment variables, see Setting up installation environment variables.
-
After you have defined PROJECT_CPD_INST_OPERANDS, run the following commands to set DOCKER_IMAGE and CONFIDENTIAL_PROP:
export DOCKER_IMAGE=`oc get cronjob -n zen spark-hb-job-cleanup-cron -o jsonpath='{..image}' -n ${PROJECT_CPD_INST_OPERANDS}` export CONFIDENTIAL_PROP=`oc get secret spark-hb-confidential-properties -n ${PROJECT_CPD_INST_OPERANDS} -o jsonpath="{.data.confidential\.properties}" | base64 -d | grep "dbUrl" | cut -d '=' -f 2-`
Check that these variables have meaningful values.
-
Now set the environment variable called DB_VERSION depending on the Cloud Pak for Data you are working on.
Choose the value you need for DB_VERSION:
-
Cloud Pak for Data 4.0: 6
-
Cloud Pak for Data 4.5.x: 12
-
Cloud Pak for Data 4.6.x: 16
For example, for Cloud Pak for Data enter:
export DB_VERSION=16
-
-
Create and deploy a K8s job:
-
Create the following
load-spark-db-schema.yml
file:# This is a YAML-formatted file. apiVersion: batch/v1 kind: Job metadata: name: spark-hb-load-db-specs labels: app: analyticsengine app.kubernetes.io/component: analyticsengine app.kubernetes.io/instance: ibm-analyticsengine-prod app.kubernetes.io/managed-by: analyticsengine app.kubernetes.io/name: analyticsengine component: analyticsengine function: spark-hb-load-db-specs icpdsupport/addOnId: spark icpdsupport/app: api release: ibm-analyticsengine-prod spec: template: metadata: annotations: cloudpakId: "eb9998dcc5d24e3eb5b6fb488f750fe2" cloudpakInstanceId: "" cloudpakName: IBM Cloud Pak for Data hook.activate.cpd.ibm.com/command: '[]' hook.deactivate.cpd.ibm.com/command: '[]' productChargedContainers: All productCloudpakRatio: "1:1" productID: eb9998dcc5d24e3eb5b6fb488f750fe2 productMetric: VIRTUAL_PROCESSOR_CORE productName: Analytics Engine powered by Apache Spark productVersion: 4.6.1 labels: app: analyticsengine app.kubernetes.io/component: analyticsengine app.kubernetes.io/instance: ibm-analyticsengine-prod app.kubernetes.io/managed-by: analyticsengine app.kubernetes.io/name: analyticsengine component: analyticsengine function: spark-hb-load-db-specs icpdsupport/addOnId: spark icpdsupport/app: api job-name: spark-hb-load-db-specs release: ibm-analyticsengine-prod spec: affinity: nodeAffinity: requiredDuringSchedulingIgnoredDuringExecution: nodeSelectorTerms: - matchExpressions: - key: kubernetes.io/arch operator: In values: - amd64 restartPolicy: "OnFailure" serviceAccount: zen-viewer-sa serviceAccountName: zen-viewer-sa automountServiceAccountToken: false hostNetwork: false hostPID: false hostIPC: false containers: - name: "spark-hb-load-db-specs" securityContext: allowPrivilegeEscalation: false capabilities: drop: - ALL runAsNonRoot: true privileged: false readOnlyRootFilesystem: false image: $DOCKER_IMAGE imagePullPolicy: Always resources: requests: cpu: 100m memory: 128Mi ephemeral-storage: 100Mi limits: cpu: 100m memory: 128Mi ephemeral-storage: 100Mi env: - name: DB_URL value: "$CONFIDENTIAL_PROP" command: ["/bin/bash", "-c"] args: - "bash /opt/ibm/entrypoint/load-db-specs.sh /opt/ibm/entrypoint/ cp.icr.io/cp/cpd/spark-hb-python $DB_VERSION /opt/hb/confidential_config/zenmetastore_certs /tmp/zenmetastore_certs_temp" volumeMounts: - name: "spark-hb-load-db-specs-script" mountPath: "/opt/ibm/entrypoint/" - name: "metastore-secret" mountPath: "/tmp/zenmetastore_certs_temp" volumes: - name: "spark-hb-load-db-specs-script" configMap: name: "spark-hb-load-db-specs-script" - name: "spark-hb-zen-metstore-certs" secret: secretName: "zen-service-broker-secret" - name: "metastore-secret" secret: secretName: "metastore-secret"
-
Now deploy this job by running the following command:
envsubst < load-spark-db-schema.yml | oc apply -n ${PROJECT_CPD_INST_OPERANDS} -f -
-
After you have deployed the job, check the job status:
oc get job spark-hb-load-db-specs -n ${PROJECT_CPD_INST_OPERANDS}
If the job ran successfully, you should see the following response:
NAME COMPLETIONS DURATION AGE spark-hb-load-db-specs 1/1 13s 22s
-
Delete the job if the recovery was successful:
oc delete job spark-hb-load-db-specs -n ${PROJECT_CPD_INST_OPERANDS}
The Analytics Engine powered by Apache Spark database is now repaired and recovered. Analytics Engine powered by Apache Spark should be successfully reconciled and reach "Completed" state.
-
The Analytics Engine Custom Resource stuck in InProgress
state
While installing or upgrading Analytics Engine powered by Apache Spark, the Analytics Engine custom resource (AE CR) might become stuck in InProgress
state for a long time.
If this happens, you can check for possible errors by running the following command. You need to have Cloud Pak for Data project administration rights to fix this issue.
oc get ae -n ${PROJECT_CPD_INST_OPERANDS} -o yaml
If the response contains Register dataplane task failed
, perform the following steps to recover from this situation:
-
Get all
ibm-nginx
pods:oc get pods | grep ibm-nginx
-
Restart
nginx
in theibm-nginx
pods. You must run the command for all theibm-nginx
pods returned by the previous command.oc exec <ibm-nginx-pod-name> bash -- nginx -s reload
-
Delete the
spark-hb-register-dataplane
pod:oc delete <spark-hb-register-dataplane-pod>
Wait for the pod to restart which might take about 6-8 minutes. When the
spark-hb-register-dataplane
pod is running again, the AE CR should move to completed state.
Using a CA certificate to connect to internal servers from the platform
If you want to enable the Cloud Pak for Data platform to use your company's CA certificate to validate certificates from your internal servers, you must create a secret that contains the CA certificate. Additionally, if your internal servers use an SSL certificate that is signed using your company's CA certificate, you must create this secret to enable the platform to connect to the servers. For details, see Using a CA certificate to connect to internal servers from the platform.