Troubleshooting common issues in IBM Maximo Visual Inspection

Following are some problems that you might encounter when you use IBM® Maximo® Visual Inspection, along with steps to fix them.


Action detection training fails, or video inference does not process full video, or auto-capture does not capture frames in full video

Problem

Operations on a video (auto-capture, frame capture, or video inference) do not process the full video, or training of an action detection model fails.

IBM Maximo Visual Inspection uses video processing utilities to read frames from the videos and some videos cannot be fully processed.

If an action detection training failed, the video-service log can be checked for specific errors that indicate this failure. For example, the following command shows the ERROR for a failed training because only 613 of the 741 frames in the video were read:

# kubectl logs  `kubectl get pods -o custom-columns=NAME:.metadata.name | grep vision-service` | grep -A2 -C2 ERROR
root        : INFO     processing 000500/000741 ...
root        : INFO     processing 000600/000741 ...
root        : ERROR    Could not read frame 614.
root        : INFO     Extract video as RGB frame is completed 614
root        : INFO     complete extracting video /opt/ibm/vision/data/admin/datasets/fd1d7222-2800-4585-b711-000120592811/training/fc575915-5be5-42c8-8d8b-125d5c7a85e2/96c9b3d2-da24-4e1d-b624-d8ce3141be97.mp4

Solution

Try one of these options to solve this problem:


Action detection training fails some instances with error "Internal Server Error - Generic exception thrown"

Problem

Multiple action detection labels are selected and renamed. When training is attempted, it fails with a generic error.

Solution

Rename each label individually. Then, train the model again.


Postgres Kubernetes pod fails to start

Problem

The IBM Maximo Visual Inspection application does not start, and the Kubernetes node status indicate the postgres pod is in CrashLoopBackOff state. The following status is a sample:

vision-elasticsearch-59b9b89b56-8h9bt       1/1    Running           0         2m41s
vision-fpga-device-plugin-q9p7b             1/1    Running           0         2m41s
vision-keycloak-98d6cf9db-jfvgl             0/1    Init:0/1          0         2m41s
vision-logstash-7778f58977-b2bqg            1/1    Running           0         2m41s
vision-mongodb-5c9956d784-ws8h4             1/1    Running           0         2m41s
vision-postgres-769698d5c4-j5wtm            0/1    CrashLoopBackOff  4         2m41s
vision-service-6c48b5688b-lmcs2             1/1    Running           0         2m41s
vision-taskanaly-6c8bbb9868-t4xxr           1/1    Running           0         2m41s
vision-ui-589dbd466-sk9tk                   1/1    Running           0         2m41s
vision-video-microservice-5678fbdcbc-kfn85  1/1    Running           0         2m41s

Solution

The problem might be related to system configuration that prevents the postgres pod from running successfully. When the log is captured successfully for the pod, it shows:

./kubectl logs vision-postgres-769698d5c4-sbpm2 -p
...
fixing permissions on existing directory /var/lib/postgresql/data ... ok
creating subdirectories ... ok
selecting default max_connections ... 10
selecting default shared_buffers ... 400kB
selecting dynamic shared memory implementation ... posix
creating configuration files ... ok
Bus error (core dumped)
child process exited with exit code 135
initdb: removing contents of data directory "/var/lib/postgresql/data"
running bootstrap script ...

The exit code 135 can occur when huge pages is enabled in the system configuration. For more information, see https://github.com/docker-library/postgres/issues/451.

The solution can be to set vm.nr_hugepages = 0 in /etc/sysctl.conf if it was set to nonzero, then restart the system to have the new configuration take effect.


When you import a DICOM format file, the "Waiting for import..." notification does not go away

Problem

When you use the IBM Maximo Visual Inspection user interface on Microsoft™ Windows™ to import a DICOM file, the "Waiting for import..." notification does not automatically close.

Solution

The import of the image is not impacted. Check to make sure that the image is visible in the data set view. The notification can safely be closed or deleted.


IBM Maximo Visual Inspection seems to be connected to IBM® Watson™ IoT Platform, but no data is showing up

After configuration, IBM Watson IoT Platform displays a green dot and a "Connected" to indicate that an IBM Maximo Visual Inspection instance is connected.

Problem

IBM Maximo Visual Inspection appears to be connected to IBM Watson IoT Platform, but no inference data is showing up.

Solution

Two possible solutions to this problem are:


The IBM Maximo Visual Inspection user interface does not work

Problem

You cannot label objects, view training charts, or create categories.

Solution

Verify that you are using a supported web browser. The following web browsers are supported:


Resource pages are not being populated in the user interface

Problem

Resource pages, such as data sets and models, are not being populated. Notifications indicate that an error occurred the resource page was obtained. For example, "Error obtaining data sets."

Solution

Check the status of the vision-service pod. This pod provides the data to the user interface, and until it is ready ( 1/1) with a status of Running, these errors are expected.

If the application is restarting, a delay is expected before all services are available and fully functioning. Otherwise, this problem might indicate an unexpected termination (error) of the vision-service pod. If that happens, follow these instructions: Gathering logs and contacting support.


Unexpected or old pages are displayed when you access the user interface

Problem

After you update, reinstall, or restart IBM Maximo Visual Inspection, the browser presents pages that are from the previous version or are stale.

Solution

This problem is typically caused when the browser uses a cached version of the page. To solve the problem, try one of these methods:


IBM Maximo Visual Inspection does not play video

Problem

You cannot upload a video, or after the video is uploaded the video does not play.

Solution

Verify that your video is a supported type:

If your video is not in a supported format, transcode your video by using a conversion utility. Such utilities are available under various free and paid licenses.


IBM Maximo Visual Inspection cannot train or deploy models after restart

Problem

On Red Hat Enterprise Linux™ 7.6 systems with CUDA 10.1, the SELinux context of NVIDA GPU files is lost at restart time. SELinux then prevents IBM Maximo Visual Inspection from using the GPUs for training and deployment.

Solution

Restart IBM Maximo Visual Inspection by running vision-stop.sh then vision-start.sh. Restarting the application resets the problematic SELinux contexts if they are incorrect, restoring the ability to access GPUs for training and inference.


A Tiny YOLO V2 model fails to train with a small data set

Problem

When you use a small data set with fewer than 15 labeled images, training a Tiny YOLO v2 model sometimes fails.

The vision-service log shows exceptions because of the ratio of images:

  # kubectl logs  `kubectl get pods -o custom-columns=NAME:.metadata.name | grep vision-service` | grep Exception | grep ratio
  root        : INFO     Exception: Based on the ratio of 0.8, the data set was split into 13 training images and 0 test images.
    At least one test image is required.
    Please set a lower ratio or add more images to the data set.

Solution

Try the training again, or follow the guidance in the message and increase the number of labeled images in the data set. For more information, see Data set considerations.


A Tiny YOLO v2 model fails to train when you use a data set with many similar bounding boxes

Problem

When you use a data set with many images and similar bounding boxes, training a Tiny YOLO v2 model might fail. The model requires unique bounding box anchors to be successfully trained.

The vision-service log shows an error because of the ratio of images:

# kubectl logs  `kubectl get pods -o custom-columns=NAME:.metadata.name | grep vision-service` | grep Error | grep anchor
root        : INFO     root        : ERROR    Requested to create 5 initial anchors but only found 4 unique bounding box sizes in the data set. Please label more objects or make sure there are at least 5 uniquly sized boxes.

Solution

Modify existing bounding boxes or use data augmentation to create new images with different bounding box anchors, then try the training again. See Augmenting the data set for instructions.


Tiny YOLO v2 models do not train, but other models train

Problem

An object detection model cannot be trained by using Tiny YOLO v2, but other object detection models, such as SSD, FR-CNN are training successfully.

Solution

Verify that the NVIDIA GPUs are not configured to run in "exclusive" mode. The nvidia-smi command can be used to ensure the GPUs are in "default" mode:

nvidia-smi -c 0

Compare the following nvidia-smi output to the standard nvidia-smi output. Check the final column (Compute M.) in the output. Make sure that its value is E. Process and not Default.

+---------------------------------------------------------------------------------+
| NVIDIA-SMI 418.87.00      Driver Version: 418.87.00      CUDA Version: 10.1     |
|-------------------------------+----------------------+--------------------------+
| GPU Name Persistence-M        | Bus-Id Disp.A        |     Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap   | Memory-Usage         |     GPU-Util Compute M.  |
|===============================+======================+==========================|
| 0 Tesla P100-SXM2... On       | 00000002:01:00.0 Off |   0                      |
| N/A 30C P0 31W / 300W         | 0MiB / 16280MiB      |   0% E. Process          |

I forgot my username or password

Problem

You forgot your username or password and cannot log in to the IBM Maximo Visual Inspection GUI.

Solution

IBM Maximo Visual Inspection is part of Maximo Application Suite and uses single sign-on to log users in from Maximo Application Suite. Contact your Maximo Application Suite administrator to reset or check your credentials. For more information, see Manage users.


GPUs are not available for training or deployment

Problem

I can log in to Maximo Visual Inspection. However, when I try to train or deploy a model, I get an error that no GPUs are available. The available GPU indicator in the UI shows 0 GPUs available.

Solution

Check the following items:

To check these items, run the following command:

$ oc describe node | egrep '^Name|^Capacity:|^Allocatable:|nvidia.com/gpu:'

Check the output to see whether GPUs are allocatable.

In the following sample output, one node that is named worker-0 has 2 GPUs, which are both allocatable:

Name: master-0.example.cluster.com
Capacity:
Allocatable:
Name: master-1.example.cluster.com
Capacity:
Allocatable:
Name: master-2.example.cluster.com
Capacity:
Allocatable:
Name: worker-0.example.cluster.com
Capacity:
nvidia.com/gpu: 2
Allocatable:
nvidia.com/gpu: 2

If you do not see capacity or allocatable GPUs, your OpenShift® environment is not properly configured for GPU resources. See No nodes have available GPUs.


No nodes have available GPUs

Problem

The NVIDIA GPU operator is installed, but no nodes have available GPUs.

Solution

Make sure that the prerequisites for the NVIDIA GPU operator are met, including installation of the Red Hat® Node Feature Discovery (NFD) operator.


The NVIDIA GPU operator attempts to create a driver container, but never completes

Problem

In the OpenShift® administrator console or from an OpenShift® command line, the NVIDIA GPU driver container starts and crashes repeatedly. The output indicates CrashLoopBackOff and a failure that describes a missing package. Sample status:

nvidia-gpu-driver-container-rhel8-tzbr9      0/1     CrashLoopBackOff   6          6m43s

Solution

Make sure that the OpenShift® cluster is configured to build entitled containers. A cluster-wide entitlement enables all workloads that run on the cluster to work correctly. This cluster-wide entitlement includes future nodes or driver containers. See https://www.openshift.com/blog/how-to-use-entitled-image-builds-to-build-drivercontainers-with-ubi-on-openshift.


Maximo Visual Inspection has access to fewer GPUs than before

Problem

An instance of Maximo Visual Inspection used to have access to multiple GPUs, but now has fewer GPUs. The value for Total GPUs in the Training and Model pages is less than expected.

Solution

GPUs belong to compute nodes in OpenShift®. A node that contains one or more GPUs has temporarily or permanently been removed from the cluster. Maximo Visual Inspection shows GPU availability in real time.


IBM Maximo Visual Inspection cannot train a model

Problem

The model training process might fail if your system does not have enough GPU resources or if insufficient information is defined in the model.

Note: Starting with version 1.2.0.1, training jobs are added to a queue, so training jobs do not fail due to lack of GPU resources.

Solution


Thumbnails do not load or images previews are missing

Problem

IBM Maximo Visual Inspection data set or model thumbnails do not load, or images are not visible when you label or preview an image.

Solution

Disable ad blockers, or exempt the IBM Maximo Visual Inspection user interface from ad blockers.


Training or deployment hangs - Kubernetes pod cleanup

Problem

You submit a job for training or deployment, but it never completes. When you do training or deployment operations, sometimes some pods that are running previous jobs are not stopped correctly by the Kubernetes services. In turn, the jobs hold GPUs so no new training or deployment jobs can complete. These jobs are in a permanent Scheduled state.

To verify the problem, run kubectl get pods and review the output. The last column shows the age of the pod. If it is older than a few minutes, use the information in the Solution section to solve the problem.

Example:

kubectl get pods
vision-infer-ic-06767722-47df-4ec1-bd58-91299255f6hxxzk 1/1 Running 0 22m
vision-infer-ic-35884119-87b6-4d1e-a263-8fb645f0addqd2z 1/1 Running 0 22m
vision-infer-ic-7e03c8f3-908a-4b52-b5d1-6d2befec69ggqw5 1/1 Running 0 5h
vision-infer-od-c1c16515-5955-4ec2-8f23-bd21d394128b6k4 1/1 Running 0 3h

Solution

Follow these steps to manually delete the deployments that are hanging:

  1. Determine the running deployments and look for deployments that are running longer than a few minutes:

    kubectl get deployments
    
  2. Delete the deployments that were identified as hanging in the previous step.

    kubectl delete deployment deployment_id
    
  3. You can now try the training or deployment operation again, assuming that GPUs are available.

    Note: When a deployment is manually deleted, vision-service might try to re-create it when it is restarted. The only way to force Kubernetes to permanently delete it is to remove the failing model from IBM Maximo Visual Inspection.


Training fails with the error, "You must retrain the model"

Problem

Long label names can result in training failures. Label or class names that are used in the data set are longer than 64 characters. Or, non-ASCII characters that have multi-byte representation are used.

Solution

Limit label and class names to 64 characters or fewer. Avoid using characters that have multi-byte representation. If the problem persists, limit label and class names to 64 ASCII characters or fewer.


Model training and inferencing fail

Problem

The NVIDIA GPU device is not accessible by the IBM Maximo Visual Inspection Docker containers. To confirm this problem, run kubectl logs -f <vision-service-ID> and then check pod_<vision-service-ID>_vision-service.log for an error that indicates error == cudaSuccess (30 vs. 0):

F0731 20:34:05.334903    35 common.cpp:159] Check failed: error == cudaSuccess (30 vs. 0)  unknown error
*** Check failure stack trace: ***
/opt/py-faster-rcnn/FRCNN/bin/train_frcnn.sh: line 24:    35 Aborted                 (core dumped) _train_frcnn.sh

Solution

Use sudo to alter SELINUX permissions for all of the NVIDIA devices so they are accessible by the IBM Maximo Visual Inspection Docker containers.

sudo chcon -t container_file_t /dev/nvidia*

Model import generates an error alert

Problem

When you import a model, you get the error "The model was not imported. You can import .zip files only if they were exported from an IBM Maximo Visual Inspection model.".

Solution

This error can occur when models are imported from older versions of IBM Maximo Visual Inspection that do not include model metadata in the exported file, such as the model name or thumbnail image. The model can still be imported successfully and is available on the model details page for deployment.


Model accuracy value is unexpected

Problem

A trained model has an unexpected value for accuracy, such as 0%, 100%, or "Unknown". This problem happens when insufficient data prevents training from working properly.

Solution

Ensure that the data set has enough images for each category or object label. For more information, see Data set considerations.


Deployed models stuck in "Starting"

Problem

IBM Maximo Visual Inspection models remain in "Starting" state and do not become available for inference operations.

Solution

Delete and redeploy the models. One possible cause is that the IBM Maximo Visual Inspection models were deployed in a prior version of the product that is not compatible with the currently installed version. For example, this problem can happen after you upgrade IBM Maximo Visual Inspection.


Auto-labeling of a data set returns "Auto Label Error"

Problem

You cannot auto-label a data set that does not have unlabeled images, unless some of the images were previously labeled by the auto label function.

Solution

Ensure that the Objects section of the data set side bar shows objects that are "Unlabeled". If no "Unlabeled" objects are listed in the side bar, add new images that are unlabeled or remove labels from some images, then run auto label again.


Uploading a large file fails

When you upload a large file into a data set, the file is broken up into smaller chunks that are uploaded individually. The upload of each file chunk must be completed within 30 minutes to avoid a security access timeout. The IBM® Maximo® Visual Inspection API expires its security access tokens every 30 minutes. If an API client begins an API call but does not complete the call within 30 minutes, that call fails.

When you upload a large file, you might see the upload start (showing a progress bar) but then see an error message in the user interface. This error happens due to a Nginx timeout, where the upload of a file chunk is taking longer than the defined 5-minute Nginx timeout.

Despite the notification error, the large file uploaded successfully. Refreshing the page shows the uploaded files in the data set.


Solution

IBM Maximo Visual Inspection requires IPv6. Enable IPv6 on the system.


IBM Maximo Visual Inspection API keys do not work

Problem

IBM Maximo Visual Inspection API keys do not work.

Solution

Check the following items: