Following are some problems that you might encounter when you use IBM® Maximo® Visual Inspection, along with steps to fix them.
Operations on a video (auto-capture, frame capture, or video inference) do not process the full video, or training of an action detection model fails.
IBM Maximo Visual Inspection uses video processing utilities to read frames from the videos and some videos cannot be fully processed.
If an action detection training failed, the video-service log can be checked for specific errors that indicate this failure. For example, the following command shows the
ERROR for a failed training because only 613 of the
741 frames in the video were read:
# kubectl logs `kubectl get pods -o custom-columns=NAME:.metadata.name | grep vision-service` | grep -A2 -C2 ERROR root : INFO processing 000500/000741 ... root : INFO processing 000600/000741 ... root : ERROR Could not read frame 614. root : INFO Extract video as RGB frame is completed 614 root : INFO complete extracting video /opt/ibm/vision/data/admin/datasets/fd1d7222-2800-4585-b711-000120592811/training/fc575915-5be5-42c8-8d8b-125d5c7a85e2/96c9b3d2-da24-4e1d-b624-d8ce3141be97.mp4
Try one of these options to solve this problem:
Multiple action detection labels are selected and renamed. When training is attempted, it fails with a generic error.
Rename each label individually. Then, train the model again.
The IBM Maximo Visual Inspection application does not start, and the Kubernetes node status indicate the
postgres pod is in
CrashLoopBackOff state. The following status is a sample:
vision-elasticsearch-59b9b89b56-8h9bt 1/1 Running 0 2m41s vision-fpga-device-plugin-q9p7b 1/1 Running 0 2m41s vision-keycloak-98d6cf9db-jfvgl 0/1 Init:0/1 0 2m41s vision-logstash-7778f58977-b2bqg 1/1 Running 0 2m41s vision-mongodb-5c9956d784-ws8h4 1/1 Running 0 2m41s vision-postgres-769698d5c4-j5wtm 0/1 CrashLoopBackOff 4 2m41s vision-service-6c48b5688b-lmcs2 1/1 Running 0 2m41s vision-taskanaly-6c8bbb9868-t4xxr 1/1 Running 0 2m41s vision-ui-589dbd466-sk9tk 1/1 Running 0 2m41s vision-video-microservice-5678fbdcbc-kfn85 1/1 Running 0 2m41s
The problem might be related to system configuration that prevents the postgres pod from running successfully. When the log is captured successfully for the pod, it shows:
./kubectl logs vision-postgres-769698d5c4-sbpm2 -p ... fixing permissions on existing directory /var/lib/postgresql/data ... ok creating subdirectories ... ok selecting default max_connections ... 10 selecting default shared_buffers ... 400kB selecting dynamic shared memory implementation ... posix creating configuration files ... ok Bus error (core dumped) child process exited with exit code 135 initdb: removing contents of data directory "/var/lib/postgresql/data" running bootstrap script ...
The exit code 135 can occur when
huge pages is enabled in the system configuration. For more information, see https://github.com/docker-library/postgres/issues/451.
The solution can be to set
vm.nr_hugepages = 0 in
/etc/sysctl.conf if it was set to nonzero, then restart the system to have the new configuration take effect.
When you use the IBM Maximo Visual Inspection user interface on Microsoft™ Windows™ to import a DICOM file, the "Waiting for import..." notification does not automatically close.
The import of the image is not impacted. Check to make sure that the image is visible in the data set view. The notification can safely be closed or deleted.
After configuration, IBM Watson IoT Platform displays a green dot and a "Connected" to indicate that an IBM Maximo Visual Inspection instance is connected.
IBM Maximo Visual Inspection appears to be connected to IBM Watson IoT Platform, but no inference data is showing up.
Two possible solutions to this problem are:
The inference operation produced classified data. If the inference results did not result in any classification data (image classification or object detection), no data is published to IBM Watson IoT Platform.
The connection was not fully established with IBM Watson IoT Platform. When connection is reestablished, any pending inference results are published to IBM Watson IoT Platform. All new inference results are published to IBM Watson IoT Platform.
You cannot label objects, view training charts, or create categories.
Verify that you are using a supported web browser. The following web browsers are supported:
Resource pages, such as data sets and models, are not being populated. Notifications indicate that an error occurred the resource page was obtained. For example, "Error obtaining data sets."
Check the status of the
vision-service pod. This pod provides the data to the user interface, and until it is ready (
1/1) with a status of
Running, these errors are expected.
If the application is restarting, a delay is expected before all services are available and fully functioning. Otherwise, this problem might indicate an unexpected termination (error) of the
vision-service pod. If that happens, follow these
instructions: Gathering logs and contacting support.
After you update, reinstall, or restart IBM Maximo Visual Inspection, the browser presents pages that are from the previous version or are stale.
This problem is typically caused when the browser uses a cached version of the page. To solve the problem, try one of these methods:
⇧ Shiftand press
You cannot upload a video, or after the video is uploaded the video does not play.
Verify that your video is a supported type:
If your video is not in a supported format, transcode your video by using a conversion utility. Such utilities are available under various free and paid licenses.
On Red Hat Enterprise Linux™ 7.6 systems with CUDA 10.1, the SELinux context of NVIDA GPU files is lost at restart time. SELinux then prevents IBM Maximo Visual Inspection from using the GPUs for training and deployment.
Restart IBM Maximo Visual Inspection by running
vision-start.sh. Restarting the application resets the problematic SELinux contexts if they are incorrect, restoring the ability to access GPUs for training
When you use a small data set with fewer than 15 labeled images, training a Tiny YOLO v2 model sometimes fails.
The vision-service log shows exceptions because of the ratio of images:
# kubectl logs `kubectl get pods -o custom-columns=NAME:.metadata.name | grep vision-service` | grep Exception | grep ratio root : INFO Exception: Based on the ratio of 0.8, the data set was split into 13 training images and 0 test images. At least one test image is required. Please set a lower ratio or add more images to the data set.
Try the training again, or follow the guidance in the message and increase the number of labeled images in the data set. For more information, see Data set considerations.
When you use a data set with many images and similar bounding boxes, training a Tiny YOLO v2 model might fail. The model requires unique bounding box anchors to be successfully trained.
The vision-service log shows an error because of the ratio of images:
# kubectl logs `kubectl get pods -o custom-columns=NAME:.metadata.name | grep vision-service` | grep Error | grep anchor root : INFO root : ERROR Requested to create 5 initial anchors but only found 4 unique bounding box sizes in the data set. Please label more objects or make sure there are at least 5 uniquly sized boxes.
Modify existing bounding boxes or use data augmentation to create new images with different bounding box anchors, then try the training again. See Augmenting the data set for instructions.
An object detection model cannot be trained by using Tiny YOLO v2, but other object detection models, such as SSD, FR-CNN are training successfully.
Verify that the NVIDIA GPUs are not configured to run in "exclusive" mode. The nvidia-smi command can be used to ensure the GPUs are in "default" mode:
nvidia-smi -c 0
Compare the following
nvidia-smi output to the standard
Check the final column (
Compute M.) in the output. Make sure that its value is
E. Process and not
+---------------------------------------------------------------------------------+ | NVIDIA-SMI 418.87.00 Driver Version: 418.87.00 CUDA Version: 10.1 | |-------------------------------+----------------------+--------------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | |===============================+======================+==========================| | 0 Tesla P100-SXM2... On | 00000002:01:00.0 Off | 0 | | N/A 30C P0 31W / 300W | 0MiB / 16280MiB | 0% E. Process |
You forgot your username or password and cannot log in to the IBM Maximo Visual Inspection GUI.
IBM Maximo Visual Inspection is part of Maximo Application Suite and uses single sign-on to log users in from Maximo Application Suite. Contact your Maximo Application Suite administrator to reset or check your credentials. For more information, see Manage users.
I can log in to Maximo Visual Inspection. However, when I try to train or deploy a model, I get an error that no GPUs are available. The available GPU indicator in the UI shows 0 GPUs available.
Check the following items:
To check these items, run the following command:
$ oc describe node | egrep '^Name|^Capacity:|^Allocatable:|nvidia.com/gpu:'
Check the output to see whether GPUs are allocatable.
In the following sample output, one node that is named
worker-0 has 2 GPUs, which are both allocatable:
Name: master-0.example.cluster.com Capacity: Allocatable: Name: master-1.example.cluster.com Capacity: Allocatable: Name: master-2.example.cluster.com Capacity: Allocatable: Name: worker-0.example.cluster.com Capacity: nvidia.com/gpu: 2 Allocatable: nvidia.com/gpu: 2
If you do not see capacity or allocatable GPUs, your OpenShift® environment is not properly configured for GPU resources. See No nodes have available GPUs.
The NVIDIA GPU operator is installed, but no nodes have available GPUs.
Make sure that the prerequisites for the NVIDIA GPU operator are met, including installation of the Red Hat® Node Feature Discovery (NFD) operator.
In the OpenShift® administrator console or from an OpenShift® command line, the NVIDIA GPU driver container starts and crashes repeatedly. The output indicates
CrashLoopBackOff and a failure that describes a missing package. Sample
nvidia-gpu-driver-container-rhel8-tzbr9 0/1 CrashLoopBackOff 6 6m43s
Make sure that the OpenShift® cluster is configured to build entitled containers. A cluster-wide entitlement enables all workloads that run on the cluster to work correctly. This cluster-wide entitlement includes future nodes or driver containers. See https://www.openshift.com/blog/how-to-use-entitled-image-builds-to-build-drivercontainers-with-ubi-on-openshift.
An instance of Maximo Visual Inspection used to have access to multiple GPUs, but now has fewer GPUs. The value for Total GPUs in the Training and Model pages is less than expected.
GPUs belong to compute nodes in OpenShift®. A node that contains one or more GPUs has temporarily or permanently been removed from the cluster. Maximo Visual Inspection shows GPU availability in real time.
The model training process might fail if your system does not have enough GPU resources or if insufficient information is defined in the model.
Note: Starting with version 220.127.116.11, training jobs are added to a queue, so training jobs do not fail due to lack of GPU resources.
If you are training a data set for image classification, verify that at least two image categories are defined, and that each category has a minimum of five images.
If you are training a data set for object detection, verify that at least one object label is used. You must also verify that each object is labeled in a minimum of five images.
Check the status and availability of the GPUs on the system.
IBM Maximo Visual Inspection data set or model thumbnails do not load, or images are not visible when you label or preview an image.
Disable ad blockers, or exempt the IBM Maximo Visual Inspection user interface from ad blockers.
You submit a job for training or deployment, but it never completes. When you do training or deployment operations, sometimes some pods that are running previous jobs are not stopped correctly by the Kubernetes services. In turn, the jobs hold GPUs
so no new training or deployment jobs can complete. These jobs are in a permanent
To verify the problem, run
kubectl get pods and review the output. The last column shows the age of the pod. If it is older than a few minutes, use the information in the Solution section to solve the problem.
kubectl get pods vision-infer-ic-06767722-47df-4ec1-bd58-91299255f6hxxzk 1/1 Running 0 22m vision-infer-ic-35884119-87b6-4d1e-a263-8fb645f0addqd2z 1/1 Running 0 22m vision-infer-ic-7e03c8f3-908a-4b52-b5d1-6d2befec69ggqw5 1/1 Running 0 5h vision-infer-od-c1c16515-5955-4ec2-8f23-bd21d394128b6k4 1/1 Running 0 3h
Follow these steps to manually delete the deployments that are hanging:
Determine the running deployments and look for deployments that are running longer than a few minutes:
kubectl get deployments
Delete the deployments that were identified as hanging in the previous step.
kubectl delete deployment deployment_id
You can now try the training or deployment operation again, assuming that GPUs are available.
Note: When a deployment is manually deleted, vision-service might try to re-create it when it is restarted. The only way to force Kubernetes to permanently delete it is to remove the failing model from IBM Maximo Visual Inspection.
Long label names can result in training failures. Label or class names that are used in the data set are longer than 64 characters. Or, non-ASCII characters that have multi-byte representation are used.
Limit label and class names to 64 characters or fewer. Avoid using characters that have multi-byte representation. If the problem persists, limit label and class names to 64 ASCII characters or fewer.
The NVIDIA GPU device is not accessible by the IBM Maximo Visual Inspection Docker containers. To confirm this problem, run
kubectl logs -f <vision-service-ID> and then check
pod_<vision-service-ID>_vision-service.log for an error that indicates
error == cudaSuccess (30 vs. 0):
F0731 20:34:05.334903 35 common.cpp:159] Check failed: error == cudaSuccess (30 vs. 0) unknown error *** Check failure stack trace: *** /opt/py-faster-rcnn/FRCNN/bin/train_frcnn.sh: line 24: 35 Aborted (core dumped) _train_frcnn.sh
sudo to alter SELINUX permissions for all of the NVIDIA devices so they are accessible by the IBM Maximo Visual Inspection Docker containers.
sudo chcon -t container_file_t /dev/nvidia*
When you import a model, you get the error "The model was not imported. You can import .zip files only if they were exported from an IBM Maximo Visual Inspection model.".
This error can occur when models are imported from older versions of IBM Maximo Visual Inspection that do not include model metadata in the exported file, such as the model name or thumbnail image. The model can still be imported successfully and is available on the model details page for deployment.
A trained model has an unexpected value for accuracy, such as 0%, 100%, or "Unknown". This problem happens when insufficient data prevents training from working properly.
Ensure that the data set has enough images for each category or object label. For more information, see Data set considerations.
IBM Maximo Visual Inspection models remain in "Starting" state and do not become available for inference operations.
Delete and redeploy the models. One possible cause is that the IBM Maximo Visual Inspection models were deployed in a prior version of the product that is not compatible with the currently installed version. For example, this problem can happen after you upgrade IBM Maximo Visual Inspection.
You cannot auto-label a data set that does not have unlabeled images, unless some of the images were previously labeled by the auto label function.
Ensure that the Objects section of the data set side bar shows objects that are "Unlabeled". If no "Unlabeled" objects are listed in the side bar, add new images that are unlabeled or remove labels from some images, then run auto label again.
When you upload a large file into a data set, the file is broken up into smaller chunks that are uploaded individually. The upload of each file chunk must be completed within 30 minutes to avoid a security access timeout. The IBM® Maximo® Visual Inspection API expires its security access tokens every 30 minutes. If an API client begins an API call but does not complete the call within 30 minutes, that call fails.
When you upload a large file, you might see the upload start (showing a progress bar) but then see an error message in the user interface. This error happens due to a Nginx timeout, where the upload of a file chunk is taking longer than the defined 5-minute Nginx timeout.
Despite the notification error, the large file uploaded successfully. Refreshing the page shows the uploaded files in the data set.
IBM Maximo Visual Inspection requires IPv6. Enable IPv6 on the system.
IBM Maximo Visual Inspection API keys do not work.
Check the following items:
X-Auth-Tokenheader. For information about using the API key, refer to IBM Maximo Visual Inspection Learning Path.