Troubleshooting common issues

Following are some problems you might encounter when using PowerAI Vision, along with steps to fix them.

The PowerAI Vision GUI does not work
Unexpected / old pages displayed when accessing the user interface
Uploading a large file fails
Uploading a large number of files fails
PowerAI Vision does not play video
Out of space error from load_images.sh
I forgot my user name or password
PowerAI Vision cannot train a model
Training or deployment hangs
Model training and inference fails
Object detection model training fails using images with non-standard aspect ratios
Auto labeling of a data set returns "Auto Label Error"
PowerAI Vision does not start
PowerAI Vision fails to start - Kubernetes connection issue
PowerAI Vision startup hangs - helm issue
Helm status errors when starting PowerAI Vision
Some PowerAI Vision functions don't work
Command line tool fails - missing options

The PowerAI Vision GUI does not work

Problem

You cannot label objects, view training charts, or create categories.

Solution

Verify that you are using a supported web browser. The following web browsers are supported:

Google Chrome Version 60, or later
Firefox Quantum 59.0, or later

Unexpected / old pages displayed when accessing the user interface

Problem

After updating, reinstalling, or restarting PowerAI Vision, the browser presents pages that are from the previous version or are stale.

Solution

This problem is typically caused by the browser using a cached version of the page. To solve the problem, try one of these methods:

Use a Firefox Private Window to access the user interface.
Use a Chrome Incognito Window to access the user interface.
Bypass the browser cache:
- In most Windows and Linux browsers: Hold down Ctrl and press F5.
- In Chrome and Firefox for Mac: Hold down ⌘ Cmd and ⇧ Shift and press R.

Uploading a large file fails

When uploading files into a data set, there is a 2GB size limit per upload session. This limit applies to a single .zip file or a set of files. When you upload a large file that is under 2 GB, you might see the upload start (showing a progress bar) but then you get an error message in the user interface. This error happens due to a Nginx timeout, where the file upload is taking longer than the defined 5 minute Nginx timeout.

Despite the notification error, the large file has been uploaded. Refreshing the page will show the uploaded files in the data set.

Uploading a large number of files fails

Problem

This problem only occurs on a Microsoft Windows system using the Chrome browser. When using the Import Files button in the PowerAI Vision user interface to add images or videos to a data set, you select a large amount of files to upload but nothing happens after submitting the file picker.

Solution

This is a known bug with the Chrome browser on Windows where the file names selected are too long, causing the file picker to fail silently. Try the following solutions:

Create a zip of the files and upload that instead.
Use Firefox Quantum 59.0 or later to uploads the files.
Drag-and-drop the files onto the area in the user interface labeled "Drop files here".

PowerAI Vision does not play video

Problem

You cannot upload a video, or after the video is uploaded the video does not play.

Solution

Verify that your video is a supported type:

Ogg Vorbis (.ogg)
VP8 or VP9 (.webm)
H.264 encoded videos with MP4 format (.mp4)

If your video is not in a supported format, transcode your video by using a conversion utility. Such utilities are available under various free and paid licenses.

Out of space error from load_images.sh

Problem

When installing the product, the load_images.sh script is used to load the PowerAI Vision Docker images. Even though the script terminates with "INFO: All images loaded successfully.", the output should be checked to ensure there were not any problems.

For example, the /var/lib/docker file system can run out of space, resulting in a message indicating that an image was not fully loaded. The following output shows that the Docker image powerai-vision-dnn was not able to be fully loaded because of insufficient file system space:

Loaded image: powerai-vision-dnn:1.1.1.0

5f38fd05125c: Loading layer [==================================================>] 826.8 MB/826.8 MB

a95ac7216ffb: Loading layer [==================================================>]  20.3 MB/20.3 MB

Error processing tar file(exit status 1): write /usr/lib/libavcodec.so.57.107.100: no space left on device

INFO:  All images loaded successfully.

This situation can also be noted in the output from /opt/powerai-vision/bin/kubectl get pods. This command is described in Checking the application and environment, which shows images that could not be loaded with a status of ErrImagePull or ImagePullBackOff.

Solution

The file system space for /var/lib/docker needs to be increased, even if the file system is not completely full. There might still be space in the file system where /var/lib/docker is located, but insufficient space for the PowerAI Vision Docker images. There are operating system mechanisms to do this, including moving or mounting /var/lib/docker to a file system partition with more space.

I forgot my user name or password

Problem: You forgot your user name or password and cannot log in to the PowerAI Vision GUI.
Solution: PowerAI Vision uses an internally managed users account database. To change your user name or password, see Logging in to PowerAI Vision.

PowerAI Vision cannot train a model

Problem

The model training process might fail if your system does not have enough GPU resources.

Solution

If you are training a data set for image classification, verify that at least two image categories are defined, and that each category has a minimum of five images.
If you are training a data set for object detection, verify that at least one object label is used. You must also verify that each object is labeled in a minimum of five images.
Ensure that enough GPUs are available. PowerAI Vision assigns one GPU to each active training job or deployed deep learning API. For example, if a system has four GPUs and you have two deployed web APIs, there are two GPUs available for active training jobs. If a training job appears to be hanging, it might be waiting for another training job to complete, or there might not be a GPU available to run it.
To determine how many GPUs are available on the system, run the sudo /opt/powerai-vision/bin/kubectl.sh describe nodes script and review the nVidiaGPU Limits column.

The following is an example of the output from sudo /opt/powerai-vision/bin/kubectl.sh describe nodes that shows two GPUs currently in use:
```
Name:               127.0.0.1

Roles:              <none>

Labels:             beta.kubernetes.io/arch=ppc64le

                    beta.kubernetes.io/os=linux

                    gpu/nvidia=TeslaV100-SXM2-16GB

                    kubernetes.io/hostname=127.0.0.1

Annotations:        node.alpha.kubernetes.io/ttl=0

                    volumes.kubernetes.io/controller-managed-attach-detach=true...

Allocated resources:

                   (Total limits may be over 100 percent, i.e., overcommitted.)

                   CPU Requests  CPU Limits  Memory Requests  Memory Limits  NvidiaGPU Limits

                   --------------------------------------------------------------------------

                    0 (0%)        0 (0%)      0 (0%)           0 (0%)         2 (50%)

Events:         <none>
```
If all the systems GPUs are in use, you can either delete a deployed web API (making the API unavailable for inference) or you can stop a training model that is running.
- To delete a deployed model, click Deployed Models. Next, select the model that you want to delete and click Delete. The trained model is not deleted from PowerAI Vision. You can redeploy the model later when more GPUs are available.
- To stop a training model that is running, click Models. Next, select the model that has a status of Training in Progress and click Stop Training.

Training or deployment hangs

Problem

You submit a job for training or deployment, but it never completes. When doing training or deployments, sometimes some pods that are running previous jobs get out of sync with the vision-service MongoDB and they hang forever instead of getting terminated within a minute or so. In turn, they hold GPUs so no new training or deployment jobs can complete. They will be in the Scheduled state forever.

To verify that this is the problem, run kubectl get pods and review the output. The last column shows the age of the pod. If it is older than a few minutes, use the information in "Solution" to solve the problem.

Example:

kubectl get pods 

powerai-vision-infer-ic-06767722-47df-4ec1-bd58-91299255f6hxxzk 1/1 Running 0 22m 

powerai-vision-infer-ic-35884119-87b6-4d1e-a263-8fb645f0addqd2z 1/1 Running 0 22m 

powerai-vision-infer-ic-7e03c8f3-908a-4b52-b5d1-6d2befec69ggqw5 1/1 Running 0 5h 

powerai-vision-infer-od-c1c16515-5955-4ec2-8f23-bd21d394128b6k4 1/1 Running 0 3h

Solution

Follow these steps to manually delete the deployments that are hanging.

Determine the running deployments and look for those that have been running longer than a few minutes:
```
kubectl get deployments
```
Delete the deployments that were identified as hanging in the previous step.
```
kubectl delete deployment deployment_id
```
You can now try the training or deploy again, assuming there are available GPUs.

Note: When a deployment is manually deleted, vision-service might try to recreate it when it is restarted. The only way to force Kubernetes to permanently delete it is to remove the failing model from PowerAI Vision.

Model training and inference fails

Problem

The NVIDIA GPU device is not accessible by the PowerAI Vision Docker containers. To confirm this, run kubectl logs -f _powerai-vision-portal-ID_ and then check pod_powerai-vision-portal-ID_powerai-vision-portal.log for an error indicating error == cudaSuccess (30 vs. 0):

F0731 20:34:05.334903    35 common.cpp:159] Check failed: error == cudaSuccess (30 vs. 0)  unknown error

*** Check failure stack trace: ***

/opt/py-faster-rcnn/FRCNN/bin/train_frcnn.sh: line 24:    35 Aborted                 (core dumped) _train_frcnn.sh

Solution

Use sudo to alter SELINUX permissions for all of the NVIDIA devices so they are accessible via the PowerAI Vision Docker containers.

sudo chcon -t container_file_t /dev/nvidia*

Object detection model training fails using images with non-standard aspect ratios

Problem

The training of an object detection model fails with the error message "An error occurred training model-name. You must retrain model-name.", where model-name is the name of the model being trained. However, repeated attempts to train the model fail with the same error.

Solution

Examine the data set for images that were cropped to a non-standard aspect ratio and are much longer on one edge, for example, 10 times longer on the horizontal edge than the vertical edge. These images will cause the object training to fail and they should be cropped to or adjusted to allow the model to train. The images must follow these guidelines:

They must be at least 130 pixels on the shortest edge.
If the longer edge is greater than 1000 pixels, the image is scaled down when the model is trained. When the longest side is scaled down to 1000 pixels, the shorter edge must still be at least 130 pixels.

Auto labeling of a data set returns "Auto Label Error"

Problem: Auto labeling cannot be performed on a data set that does not have unlabeled images, unless some of the images were previously labeled by the auto label function.
Solution: Ensure that the Objects section of the data set side bar shows there are objects that are "Unlabeled". If there are none, that is, if "Unlabeled (0)" is displayed in the side bar, add new images that are unlabeled or remove labels from some images, then run auto label again.

PowerAI Vision does not start

Problem

When you enter the URL for PowerAI Vision from a supported web browser, nothing is displayed. You see a 404 error or Connection Refused message.

Solution

Complete the following steps to solve this problem:

Verify that IP version 4 (IPv4) port forwarding is enabled by running the /sbin/sysctl net.ipv4.conf.all.forwarding command and verifying that the value for net.ipv4.conf.all.forwarding is set to 1.
If IPv4 port forwarding is not enabled, run the /sbin/sysctl -w net.ipv4.conf.all.forwarding=1 command. For more information about port forwarding with Docker, see UCP requires IPv4 IP Forwarding in the Docker success center.

If IPv4 port forwarding is enabled and the docker0 interface is a member of the trusted zone, check the Helm chart status by running this script:

sudo /opt/powerai-vision/bin/helm.sh status vision

In the script output, verify that the PowerAI Vision components are available by locating the Deployment section and identifying that the AVAILABLE column has a value of 1 for each component. The following is an example of the output from the helm.sh status vision script that shows all components are available:


RESOURCES:

==> v1beta1/Deployment

NAME                              DESIRED  CURRENT  UP-TO-DATE  AVAILABLE  AGE

powerai-vision-mongodb            1        1        1           1          4d

powerai-vision-portal             1        1        1           1          4d

powerai-vision-postgres           1        1        1           1          4d

powerai-vision-taskanaly          1        1        1           1          4d

powerai-vision-ui                 1        1        1           1          4d

powerai-vision-video-nginx        1        1        1           1          4d

powerai-vision-video-portal       1        1        1           1          4d

powerai-vision-video-rabmq        1        1        1           1          4d

powerai-vision-video-redis        1        1        1           1          4d

powerai-vision-video-test-nginx   1        1        1           1          4d

powerai-vision-video-test-portal  1        1        1           1          4d

powerai-vision-video-test-rabmq   1        1        1           1          4d

powerai-vision-video-test-redis   1        1        1           1          4d

If you recently started PowerAI Vision and some components are not available, wait a few minutes for these components to become available. If any components remain unavailable, gather the logs and contact IBM® Support, as described in this topic: Gather PowerAI Vision logs and contact support.

If the docker0 interface is a member of a trusted zone and all PowerAI Vision components are available, verify that the firewall is configured to allow communication through port 443 (used to connect to PowerAI Vision) by running this command:
```
sudo firewall-cmd --permanent --zone=public --add-port=443/tcp
```

PowerAI Vision fails to start - Kubernetes connection issue

Problem

If the host system does not have a default route defined in the networking configuration, the Kubernetes cluster will fail to start with connection issues. For example:

$ sudo /opt/powerai-vision/bin/powerai_vision_start.sh

INFO: Setting up GPU...

[...]

Checking kubernetes cluster status...

The connection to the server 127.0.0.1:8080 was refused - did you specify the right host or port?

INFO: Probing cluster status #1:

The connection to the server 127.0.0.1:8080 was refused - did you specify the right host or port?

INFO: Probing cluster status #2:

The connection to the server 127.0.0.1:8080 was refused - did you specify the right host or port?

INFO: Probing cluster status #3:

The connection to the server 127.0.0.1:8080 was refused - did you specify the right host or port?

INFO: Probing cluster status #4:

The connection to the server 127.0.0.1:8080 was refused - did you specify the right host or port?

INFO: Probing cluster status #5:

The connection to the server 127.0.0.1:8080 was refused - did you specify the right host or port?

INFO: Probing cluster status #6:

The connection to the server 127.0.0.1:8080 was refused - did you specify the right host or port?

INFO: Probing cluster status #7:

The connection to the server 127.0.0.1:8080 was refused - did you specify the right host or port?

INFO: Probing cluster status #8:

The connection to the server 127.0.0.1:8080 was refused - did you specify the right host or port?

INFO: Probing cluster status #9:

The connection to the server 127.0.0.1:8080 was refused - did you specify the right host or port?

INFO: Probing cluster status #10:

The connection to the server 127.0.0.1:8080 was refused - did you specify the right host or port?

INFO: Probing cluster status #11:

ERROR: Retry timeout. Error in starting kubernetes cluster, please check /opt/powerai-vision/log/kubernetes for logs.

Solution

Define a default route in the networking configuration. For instructions to do this on Red Hat Enterprise Linux (RHEL), refer to 2.2.4 Static Routes and the Default Gateway in the Red Hat Customer Portal.

PowerAI Vision startup hangs - helm issue

Problem

PowerAI Vision startup hangs with the message "Unable to start helm within 30 seconds - trying again." For example:

root> sudo /opt/powerai-vision/bin/powerai_vision_start.sh 

Checking ports usage...

Checking ports completed, no confict port usage detected.

[ INFO ] Setting up the GPU...

         Init cuda devices...

         Devices init completed!

         Persistence mode is already Enabled for GPU 00000004:04:00.0.

         Persistence mode is already Enabled for GPU 00000004:05:00.0.

         Persistence mode is already Enabled for GPU 00000035:03:00.0.

         Persistence mode is already Enabled for GPU 00000035:04:00.0.

         All done.

[ INFO ] Starting kubernetes...

         Checking kubernetes cluster status...

         Probing cluster status #1: NotReady

         Probing cluster status #2: NotReady

         Probing cluster status #3: NotReady

         Probing cluster status #4: Ready

         Booting up ingress controller...

         Initializing helm...

         [ WARN ] Unable to start helm within 30 seconds - trying again.  If this continues, contact support.

         [ WARN ] Unable to start helm within 30 seconds - trying again.  If this continues, contact support.

         [ WARN ] Unable to start helm within 30 seconds - trying again.  If this continues, contact support.

         [ WARN ] Unable to start helm within 30 seconds - trying again.  If this continues, contact support.

Solution

To solve this problem, you must follow these steps exactly as written:

Cancel PowerAI Vision startup by pressing ctrl+c.

Stop PowerAI Vision by running this command:

sudo /opt/powerai-vision/bin/powerai_vision_stop.sh

Modify the Red Hat Enterprise Linux (RHEL) settings as follows:

sudo nmcli connection modify docker0 connection.zone trusted

sudo systemctl stop NetworkManager.service

sudo firewall-cmd --permanent --zone=trusted --change-interface=docker0

sudo systemctl start NetworkManager.service

sudo nmcli connection modify docker0 connection.zone trusted

sudo systemctl restart docker.service

Start PowerAI Vision again:

sudo /opt/powerai-vision/bin/powerai_vision_start.sh

If the above commands do not fix the startup issue, check for a cgroup leak that can impact Docker. A Kubernetes/Docker issue can cause this situation, and after fixing the firewall issue the start up can still fail if there was cgroup leakage.

One symptom of this situation is that the df command is slow to respond. To check for excessive cgroup mounts, run the mount command:

$ mount | grep cgroup | wc -l

If the cgroup count is in thousands, reboot the system to clear up the cgroups.

Helm status errors when starting PowerAI Vision

Problem

There is an issue in some RHEL releases that causes the startup of PowerAI Vision to fail after restarting the host system. When this is the problem, the system tries to initialize Helm at 30 second intervals but never succeeds. Therefore, the startup never succeeds. You can verify this status by running the Helm status vision command:

# /opt/powerai-vision/bin/helm status vision

Result:

Error: getting deployed release "vision": Get https://10.10.0.1:443/api/v1/namespaces/kube-system/configmaps[...]: dial tcp 10.10.0.1:443: getsockopt: no route to host

Solution

To solve this problem, you must follow these steps exactly as written:

Cancel PowerAI Vision startup by pressing ctrl+c.

Stop PowerAI Vision by running this command:

sudo /opt/powerai-vision/bin/powerai_vision_stop.sh

Modify the Red Hat Enterprise Linux (RHEL) settings as follows:

sudo nmcli connection modify docker0 connection.zone trusted

sudo systemctl stop NetworkManager.service

sudo firewall-cmd --permanent --zone=trusted --change-interface=docker0

sudo systemctl start NetworkManager.service

sudo nmcli connection modify docker0 connection.zone trusted

sudo systemctl restart docker.service

Start PowerAI Vision again:

sudo /opt/powerai-vision/bin/powerai_vision_start.sh

One symptom of this situation is that the df command is slow to respond. To check for excessive cgroup mounts, run the mount command:

$ mount | grep cgroup | wc -l

If the cgroup count is in thousands, reboot the system to clear up the cgroups.

Some PowerAI Vision functions don't work

Problem

PowerAI Vision seems to start correctly, but some functions, like automatic labeling or automatic frame capture, do not function.

To verify that this is the problem, run /opt/powerai-vision/bin/kubectl.sh get pods and verify that one or more pods are in state CrashLoopBackOff. For example:

kubectl get pods

NAME                                                              READY     STATUS             RESTARTS   AGE

...

powerai-vision-video-rabmq-5d5d786f9f-7jfk9                       0/1       CrashLoopBackOff   2          54s

Solution

PowerAI Vision Vision requires IPv6. Enable IPv6 on the system.

Command line tool fails - missing options

Problem: You receive errors using one of the PowerAI Vision command line tools, indicating that parameters are missing or options are not recognized.
Solution: Validate all of the hyphen (-) characters used to specify command line options. When using international keyboards, a similar but different character might have been used for command line options in a shell. For example, in Unicode character encoding there are multiple similar characters: hyphen-minus (Ascii hyphen), hyphen, em-dash, etc. Only the "hyphen-minus" character is valid to indicate a command line option.