Troubleshooting known issues - PowerAI Vision standard install

Following are some problems you might encounter when using PowerAI Vision, along with steps to fix them.

The PowerAI Vision GUI does not work
Resource pages are not being populated in the user interface
Unexpected / old pages displayed when accessing the user interface
PowerAI Vision does not play video
Auto detection video does not play in Firefox browser
Out of space error from load_images.sh
GPUs are not available for training or inference
All PowerAI Vision GPU jobs are scheduled to GPU 0
I forgot my user name or password
PowerAI Vision cannot train a model
Training or deployment hangs - Kubernetes pod cleanup
Model training and inference fails
Auto labeling of a data set returns "Auto Label Error"
PowerAI Vision does not start
PowerAI Vision fails to start - Kubernetes connection issue
PowerAI Vision startup hangs - helm issue
Helm status errors when starting PowerAI Vision
Uploading a large file fails
Some PowerAI Vision functions don't work

The PowerAI Vision GUI does not work

Problem

You cannot label objects, view training charts, or create categories.

Solution

Verify that you are using a supported web browser. The following web browsers are supported:

Google Chrome Version 60, or later
Firefox Quantum 59.0, or later

Resource pages are not being populated in the user interface

Problem

Resource pages, such as data sets and models, are not being populated. Notifications indicate that there is an error obtaining the resource. For example, "Error obtaining data sets."

Solution

Check the status of the powerai-vision-portal pod. This pod provides the data to the user interface, and until it is ready ( 1/1) with a status of Running, these errors will occur. See Checking Kubernetes node status for instructions.

If the application is restarting, there is an expected delay before all services are available and fully functioning. Otherwise, this may indicate an unexpected termination (error) of the powerai-vision-portal pod. If that happens, follow these instructions: Gather PowerAI Vision logs and contact support.

Unexpected / old pages displayed when accessing the user interface

Problem

After updating, reinstalling, or restarting PowerAI Vision, the browser presents pages that are from the previous version or are stale.

Solution

This problem is typically caused by the browser using a cached version of the page. To solve the problem, try one of these methods:

Use a Firefox Private Window to access the user interface.
Use a Chrome Incognito Window to access the user interface.
Bypass the browser cache:
- In most Windows and Linux browsers: Hold down Ctrl and press F5.
- In Chrome and Firefox for Mac: Hold down ⌘ Cmd and ⇧ Shift and press R.

PowerAI Vision does not play video

Problem

You cannot upload a video, or after the video is uploaded the video does not play.

Solution

Verify that your video is a supported type:

Ogg Vorbis (.ogg)
VP8 or VP9 (.webm)
H.264 encoded videos with MP4 format (.mp4)

If your video is not in a supported format, transcode your video by using a conversion utility. Such utilities are available under various free and paid licenses.

Auto detection video does not play in Firefox browser

Problem: The Firefox browser reports "The media playback was aborted due to a corruption problem or because the media used features your browser did not support". This happens in versions of the Firefox browser that do not support YUV444 chroma subsampling, which prevents the video from being played successfully.
Solution: Use a version of Firefox that supports YUV444 chroma subsampling or use a different browser (such as Chrome) that does support it.

Out of space error from load_images.sh

Problem

When installing the product, the load_images.sh script is used to load the PowerAI Vision Docker images. The script might terminate with errors, the most frequent issue being insufficient disk space for loading the Docker images.

For example, the /var/lib/docker file system can run out of space, resulting in a message indicating that an image was not fully loaded. The following output shows that the Docker image powerai-vision-dnn was not able to be fully loaded because of insufficient file system space:

root@kottos-vm1:~# df --output -BG "/var/lib/docker/"
Filesystem     Type  Inodes  IUsed   IFree IUse% 1G-blocks  Used Avail Use% File             Mounted on
/dev/vda2      ext4 8208384 595697 7612687    8%      124G   81G   37G  70% /var/lib/docker/ /
root@kottos-vm1:~#

******************************************************************************************
892d6f64ce41: Loading layer [==================================================>]  21.26MB/21.26MB
785af1d0c551: Loading layer [==================================================>]  1.692MB/1.692MB
dc102f4a3565: Loading layer [==================================================>]  747.9MB/747.9MB
aac4b03de02a: Loading layer [==================================================>]  344.1MB/344.1MB
d0ea7f5f6aab: Loading layer [==================================================>]  2.689MB/2.689MB
62d3d10c6cc2: Loading layer [==================================================>]  9.291MB/9.291MB
240c4d86e5c7: Loading layer [==================================================>]    778MB/778MB
889cd0648a86: Loading layer [==================================================>]  2.775MB/2.775MB
56bbb2f20054: Loading layer [==================================================>]  3.584kB/3.584kB
3d3c7acb72e2: Loading layer [================================>                  ]  2.117GB/3.242GB
Error processing tar file(exit status 1): write /usr/bin/grops: no space left on device


[ FAIL ] Some images failed to load
[ FAIL ]   Failure info:
             Loading the PowerAI Vision docker images...
root@kottos-vm1:~#

This situation can also be noted in the output from /opt/powerai-vision/bin/kubectl get pods. This command is described in Checking the application and environment, which shows images that could not be loaded with a status of ErrImagePull or ImagePullBackOff.

Solution

The file system space for /var/lib/docker needs to be increased, even if the file system is not completely full. There might still be space in the file system where /var/lib/docker is located, but insufficient space for the PowerAI Vision Docker images. There are operating system mechanisms to do this, including moving or mounting /var/lib/docker to a file system partition with more space.

After the error situation has been addressed by increasing or cleaning up disk space on the /var/lib/docker/ file system, re-run the load_images.sh script to continue loading the images. No clean up of the previous run of load_images.sh is required.

All PowerAI Vision GPU jobs are scheduled to GPU 0

Problem

All model training and deployments are running on GPU 0.

Solution

Enusre that the nvidia-container-runtime-hook in not installed on the system.

# rpm -qa | grep nvidia-container-runtime
nvidia-container-runtime-hook-1.4.0-2.ppc64le

Uninstall the package if it is installed, since this affects the visibility of GPUs and usage by the PowerAI Vision containers.

I forgot my user name or password

Problem: You forgot your user name or password and cannot log in to the PowerAI Vision GUI.
Solution: PowerAI Vision uses an internally managed users account database. To change your user name or password, see Logging in to PowerAI Vision.

GPUs are not available for training or inference

Problem

If PowerAI Vision cannot perform training or inference operations, check the following:

Verify that the nvidia smi output shows all relevant information about the GPU devices. For example, the following output shows Unknown error messages indicating that the GPUs are not in the proper state:

Mon Dec  3 15:43:07 2018       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.72       Driver Version: 410.72       CUDA Version: 10.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  Off  | 00000004:04:00.0 Off |                    0 |
| N/A   31C    P0    49W / 300W | Unknown Error        |      0%      Default |
+-------------------------------+----------------------+----------------------+
...

Verify that the nvidia-persistenced service is enabled and running (active) by using the command sudo systemctl status nvidia-persistenced:

# systemctl status nvidia-persistenced
*  nvidia-persistenced.service - NVIDIA Persistence Daemon
    Loaded: loaded (/etc/systemd/system/nvidia-persistenced.service; enabled; vendor preset: disabled)
    Active: active (running) since Tue 2018-11-13 08:41:22 CST; 2 weeks 6 days ago
...

Solution

If the GPU status indicates errors and the nvidia-persistenced service is not enabled and active, enable and start the service:

Enable the service:

sudo systemctl enable nvidia-persistenced

Start the service:

sudo systemctl start nvidia-persistenced

PowerAI Vision cannot train a model

Problem

The model training process might fail if your system does not have enough GPU resources.

Solution

If you are training a data set for image classification, verify that at least two image categories are defined, and that each category has a minimum of five images.
If you are training a data set for object detection, verify that at least one object label is used. You must also verify that each object is labeled in a minimum of five images.
Ensure that enough GPUs are available. PowerAI Vision assigns one GPU to each active training job or deployed deep learning API. For example, if a system has four GPUs and you have two deployed web APIs, there are two GPUs available for active training jobs. If a training job appears to be hanging, it might be waiting for another training job to complete, or there might not be a GPU available to run it.
To determine how many GPUs are available on the system, run the sudo /opt/powerai-vision/bin/kubectl.sh describe nodes script and review the nVidiaGPU Limits column.

The following is an example of the output from sudo /opt/powerai-vision/bin/kubectl.sh describe nodes that shows two GPUs currently in use:
```
Name:               127.0.0.1
Roles:              <none>
Labels:             beta.kubernetes.io/arch=ppc64le
                    beta.kubernetes.io/os=linux
                    gpu/nvidia=TeslaV100-SXM2-16GB
                    kubernetes.io/hostname=127.0.0.1
Annotations:        node.alpha.kubernetes.io/ttl=0
                    volumes.kubernetes.io/controller-managed-attach-detach=true...
Allocated resources:
                   (Total limits may be over 100 percent, i.e., overcommitted.)
                   CPU Requests  CPU Limits  Memory Requests  Memory Limits  NvidiaGPU Limits
                   --------------------------------------------------------------------------
                    0 (0%)        0 (0%)      0 (0%)           0 (0%)         2 (50%)
Events:         <none>
```
If all the systems GPUs are in use, you can either delete a deployed web API (making the API unavailable for inference) or you can stop a training model that is running.
- To delete a deployed model, click Deployed Models. Next, select the model that you want to delete and click Delete. The trained model is not deleted from PowerAI Vision. You can redeploy the model later when more GPUs are available.
- To stop a training model that is running, click Models. Next, select the model that has a status of Training in Progress and click Stop Training.

Training or deployment hangs - Kubernetes pod cleanup

Problem

You submit a job for training or deployment, but it never completes. When doing training or deployments, sometimes some pods that are running previous jobs are not terminated correctly by the Kubernetes services. In turn, they hold GPUs so no new training or deployment jobs can complete. They will be in the Scheduled state forever.

To verify that this is the problem, run kubectl get pods and review the output. The last column shows the age of the pod. If it is older than a few minutes, use the information in the Solution section to solve the problem.

Example:

kubectl get pods 
powerai-vision-infer-ic-06767722-47df-4ec1-bd58-91299255f6hxxzk 1/1 Running 0 22m 
powerai-vision-infer-ic-35884119-87b6-4d1e-a263-8fb645f0addqd2z 1/1 Running 0 22m 
powerai-vision-infer-ic-7e03c8f3-908a-4b52-b5d1-6d2befec69ggqw5 1/1 Running 0 5h 
powerai-vision-infer-od-c1c16515-5955-4ec2-8f23-bd21d394128b6k4 1/1 Running 0 3h

Solution

Follow these steps to manually delete the deployments that are hanging.

Determine the running deployments and look for those that have been running longer than a few minutes:
```
kubectl get deployments
```
Delete the deployments that were identified as hanging in the previous step.
```
kubectl delete deployment deployment_id
```
You can now try the training or deploy again, assuming there are available GPUs.

Note: When a deployment is manually deleted, vision-service might try to recreate it when it is restarted. The only way to force Kubernetes to permanently delete it is to remove the failing model from PowerAI Vision.

Model training and inference fails

Problem

The NVIDIA GPU device is not accessible by the PowerAI Vision Docker containers. To confirm this, run kubectl logs -f _powerai-vision-portal-ID_ and then check pod_powerai-vision-portal-ID_powerai-vision-portal.log for an error indicating error == cudaSuccess (30 vs. 0):

F0731 20:34:05.334903    35 common.cpp:159] Check failed: error == cudaSuccess (30 vs. 0)  unknown error
*** Check failure stack trace: ***
/opt/py-faster-rcnn/FRCNN/bin/train_frcnn.sh: line 24:    35 Aborted                 (core dumped) _train_frcnn.sh

Solution

Use sudo to alter SELINUX permissions for all of the NVIDIA devices so they are accessible via the PowerAI Vision Docker containers.

sudo chcon -t container_file_t /dev/nvidia*

Auto labeling of a data set returns "Auto Label Error"

Problem: Auto labeling cannot be performed on a data set that does not have unlabeled images, unless some of the images were previously labeled by the auto label function.
Solution: Ensure that the Objects section of the data set side bar shows there are objects that are "Unlabeled". If there are none, that is, if "Unlabeled (0)" is displayed in the side bar, add new images that are unlabeled or remove labels from some images, then run auto label again.

PowerAI Vision does not start

Problem

When you enter the URL for PowerAI Vision from a supported web browser, nothing is displayed. You see a 404 error or Connection Refused message.

Solution

Complete the following steps to solve this problem:

Verify that IP version 4 (IPv4) port forwarding is enabled by running the /sbin/sysctl net.ipv4.conf.all.forwarding command and verifying that the value for net.ipv4.conf.all.forwarding is set to 1.
If IPv4 port forwarding is not enabled, run the /sbin/sysctl -w net.ipv4.conf.all.forwarding=1 command. For more information about port forwarding with Docker, see UCP requires IPv4 IP Forwarding in the Docker success center.

If IPv4 port forwarding is enabled and the docker0 interface is a member of the trusted zone, check the Helm chart status by running this script:

sudo /opt/powerai-vision/bin/helm.sh status vision

In the script output, verify that the PowerAI Vision components are available by locating the Deployment section and identifying that the AVAILABLE column has a value of 1 for each component. The following is an example of the output from the helm.sh status vision script that shows all components are available:

RESOURCES:
==> v1beta1/Deployment
NAME                              DESIRED  CURRENT  UP-TO-DATE  AVAILABLE  AGE
powerai-vision-mongodb            1        1        1           1          4d
powerai-vision-portal             1        1        1           1          4d
powerai-vision-postgres           1        1        1           1          4d
powerai-vision-taskanaly          1        1        1           1          4d
powerai-vision-ui                 1        1        1           1          4d
powerai-vision-video-nginx        1        1        1           1          4d
powerai-vision-video-portal       1        1        1           1          4d
powerai-vision-video-rabmq        1        1        1           1          4d
powerai-vision-video-redis        1        1        1           1          4d
powerai-vision-video-test-nginx   1        1        1           1          4d
powerai-vision-video-test-portal  1        1        1           1          4d
powerai-vision-video-test-rabmq   1        1        1           1          4d
powerai-vision-video-test-redis   1        1        1           1          4d

If you recently started PowerAI Vision and some components are not available, wait a few minutes for these components to become available. If any components remain unavailable, gather the logs and contact IBM® Support, as described in this topic: Gather PowerAI Vision logs and contact support.

If the docker0 interface is a member of a trusted zone and all PowerAI Vision components are available, verify that the firewall is configured to allow communication through port 443 (used to connect to PowerAI Vision) by running this command:
```
sudo firewall-cmd --permanent --zone=public --add-port=443/tcp
```

PowerAI Vision fails to start - Kubernetes connection issue

Problem

If the host system does not have a default route defined in the networking configuration, the Kubernetes cluster will fail to start with connection issues. For example:

$ sudo /opt/powerai-vision/bin/powerai_vision_start.sh
INFO: Setting up GPU...
[...]
Checking kubernetes cluster status...
The connection to the server 127.0.0.1:8080 was refused - did you specify the right host or port?
INFO: Probing cluster status #1:
The connection to the server 127.0.0.1:8080 was refused - did you specify the right host or port?
INFO: Probing cluster status #2:
The connection to the server 127.0.0.1:8080 was refused - did you specify the right host or port?
INFO: Probing cluster status #3:
The connection to the server 127.0.0.1:8080 was refused - did you specify the right host or port?
INFO: Probing cluster status #4:
The connection to the server 127.0.0.1:8080 was refused - did you specify the right host or port?
INFO: Probing cluster status #5:
The connection to the server 127.0.0.1:8080 was refused - did you specify the right host or port?
INFO: Probing cluster status #6:
The connection to the server 127.0.0.1:8080 was refused - did you specify the right host or port?
INFO: Probing cluster status #7:
The connection to the server 127.0.0.1:8080 was refused - did you specify the right host or port?
INFO: Probing cluster status #8:
The connection to the server 127.0.0.1:8080 was refused - did you specify the right host or port?
INFO: Probing cluster status #9:
The connection to the server 127.0.0.1:8080 was refused - did you specify the right host or port?
INFO: Probing cluster status #10:
The connection to the server 127.0.0.1:8080 was refused - did you specify the right host or port?
INFO: Probing cluster status #11:
ERROR: Retry timeout. Error in starting kubernetes cluster, please check /opt/powerai-vision/log/kubernetes for logs.

Solution

Define a default route in the networking configuration. For instructions to do this on Red Hat Enterprise Linux (RHEL), refer to 2.2.4 Static Routes and the Default Gateway in the Red Hat Customer Portal.

PowerAI Vision startup hangs - helm issue

Problem

PowerAI Vision startup hangs with the message "Unable to start helm within 30 seconds - trying again." For example:

root> sudo /opt/powerai-vision/bin/powerai_vision_start.sh 
Checking ports usage...
Checking ports completed, no confict port usage detected.
[ INFO ] Setting up the GPU...
         Init cuda devices...
         Devices init completed!
         Persistence mode is already Enabled for GPU 00000004:04:00.0.
         Persistence mode is already Enabled for GPU 00000004:05:00.0.
         Persistence mode is already Enabled for GPU 00000035:03:00.0.
         Persistence mode is already Enabled for GPU 00000035:04:00.0.
         All done.
[ INFO ] Starting kubernetes...
         Checking kubernetes cluster status...
         Probing cluster status #1: NotReady
         Probing cluster status #2: NotReady
         Probing cluster status #3: NotReady
         Probing cluster status #4: Ready
         Booting up ingress controller...
         Initializing helm...
         [ WARN ] Unable to start helm within 30 seconds - trying again.  If this continues, contact support.
         [ WARN ] Unable to start helm within 30 seconds - trying again.  If this continues, contact support.
         [ WARN ] Unable to start helm within 30 seconds - trying again.  If this continues, contact support.
         [ WARN ] Unable to start helm within 30 seconds - trying again.  If this continues, contact support.

Solution

To solve this problem, you must follow these steps exactly as written:

Cancel PowerAI Vision startup by pressing ctrl+c.

Stop PowerAI Vision by running this command:

sudo /opt/powerai-vision/bin/powerai_vision_stop.sh

Modify the RHEL settings as follows:

sudo nmcli connection modify docker0 connection.zone trusted
sudo systemctl stop NetworkManager.service
sudo firewall-cmd --permanent --zone=trusted --change-interface=docker0
sudo systemctl start NetworkManager.service
sudo nmcli connection modify docker0 connection.zone trusted
sudo systemctl restart docker.service

Start PowerAI Vision again:

sudo /opt/powerai-vision/bin/powerai_vision_start.sh

If the above commands do not fix the startup issue, check for a cgroup leak that can impact Docker. A Kubernetes/Docker issue can cause this situation, and after fixing the firewall issue the start up can still fail if there was cgroup leakage.

One symptom of this situation is that the df command is slow to respond. To check for excessive cgroup mounts, run the mount command:

$ mount | grep cgroup | wc -l

If the cgroup count is in thousands, reboot the system to clear up the cgroups.

Helm status errors when starting PowerAI Vision

Problem

There is an issue in some RHEL releases that causes the startup of PowerAI Vision to fail after restarting the host system. When this is the problem, the system tries to initialize Helm at 30 second intervals but never succeeds. Therefore, the startup never succeeds. You can verify this status by running the Helm status vision command:

# /opt/powerai-vision/bin/helm status vision

Result:

Error: getting deployed release "vision": Get https://10.10.0.1:443/api/v1/namespaces/kube-system/configmaps[...]: dial tcp 10.10.0.1:443: getsockopt: no route to host

Solution

To solve this problem, you must follow these steps exactly as written:

Cancel PowerAI Vision startup by pressing ctrl+c.

Stop PowerAI Vision by running this command:

sudo /opt/powerai-vision/bin/powerai_vision_stop.sh

Modify the RHEL settings as follows:

sudo nmcli connection modify docker0 connection.zone trusted
sudo systemctl stop NetworkManager.service
sudo firewall-cmd --permanent --zone=trusted --change-interface=docker0
sudo systemctl start NetworkManager.service
sudo nmcli connection modify docker0 connection.zone trusted
sudo systemctl restart docker.service

Start PowerAI Vision again:

sudo /opt/powerai-vision/bin/powerai_vision_start.sh

One symptom of this situation is that the df command is slow to respond. To check for excessive cgroup mounts, run the mount command:

$ mount | grep cgroup | wc -l

If the cgroup count is in thousands, reboot the system to clear up the cgroups.

Uploading a large file fails

When uploading files into a data set, there is a 2GB size limit per upload session. This limit applies to a single .zip file or a set of files. When you upload a large file that is under 2 GB, you might see the upload start (showing a progress bar) but then you get an error message in the user interface. This error happens due to a Nginx timeout, where the file upload is taking longer than the defined 5 minute Nginx timeout.

Despite the notification error, the large file has been uploaded. Refreshing the page will show the uploaded files in the data set.

Some PowerAI Vision functions don't work

Problem

PowerAI Vision seems to start correctly, but some functions, like automatic labeling or automatic frame capture, do not function.

To verify that this is the problem, run /opt/powerai-vision/bin/kubectl.sh get pods and verify that one or more pods are in state CrashLoopBackOff. For example:

kubectl get pods
NAME                                                              READY     STATUS             RESTARTS   AGE
...
powerai-vision-video-rabmq-5d5d786f9f-7jfk9                       0/1       CrashLoopBackOff   2          54s

Solution

PowerAI Vision requires IPv6. Enable IPv6 on the system.