Example: Detecting objects in a video

In this fictional scenario, you want to create a deep learning model to monitor traffic on a busy road. You have a video that displays the traffic during the day. From this video, you want to know how many cars are on the busy road every day, and what are the peak times that have the most cars on the road.

The video file used in this scenario is available for download here: Download car video.

To create a deep learning model, you will perform the following steps:

Importing a video
Labeling objects in a video
Training a model
Deploying a model
Automatically label frames in a video

Import a video and create a data set

First, create a data set and add videos to it.

Log in to PowerAI Vision.
Click Data Sets in the navigation bar to open the Data Sets page. There are several ways to create a new data set
From the Data set page, click the icon and name the data set Traffic Video.
To add a video to the data set, click the Traffic Video data set and click Import file or drag the video to the + area.
Important: You cannot navigate away from the PowerAI Vision page or refresh until the upload completes. You can navigate to different pages within PowerAI Vision during the upload.

Labeling objects in a video

The next step is to label objects in the video. For object detection, you must have at minimum five labels for each object. We will create Car and Motorcycle objects and will label at least five frames in the video with cars and at least five frames with motorcycles.

Select the video from your data set and select Label Objects.
Capture frames by using one of these methods:
- Click Auto capture frames and specify a value for Capture Interval (Seconds) that will result in at least five frames. We will select this option and specify 10 seconds.
  Note: Depending on the length and size of the video and the interval you specified to capture frames, the process to capture frames can take several minutes.
- Click Capture frame to manually capture frames. If you use this option, you must capture a minimum of five frames from the video.
If you used Auto capture frames, verify that there are enough of each object type in the video frames. If not, follow these steps to add new frames to the existing data set.
In this scenario, the motorcycle is only in a single automatically captured frame at 40 seconds. Therefore, we must capture at least four more frames with the motorcycle. The motorcycle comes into view at 36.72 seconds. To correctly capture the motorcycle in motion we will create extra frames at 37.79 seconds, 41.53 seconds, and 42.61 seconds.
1. Play the video. When the frame you want is displayed, click pause.
2. Click Capture Frame.
Create new object labels for the data set by clicking Add new by the Objects list. Enter Car, click Add, then enter Motorcycle, then click OK.
Note: If you later want to delete the label, it must be done at the data set level. It cannot be done from an individual frame or image.
Label the objects in the frames:
- Select the first frame in the carousel.
- Select the correct object label, for example, "Car".
- Choose Box or Polygon from the bottom left, depending on the shape you want to draw around each object. Boxes are faster to label and train, but less accurate. Only Detectron models support polygons. However, if you use polygons to label your objects, then use this data set to train a model that does not support polygons, bounding boxes are defined and used. Draw the appropriate shape around the object.
  Note: When Box or Polygon is selected, you have to hold down the Alt key for non-drawing interactions in the image. This includes trying to select, move, or edit previously drawn shapes in the image, and panning the image by using the mouse. To return to the normal mouse interactions, deselect the Box or Polygon button.
Review the following tips about identifying and drawing objects in video frames and images:
- Do not label part of an object. For example, do not label a car that is only partially in the frame.
- If an image has more than one object, you must label all objects. For example, if you have cars and motorcycles defined as objects for the data set, and there is an image with both cars and motorcycles in it, you must label the cars and the motorcycles. Otherwise, you decrease the accuracy of the model.
- Label each individual object. Do not label groups of objects. For example, if two cars are right next to each other, you must draw a label around each car.
- Draw the shape as close to the objects as possible. Do not leave blank space around the objects.
- You can draw shapes around objects that touch or overlap. For example, if one object is behind another object, you can label them both. However, it is recommended that you only label objects if the majority of the object is visible.
- Use the zoom buttons (+ and -) on the bottom right side of the editing panels to help draw more accurate shapes.
  Note: If you are zoomed in on an image and use the right arrow key to move all the way to the right edge, you might have to click the left arrow key several times to start panning in the other direction.
- Shapes cannot extend off the edge of the frame.
- After defining a shape, you can copy and paste it elsewhere in the same image or in a different image by using standard keyboard shortcuts. After pasting the shape, it can be selected and dragged to the desired location in the image. The shape can also be edited to add or remove points in the outline.
  Note: To copy and paste a shape from one image to another, both images have to be available in the image carousel. From the data set, select all images that will share shapes, then click Label objects. All images will be listed in the image carousel in the left side of the Label objects window.
- After a shape has been defined, you will no longer see the points on the outline. To edit a defined box, exit drawing mode, then edit the points as necessary. To exit drawing mode, do one of the following:
  - Click the object name on the right side of the window.
  - Alt+click (option +click) inside the defined box.
  After moving a defined point, drawing mode is automatically enabled again.
- The video object preview does not support non-ascii labels. This is a limitation of the module that generates the displayed label from the label name. The result of the conversion of non-ascii labels will be a label that is all question marks: "?????".
- Labeling with polygons
  - After a shape has been defined, you will no longer see the points on the outline. To edit a defined shape, exit drawing mode, then edit the points as necessary. To exit drawing mode, do one of the following:
    - Click the object name on the right side of the window.
    - Click inside the defined shape.
    When you are done editing the shape, click outside the shape to enter drawing mode again.
  - To delete a point from an outline, ctrl+click (or cmd+click).
  - To add a point to an outline, click the translucent white square between any two points on the outline.
  - To move a point on the outline, click it and drag.

The following figure displays the captured video frame at 41.53 seconds with object labels of Car and Motorcycle. Figure 1 also displays a box around the five frames (four of the frames were added manually) in the carousel that required object labels for the motorcycle that is in each frame.

Figure 1. Labeling objects in PowerAI Vision

The image displays GUI interface for PowerAI Vision. The image displays a screen capture of the video frame with object labels for the cars and motorcycle. Below this video frame is an image carousel that has frames from the video with time stamps.

Training a model

With all the object labels that are identified in your data set, you can now train your deep learning model. To train a model, complete the following steps:

From the Data set page, click Train.
Fill out the fields on the Train Data set page, ensuring that you select Object Detection. We will choose Accuracy (faster R-CNN) for Model selection
Click Train.
(Optional - Only supported when training for object detection.) Stop the training process by clicking Stop training > Keep Model > Continue.
You can wait for the entire training model process complete, but you can optionally stop the training process when the lines in the training graph start to flatten out, as shown in the figure below. This is because improvements in quality of training might plateau over time. Therefore, the fastest way to deploy a model and refine the data set is to stop the process before quality stops improving.
Note: Use early stop with caution when training segmented object detection models (such as with Detectron), because larger iteration counts and training times have been demonstrated to improve accuracy even when the graph indicates the accuracy is plateauing. The precision of the label is can still being improved even when the accuracy of identifying the object location stopped improving.

Figure 2. Model training graph

Important: If the training graph converges quickly and has 100% accuracy, the data set does not have enough information. The same is true if the accuracy of the training graph fails to rise or the errors in the graph do not decrease at the end of the training process. For example, a model with high accuracy might be able to discover all instances of different race cars, but might have trouble differentiating between specific race cars or those that have different colors. In this situation, add more images, video frames, or videos to the data set, label them, then try the training again.

Deploying a trained model

To deploy the trained model, complete the following steps. GPUs are used as follows:

Each Tiny YOLO V2, Detectron, Single Shot Detector (SSD), Structured segment network (SSN), or custom deployed model takes one GPU. The GPU group is listed as '-', which indicates that this model uses a full GPU and does not share the resource with any other deployed models.
Multiple Faster R-CNN and GoogLeNet models are deployed to a single GPU. PowerAI Vision uses packing to deploy the models. That is, the model is deployed to the GPU that has the most models deployed on it, if there is sufficient memory available on the GPU. The GPU group can be used to determine which deployed models share a GPU resource. To free up a GPU, all deployed models in a GPU group must be deleted (undeployed).
Note: PowerAI Vision leaves a 500MB buffer on the GPU.

Click Models from the menu.
Select the model you created in the previous section and click Deploy.
Specify a name for the model, and click Deploy. The Deployed Models page is displayed, and the model is deployed when the status column displays Ready.
Double-click the deployed model to get the API endpoint and test other videos or images against the model. For information about using the API see Vision Service API documentation.

Automatically label frames in a video

You can use the auto label function to automatically identify objects in the frames of a video after a model has been deployed.

In this scenario, you have only nine frames. To improve the accuracy for your deep learning model, you can add more frames to the data set. Remember, you can rapidly iterate by stopping the training on a model and checking the results of the model against a test data set. You can also use the model to auto label more objects in your data set. This process improves the overall accuracy of your final model.

To use the auto label function, complete the following steps:

Note: Any frames that were previously captured by using auto capture and were not manually labeled are deleted before auto labeling. This helps avoid labeling duplicate frames. Manually captured frames are not deleted.

Click Data sets from the menu, and select the data set that you used to create the previously trained model.
Select the video in the data set that had nine frames, and click Label Objects.
Click Auto label.
Specify how often you want to capture frames and automatically label the frames. Select the name of the trained model that you deployed in step 3, and click Auto label. In this scenario, you previously captured frames every 10 seconds. To improve the accuracy of the deep learning model by capturing and labeling more frames, you can specify 6 seconds.
After the auto label process completes, the new frames are added to the carousel. Click the new frames and verify that the objects have the correct labels. The object labels that were automatically added are green and the object labels you manually added are in blue. In this scenario, the carousel now has 17 frames.

Next steps

You can manipulate (move or resize) the labels that were automatically generated. You can also save or reject individual labels, or you can reject them all by selecting Clear all. Saving or manipulating a label converts it to a manually added label. Rejecting a label deletes it. If you run Auto label again, any images or frames that now have manually added labels are skipped.

You can continue to refine the data set as much as you want. When you are satisfied with the data set, you can retrain the model by completing steps 1 - 3. This time when you retrain the model, you might want to train the model for a longer time to improve the overall accuracy of the model. The loss lines in the training model graph should converge to a stable flat line. The lower the loss lines are in the training graph the better. After the training completes, you can redeploy the model by completing steps 1 - 3. You can double-click the deployed model to get the API endpoint and test other videos or images against the model.