April 27, 2020 4 min read

I recently wrote about how important data preparation is in the lifecycle of an AI project. Providing good data to an AI model is key to making it thrive. However, training it also plays an important role and can impose a few challenges.

An AI model learns by repetition. If you don’t train a model enough, it can’t properly adjust the weights of its neural network. (Read my overview of AI stages for a primer on training and inferencing.) The way you train it also impacts its usefulness and accuracy when it tries to provide an answer to an input with real data. Effectively training an AI model is comparable in some ways to teaching a child in school through repeated instruction, exercises and tests.

Let’s analyze how this works. When you’re training an AI model, first you need to feed it with data. It will try to learn and then go through a test. In each test, the model is scored against the expected answer to determine the accuracy of the model. Ideally, we want an accuracy as close as possible to 100 percent. When the AI model begins learning from training data, it yields low accuracy, meaning it doesn’t yet understand what we’re trying to make it achieve. Thus, to improve its accuracy, it is repetitively exposed to rounds of training. Each round is called an epoch (iteration), and the model is benchmarked again at the end. A child at school, similarly, might not do well in early tests but through repetitive learning can improve until a desired score is reached.

Figure 1 shows the training of a model. Notice how accuracy improves with the number of epochs. The final accuracy of this model, however, is still at around just 60 percent.

Figure 1: Model accuracy for 20K epochs [Source: Ex. IBM Watson Machine Learning Accelerator]

## AI model training challenges

As with learning at school, some problems may occur with an AI model. Suppose that a student goes over a round of exercises; then, at the end, you apply a test that has some of the exact same exercises he or she learned from. That would be an easy test since the child has seen the answers before. That can happen with a model as well, if you benchmark it using the same data used to train it. Therefore, a model needs to be benchmarked for accuracy using different data than what it was trained with. A data scientist must be careful to plan for that during the training.

Some other common challenges may happen during training. One is overfitting, which happens when the model picks up too much detail from the training data, to the point of understanding noise as part of the attributes it’s supposed to learn. It benchmarks well with the training data but fails in the real world. The opposite of overfitting is underfitting, a situation in which the model can’t even comprehend the training data and shows low accuracy even in the training phase. When underfitting happens, it’s usually a sign to suggest a change in the model itself. There are many other problems that may occur in the training phase, such as gradient explosion, overflow, saturation, divergence. Although I won’t mention each one specifically, they are widely explained elsewhere.

## Hyperparameters

Now that you’re aware of challenges with training, let’s suppose for a minute that you were able to avoid all of them and that you chose a specific algorithm to train your model with. Even in that scenario, you’d still be left with the choice of some parameters that this algorithm uses to control the learning of the neural network. These are called hyperparameters. They are chosen before the training begins and are kept at that same value throughout it. So how do you choose the best set of hyperparameters to use in order to get the best result? This is tough to answer, and usually it’s done by trial and error or by running a hyperparameter search where a round of smaller trainings is run with different values for the parameters, and the winner becomes the set of parameters with the best accuracy overall.

## Hardware makes a difference

Training can be a difficult task on its own, and often it’s a step that’s repeated until you’re satisfied with the quality of the resulting neural network. The faster you can run training, the faster you can fail and learn from your errors to try again. I don’t say this to undermine anyone’s enthusiasm, but to mention that using top-notch hardware to train these AI models makes a big difference. The IBM Power System AC922 supports up to six NVIDIA V100 GPUs, interconnected with NVLink technology and featuring PCIe Gen4 slots — all of which tremendously helps to speed and refine your training models.

Detecting failure early can also save you time. Solutions such as IBM Watson Machine Learning Accelerator with Deep Learning Impact (DLI) has an interesting characteristic of detecting training problems such as underfitting, overfitting, gradient explosion, overflow, saturation and divergence while the training is taking place. So, you don’t need to wait for hours until the training is complete. Should any of those situations be detected, the tool simply warns you about it and you can stop the training, make some adjustments and start over. Finally, DLI can also perform hyperparameter searches and provide you with a set of better values to use. For the data scientists out there: You can control the search with four types of algorithms (random, Tree-based Parzen Estimator, Bayesian and Hyperband), select the universe for the search (parameter range) and amount of time to look for a better answer.

I hope this blog post provided you with an introduction to what’s involved during the training phase of an AI project. In my next post, I’ll talk about what happens after the model is declared suited for the real world: inferencing!

Meanwhile, if you’re looking for support with an AI project on IBM Power Systems, don’t hesitate to contact IBM Systems Lab Services.