Training a neural network typically involves using many epochs, each of which exposes the neural network to the full training data set, before the accuracy is no longer appreciably affected. For a lengthy overall training, it’s useful to save the training progress so that if there is an interruption for some reason, then training can be resumed rather than restarted.
After any epoch of training, the state of training can be saved using essentially the same technique as we did for saving a fully trained model and restoring it in a production environment for inferencing. To simulate a training interruption in my Bankloan sample, I broke the 3000 epoch training cell into two cells. The first training cell had the same training code except for stopping at 1500 epochs and for saving the progress at every 500 epochs using the following:
if (epoch % 500) == 499: save_path = saver.save(training_session, "../datasets/Neural Net2/Neural Net.ckpt", global_step=epoch+1) print(epoch+1, " training progress saved to ", save_path)
The next cell resumes the training with the same training code, except for only running the latter 1500 epochs after restoring the state of TensorFlow using the following code:
with tf_training2.Session() as training2_session: inf_saver = tf_training2.train.import_meta_graph( '../datasets/Neural Net2/Neural Net.ckpt-1500.meta') inf_saver.restore(training2_session, tf_training2.train.latest_checkpoint('../datasets/Neural Net2/')) graph = tf_training2.get_default_graph() training2_op = graph.get_operation_by_name("train/GradientDescent") X2 = graph.get_tensor_by_name("X:0") y2 = graph.get_tensor_by_name("y:0") accuracy2 = graph.get_tensor_by_name("test/accuracy:0") outputs2 = graph.get_tensor_by_name("nn/nn_output:0")
Relative to restoring a model for the purpose of inference, there are only a few small differences. First, the name of the file from which we read includes the epoch number. Second, we have to get values for a few more Python variables, like the y layer tensor to which we feed the correct output values during training, the training operation itself, and the accuracy testing tensor. Third, getting the training operation requires a slightly different call because it is an operation rather than a tensor. Fourth and last, getting the accuracy tensor requires that we give it a name during construction of the tensor flow because it cannot be extracted during restoration unless it has a name that was saved. This can be accomplished by simply adding an identity node to the front of the previous accuracy tensor, like this:
with tf.name_scope("test"): correct = tf.nn.in_top_k(tf.cast(outputs, tf.float32), y, 1) accuracy = tf.identity( tf.reduce_mean(tf.cast(correct, tf.float32)), "accuracy")
This gave the accuracy tensor the name ‘accuracy’ within the ‘test’ namescope. Reviewing the code, one may note that the training operation is not explicitly given a name. However, this is because the TensorFlow library itself assigns a default name of ‘GradientDescent’ to the operation during creation, which occurred within the ‘train’ namescope.
Speaking of the code, you can go here to download a copy of the notebook. Instead of one cell for all the training, the training is split into two cells, with the latter cell  reloading and resuming where the former cell  left off. Finally, note that it is possible to fully simulate interrupted training by stopping the Python kernel after the first half of training. Once you restart the kernel, simply rerun cells  to  to reload the training data, and then run the second half of training starting at cell . The only difference will be a negligibly different accuracy result, relative to training all epochs with the same kernel, due only to the random number seed being regenerated when the Python kernel is restarted.