Edit a TensorFlow training model for distributed training with IBM Fabric

Before uploading a TensorFlow training model, edit the model to work with the distributed training engine option in IBM Spectrum Conductor Deep Learning Impact. The distributed training engine must use a fabricmodel.py file.

Before you begin

Before editing your TensorFlow training model to work with IBM Spectrum Conductor Deep Learning Impact, consider the following limitations:
  • The tf.placeholder() data input schema is not supported. Models must use the TensorFlow multithreaded queue schema as data input for high performance. To learn more about multithreaded queues in TensorFlow, see Threading and Queues.
    Note: Due to this limitation, the distributed training with IBM Fabric option is most suitable for object detection and object classification deep learning.
  • Models are set to automatically deploy to single-node, multi-node, and multi-GPU devices. Make sure that you do not define the tf.device() operation in your TensorFlow model.

About this task

By editing the TensorFlow model, the model will provide the following operations:

  • The train accuracy operation
  • The train loss operation
  • The test accuracy operation
  • The test loss operation
  • The global step operation
  • The gradients and variables (grads_and_vars) operation which can be obtained by: optimizer.compute_gradients(train_loss).
  • The apply gradient operation which can be obtained by: optimizer.apply_gradients (grads, global_step=global_step)

Procedure

  1. Rename your distributed TensorFlow model file to fabricmodel.py. IBM Spectrum Conductor Deep Learning Impact requires that the file be named fabricmodel.py in order for it to be used with the distributed training with IBM Fabric option.
  2. In the fabricmodel.py file, import the necessary API code as follows. The required APIs meta_writer.py and tf_meta_pb2.py are included with the IBM Spectrum Conductor Deep Learning Impact Python APIs. For more information on APIs, see Reference.
    from meta_writer import *
  3. The API should be defined as follows.
    DEFAULT_CKPT_DIR = './train'  
    DEFAULT_WEIGHT_FILE = ''                  
    DEFAULT_MODEL_FILE = 'mymodel.model'
    DEFAULT_META_FILE = 'mymodel.meta'
    DEFAULT_GRAPH_FILE = 'mymodel.graph'
    
    FLAGS = tf.app.flags.FLAGS
    tf.app.flags.DEFINE_string('weights', DEFAULT_WEIGHT_FILE,
       "Weight file to be loaded in order to validate, inference or continue train")
    tf.app.flags.DEFINE_string('train_dir', DEFAULT_CKPT_DIR,
       "Checkpoint directory to resume previous train and/or snapshot current train, default to \"%s\"" % (DEFAULT_CKPT_DIR))
    tf.app.flags.DEFINE_string('model_file', DEFAULT_MODEL_FILE,
      "Model file name to export, default to \"%s\"" % (DEFAULT_MODEL_FILE))
    tf.app.flags.DEFINE_string('meta_file', DEFAULT_META_FILE,
      "Meta file name to export, default to \"%s\"" % (DEFAULT_META_FILE))
    tf.app.flags.DEFINE_string('graph_file', DEFAULT_GRAPH_FILE,
      "Graph file name to export, default to \"%s\"" % (DEFAULT_GRAPH_FILE))
    
    where:
    • DEFAULT_CKPT_DIR specifies the TensorFlow checkpoint file.
    • DEFAULT_WEIGHT_FILE specifies the weight files used for continued training.
    • DEFAULT_MODEL_FILEspecifies the TensorFlow graph .protobuf file.
    • DEFAULT_META_FILE specifies the meta data file.
    • DEFAULT_GRAPH_FILE specifies the TensorFlow graph .txt file.
    Note: The DEFAULT_MODEL_FILE, DEFAULT_META_FILE, DEFAULT_GRAPH_FILE will be automatically generated by the API.
  4. In the TensorFlow fabricmodel.py file, the main method must define a deep learning model and must be followed by the write_meta command. The write_meta command generates the following files which are used by the distributed training with IBM Fabric engine: DEFAULT_MODEL_FILE, DEFAULT_META_FILE, and DEFAULT_GRAPH_FILE.
    def main(argv=None):
    //deep learning model code starts.
    //……
    //deep learning model code ends.
    
    //Finally call write_meta API as follows
    #the path to save model checkpoint
    checkpoint_file = os.path.join(FLAGS.train_dir, "model.ckpt")   
    
    #the path to load prior weight file
    restore_file = FLAGS.weights 
    
    #the snapshot interval to save checkpoint                                                      
    snapshot_interval = 100            
                                             
    write_meta
    (
    tf,                             # the tensorflow object.
    None,			    # the input placeholders, should be None.
    train_accuracy,		    # the train accuracy operation.
    train_loss, 		    # the train loss operation.
    test_accuracy, 		    # the test accuracy operation.
    test_loss, 		    # the test loss operation.
    optApplyOp, 		    # the apply gradient operation.
    grads_and_vars,		    # the grads_and_vars.
    global_step,		    # the global step.
    FLAGS.model_file, 	    # the path to save the tensorflow graph protobuf file.
    FLAGS.meta_file,	    # the path to save metadata file.
    FLAGS.graph_file,	    # the path to save tensorflow graph text file.
    restore_file,		    # the path to load prior weigh file for continue training.
    checkpoint_file,		    # the path to save model checkpoint file.
    snapshot_interval		    # the interval for saving model checkpoint file.
    )
    
    if __name__ == '__main__':
        tf.app.run()
    
    
    Note: The tensor value for the train accuracy, test accuracy and test loss operations must be set to TensorFlow scalar. If the model does not contain these values, they can be set to tf.constant(). For example,
    train_accuracy=tf.constant(0.5,dtype=tf.float32)

Results

The edited TensorFlow model is ready for distributed training with IBM Fabric

What to do next

Add the model to IBM Spectrum Conductor Deep Learning Impact, see Create a training model.