Create a dataset

Create a dataset using IBM Spectrum Conductor Deep Learning Impact 1.1.0. IBM Spectrum Conductor Deep Learning Impact supports LMDB, TFRecord and other datasets. Each dataset can include training data, test data and validation data.

IBM Spectrum Conductor Deep Learning Impact requires that the dataset has at least training and test data. However, if you plan to use the dataset for validation, make sure to include all three data types as part of your dataset. Data types include:
  • Training data: The sample of data used for learning.
  • Test data: The sample of data used to evaluate the model during the training phase.
  • Validation data: The sample of data used to evaluate the final model.

IBM Spectrum Conductor Deep Learning Impact assumes that you have collected your raw data and labeled the raw data using a label file or organized the data into folders. In order to create a dataset, you must put the raw data in a folder on the shared file system that IBM Spectrum Conductor Deep Learning Impact has access to. The raw data must be in one of the formats accepted by IBM Spectrum Conductor Deep Learning Impact. The egoadmin and execute user must have read and write permissions to the folder.

Depending on the deep learning framework you are using, different IBM Spectrum Conductor Deep Learning Impact dataset types can be used. If you are using Caffe, you can use the following datasets: LMDBs and Images for object classification. If you are using TensorFlow, you can use the following datasets: TensorFlow Records, Images for object classification, Images for object detention, Images for vector output, CSV files, and other generic types.

Note: A dataset belongs to a Spark instance group. If you have more then one Spark instance group configured for running deep learning workloads, you may have to create the same dataset on multiple Spark instance groups.