Table of contents

Managing data for deployments

This topic describes various ways of adding and promoting data assets to a space. It also contains important information about data types used in batch jobs.

Adding data sources to a space (Watson Machine Learning)

Add data sources to a deployment space to use with batch deployment jobs. Data can be:

  • A data file such as a .csv file
  • A connection to data that resides in a repository such as a database.
  • Connected data that resides in a storage bucket, such as a data file that is a Cloud Object Storage bucket or Storage volume (NFS).

Notes:

  • Depending on your configuration and the type of data connection, large data sets, typically more than 2GB, can time-out when you promote them to a space or catalog.
  • Although you can promote any kind of data connection to a space, where you can use the connection is governed by factors such as model and deployment type. For example, you can access any of the connected data using a script, but in batch deployments you are limited to particular types of data, as listed in Batch deployment details by framework.
  • If you promoted or added a Connection or Connected data that uses CloudPak credentials, make sure that the option to “Use your Cloud Pak for Data credentials to authenticate to the data source” is checked after you add it to a space. This will ensure that the Connected data using the credentials will continue to work properly.

Data added to a space is managed in a similar way to data added to a Watson Studio project. For example:

  • Adding data to a space creates a new copy of the asset and its attachments within the space, maintaining a reference back to the project asset. If an asset such as a data connection requires access credentials, they persist and are the same whether you are acessing the data from a project or from a space.
  • Just like with data connection in a project, you can edit data connection details from the space.
  • Data assets are stored in a space in the same way they are stored in a project, using the same file structure for the space as the structure used for the project.

You can add data to a space in one of these ways:

  • Promote a data source, such as a file or connection, from an associated project
  • Add a data file, connection, or connected data directly to a space
  • Save a data asset to a space programmatically

For details on how Watson Studio connects to data, see Accessing data.

Promoting data sources from a project

To promote data from a project:

  1. Save a data source, data connection, or connected data to a project.
  2. In the project Assets page, from the action item for the data asset, choose Promote.

The promoted data asset displays in the space and is available for use as an input data source in a deployment job.

Adding data to a space

To add data directly to a space:

  1. From the Assets page of the deployment space, click Add to space.
  2. Choose the type of data asset to add:
    • Data to specify a file to upload
    • Connection to specify a connection to a data repository such as DB2
    • Connected data to connect to data in a storage object such as a Cloud Object Storage bucket
  3. Complete the steps to add the data.

The data asset displays in the space and is available for use as an input data source in a deployment job.

Using data from a Cloud Object Storage connection

  1. Create a connection to IBM Cloud Object Storage by adding a Connection to your project or space and selecting Cloud Object Storage (infrastructure) as the connection type. Provide the secret key, access key and login URL.
  2. Add input and output files to the deployment space as connected data using the COS connection you created.

Using data from a Storage volume (NFS) connection

For details on using data from a networked file system, see Storage volume connection.

Data sources for batch jobs

Input data can be supplied to a batch job as:

  • Inline data - In this method, the input data for batch processing is specified in the batch deployment job’s payload, for example, you can pass a CSV file as the deployment input in the UI or as a value for parameter scoring.input_data in a notebook. Once the batch deployment job is completed, the output of the batch deployment job is written to the corresponding job’s metadata parameter scoring.predictions
  • Data reference - in this method, the input and output data for batch processing can be stored in a remote data source like a Cloud Object Storage bucket, an SQL/no-SQL database, or as a local or managed data asset in a deployment space. Details for data references include:

    • input_data_references.type and output_data_reference.type must be data_asset

    • The references to input data must be specified as a /v2/assets href in the input_data_references.location.href parameter in the deployment job’s payload. The data asset specified here can be a reference to a local or connected data asset.

    • If the batch deployment job’s output data has to be persisted in a remote data source, the references to output data must be specified as a /v2/assets href in output_data_reference.location.href parameter in the deployment job’s payload.

    • If the batch deployment job’s output data has to be persisted in a deployment space as a local asset, output_data_reference.location.name must be specified. Once the batch deployment job is completed successfully, the asset with the specified name will be created in the space.

    • If the output data references where the data asset is in a remote database, you can specify if the batch output should be appended to the table or if the table is to be truncated and output data updated. Use the output_data_references.location.write_mode parameter to specify the values truncate or append. Note the following:

      • Specifying truncate as value truncates the table and inserts the batch output data.
      • Specifying append as value appends the batch output data to the remote database table.
      • write_mode is applicable only for output_data_references parameter.
      • write_mode is applicable only for remote database related data assets. This parameter will not be applicable for a local data asset or a COS-based data asset.
    • Any input and output data asset references must be in the same space id as the batch deployment.

    • If the connected data asset references a Cloud Object Storage instance as source, for example, a file in a Cloud Object Storage bucket, you must supply the HMAC credentials for the COS bucket. Include an Access Key and a Secret Key to your IBM Cloud Object Storage connection to enable access to the stored files.