Managing Data Refinery flows
A Data Refinery flow is an ordered set of steps to cleanse, shape, and enhance data. As you refine your data by applying operations to a data set, you dynamically build a customized Data Refinery flow that you can modify in real time and save for future use.
These are actions that you can do while you refine your data:
Working with the Data Refinery flow
Steps
- Undo or redo a step
- Edit, duplicate, insert, or delete a step
- View the Data Refinery flow steps in a "snapshot view"
- Export the Data Refinery flow data to a CSV file
Working with the data sets
- Change the source of a Data Refinery flow
- Edit the sample size
- Edit the source properties
- Change the target of a Data Refinery flow
- Edit the target properties
- Change the name of the Data Refinery flow target
Actions on the project page
- Reopen a Data Refinery flow to continue working
- Duplicate a Data Refinery flow
- Delete a Data Refinery flow
- Promote a Data Refinery flow to a space
- Export the Data Refinery flow data with project assets
Working with the Data Refinery flow
Save a Data Refinery flow
Save a Data Refinery flow by clicking the Save Data Refinery flow icon in the Data Refinery toolbar. Data Refinery flows are saved to the project that you're working in. Save a Data Refinery flow so that you can continue refining a data set later.
The default output of the Data Refinery flow is saved as a data asset source-file-name_shaped.csv. For example, if the source file is mydata.csv
, the default name and output for the Data Refinery flow is mydata_csv_shaped
.
You can edit the name and add an extension by changing the target of a Data Refinery flow.
Run or schedule a job for a Data Refinery flow
Data Refinery supports large data sets, which can be time-consuming and unwieldy to refine. So that you can work quickly and efficiently, Data Refinery operates on a sample subset of rows in the data set. The sample size is 1 MB or 10,000 rows, whichever comes first. When you run a job for the Data Refinery flow, the entire data set is processed. When you run the job, you select the runtime and you can add a one-time or repeating schedule.
In Data Refinery, from the Data Refinery toolbar click the Jobs icon , and then select Save and create a job or Save and view jobs.
After you save a Data Refinery flow, you can also create a job for it from the Project page. Go to the Assets tab, select the Data Refinery flow, choose New job from the Overflow icon .
You must have the Admin or Editor role to view the job details or to edit or run the job. With the Viewer role for the project, you can view only the job details.
For more information about jobs, see Creating jobs in Data Refinery.
Rename a Data Refinery flow
On the Data Refinery toolbar, open the Info pane . Or click the Flow settings icon and go to the General tab.
Steps
Undo or redo a step
Click the Undo icon or the Redo icon on the toolbar.
Edit, duplicate, insert, or delete a step
In the Steps pane, click the Overflow icon on the step for the operation that you want to change. Select the action (Edit, Duplicate, Insert step before, Insert step after, or Delete).
-
If you select Edit, Data Refinery goes into edit mode and either displays the operation to be edited on the command line or in the Operation pane. Apply the edited operation.
-
If you select Duplicate, the duplicated step is inserted after the selected step.
The Duplicate action is not available for the Join or Union operations.
Data Refinery updates the Data Refinery flow to reflect the changes and reruns all the operations.
View the Data Refinery flow steps in a "snapshot view"
To see what your data looked like at any point in time, click a previous step to put Data Refinery into snapshot view. For example, if you click Data source, you see what your data looked like before you started refining it. Click any operation step to see what your data looked like after that operation was applied. To leave snapshot view, click Viewing step x of y or click the same step that you selected to get into snapshot view.
Export the Data Refinery flow data to a CSV file
Click the Export icon on the toolbar to export the data at the current step in your Data Refinery flow to a CSV file without saving or running a Data Refinery flow job. Use this option, for example, if you want quick output of a Data Refinery flow that is in progress. When you export the data, a CSV file is created and downloaded to your computer's Downloads folder (or the user-specified download location) at the current step in the Data Refinery flow. If you are in snapshot view, the output of the CSV file is at the step that you clicked. If you are viewing a sample (subset) of the data, only the sample data will be in the output.
If your CSV file contains any malicious payload (formulas for example) in an input field, these items might be executed.
You can also export a Data Refinery flow by exporting the project assets. For more information, see Exporting project assets.
Working with the data sets
Change the source of a Data Refinery flow
Change the source of a Data Refinery flow. Run the same Data Refinery flow but with a different source data set. There are two ways that you can change the source:
-
In the Steps pane: Click the Overflow icon next to Data source, select Edit, and then choose a different source data set.
-
In the Flow settings: You can use this method if you want to change more than one data source in the same place. For example, for a Join or a Union operation. On the toolbar, click the Flow settings icon . Go to the Source data sets tab and click the Overflow icon next to the data source. Select Replace data source, and then choose a different source data set.
For best results, the new data set should have a schema that is compatible to the original data set (for example, column names, number of columns, and data types). If the new data set has a different schema, operations that won't work with the schema will show errors. You can edit or delete the operations, or change the source to one that has a more compatible schema.
If you choose a connection for a target, you can only use a connection from the list of Supported data sources for Data Refinery.
Edit the sample size
When you run the job for the Data Refinery flow, the operations are performed on the full data set. However, when you apply the operations interactively in Data Refinery, depending on the size of the data set, you view only a sample of the data.
Increase the sample size to see results that will be closer to the results of the Data Refinery flow job, but be aware that it might take longer to view the results in Data Refinery. The maximum is a top-row count of 10,000 rows or 1 MB, whichever comes first. Decrease the sample size to view faster results. Depending on the size of the data and the number and complexity of the operations, you might want to experiment with the sample size to see what works best for the data set.
On the toolbar, click the Flow settings icon . Go to the Source data sets tab and click the Overflow icon next to the data source, and select Edit sample.
Edit the source properties
The available properties depend on the data source. Different properties are available for data assets and for data from different kinds of connections. Change the file format only if the inferred file format is incorrect. If you change the file format, the source is read with the new format, but the source file remains unchanged. Changing the format source properties might be an iterative process. Inspect your data after you apply an option.
On the toolbar, click the Flow settings icon . Go to the Source data sets tab and click the Overflow icon next to the data source, and select Edit format.
Change the target of a Data Refinery flow
By default, the target of the Data Refinery is saved as a data asset in the project that you're working in.
To change the target location, click the Flow settings icon from the toolbar. Go to the Target data set tab, click Select target, and select a different target location.
If you choose a connection for a target, you can only use a connection from the list of Supported data sources for Data Refinery. Some of these connections can only be used as a source for a Data Refinery flow.
Edit the target properties
The available properties depend on the data source. Different properties are available for data assets and for data from different kinds of connections.
To change the target data set's properties, click the Flow settings icon from the toolbar. Go to the Target data set tab, and click Edit properties.
Change the name of the Data Refinery flow target
The name of the target data set is included in the fields that you can change when you edit the target properties.
By default, the target of the Data Refinery is saved as a data asset source-file-name_shaped.csv in the project. For example, if the source is mydata.csv
, the default name and output for the Data Refinery flow is the
data asset mydata_csv_shaped
.
Different properties and naming conventions apply to a target data set from a connection. For example, if the data set is in Cloud Object Storage, the data set is identified in the Bucket and File name fields. If the data set is in a Db2 database, the data set is identified in the Schema name and Table name fields.
For more information, see Target connection options.
Actions on the project page
Reopen a Data Refinery flow to continue working
To reopen a Data Refinery flow and continue refining your data, go to the project’s Assets tab. Under Asset types, expand Flows, click Data Refinery flow. Click the Data Refinery flow name.
Duplicate a Data Refinery flow
To create a copy of a Data Refinery flow, go to the project's Assets tab, expand Flows, click Data Refinery flow. Select the Data Refinery flow, and then select Duplicate from the Overflow icon . The Data Refinery flow is added to the Data Refinery flows list as "original-name copy 1".
Delete a Data Refinery flow
To delete a Data Refinery flow, go to the project's Assets tab, expand Flows, click Data Refinery flow. Select the Data Refinery flow, and then select Delete from the Overflow icon .
Promote a Data Refinery flow to a space
Deployment spaces are used to manage a set of related assets in a separate environment from your projects. You use a space to prepare data for a deployment job for watsonx.ai Runtime. You can promote Data Refinery flows from multiple projects to a single space. Complete the steps in the Data Refinery flow before you promote it because the Data Refinery flow is not editable in a space.
To promote a Data Refinery flow to a space, go to the project's Assets tab, expand Flows, click Data Refinery flow. Select the Data Refinery flow. Click the Overflow icon for the Data Refinery flow, and then select Promote. The source file for the Data Refinery flow and any other dependent data will be promoted as well.
To create or run a job for the Data Refinery flow in a space, go the space’s Assets tab, scroll down to the Data Refinery flow, and click the New job icon from the Overflow icon . If you've already created the job, go to the Jobs tab to edit the job or view the job run details. The shaped output of the Data Refinery flow job will be available on the space’s Assets tab. You must have the Admin or Editor role to view the job details or to edit or run the job. With the Viewer role for the project, you can only view the job details. You can use the shaped output as input data for a job in watsonx.ai Runtime.
When you promote a Data Refinery flow from a project to a space and the target of the Data Refinery flow is a connected data asset, you must manually promote the connected data asset. This action ensures that the connected data asset's data is updated when you run the Data Refinery flow job in the space. Otherwise, a successful run of the Data Refinery flow job will create a new data asset in the space.
For information about spaces, see Deployment spaces.
Export the Data Refinery flow data with project assets
You can also export a Data Refinery flow by exporting the project assets. For more information, see Exporting project assets.
Parent topic: Refining data