Quick start: Refine data

You can save data preparation time by quickly transforming large amounts of raw data into consumable, high-quality information that is ready for analytics. Read about the Data Refinery tool, then watch a video and take a tutorial that’s suitable for beginners and does not require coding.

Service The Data Refinery service is not available by default. An administrator must install either the Watson Studio service or the Watson Knowledge Catalog service on the IBM Cloud Pak for Data platform. To determine whether the service is installed, open the Services catalog and check whether the Data Refinery service is enabled.

Required service
Watson Studio or Watson Knowledge Catalog

Your basic workflow includes these tasks:

  1. Create a project. Projects are where you can collaborate with others to work with data.
  2. Add your data to the project. You can add CSV files or data from a remote data source through a connection.
  3. Open the data in Data Refinery.
  4. Perform steps using operations to refine the data.
  5. Create and run a job to transform the data.

Read about Data Refinery

Use Data Refinery to cleanse and shape tabular data with a graphical flow editor. You can also use interactive templates to code operations, functions, and logical operators. When you cleanse data, you fix or remove data that is incorrect, incomplete, improperly formatted, or duplicated. When you shape data, you customize it by filtering, sorting, combining or removing columns, and performing operations.

You create a Data Refinery flow as a set of ordered operations on data. Data Refinery includes a graphical interface to profile your data to validate it and over 20 customizable charts that give you perspective and insights into your data. When you save the refined data set, you typically load it to a different location than where you read it from. In this way, your source data remains untouched by the refinement process.

Read more about refining data

Watch a video about refining data

Watch Video Watch this video to see how to refine data.

This video provides a visual method as an alternative to following the written steps in this documentation.

Try a tutorial to refine data

In this tutorial, you will complete these tasks:

This tutorial will take approximately 30 minutes to complete.

Task 1: Create a project

You need a project to store the data and the Data Refinery flow. Follow these steps to create a project:

  1. From the Cloud Pak for Data navigation menu Navigation menu, choose Projects > All projects.

  2. If you have an existing project, open it.

  3. If you don't have an existing project, then click New project.

  4. Select Create an empty project.

  5. Enter a name and optional description for the project.

  6. Click Create.

Checkpoint icon for The following image shows a new, empty project. Check your progress

The following image shows a new, empty project.

The following image shows a new, empty project.

For more information or to watch a video, see Creating a project.

Task 2: Open the data set in Data Refinery

Follow these steps to add a data asset to your project and create a Data Refinery flow:

  1. Download the airline-data.csv file (1.5 MB).

  2. From your project, click the Upload asset to project icon upload icon.

  3. In the Load panel that opens, browse to select the airline-data.csv file, and click Open. Stay on the page until the load completes.

  4. On the Assets tab, click the airline-data.csv data asset to preview its content.

  5. Click Prepare data to open a sample of the file in Data Refinery, and wait until Data Refinery reads and processes a sample of the data.

  6. Close the Information and Steps panels.

Checkpoint icon for The following image shows the airline data asset open in Data Refinery. Check your progress

The following image shows the airline data asset open in Data Refinery.

The following image shows the airline data asset open in Data Refinery.

Task 3: Review the data with Profile and Visualizations

Watson Knowledge Catalog automatically profiles and classifies the content of an asset based on the values in those columns. Follow these steps to use the Profile and Visualizations tabs to explore the data.

  1. Click the Profile tab to review the frequency distribution of the data so that you can find the outliers. The statistics show the interquartile range, minimum, maximum, median and standard deviation in each column. The following image shows the Profile tab:
    Profile tab

  2. Click the Visualizations tab. Select the UniqueCarrier column to visualize. Suggested charts have a blue dot next to their icons. Use the different perspectives available in the charts to identify patterns, connections, and relationships within the data. The following image shows the Visualizations tab:
    Visualizations tab

Tip: Use the Profile and Visualizations pages to view changes in the data as you refine it.

Data Refinery operations

Data Refinery uses two kinds of operations to refine data, GUI operations and coding operations. You will use both kinds of operations in this tutorial.

  • GUI operations can consist of multiple steps. Select an operation from New step. A subset of the GUI operations is also available from each column's overflow menu (overflow menu).

    When you open a file in Data Refinery, the Convert column type operation is automatically applied as the first step to convert any non-string data types to inferred data types (for example, to Integer, Date, Boolean, etc.). You can undo or edit this step.

  • Coding operations are interactive templates for coding operations, functions, and logical operators. Most of the operations have interactive help. Click the operation name in the command-line text box to see the coding operations and their syntax options.

Task 4: Refine the data

Refining data is a series of steps to build a Data Refinery flow. As you go through this task, view the Steps panel to follow your progress. You can select a step to delete or edit it. If you make a mistake, you can also click the Undo icon Undo icon. Follow these steps to refine the data:

  1. Go back to the Data tab.

  2. Select the Year column. Click the Overflow menu (Overflow menu) and choose Sort descending.

  3. Click Steps to see the new step in the Steps panel.

  4. Focus on the delays for a specific airline. This tutorial uses United Airlines (UA), but you can choose any airline.

    1. Click New step, and then choose the GUI operation Filter.

    2. Choose the UniqueCarrier column.

    3. For Operator, choose Is equal to.

    4. For Value, type the string for the airline for which you want to see delay information. For example, UA.
      Filter operation

    5. Click Apply. Scroll to the UniqueCarrier column to see the results.

  5. Create a new column that adds the arrival and departure delay times together.

    1. Select the DepDelay column.

    2. Notice that the Convert column type operation was automatically applied as the first step to convert the String data types in all the columns whose values are numbers to Integer data types.

    3. Click New step, and then choose the GUI operation Calculate.

    4. For Operator, choose Addition.

    5. Select Column, and then choose the ArrDelay column.

    6. Select Create new column for results.

    7. For New column name, type TotalDelay.
      Calculate operation

    8. You can position the new column at the end of the list of columns or next to the original column. In this case, select Next to original column.

    9. Click Apply. The new column, TotalDelay, is added.

  6. Move the new TotalDelay column to the beginning of the data set:

    1. In the command-line text box, choose the select operation.

    2. Click the word select, and then choose select(`<column>`, everything()).

    3. Click `<column>`, and then choose the TotalDelay column. When you finish, the command should look like this:

      select(`TotalDelay`, everything())
      
    4. Click Apply. The TotalDelay column is now the first column.

  7. Reduce the data to four columns: Year, Month, DayofMonth, and TotalDelay. Use the group_by coding operation to divide the columns into groups of year, month, and day.

    1. In the command-line text box, choose the group_by operation.

    2. Click <column>, and then choose the Year column.

    3. Before the closing parenthesis, type: ,Month,DayofMonth. When you finish, the command should look like this:

      group_by(`Year`,Month,DayofMonth)
      
    4. Click Apply.

    5. Use the select coding operation for the TotalDelay column. In the command-line text box, select the select operation.
      Click <column>, and choose the TotalDelay column. The command should look like this:

      select(`TotalDelay`)
      
    6. Click Apply. The shaped data now consists of the Year, Month, DayofMonth, and TotalDelay columns.

      The following screen image shows the first four rows of the data.
      The first four rows of the Data Refinery flow with the Year, Month, DayofMonth, and TotalDelay columns

  8. Show the mean of the values of the TotalDelay column, and create a new AverageDelay column:

    1. Make sure the TotalDelay column is selected, and click New step, and then choose the GUI operation Aggregate.

    2. For AGGREGATION 1, select Mean.

    3. For Name of the aggregated column, type AverageDelay.
      Aggregate operation

    4. Click Apply.

      The new column AverageDelay is the average of all the delay times.

Checkpoint icon for The following screen image shows the first four rows of the data. Check your progress

The following image shows the first four rows of the data.

The following screen image shows the first four rows of the data.

Task 5: Run a job for the Data Refinery flow

When you run a job for the Data Refinery flow, the steps are run on the entire data set. You select the runtime and add a one-time or repeating schedule. The output of the Data Refinery flow is added to the data assets in the project. Follow these steps to run a job to create the refined data set.

  1. From the Data Refinery toolbar, click the Jobs icon, and select Save and create a job.
    Save and create a job

  2. Type a name and description for the job, and click Next.

  3. Select a runtime environment, and click Next.

  4. (Optional) Click the toggle button to schedule a run. Specify the date, time and if you would like the job to repeat, and click Next.

  5. (Optional) Turn on notifications for this job, and click Next.

  6. Review the details, and click Create and run to run the job immediately.
    create job

  7. When the job is created, click the job details link in the notification to view the job in your project. Alternatively, you can navigate to the Jobs tab in the project, and click the job name to open it.

  8. When the Status for the job is Completed, use the project navigation trail to navigate back to the Assets tab in the project.

  9. Click the Data > Data assets section to see the output of the Data Refinery flow, airline-data_shaped.csv.

  10. Click the Flows > Data Refinery flows section to see the Data Refinery flow, airline-data.csv_flow.

Checkpoint icon for The following image shows the Assets tab with the Data Refinery flow and shaped asset. Check your progress

The following image shows the Assets tab with the Data Refinery flow and shaped asset.

The following image shows the Assets tab with the Data Refinery flow and shaped asset.

Task 6: Create another data asset from the Data Refinery flow

Follow these steps to further refine the data set by editing the Data Refinery flow:

  1. Click airline-data.csv_flow to open the flow in Data Refinery.

  2. Sort the AverageDelay column in descending order.

    1. Select the AverageDelay column.

    2. Click the column Overflow menu (Overflow menu), and then select Sort descending.

  3. Click the Flow settings settings icon.

  4. Click the Target data set panel.

  5. Click Edit properties.

    1. In the Format target properties dialog, change the data asset name to airline-data_sorted_shaped.csv.
      changed output file name

    2. Click Save to return to the Flow settings.

  6. Click Apply to save the settings.

  7. From the Data Refinery toolbar, click the Jobs icon and select Save and view jobs.
    Save and view jobs

  8. Select the job for the airline data, and then click View.

  9. From the Job window toolbar, click the Run job icon.
    Run jobs icon

Checkpoint icon for The following image shows the completed job details. Check your progress

The following image shows the completed job details.

The following image shows the completed job details.

Task 7: View the data assets and your Data Refinery flow in your project

Now follow these steps to view the three data assets, the original, the first refined data set, and the second refined data set:

  1. When the job completes, go to the project page.

  2. Click the Assets tab.

  3. In the Data assets section, you will see the original data set that you uploaded and the output of the two Data Refinery flows.

    • airline-data_sorted_shaped.csv
    • airline-data_csv_shaped
    • airline-data.csv
  4. Click the airline-data_csv_shaped data asset to see the mean delay unsorted. Navigate back to the Assets tab.

  5. Click airline-data_sorted_shaped.csv data asset to see the mean delay sorted in descending order. Navigate back to the Assets tab.

  6. Click the Flows > Data Refinery flows section shows the Data Refinery flow: airline-data.csv_flow.

Checkpoint icon for The following image shows the Assets tab with all of the assets displayed. Check your progress

The following image shows the Assets tab with all of the assets displayed.

The following image shows the Assets tab with all of the assets displayed.

Next steps

Now the data is ready to be used. For example, you or other users can do any of these tasks:

Additional resources