Data preparation and refining – Now integrated in Watson Data Platform
In the world of Data Science, the time required to transform data to good quality is a recurring barrier towards gaining insights. Data scientists or analysts will spend a bulk of their effort in cleaning the data using a variety of handwritten scripts. IBM Watson Data Platform’s data refining tools aim to reduce the pain associated with creating good quality data. The tool has an intuitive user interface and templates enabled with powerful operations to shape and clean data. It also provides metrics and data visualization which aid in every step of the process. Incremental snapshots of the results are provided allowing the user to gauge success with each iterative change. Saving, editing, and running the steps within projects provide the ability to refine data of almost any size within the Watson Data Platform.
IBM’s Data Refining tools are now available in Watson Data Platform’s open beta. We invite you to experience these data refining capabilities offered as an integrated experience in the Watson Data Platform.
Following are the new top 5 data refining features among many:
1. Quick transformation to refine many types of data sources
You can connect to data sources in the public cloud or on-premises and refine the data within Watson Data Platform. Most of the file common data formats such as csv, delimiter separated, json, parquet, avro, and relational databases well as non-relational databases are supported.
Here is a sample process: A subset of the dataset is selected by the tool on which transformations can be iteratively applied. If these transformations provide the required result, they can be applied to the full dataset. Complex transformations can be applied quickly to perform complex data manipulation. For example, you can split a single column into multiple columns by auto-detecting separators, provide regular expressions or positions in the data. You can merge multiple columns into one and calculate columns from existing ones using custom formulas or conditions.
Fig 1: Auto-detecting delimiters to split columns
A variety of frequently used data cleaning functions such as data de-duplication, empty row removal, missing value replacement are provided. Text operations including as sub-string replacement using regular expressions, string concatenation, character padding, case conversion and math operations including absolute value, ceiling, floor, square root are supported. The tool also provides core data refining operations including filtering, sorting, column removal etc. Operations such as join, merge and transposition of multiple data sets are being added.
2. Advanced data shaping operations
Watson Data Platform provides advanced data shaping operations that are easy to use. The coding editor provides templates to help you build the structure of the command. The tool provides content assist to convert the command into executable code. Click on the templates and build advanced transformations to refine data.
You can select or reorder columns using name pattern matches or ranges. Templates provide a rich set of options for each transformation.
Fig 2: Template guided and content assisted coding
You can also apply advanced operations conditionally on multiple columns using a rich set of built-in functions and expression syntax. It has a rich library of built-in aggregation (sum, count, average etc), summary and sorting functions.
3. Profile view
A quick way to get an understanding of your data is to look at the metrics of data distribution. Visibility into how the data distribution changes after each step in the “refining” flow help building the right steps in the iterative cycle. Data distributions show frequency of occurrences of the values in the data, along with counts for missing values.
Fig 3: Data distributions for integer and string columns
Another way to get an understanding of the data is to look at the distribution visually. Watson Data Platform has a large selection of built-in visualization tools. It suggests appropriate chart based on the data of the columns. You can take the suggested chart or choose one manually and customize the visualization to your liking.
Fig 4: Visualizations recommended by the tool for the selected columns
5. Iterative flow development and management
Cleaning data is almost always a multi-step iterative process. You can choose from a variety of connections to access the data. The data can be from over 25 diverse types of sources, from flat files to relational and non-relational databases. Once connected, the refining process involves building a flow with multiple steps for data cleaning and manipulation applied on a sample dataset. You can save the steps in your flow into a project, and modify it later. Once the flow produces clean data on the sample dataset, it is ready to be applied to the full data set using Spark. Once the job is completed, the transformed data is saved into the target location.
From the data flow details view in the project, you can monitor the flow execution status of current or previous runs, sources and targets used and the amount of data processed. You can re-use or enhance frequently used flows by opening them in the refiner, enhance them, optionally change target locations and re-run them.
Data refining tools now seamlessly integrated in Watson Data Platform. It allows both data scientists, who like to code, and data analysts, who prefer visual tools, to build repeatable data refining flows iteratively through rapid visual feedback and a rich set of transformations. The shaped data can be analyzed in the same project using Data Science Experience to produce valuable insights.