By: Ronald Miller

Powerful data transformation and visualization with IBM Data Refinery

As a data scientist, you are probably spending a lot of time cleansing, shaping and formatting your data before you can do the analysis. According to a recent report, data scientists spend up to 80 percent of their time finding and preparing data. And 57 percent of data scientists said that cleaning and organizing data is the least enjoyable part of their job. The problem isn’t just limited to data scientists. Business analysts face similar struggles to obtain the data they need to build reports—often having to wait weeks for their IT team to extract data from the source systems.

To address the issue, IBM has built a tool that makes fast, self-service data preparation a reality. IBM® Data Refinery is an intuitive cloud-based data preparation service, where you can quickly source, shape and share your data sets.

Powerful data transformation

IBM Data Refinery allows you to explore your data via a web interface, and specify a series of operations—like a recipe—to transform it into the format your downstream process needs for analysis. Once you’re happy with the sample output, Data Refinery applies the recipe to the entire data set using Apache Spark as execution engine. This removes the limitations of working with spreadsheets, which generally can’t cope with data sets that are too large to fit in memory.

At the same time, Data Refinery is also substantially easier to use than traditional ETL tools. It provides a user-friendly point-and-click interface for selecting and combining a wide range of built-in operations. Data Refinery also allows you to use the power of the R programming language for more complex operations, and provides in-context documentation to help users become productive with R syntax quickly.

Combined metrics and visualization

Data-shaping is an iterative and time-consuming process. In a traditional workflow, you might use one set of tools to apply various transformations to your data, and then load it into another tool to visualize and evaluate the results. Over many cycles, this continual tool-hopping becomes a source of frustration and delay.

Data Refinery soothes the pain by integrating both profiling metrics and visualizations as two tabs in a single interface, so you can move between views with a simple click. The Profile tab contains descriptive statistics, to help you understand the distribution of values in each. The Visualization tab allows you to select a combination of fields to build a chart. It automatically suggests appropriate plots (bar charts, heat maps, and so on), based on the data types.

Connecting to data sources

Data Refinery comes with a comprehensive set of 25 prebuilt data connectors, which allow you to set up connections to a wide range of commonly used on-premise and cloud data stores. If an appropriate connector is available, and you have the proper credentials to log into the system, you can access and import data without needing help from your IT team.

Since Data Refinery runs entirely in the cloud, there is no need to download any data onto your own machine, which means you don’t have to worry about storage capacity, and the data is protected if your PC breaks down or gets lost or stolen. In addition, you aren’t limited to the resources on your laptop, since Data Refinery can scale out your heavier jobs by using powerful Spark compute clusters in the cloud.

Integrated data governance

In the modern world, data governance is not just a “nice to have”—it is a “must have”. Even within your organization, you will almost certainly need to define and enforce a set of governance policies to ensure that only authorized users can access sensitive data sets. In many cases, you may also want to monitor who is creating and using your data. With a traditional, fragmented data science tools, this level of control is extremely difficult to achieve. Any gaps in governance can result in data breach.

Data Refinery combines with IBM Data Catalog, another service within IBM Watson® Data Platform, to solve these problems. All data sets within Data Refinery are governed by the rules and policies set in the Data Catalog, and can be controlled and managed from end to end.

Fully integrated in IBM Watson Data Platform

Seamless integration with IBM Data Catalog is just one of the ways in which Data Refinery works with IBM Watson Data Platform to simplify the data science workflow. It also integrates fully with IBM Data Science Experience, which allow the data coming out of Data Refinery to be analyzed seamlessly. In fact, Watson Data Platform provides a complete, end-to-end solution for all data science tasks, from data ingestion and storage through to model training, evaluation and deployment.

To learn how to get started with Data Refinery, check out these demo videos.

