Spot the Difference: The role of visualization in effective data preparation

4 min read

Spot the Difference: The role of visualization in effective data preparation

IBM Data Refinery helps accelerate the data cleansing process by visualizing relationships between fields instantly

  • Visualize correlations with just a few clicks

  • Gain insight to help reshape data sets more quickly

  • Focus on relevant fields for further analysis

It’s clear to many in the business community that big data analytics is the next mountain to climb. However, while the final destination may be obvious to everyone, ascending to the summit will need careful planning. In many cases, even reaching base camp can be a tortuous process.

For example, before large data sets can be properly analyzed by data scientists, or incorporated into machine learning models or neural networks, the data itself must be sourced, cleaned and reshaped into more useful structures. It’s estimated that many data scientists are spending as much as 80 percent of their working hours on these initial data refinement tasks.

The more time data scientists spend in these dark foothills of data science, the less time they have to acclimatize themselves to the more rarefied atmosphere of algorithm design and model training.

As a result, businesses are failing to reach the peaks of insight; instead of gaining a clear view to the horizon, they are lost in the detail of their data.

Visualizing the way forward

IBM Data Refinery, a service within IBM Watson Data Platform, provides a smooth and efficient data preparation process, which aims to lighten the load on individual data scientists and help them ascend to the next stage of the workflow more quickly.

One of the most important aspects of Data Refinery is its ability to represent data visually at the click of a mouse. It allows users to select multiple fields in their data set, and plot those fields against each other in a wide variety of charts, maps, and other graphical representations.

Visualizing the way forward

Whether a business scenario calls for a two-axis bar graph or a heat map of revenue across geographic zones, these visualizations can be quickly generated based on a representative data sample, creating an immediate visual reference.

Making the smart choice

Data Refinery makes it easy to gain insight by using smart algorithms to select appropriate visualizations automatically, depending on the types of data that the user wishes to plot.

For example, if the visualization engine detects that a certain data field contains country names or ZIP codes, it might suggest a map-based chart. On the other hand, if one axis is composed of discrete categories, while the other holds continuous numerical values, it might recommend a bar or line chart.

Making the smart choice

By providing a shortlist of potentially useful chart types and allowing the user to render them with a single click, the solution eliminates hours of frustrating fiddling with R or Python plotting libraries, and delivers insight in seconds.

More than a pretty picture

Unlike a traditional business intelligence tool, Data Refinery’s visualization capabilities aren’t intended to create attractive graphics that a data scientist can present to their coworkers.

Instead, they act as a crucial tool in the data refinement process itself. By quickly plotting the data, the user can get an instant feel for the “shape” of the data set, and the contents of each field.

If there are inexplicable outliers in one field of a data set, then plotting that field against other fields may reveal why those anomalous values exist. As they diagnose the problem, data scientists can alternate between shaping and visualizing the data, iteratively building a new, clean, high-quality data set that can be used as the basis for higher-level analysis.

Visualization in practice

Imagine you have a data set from your sales team—a vast table of transactions, with a column that purports to contain the dollar value of each transaction. The vast majority of the values fall within a reasonably consistent range—let’s say between $20 and $100. But there is also a small number of transactions that seem to be significantly outside that range, between $400 and $2,000.

Visualization in practice

Are those your most valuable clients? Was there a promotion that encouraged customers to buy in bulk? Or is something odd going on with your data?

With Data Refinery, you can quickly plot the “dollar value” column against other fields in the data set, and uncover relationships that might help you understand the reason for the discrepancy.

In this case, perhaps by plotting total value against sales territory, you see that while the transactions within your “normal” range all took place within the US, the outlier sales were all made in Mexico. Since the Mexican peso is currently trading at around 19 pesos to the US dollar, it seems likely that the data set is simply missing a column specifying the currency of the transaction.

Once you have identified the problem, you may be able to combine data from other fields to fill in the missing currency information, improving the quality of the data set and making it much more valuable for future analysis.

Plotting the next stage of the data science journey

The visualization capabilities of Data Refinery show data sets in a new light, revealing the problems and pitfalls of poor data quality, and helping to prevent missteps on the data science journey. The combination of visualization and data-shaping tools make it easy to iteratively remediate data quality issues, and gain much greater confidence in the validity of their data.

As a result, Data Refinery helps data scientists plot a safe ascent through the lower levels of data preparation, and spend more of their time on the higher planes of analysis and modeling.

Want to try IBM Data Refinery for yourself? Sign up for IBM Watson Data Platform here, and get started in minutes.

Be the first to hear about news, product updates, and innovation from IBM Cloud