Getting closer to the source: the power of self-service data preparation

By: Sonali Surange

Getting closer to the source: the power of self-service data preparation

IBM Data Refinery, a feature of Watson Data Platform, helps reduce reliance on IT and give knowledge workers faster access to high-quality data

  • Eliminate the need to wait weeks or months for IT to develop new ETL processes

  • Empower data scientists and business analysts to clean and shape data sets for themselves

  • Increase confidence in analytics by enabling users to track data lineage back to the source

What attracts people to a career in data science? Many data scientists would probably claim that they love the challenge of developing algorithms and building machine learning models that turn previously unusable data into valuable insight.

That may be the dream, but what’s the reality? These days, most data scientists are spending up to 80 percent of their time sourcing and preparing data, leaving them very little time to focus on the more complex, interesting and valuable parts of their job.

Try Watson Data Platform for free

ETL pain-points

In a typical data science workflow today, the first step is to identify relevant data sources, request data from different departments or source systems, and wait for the data to be delivered. This may require the IT team to build new custom extract, transform and load (ETL) processes to collect, integrate and reshape data from source systems – a development project that may take weeks or months.

By now, how much time has passed? Are your data scientists still on track to meet that important business deadline? In many cases, you are probably still waiting for your ETL request to reach the top of IT’s development queue – by which time, the window of opportunity for analysis may be long past.

Taking a shortcut

What if you could shortcut the whole process and get your data directly from the source? What if, instead of defining a fixed list of requirements for IT to build you an ETL job, you could prepare and shape the data yourself in a much more agile, interactive way?

That’s what IBM is helping you achieve with its new generation of self-service data preparation services in IBM® Watson® Data Platform.

Enabling self-service data preparation

IBM Data Refinery is a key feature of IBM Watson® Data Platform, which provides a self-service data preparation environment where data scientists, analysts and engineers can quickly source, shape and share new data sets.

From the perspective of today’s typical data science workflows, this is a fundamentally new capability. It enables a much wider range of users – from data scientists and data engineers to business analysts – to prepare data for themselves, regardless of their technical skill level or their understanding of the underlying data storage infrastructure.

Enabling self-service data preparation

Interactive data transformation

Date Refinery enables users to combine and reshape data from various sources directly and interactively via an intuitive web-based user interface. Instead of having to specify their requirements up-front, the user can experiment with different data transformations in a free-flowing, iterative process – adding, removing and re-ordering steps until they find the right “recipe” to shape the data for future analysis.

In-context support and documentation are also available to aid users throughout the process, explaining how each transformation works and automatically suggesting common functions that the user might want to apply. If the result isn’t what the user was looking for, they can simply undo the last operation and try something else – with no need to worry that any data is getting lost or corrupted.

Risk-free experimentation

Effectively, Data Refinery acts as a sandbox where the user can experiment without risk. None of the transformations are actually applied to the full data set until the user is completely happy with the recipe they have built, and is confident that the output will be robust.

Moreover, when the transformations are finally applied, Data Refinery saves the results as a completely new data set – so there is no risk of overwriting the original data.

Getting closer to the data

Self-service data preparation is essential, not only because it frees up valuable time, but also because it brings the right users closer to the data.

Data scientists, business analysts and other line-of-business users are often the people who have the best operational understanding of the data, so they are in the best position to prepare and shape it for productive analysis. They are also much more likely to be able to identify, diagnose and remediate data quality issues early in the process, potentially saving hours of wasted effort further down the line.

By helping these users work with data sources more independently, self-service data preparation also avoids bottlenecks between teams, reducing the impact of competing priorities on business goals and deadlines.

Tracking data back to the source

Finally, harnessing the self-service capabilities offered by IBM Data Refinery not only enables users to interact with data directly at the source; it also allows them to trace back all the transformations applied to a data set by previous users.

By exploring the data flow in IBM Data Refinery, they can easily see what changes the previous user has made – for example, dropping a column or filtering some of the rows – and can follow the trail back all the way to the original source if necessary.

If their project requires a different approach, they can simply backtrack to any earlier point in the data set’s history, apply a different recipe of transformations, and output a new version that meets their specific requirements. Every step in the data set’s evolution is clear, and none of this valuable history is ever lost.

Focusing on higher tasks

In a nutshell, IBM Data Refinery can give your data scientists more time to focus on exploring, analyzing and modeling data – as well as increasing the quality of the data itself. They can get back to doing the parts of the job that they really enjoy – and that really add value – while IBM Data Refinery takes care of the rest.

To get started with IBM Data Refinery, or to learn more about how IBM Watson Data Platform can transform your organization’s data science capabilities, sign up for a free trial.

Try Watson Data Platform for free

Be the first to hear about news, product updates, and innovation from IBM Cloud