Data Science

How to prepare and clean data through quick operations, data profiles and visualization

Share this post:

In a typical data science workflow, the initial steps are to identify relevant data sources and request data from various departments and source systems. The bulk of time is then spent cleaning and transforming the data before models can be built.

Feature engineering pain points

In most environments, data scientists have to wait for experts to provide the requested data. Depending on the effort involved and on competing priorities, this process can sometimes take weeks or months before data is available for analysis. The business impact of these delays include crippled analytics projects, missed windows of opportunity, and unmet deadlines.

Taking a shortcut

In the ideal scenario, the data scientist has tools to engineer features and can build models without dependencies. IBM offers self-service data preparation services for data scientists that are agile and interactive.

graph showing Watson Data Refinery

Self-service data preparation

IBM Watson Studio and IBM Watson Knowledge Catalog include Data Refinery for self-service data preparation. Data Refinery puts data pre-processing and feature engineering in the hands of data scientists, enabling faster data insights. Data visualization aids in tedious and iterative data cleansing and shaping tasks.

Watson Knowledge Catalog enables data scientists to access curated data sets that are known and trusted by the organization. Immediate access to ‘source of truth’ data can simplify or eliminate a number of process steps.

Data Refinery key capabilities include:

  • Intuitive data transformation — Shape and clean data using graphical interfaces or a library of templates populated with powerful data transformation operations using code
  • Rapid feedback on the data shaping process — Data visualization and profiles help to guide data preparation steps, reveal data quality issues, and avoid missteps
  • Increased confidence in data quality — Incremental snapshots enable data scientists to gauge success over time. Visualization and data-shaping tools make it easy to iteratively remediate data quality issues.
  • Support large data sets — Steps can be saved in projects, edited, and run against larger data sets

graph showing Watson Data Refinery

Use case example: Identify potential repeat customers from sales history data

In the following video, I’ll demonstrate the visual capabilities of IBM’s data shaping tools using an end-to-end example:

  • Refine depersonalized, sensitive data in Watson Knowledge Catalog
  • Clean and refine data using quick, built-in operations or code
  • Leverage incremental snapshots, data profiles, and visualizations to guide the preparation of data for analysis

Stay tuned for Part 2 of this blog where I’ll show examples of combining financial data with sales data through the power of collaboration.

Part 3 of this blog will provide a deeper dive into preparing and analyzing a mixture of structured and unstructured data sources.

To sign up for Watson Studio, click here.

To sign up for Watson Knowledge Catalog, click here.

Learn more about the Watson Data Refinery tool, available via Watson Studio and Watson Knowledge Catalog.

Lead Architect, Senior Technical Staff Member, IBM Watson Data and AI

More Data Science stories
April 30, 2019

Balancing personalization with brand consistency: Podcast interview with Tameka Vasquez & Oliver Christie

In this episode of thinkPod, we are joined by Tameka Vasquez (marketing strategist and professor) and Oliver Christie (futurist and founder of Foxy Machine). We talk to Tameka and Oliver about creating customer experiences that resonate, the beauty of simplicity and being jargon-free, and whether or not AI will replace human creativity with marketing. We also tackle whether marketers have been tone deaf and the difficulties of hyper personalization.

Continue reading

April 29, 2019

Behind the Code: Meet Bill Higgins

We're excited to kick off a new series for the Watson blog, focusing on AI and the developers at IBM Watson. AI is one of the hottest topics in tech, and we want to take a deep dive into AI, but more importantly, for you to get to know the people that help bring Watson to life. To kickoff the series, meet Bill Higgins, distinguished engineer, AI for developers, data and AI. 

Continue reading

April 24, 2019

Making mom proud with collaboration solutions

In this episode of thinkPod, we are joined by Michael McCabe, Vice President of WW IBM Go to Market at Box, who talks about how companies are leveraging collaboration tools to drive insights and outcomes, the evolution of content, and the future of AI in managing unstructured data.

Continue reading