Don’t let data preparation get in the way of cognitive analytics

Share this post:

Edward Calvesbert IBM Watson Analytics is opening up and democratizing analytics processes – helping data scientists explore new data sets and empowering line-of-business analysts with a user-friendly environment that helps them access advanced analytical tools in an intuitive way. However, before users can harness Watson Analytics’ cognitive capabilities, there are often other obstacles that need to be overcome. Accessing the data and preparing it for analysis can still be very time-consuming, preventing data scientists from delivering insights quickly, and discouraging line-of-business analysts from taking their first steps. I spoke with Edward Calvesbert, Program Director, Offering Management at IBM, about how IBM’s Watson Analytics and DataWorks teams are working together to remove these roadblocks and get data flowing smoothly through the analytics process.

By integrating IBM DataWorks as its data preparation tool of choice, IBM Watson Analytics helps citizen analysts and data scientists focus on business insights.

Edward, thanks for finding the time to speak with me. What are the biggest problems that people face when they’re trying to turn data into insight?

There are two major hurdles that any analyst traditionally faces – and these apply equally, regardless of whether they are a professional data scientist or a “citizen analyst” in a line-of-business role who just wants an answer to a pressing business question.

First, there’s the struggle to get hold of the data in the first place. If you are lucky, you might have all of your data in a spreadsheet – but if not, you will probably need to contact your IT department and ask them to extract it for you from some kind of source system, be it a database or an application. That’s going to delay you, and it’s also a burden on IT, because they are constantly being distracted from their core role to respond to ad-hoc data requests from the business.

Furthermore, there’s a real risk that the data you ask for isn’t the data that you actually get – if you don’t understand the data sources and structures, or the IT team misinterprets your request, you might need to go through several iterations of this process to actually get the information you need.

Second, once you have the data, you need to prepare it for whatever kind of analysis you are going to do – which might mean joining multiple data sets together, matching records, eliminating duplicates, classifying data types, correcting errors, and transforming fields. Again, this can be a time-consuming process, and it usually needs to be done each time you perform an analysis – so if you’re creating a daily, weekly or monthly report, you may need to go through the same data cleansing process for each reporting cycle.
As a consequence, it’s estimated that up to 80 percent of analysts’ time is spent on gathering and preparing data. The need to spend so much time on preparation diverts analysts from more valuable data interpretation work – and in some cases might discourage them from even attempting it.

So users might give up before they even get to the actual “analysis” part of the process? How is IBM addressing this problem, especially in terms of Watson Analytics?

One of the biggest drivers for the Watson Analytics team is to build a product that removes all of the complexity from analyzing data: it uses natural language processing to allow users to write queries in plain English rather than learning SQL code; and it automatically highlights correlations in the data and suggests potentially interesting areas for analysis.

However, there’s no point in making the analysis easy if the user never gets to it because the initial data collection and preparation are too difficult. So we’ve seamlessly embedded IBM DataWorks, a cloud-native data access and preparation tool, into Watson Analytics to address this issue.

Cool – so what does DataWorks bring to the table?

First, instead of just importing CSV files or spreadsheets, users will now be able to securely connect to both cloud-based and on-premise databases and applications and feed the data directly into the Watson Analytics service. And we’re not just talking about IBM data sources – we’re building a wide range of connectors for technologies such as and Amazon S3, to name a few. So instead of asking IT to extract data for you and then uploading it manually, you’ll have access to everything you need with just a few clicks.

Second, the intuitive DataWorks interface within Watson Analytics enables you to apply all sorts of transformations to prepare your data for analysis. For example, it can recognize data types such as telephone numbers, dates, social security numbers and zip codes automatically, to help standardize messy data sets. We are even working on adding IBM’s rich entity resolution technology to DataWorks, allowing it to compare multiple records and use fuzzy matching techniques to identify possible matches or duplicates.

You can also easily switch between data preparation, analytics, and visualization, so you can easily iterate and refine both your transformations and your analyses until you are confident that your results are valid and valuable. And once you have reached that stage, you can save your work so that you can repeat the same process whenever you need to – a huge time-saver when you need to produce reports on a regular schedule.


Let’s say I’m a financial analyst or a marketing manager who wants to do some kind of weekly sales analysis – can you walk me through how the process is different with Watson Analytics and DataWorks, compared to the tools I’m using today?

Ok, most probably what you’re doing today is asking your IT team to send you an extract of data from your sales system, and then you’re manipulating it in Excel, using VLOOKUPs and joining or splitting columns, and then generating some kind of pivot tables or graphs, or using another analytics tool. Every time you want to do that analysis, you need to go through the same steps, one by one, and make sure you haven’t forgotten anything or done anything different from the last time.

With Watson Analytics and DataWorks, you can get data directly from your source systems… and then simply ask whatever questions you have to gain the insights you need.

With Watson Analytics and DataWorks, you get the data directly from the source systems in a matter of seconds, you apply the transformations that you’ve already set up in DataWorks, and then you simply ask whatever questions you have of the data. Watson Analytics performs the predictive analysis and produces the visualizations for you automatically, and you can easily share them with your colleagues.

Effectively, we’re taking all of the pain out of the data preparation process and making the whole data-analysis-insight-presentation cycle seamless.

What if my organization uses Watson Analytics for specific tasks such as data exploration, but also sees the value of a data preparation solution in other areas?

DataWorks isn’t just available as part of Watson Analytics – it’s also a standalone product that can feed other cloud data services such as IBM Cloudant, dashDB, and DB2 on Cloud. So if you like what it does within Watson Analytics, you can use it with these other data targets as well.

Great! So how can data scientists and citizen analysts take a look at what Watson Analytics and DataWorks has to offer?

If you’re already a Watson Analytics Plus or Professional user, you’ll now be using DataWorks by default when you log in, so you can start benefiting from it right away. If you are using the free version, or if you don’t have a Watson Analytics account yet, you can sign up at for the Plus or Premium service.

Alternatively, if you want to try DataWorks on its own, you can check out the DataWorks service available on IBM Bluemix at

More Community stories
April 30, 2019

Introducing IBM Analytics Engine v1.2 and Announcing the Deprecation of IBM Analytics Engine v1.0

We are excited to inform you about the new version of IBM Analytics Engine v1.2 that will be available starting May 15, 2019. Along with this release, Analytics Engine v1.0 will be retired.

Continue reading

April 29, 2019

Transforming Customer Experiences with AI Services (Part 1)

This is an experience from a recent customer engagement on transcribing customer conversations using IBM Watson AI services.

Continue reading

April 26, 2019

Updated Tutorial: Database-Driven Chatbot

The tutorial on how to build a database-driven chatbot has been updated. It's now simpler to deploy and and offers more options—Slack, Facebook Messenger, Wordpress, and more.

Continue reading