What’s behind data preparation for AI?

By | 4 minute read | February 25, 2020

Data is a fundamental element of AI. If you have no data, you won’t get any meaningful insights from the AI techniques out there. Luckily, we live in a world that generates tons of data every day, and storage has become so affordable that we can now put zillions of bytes of data to use with AI, right? Well, not quite. Having lots of data isn’t enough to help you play well in the AI game. You have to know how to work with it.

Consider this example: Imagine your company is doing market research for a global client in the consumer goods industry. The client wants to use AI to analyze trends in consumer behavior, and it has sales data on its products from supermarkets in many countries. However, you quickly run into a challenge: one product — say a detergent — can have a different brand name in almost every country. How can your AI model provide meaningful insights in such a situation?

This problem is just one example of a scenario in which data is not ready to be used, even though you have it. There are many other issues that might arise. For example:

  • Data sources might be different; for instance, one market chain stores data in CSV format while others do it in Excel spreadsheets
  • Different constants could mean the same thing; for instance, some sources could use M / F for gender while others use Male / Female
  • Some sources might be missing some information that others collect; for instance, state and city

So, all of these are details you have to attend to when dealing with data. That’s why data preparation is so important before you can begin to analyze it through AI.

Data preparation

It’s known that 80 percent of the time of a data science project lifecycle is spent on data preparation. This is because a data scientist needs to clean the data before it’s used in an AI model.

The data preparation process may include: filling in missing values (but with what? a default value? something else?), removing duplicate entries, standardizing attributes (gender or product names, for instance), eventually masking sensitive data if required by law, and more. Additionally, deciding which part of the data should be used for training a model, which for testing, which for validating is also important (this is called data partitioning). Otherwise, your model will suffer from some issues during its training phase, as we’ll discuss in future blog posts.

Figure 1: The lifecycle of a data science project

Data scientists report that the task of data preparation is often tedious and error prone. It is, however, the most important step for ensuring more accurate insights from AI. If you teach children a false answer in school, they’ll give a false answer when they encounter a similar problem in real life. AI follows the same rationale. If an algorithm learns a wrong answer from incorrect, imprecise data, its insights will not be useful, and might even point in the wrong direction.

If I were to give you one hint about the AI game, it is to invest in data preparation!

Tools that can help

So, you might be wondering, are there tools to help data scientists prepare data? Of course! Tooling exists to help with performing data transformation, deduplication, filtering, aggregation, partitioning, visualization and more. IBM Watson Studio Local and IBM Cloud Pak for Data include tools to aid with data preparation. For example, an IBM POWER9 cluster can perform data preparation with Watson Studio Local, as well as training and inferencing with Watson Machine Learning Accelerator.

Whichever tools you decide to use, have some goals in mind:

  • Automation of data preparation: Can the tool handle all required data operations?
  • Ability to work with raw data that has never been formatted
  • Scalability: Data sets can be very large, so your tool needs to handle that, especially in today’s world of a zillion bytes of data
  • Collaboration across business units, because you don’t want to create more data silos
  • Ease of use: How easily can this tool perform the operations you need and connect to the data sources you have?
  • Use of new data sources for more current data to be prepared and used, thus allowing AI models to identify trending changes more quickly
  • Interoperability with cloud solutions

Data preparation is by no means simple, and there are many more nuances to it than I described in this blog post. I hope, though, that this has helped you understand how important data preparation for AI is. In my next post, I’ll talk about model training.

Meanwhile, if you have questions about what solutions IBM has for data management and getting data ready for AI, or if you’re looking to consult with experienced technical professionals on an AI solution for your business, contact IBM Systems Lab Services.