Data Science

Closing the loop: the case for building a coherent platform for data science and machine learning

Share this post:

Many organizations have started to explore the value that machine learning can bring—from illuminating previously “dark data” such as images and videos, to creating models that help to guide or even automate business decision-making.

However, very few companies have gone beyond pilots and prototypes, or made the transition from one-off projects to a scalable, repeatable workflow. Too often, machine learning exists in a bubble of its own, instead of being understood in the context of the broader data science workflow.

When a false dichotomy exists between machine learning engineers and other data scientists, it becomes very difficult to operationalize the machine learning process. Data sets do not simply appear out of thin air, in the right format to be turned into machine learning models. The preparatory stages of exploring the data, understanding its problem-solving potential, and defining the right questions to ask, are inseparable from a successful machine learning strategy.

Since both machine learning and data science are relatively new disciplines, their codependence hasn’t always been recognized, even by their practitioners. In many organizations, machine learning projects involve different teams, different tools and different processes compared to other data science activities, with little or no coordination between them.

Dangers of a disjointed workflow

From a business perspective, this division between two different aspects of the same domain is less than ideal. Instead of data flowing from capture to exploration to modeling in a seamless, efficient process, there are time-consuming handoffs between different teams, and labor-intensive, manual processes to carry data from one environment to another.

Even if a single team is responsible for the whole process from end-to-end, there’s typically a lot of context-switching, as the tools, languages and frameworks used for data exploration don’t integrate well with the tools used for model building. As a result, data scientists spend a lot of time working with and wrangling their data sets into whatever format is needed for the next step in their workflow, instead of focusing on the task at hand. In fact, it’s estimated that up to 80 percent of a data scientist’s time is spent on this kind of data preparation work, leaving only 20 percent for real analysis.

 How a data science platform can help

At IBM, we’re working on a solution to this problem called the IBM® Watson® Data PlaWatson Data Platformtform. The platform includes a set of best-of-breed experiences to handle different parts of the data science workflow—including IBM Data Catalog for cataloging, discovery and governance; IBM Data Science Experience for data exploration, visualization and initial model design; and IBM Watson Machine Learning for training, testing, deploying and monitoring machine learning models and neural networks.

The important point about Watson Data Platform is that it doesn’t treat these solutions as independent, unrelated tools. Instead, it provides all the underlying pipework necessary to link them together into a seamless end-to-end workflow, as well as APIs to help you interact with them.

Connecting the dots

From a user experience perspective, there’s a single unified user interface that provides immediate access to each of the tools, with no need to switch context or log into a new application. With just a few clicks, you can search Data Catalog for the data sets you need, open them in IBM Data Science Experience for initial analysis, and then leverage them with Watson Machine Learning to start creating a model. The architecture of the entire platform is built on microservices that communicate via well-defined APIs. As a result, it is easy to move data from one tool to another, and to save the results back into the catalog in a fully traceable and auditable way. The microservices approach also allows for maximum scalability, and enables the use of common services for functions like logging and error handling, which helps to simplify maintenance and troubleshooting.

The same principles apply to the interface and documentation: all the tools use the same shared terminology and have the same type of documentation, so barriers to entry are kept as low as possible.

Flexible, modular design

At the same time, the platform aims to be flexible and modular. If you only need IBM Data Catalog, you can use it as a standalone product. But as your data science requirements grow, you can easily take advantage of other components too. Each of the tools is also designed to support as wide a range of existing languages, frameworks and technologies as possible: for example, Data Catalog supports more than 30 types of data sources, while Watson Machine Learning lets you build models with Spark MLlib, SPSS Modeler, or various Python machine learning libraries.

With IBM Watson Data Platform, we’re hoping to take data scientists on the same journey that software developers have experienced with the DevOps revolution. Instead of an archipelago of tiny islands of expertise, each rehearsing its own strange rituals, we’ll have a single well-connected community that can manage the entire end-to-end data science workflow efficiently.

To learn more and explore the roadmap, please visit IBM Watson Data Platform website.

Add Comment
No Comments

Leave a Reply

Your email address will not be published.Required fields are marked *

More Watson Stories

From dreams to streams: turning the vision of streaming analytics into practical business reality with IBM Streams Designer

Today’s web is a much more open place than ever before—most social networks and other web platforms offer public APIs that allow anyone to request and use data on a scale that would have been unthinkable just a few years ago.

Continue reading

Data preparation and refining – Now integrated in Watson Data Platform

In the world of Data Science, the time required to transform data to good quality is a recurring barrier towards gaining insights. Data scientists or analysts will spend a bulk of their effort in cleaning the data using a variety of handwritten scripts. IBM Watson Data Platform’s data refining tools aim to reduce the pain associated with creating good quality data. The tool has an intuitive user interface and templates enabled with powerful operations to shape and clean data. It also provides metrics and data visualization which aid in every step of the process. Incremental snapshots of the results are provided allowing the user to gauge success with each iterative change. Saving, editing, and running the steps within projects provide the ability to refine data of almost any size within the Watson Data Platform.

Continue reading

Hitting the ground running: how to get your data science initiatives off to a flying start

Data science is rapidly being established as the new frontier for analytics, as it moves from niche interest to the mainstream. Combining elements of statistics, computer science, applied mathematics and visualization, it offers a powerful new set of tools and techniques to enable more effective decision-making.

Continue reading