Closing the loop: the case for building a coherent platform for data science and machine learning
Many organizations have started to explore the value that machine learning can bring—from illuminating previously “dark data” such as images and videos, to creating models that help to guide or even automate business decision-making.
However, very few companies have gone beyond pilots and prototypes, or made the transition from one-off projects to a scalable, repeatable workflow. Too often, machine learning exists in a bubble of its own, instead of being understood in the context of the broader data science workflow.
When a false dic
hotomy exists between machine learning engineers and other data scientists, it becomes very difficult to operationalize the machine learning process. Data sets do not simply appear out of thin air, in the right format to be turned into machine learning models. The preparatory stages of exploring the data, understanding its problem-solving potential, and defining the right questions to ask, are inseparable from a successful machine learning strategy.
Since both machine learning and data science are relatively new disciplines, their codependence hasn’t always been recognized, even by their practitioners. In many organizations, machine learning projects involve different teams, different tools and different processes compared to other data science activities, with little or no coordination between them.
Dangers of a disjointed workflow
From a business perspective, this division between two different aspects of the same domain is less than ideal. Instead of data flowing from capture to exploration to modeling in a seamless, efficient process, there are time-consuming handoffs between different teams, and labor-intensive, manual processes to carry data from one environment to another.
Even if a single team is responsible for the whole process from end-to-end, there’s typically a lot of context-switching, as the tools, languages and frameworks used for data exploration don’t integrate well with the tools used for model building. As a result, data scientists spend a lot of time working with and wrangling their data sets into whatever format is needed for the next step in their workflow, instead of focusing on the task at hand. In fact, it’s estimated that up to 80 percent of a data scientist’s time is spent on this kind of data preparation work, leaving only 20 percent for real analysis.
How a data science platform can help
At IBM, we’re working on a solution to this problem called the IBM® Watson® Data Pla
tform. The platform includes a set of best-of-breed experiences to handle different parts of the data science workflow—including IBM Data Catalog for cataloging, discovery and governance; IBM Data Science Experience for data exploration, visualization and initial model design; and IBM Watson Machine Learning for training, testing, deploying and monitoring machine learning models and neural networks.
The important point about Watson Data Platform is that it doesn’t treat these solutions as independent, unrelated tools. Instead, it provides all the underlying pipework necessary to link them together into a seamless end-to-end workflow, as well as APIs to help you interact with them.
Connecting the dots
From a user experience perspective, there’s a single unified user interface that provides immediate access to each of the tools, with no need to switch context or log into a new application. With just a few clicks, you can search Data Catalog for the data sets you need, open them in IBM Data Science Experience for initial analysis, and then leverage them with Watson Machine Learning to start creating a model. The architecture of the entire platform is built on microservices that communicate via well-defined APIs. As a result, it is easy to move data from one tool to another, and to save the results back into the catalog in a fully traceable and auditable way. The microservices approach also allows for maximum scalability, and enables the use of common services for functions like logging and error handling, which helps to simplify maintenance and troubleshooting.
The same principles apply to the interface and documentation: all the tools use the same shared terminology and have the same type of documentation, so barriers to entry are kept as low as possible.
Flexible, modular design
At the same time, the platform aims to be flexible and modular. If you only need IBM Data Catalog, you can use it as a standalone product. But as your data science requirements grow, you can easily take advantage of other components too. Each of the tools is also designed to support as wide a range of existing languages, frameworks and technologies as possible: for example, Data Catalog supports more than 30 types of data sources, while Watson Machine Learning lets you build models with Spark MLlib, SPSS Modeler, or various Python machine learning libraries.
With IBM Watson Data Platform, we’re hoping to take data scientists on the same journey that software developers have experienced with the DevOps revolution. Instead of an archipelago of tiny islands of expertise, each rehearsing its own strange rituals, we’ll have a single well-connected community that can manage the entire end-to-end data science workflow efficiently.
To learn more and explore the roadmap, please visit IBM Watson Data Platform website.