Hitting the ground running: how to get your data science initiatives off to a flying start
The core features comprising Watson Data Platform, Data Science Experience and Data Catalog on IBM Cloud, along with additional embedded AI services, including machine learning and deep learning, are now available in Watson Studio and Watson Knowledge Catalog. Get started for free athttps://ibm.co/watsonstudio.
Data science is rapidly being established as the new frontier for analytics, as it moves from niche interest to the mainstream. Combining elements of statistics, computer science, applied mathematics and visualization, it offers a powerful new set of tools and techniques to enable more effective decision-making.
If you’re not already excited about what data science could mean for your organization, then you should be. But for those at the starting line, the way forward can seem daunting. And even those who have already set out on the data science journey might not be finding the path as easy as they would like.
Often the biggest barrier can be knowing where to start or, if your data science initiative has stalled, how to get back on the road. In this blog, we’ll look at some of the biggest pitfalls companies face in the early stages of embracing data science, and how to overcome them.
Evaluating the playing field
Colleges and universities have only started offering courses in data science in the last few years. As a result, many data scientists are largely self-taught, drawing on the dynamic world of online communities to pick up skills and share breakthroughs. While no one wants to stifle the creativity of these communities, the lack of standardization of tools and techniques can cause obstacles when you seek to formally establish a data science department in your company.
For example, data scientists who are accustomed to working in online communities don’t always adapt well to the less open environment of a corporate data science team, where the information they need may be commercially sensitive and difficult to access. Unless you give them the ability to collaborate with each other and access shared resources both within and beyond your company’s walls, you can dramatically limit their productivity.
Similarly, data scientists are often big users of open-source technologies, attracted by the ethos of collaborative innovation. And indeed, many of the state-of-the-art data science technologies are open source projects, from Jupyter Notebooks to Apache Spark. However, a collection of best-of-breed components does not necessarily add up to a best-of-breed solution—and the lack of a coherent platform is where you can see the productivity of your data scientists break down.
What about the plumbing?
When it comes to building a technical architecture to support data science, the barriers to entry may seem insurmountable. First, there are the infrastructure costs, which can mount up very quickly when you are tackling big data problems. Second, since the architecture is likely to be built on dozens of different open-source and proprietary components, complexity soon starts to spiral out of control.
Third, with data growing in volume and variety along with the desire for rapid insight, you must be able to scale the infrastructure with ease. For optimal outcomes, you need to be able to match different workloads to the right infrastructure by quickly and flexibly provisioning server and storage resources.
Finally, you may need to devote resources you don’t have to deploying and administering the clusters that support your data science efforts, upgrading and patching software, and monitoring performance and availability.
Don’t worry: there is another way
Gartner’s 2017 Magic Quadrant for Data Science Platforms named IBM as the leader among 16 big-name vendors. So, who better to help you kick-start (or restart) your data science journey?
IBM is investing heavily in supporting and optimizing every step in the data lifecycle. Our aim is to provide a single platform that integrates a range of data and analytics services. This enables you either to start building your data science capabilities from scratch on a best-of-breed platform, or to pick and choose whichever services best complement your existing architecture, and integrate them easily.
Let’s dive into the solution
IBM Watson Data Platform acts as a comprehensive platform that can manage any—or every—aspect of your data lifecycle. Integrating seamlessly with a range of tools and programming languages, including open-source machine-learning frameworks, it is designed to deliver value very quickly.
All Watson Data Platform services are available on the IBMCloud Platform. Consequently, users can take advantage of the elasticity and scalability of cloud computing, empowering them to kick off data science initiatives very quickly and at relatively low cost.
Alternatively, if you want or need to keep your data science platform behind a firewall, these services can also be deployed in a private cloud environment. Individual users can even experiment with the platform’s core capabilities on their desktop for a first taste of what is possible.
Now, let’s look at a user tool within Watson Data Platform. IBM Data Science Experience is an interactive, collaborative environment intended to help users master the art of data science. It gives your data scientists the ability to collaborate with peers both inside and outside the organization. Internally, they can share knowledge and code, and get feedback on their work, helping to streamline projects and accelerate speed to value. Externally, it provides that all-important access to frameworks, tools and support from the wider data science community, so they can keep up with the latest innovations and sharpen their skills.
Under the hood
IBM has launched the next generation of Hadoop and Spark cloud services through IBM Analytics Engine. It is the core data and analytics processing architecture for Watson Data Platform, providing an agile environment for developing and deploying advanced analytics applications.
With IBM Analytics Engine, you can provision Apache Hadoop and Apache Spark clusters from a single point of control, spinning them up and shutting them down within minutes as and when you need them. By making it possible to separate compute and storage elements in your architecture, it allows you to scale them independently, improving maintainability and slashing costs. Integration with Watson Data Platform tools helps you create a complete ecosystem for data science.
Providing powerful integration capabilities for data management, monitoring and dashboards, IBM Analytics Engine will accelerate your data science efforts, while maintaining enterprise-level security. Moreover, intuitive interfaces for management and integration will simplify cluster administration and user control, including automation for patching, upgrades and automatic failure handling.
Offering a multi-purpose big data processing engine, IBM Analytics Engine can help you solve a range of real-world big data challenges. For example, you can use it to simplify data governance, reduce the cost of disaster recovery, streamline data science and machine learning workflows and that’s only the start! The potential impact on your business is huge.
What do I do next?
Whether you’re a new starter or a grizzled veteran in the data science game, let IBM show you how to play to win. Visit https://www.ibm.com/cloud/analytics-engine to start analyzing your data.