Data Science

Putting the engine to work: how IBM Analytics Engine can help you harness Hadoop and Spark for business benefit

Share this post:

For many companies, the potential of big data analytics may seem both exciting and overwhelming. Technologies like Hadoop and Spark promise to unearth new sources of value from the vast mountains of unstructured data your business generates every day. The newfound opportunity to get insight from data that has been dormant for years could act as an energy source to power the next phase of business growth.

At the same time, though, the tracks of companies that have pioneered big data analytics are difficult to follow—and many attempts have already fallen by the wayside. Traditional Hadoop environments can be difficult to set up, complex to manage, and expensive to scale; as a result, too many companies’ data lake initiatives have ended up as dead-ends.

Fortunately, a new generation of flexible, cloud-based big data analytics solutions is helping to address many of these issues. IBM Analytics Engine, for example, provides a simple way to spin up Hadoop and Spark clusters in the cloud within minutes—empowering data scientists and developers to extract insight from massive data sets without worrying about building a permanent cluster.

This may sound pretty good in theory, but how can your business take its first steps? To trigger some creative thinking around use cases that you could put into practice, let’s take a quick look at a few examples of how IBM Analytics Engine can solve real-world business problems.

From theory to practice: use cases for IBM Analytics Engine

1. Dipping your toes in the data lake

Your business may be keen to explore the value of data lakes and big data analytics, but unwilling to jump in at the deep end. In the past, setting up a Hadoop cluster typically required significant up-front investment in infrastructure and expertise, and connecting it to even a small number of data sources could take months. As a result, it was impractical to embark on experimental Hadoop projects because the cost of failure was too high.

With IBM Analytics Engine, these objections all melt away. Instead of purchasing hardware, just navigate to the IBM Watson Data Platform website and sign up for an account. Instead of installing and configuring dozens of Hadoop components yourself, you can spin up your first cluster in a few mouse-clicks. And instead of waiting months for your data engineers to write complex ETL scripts, you can connect to dozens of different databases just by filling in the relevant access credentials.

As a result, experimentation comes at virtually no cost. You can spin up a Hadoop or Spark cluster, load some data, try out some analyses, and then scrap the whole thing. You only pay for the resources you use, while you are using them. And if you conclude that the service isn’t right for your use case, you can walk away without any further obligation.

Hadoop adoption used to be a daunting hurdle and a risky investment; with IBM Analytics Engine, it becomes almost insignificant. There is no good reason not to try it out.

2. Streamlining data science workflows

Perhaps your business is a little further along the big data analytics roadmap: you have already taken your first steps with Hadoop and Spark, but your data science team is still struggling to be productive.

Data scientists are hired for their skill and expertise in analyzing data and building models—yet with a traditional Hadoop environment, they are often distracted from these tasks by low-level infrastructure concerns. They either have to take on responsibility for cluster management themselves—a job which may be outside their immediate skillset—or rely on support from the IT team, which means competing for priority against other important IT projects.

With IBM Analytics Engine, all the infrastructure management issues are completely abstracted away, not only from the data science team, but from the IT team too. All the user needs to do is request a cluster, connect it to an object storage repository, and get started with their analysis.

In fact, the abstraction goes even further: users don’t even have to interact with IBM Analytics Engine directly. For example, if a data scientist is working on a Jupyter Notebook in IBM Data Science Experience (DSX), they can invoke a Spark cluster to perform analysis from within the DSX interface. Within a few clicks, an IBM Analytics Engine instance can be “associated” with a notebook or a project, and will spin up the cluster in the background without distracting the user from their data science task.

3. Dealing with different workloads

Even if your company’s big data capabilities are already well-established, and your data science workflow is seamlessly efficient, you may still face challenges as user numbers rise and demand increases.

It can be extremely difficult to build a single, permanent Hadoop cluster that can serve the needs of multiple different groups of users and types of workload. Segregating the cluster into secure zones for each group can be an administrative headache, and it may not be possible to find a configuration that is optimal for both daytime interactive workloads and overnight batch processes.

With IBM Analytics Engine, you can sidestep these problems altogether. Instead of a single permanent cluster shared by multiple teams, each user can spin up their own personal cluster whenever they need it. As a result, you don’t need to worry about managing security within the cluster—you can simply use the same identity and access management framework that you use for all your other cloud services.

Similarly, there is no need for a one-size-fits-all approach to different workloads. At the end of each day, you can spin up a cluster that is perfectly configured to optimize performance for your overnight batch processes. And in the morning, you can instantiate a separate cluster with the tools required for ad-hoc, interactive queries. And if you need to scale up or down as your workload profile changes, this can easily be accomplished.

4. Simplifying disaster recovery

Disaster recovery has always been one of the pain-points of traditional Hadoop deployments. With a permanent cluster that is constantly in use, it can be difficult to find an appropriate time to run backups. Finding an appropriate backup target is also a problem: few companies want to invest in an entire second cluster just to handle backups.

IBM Analytics Engine can bypass these problems because its storage architecture is completely different from a traditional Hadoop cluster. Instead of each node having its own local storage, the whole cluster is connected to a separate object storage repository.

Object storage solutions such as IBM Cloud Object Storage can automatically distribute multiple replicas of data across a cluster of storage systems in different data centers—or even different regions. This means that the data remains highly available even in the event of a failure of one or more storage nodes.

Essentially, with IBM Analytics Engine, disaster recovery comes as standard; there is no longer any need for Hadoop users or administrators to worry about how, where or when to back up their data.

Take the next steps

Hopefully these examples will have captured your interest, and provoked some new trains of thought about how your business could benefit from a more flexible cluster management architecture.

The key takeaway is that IBM Analytics Engine can completely eliminate many of the major issues that have prevented companies from taking the plunge into big data analytics, or from realizing the full benefits of technologies like Hadoop and Spark.

If you’d like to take a deeper dive, read this whitepaper. Or if you’re ready to take your first steps with IBM Analytics Engine, sign up for Watson Data Platform here.

IBM Cloud Marketing Manager, Global

More Data Science stories
April 30, 2019

Introducing IBM Analytics Engine v1.2 and Announcing the Deprecation of IBM Analytics Engine v1.0

We are excited to inform you about the new version of IBM Analytics Engine v1.2 that will be available starting May 15, 2019. Along with this release, Analytics Engine v1.0 will be retired.

Continue reading

April 16, 2019

Announcing the Deprecation of the Decision Optimization Beta Service

The End of Beta date for the Decision Optimization service is May 17, 2019. The End of Beta Support date is June 20, 2019.

Continue reading

April 2, 2019

Data Refinery and Profiling Changes in Watson Studio and Watson Knowledge Catalog

We'd like to announce data refinery and profiling changes related to Watson Studio and Watson Knowledge Catalog that will take effect on May 17, 2019.

Continue reading