February 29, 2016 | Written by: Andrea Braida
Categorized: Community | Data Analytics
Share this post:
Hadoop is a powerful technology, but it’s not the easiest to get up and running, particularly for companies that don’t have any experience of big data technologies. After many years as a specialist in database technologies such as DB2, Jim Wankowski has made the transition to a role as Technical Sales Specialist for IBM Cloud Data Services. With deep experience of both the “old world” of relational databases and the “new world” of big data technologies, Jim is the perfect candidate to talk us through the challenges that many businesses face when adopting technologies such as Hadoop. We spoke to Jim about why there’s more than one “elephant in the room” when data practitioners and vendors alike grapple with Hadoop, and how to deal with its complexity.
Jim, good to have you with us. What it is about Hadoop that vendors don’t want customers to know?
Hadoop has, without question, transformed the way organizations deal with big data and opened up a whole host of capabilities that five or ten years ago would have been more or less impossible.
But because it was so groundbreaking and evolved so quickly, Hadoop has become quite a complex beast from a technical perspective. Setting up a Hadoop cluster doesn’t just involve Hadoop itself, you also need to install and manage a lot of other tools – HDFS, Hive, Pig, Lucene, Zookeeper and so on – depending on exactly what you’re trying to achieve. There is a whole ecosystem of possible components, and it’s difficult for non-specialists to work out exactly what they need.
Moreover, MapReduce jobs, which are the heartbeat of the whole Hadoop concept, need to be written in Java, which is not particularly easy to learn compared to languages like Python, Scala and R – the industry standards for data scientists – or SQL.
As a consequence, if you want to build your own Hadoop cluster, you will either need to invest in a lot of training for your IT team, or hire in new people who already have Hadoop experience – and that experience comes at a premium. On top of that, there are the infrastructure costs. Although Hadoop is designed to run on commodity hardware, it can still require a significant investment in data center space, power and cooling if you want to build a cluster of any significant size.
So the complexity and cost of setting up a cluster can seem like a barrier to making productive use of Hadoop?
Yes, and there’s also the perceived risk. It’s in the nature of big data analytics that the business case is not always obvious until you actually start exploring the data and work out where the real value is. But you can’t start the exploration stage until you have the platform in place. So it can seem like a chicken-and-egg situation.
Many organizations are sitting on a significant store of unstructured data that they’re sure contains some kind of valuable insight for their business. But because they’re not sure exactly where that value lies, they’re struggling to make a solid business case for what seems like a big up-front investment.
Still, Hadoop adoption is on the increase… so how are organizations overcoming these obstacles?
The good news is that there’s no longer a need to jump straight into the deep end and set up your own Hadoop cluster from day one. Many vendors now offer managed services for Hadoop, which are really changing the game: they give you all the advantages of Hadoop without having to worry about investing in infrastructure or hiring new talent.
For example, at IBM we offer BigInsights on Cloud, which packages Hadoop with all the standard components defined by the Open Data Platform – a consortium of leading vendors who are working to standardize and simplify the Hadoop ecosystem.
There’s no longer a need to jump straight into the deep end and set up your own Hadoop cluster from day one. Managed services give you all the advantages of Hadoop without having to worry about investing in infrastructure or hiring new talent.
If you want to spin up a Hadoop instance, you can just log in to IBM Bluemix, choose how large a cluster you want, and it will be provisioned for you automatically. That means you can completely ignore all the complexity of setting up servers, installing components and writing configuration scripts, and you can get up and running within days instead of weeks or months.
IBM’s Big Data experts handle all of the setup and ongoing maintenance, so you can be confident that the team supporting your cluster is one of the best in the business. This means you can focus on building a team of data scientists and analysts who will really add value, instead of just increasing headcount in your IT team. And the fact that you have dedicated experts maintaining the cluster at an enterprise-class IBM data center means you have a much more dependable and secure Hadoop environment than most organizations could provide with in-house resources.
Why should I pick BigInsights on Cloud, compared to other managed services for Hadoop?
BigInsights on Cloud augments Hadoop with a set of IBM tools that make it much easier for your data scientists to start delivering results to the business quickly by minimizing the amount of coding and technical knowledge that interacting with Hadoop usually requires.
For example, data scientists can use BigSQL to run SQL queries on Hadoop, instead of writing MapReduce jobs in Java. Similarly, BigSheets provides an Excel-like front-end that handles MapReduce jobs under the covers. And BigR allows you to take all the power of R and parallelize it across multiple nodes for rapid statistical analysis of even the largest data sets.
Moreover, because BigInsights on Cloud is a service on IBM Bluemix, you also get easy access to a wide range of other big data repositories and cloud data services. For example, if you want to combine structured data from a DB2 on Cloud database with unstructured data held in dashDB or Cloudant and move them into your Hadoop cluster, it’s very easy to coordinate all these services. And if you want to combine the scalability of Hadoop with the power of Spark for real-time analytics, we also offer seamless integration with Apache Spark.
That sounds great if we decide to take the cloud option – but what does IBM offer if we do want to build our own cluster?
If running your own cluster is the right choice for your business, that’s no problem: we can provide BigInsights on-premise too. That way, you get all the advantages of a standardized, Open Data Platform-compliant package of Hadoop, plus the IBM toolset on top.
Furthermore, since the on-premise version of BigInsights has the same configuration as the cloud version, it’s relatively easy to migrate from one to the other. So in the long term, if you do decide you want to move your cluster into the cloud, we can definitely help you with that.
What are some of the use cases that IBM is seeing with BigInsights?
We’ve seen all sorts of great stories from our clients – from a US parking services company that is using BigInsights on Cloud to analyze the way drivers interact with its innovative smart parking meters (Municipal Parking Services), to a leading global professional services company that leverages BigInsights to bring together multiple sources of structured and unstructured data and find previously unseen connections that can help its clients counter fraud (Big risks requires big data thinking: EY helping clients target and prevent fraud).
Sounds great! Where can I go to learn more about what BigInsights has to offer? We’re invite you to pre-register for our BigInsights on Cloud beta service via Bluemix, and we’ll keep you posted on the beta general availability very soon.