June 28, 2016 | Written by: Andrea Braida
Categorized: Community | Data Analytics
Share this post:
With the release of BigInsights 4.2, IBM is making self-service, powerful advanced analytics – including Apache Spark – available on an optimal Hadoop distribution. Priya Krishnan, Program Director and Product Manager for BigInsights at IBM, focuses on Hadoop and advanced analytics solutions. She came to IBM through an acquisition in 2010. Prior to this, she was a Data Scientist. Since joining IBM, Priya worked as a Product Manager with Master Data Management and moved into BigInsights early last year. She holds a Master’s degree in Operations Research, and lives in Austin, Texas, with her husband and daughter. We spoke to Priya about the BigInsights release, and how its comprehensive, open, and flexible architecture makes the delivery of big data analytics and business applications easier.
Thank you for joining us today. Let’s start with a quick overview of BigInsights – is this Hadoop on-premises or on-cloud? what is the offering?
IBM BigInsights is a complete Apache Hadoop solution, which includes Apache Spark, and is available on-premises and on-the-cloud. IBM BigInsights provides the power of open source, IBM innovations and rich developer tools – and puts the full range of analytics for Hadoop, Spark, and SQL into the hands of big data analytics teams.
We know that a common barrier to Hadoop adoption is figuring out which distribution to go with. Before we get into what’s new in 4.2, can you please share a bit about BigInsights’ distribution?
Sure! IBM Open Platform (IOP) with Apache Spark and Apache Hadoop is IBM’s big data platform. IOP is built on 100% open source Apache ecosystem components – as if you had downloaded components from Apache.org and built it yourself. IOP was designed with analytics and operations empowerment as its goals, and also has both top-rated security and ODPI compliance. (ODPI is a governing organization focused on standardizing the Apache Hadoop big data ecosystem with a common reference specification).
We’re one of the first Hadoop platforms to comply with the Open Data Platform Runtime Certification. Just as exciting, 4.2 includes the introduction of IBM Big Replicate.
That’s good stuff. Let’s move on to the new capabilities in BigInsights 4.2. What are they?
The new features include integration with Apache Spark 1.6.1, IBM Big SQL enhancements for RDBMS offload and consolidation, new Apache components (Ranger, Phoenix, and Titan), as well as currency updates to existing components (the notable ones are Ambari, Kafka, and SOLR). The release also includes ODPI Runtime Certification. We’re one of the first Hadoop platforms to comply with the Open Data Platform (ODPi) Runtime Certification, which makes it easier for independent software vendors to adopt IOP as a platform, and ensures platform openness for customers. Just as exciting, 4.2 includes the introduction of IBM Big Replicate. IBM Big Replicate provides continuous availability and data consistency via a patented active-transactional replication technology. This is an optimized data replication capability for uninterrupted migration between different distributions to IBM, cloud to on-prem, and vice versa.
Can you share a bit more about Big Replicate use cases please?
Before I move into use cases, let me talk a little about what Big Replicate is. IBM Big Replicate is based on WANdisco Fusion’s Active Replication technology, which delivers continuous availability, backup, and uninterrupted migration. Some of the use cases that this technology supports are:
- Data Lake using data shared across multiple clusters, vendors, and platforms
- Real-time analytics without loss of data or time for data spread across diverse locations
- Hybrid Cloud and Disaster Recovery, to move or share data across on-prem and on-cloud, and to provide a fail-safe mechanism in case of failure at end
I heard that the BigInsights Engineering team likes to say that BigSQL “eats other DBs.” Can you tell us about the technology behind this developer pride?
We’ve delivered an industry first with the enhanced Big SQL capabilities. Big SQL now understands SQL dialects from other vendors and products, such as Oracle, IBM DB2, and IBM Netezza. And this makes Big SQL the ultimate platform for RDBMS offload and consolidation. The cross-SQL dialect understanding makes it much, much faster, and much easier to offload old data from existing enterprise data warehouses or data marts to free up capacity while preserving most of the familiar SQL from those platforms. This is a huge step forward for IT users, and yes, we’re proud of this!
What are the benefits that an IT organization can expect from the Spark integration?
The benefits include the incredibly fast processing from the Spark Core, to speed up batch and ETL when that is needed, and then the advanced analytics capabilities of the Spark stack. These include near real-time analytics with Spark Streaming, built-in machine learning libraries that are highly extensible using Spark MLlib, querying of unstructured data and more value from free-form text analytics with Spark SQL, and graph computation/graph analytics with Spark GraphX. Spark also unifies data access across the organization. By unify data, I mean that in general, one line of code can be used to pull data from multiple data sources – which only Spark can do. It’s pretty amazing when you think about it!
What are the new and updated Apache components in the 4.2 release, and which ones are you most excited about?
The new Apache components include Apache Ranger, which provides centralized security management and auditing of user and REST interface, Apache Phoenix, which is a SQL interface for HBase, and Apache Titan, which is a Graph database API for Hadoop. Updates to existing Apache components of the IBM Open Platform are numerous and I’d like to point readers to our Release Overview, which has all the supporting detail. In my opinion, Apache Ranger is the most exciting new component. It provides a centralized security platform for managing authorization, access control, auditing, and data protection for data in Hadoop – so important! Along with the other new features and component upgrades, Ranger helps our platform provide an enterprise-grade solution for the most complex of Big Data projects.
What use cases do the new Apache components support that were not possible before?
New use cases include data protection and data security using Ranger, entity resolution using Graph Database Apache Titan, and data analytics through an easy-to-use SQL interface from data in HBase using Apache Phoenix.
This all sounds great! Where can I go to learn more about what BigInsights has to offer?
Please visit the IBM BigInsights Resources page to learn more.