Big data in its raw form rarely satisfies the Hadoop developer's data requirements for performing data processing tasks. Let Apache Oozie help automate the process of preprocessing data using different types of workflow, which can be reused.
Organizations — and the people in them — must have confidence in the big data or they simply will not use it to its fullest potential. In this discussion, Claudia Imhoff, president of Intelligent Solutions and founder of Boulder BI Brain Trust, talks with David Corrigan, director of InfoSphere product marketing, about the importance of data governance, integration, security, privacy, and working with Hadoop. (8:23) | Watch the video
Version 2.1 is available now. Find out what's
InfoSphere BigInsights Quick Start Edition is IBM's big data offering based on the open source Apache Hadoop project. It includes core Hadoop (Hadoop Distributed File System, MapReduce) and several other projects in the Hadoop ecosystem, such as Pig, Hive, HBase, and ZooKeeper. BigInsights includes a variety of IBM technologies that extend the platform's value, including advanced analytical facilities, application accelerators, development tools, platform improvements, and enterprise software integration. Many of these capabilities are available with the Quick Start edition, which you can freely download for non-production use.
Download InfoSphere BigInsights Quick Start Edition
- Big data serialization using Apache Avro with Hadoop
Share serialized data among applications with Apache Avro, a framework that produces data in a compact, binary format that doesn't require proxy objects or code generation.
- SQL to Hadoop and back again, Part 3: Direct transfer and live data exchange
Learn what makes Sqoop an efficient method of swapping data, enabling live transfer of data between your SQL and Hadoop environments.
- SQL to Hadoop and back again, Part 2: Leveraging HBase and Hive
Although the HBase and Hive systems seem similar, they have very different goals and aims. Take advantage of the differences in the way they exchange data with your SQL data stores.
- SQL to Hadoop and back again, Part 1: Basic data interchange techniques
How do you integrate your existing SQL-based data stores with Hadoop to take advantage of different technologies when you need them? Find out in this series of articles that takes a look at a range of methods for integration between Hadoop and traditional SQL databases.
- ZooKeeper fundamentals, deployment, and applications
Explore the fundamentals of ZooKeeper, then learn how to set up and deploy a ZooKeeper cluster in a simulated miniature distributed environment. The author also examines the use of ZooKeeper in popular projects.
- Sqoop: Big data conduit between NoSQL and RDBMS
Get details on how to use the Sqoop CLI and the Java API to import data from an RDBMS, manipulate the data in a Hadoop environment, and export the manipulated data back to the RDBMS tables.
- Analyzing large datasets with Hive
Learn to use Apache Hive to analyze large datasets using Hadoop as a back end. Explore what the schema of a large production dataset might look like, and learn how to join two large datasets to form a correlation dataset.
- Moving ahead with Hadoop YARN
YARN, with its new capabilities and new complexity, will soon be coming to a Hadoop cluster near you. Find out what you need to know before you make the switch to this new Hadoop framework.
- What's the big deal about Big SQL?
Relational DBMS users, meet Big SQL. Big SQL brings an industry-standard query interface to IBM's Hadoop-based platform: InfoSphere BigInsights. While Big SQL doesn't turn BigInsights into a relational DBMS, it does provide experienced SQL users with a familiar on-ramp to an increasingly popular environment for analyzing and storing big data.
- Build a data library with Hive
Apache Hive allows database developers or data analysts to use Hadoop without knowing the Java programming language or MapReduce. Now, instead of challenging MapReduce code, you can design a star schema data warehouse or a normalized database. Learn how BI and analytic tools like IBM Cognos or SPSS Statistics can now connect to the Hadoop ecosystem.
Avro: Data serialization
Chukwa: Monitoring large clustered systems
Flume: Data collection and aggregation
HBase: Real-time and read and write database
HCatalog: Table and storage management
HDFS: Hadoop Distributed File System
Hive: Data summarization and querying
Lucene: Text search
MapReduce: Programming paradigm
Oozie: Workflow and job orchestration