Discover how to use CombineFileInputFormat within the MapReduce framework to decouple the amount of data a Mapper consumes from the block size of the files in HDFS.
The seventh conversation between IBM software lab specialists about the value of BigInsights, IBM's Hadoop offering—this time with a comparison to Teradata—another vendor in the Big Data marketplace. (7:24) | Watch the video
Version 2.1 is available now. Find
out what's new.
InfoSphere BigInsights Quick Start Edition is IBM's big data offering based on the open source Apache Hadoop project. It includes core Hadoop (Hadoop Distributed File System, MapReduce) and several other projects in the Hadoop ecosystem, such as Pig, Hive, HBase, and ZooKeeper. BigInsights includes a variety of IBM technologies that extend the platform's value, including advanced analytical facilities, application accelerators, development tools, platform improvements, and enterprise software integration. Many of these capabilities are available with the Quick Start edition, which you can freely download for non-production use.
Download InfoSphere BigInsights Quick Start Edition
- Developing IBM PureData System for Hadoop
applications with the Eclipse IDE
Learn how to set up OpenVPN an open source implementation of VPN server and VPN client software published under the GNU General Public License on a connected client to enable secure access to the Hadoop cluster.
- Oozie workflow scheduler for Hadoop
Big data in its raw form rarely satisfies the Hadoop developer's data requirements for performing data processing tasks. Let Apache Oozie help automate the process of preprocessing data using different types of workflow, which can be reused.
- Big data serialization using Apache Avro with
Share serialized data among applications with Apache Avro, a framework that produces data in a compact, binary format that doesn't require proxy objects or code generation.
- SQL to Hadoop and back again, Part 3: Direct
transfer and live data exchange
Learn what makes Sqoop an efficient method of swapping data, enabling live transfer of data between your SQL and Hadoop environments.
- SQL to Hadoop and back again, Part 2: Leveraging
HBase and Hive
Although the HBase and Hive systems seem similar, they have very different goals and aims. Take advantage of the differences in the way they exchange data with your SQL data stores.
- SQL to Hadoop and back again, Part 1: Basic data
How do you integrate your existing SQL-based data stores with Hadoop to take advantage of different technologies when you need them? Find out in this series of articles that takes a look at a range of methods for integration between Hadoop and traditional SQL databases.
- ZooKeeper fundamentals, deployment, and
Explore the fundamentals of ZooKeeper, then learn how to set up and deploy a ZooKeeper cluster in a simulated miniature distributed environment. The author also examines the use of ZooKeeper in popular projects.
- Sqoop: Big data conduit between NoSQL and
Get details on how to use the Sqoop CLI and the Java API to import data from an RDBMS, manipulate the data in a Hadoop environment, and export the manipulated data back to the RDBMS tables.
- Analyzing large datasets with Hive
Learn to use Apache Hive to analyze large datasets using Hadoop as a back end. Explore what the schema of a large production dataset might look like, and learn how to join two large datasets to form a correlation dataset.
- Moving ahead with Hadoop YARN
YARN, with its new capabilities and new complexity, will soon be coming to a Hadoop cluster near you. Find out what you need to know before you make the switch to this new Hadoop framework.
Avro: Data serialization
Chukwa: Monitoring large clustered systems
Flume: Data collection and aggregation
HBase: Real-time and read and write database
HCatalog: Table and storage management
HDFS: Hadoop Distributed File System
Hive: Data summarization and querying
Lucene: Text search
MapReduce: Programming paradigm
Oozie: Workflow and job orchestration