What is Apache Hadoop®?

Apache Hadoop® is a highly scalable storage platform designed to process very large data sets across hundreds to thousands of computing nodes that operate in parallel. It provides a cost-effective storage solution for large data volumes with no format requirements.

MapReduce, the programming paradigm that allows for this massive scalability, is the heart of Hadoop. The term MapReduce actually refers to two separate and distinct tasks that Hadoop programs perform. Hadoop has two main components- HDFS and YARN.


Key Big Data Use Cases for Hadoop

  1. Data Science Sandbox: Data Science is an interdisciplinary field that combines machine learning, statistics, advanced analysis, and programming. It is a new form of art that draws out hidden insights and puts data to work in the cognitive era.
  2. Data Lake Analytics: A data lake is a shared data environment that comprises multiple repositories and capitalizes on big data technologies. It provides data to an organization for a variety of analytics processes.
  3. Streaming Data / IOT Platform: Stream computing enables organizations to process data streams which are always on and never ceasing. Stream computing helps organizations spot opportunities and risks across all data.

Learn more about Big Data


Recommended and Certified Distributions

The best Hadoop distributions give you all the benefits of open source, bundle the Hadoop ecosystem for you, and are certified to be interoperable so that you don’t have to guess – or worry – about stability.

Apache Spark™ and Hadoop Together

Hadoop and Spark were made to work together; investing in both with an integrated solution helps you solve bigger problems faster by taking advantage of cost-effective storage, and lightning-fast data processing.

100% Open Source

This 100% open source Hadoop platform is built for big data analytics and innovation for any types of data, whether at rest or in motion. Support services are also offered for those clients looking for help working with the platform.

In the spotlight

Get started with Apache Hadoop®

The Hortonworks Data Platform for IBM offers secure, enterprise-ready open source Hadoop distribution based on a centralized architecture. HDP for IBM addresses a range of data-at-rest use cases, powers real-time customer applications and delivers robust analytics that accelerate decision-making and innovation.

Accelerate big data collection and dataflow management

Hortonworks DataFlow for IBM, powered by Apache NiFi, is the first integrated platform that solves the challenges of collecting and transporting data from a multitude of sources. HDF for IBM enables simple, fast data acquisition, secure data transport, prioritized data flow and clear traceability of data from the edge of your network to the core data center. It uses a combination of an intuitive visual interface, a high-fidelity access and authorization mechanism and an always-on chain of custody (data provenance) framework.

Accelerated and Stable Apache Hadoop®

The best way to move forward with Hadoop is to choose an installation package that simplifies interoperability so that a Hadoop environment remains as standardized as possible. The Open Data Platform Initiative (ODPi) is a multi-vendor standards association focused on advancing the adoption of Hadoop in the enterprise by promoting the interoperability of big data tools. ODPi simplifies and standardizes the Apache Hadoop big data ecosystem with a common reference specification called the ODPi Core.

Db2 Big SQL

One key capability is SQL querying and this is where IBM Db2 Big SQL comes in as a data virtualization tool that lets you access, query, and summarize data from any platform including databases, data warehouses, NoSQL databases, and more. Db2 Big SQL concurrently exploits Hive, HBase and Spark using a single database connection — even a single query.

Related Products

Hortonworks on Power

Maximize performance and efficiency to accelerate insights with Hortonworks Data Platform on IBM Power Systems

IBM BigIntegrate

A Massively scalable, shared-nothing, in-memory data integration engine running natively in a Hadoop cluster to help bring enterprise robust capabilities to the data lake. 

IBM BigQuality

A powerful data quality solution that provides a rich set of data profiling, cleansing and monitoring capabilities that execute on the data nodes of a Hadoop cluster.

IBM Big Replicate

This active-transactional replication technology delivers continuous availability, streaming backup, uninterrupted migration, hybrid cloud and burst-to-cloud, and data consistency across clusters any distance apart.

IBM Streams

An advanced analytic platform that allows user-developed applications to quickly ingest, analyze and correlate information as it arrives from thousands of data stream sources.


IBM Db2 Big SQL data sheet

With Spark SQL, the fastest open source SQL engine available, amplify the power of Apache Hadoop on IBM BigInsights to create insight. Spark SQL is helping make big data environments faster than ever.

Hortonworks Data Platform data sheet

HDP addresses a range of data-at-rest use cases, powers real-time customer applications and delivers robust analytics that accelerate decision-making and innovation.

Hortonworks Data Flow data sheet

HDF was designed to meet the challenges of collecting data from a wide range of data sources securely, efficiently and over a geographically disperse and possibly fragmented network.