Apache Hadoop
An open-source software platform for the distributed processing of massive amounts of big data across clusters of computers
Explore solutions
Apache Hadoop Overview

Apache Hadoop® is an open source software framework that provides highly reliable distributed processing of large data sets using simple programming models. Hadoop, known for its scalability, is built on clusters of commodity computers, providing a cost-effective solution for storing and processing massive amounts of structured, semi-structured and unstructured data with no format requirements.

A data lake architecture including Hadoop can offer a flexible data management solution for your big data analytics initiatives. Because Hadoop is an open source software project and follows a distributed computing model, it can offer a lower total cost of ownership for a big data software and storage solution.

Hadoop can also be installed on cloud servers to better manage the compute and storage resources required for big data. Leading cloud vendors such as Amazon Web Services (AWS) and Microsoft Azure offer solutions. Cloudera supports Hadoop workloads both on-premises and in the cloud, including options for one or more public cloud environments from multiple vendors.

The Hadoop ecosystem The Hadoop framework, built by the Apache Software Foundation, includes: Hadoop Common

The common utilities and libraries that support the other Hadoop modules. Also known as Hadoop Core.

Hadoop HDFS (Hadoop Distributed File System)

A distributed file system for storing application data on commodity hardware. It provides high-throughput access to data and high fault tolerance. The HDFS architecture features a NameNode to manage the file system namespace and file access and multiple DataNodes to manage data storage.

Learn more
Hadoop YARN

A framework for managing cluster resources and scheduling jobs. YARN stands for Yet Another Resource Negotiator. It supports more workloads, such as interactive SQL, advanced modeling and real-time streaming.

Hadoop MapReduce

A YARN-based system for parallel processing of large data sets.

Learn more
Hadoop Ozone

A scalable, redundant and distributed object store designed for big data applications.

IBM + Cloudera

See how they are driving advanced analytics with an enterprise-grade, secure, governed, open source-based data lake.

How to connect more data

Add a data lake to your data management strategy to integrate more unstructured data for deeper insights.

A robust, governed data lake for AI

Explore the storage and governance technologies needed for your data lake to deliver AI-ready data.

Data lake governance

See how proven governance solutions can drive better data integration, quality and security for your data lakes.

Big data analytics courses

Choose your learning path, based on skill level, from no-cost courses in data science, AI, big data and more.

Open source community

Join the IBM community for open source data management for collaboration, resources and more.

Next steps

Get started with Hadoop - Talk to an IBM big data specialist for 30 minutes at no cost.

Explore data lakes from IBM