Apache Hadoop® is an open source software framework that provides highly reliable distributed processing of large data sets using simple programming models. Hadoop, known for its scalability, is built on clusters of commodity computers, providing a cost-effective solution for storing and processing massive amounts of structured, semi-structured and unstructured data with no format requirements.
A data lake architecture including Hadoop can offer a flexible data management solution for your big data analytics initiatives. Because Hadoop is an open source software project and follows a distributed computing model, it can offer a lower total cost of ownership for a big data software and storage solution.
Hadoop can also be installed on cloud servers to better manage the compute and storage resources required for big data. Leading cloud vendors such as Amazon Web Services (AWS) and Microsoft Azure offer solutions. Cloudera supports Hadoop workloads both on-premises and in the cloud, including options for one or more public cloud environments from multiple vendors.
The common utilities and libraries that support the other Hadoop modules. Also known as Hadoop Core.
A distributed file system for storing application data on commodity hardware. It provides high-throughput access to data and high fault tolerance. The HDFS architecture features a NameNode to manage the file system namespace and file access and multiple DataNodes to manage data storage.
A framework for managing cluster resources and scheduling jobs. YARN stands for Yet Another Resource Negotiator. It supports more workloads, such as interactive SQL, advanced modeling and real-time streaming.
A YARN-based system for parallel processing of large data sets.
A scalable, redundant and distributed object store designed for big data applications.
See how they are driving advanced analytics with an enterprise-grade, secure, governed, open source-based data lake.
Add a data lake to your data management strategy to integrate more unstructured data for deeper insights.
Explore the storage and governance technologies needed for your data lake to deliver AI-ready data.
See how proven governance solutions can drive better data integration, quality and security for your data lakes.
Choose your learning path, based on skill level, from no-cost courses in data science, AI, big data and more.
Join the IBM community for open source data management for collaboration, resources and more.