Apache Hadoop® is an open source software framework that provides highly reliable distributed processing of large data sets using simple programming models. Hadoop, known for its scalability, is built on clusters of commodity computers, providing a cost-effective solution for storing and processing massive amounts of structured, semi-structured and unstructured data with no format requirements.
A data lake architecture including Hadoop can offer a flexible data management solution for your big data analytics initiatives. Because Hadoop is an open source software project and follows a distributed computing model, it can offer a lower total cost of ownership for a big data software and storage solution.
Hadoop can also be installed on cloud servers to better manage the compute and storage resources required for big data. Leading cloud vendors such as Amazon Web Services (AWS) and Microsoft Azure offer solutions. Cloudera supports Hadoop workloads both on-premises and in the cloud, including options for one or more public cloud environments from multiple vendors.
The common utilities and libraries that support the other Hadoop modules. Also known as Hadoop Core.
A distributed file system for storing application data on commodity hardware. It provides high-throughput access to data and high fault tolerance. The HDFS architecture features a NameNode to manage the file system namespace and file access and multiple DataNodes to manage data storage.
A framework for managing cluster resources and scheduling jobs. YARN stands for Yet Another Resource Negotiator. It supports more workloads, such as interactive SQL, advanced modeling and real-time streaming.
A YARN-based system for parallel processing of large data sets.
A scalable, redundant and distributed object store designed for big data applications.