What is Apache Hadoop®?

Apache Hadoop® is a highly scalable storage platform designed to process very large data sets across hundreds to thousands of computing nodes that operate in parallel. It provides a cost-effective storage solution for large data volumes with no format requirements.

MapReduce, the programming paradigm that allows for this massive scalability, is the heart of Hadoop. The term MapReduce actually refers to two separate and distinct tasks that Hadoop programs perform. Hadoop has two main components- HDFS and YARN.

Key Big Data Use Cases for Hadoop

  1. Data Science Sandbox: Data Science is an interdisciplinary field that combines machine learning, statistics, advanced analysis, and programming. It is a new form of art that draws out hidden insights and puts data to work in the cognitive era.
  2. Data Lake Analytics: A data lake is a shared data environment that comprises multiple repositories and capitalizes on big data technologies. It provides data to an organization for a variety of analytics processes.
  3. Streaming Data / IOT Platform: Stream computing enables organizations to process data streams which are always on and never ceasing. Stream computing helps organizations spot opportunities and risks across all data.

Learn more about Big Data

IBM BigData Tools

Recommended and Certified Distributions

The best Hadoop distributions give you all the benefits of open source, bundle the Hadoop ecosystem for you, and are certified to be interoperable so that you don’t have to guess – or worry – about stability.

Apache Hadoop Scalability

Apache Spark™ and Hadoop Together

Hadoop and Spark were made to work together; investing in both with an integrated solution helps you solve bigger problems faster by taking advantage of cost-effective storage, and lightning-fast data processing.

Advanced tools with Apacke Hadoop

100% Open Source

This 100% open source Hadoop platform is built for big data analytics and innovation for any types of data, whether at rest or in motion. Support services are also offered for those clients looking for help working with the platform.

 

Hortonworks DataFlow

HDF Data Motion Platform

Open Data Platform Initiative

Accelerated and Stable Apache Hadoop

Accelerated and Stable Apache Hadoop®

The best way to move forward with Hadoop is to choose an installation package that simplifies interoperability so that a Hadoop environment remains as standardized as possible. The Open Data Platform Initiative (ODPi) is a multi-vendor standards association focused on advancing the adoption of Hadoop in the enterprise by promoting the interoperability of big data tools. ODPi simplifies and standardizes the Apache Hadoop big data ecosystem with a common reference specification called the ODPi Core.

Learn more about ODPi

Key Value Added Capabilities

IBM Big SQL

IBM Big SQL

One key capability is SQL querying and this is where IBM Big SQL comes in as a data virtualization tool that lets you access, query, and summarize data from any platform including databases, data warehouses, NoSQL databases, and more. Big SQL concurrently exploits Hive, HBase and Spark using a single database connection — even a single query.

Learn more

IBM Data Science Experience
IBM Data Science Experience

IBM Data Science Experience

Analyze data using RStudio, Jupyter, and Python in a configured, collaborative environment that includes IBM value-adds, such as managed Spark.

Learn more

IBM Data Science Experience

Resources related to Apache Hadoop®

Access analyst reports, data sheets, white papers and more.

IBM Big SQL data sheet

IBM Big SQL data sheet

With Spark SQL, the fastest open source SQL engine available, amplify the power of Apache Hadoop on IBM BigInsights to create insight. Spark SQL is helping make big data environments faster than ever.

Hortonworks Data Platform Data Sheet

Hortonworks Data Platform Data Sheet

HDP addresses a range of data-at-rest use cases, powers real-time customer applications and delivers robust analytics that accelerate decision-making and innovation.

Hortonworks Data Flow Data Sheet

Hortonworks Data Flow Data Sheet

HDF was designed to meet the challenges of collecting data from a wide range of data sources securely, efficiently and over a geographically disperse and possibly fragmented network.