What is Hadoop?

Click for more info on the
IBM Open Platform

Apache™ Hadoop® is a highly scalable storage platform designed to process very large data sets across hundreds to thousands of computing nodes that operate in parallel. It provides a cost effective storage solution for large data volumes with no format requirements.

MapReduce, the programming paradigm that allows for this massive scalability, is the heart of Hadoop. The term MapReduce actually refers to two separate and distinct tasks that Hadoop programs perform. Hadoop has two main components- HDFS and YARN.

Why Hadoop?

Some companies are delaying data opportunities because of organizational constraints, others are not sure what distribution to choose, and still others simply can’t find time to mature their big data delivery due to the pressure of day-to-day business needs.

Hadoop adopters won’t leave opportunity on the table; it's a nonnegotiable for them to pursue new revenue opportunities, beat their competition, and delight their customers with better, faster, analytics and data applications.

The smartest Hadoop strategies start with choosing recommended distributions, then maturing the environment with modernized hybrid architectures, and adopting a data lake strategy based on Hadoop technology.

IBM BigData Tools

Recommended and Certified Distributions

The best Hadoop distributions give you all the benefits of open source, bundle the Hadoop ecosystem for you, and are certified to be interoperable so that you don’t have to guess – or worry – about stability.

Hadoop Scalability

Spark and Hadoop Together

Hadoop and Spark were made to work together; investing in both with an integrated solution helps you solve bigger problems faster by taking advantage of cost-effective storage, and lightning-fast data processing.

Advanced tools with Apacke Hadoop

Open and Hybrid Architecture

A mature Hadoop platform is just the beginning – a modernized architecture is a seamless, hybrid architecture across on-prem and cloud for data ingest, high availability and disaster recovery.


Open Data Platform Initiative

Streaming, Predictive and Graph Analytics with Apache Spark

Streaming, Predictive and Graph Analytics with Apache Spark

Apache Spark™ is an open source, in-memory compute engine with a stack of advanced analytics capabilities.

Spark’s in-memory processing enables extremely fast processing speed – up to 100x faster than MapReduce – because it persists immediate data in memory for reuse.

Spark’s advanced analytics and data science capabilities include near real-time streaming via micro batch processing, graph computation analysis, and built-in machine learning libraries which are highly extensible.

Learn more

Watch our panel of experts

Explore Hadoop Today
Spark and Hadoop – Better together

Spark and Hadoop – Better together

Although many think of Hadoop and Spark as alternative options, Spark was developed for use with Hadoop, and does not have a data store. The two big data frameworks are complementary, and work together very effectively as a big data system. Hadoop MapReduce is used for batch processing of data stored in HDFS for fast and reliable analysis, whereas Apache Spark is used for data streaming and in-memory distributed processing for faster real-time analysis.

Read the brief

Explore Hadoop Today
Learn more about Datalakes: From concept to big data’s future
Learn more about Datalakes: From concept to big data’s future

Data Lakes: From concept to big data’s future

A data lake refers to a large repository of disparate and vast data, in its’ native format, combined together until it is needed – or queried. Data Lakes can be supported by a Hadoop-based technology environment.

Using a data lake approach in the cloud or hybrid cloud infrastructure can solve for challenges presented by traditional data marts, provide greater access to data across the company (even breaking down data silos), accelerate data gathering and preparation, and enhance the accuracy of the analysis process.

Read the analyst brief

Value Added Capabilities

Value Added Capabilities

IBM Open Platform brings a range of open source components to your Hadoop implementation, but you may want to leverage additional tooling to integrate it with your data applications and architecture.

One key capability is SQL querying and this is where IBM Big SQL comes in as a data virtualization tool that lets you access, query, and summarize data from any platform including databases, data warehouses, NoSQL databases, and more. Big SQL concurrently exploits Hive, HBase and Spark using a single database connection — even a single query.

Learn more

Value Added Capabilities

Hadoop Resources

Access analyst reports, data sheets, white papers and more.

The Horsepower of Hadoop

The Horsepower of Hadoop

Fast and flexible insight with results

How are Clients Using Hadoop and Spark

How are Clients Using Hadoop and Spark?

Hadoop your way to a hybrid cloud

Spark SQL: Faster Insights for Business

With Spark SQL, the fastest open source SQL engine available, amplify the power of Apache Hadoop on IBM BigInsights to create insight. Spark SQL is helping make big data environments faster than ever.