What is Hadoop?

Apache™ Hadoop® is a highly scalable storage platform designed to process very large data sets across hundreds to thousands of computing nodes that operate in parallel. It provides a cost effective storage solution for large data volumes with no format requirements.

MapReduce, the programming paradigm that allows for this massive scalability, is the heart of Hadoop. The term MapReduce actually refers to two separate and distinct tasks that Hadoop programs perform. Hadoop has two main components- HDFS and YARN.

Why Hadoop?

Some companies are delaying data opportunities because of organizational constraints, others are not sure what distribution to choose, and still others simply can’t find time to mature their big data delivery due to the pressure of day-to-day business needs.

Hadoop adopters won’t leave opportunity on the table; it's a nonnegotiable for them to pursue new revenue opportunities, beat their competition, and delight their customers with better, faster, analytics and data applications.

The smartest Hadoop strategies start with choosing recommended distributions, then maturing the environment with modernized hybrid architectures, and adopting a data lake strategy based on Hadoop technology.

IBM BigData Tools

Recommended and Certified Distributions

The best Hadoop distributions give you all the benefits of open source, bundle the Hadoop ecosystem for you, and are certified to be interoperable so that you don’t have to guess – or worry – about stability.

Hadoop Scalability

Spark and Hadoop Together

Hadoop and Spark were made to work together; investing in both with an integrated solution helps you solve bigger problems faster by taking advantage of cost-effective storage, and lightning-fast data processing.

Advanced tools with Apacke Hadoop

Open and Hybrid Architecture

A mature Hadoop platform is just the beginning – a modernized architecture is a seamless, hybrid architecture across on-prem and cloud for data ingest, high availability and disaster recovery.

 
Streaming, Predictive and Graph Analytics with Apache Spark

Streaming, Predictive and Graph Analytics with Apache Spark

Apache Spark™ is an open source, in-memory compute engine with a stack of advanced analytics capabilities.

Spark’s in-memory processing enables extremely fast processing speed – up to 100x faster than MapReduce – because it persists immediate data in memory for reuse.

Spark’s advanced analytics and data science capabilities include near real-time streaming via micro batch processing, graph computation analysis, and built-in machine learning libraries which are highly extensible.

Learn more

Watch our panel of experts

Explore Hadoop Today
Spark and Hadoop – Better together

Spark and Hadoop – Better together

Although many think of Hadoop and Spark as alternative options, Spark was developed for use with Hadoop, and does not have a data store. The two big data frameworks are complementary, and work together very effectively as a big data system. Hadoop MapReduce is used for batch processing of data stored in HDFS for fast and reliable analysis, whereas Apache Spark is used for data streaming and in-memory distributed processing for faster real-time analysis.

Read the brief

Explore Hadoop Today
Learn more about Datalakes: From concept to big data’s future
Learn more about Datalakes: From concept to big data’s future

Data Lakes: From concept to big data’s future

A data lake refers to a large repository of disparate and vast data, in its’ native format, combined together until it is needed – or queried. Data Lakes can be supported by a Hadoop-based technology environment.

Using a data lake approach in the cloud or hybrid cloud infrastructure can solve for challenges presented by traditional data marts, provide greater access to data across the company (even breaking down data silos), accelerate data gathering and preparation, and enhance the accuracy of the analysis process.

Read the analyst brief

Get started with a Hadoop trial

Get started with Hadoop

The IBM Open Platform (IOP) is IBM’s big data platform and Hadoop distribution. IOP is built on 100% open source Apache ecosystem components, including Apache Spark – as if you had downloaded components from Apache.org and built a distribution yourself.

IBM IOP was designed with analytics, operational excellence, and security empowerment in mind, and as such offers a unique and optimal combination of Apache components to maximize the development of big data applications. With ODPi certification, IOP provides an open and highly flexible platform which supports and accelerates the development of big data ecosystems.

IBM IOP with Spark and Hadoop is the foundation of IBM’s BigInsights family of products and Hadoop offerings.

Hadoop Resources

Access analyst reports, data sheets, white papers and more.

The Horsepower of Hadoop

The Horsepower of Hadoop

Fast and flexible insight with results

How are Clients Using Hadoop and Spark

How are Clients Using Hadoop and Spark?

Hadoop your way to a hybrid cloud

Hadoop your way to a hybrid cloud

Taking data warehousing to the hybrid cloud