What is Apache ZooKeeper?

ZooKeeper is an open source Apache project that provides centralized infrastructure and services that enable synchronization across an Apache Hadoop cluster.

ZooKeeper maintains common objects needed in large cluster environments. Examples of these objects include configuration information, hierarchical naming space, and so on. Applications leverage these services to coordinate distributed processing across large clusters. 

How does it work?

Imagine a Hadoop cluster spanning 500 or more commodity servers. There’s a need for centralized management of the entire cluster in terms of name. group and synchronization services, configuration management, and more. Other open source projects leveraging Hadoop clusters require cross-cluster services. Embedding ZooKeeper alleviates building synchronization services from scratch. Interaction with ZooKeeper occurs by way of Java™ or C interface time.

For applications, ZooKeeper provides an infrastructure for cross-node synchronization. It does this by maintaining status type information in memory on ZooKeeper servers. A ZooKeeper server keeps a copy of the state of the entire system and persists this information in local log files. Large Hadoop clusters supported by multiple ZooKeeper servers (a master server synchronizes the top-level servers). 

Within ZooKeeper, an application can create what is called a znode (a file that persists in memory on the ZooKeeper servers). The znode can be updated by any node in the cluster, and any node in the cluster can register to be informed of changes to that znode (in ZooKeeper parlance, a server can be set up to “watch” a specific znode). 

Using this znode infrastructure (and there is much more to this such that we can’t even begin to do it justice in this section), applications can synchronize their tasks across the distributed cluster by updating their status in a ZooKeeper znode, which would then inform the rest of the cluster of a specific node’s status change. This cluster-wide status centralization service is essential for management and serialization tasks across a large distributed set of servers.


Apache Hadoop

Use ZooKeeper with a Hadoop cluster. Apache Hadoop is a highly scalable storage platform designed to process very large data sets across hundreds to thousands of computing nodes that operate in parallel. It provides a cost-effective storage solution for large data volumes with no format requirements.

Db2 Big SQL

IBM Db2® Big SQL is a hybrid SQL engine for Hadoop delivering easy data querying across the enterprise. Use a single database connection or query to connect to disparate sources such as HDFS, RDMS, NoSQL databases, object stores and WebHDFS.


The data warehouse evolved: A foundation for analytical excellence

Re-explore a best-in-class approach to data management, and how companies are prioritizing data technologies to drive growth and efficiency.

Understanding big data beyond the hype

Read this practical introduction to the next generation of data architectures that introduces the role of the cloud and NoSQL technologies and discusses the practicalities of security, privacy and governance.

Engage with an expert

Schedule a one-on-one call with an expert to learn about the IBM Hortonworks relationship and how we can help you extend data science and machine learning across the Apache Hadoop ecosystem.