ZooKeeper is an open source Apache project that provides centralized infrastructure and services that enable synchronization across an Apache Hadoop cluster.
ZooKeeper maintains common objects needed in large cluster environments. Examples of these objects include configuration information, hierarchical naming space, and so on. Applications leverage these services to coordinate distributed processing across large clusters.
How does it work?
Imagine a Hadoop cluster spanning 500 or more commodity servers. There’s a need for centralized management of the entire cluster in terms of name. group and synchronization services, configuration management, and more. Other open source projects leveraging Hadoop clusters require cross-cluster services. Embedding ZooKeeper alleviates building synchronization services from scratch. Interaction with ZooKeeper occurs by way of Java™ or C interface time.
For applications, ZooKeeper provides an infrastructure for cross-node synchronization. It does this by maintaining status type information in memory on ZooKeeper servers. A ZooKeeper server keeps a copy of the state of the entire system and persists this information in local log files. Large Hadoop clusters supported by multiple ZooKeeper servers (a master server synchronizes the top-level servers).
Within ZooKeeper, an application can create what is called a znode (a file that persists in memory on the ZooKeeper servers). The znode can be updated by any node in the cluster, and any node in the cluster can register to be informed of changes to that znode (in ZooKeeper parlance, a server can be set up to “watch” a specific znode).
Using this znode infrastructure (and there is much more to this such that we can’t even begin to do it justice in this section), applications can synchronize their tasks across the distributed cluster by updating their status in a ZooKeeper znode, which would then inform the rest of the cluster of a specific node’s status change. This cluster-wide status centralization service is essential for management and serialization tasks across a large distributed set of servers.