Monday, October 28, 2013

ZooKeeper - In my point of view

ZooKeeper: Because coordinating distributed systems is a Zoo

What is Apache Zoo Keeper? 

  • Apache ZooKeeper is a software project of the Apache Software Foundation, providing an open source distributed configuration service, synchronization service, and naming registry for large distributed systems.[1]
  • Apache ZooKeeper is an effort to develop and maintain an open-source server which enables highly reliable distributed coordination. [2]
  • ZooKeeper is a high-performance coordination service for distributed applications. It exposes common services - such as naming, configuration management, synchronization, and group services - in a simple interface so you don't have to write them from scratch. You can use it off-the-shelf to implement consensus, group management, leader election, and presence protocols. And you can build on it for your own, specific needs.[3]

Inherent Problems in Distributed Systems

Little things become complicated because of the distributed nature. For example, when running an application on a local machine, changing of an application involves, editing a configuration file and restarting the app. However, distributed applications run on different machines and need to see configuration changes and react to them. To make matters worse, machines may be temporarily down or partitioned from the network. Robust distributed applications also have the ability to incorporate new machines or decommission machines on the fly. This makes configuration of the distributed application should also be dynamic. 

Distributed applications need a service that they can just believe to oversee the distributed environment. 
The service needs to be as simple as possible and easy to understand as possible. A developer should not have trouble to integrating the service into their application.

The service needs to have good performance so that applications can use the service extensively. 

ZooKeeper aims to meet above requirements by collecting the essence of these different services into a very simple interface to a centralized coordination service. The service itself is distributed and highly reliable. Consensus, group management, leader election and presence protocols will be implemented by the service so that the applications do not need to implement them on their own. 

ZooKeeper instructions show how this simple service can be used to build much more powerful abstractions.

Originally ZooKeeper was a sub project of Hadoop. But from January 2011 Hadoop's ZooKeeper subproject has graduated to become a top-level Apache project. ZooKeeper have Java and C interfaces, and someday they hope to get Python, Perl, and REST interfaces for building applications and management interfaces.

ZooKeeper FileSystem / Data Model

ZooKeeper allows distributed processes to coordinate with each other through a shared hierarchical name space of data registers (we call these registers znodes), much like a tree based file system. Every znode in ZooKeeper's name space is identified by a path. Sequence of path elements separated by a slash ("/"). And every znode has a parent whose path is a prefix of the znode with one less element; the exception to this rule is root ("/") which has no parent. Also, a znode cannot be deleted if it has any children. There are no renames, no append semantics and no partial read writes. 

Data is read and written entirely

The main differences between ZooKeeper and standard file systems are that every znode can have data associated with it (every file can also be a directory and vice-versa) and znodes are limited to the amount of data that they can have. ZooKeeper was designed to store coordination data: status information, configuration, location information, etc. This kind of meta-information is usually measured in kilobytes, if not bytes. ZooKeeper has a built-in sanity check of 1M, to prevent it from being used as a large data store.

Znodes maintain a stat structure that includes version numbers for data changes, ACL changes, and timestamps, to allow cache validations and coordinated updates. Each time a znode's data changes, the version number increases. For instance, whenever a client retrieves data it also receives the version of the data.

1. The data stored at each znode in a namespace is read and written atomically.
2. Each node has an Access Control List (ACL) that restricts who can do what.
3. ZooKeeper supports the concept of watches. Clients can set a watch on a znodes. A watch will be triggered and removed when the znode changes. When a watch is triggered the client receives a packet saying that the znode has changed. And if the connection between the client and one of the Zoo Keeper servers is broken, the client will receive a local notification

ZooKeeper Distributed Architecture

Main Facts
  • All servers store a copy of the data in memory 
  • The leader is elected at startup 
  • Followers respond to clients 
  • All updates go through the leader 
  • Responses are sent when a majority of servers have persisted the change

The ZooKeeper service itself is replicated over a set of machines that comprise the service. These machines maintain an in-memory image of the data tree along with a transaction logs and snapshots in a persistent store. Because the data is kept in-memory ZooKeeper is able to get very high throughput and low latency numbers. The downside to an in memory database is that the size of the database that ZooKeeper can manage is limited by memory. This limitation is further reason to keep the amount of data stored in znodes small.

The servers that make up the ZooKeeper service must all know about each other. As long as a majority of the servers are available the ZooKeeper service will be available. Clients must also know the list of servers. The clients create a handle to the ZooKeeper service using this list of servers.

Clients only connect to a single ZooKeeper server. The client maintains a TCP connection through which it sends requests, gets responses, gets watch events, and sends heart beats. If the TCP connection to the server breaks, the client will connect to a different server. When a client first connects to the ZooKeeper service, the first ZooKeeper server will setup a session for the client. If the client needs to connect to another server, this session will get reestablished with the new server.

Given a cluster of Zookeeper servers, only one acts as a leader, whose role is to accept and coordinate all writes (via a quorum). All other servers are called followers who read-only replicas of the master. Read requests sent by a ZooKeeper client are processed locally at the ZooKeeper server to which the client is connected. If the read request registers a watch on a znode, that watch is also tracked locally at the ZooKeeper server. Write requests are forwarded to the leader and write requests go through consensus before a response is generated. The rest of the ZooKeeper servers (followers) receive message proposals from the leader and agree upon message delivery. Since followers are replicas of the leader, if the leader goes down, any other server can pick up the slack and immediately continue serving requests. 

The messaging layer takes care of replacing leaders on failures and syncing followers with leaders. Sync requests are also forwarded to another server, but does not actually go through consensus. Thus, the throughput of read requests scales with the number of servers and the throughput of write requests decreases with the number of servers. 

Order is very important to ZooKeeper. (They tend to be a bit obsessive compulsive.) All updates are totally ordered. ZooKeeper actually stamps each update with a number that reflects this order. We call this number the zxid (ZooKeeper Transaction Id). Each update will have a unique zxid. Reads (and watches) are ordered with respect to updates. Read responses will be stamped with the last zxid processed by the server that services the read.

ZooKeeper supports the concept of watches. Clients can set a watch on a znodes. A watch will be triggered and removed when the znode changes. When a watch is triggered the client receives a packet saying that the znode has changed. And if the connection between the client and one of the Zoo Keeper servers is broken, the client will receive a local notification.
ZooKeeper uses a custom atomic messaging protocol. Since the messaging layer is atomic, ZooKeeper can guarantee that the local replicas never diverge. When the leader receives a write request, it calculates what the state of the system is when the write is to be applied and transforms this into a transaction that captures this new state. 

And last but not least, what if you wanted to create a node, which only existed for the lifetime of your connection to Zookeper? That's what "ephemeral nodes" are for .Now, put all of these things together, and you have a powerful toolkit to solve many problems in distributed computing. Zookeeper guarantees completely ordered updates, data versioning; conditional updates (CAS), as well as, advanced features such as "ephemeral nodes", "generated names", and an async notification ("watch") API. 


ZooKeeper is Simple, Replicated, Ordered and Fast. ZooKeeper znode provides its clients high throughput, low latency, highly available, strictly ordered access to the znodes. 

  1. High Availability ,ZooKeeper's architecture supports high-availability through redundant services. The clients can thus ask another ZooKeeper master if the first fails to answer. ZooKeeper nodes store their data in a hierarchical name space, much like a file system or a trie data-structure. Clients can read and write from/to the nodes and in this way have a shared configuration service. Updates are totally ordered.
  2. Performance, The performance aspect of ZooKeeper allows it to be used in large distributed systems. 
  3. Reliability, The reliability aspects prevent it from becoming the single point of failure in big systems. Its strict ordering allows sophisticated synchronization primitives to be implemented at the client.


ZooKeeper is used by companies including WSO2 ,Rackspace and Yahoo! as well as open source enterprise search systems like Solr. Cloudera Inc. Hortonworks Inc. are some other organizations that use ZooKeeper. The Katta project, describes itself as Lucene in the Cloud, a scalable, fault-tolerant, distributed indexing system capable of serving large replicated Lucene indexes at high loads.[8]