ZooKeeper

ZooKeeper is a centralized service for maintaining and synchronizing configuration data. Zookeeper is used to set a distributed lock on incoming data that is then ingested by a client. Zookeeper services use odd numbers for redundancy, it is best to have a minimum of three Zookeeper services.

About ZooKeeper

In distributed application engineering, the word node can refer to a generic host machine, a server, a member of an ensemble, a client process, etc. In the ZooKeeper documentation, znodes refer to the data nodes. Servers to refer to machines that make up the ZooKeeper service; quorum peers refer to the servers that make up an ensemble; client refers to any host or process which uses a ZooKeeper service.

Znodes have three characteristics:

Watches

Clients can set watches on znodes. Changes to that znode trigger the watch and then clear the watch. When a watch triggers, ZooKeeper sends the client a notification. More information about watches can be found in the section ZooKeeper Watches. For example: HDFS HA sets a watcher on startup.

Data Access

The data stored at each znode in a namespace is read and written atomically. Reads get all the data bytes associated with a znode and a write replaces all the data. Each node has an Access Control List (ACL) that restricts who can do what. You can refer to this type as persistent: This is the default type of znode in any Zookeeper. Persistent nodes are always present and they contain the important configuration details. When a new node is added to the Zookeeper it goes to persistent znode and gets the configuration information.

Ephemeral Nodes

ZooKeeper also has the notion of ephemeral nodes. These znodes exists as long as the session that created the znode is active. When the session ends the znode is deleted. Because of this behavior ephemeral znodes are not allowed to have children. Ephemeral Nodes are mainly useful to keep check on client applications in case of failures. As the application fails the znode dies.

Configure ZooKeeper

Install ZooKeeper

Cloudera Manager distributes ZooKeeper in CDH. On install, you will select the nodes required to run the ZooKeeper Server. Generally it is best to install ZooKeeper servers in groups of three.

  • ZooKeeper Server (due to the CPU requirements of the ZooKeeper Server, only a few services should share the host with this service – make sure a ZooKeeper Server is not shared with an HBase RegionServer, HDFS DataNode, or a YARN ResourceManager. I’ve had luck co-locating the ZooKeeper Server with an HDFS JournalNode.

ZooKeeper Configuration

Configure ZooKeeper with the following settings:

Configuration Description Value Calculation
Data Directory The disk location that ZooKeeper will use to store its database snapshots. /space1/zookeeper Not on root.
Transaction Log Directory The disk location that ZooKeeper will use to store its transaction logs. /space1/zookeeper

or

/mnt/ramdisk1

Not on root.

Set the Transaction Log Directory on ramdisk for better performance, see my notes on ramdisk below.

Maximum Client Connections The maximum number of concurrent connections (at the socket level) that a single client, identified by the IP address, may make to a single member of the ZooKeeper ensemble. This setting is used to prevent certain classes of DoS attacks, including file descriptor exhaustion. To remove the limit on concurrent connections, set this value to 0. 600 How many clients will connect?
Maximum Session Timeout

maxSessionTimeout

The maximum session timeout, in milliseconds, that the ZooKeeper Server will allow the client to negotiate. 60000 In Azure we realized that connections were timing out too frequently and increased the timeout to compensate.
ZooKeeper Canary Connection Timeout Configures the timeout used by the canary for connection establishment with ZooKeeper servers 15 Default is 10 seconds. In Walmart we realized that the connection timeout was too low and we were seeing many errors. In Azure we also saw a slow connection. We realized that high memory usage will cause a slow connection attempt to ZooKeeper – this is an early indication of a problem.

zookeeper-client

Connect to the Zookeeper-client using the following command

zookeeper-client -server <zookeeper-alias>:2181

Create RAMDISK for ZooKeeper

Create the directory to be mounted

sudo mkdir -p /mnt/ramdisk1

Change permissions on the directory

sudo chmod -R 775 /mnt/ramdisk1
sudo chown -R zookeeper:zookeeper /mnt/ramdisk1

The tempfs filesystem is a RAMDISK. Mount the tempfs filesystem

sudo mount -t tmpfs -o size=8192M tmpfs /mnt/ramdisk1

To make the ramdisk permanently available, add it to /etc/fstab.

sudo grep /mnt/ramdisk1 /etc/mtab | sudo tee -a /etc/fstab

Troubleshooting

ZooKeeper Canary test failed to establish a connection or a client session to the ZooKeeper service – Too many connections

I once noticed that my Hive commands were failing. I looked at ZooKeeper and saw the Canary test failed to establish a connection or a client session to the ZooKeeper service error. I looked in the ZooKeeper service log and saw the following:

1:35:11.059 PM WARN   org.apache.zookeeper.server.NIOServerCnxnFactory

Too many connections from /192.168.210.253 – max is 60

1:35:11.059 PM WARN   org.apache.zookeeper.server.NIOServerCnxnFactory

Too many connections from /192.168.210.253 – max is 60

1:35:11.063 PM WARN   org.apache.zookeeper.server.NIOServerCnxnFactory

Too many connections from /192.168.210.253 – max is 60

Cause: I was running commands through Hive that were not disconnecting from ZooKeeper. There is a bug in Hue which causes this issue. Especially with this command: ALTER TABLE ADD PARTITION

Resolution: Restart Hive might work, otherwise restart the Hue service. You also might have to increase the max number of connections.

References: http://community.cloudera.com/t5/Batch-SQL-Apache-Hive/hiveserver2-fails-to-release-zookeeper-connection-in-CDH5-0-1/td-p/13598https://groups.google.com/a/cloudera.org/forum/#!topic/cdh-user/WSt7nB84Lm0

ZooKeeper: Cannot Start ZooKeeper: Last Transaction Was Partial

The disk was full, and after recovering you receive an error that you cannot start Zookeeper, there is an exception: Last transaction was partial. You must delete the snapshot file from the /var/lib/zookeeper/ version-2/ data folder.

Log file: /var/log/zookeeper/zookeeper-cmf-zookeeper1-SERVER-servername01.log

2014-06-11 16:02:59,874 INFO org.apache.zookeeper.server.persistence.FileSnap: Reading snapshot /var/lib/zookeeper/version-2/snapshot.a59a5

2014-06-11 16:02:59,982 ERROR org.apache.zookeeper.server.persistence.Util: Last transaction was partial.

2014-06-11 16:02:59,983 ERROR org.apache.zookeeper.server.ZooKeeperServerMain: Unexpected exception, exiting abnormally

java.io.EOFException

at java.io.DataInputStream.readInt(DataInputStream.java:375)

at org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63)

at org.apache.zookeeper.server.persistence.FileHeader.deserialize(FileHeader.java:64)

2014-06-11 16:08:14,298 INFO org.apache.zookeeper.server.quorum.QuorumPeerConfig: Reading configuration from: /run/cloudera-scm-agent/process/110-zookeeper-server/zoo.cfg

ZooKeeper SessionExpired Events: HBase RegionServer service stops

Error:

WARN org.apache.hadoop.hbase.util.Sleeper

We slept 247488ms instead of 3000ms, this is likely due to a long garbage collecting pause and it’s usually bad, see http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired

FATAL org.apache.hadoop.hbase.regionserver.HRegionServer

ABORTING region server servername32,60020,1378856731842: Unhandled exception: org.apache.hadoop.hbase.YouAreDeadException: Server REPORT rejected; currently processing servername32,60020,1378856731842 as dead server

at org.apache.hadoop.hbase.master.ServerManager.checkIsDead(ServerManager.java:254)

at org.apache.hadoop.hbase.master.ServerManager.regionServerReport(ServerManager.java:172)

at org.apache.hadoop.hbase.master.HMaster.regionServerReport(HMaster.java:1010)

Resolution:

Master or RegionServers shutting down with messages like those in the logs:

WARN org.apache.zookeeper.ClientCnxn: Exception

closing session 0x278bd16a96000f to sun.nio.ch.SelectionKeyImpl@355811ec

java.io.IOException: TIMED OUT

at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:906)

WARN org.apache.hadoop.hbase.util.Sleeper: We slept 79410ms, ten times longer than scheduled: 5000

INFO org.apache.zookeeper.ClientCnxn: Attempting connection to server hostname/IP:PORT

INFO org.apache.zookeeper.ClientCnxn: Priming connection to java.nio.channels.SocketChannel[connected local=/IP:PORT remote=hostname/IP:PORT]

INFO org.apache.zookeeper.ClientCnxn: Server connection successful

WARN org.apache.zookeeper.ClientCnxn: Exception closing session 0x278bd16a96000d to sun.nio.ch.SelectionKeyImpl@3544d65e

java.io.IOException: Session Expired

at org.apache.zookeeper.ClientCnxn$SendThread.readConnectResult(ClientCnxn.java:589)

at org.apache.zookeeper.ClientCnxn$SendThread.doIO(ClientCnxn.java:709)

at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:945)

ERROR org.apache.hadoop.hbase.regionserver.HRegionServer: ZooKeeper session expired

The JVM is doing a long running garbage collecting which is pausing every threads (aka “stop the world”). Since the RegionServer’s local ZooKeeper client cannot send heartbeats, the session times out. By design, we shut down any node that isn’t able to contact the ZooKeeper ensemble after getting a timeout so that it stops serving data that may already be assigned elsewhere.

  • Make sure you give plenty of RAM (in hbase-env.sh), the default of 1GB won’t be able to sustain long running imports.
  • Make sure you don’t swap, the JVM never behaves well under swapping.
  • Make sure you are not CPU starving the RegionServer thread. For example, if you are running a MapReduce job using 6 CPU-intensive tasks on a machine with 4 cores, you are probably starving the RegionServer enough to create longer garbage collection pauses.
  • Increase the ZooKeeper session timeout

If you wish to increase the session timeout, add the following to your hbase-site.xml to increase the timeout from the default of 60 seconds to 120 seconds.

&lt;property&gt;
  &lt;name&gt;zookeeper.session.timeout&lt;/name&gt;
  &lt;value&gt;1200000&lt;/value&gt;
&lt;/property&gt;
&lt;property&gt;
  &lt;name&gt;hbase.zookeeper.property.tickTime&lt;/name&gt;
  &lt;value&gt;6000&lt;/value&gt;
&lt;/property&gt;

Be aware that setting a higher timeout means that the regions served by a failed RegionServer will take at least that amount of time to be transfered to another RegionServer. For a production system serving live requests, we would instead recommend setting it lower than 1 minute and over-provision your cluster in order the lower the memory load on each machines (hence having less garbage to collect per machine).

If this is happening during an upload which only happens once (like initially loading all your data into HBase), consider bulk loading.

See Section 13.11.2, “ZooKeeper, The Cluster Canary” for other general information about ZooKeeper troubleshooting.