{"id":1130,"date":"2016-05-11T14:29:14","date_gmt":"2016-05-11T22:29:14","guid":{"rendered":"http:\/\/www.developerscloset.com\/?page_id=1130"},"modified":"2018-06-12T13:24:58","modified_gmt":"2018-06-12T21:24:58","slug":"zookeeper","status":"publish","type":"page","link":"https:\/\/www.developerscloset.com\/?page_id=1130","title":{"rendered":"ZooKeeper"},"content":{"rendered":"<p><a href=\"http:\/\/www.developerscloset.com\/wp-content\/uploads\/2018\/05\/zookeeper-image.png\"><img loading=\"lazy\" decoding=\"async\" class=\"wp-image-1131 alignnone\" src=\"http:\/\/www.developerscloset.com\/wp-content\/uploads\/2018\/05\/zookeeper-image-251x300.png\" alt=\"\" width=\"159\" height=\"190\" srcset=\"https:\/\/www.developerscloset.com\/wp-content\/uploads\/2018\/05\/zookeeper-image-251x300.png 251w, https:\/\/www.developerscloset.com\/wp-content\/uploads\/2018\/05\/zookeeper-image.png 301w\" sizes=\"auto, (max-width: 159px) 100vw, 159px\" \/><\/a><\/p>\n<p>ZooKeeper is a centralized service for maintaining and synchronizing configuration data. Zookeeper is used to set a distributed lock on incoming data that is then ingested by a client. Zookeeper services use odd numbers for redundancy, it is best to have a minimum of three Zookeeper services.<\/p>\n<div id=\"ez-toc-container\" class=\"ez-toc-v2_0_79 counter-hierarchy ez-toc-counter ez-toc-grey ez-toc-container-direction\">\n<p class=\"ez-toc-title\" style=\"cursor:inherit\">Table of Contents<\/p>\n<label for=\"ez-toc-cssicon-toggle-item-69ea2105e58ce\" class=\"ez-toc-cssicon-toggle-label\"><span class=\"\"><span class=\"eztoc-hide\" style=\"display:none;\">Toggle<\/span><span class=\"ez-toc-icon-toggle-span\"><svg style=\"fill: #999;color:#999\" xmlns=\"http:\/\/www.w3.org\/2000\/svg\" class=\"list-377408\" width=\"20px\" height=\"20px\" viewBox=\"0 0 24 24\" fill=\"none\"><path d=\"M6 6H4v2h2V6zm14 0H8v2h12V6zM4 11h2v2H4v-2zm16 0H8v2h12v-2zM4 16h2v2H4v-2zm16 0H8v2h12v-2z\" fill=\"currentColor\"><\/path><\/svg><svg style=\"fill: #999;color:#999\" class=\"arrow-unsorted-368013\" xmlns=\"http:\/\/www.w3.org\/2000\/svg\" width=\"10px\" height=\"10px\" viewBox=\"0 0 24 24\" version=\"1.2\" baseProfile=\"tiny\"><path d=\"M18.2 9.3l-6.2-6.3-6.2 6.3c-.2.2-.3.4-.3.7s.1.5.3.7c.2.2.4.3.7.3h11c.3 0 .5-.1.7-.3.2-.2.3-.5.3-.7s-.1-.5-.3-.7zM5.8 14.7l6.2 6.3 6.2-6.3c.2-.2.3-.5.3-.7s-.1-.5-.3-.7c-.2-.2-.4-.3-.7-.3h-11c-.3 0-.5.1-.7.3-.2.2-.3.5-.3.7s.1.5.3.7z\"\/><\/svg><\/span><\/span><\/label><input type=\"checkbox\"  id=\"ez-toc-cssicon-toggle-item-69ea2105e58ce\"  aria-label=\"Toggle\" \/><nav><ul class='ez-toc-list ez-toc-list-level-1 ' ><li class='ez-toc-page-1 ez-toc-heading-level-1'><a class=\"ez-toc-link ez-toc-heading-1\" href=\"https:\/\/www.developerscloset.com\/?page_id=1130\/#About_ZooKeeper\" >About ZooKeeper<\/a><ul class='ez-toc-list-level-2' ><li class='ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-2\" href=\"https:\/\/www.developerscloset.com\/?page_id=1130\/#Watches\" >Watches<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-3\" href=\"https:\/\/www.developerscloset.com\/?page_id=1130\/#Data_Access\" >Data Access<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-4\" href=\"https:\/\/www.developerscloset.com\/?page_id=1130\/#Ephemeral_Nodes\" >Ephemeral Nodes<\/a><\/li><\/ul><\/li><li class='ez-toc-page-1 ez-toc-heading-level-1'><a class=\"ez-toc-link ez-toc-heading-5\" href=\"https:\/\/www.developerscloset.com\/?page_id=1130\/#Configure_ZooKeeper\" >Configure ZooKeeper<\/a><ul class='ez-toc-list-level-2' ><li class='ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-6\" href=\"https:\/\/www.developerscloset.com\/?page_id=1130\/#Install_ZooKeeper\" >Install ZooKeeper<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-7\" href=\"https:\/\/www.developerscloset.com\/?page_id=1130\/#ZooKeeper_Configuration\" >ZooKeeper Configuration<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-8\" href=\"https:\/\/www.developerscloset.com\/?page_id=1130\/#zookeeper-client\" >zookeeper-client<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-9\" href=\"https:\/\/www.developerscloset.com\/?page_id=1130\/#Create_RAMDISK_for_ZooKeeper\" >Create RAMDISK for ZooKeeper<\/a><\/li><\/ul><\/li><li class='ez-toc-page-1 ez-toc-heading-level-1'><a class=\"ez-toc-link ez-toc-heading-10\" href=\"https:\/\/www.developerscloset.com\/?page_id=1130\/#Troubleshooting\" >Troubleshooting<\/a><ul class='ez-toc-list-level-2' ><li class='ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-11\" href=\"https:\/\/www.developerscloset.com\/?page_id=1130\/#ZooKeeper_Canary_test_failed_to_establish_a_connection_or_a_client_session_to_the_ZooKeeper_service_%E2%80%93_Too_many_connections\" >ZooKeeper Canary test failed to establish a connection or a client session to the ZooKeeper service &#8211; Too many connections<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-12\" href=\"https:\/\/www.developerscloset.com\/?page_id=1130\/#ZooKeeper_Cannot_Start_ZooKeeper_Last_Transaction_Was_Partial\" >ZooKeeper: Cannot Start ZooKeeper: Last Transaction Was Partial<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-13\" href=\"https:\/\/www.developerscloset.com\/?page_id=1130\/#ZooKeeper_SessionExpired_Events_HBase_RegionServer_service_stops\" >ZooKeeper SessionExpired Events: HBase RegionServer service stops<\/a><\/li><\/ul><\/li><\/ul><\/nav><\/div>\n<h1><span class=\"ez-toc-section\" id=\"About_ZooKeeper\"><\/span>About ZooKeeper<span class=\"ez-toc-section-end\"><\/span><\/h1>\n<p>In distributed application engineering, the word\u00a0<em>node<\/em>\u00a0can refer to a generic host machine, a server, a member of an ensemble, a client process, etc. In the ZooKeeper documentation,\u00a0<em>znodes<\/em>\u00a0refer to the data nodes.\u00a0<em>Servers<\/em>\u00a0to refer to machines that make up the ZooKeeper service;\u00a0<em>quorum peers<\/em> refer to the servers that make up an ensemble; <em>client<\/em> refers to any host or process which uses a ZooKeeper service.<\/p>\n<p>Znodes have three characteristics:<\/p>\n<h2><span class=\"ez-toc-section\" id=\"Watches\"><\/span>Watches<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>Clients can set watches on znodes. Changes to that znode trigger the watch and then clear the watch. When a watch triggers, ZooKeeper sends the client a notification. More information about watches can be found in the section\u00a0<a href=\"http:\/\/zookeeper.apache.org\/doc\/r3.1.2\/zookeeperProgrammers.html#ch_zkWatches\">ZooKeeper Watches<\/a>. For example: HDFS HA sets a watcher on startup.<\/p>\n<h2><span class=\"ez-toc-section\" id=\"Data_Access\"><\/span>Data Access<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>The data stored at each znode in a namespace is read and written atomically. Reads get all the data bytes associated with a znode and a write replaces all the data. Each node has an Access Control List (ACL) that restricts who can do what. You can refer to this type as\u00a0persistent: This is the default type of znode in any Zookeeper. Persistent nodes are always present and they contain the important configuration details. When a new node is added to the Zookeeper it goes to persistent znode and gets the configuration information.<\/p>\n<h2><span class=\"ez-toc-section\" id=\"Ephemeral_Nodes\"><\/span>Ephemeral Nodes<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>ZooKeeper also has the notion of ephemeral nodes. These znodes exists as long as the session that created the znode is active. When the session ends the znode is deleted. Because of this behavior ephemeral znodes are not allowed to have children. Ephemeral Nodes are\u00a0mainly useful to keep check on client applications in case of failures. As the application fails the znode dies.<\/p>\n<h1><span class=\"ez-toc-section\" id=\"Configure_ZooKeeper\"><\/span>Configure ZooKeeper<span class=\"ez-toc-section-end\"><\/span><\/h1>\n<h2><span class=\"ez-toc-section\" id=\"Install_ZooKeeper\"><\/span>Install ZooKeeper<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>Cloudera Manager distributes ZooKeeper in CDH. On install, you will select\u00a0the nodes\u00a0required\u00a0to run the ZooKeeper Server. Generally it is best to install ZooKeeper servers in groups of three.<\/p>\n<ul>\n<li><strong>ZooKeeper Server<\/strong>\u00a0(due to the CPU requirements of the ZooKeeper Server, only a few services\u00a0should share the host with this service &#8211; make sure a ZooKeeper Server is not shared with an HBase RegionServer,\u00a0HDFS DataNode, or a YARN ResourceManager. I&#8217;ve had luck co-locating the ZooKeeper Server with an HDFS JournalNode.<\/li>\n<\/ul>\n<h2 id=\"ZooKeeper-ZooKeeperConfiguration\"><span class=\"ez-toc-section\" id=\"ZooKeeper_Configuration\"><\/span>ZooKeeper Configuration<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>Configure ZooKeeper with the following settings:<\/p>\n<div style=\"max-width: 100%;margin: auto;overflow: hidden\">\n<div style=\"width: 100%;overflow: auto\">\n<table>\n<tbody>\n<tr>\n<td class=\"confluenceTd\"><strong>Configuration<\/strong><\/td>\n<td class=\"confluenceTd\"><strong>Description<\/strong><\/td>\n<td class=\"confluenceTd\"><strong>Value<\/strong><\/td>\n<td class=\"confluenceTd\"><strong>Calculation<\/strong><\/td>\n<\/tr>\n<tr>\n<td class=\"confluenceTd\" colspan=\"1\">Data Directory<\/td>\n<td class=\"confluenceTd\" colspan=\"1\">The disk location that ZooKeeper will use to store its database snapshots.<\/td>\n<td class=\"confluenceTd\" colspan=\"1\">\/space1\/zookeeper<\/td>\n<td class=\"confluenceTd\" colspan=\"1\">Not on root.<\/td>\n<\/tr>\n<tr>\n<td class=\"confluenceTd\" colspan=\"1\">Transaction Log Directory<\/td>\n<td class=\"confluenceTd\" colspan=\"1\">The disk location that ZooKeeper will use to store its transaction logs.<\/td>\n<td class=\"confluenceTd\" colspan=\"1\">\/space1\/zookeeper<\/p>\n<p>or<\/p>\n<p>\/mnt\/ramdisk1<\/td>\n<td class=\"confluenceTd\" colspan=\"1\">Not on root.<\/p>\n<p>Set the Transaction Log Directory on ramdisk for better performance, see my notes on ramdisk below.<\/td>\n<\/tr>\n<tr>\n<td class=\"confluenceTd\"><strong>Maximum Client Connections<\/strong><\/td>\n<td class=\"confluenceTd\">The maximum number of concurrent connections (at the socket level) that a single client, identified by the IP address, may make to a single member of the ZooKeeper ensemble. This setting is used to prevent certain classes of DoS attacks, including file descriptor exhaustion. To remove the limit on concurrent connections, set this value to 0.<\/td>\n<td class=\"confluenceTd\"><u>600<\/u><\/td>\n<td class=\"confluenceTd\">How many clients will connect?<\/td>\n<\/tr>\n<tr>\n<td class=\"confluenceTd\"><strong>Maximum Session Timeout<\/strong><\/p>\n<p>maxSessionTimeout<\/td>\n<td class=\"confluenceTd\">The maximum session timeout, in milliseconds, that the ZooKeeper Server will allow the client to negotiate.<\/td>\n<td class=\"confluenceTd\">60000<\/td>\n<td class=\"confluenceTd\">In Azure we realized that connections were timing out too\u00a0frequently and increased the timeout to compensate.<\/td>\n<\/tr>\n<tr>\n<td class=\"confluenceTd\" colspan=\"1\">ZooKeeper Canary Connection Timeout<\/td>\n<td class=\"confluenceTd\" colspan=\"1\">Configures the timeout used by the canary for connection establishment with ZooKeeper servers<\/td>\n<td class=\"confluenceTd\" colspan=\"1\"><u>15<\/u><\/td>\n<td class=\"confluenceTd\" colspan=\"1\">Default is 10 seconds. In Walmart we realized that the connection timeout was too low and we were seeing many errors. In Azure we also saw a slow connection. We realized that high memory usage will cause a slow connection attempt to ZooKeeper &#8211; this is an early indication of a problem.<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<\/div>\n<\/div>\n<h2><span class=\"ez-toc-section\" id=\"zookeeper-client\"><\/span>zookeeper-client<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>Connect to the Zookeeper-client using the following command<\/p>\n<p>zookeeper-client -server &lt;zookeeper-alias&gt;:2181<\/p>\n<h2 id=\"ZooKeeper-CreateRAMDISKforZookeeper\"><span class=\"ez-toc-section\" id=\"Create_RAMDISK_for_ZooKeeper\"><\/span>Create RAMDISK for ZooKeeper<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>Create the directory to be mounted<\/p>\n<p>sudo\u00a0mkdir\u00a0-p\u00a0\/mnt\/ramdisk1<\/p>\n<p>Change permissions on the directory<\/p>\n<p>sudo\u00a0chmod\u00a0-R 775\u00a0\/mnt\/ramdisk1<br \/>\nsudo\u00a0chown\u00a0-R zookeeper:zookeeper\u00a0\/mnt\/ramdisk1<\/p>\n<p>The tempfs filesystem is a RAMDISK. Mount the tempfs filesystem<\/p>\n<p>sudo\u00a0mount\u00a0-t tmpfs -o size=8192M tmpfs\u00a0\/mnt\/ramdisk1<\/p>\n<p>To make the\u00a0<span class=\"searchword\">ramdisk<\/span>\u00a0permanently available, add it to \/etc\/fstab.<\/p>\n<p>sudo\u00a0grep\u00a0\/mnt\/ramdisk1\u00a0\/etc\/mtab\u00a0|\u00a0sudo\u00a0tee\u00a0-a\u00a0\/etc\/fstab<\/p>\n<h1 id=\"ZooKeeper-Troubleshooting\"><span class=\"ez-toc-section\" id=\"Troubleshooting\"><\/span>Troubleshooting<span class=\"ez-toc-section-end\"><\/span><\/h1>\n<h2 id=\"ZooKeeper-ZooKeeperCanarytestfailedtoestablishaconnectionoraclientsessiontotheZooKeeperservice-Toomanyconnections\"><span class=\"ez-toc-section\" id=\"ZooKeeper_Canary_test_failed_to_establish_a_connection_or_a_client_session_to_the_ZooKeeper_service_%E2%80%93_Too_many_connections\"><\/span>ZooKeeper Canary test failed to establish a connection or a client session to the ZooKeeper service &#8211; Too many connections<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>I once noticed that my Hive commands were failing. I looked at ZooKeeper and saw the Canary test failed to establish a connection or a client session to the ZooKeeper service error. I looked in the ZooKeeper service log and saw the following:<\/p>\n<p>1:35:11.059 PM WARN\u00a0\u00a0 org.apache.zookeeper.server.NIOServerCnxnFactory<\/p>\n<p>Too many connections from \/192.168.210.253 &#8211; max is 60<\/p>\n<p>1:35:11.059 PM WARN\u00a0\u00a0 org.apache.zookeeper.server.NIOServerCnxnFactory<\/p>\n<p>Too many connections from \/192.168.210.253 &#8211; max is 60<\/p>\n<p>1:35:11.063 PM WARN\u00a0\u00a0 org.apache.zookeeper.server.NIOServerCnxnFactory<\/p>\n<p>Too many connections from \/192.168.210.253 &#8211; max is 60<\/p>\n<p>Cause: I was running commands through Hive that were not disconnecting from ZooKeeper. There is a bug in Hue which causes this issue. Especially with this command: ALTER TABLE ADD PARTITION<\/p>\n<p>Resolution: Restart Hive might work, otherwise restart the Hue service. You also might have to increase the max number of connections.<\/p>\n<p>References:\u00a0<a class=\"external-link\" href=\"http:\/\/community.cloudera.com\/t5\/Batch-SQL-Apache-Hive\/hiveserver2-fails-to-release-zookeeper-connection-in-CDH5-0-1\/td-p\/13598\" rel=\"nofollow\">http:\/\/community.cloudera.com\/t5\/Batch-SQL-Apache-Hive\/hiveserver2-fails-to-release-zookeeper-connection-in-CDH5-0-1\/td-p\/13598<\/a>,\u00a0<a class=\"external-link\" href=\"https:\/\/groups.google.com\/a\/cloudera.org\/forum\/#!topic\/cdh-user\/WSt7nB84Lm0\" rel=\"nofollow\">https:\/\/groups.google.com\/a\/cloudera.org\/forum\/#!topic\/cdh-user\/WSt7nB84Lm0<\/a><\/p>\n<h2 id=\"ZooKeeper-ZooKeeper:CannotStartZooKeeper:LastTransactionWasPartial\"><span class=\"ez-toc-section\" id=\"ZooKeeper_Cannot_Start_ZooKeeper_Last_Transaction_Was_Partial\"><\/span>ZooKeeper: Cannot Start ZooKeeper: Last Transaction Was Partial<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>The disk was full, and after recovering you receive an error that you cannot start Zookeeper, there is an exception: Last transaction was partial. You must delete the snapshot file from the \/var\/lib\/zookeeper\/ version-2\/ data folder.<\/p>\n<p>Log file: \/var\/log\/zookeeper\/zookeeper-cmf-zookeeper1-SERVER-servername01.log<\/p>\n<p>2014-06-11 16:02:59,874 INFO org.apache.zookeeper.server.persistence.FileSnap: Reading snapshot \/var\/lib\/zookeeper\/version-2\/snapshot.a59a5<\/p>\n<p>2014-06-11 16:02:59,982 ERROR org.apache.zookeeper.server.persistence.Util: Last transaction was partial.<\/p>\n<p>2014-06-11 16:02:59,983 ERROR org.apache.zookeeper.server.ZooKeeperServerMain: Unexpected exception, exiting abnormally<\/p>\n<p>java.io.EOFException<\/p>\n<p>at java.io.DataInputStream.readInt(DataInputStream.java:375)<\/p>\n<p>at org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63)<\/p>\n<p>at org.apache.zookeeper.server.persistence.FileHeader.deserialize(FileHeader.java:64)<\/p>\n<p>&#8230;<\/p>\n<p>2014-06-11 16:08:14,298 INFO org.apache.zookeeper.server.quorum.QuorumPeerConfig: Reading configuration from: \/run\/cloudera-scm-agent\/process\/110-zookeeper-server\/zoo.cfg<\/p>\n<h2 id=\"ZooKeeper-ZooKeeperSessionExpiredEvents:HBaseRegionServerservicestops\"><span class=\"ez-toc-section\" id=\"ZooKeeper_SessionExpired_Events_HBase_RegionServer_service_stops\"><\/span>ZooKeeper SessionExpired Events: HBase RegionServer service stops<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p><strong>Error:<\/strong><\/p>\n<p>WARN org.apache.hadoop.hbase.util.Sleeper<\/p>\n<p>We slept 247488ms instead of 3000ms, this is likely due to a long garbage collecting pause and it&#8217;s usually bad, see\u00a0<a class=\"external-link\" href=\"http:\/\/hbase.apache.org\/book.html#trouble.rs.runtime.zkexpired\" rel=\"nofollow\">http:\/\/hbase.apache.org\/book.html#trouble.rs.runtime.zkexpired<\/a><\/p>\n<p>FATAL org.apache.hadoop.hbase.regionserver.HRegionServer<\/p>\n<p>ABORTING region server servername32,60020,1378856731842: Unhandled exception: org.apache.hadoop.hbase.YouAreDeadException: Server REPORT rejected; currently processing servername32,60020,1378856731842 as dead server<\/p>\n<p>at org.apache.hadoop.hbase.master.ServerManager.checkIsDead(ServerManager.java:254)<\/p>\n<p>at org.apache.hadoop.hbase.master.ServerManager.regionServerReport(ServerManager.java:172)<\/p>\n<p>at org.apache.hadoop.hbase.master.HMaster.regionServerReport(HMaster.java:1010)<\/p>\n<p>&#8230;<\/p>\n<p><strong>Resolution:<\/strong><\/p>\n<p>Master or RegionServers shutting down with messages like those in the logs:<\/p>\n<p>WARN org.apache.zookeeper.ClientCnxn: Exception<\/p>\n<p>closing session 0x278bd16a96000f to sun.nio.ch.SelectionKeyImpl@355811ec<\/p>\n<p>java.io.IOException: TIMED OUT<\/p>\n<p>at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:906)<\/p>\n<p>WARN org.apache.hadoop.hbase.util.Sleeper: We slept 79410ms, ten times longer than scheduled: 5000<\/p>\n<p>INFO org.apache.zookeeper.ClientCnxn: Attempting connection to server hostname\/IP:PORT<\/p>\n<p>INFO org.apache.zookeeper.ClientCnxn: Priming connection to java.nio.channels.SocketChannel[connected local=\/IP:PORT remote=hostname\/IP:PORT]<\/p>\n<p>INFO org.apache.zookeeper.ClientCnxn: Server connection successful<\/p>\n<p>WARN org.apache.zookeeper.ClientCnxn: Exception closing session 0x278bd16a96000d to\u00a0sun.nio.ch.SelectionKeyImpl@3544d65e<\/p>\n<p>java.io.IOException: Session Expired<\/p>\n<p>at org.apache.zookeeper.ClientCnxn$SendThread.readConnectResult(ClientCnxn.java:589)<\/p>\n<p>at org.apache.zookeeper.ClientCnxn$SendThread.doIO(ClientCnxn.java:709)<\/p>\n<p>at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:945)<\/p>\n<p>ERROR org.apache.hadoop.hbase.regionserver.HRegionServer: ZooKeeper session expired<\/p>\n<p>The JVM is doing a long running garbage collecting which is pausing every threads (aka &#8220;stop the world&#8221;). Since the RegionServer&#8217;s local ZooKeeper client cannot send heartbeats, the session times out. By design, we shut down any node that isn&#8217;t able to contact the ZooKeeper ensemble after getting a timeout so that it stops serving data that may already be assigned elsewhere.<\/p>\n<ul>\n<li>Make\u00a0sure you give plenty of RAM (in\u00a0hbase-env.sh), the default of 1GB won&#8217;t be able to sustain long running imports.<\/li>\n<li>Make\u00a0sure you don&#8217;t swap, the JVM never behaves well under swapping.<\/li>\n<li>Make\u00a0sure you are not CPU starving the RegionServer thread. For example, if you are running a MapReduce job using 6 CPU-intensive tasks on a machine with 4 cores, you are probably starving the RegionServer enough to create longer garbage collection pauses.<\/li>\n<li>Increase\u00a0the ZooKeeper session timeout<\/li>\n<\/ul>\n<p>If you wish to increase the session timeout, add the following to your\u00a0hbase-site.xml\u00a0to increase the timeout from the default of 60 seconds to 120 seconds.<\/p>\n<pre class=\"brush: plain; title: ; notranslate\" title=\"\">\r\n&amp;lt;property&amp;gt;\r\n  &amp;lt;name&amp;gt;zookeeper.session.timeout&amp;lt;\/name&amp;gt;\r\n  &amp;lt;value&amp;gt;1200000&amp;lt;\/value&amp;gt;\r\n&amp;lt;\/property&amp;gt;\r\n&amp;lt;property&amp;gt;\r\n  &amp;lt;name&amp;gt;hbase.zookeeper.property.tickTime&amp;lt;\/name&amp;gt;\r\n  &amp;lt;value&amp;gt;6000&amp;lt;\/value&amp;gt;\r\n&amp;lt;\/property&amp;gt;\r\n<\/pre>\n<p>Be aware that setting a higher timeout means that the regions served by a failed RegionServer will take at least that amount of time to be transfered to another RegionServer. For a production system serving live requests, we would instead recommend setting it lower than 1 minute and over-provision your cluster in order the lower the memory load on each machines (hence having less garbage to collect per machine).<\/p>\n<p>If this is happening during an upload which only happens once (like initially loading all your data into HBase), consider bulk loading.<\/p>\n<p>See\u00a0<a class=\"external-link\" href=\"http:\/\/hbase.apache.org\/book.html#trouble.zookeeper.general\" rel=\"nofollow\">Section\u00a013.11.2, \u201cZooKeeper, The Cluster Canary\u201d<\/a>\u00a0for other general information about ZooKeeper troubleshooting.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>ZooKeeper is a centralized service for maintaining and synchronizing configuration data. Zookeeper is used to set a distributed lock on incoming data that is then [&#8230;]<\/p>\n","protected":false},"author":2,"featured_media":0,"parent":0,"menu_order":0,"comment_status":"closed","ping_status":"closed","template":"","meta":{"jetpack_post_was_ever_published":false,"footnotes":""},"class_list":["post-1130","page","type-page","status-publish","hentry"],"jetpack_shortlink":"https:\/\/wp.me\/P1BQ8S-ie","jetpack_sharing_enabled":true,"_links":{"self":[{"href":"https:\/\/www.developerscloset.com\/index.php?rest_route=\/wp\/v2\/pages\/1130","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.developerscloset.com\/index.php?rest_route=\/wp\/v2\/pages"}],"about":[{"href":"https:\/\/www.developerscloset.com\/index.php?rest_route=\/wp\/v2\/types\/page"}],"author":[{"embeddable":true,"href":"https:\/\/www.developerscloset.com\/index.php?rest_route=\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/www.developerscloset.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=1130"}],"version-history":[{"count":5,"href":"https:\/\/www.developerscloset.com\/index.php?rest_route=\/wp\/v2\/pages\/1130\/revisions"}],"predecessor-version":[{"id":1415,"href":"https:\/\/www.developerscloset.com\/index.php?rest_route=\/wp\/v2\/pages\/1130\/revisions\/1415"}],"wp:attachment":[{"href":"https:\/\/www.developerscloset.com\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=1130"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}