HDFS

Hadoop Distributed File System (HDFS) is the primary storage system used by Hadoop applications. HDFS creates multiple replicas of data blocks and distributes them on compute hosts throughout a cluster to enable reliable, extremely rapid computations.

Configure Hdfs

Install HDFS

HDFS is distributed by CDH and requires the following services:

  • NameNode The NameNode contains a global map of where the data is stored, when DataNodes start they present a list of the blocks they are hosting to the NameNode. Do not place the NameNode on the same server as the MapReduce JobTracker or YARN ResourceManager as the memory needed to run these services is high.
  • SecondaryNameNode (not on the same hosts as the NameNode, and not required if HDFS HA is configured). The SecondaryNameNode contains information necessary to rebuild the NameNode. The SecondaryNameNode is NOT a backup NameNode as it only contains configurations.
  • JournalNode – Used for HDFS HA – there must be at least three JournalNode daemons, since edit log modifications must be written to a majority of JournalNodes. The JournalNode daemon is relatively lightweight, so these daemons can reasonably be collocated on machines with other Hadoop services. You can also run more than three JournalNodes, but in order to actually increase the number of failures the system can tolerate, you should run an odd number of JournalNodes.
  • Failover Controller – Used for HDFS HA – there is one Failover Controller for each NameNode, the FC MUST be placed on the same host as each NameNode. ZKFC uses the Zookeeper Service for coordination in determining which is the Active NameNode and in determining when to failover to the Standby NameNode.
  • Balancer – required service – (it is ok to have the Balancer on the same host as the NameNode). The Balancer is a service that redistributes blocks among DataNodes to keep the cluster storage equal across all volumes. It is helpful to run the Balance after a large amount of data has been deleted or if a new DataNode has been added.
  • DataNodes – The DataNode can share a host with the HBase RegionServer and MapReduce TaskTracker or YARN NodeManager. For data locality colocate the HBase RegionServer with a DataNode.
  • Gateway – The Gateway stores configuration information about HDFS, including the network topology. Install a Gateway on all APP servers.
  • HttpFS – HttpFS is a service that provides HTTP access to HDFS. HttpFS has a REST HTTP API supporting all HDFS File System operations (both read and write).
  • NFS Gateway – The NGS Gateway allows HDFS to be mounted as a local disk. The NFS Gateway server can be any host in the cluster, including the NameNode, a DataNode, or any HDFS client. The client can be any NFSv3-client-compatible machine (NOT supported on an Ubuntu client).

HDFS Configuration

Configuration Description Small (8 GB Memory) Medium (16 GB Memory) Large (28 GB Memory) Calculation
DataNode Data Directory

dfs.data.dir, dfs.datanode.data.dir

Comma-delimited list of directories on the local file system where the DataNode stores HDFS block data.

Warning: be very careful when modifying this property. Removing or changing entries can result in data loss. If you want to hot swap drives, override the value of this property for the specific DataNode role instance whose drive is to be hot-swapped; do not modify the property value in the role group.

/space1/dfs/dn /space1/dfs/dn, /space2/dfs/dn /space1/dfs/dn, /space2/dfs/dn Typical values are /spaceN/dfs/dn for N = 1, 2, 3…
NameNode Data Directories

dfs.name.dir, dfs.namenode.name.dir

Determines where on the local file system the NameNode should store the name table (fsimage). For redundancy, enter a comma-delimited list of directories to replicate the name table in all of the directories. /space1/dfs/nn /space1/dfs/nn, /space2/dfs/nn /space1/dfs/nn, /space2/dfs/nn Typical values are /spaceN/dfs/nn where N=1..3.
HDFS Checkpoint Directories

fs.checkpoint.dir, dfs.namenode.checkpoint.dir

Determines where on the local file system the DFS SecondaryNameNode should store the temporary images to merge. For redundancy, enter a comma-delimited list of directories to replicate the image in all of the directories. /space1/dfs/snn /space1/dfs/snn /space1/dfs/snn Typical values are /spaceN/dfs/snn for N = 1, 2, 3…
Java Heap Size of NameNode in Bytes Maximum size in bytes for the Java Process heap memory. Passed to Java -Xmx. 1 GB 1 GB 4 GB Hdfs NameNode memory calculation = 1 GB * 1,000,000 blocks, or on really slow disk = 1 GB * 100,000 blocks. In cases of slow disk (Azure), bump this up to compensate.
Java Heap Size of DataNode in Bytes Maximum size in bytes for the Java Process heap memory. Passed to Java -Xmx. 768 MB 1 GB 2 GB I have not seen a need for a larger heap size than 1 GB for 500,000 blocks.
Java Heap Size of Secondary NameNode in Bytes Maximum size in bytes for the Java Process heap memory. Passed to Java -Xmx. 1 GB 1 GB 4 GB On single node, developer servers, the SecondaryNameNode memory can be reduced. On larger clusters keep the memory high enough to manage the edits (half the NameNode). Not used in HDFS HA.
Reserved Space for Non DFS Use Reserved space in bytes per volume for non Distributed File System (DFS) use. 10 GB 10 GB (for a single node VM)
100 GB
100 GB
Each DataNode that also runs a MapReduce TaskTracker or YARN NodeManager must have the Reserved Space for Non DFS Use increased. The MR services use too much data to keep the default.
DataNode Failed Volumes Tolerated

dfs.datanode.failed.volumes.tolerated

The number of volumes that are allowed to fail before a DataNode stops offering service. By default, any volume failure will cause a DataNode to shutdown. 0 0 0 Protect HDFS from failed volumes (or what HDFS incorrectly assumes is a failed volume, like Azure shutting down a VM by first shutting down the volumes).
DataNode Volume Choosing Policy

dfs.datanode.fsdataset.volume.choosing.policy

Available Space DataNode Policy for picking which volume should get a new block. Available Space Available Space  Available Space  By default a DataNode writes new block replicas to disk volumes solely on a round-robin basis. Change this to the Available Space volume-choosing policy that causes the DataNode to take into account how much space is available on each volume when deciding where to place a new replica.
DataNode Data Directory Permissions

dfs.datanode.data.dir.perm

Permissions for the directories on the local file system where the DataNode stores its blocks. The permissions must be octal. 755 and 700 are typical values. 755 755  755 We manage permissions by group so our service accounts can access the data.
Default Umask

dfs.umaskmode, fs.permissions.umask-mode

Default umask for file and directory creation, specified in an octal value (with a leading 0). Default is 022. 002 002  002 We manage permissions by group, which makes the umask important. We set the value of umask to 002, the default is 022. This allows users of the group to read and write.
Automatically Restart Process

DataNode

Failover Controller

HttpFS

JournalNode

NameNode

When set, this role’s process is automatically (and transparently) restarted in the event of an unexpected failure. N/A N/A Enabled If HA is enabled, the Failover Controller should be set to restart on failure. This service intermittently dies on Azure (is a victim of low heap on clusters with low memory).
HttpFS Proxy User Groups
hadoop.proxyuser.httpfs.groups
Comma-delimited list of groups to allow the HttpFS user to impersonate. The default ‘*’ allows all groups. To disable entirely, use a string that does not correspond to a group name, such as ‘_no_group_’. linux-dsiq-webhdfs

domain^users

hue

linux-dsiq-webhdfs

domain^users

hue

linux-dsiq-webhdfs

domain^users

hue

We use this configuration to lock down read/write permissions on the HttpFS service.

Administer HDFS

Start the DataNode locally (no SSH):

/opt/hadoop/hadoop-#/sbin/hadoop-daemon.sh –config ~/opt/hadoop/etc/hadoop/ start datanode

Configure HDFS High Availability

CM 5 and CDH 5 support Quorum-based Storage as the only HDFS HA implementation. Quorum-based Storage refers to the HA implementation that uses a Quorum Journal Manager (QJM).

In order for the Standby NameNode to keep its state synchronized with the Active NameNode in this implementation, both nodes communicate with a group of separate daemons called JournalNodes. When any namespace modification is performed by the Active NameNode, it durably logs a record of the modification to a majority of these JournalNodes. The Standby NameNode is capable of reading the edits from the JournalNodes, and is constantly watching them for changes to the edit log. As the Standby Node sees the edits, it applies them to its own namespace. In the event of a failover, the Standby will ensure that it has read all of the edits from the JournalNodes before promoting itself to the Active state. This ensures that the namespace state is fully synchronized before a failover occurs.

In order to provide a fast failover, it is also necessary that the Standby NameNode has up-to-date information regarding the location of blocks in the cluster. In order to achieve this, the DataNodes are configured with the location of both NameNodes, and they send block location information and heartbeats to both.

It is vital for the correct operation of an HA cluster that only one of the NameNodes be active at a time. Otherwise, the namespace state would quickly diverge between the two, risking data loss or other incorrect results. In order to ensure this property and prevent the so-called “split-brain scenario,” the JournalNodes will only ever allow a single NameNode to be a writer at a time. During a failover, the NameNode which is to become active will simply take over the role of writing to the JournalNodes, which will effectively prevent the other NameNode from continuing in the Active state, allowing the new Active NameNode to safely proceed with failover.

In order to deploy an HA cluster using Quorum-based Storage, you should prepare the following:

  • NameNode machines – the machines on which you run the Active and Standby NameNodes should have equivalent hardware to each other, and equivalent hardware to what would be used in a non-HA cluster.
  • JournalNode machines – the machines on which you run the JournalNodes.
  • The JournalNode daemon is relatively lightweight, so these daemons can reasonably be collocated on machines with other Hadoop daemons, for example NameNodes, the JobTracker, or the YARN ResourceManager.
  • Cloudera recommends that you deploy the JournalNode daemons on the “master” host or hosts (NameNode, Standby NameNode, JobTracker, etc.) so the JournalNodes’ local directories can use the reliable local storage on those machines. You should not use SAN or NAS storage for these directories.
  • There must be at least three JournalNode daemons, since edit log modifications must be written to a majority of JournalNodes. This will allow the system to tolerate the failure of a single machine. You can also run more than three JournalNodes, but in order to actually increase the number of failures the system can tolerate, you should run an odd number of JournalNodes, (three, five, seven, etc.) Note that when running with N JournalNodes, the system can tolerate at most (N – 1) / 2 failures and continue to function normally. If the requisite quorum is not available, the NameNode will not format or start, and you will see an error similar to this:

12/10/01 17:34:18 WARN namenode.FSEditLog: Unable to determine input streams from QJM to [10.0.1.10:8485, 10.0.1.10:8486, 10.0.1.10:8487]. Skipping.

java.io.IOException: Timed out waiting 20000ms for a quorum of nodes to respond.

Note: In an HA cluster, the Standby NameNode also performs checkpoints of the namespace state, and thus it is not necessary to run a Secondary NameNode, CheckpointNode, or BackupNode in an HA cluster. In fact, to do so would be an error. If you are reconfiguring a non-HA-enabled HDFS cluster to be HA-enabled, you can reuse the hardware which you had previously dedicated to the Secondary NameNode.

Enabling High Availability and Automatic Failover

In CM, the Enable High Availability workflow leads you through adding a second (standby) NameNode and configuring JournalNodes. During the workflow, Cloudera Manager creates a federated namespace.

  1. Go to the HDFS service.
  2. Click on Instances, stop the Secondary NameNode service. The Secondary NameNode MUST be stopped before continuing or the setup will fail.
  3. Select Actions > Enable High Availability. A screen showing the hosts that are eligible to run a standby NameNode and the JournalNodes displays.
    1. Specify a name for the nameservice or accept the default name nameservice1 and click Continue.
    2. In the NameNode Hosts field, click Select a host. The host selection dialog displays.
    3. Check the checkbox next to the hosts where you want the standby NameNode to be set up and click OK. The standby NameNode cannot be on the same host as the active NameNode, and the host that is chosen should have the same hardware configuration (RAM, disk space, number of cores, and so on) as the active NameNode.
    4. In the JournalNode Hosts field, click Select hosts. The host selection dialog displays.
    5. Check the checkboxes next to an odd number of hosts (a minimum of three) to act as JournalNodes and click OK. JournalNodes should be hosted on hosts with similar hardware specification as the NameNodes. It is recommended that you put a JournalNode each on the same hosts as the active and standby NameNodes, and the third JournalNode on similar hardware, such as the JobTracker.
    6. Failover Controller – make sure these are on the same host as each NameNode. There are two FC for the two NN.
    7. Click Continue.
    8. In the JournalNode Edits Directory property, enter a directory location for the JournalNode edits directory into the fields for each JournalNode host (/space1/dfs/jn).
      1. You may enter only one directory for each JournalNode. The paths do not need to be the same on every JournalNode.
      2. The directories you specify should be empty, and must have the appropriate permissions.
  4. Extra Options: Decide whether Cloudera Manager should clear existing data in ZooKeeper, standby NameNode, and JournalNodes. If the directories are not empty (for example, you are re-enabling a previous HA configuration), Cloudera Manager will not automatically delete the contents—you can select to delete the contents by keeping the default checkbox selection. The recommended default is to clear the directories. If you choose not to do so, the data should be in sync across the edits directories of the JournalNodes and should have the same version data as the NameNodes.
  5. Click Continue.

Note: Cloudera Manager executes a set of commands that will stop the dependent services, delete, create, and configure roles and directories as appropriate, create a nameservice and failover controller, and restart the dependent services and deploy the new client configuration.

Another Note: you may see the following warning: The following manual steps must be performed after completing this wizard: For each of the Hive service(s) Hive, stop the Hive service, back up the Hive Metastore Database to a persistent store, run the service command “Update Hive Metastore NameNodes”, then restart the Hive services.

  1. If you want to use Hive, Impala, or Hue in a cluster with HA configured, follow the procedures in Configuring Other CDH Components to Use HDFS HA. The following manual steps must be performed after completing this wizard:
    1. Configure the HDFS Web Interface Role of Hue service(s) to be an HTTPFS role instead of a NameNode. Select the Hue server, click Configuration, and select the HTTPFS role. Click Save Changes, and start the Hue service.
    2. For each of the Hive service(s), stop the Hive service, back up the Hive Metastore Database to a persistent store, run the service command “Update Hive Metastore NameNodes”, then restart the Hive services.
      1. Go the Hive service.
      2. Select Actions > StopNote: You may want to stop the Hue and Impala services first, if present, as they depend on the Hive service.
      3. Click Stop to confirm the command.
      4. Back up the Hive metastore database.
      5. Select Actions > Update Hive Metastore NameNodes and confirm the command.
      6. Select Actions > Start.
      7. Restart the Hue and Impala services if you stopped them prior to updating the metastore.
  2. Configure Oozie to use the HDFS nameservice instead of the URI in the <name-node> element of the workflow:

Example:

<action name="mr-node">
  <map-reduce>
    <job-tracker>${jobTracker}</job-tracker>
    <name-node>hdfs://nameservice1

where nameservice1 is the value of dfs.nameservices in hdfs-site.xml.

Manually Failing Over to the Standby NameNode

If you are running an HDFS service with HA enabled, you can manually cause the active NameNode to failover to the standby NameNode. This is useful for planned downtime—for hardware changes, configuration changes, or software upgrades of your primary host.

Manual failover:

hdfs haadmin -failover nn2(from) nn1(to)

Stop services on nn2

Once you’ve made sure that the nn2 node is inactive, you can stop services on that node: in this example, stop services on nn2. Stop the NameNode, the ZKFC daemon if this an automatic-failover deployment, and the JournalNode if you are moving it. Proceed as follows.

Stop the NameNode daemon:

$ sudo service hadoop-hdfs-namenode stop

Stop the ZKFC daemon if it is running:

$ sudo service hadoop-hdfs-zkfc stop

Stop the JournalNode daemon if it is running:

$ sudo service hadoop-hdfs-journalnode stop

Make sure these services are not set to restart on boot. If you are not planning to use nn2 as a NameNode again, you may want remove the services.

In Cloudera Manager:

  1. Go to the HDFS service.
  2. Click the Instances tab.
  3. Select Actions > Manual Failover. (This option does not appear if HA is not enabled for the cluster.)
  4. From the pop-up, select the NameNode that should be made active, then click Manual Failover.
  5. When all the steps have been completed, click Finish.

Cloudera Manager transitions the NameNode you selected to be the active NameNode, and the other NameNode to be the standby NameNode. HDFS should never have two active NameNodes.

Reference:

Configuring HDFS High Availability

High Availability for the Hadoop Distributed File System (HDFS)

Rename NameNode or Replace NameNode

If you rename the NameNode or replace the NameNode with a new NameNode you will have to clear the ZooKeeper’s zkfs znodes

Clear Failover Controller ZooKeeper

zkfc -formatZK

Finally, refresh the DataNodes to pick up the new NameNodes.

datanodes refreshnamenode

Configure Uber in HDFS

The application master for MapReduce jobs is a Java application whose main class is MRAppMaster. It initializes the job by creating a number of bookkeeping objects to keep track of the job’s progress, as it will receive progress and completion reports from the tasks (step 6). Next, it retrieves the input splits computed in the client from the shared filesystem (step 7). It then creates a map task object for each split, and a number of reduce task objects determined by the mapreduce.job.reduces property.

The next thing the application master does is decide how to run the tasks that make up the MapReduce job. If the job is small, the application master may choose to run them in the same JVM as itself, since it judges the overhead of allocating new containers and running tasks in them as outweighing the gain to be had in running them in parallel, compared to running them sequentially on one node. (This is different to MapReduce 1, where small jobs are never run on a single tasktracker.) Such a job is said to be uberized, or run as an uber task.

What qualifies as a small job? By default one that has less than 10 mappers, only one reducer, and the input size is less than the size of one HDFS block. (These values may be changed for a job by setting mapreduce.job.ubertask.maxmaps, mapreduce.job.uber task.maxreduces, and mapreduce.job.ubertask.maxbytes.) It’s also possible to disable uber tasks entirely (by setting mapreduce.job.ubertask.enable to false).

Reference: http://sungsoo.github.io/2013/12/12/yarn-mapreduce2.html

NFS Gateway

NFS Gateway is not supported on Ubuntu – only on RHEL. There is a workaround for Ubuntu, but it is insecure – anyone can access HDFS from a remote machine if this workaround is in place.

After mounting HDFS to his or her local filesystem, a user can:

  • Browse the HDFS file system through the local file system
  • Upload and download files from the HDFS file system to and from the local file system.
  • Stream data directly to HDFS through the mount point.

File append is supported, but random write is not.

HDFS Commands

File System Check

sudo -u hdfs hdfs fsck / -files -blocks -locations > ~/fsck-files-08162017.log

Example report:

Status: HEALTHY
 Total size: 1242886634788 B (Total open files size: 540 B)
 Total dirs: 24939
 Total files: 347544
 Total symlinks: 0 (Files currently being written: 9)
 Total blocks (validated): 300935 (avg. block size 4130083 B) (Total open file blocks (not validated): 8)
 Minimally replicated blocks: 300935 (100.0 %)
 Over-replicated blocks: 0 (0.0 %)
 Under-replicated blocks: 0 (0.0 %)
 Mis-replicated blocks: 0 (0.0 %)
 Default replication factor: 3
 Average block replication: 3.0217955
 Corrupt blocks: 0
 Missing replicas: 0 (0.0 %)
 Number of data-nodes: 10
 Number of racks: 1
 FSCK ended at Wed Aug 16 14:39:06 PDT 2017 in 16085 milliseconds
The filesystem under path '/' is HEALTHY

List Corrupt Blocks

sudo -u hdfs hdfs fsck -list-corruptfileblocks > ~/fsck-corruptfileblocks-08162017.log

DataNode Report

sudo -u hdfs hdfs dfsadmin -report > ~/dfsadmin-report-08162017.log

Example report:

Configured Capacity: 68167702077440 (62.00 TB)
Present Capacity: 64158271559713 (58.35 TB)
DFS Remaining: 60359705043522 (54.90 TB)
DFS Used: 3798566516191 (3.45 TB)
DFS Used%: 5.92%
Under replicated blocks: 0
Blocks with corrupt replicas: 0
Missing blocks: 0
Missing blocks (with replication factor 1): 0
-------------------------------------------------
Live datanodes (10):
Name: 10.200.0.12:50010 (servername01)
Hostname: servername01
Rack: /rackname01
Decommission Status : Normal
Configured Capacity: 6816770207744 (6.20 TB)
DFS Used: 393658411357 (366.62 GB)
Non DFS Used: 396947193216 (369.69 GB)
DFS Remaining: 6026164603171 (5.48 TB)
DFS Used%: 5.77%
DFS Remaining%: 88.40%
Configured Cache Capacity: 4294967296 (4 GB)
Cache Used: 0 (0 B)
Cache Remaining: 4294967296 (4 GB)
Cache Used%: 0.00%
Cache Remaining%: 100.00%
Xceivers: 10
Last contact: Wed Aug 16 14:42:17 PDT 2017

Create a File in HDFS using touchz

hadoop fs -ls /user/username/
hadoop fs -touchz /user/username/test-file-10272017-1

Move a File from Windows to HDFS

1. Create a connection using FileZilla.

2. FileZilla moves the file from Windows to /home/username in Linux.

3. Move from Linux to Hadoop:

a. Bring up Cygwin and ssh to servername01

ssh -Y username@servername01

b. Log in with ADM account user name and password

c. I am currently in /home/username, view files in directory:

ls

d. Connect to Hadoop cluster

hadoop fs -ls / # list files at root

hadoop fs -ls /user/username # list a file that I want to upload data to in Hadoop

hadoop fs -copyFromLocal /home/username/20140713_TAB.txt /user/username

# the -f option will overwrite the file if it exists

4. Move from Hadoop back to Linux:

hadoop fs -copyToLocal /user/username/20140713_TAB.txt /home/username/20140713_TAB_a.txt

ls # to prove to myself it is there

5. Move from Linux to Windows – use FileZilla

Reference: http://hadoop.apache.org/docs/r0.18.3/hdfs_shell.html

List Folder Structure

hadoop fs -ls /asset/

Get a summary of the file sizes

hadoop fs -du -s -h /user/*

hadoop fs -du -s -h /tmp/temp* | grep T # filter for temp* files that are over 1 TB in size

Copy data from local disk to HDFS

hadoop fs -copyFromLocal /tmp/from/linux/testfile.txt /tmp/to/hdfs/folder/

Copy data from HDFS to local

hadoop fs -copyToLocal /tmp/file.txt /tmp/local/folder

Delete Folder Recursively

Note: Be very careful that there are NO SPACES between the server:port name and the file or you will DELETE the entire root of HDFS.

hadoop fs -rm -r /tmp/folder-to-delete-temp/

Set Folder Permissions

sudo -u hdfs hadoop fs -chmod -R 775 /staging/wm/offer_scoring

Create User Home Directory

To create a user’s home directory you’ll have to sudo as hdfs:

sudo -u hdfs hadoop fs -mkdir /user/username

sudo -u hdfs hadoop fs -chown -R username:Users /user/username

A little simpler:

USER=username;sudo -u hdfs hadoop fs -mkdir -p /user/$USER;sudo -u hdfs hadoop fs -chown -R $USER /user/$USER;

WebHDFS and HttpFS API

Here is an example opening a file on a dev cluster:

http://nn.servername01:50070/webhdfs/v1/asset/reference/timeanddate/sample/romaniaholidays2014.json?op=OPEN

Note that the VM name of the NameNode is hardcoded into the URL, this is not the best because the NameNode will change as we use HDFS HA.

So another good API to use is the HttpFS – the centralized service understands HDFS HA and is a better option, here is an example:

http://fs.servername01:14000/webhdfs/v1/asset/reference/timeanddate/sample/romaniaholidays2014.json?user.name=username&op=OPEN

Try with the ?op=create and create a file, for example…

Change Replication Factor

HDFS stores files as data blocks and distributes these blocks across the entire cluster. As HDFS was designed to be fault-tolerant and to run on commodity hardware, blocks are replicated a number of times to ensure high data availability. The replication factor is a property that can be set in the HDFS configuration file that will allow you to adjust the global replication factor for the entire cluster. For each block stored in HDFS, there will be n – 1 duplicated blocks distributed across the cluster. For example, if the replication factor was set to 3 (default value in HDFS) there would be one original block and two replicas.

SSH to the node and run the following command where -w is the new replication factor:

sudo -u hdfs hadoop dfs -setrep -R -w 1 /

Copy data from one Hadoop cluster to another Hadoop cluster using DistCp

DistCp (distributed copy) is a tool used for large inter/intra-cluster copying. It uses MapReduce to effect its distribution, error handling and recovery, and reporting. It expands a list of files and directories into input to map tasks, each of which will copy a partition of the files specified in the source list.

Example usage:

hadoop distcp hdfs://source-namenode:8020/contents/of/folder/from/* hdfs://destination-namenode:8020/folder/to/

By default, distcp will skip files that already exist in the destination, but they can be overwritten by supplying the -overwrite option. You can also update only files that have changed using the -update option.

distcp is implemented as a MapReduce job where the work of copying is done by maps that run in parallel across the cluster. There are no reducers. Each file is copied by a single map, and distcp tries to give each map approximately the same amount of data, by bucketing files into roughly equal allocations.

The following command will copy the folder contents from one Hadoop 4.# cluster to a folder on a Hadoop 5.# cluster. The hftp is necessary because Hadoop 4 and 5 are not wire-compatible. The command must be run on the destination cluster. Be sure your user has access to write to the destination folder.

hadoop distcp -pb hftp://nn.servername01:50070/user/source/* hdfs://fs.servername02:8020/user/destination/

Note: The -pb option will only preserve the block size.

Double Note: For copying between two different versions of Hadoop we must use the HftpFileSystem, which is a read-only files system. So the distcp must be run on the destination cluster.

The following command will copy the folder contents from one Hadoop 5.# cluster to a folder on another Hadoop 5.# cluster.

hadoop distcp -pb hdfs://fs.servername01:8020/user/username/* hdfs://fs.servername02:8020/user/username/

hadoop distcp -pb hdfs://hdfs-nn.servername01:8020/user/username/* hdfs://hdfs-nn.servername02:8020/user/username/

For additional options, use -p <arg> to preserve the following attributes: (rbugpcaxt): (replication, block-size, user, group, permission, checksum-type, ACL, XATTR, timestamps).

Note: If -p is specified with no <arg>, then preserves replication, block size, user, group, permission, checksum type and timestamps.

Recover Under-Replicated, Missing, or Corrupt Blocks

If you run into a situation where there are a large number of under-replicated, missing, or corrupt blocks, you need to first make sure no volumes have failed. If all volumes are accounted for, then tackle the under-replicated blocks FIRST. As soon as all blocks have replicated you will have a clear picture of the missing or corrupt blocks. Next, take care of the missing blocks, and finally corrupt.

Read the following section fully before proceeding – you need to be careful with these instructions as you might delete blocks unnecessarily.

Fix any HDFS issues using fsck:

sudo -u hdfs hdfs fsck /

# save the results to a file

sudo -u hdfs hdfs fsck / > ~/hdfs-fsck-09082014.txt

# report on under-replicated, missing, or corrupt blocks

sudo -u hdfs hdfs dfsadmin -report

List only corrupt file blocks:

sudo -u hdfs hdfs fsck -list-corruptfileblocks > ~/list-corruptfileblocks-10262017.txt

To determine which files are having problems. Look through the output for missing or corrupt blocks (ignore under-replicated blocks for now). This command is really verbose especially on a large HDFS filesystem so I normally get down to the meaningful output with

hadoop fsck / | egrep -v ‘^\.+$’ | grep -v eplica

which ignores lines with nothing but dots and lines talking about replication.

Once you find a file that is corrupt you can attempt to find out why it was marked as corrupt – where are the blocks, what servers are the blocks located on?

hadoop fsck /path/to/corrupt/file -locations -blocks -files

Use that output to determine where blocks might live. If the file is larger than your block size it might have multiple blocks.

You can use the reported block numbers to go around to the datanodes and the namenode logs searching for the machine or machines on which the blocks lived. Try looking for filesystem errors on those machines. Missing mount points, datanode not running, file system reformatted/re-provisioned. If you can find a problem in that way and bring the block back online that file will be healthy again.

If the blocks are completely missing:

0. blk_-6574099661639162407_21831 len=134217728 repl=3 [172.16.1.115:50010, 172.16.1.128:50010, 129.93.239.178:50010]
1. blk_-8603098634897134795_21831 len=134217728 repl=3 MISSING!
2. ...

In the above log output, all possible sources of the second block are gone and the namenode has no knowledge of any host with it. This can happen when nodes are completely off or have no network connection to the namenode. In this case, the easiest solution is to grep for the block ID “8603098634897134795” in the namenode logs in hopes of seeing the last place that block lived. Providing you keep namenode logs around and the logging level is set high enough [what is high enough anyway?] you will hopefully find a datanode containing the block. If you are able to bring the datanode back up and the blocks are readable from the hard drive(s) the namenode will replicate it back to the appropriate amount and the file corruption will be gone.

Lather rinse and repeat until all files are healthy or you exhaust all alternatives looking for the blocks.

Once you determine what happened and you cannot recover any more blocks, just use the command to DELETE the missing blocks.

hadoop fs -rm /path/to/file/with/permanently/missing/blocks

Note: After you delete the block you might have moved the block into the Trash. The Trash protects you from accidentally deleting blocks – giving you one more chance to recover. To fully delete the block you need to remove it from your Trash.

To get your HDFS filesystem back to healthy so you can start tracking new errors as they occur.

Reference: https://twiki.grid.iu.edu/bin/view/Storage/HadoopRecovery

Checkpointing in HDFS

A typical edit ranges from 10s to 100s of bytes, but over time enough edits can accumulate to become unwieldy. A couple of problems can arise from these large edit logs. In extreme cases, it can fill up all the available disk capacity on a node, but more subtly, a large edit log can substantially delay NameNode startup as the NameNode reapplies all the edits. This is where checkpointing comes in.

Checkpointing is a process that takes an fsimage and edit log and compacts them into a new fsimage. This way, instead of replaying a potentially unbounded edit log, the NameNode can load the final in-memory state directly from the fsimage. This is a far more efficient operation and reduces NameNode startup time.

However, creating a new fsimage is an I/O- and CPU-intensive operation, sometimes taking minutes to perform. During a checkpoint, the namesystem also needs to restrict concurrent access from other users. So, rather than pausing the active NameNode to perform a checkpoint, HDFS defers it to either the SecondaryNameNode or Standby NameNode, depending on whether NameNode high-availability is configured.

In either case though, checkpointing is triggered by one of two conditions: if enough time has elapsed since the last checkpoint (dfs.namenode.checkpoint.period), or if enough new edit log transactions have accumulated (dfs.namenode.checkpoint.txns). The checkpointing node periodically checks if either of these conditions are met (dfs.namenode.checkpoint.check.period), and if so, kicks off the checkpointing process.

Reference: http://blog.cloudera.com/blog/2014/03/a-guide-to-checkpointing-in-hadoop/

HDFS Balancer

The HDFS Balancer is a tool used to balance data across the DataNodes. If you add new DataNodes you might notice that the data is not distributed equally across all nodes.

Start the Balancer

To start the HDFS Balancer, select the HDFS service from Cloudera Manager, click on Instances, and click on the Balancer service. From within the Balancer service, click Actions, and click Rebalance.

Find the Balancer’s log to discover the status of the balancing process: /run/cloudera-scm-agent/process/pid#-hdfs-BALANCER/logs/stdout.log, using a command like this: ls -lrt /run/cloudera-scm-agent/process/|grep BALANCER

In Cloudera Manager’s Web UI you can find the log by clicking on the Stdout log under the Balancer service under the node Processes.

Near the beginning of the log you will find how much work has to be done to balance the cluster and how many nodes are over or underutilized. For example:

2014-08-26 08:40:31,834 INFO  [main] balancer.Balancer (Balancer.java:logNodes(907)) – 1 over-utilized: [Source[192.168.210.202:50010, utilization=86.94839955017261]]

2014-08-26 08:40:31,835 INFO  [main] balancer.Balancer (Balancer.java:logNodes(907)) – 4 underutilized: [BalancerDatanode[192.168.210.224:50010, utilization=10.070419522861195], BalancerDatanode[192.168.210.64:50010, utilization=1.4040975994717294E-5], BalancerDatanode[192.168.210.229:50010, utilization=3.730515639121875], BalancerDatanode[192.168.210.208:50010, utilization=15.298922338705484]]

2014-08-26 08:40:31,861 INFO  [main] balancer.Balancer (Balancer.java:run(1344)) – Need to move 1.47 TB to make the cluster balanced.

2014-08-26 08:40:31,863 INFO  [main] balancer.Balancer (Balancer.java:chooseTarget(1038)) – Decided to move 10 GB bytes from 192.168.210.202:50010 to 192.168.210.224:50010

2014-08-26 08:40:31,863 INFO  [main] balancer.Balancer (Balancer.java:chooseSource(1086)) – Decided to move 10 GB bytes from 192.168.210.254:50010 to 192.168.210.64:50010

2014-08-26 08:40:31,863 INFO  [main] balancer.Balancer (Balancer.java:chooseSource(1086)) – Decided to move 10 GB bytes from 192.168.210.215:50010 to 192.168.210.229:50010

2014-08-26 08:40:31,863 INFO  [main] balancer.Balancer (Balancer.java:chooseSource(1086)) – Decided to move 10 GB bytes from 192.168.210.207:50010 to 192.168.210.208:50010

2014-08-26 08:40:31,863 INFO  [main] balancer.Balancer (Balancer.java:run(1358)) – Will move 40 GB in this iteration

Aug 26, 2014 8:40:31 AM           0                  0 B             1.47 TB              40 GB

2014-08-26 08:40:32,101 INFO  [pool-2-thread-3] balancer.Balancer (Balancer.java:dispatch(344)) – Moving block 8184095414603472774 from 192.168.210.207:50010 to 192.168.210.208:50010 through 192.168.210.254:50010 is succeeded.

Stop the Balancer

To stop the Balancer, click Running Commands in the upper right hand side of Cloudera Manager’s web UI, find the Rebalance process, and click Abort.

Reference: http://www.swiss-scalability.com/2013/08/hadoop-hdfs-balancer-explained.html

Some notes: http://mail-archives.apache.org/mod_mbox/hadoop-common-user/201108.mbox/%3CE5C6ED175FFCE34D974527016DF712FD6A90C5D31C@AMRXM3113.dir.svc.accenture.com%3E

Troubleshooting

Manually Start HDFS

Not done very often, but once in a while, you may have to manually start HDFS:

for x in `cd /etc/init.d ; ls hadoop-hdfs-*` ; do sudo service $x start ; done

HDFS is in Safe Mode

In Cloudera Manager, browse to the NameNode service and click Actions. You will see an option in the menu to Enter and Leave Safe Mode.

To get the safe mode status:

sudo hdfs dfsadmin -safemode get

To take HDFS out of safe mode:

sudo -su hdfs hdfs dfsadmin -safemode leave

To turn safe mode on:

sudo -su hdfs hdfs dfsadmin -safemode enter

HDFS Checkpoint Age has become bad

If a Backup (Standby) or Secondary NameNode fails, the Active NameNode will have no other node to write checkpoints to (documenting changes to data) and the two NameNodes will slowly becoming disparate. This could lead to data loss if the Active NameNode fails because we would be left with a NameNode that does not know about any of new or changed data. So an alert is thrown: NAME_NODE_HA_CHECKPOINT_AGE has become bad.

To fix this problem, find the NameNode that is down, find out why it is down, and restart it. The Active NameNode will write its checkpoint and the alert will clear.

In Cloudera Manager, here is what this problem would look like:

In this screenshot you can see that although the Active NameNode is bad – it is actually still running and SHOULD NOT be restarted. The first NameNode in the list with the ‘down’ symbol should be restarted (which line through the red ball indicates down).

To fix this, check the box of the NameNode (with the red symbol with a white line) and select restart. The NameNode will come back up as a Backup and the checkpoint will be written.

HDFS Started with Bad Health: Address is already in use

Error: Hdfs started with ‘bad’ health. When you drill down to the Datanode and SecondaryNameNode, they are in ‘bad’ health.

The log file contains the error:

2013-02-19 13:20:37,437 FATAL org.apache.hadoop.hdfs.server.datanode.DataNode: Exception in secureMain java.net.BindException: Problem binding to [servername01:50010] java.net.BindException: Address already in use; For more details see:  http://wiki.apache.org/hadoop /BindException

Turns out port 50010 and 50090 which Datanode and SecondaryNameNode are trying to use were already taken by tcp.

This may also fail during configuration of a Single Node:

Checking if the name directories of the NameNode are empty. Formatting HDFS only if empty.

Failed to format NameNode.

Starting HDFS Service

Service did not start successfully; not all of the required roles started: Service hdfs1 does not have sufficient running NameNodes.

Solution:

vi /var/log/hadoop-hdfs/hadoop-cmf-hdfs1-SECONDARYNAMENODE-servername01.log.out

sudo netstat -a -t –numeric-ports -p | grep java | grep LISTEN | grep 500

OR

sudo lsof -P -n | grep LISTEN | grep 500

sudo kill -9 1147

restart Hdfs service

sudo jps # to verify results

OR

Stop the hdfs services:

sudo service hadoop-hdfs-namenode stop

service hadoop-hdfs-datanode stop

sudo service hadoop-hdfs-secondarynamenode stop

sudo service hadoop-0.20-mapreduce-tasktracker stop

sudo service hadoop-0.20-mapreduce-jobtracker stop

Reference: http://grokbase.com/t/cloudera/scm-users/133k2jtxc2/unable-to-start-namenode-and-hbase-master-on-cloudera-manager

HDFS NameNode Stopped: HDFS Service is Down

The BETA cluster NameNode has stopped after I decommissioned a node from HDFS. I have never seen this behavior before. Restarting the HDFS cluster did not resolve the problem.

I decommissioned HDFS DataNode blvbetahdp34. It seems that immediately after this task the NameNode, running on blvbetahdp02 crashed – although the service remained running on the server. This caused a situation where the /space1{2}/dfs/nn/in_use.lock file was locked by the running process. I moved the in_use.lock file to tmp, but that was unnecessary. I discovered the running NameNode by searching for the port and killed the running NameNode process. A restart of the NameNode initially failed on a canary test, but I restarted the entire cluster which worked.

Some notes:

2014-03-31 11:39:43,862 FATAL org.apache.hadoop.hdfs.server.namenode.NameNode: Exception in namenode join

java.io.IOException: Cannot lock storage /space1/dfs/nn. The directory is already locked

at org.apache.hadoop.hdfs.server.common.Storage$StorageDirectory.lock(Storage.java:634)

at org.apache.hadoop.hdfs.server.common.Storage$StorageDirectory.analyzeStorage(Storage.java:457)

at org.apache.hadoop.hdfs.server.namenode.FSImage.recoverStorageDirs(FSImage.java:292)

at org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:207)

at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:728)

at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:521)

at org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:403)

at org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:445)

at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:621)

at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:606)

at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1177)

at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1241)

2014-03-31 11:39:43,881 INFO org.apache.hadoop.util.ExitUtil: Exiting with status 1

2014-03-31 11:39:43,884 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: SHUTDOWN_MSG:

/************************************************************

SHUTDOWN_MSG: Shutting down NameNode at servername01/192.168.210.253

************************************************************/

Move in_use.lock and restart the NameNode:

2014-03-31 12:21:12,076 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: NameNode metrics system shutdown complete.

2014-03-31 12:21:12,077 FATAL org.apache.hadoop.hdfs.server.namenode.NameNode: Exception in namenode join

java.net.BindException: Problem binding to [servername01:8022] java.net.BindException: Address already in use; For more details see:  http://wiki.apache.org/hadoop/BindException

at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:718)

at org.apache.hadoop.ipc.Server.bind(Server.java:403)

at org.apache.hadoop.ipc.Server$Listener.(Server.java:501)

at org.apache.hadoop.ipc.Server.(Server.java:1894)

at org.apache.hadoop.ipc.RPC$Server.(RPC.java:970)

at org.apache.hadoop.ipc.ProtobufRpcEngine$Server.(ProtobufRpcEngine.java:375)

at org.apache.hadoop.ipc.ProtobufRpcEngine.getServer(ProtobufRpcEngine.java:350)

at org.apache.hadoop.ipc.RPC.getServer(RPC.java:695)

at org.apache.hadoop.ipc.RPC.getServer(RPC.java:684)

at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.(NameNodeRpcServer.java:221)

at org.apache.hadoop.hdfs.server.namenode.NameNode.createRpcServer(NameNode.java:468)

at org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:447)

at org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:621)

at org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:606)

at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1177)

at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1241)

2014-03-31 12:21:12,093 INFO org.apache.hadoop.util.ExitUtil: Exiting with status 1

2014-03-31 12:21:12,095 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: SHUTDOWN_MSG:

/************************************************************

SHUTDOWN_MSG: Shutting down NameNode at servername01/192.168.210.253

************************************************************/

netstat -a -t –numeric-ports -p|grep 8022

tcp        0      0 192.168.210.253:8022    0.0.0.0:*               LISTEN      31689/java

sudo kill 31689

Restart NameNode using Cloudera Manager (the canary test might fail) – then restart the entire HDFS cluster.

HDFS Error: Canary test failed – Permission denied: error=13

1. Initially this outage presented itself as an HDFS NameNode outage. The Canary test failed and the error messages that I could find pointed to the NameNode. From the logs in Cloudera Manager:

Canary test failed to read file in directory /tmp/.cloudera_health_monitoring_canary_files

The health test result for HDFS_HA_NAMENODE_HEALTH  has become bad: The active NameNode’s health is bad.

2. Looking deeper into the NameNode logs, I discovered a Permission Denied error when running a script in the /run folder:

http://servername01:50070/logs/hadoop-cmf-hdfs1-NAMENODE-servername01.log.out

2014-04-10 10:46:56,637 INFO org.apache.hadoop.hdfs.StateChange: DIR* completeFile: /tmp/.cloudera_health_monitoring_canary_files/.canary_file_2014_04_10-10_46_56 is closed by DFSClient_NONMAPREDUCE_244274405_96

2014-04-10 10:46:56,675 WARN org.apache.hadoop.net.ScriptBasedMapping: Exception running /run/cloudera-scm-agent/process/1080-hdfs-NAMENODE/topology.py 192.168.210.242

java.io.IOException: Cannot run program “/run/cloudera-scm-agent/process/1080-hdfs-NAMENODE/topology.py” (in directory “/run/cloudera-scm-agent/process/1080-hdfs-NAMENODE”): java.io.IOException: error=13, Permission denied

at java.lang.ProcessBuilder.start(ProcessBuilder.java:460)

at org.apache.hadoop.util.Shell.runCommand(Shell.java:206)

at org.apache.hadoop.util.Shell.run(Shell.java:188)

at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:381)

at org.apache.hadoop.net.ScriptBasedMapping$RawScriptBasedMapping.runResolveCommand(ScriptBasedMapping.java:242)

at org.apache.hadoop.net.ScriptBasedMapping$RawScriptBasedMapping.resolve(ScriptBasedMapping.java:180)

at org.apache.hadoop.net.CachedDNSToSwitchMapping.resolve(CachedDNSToSwitchMapping.java:119)

at org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager.sortLocatedBlocks(DatanodeManager.java:334)

at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1343)

at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLocations(NameNodeRpcServer.java:413)

at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getBlockLocations(ClientNamenodeProtocolServerSideTranslatorPB.java:172)

at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java:44938)

at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:453)

at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1002)

at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1752)

at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1748)

at java.security.AccessController.doPrivileged(Native Method)

at javax.security.auth.Subject.doAs(Subject.java:396)

at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1438)

at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1746)

Caused by: java.io.IOException: java.io.IOException: error=13, Permission denied

at java.lang.UNIXProcess.<init>(UNIXProcess.java:148)

at java.lang.ProcessImpl.start(ProcessImpl.java:65)

at java.lang.ProcessBuilder.start(ProcessBuilder.java:453)

… 19 more

Or narrow down the search with this command:

tail -n 24 /var/log/hadoop-hdfs/hadoop-cmf-hdfs1-NAMENODE-servername01.log.out|grep “WARN org.apache.hadoop”

If you see the WARN exception, you’re in trouble…

3. This pointed me to evaluate our /run mount, which is mounted on tmpfs (shared memory). I discovered that Ubuntu had changed tmpfs to noexec, which will not allow a script to execute. Executing scripts in shared memory is usually not a good thing anyways, and Ubuntu must be locking down their OS. However, Clouera still runs scripts out of the /run folder on tmpfs. To fix this I mounted tmpfs without noexec by editing the /etc/fstab file in the following manner (Note: you can temporarily fixed the problem by mounting /run with the exec option (sudo mount -o remount,exec /run), but that’s less than ideal):

Running mount shows how tmpfs is already mounted:

tmpfs on /run type tmpfs (rw,noexec,nosuid,size=10%,mode=0755)

Note the noexec flag on tmpfs.

sudo vi /etc/fstab

# add this line to the bottom of the fstab file:

tmpfs /run tmpfs rw,nosuid,size=10%,mode=0755 0 0

4. Save the file and mount the filesystem:

sudo mount -o mount /run

5. Restart HDFS.

Note: Look at the tmpfs settings:

sudo cat /lib/init/fstab

# These are the filesystems that are always mounted on boot, you can

# override any of these by copying the appropriate line from this file into

# /etc/fstab and tweaking it as you see fit.  See fstab(5).

HDFS JournalNode: FileNotFoundException: No such file or directory

Aug 15, 2:34:50.515 PM  INFO    org.apache.hadoop.ipc.Server   
IPC Server handler 4 on 8485, call org.apache.hadoop.hdfs.qjournal.protocol.QJournalProtocol.heartbeat from 161.170.176.104:39343 Call#631824 Retry#0
java.io.FileNotFoundException: /space1/dfs/jn/nameservice1/current/last-promised-epoch.tmp (No such file or directory)
    at java.io.FileOutputStream.open0(Native Method)
    at java.io.FileOutputStream.open(FileOutputStream.java:270)
    at java.io.FileOutputStream.<init>(FileOutputStream.java:213)
    at java.io.FileOutputStream.<init>(FileOutputStream.java:162)
    at org.apache.hadoop.hdfs.util.AtomicFileOutputStream.<init>(AtomicFileOutputStream.java:58)
    at org.apache.hadoop.hdfs.util.PersistentLongFile.writeFile(PersistentLongFile.java:78)
    at org.apache.hadoop.hdfs.util.PersistentLongFile.set(PersistentLongFile.java:64)
    at org.apache.hadoop.hdfs.qjournal.server.Journal.updateLastPromisedEpoch(Journal.java:327)
    at org.apache.hadoop.hdfs.qjournal.server.Journal.checkRequest(Journal.java:435)
    at org.apache.hadoop.hdfs.qjournal.server.Journal.heartbeat(Journal.java:418)
    at org.apache.hadoop.hdfs.qjournal.server.JournalNodeRpcServer.heartbeat(JournalNodeRpcServer.java:155)
    at org.apache.hadoop.hdfs.qjournal.protocolPB.QJournalProtocolServerSideTranslatorPB.heartbeat(QJournalProtocolServerSideTranslatorPB.java:172)
    at org.apache.hadoop.hdfs.qjournal.protocol.QJournalProtocolProtos$QJournalProtocolService$2.callBlockingMethod(QJournalProtocolProtos.java:25423)
    at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:617)
    at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1073)
    at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2086)
    at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2082)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:422)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1693)
    at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2080)
     

Solution: The Storage Directory should have been created, and I am not sure how it was deleted. You can create the folder manually:

sudo mkdir -p /space1/dfs/jn/nameservice1/current/
sudo chown -R hdfs:hdfs /space1/dfs/jn/

Note: This solution will lead to the JournalNotFormattedException, explained below.

HDFS JournalNode: JournalNotFormattedException: Journal Storage Directory * not formatted

Error:

Aug 15, 2:46:01.285 PM  WARN    org.apache.hadoop.security.UserGroupInformation
PriviledgedActionException as:hdfs (auth:SIMPLE) cause:org.apache.hadoop.hdfs.qjournal.protocol.JournalNotFormattedException: Journal Storage Directory /space1/dfs/jn/nameservice1 not formatted
Aug 15, 2:46:01.286 PM  INFO    org.apache.hadoop.ipc.Server   
IPC Server handler 3 on 8485, call org.apache.hadoop.hdfs.qjournal.protocol.QJournalProtocol.getEditLogManifest from 161.170.176.10:37369 Call#83 Retry#0
org.apache.hadoop.hdfs.qjournal.protocol.JournalNotFormattedException: Journal Storage Directory /space1/dfs/jn/nameservice1 not formatted
    at org.apache.hadoop.hdfs.qjournal.server.Journal.checkFormatted(Journal.java:472)
    at org.apache.hadoop.hdfs.qjournal.server.Journal.getEditLogManifest(Journal.java:655)
    at org.apache.hadoop.hdfs.qjournal.server.JournalNodeRpcServer.getEditLogManifest(JournalNodeRpcServer.java:186)
    at org.apache.hadoop.hdfs.qjournal.protocolPB.QJournalProtocolServerSideTranslatorPB.getEditLogManifest(QJournalProtocolServerSideTranslatorPB.java:236)
    at org.apache.hadoop.hdfs.qjournal.protocol.QJournalProtocolProtos$QJournalProtocolService$2.callBlockingMethod(QJournalProtocolProtos.java:25431)
    at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:617)
    at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1073)
    at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2086)
    at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2082)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:422)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1693)
    at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2080)

Solution: The JournalNode was missing the cluster configuration file (VERSION). Create the VERSION file in the Storage Directory and add the contents of the VERSION file from another JournalNode (I’ve pasted an example below).

sudo mkdir -p /space1/dfs/jn/nameservice1/current/
sudo chown -R hdfs:hdfs /space1/dfs/jn/nameservice1/
# create the file
sudo vi /space1/dfs/jn/nameservice1/current/VERSION
sudo chown hdfs:hdfs /space1/dfs/jn/nameservice1/current/VERSION

Azure Dev

#Wed Jul 29 23:37:46 PDT 2015
namespaceID=1236816086
clusterID=cluster18
cTime=1438238263778
storageType=JOURNAL_NODE
layoutVersion=-60

HDFS Under Replicated Blocks

Details: This is an HDFS service-level health check that checks that the number of under-replicated blocks does not rise above some percentage of the cluster’s total blocks. A failure of this health check may indicate a loss of DataNodes. Use the HDFS fsck command to identify which files contain under replicated blocks.

There are reasons for managing the replication level of data on a running Hadoop system. For example, if you don’t have even distribution of blocks across your DataNodes, you can increase replication temporarily and then bring it back down.

To set replication of an individual file to 4:

sudo -u hdfs hadoop dfs -setrep -w 4 /path/to/file

You can also do this recursively. To change replication of entire HDFS to 1:

sudo -u hdfs hadoop dfs -setrep -R -w 1 /

To script a fix for under-replicated blocks in HDFS, try the following:

####Fix under-replicated blocks###
su - <$hdfs_user>
hdfs fsck / | grep 'Under replicated' awk -F':' '{print $1}' >> /tmp/under_replicated_files
for hdfsfile in `cat /tmp/under_replicated_files`; do echo "Fixing $hdfsfile :" ; hadoop fs -setrep 3 $hdfsfile; done

HDFS Missing Blocks

To discover what blocks have not been replicated, browse to the NameNode Web UI under the hdfs Service in Cloudera Manager. Then click on the Warning link in the Cluster Summary section of the page.

Or browse to the page locally using the w3m web browser:

w3m ‘http://NameNodeServerName:50070/corrupt_files.jsp’

On the Warning page you will have a list of Reported Corrupt Files. You can either replicate these files (if you can locate them), or delete these files.

To delete the files, copy the file name and run the following command on the NameNode server:

# delete a file

hadoop fs -rm hdfs://servername01:8020/user/username/asset/attempt_1416875798797.txt

# delete a folder or file recursively

sudo -u hdfs hadoop dfs -rmr hdfs://servername01:8020/user/oozie/.Trash/131007170000/user/oozie/share/lib/sqoop/hive-builtins-0.10.0-cdh4.4.0.jar

You may have to run the command more than once if it places the file in the hdfs user’s trash.

The process of deleting missing blocks has been automated with the Hadoop\Delete Missing Blocks from HDFS Orchestrator Runbook.