Developer's Closet A place where I can put my PHP, SQL, Perl, JavaScript, and VBScript code.

Copy data from one Hadoop cluster to another Hadoop cluster (running different versions of Hadoop)

I had to copy data from one Hadoop cluster to another recently. However, the two clusters ran different versions of Hadoop, which made using distcp a little tricky.

Some notes of distcp: By default, distcp will skip files that already exist in the destination, but they can be overwritten by supplying the -overwrite option. You can also update only files that have changed using the -update option. distcp is implemented as a MapReduce job where the work of copying is done by maps that run in parallel across the cluster. There are no reducers. Each file is copied by a single map, and distcp tries to give each map approximately the same amount of data, by bucketing files into roughly equal allocations.

The following command will copy the folder contents from one Hadoop cluster to a folder on another Hadoop cluster. Using hftp is necessary because the clusters run a different version of Hadoop. The command must be run on the destination cluster. Be sure your user has access to write to the destination folder.

hadoop distcp -pb hftp://namenode:50070/tmp/* hdfs://namenode/tmp/

Note: The -pb option will preserve the block size.

Double Note: For copying between two different versions of Hadoop we must use the HftpFileSystem, which is a read-only files system. So the distcp must be run on the destination cluster.

The following command will copy data from Hadoop clusters that are the same version.

hadoop distcp -pb hdfs://namenode/tmp/* hdfs://namenode/tmp/

Filed under: HDFS, Ubuntu No Comments

Using Cloudera Manager I found it very easy to configure single-node Hadoop development nodes that can be used by our developers to test their Pig scripts. However, out of the box dfs.replication is set to 3, which is great for a cluster, but for a single-node development workstation, this throws warnings. I set the dfs.replication to 1, but any blocks written previously are reported as under replicated blocks. Having a quick way to change the replication factor is very handy.

There are other reasons for managing the replication level of data on a running Hadoop system. For example, if you don’t have even distribution of blocks across your DataNodes, you can increase replication temporarily and then bring it back down.

To set replication of an individual file to 4:
sudo -u hdfs hadoop dfs -setrep -w 4 /path/to/file

You can also do this recursively. To change replication of entire HDFS to 1:
sudo -u hdfs hadoop dfs -setrep -R -w 1 /

I found these easy instructions on the Streamy Development Blog.