Using Cloudera Manager I found it very easy to configure single-node Hadoop development nodes that can be used by our developers to test their Pig scripts. However, out of the box dfs.replication is set to 3, which is great for a cluster, but for a single-node development workstation, this throws warnings. I set the dfs.replication to 1, but any blocks written previously are reported as under replicated blocks. Having a quick way to change the replication factor is very handy.

There are other reasons for managing the replication level of data on a running Hadoop system. For example, if you don’t have even distribution of blocks across your DataNodes, you can increase replication temporarily and then bring it back down.

To set replication of an individual file to 4:
sudo -u hdfs hadoop dfs -setrep -w 4 /path/to/file

You can also do this recursively. To change replication of entire HDFS to 1:
sudo -u hdfs hadoop dfs -setrep -R -w 1 /

I found these easy instructions on the Streamy Development Blog.

Leave a Reply