Developer's Closet A place where I can put my PHP, SQL, Perl, JavaScript, and VBScript code.

Copy data from one Hadoop cluster to another Hadoop cluster (running different versions of Hadoop)

I had to copy data from one Hadoop cluster to another recently. However, the two clusters ran different versions of Hadoop, which made using distcp a little tricky.

Some notes of distcp: By default, distcp will skip files that already exist in the destination, but they can be overwritten by supplying the -overwrite option. You can also update only files that have changed using the -update option. distcp is implemented as a MapReduce job where the work of copying is done by maps that run in parallel across the cluster. There are no reducers. Each file is copied by a single map, and distcp tries to give each map approximately the same amount of data, by bucketing files into roughly equal allocations.

The following command will copy the folder contents from one Hadoop cluster to a folder on another Hadoop cluster. Using hftp is necessary because the clusters run a different version of Hadoop. The command must be run on the destination cluster. Be sure your user has access to write to the destination folder.

hadoop distcp -pb hftp://namenode:50070/tmp/* hdfs://namenode/tmp/

Note: The -pb option will preserve the block size.

Double Note: For copying between two different versions of Hadoop we must use the HftpFileSystem, which is a read-only files system. So the distcp must be run on the destination cluster.

The following command will copy data from Hadoop clusters that are the same version.

hadoop distcp -pb hdfs://namenode/tmp/* hdfs://namenode/tmp/

Filed under: HDFS, Ubuntu No Comments
Useful Scripts


$ou = [adsi]"LDAP://OU=Marketing,OU=Departments,DC=Company,DC=Domain";
$user = $ou.psbase.get_children().find('CN=UserName');


DATE=$(date +%Y%m%d%H%M%S)
sudo cp "$fname" "$fname".$DATE

while IFS='' read -r line
if [ "$line" == " line item:" ] || [ $i -gt 0 -a $i -lt 4 ]
echo -e "#$line" | sudo tee -a $tname
i=$(($i + 1))
printf "%s\n" "$line" | sudo tee -a "$tname"
done <"$fname"

sudo mv $tname $fname


$SourcePath = 'C:\Path\';
$DestServer = 'ServerName';
$DestPath = '/path/';
$FileName = 'FileName';
$Output = @()
$cmd = @(
"lcd $SourcePath",
"cd $DestPath",
"mput $FileName",

$Output = $cmd | & "C:\Program Files (x86)\Putty\psftp.exe" –v $DestServer 2>&1;
$Err = [String]($Output -like "*=>*");
If (($LastExitCode -ne 0) || (($Err.Contains("=>")) -eq $false)) {
throw "File Failed to Transfer! `n $($Output)";


sudo mkdir /space;
echo "/dev/space /space ext4 defaults 0 0" | sudo tee -a /etc/fstab;
sudo mount /dev/space /space;
sudo df -h;
ls /dev/;


Add-PSSnapin Quest.ActiveRoles.ADManagement;
connect-QADService -service domain;
set-QADuser UserName -TSRemoteControl 0;
$objCurrentPSProcess = [System.Diagnostics.Process]::GetCurrentProcess();
Stop-Process -Id $objCurrentPSProcess.ID;


Sqoop Server Startup Failure: Upgrade required but not allowed

After an upgrade from CDH 5.0.2 to CDH 5.0.3, Sqoop failed to start with the following error: Server startup failure, Connector registration failed, Upgrade required but not allowed - Connector: generic-jdbc-connector.

To resolve this problem I had to add the following property to the Sqoop 2 Server Advanced Configuration Snippet (Safety Valve) for You can find this property under Cloudera Manager, Sqoop Service, Configuration, Sqoop 2 Server Default Group, and Advanced:


After the upgrade has completed successfully, the property can be removed.

Log File: /var/log/sqoop2/sqoop-cmf-sqoop-SQOOP_SERVER-servername01.ds-iq.corp.log.out

Server startup failure
org.apache.sqoop.common.SqoopException: CONN_0007:Connector registration failed
at org.apache.sqoop.connector.ConnectorManager.registerConnectors(
at org.apache.sqoop.connector.ConnectorManager.initialize(
at org.apache.sqoop.connector.ConnectorManager.initialize(
at org.apache.sqoop.core.SqoopServer.initialize(
at org.apache.sqoop.server.ServerInitializer.contextInitialized(
at org.apache.catalina.core.StandardContext.listenerStart(
at org.apache.catalina.core.StandardContext.start(
at org.apache.catalina.core.ContainerBase.addChildInternal(
at org.apache.catalina.core.ContainerBase.addChild(
at org.apache.catalina.core.StandardHost.addChild(
at org.apache.catalina.startup.HostConfig.deployWAR(
at org.apache.catalina.startup.HostConfig.deployWARs(
at org.apache.catalina.startup.HostConfig.deployApps(
at org.apache.catalina.startup.HostConfig.start(
at org.apache.catalina.startup.HostConfig.lifecycleEvent(
at org.apache.catalina.util.LifecycleSupport.fireLifecycleEvent(
at org.apache.catalina.core.ContainerBase.start(
at org.apache.catalina.core.StandardHost.start(
at org.apache.catalina.core.ContainerBase.start(
at org.apache.catalina.core.StandardEngine.start(
at org.apache.catalina.core.StandardService.start(
at org.apache.catalina.core.StandardServer.start(
at org.apache.catalina.startup.Catalina.start(
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(
at sun.reflect.DelegatingMethodAccessorImpl.invoke(
at java.lang.reflect.Method.invoke(
at org.apache.catalina.startup.Bootstrap.start(
at org.apache.catalina.startup.Bootstrap.main(
Caused by: org.apache.sqoop.common.SqoopException: JDBCREPO_0026:Upgrade required but not allowed - Connector: generic-jdbc-connector
at org.apache.sqoop.repository.JdbcRepository$3.doIt(
at org.apache.sqoop.repository.JdbcRepository.doWithConnection(
at org.apache.sqoop.repository.JdbcRepository.doWithConnection(
at org.apache.sqoop.repository.JdbcRepository.registerConnector(
at org.apache.sqoop.connector.ConnectorManager.registerConnectors(
... 28 more

HBase All Regions in Transition: state=FAILED_OPEN

After I added a jar file to the HBase Master I had a problem where regions failed to transition to a RegionServer. Below are the errors; removing the jar file from the hbase/lib folder resolved this problem (full path to jar: /opt/cloudera/parcels/CDH-5.0.2-1.cdh5.0.2.p0.13/lib/hbase/lib/). What tipped me off was the missing class definition: Caused by: java.lang.NoClassDefFoundError: org/apache/hadoop/hbase/ipc/CoprocessorProtocol.

Failed open of region=REGION.NAME,,4194066667839.6ea7d7ff9276f9c0e9b126c73e25bc54., starting to roll back the global memstore size.
java.lang.IllegalStateException: Could not instantiate a region instance.
at org.apache.hadoop.hbase.regionserver.HRegion.newHRegion(
at org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(
at org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(
at org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(
at org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(
at org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler.openRegion(
at org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler.process(
at java.util.concurrent.ThreadPoolExecutor.runWorker(
at java.util.concurrent.ThreadPoolExecutor$
Caused by: java.lang.reflect.InvocationTargetException
at sun.reflect.GeneratedConstructorAccessor7.newInstance(Unknown Source)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(
at java.lang.reflect.Constructor.newInstance(
at org.apache.hadoop.hbase.regionserver.HRegion.newHRegion(
... 10 more
Caused by: java.lang.NoClassDefFoundError: org/apache/hadoop/hbase/ipc/CoprocessorProtocol
at java.lang.ClassLoader.defineClass1(Native Method)
at java.lang.ClassLoader.defineClass(

9:10:19.721 AM INFO org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler
Opening of region {ENCODED => 6ea7d7ff9276f9c0e9b126c73e25bc54, NAME => 'REGION.NAME,,4194066667839.6ea7d7ff9276f9c0e9b126c73e25bc54.', STARTKEY => '', ENDKEY => ''} failed, transitioning from OPENING to FAILED_OPEN in ZK, expecting version 28

Cron file failed to load: (username~) ORPHAN (no passwd entry)

This problem bothers me a little. The authentication server failed during a cron job that referenced a specific account. Ubuntu could not authenticate the account, and id username failed. So the account was written to the /var/spool/cron/crontabs/ and any time I tried to edit the cron file under /etc/cron.c/username-cron-file, the reload would fail:

cron[17959]: (*system*username-cron-file) RELOAD (/etc/cron.d/username-cron-file)
cron[17959]: Error: bad username; while reading /etc/cron.d/username-cron-file
cron[17959]: (*system*username-cron-file) ERROR (Syntax error, this crontab file will be ignored)
cron[17959]: (username~) ORPHAN (no passwd entry)

I deleted the spool entry and was able to recreate the cron file.


Filed under: Linux, Ubuntu No Comments
Cloudera Manager HBase Check HFile Version

An under-documented feature in Cloudera Manager is the HBase Check HFile Version. When upgrading from CDH 4.7 to CDH 5.0, I ran across the instructions that the HBase upgrade will not be successful if there are any HFiles with version 1 present. Run "Check HFile Version" from the HBase service Actions menu to ensure that HBase is ready for the upgrade.

After a check if HFiles with v1 are present, and "Process (###) has reached expected state", you are looking for the message that No files with v1 were found and HBase can be upgraded. Otherwise HBase regions will have to be compacted. I'll post more on that later.

In Stderr, look for the following:

INFO util.HFileV1Detector: Count of HFileV1: 0
INFO util.HFileV1Detector: Count of corrupted files: 0
INFO util.HFileV1Detector: Count of Regions with HFileV1: 0
Cannot Start HBase Master: SplitLogManager: Error Splitting

I could not start HBase within Cloudera Manager, the service reported errors. I was initially confused because I could start the Master when the RegionServers were stopped, but as soon as I started a RegionServer, the master went down. I tracked this problem down to an unexpected server reboot of the node running the HBase Master. After the Master restarted, HBase was not able to continue reading from the transaction log because it had become unusable (corrupt). I had to delete the broken file before restarting the Master node.

Digging through the logs: sudo tail /var/log/hbase/hbase-cmf-hbase1-MASTER-ServerName.log.out, I discovered: error or interrupted while splitting logs in [hdfs://ServerName:8020/hbase/.logs/ServerName,60020,1393982440484-splitting] Task = installed = 1 done = 0 error = 1

In the log file, look for the file that cannot be split:


Then search hdfs for the file:

sudo -u hdfs hadoop fs -ls /hbase/.logs

Note that the file is 0 KB. Next, move the offending file:

sudo -u hdfs hadoop fs -mv /hbase/.logs/ServerName,60020,1393982440484-splitting /tmp/ServerName,60020,1393982440484-splitting.old

Restart the HBase Master service. The splitting log file can be replayed back to recover any lost data, but I didn't look into that because there was no data to recover.

Note: Here is a fantastic HBase command to identify and fix any problems with HBase:

sudo -u hbase hbase hbck -fix

SQL Query for Recently Changed Objects

These may be useful to determine when objects have been changed.  We might infer from this info when was the last formal deployment of a database.


-- get most recently changed object of each type


,SCHEMA_NAME(o.schema_id) AS 'SchemaName'

,COALESCE(OBJECT_NAME(o.parent_object_id), '') AS 'ParentName'





FROM    sys.objects o


,'LateDate' = MAX(modify_date)

FROM      sys.objects

WHERE     is_ms_shipped = 0

AND OBJECT_SCHEMA_NAME(object_id) <> 'tSQLt'

GROUP BY  type

) xo

ON o.type = xo.type

AND o.modify_date = xo.LateDate

ORDER BY o.modify_date DESC




-- get 20 most recently changed objects


,SCHEMA_NAME(o.schema_id) AS 'SchemaName'

,COALESCE(OBJECT_NAME(o.parent_object_id), '') AS 'ParentName'





FROM   sys.objects o

WHERE   o.is_ms_shipped = 0

AND o.parent_object_id = 0

--AND o.type = 'U'

ORDER BY o.modify_date DESC


-- get 20 most recently changed objects that are not tSQLt or tSQLt tests


SCHEMA_NAME(o.schema_id) AS 'SchemaName'


,COALESCE(OBJECT_NAME(o.parent_object_id), '') AS 'ParentName'





FROM   sys.objects o

WHERE   o.is_ms_shipped = 0

AND o.parent_object_id = 0

--AND o.type = 'U'


AND OBJECT_SCHEMA_NAME(o.object_id) <> 'tSQLt'

ORDER BY o.modify_date DESC


Filed under: SQL, Windows No Comments
Fix an SQL Database Marked Suspect

If you have a database marked suspect, and do not have a backup to restore from, you can attempt to repair the database. This may lead to data loss and is most certainly not the best option (see: database backup).

Make sure there are no open connections to the database - stop all services that might attempt a connection. Run sp_who2 to check on open connections.

Run the commands one at a time and fix any errors as they appear:

EXEC sp_resetstatus 'DatabaseName';
DBCC CHECKDB ('DatabaseName');

Restoring from a backup is the better way to fix this issue.

Filed under: SQL, Windows No Comments
Configure Solr Using Cloudera Manager

Solr, also known as Cloudera Search within Cloudera Manager, is a distributed service for indexing and searching data stored in HDFS.

Add the Solr Service

Using Cloudera Manager, add a Solr Server to a host that is not hosting Zookeeper or Oozie. Solr will take a lot of processing power and memory. You can collocate a Cloudera Search server (solr-server package) with a MapReduce TaskTracker (MRv1) and a HDFS DataNode. When co-locating with MapReduce TaskTrackers, be sure that the resources of the machine are not oversubscribed. It's safest to start with a small number of MapReduce slots and increase them gradually.

Here is Cloudera’s current Solr guide:

Creating Your First Solr Collection

To use Solr for the first time you will have to create Collections. Here is how:, look under the heading: Creating Your First Solr Collection.

By default, the Solr server comes up with no collections. Make sure that you create your first collection using the instancedir that you provided to Solr in previous steps by using the same collection name. (numOfShards is the number of SolrCloud shards you want to partition the collection across. The number of shards cannot exceed the total number of Solr servers in your SolrCloud cluster):

solrctl collection --create collection1 -s {{numOfShards}}

You should be able to check that the collection is active. For example, for the server, you should be able to navigate to*%3A*&wt=json&indent=true and verify that the collection is active. Similarly, you should also be able to observe the topology of your SolrCloud using a URL similar to:

You will then be able to create a new core.

Creating a New Core

In Cloudera Manager, browse to the Solr Service. On the right of the menu, click on Solr Web UI. On the lower left menu, click on Add a New Core. Enter the collection created above, give the core a name, and submit.