HBase: RowCount Script

I wrote a quick script to count all rows in all tables in HBase. This works great for my Dev clusters that have ever-growing tables filled with clutter. The script uses a MapReduce job to go against all HBase tables. I have used this in Prod, but with mixed results: Sometimes the HBase tables are too large for the MR jobs to run within 24 hours.


#!/bin/bash
# Filename: rc-start-rowcount.sh
# Description: start a row count for each table
#
# Example:
# /opt/scripts/rc-start-rowcount.sh

# 1. check if the row count is already running
# 2. if the row count is NOT running, then run a row count

cd /opt/cloudera/parcels/CDH/bin

ScriptDir="/opt/scripts/";
WorkingDir="/opt/scripts/rc-work";
Test="";
ListOfHBaseTables="rc-tables.txt";
ListOfRunningYarnJobs="rc-yarn-jobs.txt";
ScriptToRun="rc-script.sh";
LogDir="/var/log/scripts";
LogFile="rc-start-rowcount.log";

echo "`date`: Start" >> $LogDir/$LogFile;

StartTest=`ps ax|grep rc-parserowcount.sh|grep bash`
echo $StartTest
if [[ ! $StartTest == "" ]]; then
echo "`date`: WARNING: rc-parse-rowcount.sh is running, exit" >> $LogDir/$LogFile;
echo $StartTest >> $LogDir/$LogFile;
exit;
fi

# create the script
echo "#!/bin/bash" > $WorkingDir/$ScriptToRun

echo 'list; quit;' | hbase shell > $WorkingDir/$ListOfHBaseTables
sed -i '/^$/d' $WorkingDir/$ListOfHBaseTables
sed -i '$d' $WorkingDir/$ListOfHBaseTables

# get running applications from yarn
yarn application -list > $WorkingDir/$ListOfRunningYarnJobs

while read table; do
#echo 'table:' $table
# 1. check if row count is running
# if Test is blank=NOT running, anything else=running
Test=`grep $table $WorkingDir/$ListOfRunningYarnJobs`;
if [[ $Test == "" ]]; then
# 2. if the row count is NOT running, then run a row count
echo "sleep 10;hbase org.apache.hadoop.hbase.mapreduce.RowCounter $table > $WorkingDir/$table.txt 2>&1 &" >> $WorkingDir/$ScriptToRun 2>&1
#echo 'run this table:' $table
echo "`date`: Process Table: $table" >> $LogDir/$LogFile;
fi
done <$WorkingDir/$ListOfHBaseTables

# set the script to be executable
chmod +x $WorkingDir/$ScriptToRun

# run the script that included all map reduce jobs
cat $WorkingDir/$ScriptToRun
$WorkingDir/$ScriptToRun

echo "`date`: End" >> $LogDir/$LogFile;

Leave a Reply