Hive

Hive is a data warehouse system that facilitates reading, writing, and managing large datasets residing in distributed storage. Structure can be projected onto data already in storage. Hive offers a SQL-like language called HiveQL. Hive also offers connectors for various external applications like Tableau.

Configure Hive

Install Hive

Cloudera Manager distributes Hive in CDH and offers the following services:

  • Hive Metastore Server (required) – The Hive Metastore service stores the metadata for Hive tables and partitions in a relational database, and provides clients (including Hive) access to this information via the Metastore service API. By default the Hive Metastore Server uses a Derby database. It is best to run the Hive Metastore Server on the same server as the database because large queries can cause performance issues if the database and the Hive Metastore Server are on separate servers. It is also recommended to install the Hive Metastore Server on the same node as an HDFS NameNode. Although Hive can be configured to use more than one Hive Metastore Server, Cloudera does not support having multiple Hive Metastore Servers as this may result in problems such as concurrency errors.
  • HiveServer2 (required – same node as the Hive Metastore Server) – HiveServer2 (HS2) is a server interface that enables remote clients to execute queries against Hive and retrieve the results. HiveServer2 differs from HiveServer, and Beeline is the supported CLI to communicate with HiveServer2. HiveServer2 supports multiple clients making many simultaneous requests, which is an improvement over HiveServer.
  • Gateway (required – add a Hive Gateway to any host from which you will run the Hive CLI, such as an APP server and where the Hive Metastore Server is installed). If you are not sure, add the Hive Gateway to all APP servers and the server where you installed the Hive Metastore Server.
  • WebHCat Server (optional) – can be installed onto any node. WebHCat is the REST API for HCatalog, a table and storage management layer for Hadoop. We do not use this service.

Hive Configuration

Configuration Description Small (8 GB Memory) Medium (16 GB Memory) Large (32 GB Memory) Calculation
Java Heap Size of HiveServer2 Maximum size in bytes for the Java Process heap memory. Passed to Java -Xmx. 256 MB 2 GB 4 GB We were able to reliably run against 70 GB files on 256 MB of heap, but when we went beyond we were often hit by out of memory errors. I’d recommend 2 GB for each 1 TB of files that you intend to go after, but this is not a regular slope, instead it is much more heavy on the lower end (100 GB files) and uses less heap as you increase the file repository size.
HiveServer2 Port

hive.server2.thrift.port

Port on which HiveServer2 will listen for connections. 10000 10000 10000 Standard HiveServer2 port is 10000.

Hive Metastore Connection Timeout

hive.metastore.client.socket.timeout

Timeout for requests to the Hive Metastore Server. Consider increasing this if you have tables with a lot of metadata and see timeout errors. Used by most Hive Metastore clients such as Hive CLI and HiveServer2, but not by Impala. Impala has a separately configured timeout. 5 minutes 5 minutes 30 minutes Default = 5 minutes. Bigger for bigger tables.
Service Monitor Client Config Overrides For advanced use only, a list of configuration properties that will be used by the Service Monitor instead of the current client configuration for the service. 60 60 <property><name>hive.metastore.client.socket.timeout</name><value>1800</value></property> Default = 60 seconds. This configuration is set by Cloudera’s installer. We override the value to match the default

hive.metastore.client.socket.timeout

because the two values should match.

hive.execution.engine

The default execution engine for running hive queries. Can be set to mr for MapReduce or spark for Spark. Use this option to use Hive-on-Spark, or the default MapReduce.

For Hive to work on Spark, you must deploy Spark gateway roles on the same machine that hosts HiveServer2. Otherwise, Hive on Spark cannot read from Spark configurations and cannot submit Spark jobs.

After installation, run the following command in Hive so that Hive will use Spark as the back-end engine for all subsequent queries.

Reference: https://www.cloudera.com/documentation/enterprise/5-6-x/topics/admin_hos_config.html#hos_config

hive.stats.fetch.column.stats Whether column stats for a table are fetched during explain. true true true For improved performance, Cloudera recommends that you configure the this properties for Hive.
hive.optimize.index.filter Whether to use the indexing optimization for all queries. true true true For improved performance, Cloudera recommends that you configure the this properties for Hive.

Spark Executor Maximum Java Heap Size

spark.executor.memory

Maximum size of each Spark executor’s Java heap memory when Hive is running on Spark.

  • Compute a memory size equal to yarn.nodemanager.resource.memory-mb * (spark.executor.cores / yarn.nodemanager.resource.cpu-vcores) and then split that between spark.executor.memory and spark.yarn.executor.memoryOverhead.
  • spark.yarn.executor.memoryOverhead is 15-20% of the total memory size.

For general Spark configuration recommendations, see Configuring Spark on YARN Applications.

Executor memory size can have a number of effects on Hive. Increasing executor memory increases the number of queries for which Hive can enable mapjoin optimization. However, if there’s too much executor memory, it takes longer to perform garbage collection. Also, some experiments shows that HDFS doesn’t handle concurrent writers well, so it may face a race condition if there are too many executor cores.

Cloudera recommends that you set the value for spark.executor.cores to 56, or 7, depending on what the host is divisible by. For example, if yarn.nodemanager.resource.cpu-vcores is 19, then you would set the value to 6. Executors must have the same number of cores. If you set the value to 5, three executors with 5 cores each can be launched, leaving four cores unused. If you set the value to 7, only two executors are used, and five cores are unused. If the number of cores is 20, set the value to 5 so that each executor gets four cores, and no cores are unused.

Spark Executor Memory Overhead

spark.yarn.executor.memoryOverhead

This is the amount of extra off-heap memory that can be requested from YARN, per executor process. This, together with spark.executor.memory, is the total memory that YARN can use to create JVM for an executor process. See spark.executor.memory for more details.
hive.stats.collect.rawdatasize Not found in Cloudera’s configuration. true true true Hive on Spark uses statistics to determine the threshold for converting common join to map join. There are two types of statistics about the size of data:

  • totalSize: The approximate size of data on the disk
  • rawDataSize: The approximate size of data in memory

When both metrics are available, Hive chooses rawDataSize.

Hive Auto Convert Join Noconditional Size

hive.auto.convert.join.noconditionaltask.size

If Hive auto convert join is on, and the sum of the size for n-1 of the tables/partitions for a n-way join is smaller than the specified size, the join is directly converted to a MapJoin (there is no conditional task). 20 MB 20 MB 256 MB The threshold for the sum of all the small table size (by default, rawDataSize), for map join conversion. You can increase the value if you want better performance by converting more common joins to map joins. However, if you set this value too high, tasks may fail because too much memory is being used by data from small tables.

Table Lock Manager

For Table Lock Manager, make the following configuration changes before using HiveServer2. To understand the Table Lock Manager, you need to understand Concurrency, which a must in databases. At a minimum, strive to support concurrent readers and writers whenever possible. It would be useful to add a mechanism to discover the current locks which have been acquired, though all locks are acquired implicitly. Failure to do so may result in unpredictable behavior. To enable Hive’s Table Lock Manager:

  1. Open Cloudera Manager’s UI, browse to Services, Hive, click on Configuration.
  2. Expand HiveServer2 Default Group, click on Advanced, and add the following to the HiveServer2 Advanced Configuration Snippet (Safety Valve) for hive-site.xml (substitute the Zookeeper servers in the hive.zookeeper.quorum <value>):
<property>
  <name>hive.support.concurrency</name>
  <description>Enable Hive's Table Lock Manager Service</description>
  <value>true</value>
</property>
<property>
  <name>hive.zookeeper.quorum</name>
  <description>Zookeeper quorum used by Hive's Table Lock Manager</description>
  <value>zk.servername12,zk.servername13</value>
</property>

Long Live and Process (LLAP)

Long Live and Process (LLAP) functionality was added in Hive 2.0 (HIVE-7926 and associated tasks).  HIVE-9850 links documentation, features and issues for this enhancement

For configuration of LLAP, see LLAP Section of Configuration Properties.

https://cwiki.apache.org/confluence/display/Hive/LLAP

http://hortonworks.com/blog/stinger-next-enterprise-sql-hadoop-scale-apache-hive/

Test build: http://www.lewuathe.com/blog/2015/08/12/try-hive-llap/

Administer Hive

Test Hive

Enter the Hive shell:

hive shell

show databases;

Debug Hive using the following commands:

hive -hiveconf hive.root.logger=ALL,console

show databases;

Query Examples

Query Hive table and output the results to HDFS folder:

set mapred.job.name = operations-report-file;
INSERT OVERWRITE DIRECTORY ‘/path/to/hdfs/folder/’
ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘,’
select
TO_DATE(from_unixtime(UNIX_TIMESTAMP(split(rowkey,”:”)[3], ‘yyyy-MM-dd’))) AS datetime
, count(*) AS rowcount
FROM default.tablename
WHERE TO_DATE(from_unixtime(UNIX_TIMESTAMP(split(rowkey,”:”)[3], ‘yyyy-MM-dd’))) > date_sub(FROM_UNIXTIME(UNIX_TIMESTAMP(),’yyyy-MM-dd’) ,14)
GROUP BY split(rowkey,”:”)[3]
ORDER BY datetime;

Split Function

select split(rowkey,”:”)[3] from tablename limit 10;

Create Table

Create Hive table using a file stored in HDFS:

CREATE TABLE default.operations_report_file (
date DATE,
rowcount INT
) ROW FORMAT
DELIMITED FIELDS TERMINATED BY ‘\t’
LINES TERMINATED BY ‘\n’ STORED AS TEXTFILE;
LOAD DATA INPATH ‘/ops/operations-report-file.tsv’ INTO TABLE default.operations_report_file;

Create Hive table using a local file:

CREATE DATABASE test;CREATE EXTERNAL TABLE test.hivetest1
(
CustomerNumber INT,
DepartmentNumber INT,
CategoryNumber INT,
VisitTime INT
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘,’;LOAD DATA LOCAL INPATH ‘customers.csv’
OVERWRITE INTO TABLE test.hivetest1;

Create Hive table against an HBase table:

CREATE EXTERNAL TABLE hbase_test6c( rowKey string,
ac1 string,
ac2 string,
ac3 string,
ac4 string,
ac5 string,
ac6 string
)
STORED BY ‘org.apache.hadoop.hive.hbase.HBaseStorageHandler’
WITH SERDEPROPERTIES (“hbase.columns.mapping” = “:key,cf:c1,cf:c2,cf:c3,cf:c4,cf:c5,cf:c6”)
TBLPROPERTIES(“hbase.table.name” = “test6c”);

Additional example against an HBase table:

create external table if not exists hbasepersontable (id INT, firstname STRING, lastname STRING )
stored by ‘org.apache.hadoop.hive.hbase.HBaseStorageHandler’
with serdeproperties (‘hbase.columns.mapping’ = ‘:key, firstname:firstname, lastname:lastname’)
tblproperties (‘hbase.table.name’ = ‘PERSON’);

Scripting Against Hive

Return results from Hive using bash:

# the -S silences the connection messages
# use -e to pass a command
my_value=`hive -S -e ‘select count(*) from salestrigger.external_triggered_stores;’|tail -n1`;
echo $my_value;

ODBC Connector

I found a useful ODBC Connector to connect to Hive’s HiveServer2: http://kb.tableau.com/articles/knowledgebase/hadoop-hive-connection

Troubleshooting

Hive shell debug mode

You can enter a debug mode in the hive shell with the following command:

hive -hiveconf hive.root.logger=DEBUG,console

Hive Metastore Servers will not start – Hive Metastore canary failed to create a database

The Hive Server will not connect to the Hive Metastore and Cloudera reports that canary failed to create a database. In the Hive Metastore logs you see the following error: Metastore Thrift Server threw an exception…

MetaException(message:Hive Schema version 0.13.0 does not match metastore’s schema version 0.12.0 Metastore is not upgraded or corrupt)

Resolution: You will need to update the database, there must have been an update to the Hive Server which missed the database – or the database has become corrupt. Stop the Hive service, (in Cloudera Manager) within the Hive Metastore instance, click Actions, and select Update Database Schema. Without Cloudera Manager you’ll find the update-database-schema script in the bin folder. After the schema has successfully been updated, start Hive Metastore and Hive Server.

From the Hive Metastore log:

8:26:23.133 AM ERROR  org.apache.hadoop.hive.metastore.HiveMetaStore

Metastore Thrift Server threw an exception…

MetaException(message:Hive Schema version 0.13.0 does not match metastore’s schema version 0.12.0 Metastore is not upgraded or corrupt)

at org.apache.hadoop.hive.metastore.ObjectStore.checkSchema(ObjectStore.java:6311)

at org.apache.hadoop.hive.metastore.ObjectStore.verifySchema(ObjectStore.java:6282)

at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

8:26:22.774 AM ERROR  org.apache.hadoop.hive.metastore.MetaStoreDirectSql

Database initialization failed; direct SQL is disabled

javax.jdo.JDOException: Exception thrown when executing query

at org.datanucleus.api.jdo.NucleusJDOHelper.getJDOExceptionForNucleusException(NucleusJDOHelper.java:596)

at org.datanucleus.api.jdo.JDOQuery.execute(JDOQuery.java:230)

at org.apache.hadoop.hive.metastore.MetaStoreDirectSql.<init>(MetaStoreDirectSql.java:119)

org.apache.hadoop.hive.metastore.ObjectStore.setConf(ObjectStore.java:224)

at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:73)

at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:133)

at org.apache.hadoop.hive.metastore.RawStoreProxy.<init>(RawStoreProxy.java:58)

NestedThrowablesStackTrace:

org.postgresql.util.PSQLException: ERROR: column A0.OWNER_NAME does not exist

Position: 122

at org.postgresql.core.v3.QueryExecutorImpl.receiveErrorResponse(QueryExecutorImpl.java:2102)

at org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:1835)

at org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:257)

From the Hive Server log:

8:22:20.788 PM ERROR  org.apache.hive.service.CompositeService

Error starting services HiveServer2

org.apache.hive.service.ServiceException: Unable to connect to MetaStore!

at org.apache.hive.service.cli.CLIService.start(CLIService.java:140)

at org.apache.hive.service.CompositeService.start(CompositeService.java:70)

at org.apache.hive.service.server.HiveServer2.start(HiveServer2.java:73)

Caused by: MetaException(message:Got exception: org.apache.thrift.transport.TTransportException java.net.SocketException: Broken pipe)

at org.apache.hadoop.hive.metastore.MetaStoreUtils.logAndThrowMetaException(MetaStoreUtils.java:1114)

at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.getDatabases(HiveMetaStoreClient.java:839)

at org.apache.hive.service.cli.CLIService.start(CLIService.java:138)

… 9 more

Out of Memory Error – Within Hive Before being submitted to Yarn

Resolution:

It appears that the Hive query was not being submitted to Yarn. That points to an out of memory error from the HiveServer2. Increase memory under: Java Heap Size of HiveServer2

Error:

vi /var/log/hive/hadoop-cmf-hive-HIVESERVER2-servername12.log.out

2015-05-07 10:30:09,925 WARN org.apache.hive.service.cli.thrift.ThriftCLIService: Error executing statement:

java.lang.RuntimeException: java.lang.OutOfMemoryError: Java heap space

at org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:84)

at org.apache.hive.service.cli.session.HiveSessionProxy.access$000(HiveSessionProxy.java:37)

at org.apache.hive.service.cli.session.HiveSessionProxy$1.run(HiveSessionProxy.java:64)

Caused by: java.lang.OutOfMemoryError: Java heap space