Impala provides a real-time SQL query interface for data stored in HDFS and HBase. Impala requires Hive service and shares the Hive Metastore with Hue. Impala also offers connectors for various external applications like Tableau.

Table of Contents

Configure Impala

Install Impala

Cloudera Manager distributes Impala in CDH and offers the following services:

Impala StateStore – (Cloudera recommends the StateStore be on a separate server from the Impala Daemon, preferably on the server running the HDFS NameNode) – The Impala StateStore is the service that tracks the location and status of all Impala Daemon instances in the cluster. Run one instance of this daemon in your cluster. Most production deployments run this daemon on the server where the HDFS NameNode is installed, often on node #02.
Impala Catalog Server – (run the catalog server on the same server as the StateStore daemon) – Cloudera recommends the catalog server be on the same host as the StateStore. The Impala component known as the catalog service relays the metadata changes from Impala SQL statements to all the nodes in a cluster. The Impala Catalog Server is physically represented by a daemon process named catalogd; you only need one Impala Catalog Server on in the cluster. Do not run the Impala Catalog service on a server where you are running an Impala Daemon.
Impala Daemon – Run one Impala Daemon on each server in the cluster that has a HDFS DataNode – but not on a node with the Impala StateStore Daemon. Also, you should not run an Impala Daemon service on a server running an HDFS NameNode – the memory used can be too high. The Impala Daemon service plans and executes queries against HDFS and HBase data. As data use increases, memory use will increase.

Impala Configuration

Configuration	Description	Small (< 16 GB memory on a node)	Large (> 16 GB memory on a node)	Calculation
Impala Daemon Memory Limit mem_limit	Memory limit in bytes for Impala Daemon, enforced by the daemon itself. If reached, queries running on the Impala Daemon may be killed. Leave it blank to let Impala pick its own limit. Use a value of -1 B to specify no limit.	256 MB	1 GB	In HDFS base the calculation on the block count used in Impala joins. (block_count/100,000 * .5 GB)
HBase RPC Timeouthbase.rpc.timeout	Timeout in seconds for all HBase RPCs made by Impala. Overrides configuration in HBase service.	3 seconds	9 seconds	On Azure we needed to bump the 3 second timeout to 9 seconds to allow for network slowness inherent to Azure.
Process Swap Memory Thresholds Impala Daemon Default Group	The health test thresholds on the swap memory usage of the process.	Critical: Never	Critical: Never	Obviously swap is bad for Impala, but there are times when a warning is enough.

Configuration

Description

Small (< 16 GB memory on a node)

Large (> 16 GB memory on a node)

Calculation

Impala Daemon Memory Limit

mem_limit

Memory limit in bytes for Impala Daemon, enforced by the daemon itself. If reached, queries running on the Impala Daemon may be killed. Leave it blank to let Impala pick its own limit. Use a value of -1 B to specify no limit.

256 MB

1 GB

In HDFS base the calculation on the block count used in Impala joins.

(block_count/100,000 * .5 GB)

HBase RPC Timeouthbase.rpc.timeout

Timeout in seconds for all HBase RPCs made by Impala. Overrides configuration in HBase service.

3 seconds

9 seconds

On Azure we needed to bump the 3 second timeout to 9 seconds to allow for network slowness inherent to Azure.

Process Swap Memory Thresholds

Impala Daemon Default Group

The health test thresholds on the swap memory usage of the process.

Critical: Never

Obviously swap is bad for Impala, but there are times when a warning is enough.

More details: During join operations, portions of data from each joined table are loaded into memory. Data sets can be very large, so ensure your hardware has sufficient memory to accommodate the joins you anticipate completing.

Even more details: While requirements vary according to data set size, the following is generally recommended:

Memory – 128 GB or more recommended, ideally 256 GB or more. If the intermediate results during query processing on a particular node exceed the amount of memory available to Impala on that node, the query writes temporary work data to disk, which can lead to long query times. Note that because the work is parallelized, and intermediate results for aggregate queries are typically smaller than the original data, Impala can query and join tables that are much larger than the memory available on an individual node.
Storage – DataNodes with 12 or more disks each. I/O speeds are often the limiting factor for disk performance with Impala. Ensure that you have sufficient disk space to store the data Impala will be querying.

For even more details: Cluster Sizing Calculator: http://www.cloudera.com/content/cloudera/en/documentation/cloudera-impala/latest/topics/impala_cluster_sizing.html

Administer Impala

ODBC Connector

Connect to any Impala Daemon over ODBC. The Impala ODBC Version 2 and higher connect to Impala on port 21050. For authentication, Impala supports Kerberos authentication with all the supported versions of the driver, and requires ODBC 2.05.13 for Impala or later for LDAP username/password authentication. Download the ODBC Connector: https://www.cloudera.com/downloads/connectors/impala/odbc/2-5-41.html

Impala Query Editor (Hue)

Hue offers a stripped down query editor that displays databases, allows users to save scripts, explain, and query databases. While Hue’s query editor is limited, it might come in handy for a quick overview.