Driven Agent Guide: Driven Agent for Hive

version 2.2.6

Driven Agent for Hive

The Driven Agent for Hive is a JVM-level agent library that enables monitoring of Apache Hive queries in Driven. The agent runs in parallel with the main execution of Hive and sends telemetry data to a remote Driven server. Any type of Hive deployment is supported, such as the fat client (hive), the newer Beeline, HiveServer2, or Apache Oozie workflows containing Hive queries. Any application queries that are sent through JDBC or even ODBC can be monitored as well.

For instructions on downloading and installing the Driven Agent see sections on downloading and installing the agent.

Hive Version Requirements

The Driven Agent for Hive can be used with any version of Hive newer than 0.13.0 if you are using the MapReduce execution engine. If you are using Hive with the Tez execution engine, you must at least use Hive 0.14.0 and Tez 0.5.2. In the Tez deployment case you must furthermore ensure that the YARN Application Timeline Server (ATS) is properly configured. Hive works without ATS, but the Driven Agent requires a functioning ATS to monitor all resource usage. Refer to the Tez project documentation for how to properly configure ATS.

Known limitations

Hive Version Problem Solution

0.13.x

When creating a table based on a select query (e.g. create table foo as select a, b from bar) the output resources are not correctly identified.

Update Hive to at least 0.14.0

2.0.0

Hive 2.0.0 is currently not supported

HiveServer2 limitations

The Driven Agent for Hive monitors Units of Work that are executed by the HiveServer2. Please note that the Driven Agent for Hive is not a replacement for the process monitoring of the 'HiveServer2' process itself. The Agent will detect and correctly report failures that occur within a Unit of Work, but it cannot detect problems with the general setup of the HiveServer2 itself. Failure cases, such as an unavailable Metastore or incorrect permissions to access to HDFS are outside of the realm of the Agent. The Agent assumes a working HiveServer2 deployment to monitor all queries run by a server.

The Agent will recognize a shutdown of the HiveServer2 process and will mark it as stopped in Driven. However if the HiveServer2 receives a SIGKILL signal (a.k.a. kill -9) the application will still appear as running in Driven. Since the process is killed by the kernel without allowing any clean-up to happen, the agent cannot record the final state of the application. You should make sure that the management scripts used to start and stop the HiveServer2 allow for a proper shutdown and do not use the SIGKILL signal.

Metadata Support

The Driven Agent for Hive enables Driven to recognize application metadata, such as name, version number, or tags, which can be sent with other telemetry data from an application. This is supported by the Driven Agent for Hive. The following table shows the application level properties supported by the agent:

Table 1. Properties for sending App metadata to Driven
Name Example Explanation

driven.app.name

driven.app.name=tps-report

Name of the application

driven.app.version

driven.app.version=1.1.5

Version of the application

driven.app.tags

driven.app.tags=cluster:prod,tps,dept:marketing

Comma-separated list of tags

The installing the Agent section explained how these properties can be given on the agent command line. If that is not flexible enough for your use-case, the Driven Agent for Hive offers more options:

The properties can be set within a given HiveQL script via set-commands, an initialization file, or can be given on the command line. It is also possible to add them to the hive-site.xml file. With HiveServer2, you can also pass the properties as JDBC parameters. Basically any way you would normally send parameters to a Hive query is supported.

Next to application level metadata the Driven Agent for Hive allows the user to set Unit of Work meta-data.

Table 2. Properties for sending Unit of Work metadata to Driven
Name Example Explanation

driven.flow.name

driven.flow.name=tps-report

Name of the unit of work

driven.flow.description

driven.flow.description=Coversheet

Description of the unit of work

If no driven.flow.name property is set, the internal Hive Query Id is used.

Note
Setting the name or description will make it the name or description for all subsequent queries, so users should set it to a meaningful value as often as necessary.

Using the Hive Agent Artifact

For downloading the latest Driven Agent for Hive, follow the instructions in downloading the Agent section.

Enable the agent by extending the HADOOP_OPTS environment variable before starting a hive fat client, an embedded Beeline client, or HiveServer2. Use the following command format:

export HADOOP_OPTS="-javaagent:/path/to/driven-agent-hive-<version>.jar"
Note
You have to set the HADOOP_OPTS variable. Setting the YARN_OPTS variable, even on a YARN-based cluster, has no effect.

The agent must be installed and configured on the host where Hive queries are executed. In the case of the fat client, it is sufficient to set the environment variable in the shell where hive will be launched. The same applies to the newer Beeline client, when used without HiveServer2.

In case of a HiveServer2 deployment, the agent must be installed on the machine where the server is running. For the agent to work, the HADOOP_OPTS variable must be set in the environment where the server is running. Typically this involves modifying the startup script of HiveServer2. Some distributions ship with graphical cluster administration tools, with which you can customize a hive-env.sh script to administer the HiveServer2.

Note
Each HiveServer2 instance appears as one long-running application in Driven from the time that the first query is executed on the server. When using Driven Agent for Hive, an application is defined as one JVM. As a result, queries that a HiveServer2 runs are displayed as processes of the same application in Driven.

What Cluster Work Is Monitored

The Driven Agent for Hive monitors all queries that use resources (CPU and memory) on the cluster. Queries that do not use cluster resources in terms of computing power, even if they modify the state of the system, are currently not being tracked. Examples of queries that are not tracked are all DML statements or statements like select current_database(), use somedb;, etc.