Driven Agent Guide: Driven Agent for Hive
version 2.2.6- 1. Prerequisites
- 2. Installing the Driven Agent
- 3. Configuring the Driven Agent
-
3.1. Testing the Agent
3.2. Agent Common Options
- 4. Driven Agent for MapReduce
-
4.1. Running on Hadoop or YARN
4.2. MapReduce Versions
- 5. Driven Agent for Hive
-
5.1. Hive Version Requirements
5.2. Metadata Support
- 6. Driven Agent for Apache Spark
-
6.1. Spark Version Requirements
6.2. Supported APIs
6.3. Spark Runtimes
6.4. Supported Runtimes
- 7. Using Driven Agent with Apache Oozie
- 8. Advanced Installation
- 9. Troubleshooting the Driven Agent
Driven Agent for Hive
The Driven Agent for Hive is a JVM-level agent library that enables monitoring
of Apache Hive queries in Driven. The agent runs in parallel with the main
execution of Hive and sends telemetry data to a remote Driven server. Any type
of Hive deployment is supported, such as the fat client (hive
), the newer
Beeline
, HiveServer2, or Apache Oozie workflows containing Hive queries. Any
application queries that are sent through JDBC or even ODBC can be monitored as
well.
For instructions on downloading and installing the Driven Agent see sections on downloading and installing the agent.
Hive Version Requirements
The Driven Agent for Hive can be used with any version of Hive newer than 0.13.0 if you are using the MapReduce execution engine. If you are using Hive with the Tez execution engine, you must at least use Hive 0.14.0 and Tez 0.5.2. In the Tez deployment case you must furthermore ensure that the YARN Application Timeline Server (ATS) is properly configured. Hive works without ATS, but the Driven Agent requires a functioning ATS to monitor all resource usage. Refer to the Tez project documentation for how to properly configure ATS.
Known limitations
Hive Version | Problem | Solution |
---|---|---|
0.13.x |
When creating a table based on a select query (e.g.
|
Update Hive to at least 0.14.0 |
2.0.0 |
Hive 2.0.0 is currently not supported |
HiveServer2 limitations
The Driven Agent for Hive monitors Units of Work that are executed by the
HiveServer2
. Please note that the Driven Agent for Hive is not a replacement
for the process monitoring of the 'HiveServer2' process itself. The Agent will
detect and correctly report failures that occur within a Unit of Work, but
it cannot detect problems with the general setup of the HiveServer2
itself.
Failure cases, such as an unavailable Metastore or incorrect permissions to
access to HDFS are outside of the realm of the Agent. The Agent assumes a working
HiveServer2
deployment to monitor all queries run by a server.
The Agent will recognize a shutdown of the HiveServer2
process and will mark
it as stopped in Driven. However if the HiveServer2
receives a
SIGKILL
signal (a.k.a.
kill -9
) the application will still appear as running in Driven. Since the
process is killed by the kernel without allowing any clean-up to happen, the
agent cannot record the final state of the application. You should make sure
that the management scripts used to start and stop the HiveServer2
allow for a
proper shutdown and do not use the SIGKILL
signal.
Metadata Support
The Driven Agent for Hive enables Driven to recognize application metadata, such as name, version number, or tags, which can be sent with other telemetry data from an application. This is supported by the Driven Agent for Hive. The following table shows the application level properties supported by the agent:
Name | Example | Explanation |
---|---|---|
driven.app.name |
driven.app.name=tps-report |
Name of the application |
driven.app.version |
driven.app.version=1.1.5 |
Version of the application |
driven.app.tags |
driven.app.tags=cluster:prod,tps,dept:marketing |
Comma-separated list of tags |
The installing the Agent section explained how these properties can be given on the agent command line. If that is not flexible enough for your use-case, the Driven Agent for Hive offers more options:
The properties can be set within a given HiveQL script via set
-commands, an
initialization file, or can be given on the command line. It is also possible to
add them to the hive-site.xml
file. With HiveServer2
, you can also pass the
properties as JDBC parameters. Basically any way you would normally send
parameters to a Hive query is supported.
Next to application level metadata the Driven Agent for Hive allows the user to set Unit of Work meta-data.
Name | Example | Explanation |
---|---|---|
driven.flow.name |
driven.flow.name=tps-report |
Name of the unit of work |
driven.flow.description |
driven.flow.description=Coversheet |
Description of the unit of work |
If no driven.flow.name
property is set, the internal Hive Query Id is used.
Note
|
Setting the name or description will make it the name or description for all subsequent queries, so users should set it to a meaningful value as often as necessary. |
Using the Hive Agent Artifact
For downloading the latest Driven Agent for Hive, follow the instructions in downloading the Agent section.
Enable the agent by extending the HADOOP_OPTS
environment variable before
starting a hive
fat client, an embedded Beeline
client, or HiveServer2
.
Use the following command format:
export HADOOP_OPTS="-javaagent:/path/to/driven-agent-hive-<version>.jar"
Note
|
You have to set the HADOOP_OPTS variable. Setting the YARN_OPTS
variable, even on a YARN-based cluster, has no effect.
|
The agent must be installed and configured on the host where Hive queries are
executed. In the case of the fat client, it is sufficient to set the environment
variable in the shell where hive
will be launched. The same applies to the
newer Beeline
client, when used without HiveServer2
.
In case of a HiveServer2
deployment, the agent must be installed on the
machine where the server is running. For the agent to work, the HADOOP_OPTS
variable must be set in the environment where the server is running. Typically
this involves modifying the startup script of HiveServer2
. Some distributions
ship with graphical cluster administration tools, with which you can customize a
hive-env.sh
script to administer the HiveServer2
.
Note
|
Each HiveServer2 instance appears as one long-running application in
Driven from the time that the first query is executed on the server. When using
Driven Agent for Hive, an application is defined as one JVM. As a result,
queries that a HiveServer2 runs are displayed as processes of the same
application in Driven.
|
What Cluster Work Is Monitored
The Driven Agent for Hive monitors all queries that use resources (CPU and
memory) on the cluster. Queries that do not use cluster resources in terms of
computing power, even if they modify the state of the system, are currently not
being tracked. Examples of queries that are not tracked are all DML statements
or statements like select current_database()
, use somedb;
, etc.