Driven Agent Guide
version 2.1.4Getting Started
The Driven Agent is a collection of JVM libraries that enables monitoring of the following Hadoop applications in Driven:
-
Apache Hive
-
MapReduce
The agent is compatible with the following scheduler:
-
Apache Oozie
The Driven Agent is a Java agent wrapper for the Driven Plugin. There is one agent JAR file for Hive, and another JAR file for MapReduce.
All agent JAR files are bundled with the Driven Plugin JAR.
Note
|
To monitor only Cascading applications with Driven, the Driven Agent is not necessary. However, an installation of the Driven Plugin JAR is required. See Driven documentation for details. |
Downloading the Driven Agent
Select the agent library that is applicable to your framework.
From a terminal/shell, execute one of the following commands.
# latest Hive agent bundle
> wget -i http://files.concurrentinc.com/driven/2.1/driven-agent/latest-driven-agent-hive-bundle.txt
# latest MapReduce agent bundle
> wget -i http://files.concurrentinc.com/driven/2.1/driven-agent/latest-driven-agent-mr-bundle.txt
Note
|
The -i switch downloads the latest-…-bundle.txt file, and then
downloads the link in the file, which is always the latest release for the
current version of Driven.
|
Installing the Driven Agent
Note for Apache Oozie users: Use the Driven Agent with Apache Oozie installation documentation instead of the following procedure.
The following steps assume that Hadoop applications are being launched from:
-
the command line, via
bin/yarn jar …
orbin/hadoop jar …
-
an interactive shell like Beeline when using Hive with a "thick" client
-
jobs that start from a long-running server like Hive Server or from an application server like Tomcat, JBoss, Spring, etc.
Note
|
Driven defines an application context as the JVM instance driving and orchestrating the client side of Hadoop applications. Each Hive query or MapReduce job appears as a single Unit of Work in that application. In a single application context, there can be thousands of queries. Each instance of the application entails a shutdown and restart. |
Variables are used in many of the commands in the following sections:
-
[framework]
stands for Hive (hive) or MapReduce (mr) -
<version>
stands for the current agent version
Agent Quick Start:
Step 1: Create a new directory named driven-agent
in your home directory.
Step 2: Copy the downloaded installation JAR file into the driven-agent
directory.
Step 3: Create a driven-agent.properties
file with the appropriate settings
for your environment. See Configuring the Driven Agent section
to properly configure both the drivenHosts and
drivenAPIkey settings (if API key is required).
Tip
|
Creating a different driven-agent.properties file for each unique
application enables the display of application-specific values (like name and
tags) in Driven and lets you assign applications to specific teams via the
Driven team API key.
|
Step 4: In the current console or within a bash script, use either export
HADOOP_OPTS
or export YARN_CLIENT_OPTS
(depending on your environment) to
pass the options in the following command:
export YARN_CLIENT_OPTS="-javaagent:/path/to/driven-agent-[framework]-bundle-<version>.jar=optionsFile=driven-agent.properties"
Step 5: Run your application.
After installing the agent and running your application, log in to the Driven Server to see your application’s performance information.
Tip
|
The URL to the current application will be printed in the logs. |
Note
|
Putting the agent on the runtime CLASSPATH will have no effect. Be sure to
place the -javaagent:/path/to/driven-agent-[framework]-bundle-<version>.jar switch on
the JVM command line before the application jar.
|
Configuring the Driven Agent
The Driven Agent accepts various configuration options after the path to the Driven Agent JAR file.
java -javaagent:/path/to/driven-agent-[framework]-bundle-<version>.jar[=key1=value1;key2=value2,value3] <other java arguments>
Available agent options can be printed to the console by running the Driven Agent JAR with the following command:
java -jar /path/to/driven-agent-[framework]-bundle-<version>.jar
The agent also accepts a properties file via the optionsFile
option. To
generate a template file with defaults, run the following command (with a dash
as the only argument):
java -jar /path/to/driven-agent-[framework]-bundle-<version>.jar - > driven-agent.properties
This creates a driven-agent.properties
template in the current directory.
Tip
|
The file specified by optionsFile will be treated relative to the JVM
current working directory, unless the path is absolute. If not found, the file
will be relative to the Driven Agent JAR file location.
|
Note
|
Some of the following configuration options might not be available for all frameworks. |
Agent-Specific Options
optionsFile
-
Specifies the file that provides option values for the Driven Agent. All values take precedence over the agent argument values. The file is relative to the current directory. If no current directory is found, the file is relative to the agent’s JAR directory.
agentDisableLogging
-
Disables all Driven Agent and Driven Plugin logging.
agentDebugLogging
-
Enables debug logging in the Driven Agent and the Driven Plugin.
agentExitOnlyOnJobCompletion
-
Forces the JVM to remain active until the monitored jobs complete, fail, or are killed. The
appCompletionTimeout
option is not supported. Default isTRUE
. agentKillJobsOnExit
-
Kills all running jobs when JVM is exited. Work is marked as STOPPED if System.exit is called when detaching the client.
agentLogSystemExitCalls
-
Enables logging of the stack trace making System and Runtime exit calls. The option also installs a custom SecurityManager if no other SecurityManager has been installed.
agentLogSubmitStackTrace
-
Enables logging of the stack trace making
submit()
calls to the cluster, which helps in diagnosing the root main class and function.
Plugin-Specific Options
drivenHosts
-
Specifies the server host names and ports where data is to be sent. Values should be entered in this format:
host1:80,host2:8080
. Thehttp://
orhttps://
prefix may be placed before the host name.
Note
|
If you are using the Early Access Program (EAP) or the Hosted Trial,
drivenHosts must be set to https://driven.cascading.io/ or
https://trial.driven.io/ , respectively.
|
drivenAPIKey
-
Specifies the API key that is associated with application executions.
Note
|
If you are using the EAP or the Hosted Trial, drivenAPIKey must be set
in order to see your applications in Driven after logging in. This requires an
account, which you can get on the Driven
Trial Options website.
|
drivenArchiveDir
-
Indicates the local directory where copies of transmitted data are to be stored.
drivenDisabled
-
Disables the sending of data to the Driven Server.
drivenSuppressSliceData
-
Disables sending slice-level data and detailed low-level performance visualizations; overrides server settings. This option can reduce network traffic, load on any history servers, and indexing latency.
drivenContinuousSliceData
-
Enables frequent updates of slice-level data before slice completion (update on completion is the default); overrides server settings. This option can increase network traffic, load on any history server, and indexing latency.
Note
|
Some platforms do not support retrieving intermediate results at this level. |
drivenSuppressJavaCommandData
-
Disables sending command-line argument data; overrides server settings. This option prevents sending sensitive information that might appear on the command line.
Application-Specific Options
appName
-
Names an application. The default name is the JAR file name without version information.
appVersion
-
Specifies the version of an application. The default version is parsed from the JAR file name.
appTags
-
Assigns tags that should be associated with the application, for example:
cluster:prod,dept:engineering
appCompletionTimeout
-
Specifies timeout (in milliseconds) to wait to send all completed application details before shutdown.
appFailedOnAnyUoWFail
-
Indicates that if any Unit of Work fails, then the application is marked as FAILED. The default is to mark an app as FAILED only if the last Unit of Work fails.
appFailedOnAnyUoWPending
-
Indicates that if any Unit of Work is not started, then the application is marked as FAILED. The default is to mark an app as FAILED only if the last Unit of Work does not start.
UnitOfWork Specific Options
uoWSanitizeStatements
-
Enable query statement sanitization. This option is enabled by default.
Driven Agent for MapReduce
The agent for MapReduce enables Driven to perform real-time monitoring of any application written as one or more MapReduce jobs.
For instructions on downloading and installing the Driven Agent see sections on downloading and installing the agent. If you plan to use the agent with Apache Oozie, see the additional installation documentation in Using Driven Agent with Apache Oozie.
MapReduce Versions
The Driven Agent works for both the mapred.*
and mapreduce.*
APIs
(sometimes known as the old and new APIs) on either Hadoop 1.x or 2.x releases.
Driven Agent for Hive
The Driven Agent for Hive is a JVM-level agent library that enables monitoring
of Apache Hive queries in Driven. The agent runs in parallel with the main
execution of Hive and sends telemetry data to a remote Driven server. Any type
of Hive deployment is supported, such as the fat client (hive
), the newer
Beeline
, HiveServer2, or Apache Oozie workflows containing Hive queries. Any
application queries that are sent through JDBC or even ODBC can be monitored as
well.
Hive Version Requirements
The Driven Agent for Hive can be used with any version of Hive newer than 0.13.0 if you are using the MapReduce execution engine. If you are using Hive with the Tez execution engine, you must at least use Hive 0.14.0 and Tez 0.5.2. In the Tez deployment case you must furthermore ensure that the YARN Application Timeline Server (ATS) is properly configured. Hive works without ATS, but the Driven Agent requires a functioning ATS to monitor all resource usage. Refer to the Tez project documentation for how to properly configure ATS.
Known limitations
Hive Version | Problem | Solution |
---|---|---|
0.13.x |
When creating a table based on a select query (e.g.
|
Update Hive to at least 0.14.0 |
2.0.0 |
Hive 2.0.0 is currently not supported |
HiveServer2 limitations
The Driven Agent for Hive monitors Units of Work that are executed by the
HiveServer2
. Please note that the Driven Agent for Hive is not a replacement
for the process monitoring of the 'HiveServer2' process itself. The Agent will
detect and correctly report failures that occur within a Unit of Work, but
it cannot detect problems with the general setup of the HiveServer2
itself.
Failure cases, such as an unavailable Metastore or incorrect permissions to
access to HDFS are outside of the realm of the Agent. The Agent assumes a working
HiveServer2
deployment to monitor all queries run by a server.
The Agent will recognize a shutdown of the HiveServer2
process and will mark
it as stopped in Driven. However if the HiveServer2
receives a
SIGKILL
signal (a.k.a.
kill -9
) the application will still appear as running in Driven. Since the
process is killed by the kernel without allowing any clean-up to happen, the
agent cannot record the final state of the application. You should make sure
that the management scripts used to start and stop the HiveServer2
allow for a
proper shutdown and do not use the SIGKILL
signal.
Metadata Support
The Driven Agent for Hive enables Driven to recognize application metadata, such as name, version number, or tags, which can be sent with other telemetry data from an application. This is supported by the Driven Agent for Hive. The following table shows the application level properties supported by the agent:
Name | Example | Explanation |
---|---|---|
driven.app.name |
driven.app.name=tps-report |
Name of the application |
driven.app.version |
driven.app.version=1.1.5 |
Version of the application |
driven.app.tags |
driven.app.tags=cluster:prod,tps,dept:marketing |
Comma-separated list of tags |
Getting Started explained how these properties can be given on the agent command line. If that is not flexible enough for your use-case, the Driven Agent for Hive offers more options:
The properties can be set within a given HiveQL script via set
-commands, an
initialization file, or can be given on the command line. It is also possible to
add them to the hive-site.xml
file. With HiveServer2
, you can also pass the
properties as JDBC parameters. Basically any way you would normally send
parameters to a Hive query is supported.
Next to application level metadata the Driven Agent for Hive allows the user to set Unit of Work meta-data.
Name | Example | Explanation |
---|---|---|
driven.flow.name |
driven.flow.name=tps-report |
Name of the unit of work |
driven.flow.description |
driven.flow.description=Coversheet |
Description of the unit of work |
If no driven.flow.name
property is set, the internal Hive Query Id is used.
Note
|
Setting the name or description will make it the name or description for all subsequent queries, so users should set it to a meaningful value as often as necessary. |
Using the Hive Agent Artifact (Unbundled Form)
For downloading the latest Driven Agent for Hive, follow the instructions in Getting Started.
Enable the agent by extending the HADOOP_OPTS
environment variable before
starting a hive
fat client, an embedded Beeline
client, or HiveServer2
.
Use the following command format:
export HADOOP_OPTS="-javaagent:/path/to/driven-agent-hive-<version>.jar"
Note
|
You have to set the HADOOP_OPTS variable. Setting the YARN_OPTS
variable, even on a YARN-based cluster, has no effect.
|
The agent must be installed and configured on the host where Hive queries are
executed. In the case of the fat client, it is sufficient to set the environment
variable in the shell where hive
will be launched. The same applies to the
newer Beeline
client, when used without HiveServer2
.
In case of a HiveServer2
deployment, the agent must be installed on the
machine where the server is running. For the agent to work, the HADOOP_OPTS
variable must be set in the environment where the server is running. Typically
this involves modifying the startup script of HiveServer2
. Some distributions
ship with graphical cluster administration tools, with which you can customize a
hive-env.sh
script to administer the HiveServer2
.
Note
|
Each HiveServer2 instance appears as one long-running application in
Driven from the time that the first query is executed on the server. When using
Driven Agent for Hive, an application is defined as one JVM. As a result,
queries that a HiveServer2 runs are displayed as processes of the same
application in Driven.
|
Using the Hive Agent Bundle Artifact
The Hive agent bundle has the same functionality as the plain agent, but the bundle simplifies the installation and configuration of the agent in certain deployment scenarios. Users of Oozie should always use the Hive agent bundle. See Using Driven Agent with Apache Oozie for further information.
To download the latest Driven Agent for Hive, see the Getting Started documentation.
What Cluster Work Is Monitored
The Driven Agent for Hive monitors all queries that use resources (CPU and
memory) on the cluster. Queries that do not use cluster resources in terms of
computing power, even if they modify the state of the system, are currently not
being tracked. Examples of queries that are not tracked are all DML statements
or statements like select current_database()
, use somedb;
, etc.
Using Driven Agent with Apache Oozie
Apache Oozie is a popular workflow management solution in the Hadoop ecosystem. The workflow solution supports running a variety of different technologies. Oozie operates by executing computational tasks called actions. The actions are arranged in directed acyclic graphs (DAGs), which are referred to as workflows. The Driven Agent can be used to monitor the execution of HiveActions and MapReduceActions that are managed by Oozie.
Oozie uses a client-server architecture, but the Oozie server is not running any user code by itself. Instead, the server uses a LauncherMapper to drive each action in a given workflow.
The LauncherMapper is a single Map task, which is sent cluster-side and acts as the driving program for the action. Any node of the cluster can potentially be the machine that drives a given Hive query or runs a MapReduce job. Therefore, every machine in the cluster must have access to the Driven Agent for Hive and every machine must be able to communicate with the Driven Server. Your firewall rules should be set accordingly.
Note
|
Apache Oozie users should always install and configure the Hive agent bundle for Driven. |
Driven Agent Bundle Configuration
Instead of installing the Driven Agent JAR files on every machine of the cluster, the Driven Agents can be installed in Oozie’s sharelib on HDFS:
Given a sharelib directory on HDFS of /user/oozie/share/lib/lib_20150721160609, the Driven Agent could be installed as follows:
> hadoop fs -mkdir /user/oozie/share/lib/lib_20150721160609/driven
> hadoop fs -copyFromLocal /path/to/driven-agent-<framework>-bundle-<version>.jar \
/user/oozie/share/lib/lib_20150721160609/driven$
Some distributions require a restart of the Oozie server after modifying the sharelib. Check the documentation of your distribution.
Now that the Driven Agent is available on HDFS, the agent must be configured on the global workflow or with single-action XML.
The following property sets the Java path for loading the agent. The JAR file name must match the one on HDFS.
<property>
<name>oozie.launcher.mapred.child.java.opts</name>
<value>-javaagent:$PWD/driven-agent-<framework>-bundle-<version>.jar</value>
</property>
The following property configures the Oozie Hive action to include JAR files from the hive and driven subdirectories of the currently active sharelib for HiveActions.
<property>
<name>oozie.action.sharelib.for.hive</name>
<value>hive,driven</value>
</property>
For MapReduceActions, the map-reduce directory may need to be created. Please verify the map-reduce directory existence in the sharelib. The configuration should look like this:
<property>
<name>oozie.action.sharelib.for.map-reduce</name>
<value>map-reduce,driven</value>
</property>
Finally, the following properties configure the Driven Server location and API key for the bundled agent to use. Depending on your deployment and needs, you can freely choose on which level to set these properties. Setting the properties on the workflow level enables the agent for all supported actions in that workflow. Setting the properties on the action level only enables them for that specific action.
<property>
<name>cascading.management.document.service.hosts</name>
<value>http://<hostname>:<port>/</value>
</property>
<property>
<name>cascading.management.document.service.apikey</name>
<value><apikey></value>
</property>
Advanced Installation
Advanced users may wish to script the Driven Agent installation, or use the Driven Agent with Amazon Elastic MapReduce.
Scripted Installation
For advanced users, the Driven Agent can be installed with the following script:
# to download the script
> wget http://files.concurrentinc.com/driven/2.1/driven-plugin/install-driven-plugin.sh
# for Hive installation
> bash install-driven-plugin.sh --hive --bundle
# for MapReduce installation
> bash install-driven-plugin.sh --mapreduce --bundle
Note
|
The --bundle switch will force only the bundled version of the agent to
be downloaded and installed. Leaving off the switch will download and install an
un-bundled version of the agent and also the latest version of the Driven
Plugin, that the Driven Agent requires.
|
Alternately, as a one-liner:
# for Hive installation
> export USE_AGENT_BUNDLE=true; export AGENT=hive; curl http://files.concurrentinc.com/driven/2.1/driven-plugin/install-driven-plugin.sh | sh
# for MapReduce installation
> export USE_AGENT_BUNDLE=true; export AGENT=mr; curl http://files.concurrentinc.com/driven/2.1/driven-plugin/install-driven-plugin.sh | sh
This script will create a .driven-plugin
directory in the current user’s
home
directory, download the latest Driven Agent bundle JAR, and create a
symbolic link referencing the latest versions of the
driven-agent-[framework]-bundle.jar
.
Re-running the script can be used to safely upgrade the agent.
Note
|
driven-agent-[framework]-bundle.jar is a unix symbolic link to the
latest downloaded version of the agent jar file. This link is created or updated
by the install script.
|
Amazon Elastic MapReduce
For Amazon Elastic MapReduce users, the install-driven-plugin.sh
,
introduced above, doubles as a bootstrap action. The only addition
is to also use the Amazon provided configure-daemons
(s3://elasticmapreduce/bootstrap-actions/configure-daemons
) bootstrap action
with the following arguments:
--client-opts=-javaagent:/home/hadoop/.driven-plugin/driven-agent-[framework]-bundle.jar
Replace [framework]
with hive
for Hive, and`mr` for MapReduce.
EMR 4.x
Amazon introduced a set of changes in EMR version 4.0, that have a direct influence on how to install the Driven agent. One important change is that bootstrap actions can no longer modify the installation of Hadoop, since Hadoop is only deployed after all bootstrap actions have been executed. The new way of changing the hadoop installation with user defined settings is using the application configuration feature.
Driven provides a set configurations to be used with the different agents. On
the commandline simply add the --configurations
switch for your framework:
--configurations http://files.concurrentinc.com/driven/2.1/hosted/driven-plugin/configurations-[framework].json"