Driven Agent Guide

version 2.1.4

Getting Started

The Driven Agent is a collection of JVM libraries that enables monitoring of the following Hadoop applications in Driven:

Apache Hive
MapReduce

The agent is compatible with the following scheduler:

Apache Oozie

The Driven Agent is a Java agent wrapper for the Driven Plugin. There is one agent JAR file for Hive, and another JAR file for MapReduce.

All agent JAR files are bundled with the Driven Plugin JAR.

Note	To monitor only Cascading applications with Driven, the Driven Agent is not necessary. However, an installation of the Driven Plugin JAR is required. See Driven documentation for details.

Downloading the Driven Agent

Select the agent library that is applicable to your framework.

From a terminal/shell, execute one of the following commands.

# latest Hive agent bundle
> wget -i http://files.concurrentinc.com/driven/2.1/driven-agent/latest-driven-agent-hive-bundle.txt

# latest MapReduce agent bundle
> wget -i http://files.concurrentinc.com/driven/2.1/driven-agent/latest-driven-agent-mr-bundle.txt

Note	The `-i` switch downloads the `latest-…-bundle.txt` file, and then downloads the link in the file, which is always the latest release for the current version of Driven.

Installing the Driven Agent

Note for Apache Oozie users: Use the Driven Agent with Apache Oozie installation documentation instead of the following procedure.

The following steps assume that Hadoop applications are being launched from:

the command line, via bin/yarn jar … or bin/hadoop jar …
an interactive shell like Beeline when using Hive with a "thick" client
jobs that start from a long-running server like Hive Server or from an application server like Tomcat, JBoss, Spring, etc.

Note

Driven defines an application context as the JVM instance driving and orchestrating the client side of Hadoop applications. Each Hive query or MapReduce job appears as a single Unit of Work in that application. In a single application context, there can be thousands of queries. Each instance of the application entails a shutdown and restart.

Variables are used in many of the commands in the following sections:

[framework] stands for Hive (hive) or MapReduce (mr)
<version> stands for the current agent version

Agent Quick Start:

Step 1: Create a new directory named driven-agent in your home directory.

Step 2: Copy the downloaded installation JAR file into the driven-agent directory.

Step 3: Create a driven-agent.properties file with the appropriate settings for your environment. See Configuring the Driven Agent section to properly configure both the drivenHosts and drivenAPIkey settings (if API key is required).

Tip	Creating a different `driven-agent.properties` file for each unique application enables the display of application-specific values (like name and tags) in Driven and lets you assign applications to specific teams via the Driven team API key.

Step 4: In the current console or within a bash script, use either export HADOOP_OPTS or export YARN_CLIENT_OPTS (depending on your environment) to pass the options in the following command:

export YARN_CLIENT_OPTS="-javaagent:/path/to/driven-agent-[framework]-bundle-<version>.jar=optionsFile=driven-agent.properties"

Step 5: Run your application.

After installing the agent and running your application, log in to the Driven Server to see your application’s performance information.

Tip	The URL to the current application will be printed in the logs.

Note	Putting the agent on the runtime CLASSPATH will have no effect. Be sure to place the `-javaagent:/path/to/driven-agent-[framework]-bundle-<version>.jar` switch on the JVM command line before the application jar.

Configuring the Driven Agent

The Driven Agent accepts various configuration options after the path to the Driven Agent JAR file.

java -javaagent:/path/to/driven-agent-[framework]-bundle-<version>.jar[=key1=value1;key2=value2,value3] <other java arguments>

Available agent options can be printed to the console by running the Driven Agent JAR with the following command:

java -jar /path/to/driven-agent-[framework]-bundle-<version>.jar

The agent also accepts a properties file via the optionsFile option. To generate a template file with defaults, run the following command (with a dash as the only argument):

java -jar /path/to/driven-agent-[framework]-bundle-<version>.jar - > driven-agent.properties

This creates a driven-agent.properties template in the current directory.

Tip	The file specified by `optionsFile` will be treated relative to the JVM current working directory, unless the path is absolute. If not found, the file will be relative to the Driven Agent JAR file location.

Note	Some of the following configuration options might not be available for all frameworks.

Agent-Specific Options

optionsFile: Specifies the file that provides option values for the Driven Agent. All values take precedence over the agent argument values. The file is relative to the current directory. If no current directory is found, the file is relative to the agent’s JAR directory.
agentDisableLogging: Disables all Driven Agent and Driven Plugin logging.
agentDebugLogging: Enables debug logging in the Driven Agent and the Driven Plugin.
agentExitOnlyOnJobCompletion: Forces the JVM to remain active until the monitored jobs complete, fail, or are killed. The appCompletionTimeout option is not supported. Default is TRUE.
agentKillJobsOnExit: Kills all running jobs when JVM is exited. Work is marked as STOPPED if System.exit is called when detaching the client.
agentLogSystemExitCalls: Enables logging of the stack trace making System and Runtime exit calls. The option also installs a custom SecurityManager if no other SecurityManager has been installed.
agentLogSubmitStackTrace: Enables logging of the stack trace making submit() calls to the cluster, which helps in diagnosing the root main class and function.

Plugin-Specific Options

drivenHosts: Specifies the server host names and ports where data is to be sent. Values should be entered in this format: host1:80,host2:8080. The http:// or https:// prefix may be placed before the host name.

Note	If you are using the Early Access Program (EAP) or the Hosted Trial, `drivenHosts` must be set to `https://driven.cascading.io/` or `https://trial.driven.io/`, respectively.

drivenAPIKey: Specifies the API key that is associated with application executions.

Note	If you are using the EAP or the Hosted Trial, `drivenAPIKey` must be set in order to see your applications in Driven after logging in. This requires an account, which you can get on the Driven Trial Options website.

drivenArchiveDir: Indicates the local directory where copies of transmitted data are to be stored.
drivenDisabled: Disables the sending of data to the Driven Server.
drivenSuppressSliceData: Disables sending slice-level data and detailed low-level performance visualizations; overrides server settings. This option can reduce network traffic, load on any history servers, and indexing latency.
drivenContinuousSliceData: Enables frequent updates of slice-level data before slice completion (update on completion is the default); overrides server settings. This option can increase network traffic, load on any history server, and indexing latency.

Note	Some platforms do not support retrieving intermediate results at this level.

drivenSuppressJavaCommandData: Disables sending command-line argument data; overrides server settings. This option prevents sending sensitive information that might appear on the command line.

Application-Specific Options

appName: Names an application. The default name is the JAR file name without version information.
appVersion: Specifies the version of an application. The default version is parsed from the JAR file name.
appTags: Assigns tags that should be associated with the application, for example: cluster:prod,dept:engineering
appCompletionTimeout: Specifies timeout (in milliseconds) to wait to send all completed application details before shutdown.
appFailedOnAnyUoWFail: Indicates that if any Unit of Work fails, then the application is marked as FAILED. The default is to mark an app as FAILED only if the last Unit of Work fails.
appFailedOnAnyUoWPending: Indicates that if any Unit of Work is not started, then the application is marked as FAILED. The default is to mark an app as FAILED only if the last Unit of Work does not start.

UnitOfWork Specific Options

uoWSanitizeStatements: Enable query statement sanitization. This option is enabled by default.

Driven Agent for MapReduce

The agent for MapReduce enables Driven to perform real-time monitoring of any application written as one or more MapReduce jobs.

For instructions on downloading and installing the Driven Agent see sections on downloading and installing the agent. If you plan to use the agent with Apache Oozie, see the additional installation documentation in Using Driven Agent with Apache Oozie.

MapReduce Versions

The Driven Agent works for both the mapred.* and mapreduce.* APIs (sometimes known as the old and new APIs) on either Hadoop 1.x or 2.x releases.

Driven Agent for Hive

The Driven Agent for Hive is a JVM-level agent library that enables monitoring of Apache Hive queries in Driven. The agent runs in parallel with the main execution of Hive and sends telemetry data to a remote Driven server. Any type of Hive deployment is supported, such as the fat client (hive), the newer Beeline, HiveServer2, or Apache Oozie workflows containing Hive queries. Any application queries that are sent through JDBC or even ODBC can be monitored as well.

Hive Version Requirements

The Driven Agent for Hive can be used with any version of Hive newer than 0.13.0 if you are using the MapReduce execution engine. If you are using Hive with the Tez execution engine, you must at least use Hive 0.14.0 and Tez 0.5.2. In the Tez deployment case you must furthermore ensure that the YARN Application Timeline Server (ATS) is properly configured. Hive works without ATS, but the Driven Agent requires a functioning ATS to monitor all resource usage. Refer to the Tez project documentation for how to properly configure ATS.

Known limitations

Hive Version Problem Solution

Hive Version	Problem	Solution
0.13.x	When creating a table based on a select query (e.g. `create table foo as select a, b from bar`) the output resources are not correctly identified.	Update Hive to at least 0.14.0
2.0.0	Hive 2.0.0 is currently not supported

0.13.x

When creating a table based on a select query (e.g. create table foo as select a, b from bar) the output resources are not correctly identified.

Update Hive to at least 0.14.0

2.0.0

Hive 2.0.0 is currently not supported

HiveServer2 limitations

The Driven Agent for Hive monitors Units of Work that are executed by the HiveServer2. Please note that the Driven Agent for Hive is not a replacement for the process monitoring of the 'HiveServer2' process itself. The Agent will detect and correctly report failures that occur within a Unit of Work, but it cannot detect problems with the general setup of the HiveServer2 itself. Failure cases, such as an unavailable Metastore or incorrect permissions to access to HDFS are outside of the realm of the Agent. The Agent assumes a working HiveServer2 deployment to monitor all queries run by a server.

The Agent will recognize a shutdown of the HiveServer2 process and will mark it as stopped in Driven. However if the HiveServer2 receives a SIGKILL signal (a.k.a. kill -9) the application will still appear as running in Driven. Since the process is killed by the kernel without allowing any clean-up to happen, the agent cannot record the final state of the application. You should make sure that the management scripts used to start and stop the HiveServer2 allow for a proper shutdown and do not use the SIGKILL signal.

Metadata Support

The Driven Agent for Hive enables Driven to recognize application metadata, such as name, version number, or tags, which can be sent with other telemetry data from an application. This is supported by the Driven Agent for Hive. The following table shows the application level properties supported by the agent:

Table 1. Properties for sending App metadata to Driven
Name	Example	Explanation
driven.app.name	driven.app.name=tps-report	Name of the application
driven.app.version	driven.app.version=1.1.5	Version of the application
driven.app.tags	driven.app.tags=cluster:prod,tps,dept:marketing	Comma-separated list of tags

Getting Started explained how these properties can be given on the agent command line. If that is not flexible enough for your use-case, the Driven Agent for Hive offers more options:

The properties can be set within a given HiveQL script via set-commands, an initialization file, or can be given on the command line. It is also possible to add them to the hive-site.xml file. With HiveServer2, you can also pass the properties as JDBC parameters. Basically any way you would normally send parameters to a Hive query is supported.

Next to application level metadata the Driven Agent for Hive allows the user to set Unit of Work meta-data.

Table 2. Properties for sending Unit of Work metadata to Driven
Name	Example	Explanation
driven.flow.name	driven.flow.name=tps-report	Name of the unit of work
driven.flow.description	driven.flow.description=Coversheet	Description of the unit of work

If no driven.flow.name property is set, the internal Hive Query Id is used.

Note	Setting the name or description will make it the name or description for all subsequent queries, so users should set it to a meaningful value as often as necessary.

Using the Hive Agent Artifact (Unbundled Form)

For downloading the latest Driven Agent for Hive, follow the instructions in Getting Started.

Enable the agent by extending the HADOOP_OPTS environment variable before starting a hive fat client, an embedded Beeline client, or HiveServer2. Use the following command format:

export HADOOP_OPTS="-javaagent:/path/to/driven-agent-hive-<version>.jar"

Note	You have to set the `HADOOP_OPTS` variable. Setting the `YARN_OPTS` variable, even on a YARN-based cluster, has no effect.

The agent must be installed and configured on the host where Hive queries are executed. In the case of the fat client, it is sufficient to set the environment variable in the shell where hive will be launched. The same applies to the newer Beeline client, when used without HiveServer2.

In case of a HiveServer2 deployment, the agent must be installed on the machine where the server is running. For the agent to work, the HADOOP_OPTS variable must be set in the environment where the server is running. Typically this involves modifying the startup script of HiveServer2. Some distributions ship with graphical cluster administration tools, with which you can customize a hive-env.sh script to administer the HiveServer2.

Note

Each HiveServer2 instance appears as one long-running application in Driven from the time that the first query is executed on the server. When using Driven Agent for Hive, an application is defined as one JVM. As a result, queries that a HiveServer2 runs are displayed as processes of the same application in Driven.

Using the Hive Agent Bundle Artifact

The Hive agent bundle has the same functionality as the plain agent, but the bundle simplifies the installation and configuration of the agent in certain deployment scenarios. Users of Oozie should always use the Hive agent bundle. See Using Driven Agent with Apache Oozie for further information.

To download the latest Driven Agent for Hive, see the Getting Started documentation.

What Cluster Work Is Monitored

The Driven Agent for Hive monitors all queries that use resources (CPU and memory) on the cluster. Queries that do not use cluster resources in terms of computing power, even if they modify the state of the system, are currently not being tracked. Examples of queries that are not tracked are all DML statements or statements like select current_database(), use somedb;, etc.

Using Driven Agent with Apache Oozie

Apache Oozie is a popular workflow management solution in the Hadoop ecosystem. The workflow solution supports running a variety of different technologies. Oozie operates by executing computational tasks called actions. The actions are arranged in directed acyclic graphs (DAGs), which are referred to as workflows. The Driven Agent can be used to monitor the execution of HiveActions and MapReduceActions that are managed by Oozie.

Oozie uses a client-server architecture, but the Oozie server is not running any user code by itself. Instead, the server uses a LauncherMapper to drive each action in a given workflow.

The LauncherMapper is a single Map task, which is sent cluster-side and acts as the driving program for the action. Any node of the cluster can potentially be the machine that drives a given Hive query or runs a MapReduce job. Therefore, every machine in the cluster must have access to the Driven Agent for Hive and every machine must be able to communicate with the Driven Server. Your firewall rules should be set accordingly.

Note	Apache Oozie users should always install and configure the Hive agent bundle for Driven.

Driven Agent Bundle Configuration

Instead of installing the Driven Agent JAR files on every machine of the cluster, the Driven Agents can be installed in Oozie’s sharelib on HDFS:

Given a sharelib directory on HDFS of /user/oozie/share/lib/lib_20150721160609, the Driven Agent could be installed as follows:

> hadoop fs -mkdir /user/oozie/share/lib/lib_20150721160609/driven
> hadoop fs -copyFromLocal /path/to/driven-agent-<framework>-bundle-<version>.jar \
  /user/oozie/share/lib/lib_20150721160609/driven$

Some distributions require a restart of the Oozie server after modifying the sharelib. Check the documentation of your distribution.

Now that the Driven Agent is available on HDFS, the agent must be configured on the global workflow or with single-action XML.

The following property sets the Java path for loading the agent. The JAR file name must match the one on HDFS.

<property>
    <name>oozie.launcher.mapred.child.java.opts</name>
    <value>-javaagent:$PWD/driven-agent-<framework>-bundle-<version>.jar</value>
</property>

The following property configures the Oozie Hive action to include JAR files from the hive and driven subdirectories of the currently active sharelib for HiveActions.

<property>
    <name>oozie.action.sharelib.for.hive</name>
    <value>hive,driven</value>
</property>

For MapReduceActions, the map-reduce directory may need to be created. Please verify the map-reduce directory existence in the sharelib. The configuration should look like this:

<property>
    <name>oozie.action.sharelib.for.map-reduce</name>
    <value>map-reduce,driven</value>
</property>

Finally, the following properties configure the Driven Server location and API key for the bundled agent to use. Depending on your deployment and needs, you can freely choose on which level to set these properties. Setting the properties on the workflow level enables the agent for all supported actions in that workflow. Setting the properties on the action level only enables them for that specific action.

<property>
    <name>cascading.management.document.service.hosts</name>
    <value>http://<hostname>:<port>/</value>
</property>
<property>
    <name>cascading.management.document.service.apikey</name>
    <value><apikey></value>
</property>

Advanced Installation

Advanced users may wish to script the Driven Agent installation, or use the Driven Agent with Amazon Elastic MapReduce.

Scripted Installation

For advanced users, the Driven Agent can be installed with the following script:

# to download the script
> wget http://files.concurrentinc.com/driven/2.1/driven-plugin/install-driven-plugin.sh

# for Hive installation
> bash install-driven-plugin.sh --hive --bundle

# for MapReduce installation
> bash install-driven-plugin.sh --mapreduce --bundle

Note	The `--bundle` switch will force only the bundled version of the agent to be downloaded and installed. Leaving off the switch will download and install an un-bundled version of the agent and also the latest version of the Driven Plugin, that the Driven Agent requires.

Alternately, as a one-liner:

# for Hive installation
> export USE_AGENT_BUNDLE=true; export AGENT=hive; curl http://files.concurrentinc.com/driven/2.1/driven-plugin/install-driven-plugin.sh | sh

# for MapReduce installation
> export USE_AGENT_BUNDLE=true; export AGENT=mr; curl http://files.concurrentinc.com/driven/2.1/driven-plugin/install-driven-plugin.sh | sh

This script will create a .driven-plugin directory in the current user’s home directory, download the latest Driven Agent bundle JAR, and create a symbolic link referencing the latest versions of the driven-agent-[framework]-bundle.jar.

Re-running the script can be used to safely upgrade the agent.

Note	`driven-agent-[framework]-bundle.jar` is a unix symbolic link to the latest downloaded version of the agent jar file. This link is created or updated by the install script.

Amazon Elastic MapReduce

For Amazon Elastic MapReduce users, the install-driven-plugin.sh, introduced above, doubles as a bootstrap action. The only addition is to also use the Amazon provided configure-daemons (s3://elasticmapreduce/bootstrap-actions/configure-daemons) bootstrap action with the following arguments:

--client-opts=-javaagent:/home/hadoop/.driven-plugin/driven-agent-[framework]-bundle.jar

Replace [framework] with hive for Hive, and`mr` for MapReduce.

EMR 4.x

Amazon introduced a set of changes in EMR version 4.0, that have a direct influence on how to install the Driven agent. One important change is that bootstrap actions can no longer modify the installation of Hadoop, since Hadoop is only deployed after all bootstrap actions have been executed. The new way of changing the hadoop installation with user defined settings is using the application configuration feature.

Driven provides a set configurations to be used with the different agents. On the commandline simply add the --configurations switch for your framework:

  --configurations http://files.concurrentinc.com/driven/2.1/hosted/driven-plugin/configurations-[framework].json"