Driven Agent Guide

version 2.0.5

Getting Started

The Driven Agent is a collection of JVM libraries that enables monitoring of the following Hadoop applications in Driven:

Apache Hive
native MapReduce

The agent is compatible with the following scheduler:

Apache Oozie

The Driven Agent is a Java agent wrapper for the Driven Plugin. There is one agent JAR file for Hive, and another JAR file for MapReduce.

All agent JAR files are bundled with the Driven Plugin JAR.

Note	To monitor only Cascading applications with Driven, the Driven Agent is not necessary. However, an installation of the Driven Plugin JAR is required. See Driven documentation for details.

Downloading the Driven Agent

Select the agent library that is applicable to your framework.

From a terminal/shell, execute one of the following commands.

# latest Hive agent bundle
> wget -i http://files.concurrentinc.com/driven/2.0/driven-agent/latest-driven-agent-hive-bundle.txt

# latest MapReduce agent bundle
> wget -i http://files.concurrentinc.com/driven/2.0/driven-agent/latest-driven-agent-mr-bundle.txt

Note	The `-i` switch downloads the `latest-…-bundle.txt` file, and then downloads the link in the file, which is always the latest release for the current version of Driven.

Installing the Driven Agent

Note for Apache Oozie users: Use the Driven Agent with Apache Oozie installation documentation instead of the following procedure.

The following steps assume that Hadoop applications are being launched from:

the command line, via bin/yarn jar … or bin/hadoop jar …
an interactive shell like Beeline when using Hive with a "thick" client
jobs that start from a long-running server like Hive Server or from an application server like Tomcat, JBoss, Spring, etc.

Note

Driven defines an application context as the JVM instance driving and orchestrating the client side of Hadoop applications. Each Hive query or MapReduce job appears as a single Unit of Work in that application. In a single application context, there can be thousands of queries. Each instance of the application entails a shutdown and restart.

Variables are used in many of the commands in the following sections:

[framework] stands for Hive (hive) or MapReduce (mr)
<version> stands for the current agent version

Agent Quick Start:

Step 1: Create a new directory named driven-agent in your home directory.

Step 2: Copy the downloaded installation JAR file into the driven-agent directory.

Step 3: Create a driven-agent.properties file with the appropriate settings for your environment. See Configuring the Driven Agent section to properly configure both the drivenHosts and drivenAPIkey settings (if API key is required).

Tip	Creating a different `driven-agent.properties` file for each unique application enables the display of application-specific values (like name and tags) in Driven and lets you assign applications to specific teams via the Driven team API key.

Step 4: In the current console or within a bash script, use either export HADOOP_OPTS or export YARN_CLIENT_OPTS (depending on your environment) to pass the options in the following command:

export YARN_CLIENT_OPTS="-javaagent:/path/to/driven-agent-[framework]-bundle-<version>.jar=optionsFile=driven-agent.properties"

Step 5: Run your application.

After installing the agent and running your application, log in to the Driven Server to see your application’s performance information.

Tip	The URL to the current application will be printed in the logs.

Note	Putting the agent on the runtime CLASSPATH will have no effect. Be sure to place the `-javaagent:/path/to/driven-agent-[framework]-bundle-<version>.jar` switch on the JVM command line before the application jar.

Configuring the Driven Agent

The Driven Agent accepts various configuration options after the path to the Driven Agent JAR file.

java -javaagent:/path/to/driven-agent-[framework]-bundle-<version>.jar[=key1=value1;key2=value2,value3] <other java arguments>

Available agent options can be printed to the console by running the Driven Agent JAR with the following command:

java -jar /path/to/driven-agent-[framework]-bundle-<version>.jar

The agent also accepts a properties file via the optionsFile option. To generate a template file with defaults, run the following command (with a dash as the only argument):

java -jar /path/to/driven-agent-[framework]-bundle-<version>.jar - > driven-agent.properties

This creates a driven-agent.properties template in the current directory.

Tip	The file specified by `optionsFile` will be treated relative to the JVM current working directory, unless the path is absolute. If not found, the file will be relative to the Driven Agent JAR file location.

Note	Some of the following configuration options might not be available for all frameworks.

Agent-Specific Options

optionsFile: Specifies the file that provides option values for the Driven Agent. All values take precedence over the agent argument values. The file is relative to the current directory. If no current directory is found, the file is relative to the agent’s JAR directory.
agentDisableLogging: Disables all Driven Agent and Driven Plugin logging.
agentDebugLogging: Enables debug logging in the Driven Agent and the Driven Plugin.
agentExitOnlyOnJobCompletion: Forces the JVM to remain active until the monitored jobs complete, fail, or are killed. The appCompletionTimeout option is not supported. Default is TRUE.
agentKillJobsOnExit: Kills all running jobs when JVM is exited. Work is marked as STOPPED if System.exit is called when detaching the client.
agentLogSystemExitCalls: Enables logging of the stack trace making System and Runtime exit calls. The option also installs a custom SecurityManager if no other SecurityManager has been installed.
agentLogSubmitStackTrace: Enables logging of the stack trace making submit() calls to the cluster, which helps in diagnosing the root main class and function.

Plugin-Specific Options

drivenHosts: Specifies the server host names and ports where data is to be sent. Values should be entered in this format: host1:80,host2:8080. The http:// or https:// prefix may be placed before the host name.

Note	If you are using the Early Access Program (EAP) or the Hosted Trial, `drivenHosts` must be set to `https://driven.cascading.io/` or `https://trial.driven.io/`, respectively.

drivenAPIKey: Specifies the API key that is associated with application executions.

Note	If you are using the EAP or the Hosted Trial, `drivenAPIKey` must be set in order to see your applications in Driven after logging in. This requires an account, which you can get on the Driven Trial Options website.

drivenArchiveDir: Indicates the local directory where copies of transmitted data are to be stored.
drivenDisabled: Disables the sending of data to the Driven Server.
drivenSuppressSliceData: Disables sending slice-level data and detailed low-level performance visualizations; overrides server settings. This option can reduce network traffic, load on any history servers, and indexing latency.
drivenContinuousSliceData: Enables frequent updates of slice-level data before slice completion (update on completion is the default); overrides server settings. This option can increase network traffic, load on any history server, and indexing latency.

Note	Some platforms do not support retrieving intermediate results at this level.

drivenSuppressJavaCommandData: Disables sending command-line argument data; overrides server settings. This option prevents sending sensitive information that might appear on the command line.

Application-Specific Options

appName: Names an application. The default name is the JAR file name without version information.
appVersion: Specifies the version of an application. The default version is parsed from the JAR file name.
appTags: Assigns tags that should be associated with the application, for example: cluster:prod,dept:engineering
appCompletionTimeout: Specifies timeout (in milliseconds) to wait to send all completed application details before shutdown.
appFailedOnAnyUoWFail: Indicates that if any Unit of Work fails, then the application is marked as FAILED. The default is to mark an app as FAILED only if the last Unit of Work fails.
appFailedOnAnyUoWPending: Indicates that if any Unit of Work is not started, then the application is marked as FAILED. The default is to mark an app as FAILED only if the last Unit of Work does not start.

UnitOfWork Specific Options

uoWSanitizeStatements: Enable query statement sanitization. This option is enabled by default.
uoWCaptureStatementTypes: types of statements to capture. Valid values are DML, DDL, DCL, TCL, UNKNOWN any combination or ALL. default is ALL

Driven Agent for MapReduce

The agent for MapReduce enables Driven to perform real-time monitoring of any application written as one or more MapReduce jobs.

For instructions on downloading and installing the Driven Agent see sections on downloading and installing the agent. If you plan to use the agent with Apache Oozie, see the additional installation documentation in Using Driven Agent with Apache Oozie.

MapReduce Versions

The Driven Agent works for both the mapred.* and mapreduce.* APIs (sometimes known as the old and new APIs) on either Hadoop 1.x or 2.x releases.

Driven Agent for Hive

The Driven Agent for Hive is a JVM-level agent library that enables monitoring of Apache Hive queries in Driven. The agent runs in parallel with the main execution of Hive and sends telemetry data to a remote Driven server. Any type of Hive deployment is supported, such as the fat client (hive), the newer Beeline with HiveServer2, or Apache Oozie workflows containing Hive queries. Furthermore, the agent can monitor any query made through JDBC or ODBC.

Hive Version Requirements

The Driven Agent for Hive can be used with any version of Hive from 0.14.0 onwards. If you are using Hive with the Tez execution engine you have to at least use Tez 0.5.2. In the Tez deployment case you must furthermore ensure that the YARN Application Timeline Server (ATS) is properly configured. Hive works without ATS, but the Driven Agent requires a functioning ATS to monitor all resource usage. Refer to the Tez project documentation for how to properly configure ATS.

Note	Hive on Apache Spark is currently not supported.

Metadata Support

The Driven Agent for Hive enables Driven to recognize application metadata, such as name, version number, or tags, which can be sent with the telemetry data from an application. The following table shows the application-level properties supported by the agent:

Table 1. Properties for sending app metadata to Driven
Name	Example	Explanation
driven.app.name	driven.app.name=tps-report	Name of the application
driven.app.version	driven.app.version=1.1.5	Version of the application
driven.app.tags	driven.app.tags=cluster:prod,tps,dept:marketing	Comma-separated list of tags

If you do not name the application, the agent sets the name driven-agent-hive. If you are using HiveServer2, the default name is HiveServer2 [hostname].

The Getting Started section explains how properties can be given on the agent command line. If that is not flexible enough for your use case, the properties can also be set by any of the following methods:

HiveQL script: with set commands, an initialization file, or arguments on the command line.
hive-site.xml file
With HiveServer2, you can also pass the properties as JDBC connection parameters.

Basically, any way you would normally send parameters to a Hive query is supported.

In addition to application-level metadata, you can also pass Unit of Work metadata with the Driven Agent for Hive.

Table 2. Properties for sending Unit of Work metadata to Driven
Name	Example	Explanation
driven.flow.name	driven.flow.name=tps-report	Name of the unit of work
driven.flow.description	driven.flow.description=Coversheet	Description of the unit of work

If you do not set the driven.flow.name property, the internal Hive query ID is used.

Note	After you provide the name and description, subsequent queries inherit this metadata. Revise the name and description to meaningful information as often as necessary.

Using the Hive Agent Artifact (Unbundled Form)

For downloading the latest Driven Agent for Hive, follow the instructions in Getting Started.

Enable the agent by extending the HADOOP_OPTS environment variable before starting a hive fat client, an embedded Beeline client, or HiveServer2. Use the following command format:

export HADOOP_OPTS="-javaagent:/path/to/driven-agent-hive-<version>.jar"

Note	You must set the `HADOOP_OPTS` variable. Setting the `YARN_OPTS` variable, even on a YARN-based cluster, has no effect.

The agent must be installed and configured on the host where Hive queries are executed. In the case of the fat client, it is sufficient to set the environment variable in the shell where hive will be launched. The same applies to the newer Beeline client, when used without HiveServer2.

For a HiveServer2 deployment, the agent must be installed on the machine where the server is running. For the agent to work, the HADOOP_OPTS variable must be set in the environment where the server is running. Typically this involves modifying the startup script of HiveServer2. Some distributions ship with graphical cluster administration tools, with which you can customize a hive-env.sh script to administer HiveServer2.

Note

Each HiveServer2 instance appears as one long-running application in Driven from the time that the first query is executed on the server. When using Driven Agent for Hive, an application is defined as one JVM. As a result, queries that a HiveServer2 runs are displayed as processes of the same application in Driven.

Using the Hive Agent Bundle Artifact

The Hive agent bundle has the same functionality as the plain agent, but the bundle simplifies the installation and configuration of the agent in certain deployment scenarios. Users of Oozie should always use the Hive agent bundle. See Using Driven Agent with Apache Oozie for further information.

To download the latest Driven Agent for Hive, see the Getting Started documentation.

Query sanitization

The Driven Agent for Hive supports query sanitization. All INSERT queries are sent in redacted form (values removed) to the Driven Server. You can change this behavior by setting the uoWSanitizeStatements configuration property to false.

Choosing the Query Types to Monitor

By default the Driven Agent for Hive monitors all SQL statements. This includes all DDL statements like CREATE TABLE or DROP TABLE, all DML statements like INSERT INTO, and even queries typically used in interactive sessions like SHOW TABLES. You can change this behavior by passing the statement types you want to monitor on the agent command line. The relevant parameter is uoWCaptureStatementTypes, which is a comma-separated list of statement types to capture. Allowed values are DML (Data Manipulation Language), DDL (Data Definition Language) ,DCL (Data Control Language), TCL (Transaction Control Language) ,UNKNOWN (used for unknown constructs) or ALL.

The following table shows the mapping of Hive Operations to the categories introduced above.

Table 3. Hive operation to query type mapping
Hive Operation	Query Type
ALTERDATABASE_OWNER	DDL
ALTERDATABASE	DDL
ALTERINDEX_PROPS	DDL
ALTERINDEX_REBUILD	DDL
ALTERPARTITION_BUCKETNUM	DDL
ALTERPARTITION_FILEFORMAT	DDL
ALTERPARTITION_LOCATION	DDL
ALTERPARTITION_MERGEFILES	DDL
ALTERPARTITION_PROTECTMODE	DDL
ALTERPARTITION_SERDEPROPERTIES	DDL
ALTERPARTITION_SERIALIZER	DDL
ALTERTABLE_ADDCOLS	DDL
ALTERTABLE_ADDPARTS	DDL
ALTERTABLE_ARCHIVE	DDL
ALTERTABLE_BUCKETNUM	DDL
ALTERTABLE_CLUSTER_SORT	DDL
ALTERTABLE_COMPACT	DDL
ALTERTABLE_DROPPARTS	DDL
ALTERTABLE_FILEFORMAT	DDL
ALTERTABLE_LOCATION	DDL
ALTERTABLE_MERGEFILES	DDL
ALTERTABLE_PARTCOLTYPE	DDL
ALTERTABLE_PROPERTIES	DDL
ALTERTABLE_PROTECTMODE	DDL
ALTERTABLE_RENAME	DDL
ALTERTABLE_RENAMECOL	DDL
ALTERTABLE_RENAMEPART	DDL
ALTERTABLE_REPLACECOLS	DDL
ALTERTABLE_SERDEPROPERTIES	DDL
ALTERTABLE_SERIALIZER	DDL
ALTERTABLE_SKEWED	DDL
ALTERTABLE_TOUCH	DDL
ALTERTABLE_UNARCHIVE	DDL
ALTERTABLE_UPDATEPARTSTATS	DDL
ALTERTABLE_UPDATETABLESTATS	DDL
ALTERTBLPART_SKEWED_LOCATION	DDL
ALTERVIEW_AS	DDL
ALTERVIEW_PROPERTIES	DDL
ALTERVIEW_RENAME	DDL
ANALYZE_TABLE	DDL
CREATEDATABASE	DDL
CREATEFUNCTION	DDL
CREATEINDEX	DDL
CREATEMACRO	DDL
CREATEROLE	DCL
CREATETABLE_AS_SELECT	DML
CREATETABLE	DDL
CREATEVIEW	DDL
DESCDATABASE	DDL
DESCFUNCTION	DDL
DESCTABLE	DDL
DROPDATABASE	DDL
DROPFUNCTION	DDL
DROPINDEX	DDL
DROPMACRO	DDL
DROPROLE	DCL
DROPTABLE	DDL
DROPVIEW_PROPERTIES	DDL
DROPVIEW	DDL
EXPLAIN	DDL
EXPORT	DML
GRANT_PRIVILEGE	DCL
GRANT_ROLE	DCL
IMPORT	DML
LOAD	DML
LOCKDB	DCL
LOCKTABLE	DCL
MSCK	DDL
QUERY	DML
REVOKE_PRIVILEGE	DCL
REVOKE_ROLE	DCL
SHOW_COMPACTIONS	DDL
SHOW_CREATETABLE	DDL
SHOW_GRANT	DCL
SHOW_ROLE_GRANT	DDL
SHOW_ROLE_PRINCIPALS	DDL
SHOW_ROLES	DDL
SHOW_TABLESTATUS	DDL
SHOW_TBLPROPERTIES	DDL
SHOW_TRANSACTIONS	DDL
SHOWCOLUMNS	DDL
SHOWCONF	DDL
SHOWDATABASES	DDL
SHOWFUNCTIONS	DDL
SHOWINDEXES	DDL
SHOWLOCKS	DDL
SHOWPARTITIONS	DDL
SHOWTABLES	DDL
SWITCHDATABASE	DDL
TRUNCATETABLE	DML
UNLOCKDB	DCL
UNLOCKTABLE	DCL

Using Driven Agent with Apache Oozie

Apache Oozie is a popular workflow management solution in the Hadoop ecosystem. The workflow solution supports running a variety of different technologies. Oozie operates by executing computational tasks called actions. The actions are arranged in directed acyclic graphs (DAGs), which are referred to as workflows. The Driven Agent can be used to monitor the execution of HiveActions and MapReduceActions that are managed by Oozie.

Oozie uses a client-server architecture, but the Oozie server is not running any user code by itself. Instead, the server uses a LauncherMapper to drive each action in a given workflow.

The LauncherMapper is a single Map task, which is sent cluster-side and acts as the driving program for the action. Any node of the cluster can potentially be the machine that drives a given Hive query or runs a MapReduce job. Therefore, every machine in the cluster must have access to the Driven Agent for Hive and every machine must be able to communicate with the Driven Server. Your firewall rules should be set accordingly.

Note	Apache Oozie users should always install and configure the Hive agent bundle for Driven.

Driven Agent Bundle Configuration

Instead of installing the Driven Agent JAR files on every machine of the cluster, the Driven Agents can be installed in Oozie’s sharelib on HDFS:

Given a sharelib directory on HDFS of /user/oozie/share/lib/lib_20150721160609, the Driven Agent could be installed as follows:

> hadoop fs -mkdir /user/oozie/share/lib/lib_20150721160609/driven
> hadoop fs -copyFromLocal /path/to/driven-agent-<framework>-bundle-<version>.jar \
  /user/oozie/share/lib/lib_20150721160609/driven$

Some distributions require a restart of the Oozie server after modifying the sharelib. Check the documentation of your distribution.

Now that the Driven Agent is available on HDFS, the agent must be configured on the global workflow or with single-action XML.

The following property sets the Java path for loading the agent. The JAR file name must match the one on HDFS.

<property>
    <name>oozie.launcher.mapred.child.java.opts</name>
    <value>-javaagent:$PWD/driven-agent-<framework>-bundle-<version>.jar</value>
</property>

The following property configures the Oozie Hive action to include JAR files from the hive and driven subdirectories of the currently active sharelib for HiveActions.

<property>
    <name>oozie.action.sharelib.for.hive</name>
    <value>hive,driven</value>
</property>

For MapReduceActions, the map-reduce directory may need to be created. Please verify the map-reduce directory existence in the sharelib. The configuration should look like this:

<property>
    <name>oozie.action.sharelib.for.map-reduce</name>
    <value>map-reduce,driven</value>
</property>

Finally, the following properties configure the Driven Server location and API key for the bundled agent to use. Depending on your deployment and needs, you can freely choose on which level to set these properties. Setting the properties on the workflow level enables the agent for all supported actions in that workflow. Setting the properties on the action level only enables them for that specific action.

<property>
    <name>cascading.management.document.service.hosts</name>
    <value>http://<hostname>:<port>/</value>
</property>
<property>
    <name>cascading.management.document.service.apikey</name>
    <value><apikey></value>
</property>

Advanced Installation

Advanced users may wish to script the Driven Agent installation, or use the Driven Agent with Amazon Elastic MapReduce.

Scripted Installation

For advanced users, the Driven Agent and the Driven Plugin can be installed with the following script:

# to download the script
> wget http://files.concurrentinc.com/driven/2.0/driven-plugin/install-driven-plugin.sh

# for Hive installation
> bash install-driven-plugin.sh --hive --bundle

# for MapReduce installation
> bash install-driven-plugin.sh --mapreduce --bundle

Note	The `--bundle` switch will force only the bundled version of the agent to be downloaded and installed. Leaving off the switch will download and install an un-bundled version of the agent and also the latest version of the Driven Plugin.

Alternately, as a one-liner:

# for Hive installation
> export USE_AGENT_BUNDLE=true; export AGENT=hive; curl http://files.concurrentinc.com/driven/2.0/driven-plugin/install-driven-plugin.sh | sh

# for MapReduce installation
> export USE_AGENT_BUNDLE=true; export AGENT=mr; curl http://files.concurrentinc.com/driven/2.0/driven-plugin/install-driven-plugin.sh | sh

This script will create a .driven-plugin directory in the current user’s home directory, download the latest Driven Agent bundle JAR, and create a symbolic link referencing the latest versions of the driven-agent-[framework]-bundle.jar.

Re-running the script can be used to safely upgrade the agent.

Note	`driven-agent-[framework]-bundle.jar` is a unix symbolic link to the latest downloaded version of the agent jar file. This link is created or updated by the install script.

Amazon Elastic MapReduce

For Amazon Elastic MapReduce users, the install-driven-plugin.sh, introduced above, doubles as a bootstrap action. The only addition is to also use the Amazon provided configure-daemons (s3://elasticmapreduce/bootstrap-actions/configure-daemons) bootstrap action with the following arguments:

--client-opts=-javaagent:/home/hadoop/.driven-plugin/driven-agent-[framework]-bundle.jar

Replace [framework] with hive for Hive, and mr for MapReduce.

EMR 4.x

Amazon introduced a set of changes in EMR version 4.0, that have a direct influence on how to install the Driven agent. One important change is that bootstrap actions can no longer modify the installation of Hadoop, since Hadoop is only deployed after all bootstrap actions have been executed. The new way of changing the Hadoop installation with user defined settings is using the application configuration feature.

Driven provides a set configurations to be used with the different agents. On the commandline simply add the --configurations switch for your framework:

  --configurations http://files.concurrentinc.com/driven/2.0/hosted/driven-plugin/configurations-[framework].json"