Table of Contents

Driven Agent Guide

version 2.0.5

Getting Started

The Driven Agent is a collection of JVM libraries that enables monitoring of the following Hadoop applications in Driven:

  • Apache Hive

  • native MapReduce

The agent is compatible with the following scheduler:

  • Apache Oozie

The Driven Agent is a Java agent wrapper for the Driven Plugin. There is one agent JAR file for Hive, and another JAR file for MapReduce.

All agent JAR files are bundled with the Driven Plugin JAR.

Note
To monitor only Cascading applications with Driven, the Driven Agent is not necessary. However, an installation of the Driven Plugin JAR is required. See Driven documentation for details.

Downloading the Driven Agent

Select the agent library that is applicable to your framework.

From a terminal/shell, execute one of the following commands.

# latest Hive agent bundle
> wget -i http://files.concurrentinc.com/driven/2.0/driven-agent/latest-driven-agent-hive-bundle.txt

# latest MapReduce agent bundle
> wget -i http://files.concurrentinc.com/driven/2.0/driven-agent/latest-driven-agent-mr-bundle.txt
Note
The -i switch downloads the latest-…​-bundle.txt file, and then downloads the link in the file, which is always the latest release for the current version of Driven.

Installing the Driven Agent

Note for Apache Oozie users: Use the Driven Agent with Apache Oozie installation documentation instead of the following procedure.

The following steps assume that Hadoop applications are being launched from:

  • the command line, via bin/yarn jar …​ or bin/hadoop jar …​

  • an interactive shell like Beeline when using Hive with a "thick" client

  • jobs that start from a long-running server like Hive Server or from an application server like Tomcat, JBoss, Spring, etc.

Note
Driven defines an application context as the JVM instance driving and orchestrating the client side of Hadoop applications. Each Hive query or MapReduce job appears as a single Unit of Work in that application. In a single application context, there can be thousands of queries. Each instance of the application entails a shutdown and restart.

Variables are used in many of the commands in the following sections:

  • [framework] stands for Hive (hive) or MapReduce (mr)

  • <version> stands for the current agent version

Agent Quick Start:

Step 1: Create a new directory named driven-agent in your home directory.

Step 2: Copy the downloaded installation JAR file into the driven-agent directory.

Step 3: Create a driven-agent.properties file with the appropriate settings for your environment. See Configuring the Driven Agent section to properly configure both the drivenHosts and drivenAPIkey settings (if API key is required).

Tip
Creating a different driven-agent.properties file for each unique application enables the display of application-specific values (like name and tags) in Driven and lets you assign applications to specific teams via the Driven team API key.

Step 4: In the current console or within a bash script, use either export HADOOP_OPTS or export YARN_CLIENT_OPTS (depending on your environment) to pass the options in the following command:

export YARN_CLIENT_OPTS="-javaagent:/path/to/driven-agent-[framework]-bundle-<version>.jar=optionsFile=driven-agent.properties"

Step 5: Run your application.

After installing the agent and running your application, log in to the Driven Server to see your application’s performance information.

Tip
The URL to the current application will be printed in the logs.
Note
Putting the agent on the runtime CLASSPATH will have no effect. Be sure to place the -javaagent:/path/to/driven-agent-[framework]-bundle-<version>.jar switch on the JVM command line before the application jar.

Configuring the Driven Agent

The Driven Agent accepts various configuration options after the path to the Driven Agent JAR file.

java -javaagent:/path/to/driven-agent-[framework]-bundle-<version>.jar[=key1=value1;key2=value2,value3] <other java arguments>

Available agent options can be printed to the console by running the Driven Agent JAR with the following command:

java -jar /path/to/driven-agent-[framework]-bundle-<version>.jar

The agent also accepts a properties file via the optionsFile option. To generate a template file with defaults, run the following command (with a dash as the only argument):

java -jar /path/to/driven-agent-[framework]-bundle-<version>.jar - > driven-agent.properties

This creates a driven-agent.properties template in the current directory.

Tip
The file specified by optionsFile will be treated relative to the JVM current working directory, unless the path is absolute. If not found, the file will be relative to the Driven Agent JAR file location.
Note
Some of the following configuration options might not be available for all frameworks.

Agent-Specific Options

optionsFile

Specifies the file that provides option values for the Driven Agent. All values take precedence over the agent argument values. The file is relative to the current directory. If no current directory is found, the file is relative to the agent’s JAR directory.

agentDisableLogging

Disables all Driven Agent and Driven Plugin logging.

agentDebugLogging

Enables debug logging in the Driven Agent and the Driven Plugin.

agentExitOnlyOnJobCompletion

Forces the JVM to remain active until the monitored jobs complete, fail, or are killed. The appCompletionTimeout option is not supported. Default is TRUE.

agentKillJobsOnExit

Kills all running jobs when JVM is exited. Work is marked as STOPPED if System.exit is called when detaching the client.

agentLogSystemExitCalls

Enables logging of the stack trace making System and Runtime exit calls. The option also installs a custom SecurityManager if no other SecurityManager has been installed.

agentLogSubmitStackTrace

Enables logging of the stack trace making submit() calls to the cluster, which helps in diagnosing the root main class and function.

Plugin-Specific Options

drivenHosts

Specifies the server host names and ports where data is to be sent. Values should be entered in this format: host1:80,host2:8080. The http:// or https:// prefix may be placed before the host name.

Note
If you are using the Early Access Program (EAP) or the Hosted Trial, drivenHosts must be set to https://driven.cascading.io/ or https://trial.driven.io/, respectively.
drivenAPIKey

Specifies the API key that is associated with application executions.

Note
If you are using the EAP or the Hosted Trial, drivenAPIKey must be set in order to see your applications in Driven after logging in. This requires an account, which you can get on the Driven Trial Options website.
drivenArchiveDir

Indicates the local directory where copies of transmitted data are to be stored.

drivenDisabled

Disables the sending of data to the Driven Server.

drivenSuppressSliceData

Disables sending slice-level data and detailed low-level performance visualizations; overrides server settings. This option can reduce network traffic, load on any history servers, and indexing latency.

drivenContinuousSliceData

Enables frequent updates of slice-level data before slice completion (update on completion is the default); overrides server settings. This option can increase network traffic, load on any history server, and indexing latency.

Note
Some platforms do not support retrieving intermediate results at this level.
drivenSuppressJavaCommandData

Disables sending command-line argument data; overrides server settings. This option prevents sending sensitive information that might appear on the command line.

Application-Specific Options

appName

Names an application. The default name is the JAR file name without version information.

appVersion

Specifies the version of an application. The default version is parsed from the JAR file name.

appTags

Assigns tags that should be associated with the application, for example: cluster:prod,dept:engineering

appCompletionTimeout

Specifies timeout (in milliseconds) to wait to send all completed application details before shutdown.

appFailedOnAnyUoWFail

Indicates that if any Unit of Work fails, then the application is marked as FAILED. The default is to mark an app as FAILED only if the last Unit of Work fails.

appFailedOnAnyUoWPending

Indicates that if any Unit of Work is not started, then the application is marked as FAILED. The default is to mark an app as FAILED only if the last Unit of Work does not start.

UnitOfWork Specific Options

uoWSanitizeStatements

Enable query statement sanitization. This option is enabled by default.

uoWCaptureStatementTypes

types of statements to capture. Valid values are DML, DDL, DCL, TCL, UNKNOWN any combination or ALL. default is ALL

Driven Agent for MapReduce

The agent for MapReduce enables Driven to perform real-time monitoring of any application written as one or more MapReduce jobs.

For instructions on downloading and installing the Driven Agent see sections on downloading and installing the agent. If you plan to use the agent with Apache Oozie, see the additional installation documentation in Using Driven Agent with Apache Oozie.

MapReduce Versions

The Driven Agent works for both the mapred.* and mapreduce.* APIs (sometimes known as the old and new APIs) on either Hadoop 1.x or 2.x releases.

Driven Agent for Hive

The Driven Agent for Hive is a JVM-level agent library that enables monitoring of Apache Hive queries in Driven. The agent runs in parallel with the main execution of Hive and sends telemetry data to a remote Driven server. Any type of Hive deployment is supported, such as the fat client (hive), the newer Beeline with HiveServer2, or Apache Oozie workflows containing Hive queries. Furthermore, the agent can monitor any query made through JDBC or ODBC.

Hive Version Requirements

The Driven Agent for Hive can be used with any version of Hive from 0.14.0 onwards. If you are using Hive with the Tez execution engine you have to at least use Tez 0.5.2. In the Tez deployment case you must furthermore ensure that the YARN Application Timeline Server (ATS) is properly configured. Hive works without ATS, but the Driven Agent requires a functioning ATS to monitor all resource usage. Refer to the Tez project documentation for how to properly configure ATS.

Note
Hive on Apache Spark is currently not supported.

Metadata Support

The Driven Agent for Hive enables Driven to recognize application metadata, such as name, version number, or tags, which can be sent with the telemetry data from an application. The following table shows the application-level properties supported by the agent:

Table 1. Properties for sending app metadata to Driven
Name Example Explanation

driven.app.name

driven.app.name=tps-report

Name of the application

driven.app.version

driven.app.version=1.1.5

Version of the application

driven.app.tags

driven.app.tags=cluster:prod,tps,dept:marketing

Comma-separated list of tags

If you do not name the application, the agent sets the name driven-agent-hive. If you are using HiveServer2, the default name is HiveServer2 [hostname].

The Getting Started section explains how properties can be given on the agent command line. If that is not flexible enough for your use case, the properties can also be set by any of the following methods:

  • HiveQL script: with set commands, an initialization file, or arguments on the command line.

  • hive-site.xml file

  • With HiveServer2, you can also pass the properties as JDBC connection parameters.

Basically, any way you would normally send parameters to a Hive query is supported.

In addition to application-level metadata, you can also pass Unit of Work metadata with the Driven Agent for Hive.

Table 2. Properties for sending Unit of Work metadata to Driven
Name Example Explanation

driven.flow.name

driven.flow.name=tps-report

Name of the unit of work

driven.flow.description

driven.flow.description=Coversheet

Description of the unit of work

If you do not set the driven.flow.name property, the internal Hive query ID is used.

Note
After you provide the name and description, subsequent queries inherit this metadata. Revise the name and description to meaningful information as often as necessary.

Using the Hive Agent Artifact (Unbundled Form)

For downloading the latest Driven Agent for Hive, follow the instructions in Getting Started.

Enable the agent by extending the HADOOP_OPTS environment variable before starting a hive fat client, an embedded Beeline client, or HiveServer2. Use the following command format:

export HADOOP_OPTS="-javaagent:/path/to/driven-agent-hive-<version>.jar"
Note
You must set the HADOOP_OPTS variable. Setting the YARN_OPTS variable, even on a YARN-based cluster, has no effect.

The agent must be installed and configured on the host where Hive queries are executed. In the case of the fat client, it is sufficient to set the environment variable in the shell where hive will be launched. The same applies to the newer Beeline client, when used without HiveServer2.

For a HiveServer2 deployment, the agent must be installed on the machine where the server is running. For the agent to work, the HADOOP_OPTS variable must be set in the environment where the server is running. Typically this involves modifying the startup script of HiveServer2. Some distributions ship with graphical cluster administration tools, with which you can customize a hive-env.sh script to administer HiveServer2.

Note
Each HiveServer2 instance appears as one long-running application in Driven from the time that the first query is executed on the server. When using Driven Agent for Hive, an application is defined as one JVM. As a result, queries that a HiveServer2 runs are displayed as processes of the same application in Driven.

Using the Hive Agent Bundle Artifact

The Hive agent bundle has the same functionality as the plain agent, but the bundle simplifies the installation and configuration of the agent in certain deployment scenarios. Users of Oozie should always use the Hive agent bundle. See Using Driven Agent with Apache Oozie for further information.

To download the latest Driven Agent for Hive, see the Getting Started documentation.

Query sanitization

The Driven Agent for Hive supports query sanitization. All INSERT queries are sent in redacted form (values removed) to the Driven Server. You can change this behavior by setting the uoWSanitizeStatements configuration property to false.

Choosing the Query Types to Monitor

By default the Driven Agent for Hive monitors all SQL statements. This includes all DDL statements like CREATE TABLE or DROP TABLE, all DML statements like INSERT INTO, and even queries typically used in interactive sessions like SHOW TABLES. You can change this behavior by passing the statement types you want to monitor on the agent command line. The relevant parameter is uoWCaptureStatementTypes, which is a comma-separated list of statement types to capture. Allowed values are DML (Data Manipulation Language), DDL (Data Definition Language) ,DCL (Data Control Language), TCL (Transaction Control Language) ,UNKNOWN (used for unknown constructs) or ALL.

The following table shows the mapping of Hive Operations to the categories introduced above.

Table 3. Hive operation to query type mapping
Hive Operation Query Type

ALTERDATABASE_OWNER

DDL

ALTERDATABASE

DDL

ALTERINDEX_PROPS

DDL

ALTERINDEX_REBUILD

DDL

ALTERPARTITION_BUCKETNUM

DDL

ALTERPARTITION_FILEFORMAT

DDL

ALTERPARTITION_LOCATION

DDL

ALTERPARTITION_MERGEFILES

DDL

ALTERPARTITION_PROTECTMODE

DDL

ALTERPARTITION_SERDEPROPERTIES

DDL

ALTERPARTITION_SERIALIZER

DDL

ALTERTABLE_ADDCOLS

DDL

ALTERTABLE_ADDPARTS

DDL

ALTERTABLE_ARCHIVE

DDL

ALTERTABLE_BUCKETNUM

DDL

ALTERTABLE_CLUSTER_SORT

DDL

ALTERTABLE_COMPACT

DDL

ALTERTABLE_DROPPARTS

DDL

ALTERTABLE_FILEFORMAT

DDL

ALTERTABLE_LOCATION

DDL

ALTERTABLE_MERGEFILES

DDL

ALTERTABLE_PARTCOLTYPE

DDL

ALTERTABLE_PROPERTIES

DDL

ALTERTABLE_PROTECTMODE

DDL

ALTERTABLE_RENAME

DDL

ALTERTABLE_RENAMECOL

DDL

ALTERTABLE_RENAMEPART

DDL

ALTERTABLE_REPLACECOLS

DDL

ALTERTABLE_SERDEPROPERTIES

DDL

ALTERTABLE_SERIALIZER

DDL

ALTERTABLE_SKEWED

DDL

ALTERTABLE_TOUCH

DDL

ALTERTABLE_UNARCHIVE

DDL

ALTERTABLE_UPDATEPARTSTATS

DDL

ALTERTABLE_UPDATETABLESTATS

DDL

ALTERTBLPART_SKEWED_LOCATION

DDL

ALTERVIEW_AS

DDL

ALTERVIEW_PROPERTIES

DDL

ALTERVIEW_RENAME

DDL

ANALYZE_TABLE

DDL

CREATEDATABASE

DDL

CREATEFUNCTION

DDL

CREATEINDEX

DDL

CREATEMACRO

DDL

CREATEROLE

DCL

CREATETABLE_AS_SELECT

DML

CREATETABLE

DDL

CREATEVIEW

DDL

DESCDATABASE

DDL

DESCFUNCTION

DDL

DESCTABLE

DDL

DROPDATABASE

DDL

DROPFUNCTION

DDL

DROPINDEX

DDL

DROPMACRO

DDL

DROPROLE

DCL

DROPTABLE

DDL

DROPVIEW_PROPERTIES

DDL

DROPVIEW

DDL

EXPLAIN

DDL

EXPORT

DML

GRANT_PRIVILEGE

DCL

GRANT_ROLE

DCL

IMPORT

DML

LOAD

DML

LOCKDB

DCL

LOCKTABLE

DCL

MSCK

DDL

QUERY

DML

REVOKE_PRIVILEGE

DCL

REVOKE_ROLE

DCL

SHOW_COMPACTIONS

DDL

SHOW_CREATETABLE

DDL

SHOW_GRANT

DCL

SHOW_ROLE_GRANT

DDL

SHOW_ROLE_PRINCIPALS

DDL

SHOW_ROLES

DDL

SHOW_TABLESTATUS

DDL

SHOW_TBLPROPERTIES

DDL

SHOW_TRANSACTIONS

DDL

SHOWCOLUMNS

DDL

SHOWCONF

DDL

SHOWDATABASES

DDL

SHOWFUNCTIONS

DDL

SHOWINDEXES

DDL

SHOWLOCKS

DDL

SHOWPARTITIONS

DDL

SHOWTABLES

DDL

SWITCHDATABASE

DDL

TRUNCATETABLE

DML

UNLOCKDB

DCL

UNLOCKTABLE

DCL

Using Driven Agent with Apache Oozie

Apache Oozie is a popular workflow management solution in the Hadoop ecosystem. The workflow solution supports running a variety of different technologies. Oozie operates by executing computational tasks called actions. The actions are arranged in directed acyclic graphs (DAGs), which are referred to as workflows. The Driven Agent can be used to monitor the execution of HiveActions and MapReduceActions that are managed by Oozie.

Oozie uses a client-server architecture, but the Oozie server is not running any user code by itself. Instead, the server uses a LauncherMapper to drive each action in a given workflow.

The LauncherMapper is a single Map task, which is sent cluster-side and acts as the driving program for the action. Any node of the cluster can potentially be the machine that drives a given Hive query or runs a MapReduce job. Therefore, every machine in the cluster must have access to the Driven Agent for Hive and every machine must be able to communicate with the Driven Server. Your firewall rules should be set accordingly.

Note
Apache Oozie users should always install and configure the Hive agent bundle for Driven.

Driven Agent Bundle Configuration

Instead of installing the Driven Agent JAR files on every machine of the cluster, the Driven Agents can be installed in Oozie’s sharelib on HDFS:

Given a sharelib directory on HDFS of /user/oozie/share/lib/lib_20150721160609, the Driven Agent could be installed as follows:

> hadoop fs -mkdir /user/oozie/share/lib/lib_20150721160609/driven
> hadoop fs -copyFromLocal /path/to/driven-agent-<framework>-bundle-<version>.jar \
  /user/oozie/share/lib/lib_20150721160609/driven$

Some distributions require a restart of the Oozie server after modifying the sharelib. Check the documentation of your distribution.

Now that the Driven Agent is available on HDFS, the agent must be configured on the global workflow or with single-action XML.

The following property sets the Java path for loading the agent. The JAR file name must match the one on HDFS.

<property>
    <name>oozie.launcher.mapred.child.java.opts</name>
    <value>-javaagent:$PWD/driven-agent-<framework>-bundle-<version>.jar</value>
</property>

The following property configures the Oozie Hive action to include JAR files from the hive and driven subdirectories of the currently active sharelib for HiveActions.

<property>
    <name>oozie.action.sharelib.for.hive</name>
    <value>hive,driven</value>
</property>

For MapReduceActions, the map-reduce directory may need to be created. Please verify the map-reduce directory existence in the sharelib. The configuration should look like this:

<property>
    <name>oozie.action.sharelib.for.map-reduce</name>
    <value>map-reduce,driven</value>
</property>

Finally, the following properties configure the Driven Server location and API key for the bundled agent to use. Depending on your deployment and needs, you can freely choose on which level to set these properties. Setting the properties on the workflow level enables the agent for all supported actions in that workflow. Setting the properties on the action level only enables them for that specific action.

<property>
    <name>cascading.management.document.service.hosts</name>
    <value>http://<hostname>:<port>/</value>
</property>
<property>
    <name>cascading.management.document.service.apikey</name>
    <value><apikey></value>
</property>

Advanced Installation

Advanced users may wish to script the Driven Agent installation, or use the Driven Agent with Amazon Elastic MapReduce.

Scripted Installation

For advanced users, the Driven Agent and the Driven Plugin can be installed with the following script:

# to download the script
> wget http://files.concurrentinc.com/driven/2.0/driven-plugin/install-driven-plugin.sh

# for Hive installation
> bash install-driven-plugin.sh --hive --bundle

# for MapReduce installation
> bash install-driven-plugin.sh --mapreduce --bundle
Note
The --bundle switch will force only the bundled version of the agent to be downloaded and installed. Leaving off the switch will download and install an un-bundled version of the agent and also the latest version of the Driven Plugin.

Alternately, as a one-liner:

# for Hive installation
> export USE_AGENT_BUNDLE=true; export AGENT=hive; curl http://files.concurrentinc.com/driven/2.0/driven-plugin/install-driven-plugin.sh | sh

# for MapReduce installation
> export USE_AGENT_BUNDLE=true; export AGENT=mr; curl http://files.concurrentinc.com/driven/2.0/driven-plugin/install-driven-plugin.sh | sh

This script will create a .driven-plugin directory in the current user’s home directory, download the latest Driven Agent bundle JAR, and create a symbolic link referencing the latest versions of the driven-agent-[framework]-bundle.jar.

Re-running the script can be used to safely upgrade the agent.

Note
driven-agent-[framework]-bundle.jar is a unix symbolic link to the latest downloaded version of the agent jar file. This link is created or updated by the install script.

Amazon Elastic MapReduce

For Amazon Elastic MapReduce users, the install-driven-plugin.sh, introduced above, doubles as a bootstrap action. The only addition is to also use the Amazon provided configure-daemons (s3://elasticmapreduce/bootstrap-actions/configure-daemons) bootstrap action with the following arguments:

--client-opts=-javaagent:/home/hadoop/.driven-plugin/driven-agent-[framework]-bundle.jar

Replace [framework] with hive for Hive, and mr for MapReduce.

EMR 4.x

Amazon introduced a set of changes in EMR version 4.0, that have a direct influence on how to install the Driven agent. One important change is that bootstrap actions can no longer modify the installation of Hadoop, since Hadoop is only deployed after all bootstrap actions have been executed. The new way of changing the Hadoop installation with user defined settings is using the application configuration feature.

Driven provides a set configurations to be used with the different agents. On the commandline simply add the --configurations switch for your framework:

  --configurations http://files.concurrentinc.com/driven/2.0/hosted/driven-plugin/configurations-[framework].json"