Driven Agent Guide: Using Driven Agent with Apache Oozie

version 2.2.6

Driven Agent Guide: Using Driven Agent with Apache Oozie

1. Prerequisites: 1.1. Check your team’s demands and expectations

1.2. Check your system privileges and network access
2. Installing the Driven Agent: 2.1. Downloading the Driven Agent

2.2. Installing the Driven Agent

2.3. Confirming Installation
3. Configuring the Driven Agent: 3.1. Testing the Agent

3.2. Agent Common Options

3.3. Agent Advanced Options
4. Driven Agent for MapReduce: 4.1. Running on Hadoop or YARN

4.2. MapReduce Versions
5. Driven Agent for Hive: 5.1. Hive Version Requirements

5.2. Metadata Support

5.3. Using the Hive Agent Artifact

5.4. What Cluster Work Is Monitored
6. Driven Agent for Apache Spark: 6.1. Spark Version Requirements

6.2. Supported APIs

6.3. Spark Runtimes

6.4. Supported Runtimes
7. Using Driven Agent with Apache Oozie: 7.1. Driven Agent Configuration
8. Advanced Installation: 8.1. Scripted Installation

8.2. Amazon Elastic MapReduce
9. Troubleshooting the Driven Agent: 9.1. Applications cannot send data to the Driven Server

Using Driven Agent with Apache Oozie

Apache Oozie is a popular workflow management solution in the Hadoop ecosystem. The workflow solution supports running a variety of different technologies. Oozie operates by executing computational tasks called actions. The actions are arranged in directed acyclic graphs (DAGs), which are referred to as workflows. The Driven Agent can be used to monitor the execution of HiveActions and MapReduceActions that are managed by Oozie.

Oozie uses a client-server architecture, but the Oozie server is not running any user code by itself. Instead, the server uses a LauncherMapper to drive each action in a given workflow.

The LauncherMapper is a single Map task, which is sent cluster-side and acts as the driving program for the action. Any node of the cluster can potentially be the machine that drives a given Hive query or runs a MapReduce job. Therefore, every machine in the cluster must have access to the Driven Agent for Hive and every machine must be able to communicate with the Driven Server. Your firewall rules should be set accordingly.

Driven Agent Configuration

Instead of installing the Driven Agent JAR files on every machine of the cluster, the Driven Agents can be installed in Oozie’s sharelib on HDFS:

Given a sharelib directory on HDFS of /user/oozie/share/lib/lib_20150721160609, the Driven Agent could be installed as follows:

$ hadoop fs -mkdir /user/oozie/share/lib/lib_20150721160609/driven
$ hadoop fs -copyFromLocal /path/to/driven-agent-<framework>-<version>.jar \
  /user/oozie/share/lib/lib_20150721160609/driven$

Some distributions require a restart of the Oozie server after modifying the sharelib. Check the documentation of your distribution.

Now that the Driven Agent is available on HDFS, the agent must be configured on the global workflow or with single-action XML.

The following property sets the Java path for loading the agent. The JAR file name must match the one on HDFS.

<property>
    <name>oozie.launcher.mapred.child.java.opts</name>
    <value>-javaagent:$PWD/driven-agent-<framework>-<version>.jar</value>
</property>

The following property configures the Oozie Hive action to include JAR files from the hive and driven subdirectories of the currently active sharelib for HiveActions.

<property>
    <name>oozie.action.sharelib.for.hive</name>
    <value>hive,driven</value>
</property>

For MapReduceActions, the map-reduce directory may need to be created. Please verify the map-reduce directory existence in the sharelib. The configuration should look like this:

<property>
    <name>oozie.action.sharelib.for.map-reduce</name>
    <value>map-reduce,driven</value>
</property>

Finally, the following properties configure the Driven Server location and API key for the agent to use. Depending on your deployment and needs, you can freely choose on which level to set these properties. Setting the properties on the workflow level enables the agent for all supported actions in that workflow. Setting the properties on the action level only enables them for that specific action.

<property>
    <name>driven.management.document.service.hosts</name>
    <value>http://<hostname>:<port>/</value>
</property>
<property>
    <name>driven.management.document.service.apikey</name>
    <value><apikey></value>
</property>