Driven Administrator Guide

version 1.2-eap-11

Planning a Driven Deployment

Understanding the Deployment Architecture

The Driven installation consists of a plugin installation on a host where Cascading applications are run and the Driven Server running in Apache Tomcat or other type of servlet container. An Elasticsearch datastore provides a persistence layer for the Driven application.

HighLevelComponentArchitecture Figure 1: Driven deployment architecture

The Driven Server is packaged as a web application resource (WAR) file in a stand-alone Tomcat distribution. The Driven application has been architected to allow for fault tolerance and scalability by implementing redundancy in the persistence layer and features for backup and restore.

When executing your Cascading application, the framework underneath builds a rich state model to execute the flow on the Hadoop cluster. This includes taking higher-level Cascading primitives (see the Cascading User Guide) and mapping them to constructs available on the underlying computational fabric (MapReduce, local in-memory mode, etc.) using a sophisticated pattern-matching rules engine.

The Driven Plugin collects this information and sends it to the Driven Server for visualization. In addition, the Driven Plugin collects metadata information about each slice–a unit of work such as a mapper or a reducer–from the Hadoop NameNode to collect statistics that will be analyzed and correlated by Driven.

There are several implications as a result of this architecture:

Your Cascading application does not appear to have completed execution until all the computation is finished and the Driven Plugin has successfully transmitted all telemetry data to the Driven Server.
The additional latency that results from the transmission of the telemetry data does not affect your application runtime for data computation. In other words, your application’s Service Level Agreements (SLAs), such as producing a data set within a certain time period is not affected. The computation of data on the Hadoop cluster and the transmission of the telemetry data from your client plugin are decoupled.
The amount of data collected by the Cascading framework and the Driven Plugin is dependent on the complexity of your application, such as operations and branches. In addition, the Driven Plugin collects information about each task or job that is part of your application. The larger your data sets, the more information that is collected for analysis from the Hadoop NameNode. As a result, make sure that you have provisioned adequate resources for your Hadoop NameNode to prevent a bottleneck in collecting the telemetry signals. Also, make sure that you provision adequate memory for your Cascading application running with the Driven Plugin.

External Dependencies

Driven components depend on each other and external software. Some of these interdependencies require using particular software versions to function correctly. For example, the Driven 1.2 Server requires an installation of JDK 1.7+. In other cases, specific versioning of components may not be required to run in your environment but are required to obtain technical support.

Other interdependencies include components that are outside the Driven installation, such as which versions of Cascading are certified with specific versions of Apache distros. If you are running a single-node Driven environment from a Tomcat-bundled deployment file, you do not need to check the versions of Driven, Elasticsearch, and Tomcat. The bundled ZIP file is packaged with compatible versions of Tomcat, Elasticsearch, and the Driven WAR file.

Driven Compatibility Matrix

Table 1. Supported version interdependencies
Driven	Cascading	JDK	Elasticsearch	Tomcat
1.2 Server	not applicable	1.7+	1.4.3	7
1.2 Plugin	2.1+, 2.2+, 2.5+, 2.6+	1.6+	not applicable	not applicable
1.1 Server	not applicable	1.7+	1.1.1	7
1.1 Plugin	2.1+, 2.2+, 2.5+, 2.6+	1.6+	not applicable	not applicable

Cascading Compatibility Requirements

For matrices that detail the interdependencies that Cascading has with Apache and Java components, see the Compatibility page on the Cascading support website.

Selecting a Deployment Topology

The topology of how the Driven, Elasticsearch, and Tomcat components are architected for your deployment depends on your goals and on your system environment. There are two recommended deployment topologies for the Driven Server:

Deploying a Single-Node Environment: You can deploy an "out-of-the-box" single installation of Driven that includes an embedded Elasticsearch datastore for learning how to use the product or for small-scale usage.
Deploying a Clustered Production Environment: Deploy any number of Driven Server nodes backed by an external Elasticsearch cluster as the datastore. This topology is recommended for production environments where scalability, redundancy, and reliability are a priority.

Driven can also function when multiple machines host the Driven Server with embedded Elasticsearch datastores. The setup process for this environment is not documented.

Deploying a Single-Node Environment