Driven Administrator Guide: Planning a Driven Deployment

version 2.2.6

Planning a Driven Deployment

Understanding the Deployment Architecture

The Driven installation consists of a plugin installation on a host where applications are run and the Driven Server running in Apache Tomcat or other type of servlet container. An Elasticsearch datastore provides a persistence layer for the Driven application.

HighLevelComponentArchitecture
Figure 1. Driven deployment architecture

The Driven Server is packaged as a web application resource (WAR) file in a stand-alone Tomcat distribution. The Driven application has been architected to allow for fault tolerance and scalability by implementing redundancy in the persistence layer and features for backup and restore.

When executing your application, the Agent or Cascading framework underneath builds a rich state model to execute the Units of Work on the Hadoop cluster.

The Driven Plugin collects this information and sends it to the Driven Server for visualization. In addition, the Driven Plugin collects metadata information about each slice–a unit of work such as a mapper or a reducer–from the Hadoop NameNode to collect statistics that will be analyzed and correlated by Driven.

There are several implications as a result of this architecture:

  • Your application does not appear to have completed execution until all the computation is finished and the Driven Plugin has successfully transmitted all telemetry data to the Driven Server.

  • The additional latency that results from the transmission of the telemetry data does not affect your application runtime for data computation. In other words, your application’s Service Level Agreements (SLAs), such as producing a data set within a certain time period is not affected. The computation of data on the Hadoop cluster and the transmission of the telemetry data from your client plugin are decoupled.

  • The amount of data collected by the Driven Agent or Cascading and the Driven Plugin is dependent on the complexity of your application, such as operations and branches. In addition, the Driven Plugin collects information about each task or job that is part of your application. The larger your data sets, the more information that is collected for analysis from the Hadoop NameNode. As a result, make sure that you have provisioned adequate resources for your Hadoop NameNode to prevent a bottleneck in collecting the telemetry signals. Also, make sure that you provision adequate memory for your application running with the Driven Plugin.

External Dependencies

Driven components depend on each other and external software. Some of these interdependencies require using particular software versions to function correctly. For example, the Driven Server operates only with specific versions of JDK. In other cases, specific versioning of components may not be required to run in your environment but are required to obtain technical support.

Other interdependencies include components that are outside the Driven installation, such as which versions of Cascading are certified with specific versions of Apache distros. If you are running a single-node Driven environment from a Tomcat-bundled deployment file, you do not need to check the versions of Driven, Elasticsearch, and Tomcat. The bundled ZIP file is packaged with compatible versions of Tomcat, Elasticsearch, and the Driven WAR file.

Driven Compatibility Matrix

Table 1. Supported version interdependencies
Driven Cascading JDK Elasticsearch Tomcat

1.3+ Server

not applicable

1.7+

1.5.2

7

1.3+ Plugin

2.1+, 2.2+, 2.5+, 2.6+, 3.0.2+

1.6+

not applicable

not applicable

1.2 Server

not applicable

1.7+

1.4.3

7

1.2 Plugin

2.1+, 2.2+, 2.5+, 2.6+

1.6+

not applicable

not applicable

1.1 Server

not applicable

1.7+

1.1.1

7

1.1 Plugin

2.1+, 2.2+, 2.5+, 2.6+

1.6+

not applicable

not applicable

Cascading Compatibility Requirements

For matrices that detail the interdependencies that Cascading has with Apache and Java components, see the Compatibility page on the Cascading support website.

Selecting a Deployment Topology

The topology of how the Driven, Elasticsearch, and Tomcat components are architected for your deployment depends on your goals and on your system environment. There are two recommended deployment topologies for the Driven Server:

  • Deploying a Single-Node Environment: You can deploy an "out-of-the-box" single installation of Driven that includes an embedded Elasticsearch datastore for learning how to use the product or for small-scale usage.

  • Deploying a Clustered Production Environment: Deploy any number of Driven Server nodes backed by an external Elasticsearch cluster as the datastore. This topology is recommended for production environments where scalability, redundancy, and reliability are a priority.

Deploying a Single-Node Environment

A single-node deployment of Driven is sufficient for testing and becoming familiar with Driven. A single-node architecture can also function in low-scale production environments with a few users and small workloads. Single-node Driven deployment shows a single-node Driven Server instance in relation to the web server container and the embedded Elasticsearch datastore.

SingleNodeEmbedded
Figure 2. Single-node Driven deployment

The Apache Tomcat distribution in the Driven Server installation package includes a preinstalled WAR file. You can install a single-node Driven Server instance, web server, and Elasticsearch server from the Tomcat distribution. If you prefer, you can skip installation of the packaged Tomcat distribution and deploy the WAR file to install the Driven and Elasticsearch servers to an existing web-server node on your system.

Deploying a Clustered Production Environment

Clustered production environment of a Driven deployment shows the architecture of a multinode Driven deployment running with an external Elasticsearch cluster.

ExternalESCluster
Figure 3. Clustered production environment of a Driven deployment

This topology provides horizontal scalability, data replication, and resiliency. It is the recommended topology for production deployments to meet high-volume service-level agreements and scale Driven with increasing demand. Separating the data persistence layer from the web application layer provides additional stability and the ability to tune operations to the specific needs of each layer.

Setting up this topology requires additional expertise with running Elasticsearch as you need to install and configure stand-alone Elasticsearch nodes. You cannot use the embedded Elasticsearch server included in the Tomcat distribution of Driven if you are deploying a clustered production environment. Elastic provides best practices and documentation for setting up a cluster, and this guide often refers to that documentation.

The Driven Server configuration section of this guide documents the properties allowing Driven to communicate with an external Elasticsearch cluster. See Installing and Configuring for further details.