Table of Contents

Driven Administrator Guide

version 2.0.5

Planning a Driven Deployment

Understanding the Deployment Architecture

The Driven installation consists of a plugin installation on a host where big data applications are run and the Driven Server running in Apache Tomcat or other type of servlet container. An Elasticsearch datastore provides a persistence layer for the Driven application.

HighLevelComponentArchitecture Figure 1: Driven deployment architecture

The Driven Server is packaged as a web application resource (WAR) file in a stand-alone Tomcat distribution. The Driven application has been architected to allow for fault tolerance and scalability by implementing redundancy in the persistence layer and features for backup and restore.

When executing your application, the framework underneath builds a rich state model to execute units of work on the Hadoop cluster. For Cascading, this includes taking higher-level primitives (see the Cascading User Guide) and mapping them to constructs available on the underlying computational fabric (MapReduce, local in-memory mode, etc.) using a sophisticated pattern-matching rules engine.

The Driven Plugin collects this information and sends it to the Driven Server for visualization. In addition, the Driven Plugin collects metadata information about each slice–a unit of work such as a mapper or a reducer–from the Hadoop NameNode to collect statistics that will be analyzed and correlated by Driven.

There are several implications as a result of this architecture:

  • Your application does not appear to have completed execution until all the computation is finished and the Driven Plugin has successfully transmitted all telemetry data to the Driven Server.

  • The additional latency that results from the transmission of the telemetry data does not affect your application runtime for data computation. In other words, your application’s Service Level Agreements (SLAs), such as producing a data set within a certain time period is not affected. The computation of data on the Hadoop cluster and the transmission of the telemetry data from your client plugin are decoupled.

  • The amount of data collected by the application framework and the Driven Plugin is dependent on the complexity of your application, such as operations and branches. In addition, the Driven Plugin collects information about each task or job that is part of your application. The larger your data sets, the more information that is collected for analysis from the Hadoop NameNode. As a result, make sure that you have provisioned adequate resources for your Hadoop NameNode to prevent a bottleneck in collecting the telemetry signals. Also, make sure that you provision adequate memory for your application running with the Driven Plugin.

Next