Driven User Guide

version 2.0.5

Understanding the Unit of Work Details Page

The Unit of Work details page can address many questions about an application run. A couple of typical issues that can be addressed include:

How does the application decompose to MapReduce tasks?
Is there a particular cause for performance degradation: data skew, network storm, poor application logic, or inadequate cluster resource provisioning?

You know that you are viewing a Unit of Work details page when the Slice Performance dashboard appears under a directed acyclic graph (DAG).

Process Levels

A Unit of Work in Driven is the platform-agnostic term for an application process level that is called Flow in Cascading. A Unit of Work in Driven is a runtime process that can be mapped to a business or development architecture need. The consumed and produced data has some level of durability.

A Unit of Work consists of a hierarchy of smaller components. Each Driven-supported application framework often uses specific terminology for these components. However, for product usability, each of these components has one name in Driven. The following is a list of Driven names for the components, descriptions, and corresponding framework-specific terms.

Step: A Driven step represents a unit of platform-managed work.

Cascading term: FlowStep
Hadoop MapReduce term: job
Apache Tez term: DAG
Node: A node represents the complete Unit of Work that conceptually fits in a single JVM. This process becomes parallelized by handling subsets of the input data source.

The node can represent multiple data paths. In this case, one path is selected at runtime depending on which input data set is actually being processed.

Cascading term: FlowNode
Hadoop MapReduce terms: mapper and reducer
Apache Tez term: vertex
Slice: A slice is the smallest unit of parallelization. At runtime, a slice represents the actual JVM-executing application code and the data path or paths being executed in that JVM.

Cascading term: FlowSlice
Hadoop MapReduce and Apache Tez terms: task attempt

The Cascading Query Planner

A key component of Cascading is a query planner that rapidly forecasts the workflow of a complex data application. When a Cascading application executes, the query planner compiles all the data-processing steps, analyzes dependencies of the steps, and develops a DAG for the application. The DAG is a dependency graph of the higher-level Cascading steps.

Other application frameworks that Driven supports, such as Apache Hive and native MapReduce, run operations that are similar to the work that the Cascading query planner undertakes. Driven compiles the information from these non-Cascading applications and also renders DAGs from the data. However, non-Cascading frameworks might not offer the same degree of DAG manipulation that can be programmed in the code.

Figure 1. Sample of a Unit of Work DAG

Figure 1 displays a DAG and detailed information for one of the nodes. The information in the gray shading represent Cascading annotations. See Using Annotations for more information.

Tip	Click the Details drop-down menu above the DAG to see additional data about the whole Unit of Work.

The Cascading query planner iterates through the DAG, breaking it into smaller and smaller graphs (expression graphs). This process continues until each graph component matches a pattern associated with a unit of work, such as a mapper or a reducer. This granular data is visualized in columns of the Slice Performance table (see Figure 2).

By creating this decoupling between the DAG and the units of work on the computation fabric, Cascading can support running your application on many Big Data fabrics (such as MapReduce), without requiring a rewrite or change to your code.

Figure 2. Steps associated with their mappers and reducers, node DAGs, and non-normalized expression graphs

A step usually consists of multiple slices. An expression graph plots data from the slices that compose a step. As you hover an expression graph, notice that each slice is highlighted and displays a specific duration.

By default, Driven visualizes each step by drawing an expression graph that is proportional to the range of slice values that compose the step. This helps you analyze the slice performance of an individual step.

If you click the Normalize checkbox, the graphs are redrawn so that they are proportional to the range of slice values for the entire column. This visualization aids in comparison of slice performance among different steps that report the same type of data.

Click the Add counters button to control what slice-level execution data is represented as expression graphs in the table.

Figure 3. Selecting counters to add to or remove from the Slice Performance table

Understanding Bottlenecks in Your Application

If an application is not performing optimally, the problem might result from skewed data processing. You can check the Slice Performance table visualizations for skewed-data problems. In a MapReduce application, the data is divided and processed in equal-sized chunks. If certain slices are taking more time to finish processing a similar type of task with (assumed) similarly sized data, then it is an anomaly and could indicate application execution problems.

Figure 4. This example shows skewed data at the slice level for several counters

These skews often indicate that applications are processing a large number of small files, which usually means that you need to optimize the environment. In other cases, depending on the skew dimension, they could indicate a network issue, which can delay the shuffle-sort operations in MapReduce.

Viewing the Hadoop Dashboard

If there is a Hadoop dashboard for a step, the row for the step has a JobTracker hyperlink.

Figure 5. Link to Hadoop dashboard

Managing Applications with Tags