Driven User Guide: Topics for Cascading Applications

version 2.2.6

Driven User Guide: Topics for Cascading Applications

1. Overview of Monitored Applications: 1.1. Logging In

1.2. Status Views
2. Searches, Saved Views, and Accessing Relevant Data: 2.1. Starting a Search

2.2. Saving Search Queries as Views

2.3. My Teams Views

2.4. Case Examples of Searches, Views, and Teams

2.5. Customizing Searches

2.6. Periodic Views

2.7. Counter Data and Other Metrics in Tables

2.8. Sharing and Limiting Access to Application Information
3. Using the App Details Page: 3.1. Searching App Details

3.2. Viewing Application Details

3.3. Understanding the Directed Acyclic Graph

3.4. Viewing the Graph

3.5. Real-Time Visibility into Your Application

3.6. Details Table
4. Understanding the Unit of Work Details Page: 4.1. Viewing Unit-of-Work Details

4.2. The Directed Acyclic Graph

4.3. Step Table and Slice Histograms

4.4. Viewing the Hadoop Dashboard
5. Managing Applications with Tags: 5.1. Best Practice for Tags

5.2. Assigning Tags as Application-Level Properties

5.3. Assigning Tags in the Driven Plugin Configuration File

5.4. Using Tags in a Search Query

5.5. Finding Tagged Applications with the Column Chooser
6. Configuring Teams for Collaboration: 6.1. Creating and Managing Teams

6.2. Team Details

6.3. Associating an Application with a Team
7. Using Annotations: 7.1. Creating Custom Annotations

7.2. Data Visibility
8. Execute Hive Queries as Cascading HiveFlow: 8.1. Using HiveFlow

8.2. Driven for HiveFlow
9. Execute Cascading MapReduce Flows
10. User Profile: 10.1. User Actions

10.2. User Credentials

10.3. User Statistics

10.4. Invitations

10.5. Teams

Using Annotations

Annotations display metadata about flow and step nodes on the application details and slice performance views of Driven. Click on a node in the directed acyclic graph (DAG) to view annotations about the operation, such as processing details for a tap, filter, or function of a flow.

Figure 1. Sample annotation displaying the Function type within the wc flow

Creating Custom Annotations

The types of metadata that are exposed in annotations are selected as part of Cascading application development.

Use Cascading 2.6 or above to assign annotations to Cascading functions and taps.

Also, refer to the Cascading 3.0 core API reference for details about the cascading.management.annotation package.

Data Visibility

Driven renders all the application metadata associated with the annotations. However, for privacy and compliance reasons, you may want to restrict access to information about a certain Property to a subset of Driven users. Access control becomes an important feature if you want to restrict visibility of some metadata attributes to comply with privacy or governance guidelines in a shared, multitenant cluster.

In the following code example, the visibility rule is applied for users based on their identity in Driven. In the following example, the visibility rule is set to PUBLIC:

@Property(name = "scrubTextConvert", visibility = Visibility.PUBLIC)
@PropertyDescription("_my_property_description_")
...
@Property(name = "scrubText", visibility = Visibility.PUBLIC)
@PropertyDescription("_my_property_description_")
...

Driven maps the visibility levels listed in the table below to the state of the user session.

Table 1. Visibility Levels
Property (user session)	Public Access (Anonymous)	Protected Access (Login)	Private Access (Team)
PUBLIC	X	X	X
PROTECTED		X	X
PRIVATE			X

This mapping can be configured in the driven.properties file in order to effect your governance guidelines. The example below illustrates a typical use of the visibility levels.

The default visibility mapping effectively means the properties control viewing of information in the following manner:

PUBLIC	Allow metadata attributes to be observed anonymously, by default.
PROTECTED	Allow metadata attributes to be viewed by users who log in to Driven.
PRIVATE	Allow metadata attributes to be viewed by members of a Driven team. This level is also used when access is restricted by role, such as Driven admin or team leader.

Execute Hive Queries as Cascading HiveFlow

Cascading execution framework can run Hive queries. This will enable your Hive applications to benefit from Cascading platform: dynamic management of all Hive objects, visibility into the end-to-end flow of the application, instrumentation, orchestration of your Hive modules for error recovery, and integration with major third-party systems such as Elasticsearch and Teradata.

Using HiveFlow

You can move your Hive Query Language (HQL) queries into production using an API from HiveFlow and the runtime monitoring capabilities of Driven.

HiveFlow is a simple Java wrapper that simplifies the chaining of multiple HQL statements into a single maintainable application. It transparently sends telemetry to Driven so that an HQL-based application can be managed and monitored in real-time.

With HiveFlow, even applications based on multiple technologies, such as Hive, custom MapReduce, Cascading, and Scalding, can be chained together within the same application (an Apache Hadoop job JAR). The consolidation simplifies testing, deployment, maintenance, and monitoring.

Driven for HiveFlow

Figure 2. Sample usage of flow details page

Drilling-down to view a HQL statement. Click the small download icon icon to copy the statement to your clipboard.

Execute Cascading MapReduce Flows

You can take your existing MapReduce jobs and apply the Cascading Class MapReduceFlow, which is a HadoopFlow subclass.

The Class MapReduceFlow allows custom MapReduce jobs to be executed by Cascading.

After Driven receives data from application execution, you can see the directed acyclic graph (DAG) representation of the MapReduce job with flows and their dependencies.

Figure 3. Driven displays your MapReduce job DAG representation, which allows you to drill down to a specific flow

Note	Driven does not show any performance data in this example since the job was just launched.