← Back to team overview

savanna-all team mailing list archive

Savanna version 0.3 - on demand Hadoop task execution


We want to initiate discussion about Elastic Data Processing (EDP) Savanna
component. This functionality is planned to be implemented in the next
development phase starting on July 15. The main questions to address:


   what kind of functionality should be implemented for EDP?

   what are the main components and their responsibilities?

   which existing tools like Hue or Oozie should be used?

To have something to start, we have prepared an overview of our thoughts in
the following document https://wiki.openstack.org/wiki/Savanna/EDP. For you
convenience, you can find the text below. Your comments and suggestions are

Key Features

Starting the job:


   Simple REST API and UI

   TODO: mockups

   Job can be entered through UI/API or pulled through VCS

   Configurable data source

Job execution modes:


   Run job on one of the existing cluster

      Expose information on cluster load

      Provide hints for optimizing data locality TODO: more details

   Create new transient cluster for the job

Job structure:


   Individual job via jar file, Pig or Hive script

   Oozie workflow

      In future to support EMR job flows import

Job execution tracking and monitoring


   Any existent components that can help to visualize? (Twitter

   Terminate job

   Auto-scaling functionality

Main EDP Components Data discovery component

EDP can have several sources of data for processing. Data can be pulled
from Swift, GlusterFS or NoSQL database like Cassandra or HBase. To provide
an unified access to this data we’ll introduce a component responsible for
discovering data location and providing right configuration for Hadoop
cluster. It should have a pluggable system.
Job Source

Users would like to execute different types of jobs: jar file, Pig and Hive
scripts, Oozie job flows, etc.  Job description and source code can be
supplied in a different way. Some users just want to insert hive script and
run it. Other users want to save this script in Savanna internal database
for later use. We also need to provide an ability to run a job from source
code stored in vcs.
Savanna Dispatcher Component

This component is responsible for provisioning a new cluster, scheduling
job on new or existing cluster, resizing cluster and gathering information
from clusters about current jobs and utilization. Also, it should provide
information to help to make a right decision where to schedule job, create
a new cluster or use existing one. For example, current loads on clusters,
their proximity to the data location etc.
UI Component

Integration into OpenStack Dashboard - Horizon. It should provide
instruments for job creation, monitoring etc.

Cloudera Hue already provides part of this functionality: submit jobs (jar
file, Hive, Pig, Impala), view job status and output.
Cluster Level Coordination Component

Expose information about jobs on a specific cluster. Possible this
component should be represent by existing Hadoop projects Hue and Oozie.
User Workflow

- User selects or creates a job to run

- User chooses data source for appropriate type for this job

- Dispatcher provides hints to user about a better way to scheduler this
job (on existing clusters or create a new one)

- User makes a decision based on the hint from dispatcher

- Dispatcher (if needed) creates or resizes existing cluster and schedules
job to it
- Dispatcher periodically pull status of job and shows it on UI


Alexander Kuznetsov