← Back to team overview

savanna-all team mailing list archive

Re: advanced hadoop configuration



It would be helpful if you could describe how the controller would use the data that you mention, DN placement, HDFS, etc., while provisioning vm's.


On 5/11/13 10:09 AM, Jon Maron wrote:

On May 11, 2013, at 8:46 AM, Ruslan Kamaldinov <rkamaldinov@xxxxxxxxxxxx <mailto:rkamaldinov@xxxxxxxxxxxx>> wrote:

> I don't believe that openstack currently has rack awareness?
This is one of the main goals of Savanna. It is targeted for phase 2. Quote from the roadmap:
/Hadoop cluster topology configuration parameters/
/- Data node placement control/
/- HDFS location/
/- Swift integration/

While your approach targets all the Hadoop related configs, it misses all the OpenStack related configurations. Advanced Hadoop cluster will require advanced OpenStack configuration: Swift, Cinder, placement control, etc. We **have to** give user control over both worlds: Hadoop and OpenStack. Giving control to the plugin means that user will lose control over OpenStack-related configuration.

I do not disagree. I just feel that there should strive to have the configuration structures in such a way that the VM configuration element of the controller doesn't need to process Hadoop configuration and the Hadoop plugin doesn't need to comprehend VM related configuration. We are striving for a design that allows each component of the savanna system to process their configuration alone while having enough information about the system to make appropriate decisions. So I'd view the goal to be:

1) controller assembles information, based on use input, that has both VM cluster and Hadoop cluster information. 2) The VM cluster configuration is passed to a VM provisioning component. The output of that invocation is a VM cluster spec with server instances that provide information about their characteristics. 3) The controller passes the Hadoop cluster configuration (either standard or advanced) and the VM cluster spec to the Hadoop plugin. 4) The plugin leverages the configuration it is provided, and the set of VMs it is made aware of via the VM cluster spec, to execute the appropriate package installations, configuration file edits, etc to setup the Hadoop cluster on the given VMs.

I think this allows for the cleanest separation of responsibilities and for the most effective and extensible design for savanna. I think we should follow this approach to drive the structures we come up with to designate the cluster and Hadoop configurations.


Hierarchical node/cluster templates (see https://blueprints.launchpad.net/savanna/+spec/hierarchical-templates) were designed specifically to support both Hadoop and OpenStack advanced configurations.

We don't object to the template approach. It'll probably cover a great deal of the scenarios we may encouter. However, we've just been through enough similar efforts to realize that:

1)  There are always edge cases that need the most flexible approach
2) Users like to use existing assets (e.g. Ambari blueprints they've already assembled in a non-openstack/VM environment). They will resent or resist having to learn a new management mechanism on top of the one they already understand and implement.

If you think, that current design misses something, that something doesn't allow to support "Hadoop Blueprint Specification" let's discuss it. It was designed to support such configurations and it **has to support them**.



On Sat, May 11, 2013 at 1:17 AM, Jon Maron <jmaron@xxxxxxxxxxxxxxx <mailto:jmaron@xxxxxxxxxxxxxxx>> wrote:

    On May 10, 2013, at 4:35 PM, Ruslan Kamaldinov
    <rkamaldinov@xxxxxxxxxxxx <mailto:rkamaldinov@xxxxxxxxxxxx>> wrote:

    Hi John,

    If controller doesn't know anything about services which will
    run on VMs, then it will not be able to place them correctly.
    The whole cluster might end up on one physical machine (or rack).

    I don't believe that openstack currently has rack awareness?  In
    addition, the controller doesn't need actual service or hadoop
    information to make a determination about which physical machine
    to utilize (I think that would actually be a mistake and could
    limit the controllers ability to extend to other potential
    scenarios).  Rather, if we deem it necessary we could create some
    additional VM specific configuration it can utilize to
    appropriately provision the VMs, independent of the hadoop
    configuration.  We think it'd be a mistake to expect the
    controller in general to interpret hadoop specific information
    (standard or advanced).  The controller is simply providing
    services and managing the cluster creation workflow.  There
    should be a clear VM provisioning element that reads the VM
    specific configuration and provisions accordingly, and then the
    hadoop configuration (standard or advanced), along with the vm
    specs, should be passed to the plugin and allow it to proceed
    with service/component installations.

    That's why we need to pass more detailed config to the
    controller, so it would be able to place VMs in correct place.
    And we can't have this logic inside the plugin.

    I don't quite understand your concern.

    The controller is going to deal with the VM provisioning element
    and request it to create the VMs based on the information
    provided (number of VMs, flavors).   The VM information will then
    be related to the plugin within the vm_specs object.   Then,
    given a list of VMs and their characteristics, the plugin will be
    able to select the appropriate VMs to install the various hadoop
    services based on predicates available within the hadoop cluster
    configuration within the advanced configuration file.  For
    example, for the name node the hadoop configuration may include a
    min memory requirement.  The plugin will be able to iterate thru
    the list of VMs and find one that has the appropriate amount of
    memory.  Once a VM is found that meets all the criteria listed
    for the given component, the installation can proceed.

    It was indicated to us that the plugin will have the
    responsibility of installing services, not the controller.

    Here is what we can do:
    Add rest api call:

    User will be able to use advanced configs and Savanna will be
    able to process advanced configs, at least that part which is
    required to properly provision VMs.

    The advanced configuration is much richer than the standard
    approach's configuration items (it may include required packages,
    host selection predicates, etc).  One of the reasons it is
    proposed is to specifically handle cases that the savanna
    standard configuration simply can't handle (it's not the only
    reason - we've already expressed all the other drivers).  I don't
    believe a conversion would work.


    On Sat, May 11, 2013 at 12:24 AM, Jon Maron
    <jmaron@xxxxxxxxxxxxxxx <mailto:jmaron@xxxxxxxxxxxxxxx>> wrote:

        John answered most of your questions below.  One more note

        On May 10, 2013, at 4:06 PM, Ruslan Kamaldinov
        <rkamaldinov@xxxxxxxxxxxx <mailto:rkamaldinov@xxxxxxxxxxxx>>

        Jon, John,

        Could you please shed more light on how these Advanced
        configs could be processed by Savanna controller?

        There is an example stack configuration in "Hadoop
        Blueprint Specification". And there is "cardinality" field
        - in our case it's the number of VMs per specific service,
        data-node for example.

        Let's imagine user passed such config to Savanna and
        defined two VM Groups

        What happens then? How will Savanna controller be able to
        create VMs with specific to service properties? How will it
         be possible to use different data node placement options?

        How will Savanna be able to store cluster information in
        templates (see

        The hierarchical templates will still play a role in the
        "standard" processing.  But for advanced configuration the
        relevant configuration will be in the provider specific
        configuration.  In addition, the mechanisms to persist the
        provider configuration (advanced) will also exist.


        On Fri, May 10, 2013 at 6:56 PM, Jon Maron
        <jmaron@xxxxxxxxxxxxxxx <mailto:jmaron@xxxxxxxxxxxxxxx>> wrote:


              We have uploaded some mockups that illustrate the
            "Advanced" configuration mechanism we've been proposing:



              The advanced mechanism essentially leverages existing
            APIs (configure_cluster(), create_cluster()) but the
            cluster description parameter passed to those methods
            includes the user selected configuration file that is
            specific to the Hadoop provider rather than the
            standard list of configuration items.

            -- Jon

            On May 8, 2013, at 6:40 PM, John Speidel
            <mailto:jspeidel@xxxxxxxxxxxxxxx>> wrote:

            Here are more details on the advanced Hadoop
            configuration that we discussed the other day.

            Savanna Advanced Hadoop Configuration

            In addition to the proposed “config items” based
            Hadoop configuration, it will be necessary to provide
            an advanced configuration mechanism in Savanna.This
            mechanism should allow for very fine-grained and
            extensive configuration of a Hadoop cluster
            provisioned by Savanna.It is expected that a user
            would likely use the simple node group based
            configuration for cases where little configuration is
            required and use the advanced configuration where more
            control is desired.The advanced cluster configuration
            would be specific to a Hadoop plugin and it’s content
            opaque to the Savanna controller.

            For reference, here is a link to the Hadoop Blueprint
            proposed by the Ambari.

                  Advanced Hadoop Configuration Use Cases

            ·A user has an existing on premise or non-virtualized
            cluster and wants to clone the
            cluster(topology/configuration not data) in a
            virtualized environment using Savanna.

            In this case, the user will export a configuration for
            the existing cluster using provider/management product
            specific tooling.This configuration can then be used
            to create a new cluster using Savanna.

            ·A user wants to provision a new cluster in a
            virtualized environment using savanna and needs very
            fine-grained control of the Hadoop cluster
            configuration.This could include configuration of host
            level roles, configuration of a large number of
            properties across many optional services and
            potentially even Hadoop stack configuration related to
            packages and repository locations.

                  Changes to UI Workflow

            To allow a user to specify an advanced configuration,
            some UI changes are necessary.

            The create cluster screen would need an “advanced
            Hadoop configuration” tab or button.In the initial
            implementation, the advanced configuration screen
            would allow a user to specify the location of a plugin
            specific configuration file (select file dialog).This
            configuration file would contain all necessary Hadoop
            related configuration.In future releases, we may want
            a link to provider specific tooling, which could be
            used to create/edit provider configurations.

            The UI would still need to allow a user to specify VM
            details such as flavor, count, etc., but the user
            wouldn’t specify node groups or configuration for the
            VM’s.Instead, host/role mapping would be specified in
            the provider specific configuration file.

                  Changes to Hadoop Plugin SPI

            The addition of “Advanced Hadoop Configuration” using
            plugin specific configuration will result in small
            changes to the proposed Hadoop Plugin SPI.

            cluster_description: The cluster description object
            would need to be updated to contain an
            advanced_configuration field in addition to
            cluster_configs.In the case of a user providing an
            advanced configuration, it would be available in
            advanced_configuration and cluster_configs would be empty.


            Because the provider specific configuration is opaque
            to Savanna, it might be necessary for the plugin to
            return some cluster topology information from this
            method for rendering purposes.The specifics of this
            information would be dependent on what cluster
            information is required by Savanna.


-- Mailing list: https://launchpad.net/~savanna-all
            Post to     : savanna-all@xxxxxxxxxxxxxxxxxxx
            Unsubscribe : https://launchpad.net/~savanna-all
            More help   : https://help.launchpad.net/ListHelp

            Mailing list: https://launchpad.net/~savanna-all
            Post to     : savanna-all@xxxxxxxxxxxxxxxxxxx
            Unsubscribe : https://launchpad.net/~savanna-all
            More help   : https://help.launchpad.net/ListHelp

Follow ups