savanna-all team mailing list archive

Thread
Date
Re: advanced hadoop configuration

To: John Speidel <jspeidel@xxxxxxxxxxxxxxx>
From: Dmitry Mescheryakov <dmescheryakov@xxxxxxxxxxxx>
Date: Tue, 14 May 2013 19:29:21 +0400
Cc: "savanna-all@xxxxxxxxxxxxxxxxxxx" <savanna-all@xxxxxxxxxxxxxxxxxxx>, Jon Maron <jonmaron@xxxxxxxxx>
In-reply-to: <51914687.7010802@hortonworks.com>
Folks,

We've adjusted our specs to incorporate the case when cluster is created
from provider-specific configuration file. The "Cluster Lifecycle for
Config File Mode" covers this case briefly.

The spec is attached. I will upload it to the wiki once they fix
authentication.

Thanks,

Dmitry


2013/5/14 John Speidel <jspeidel@xxxxxxxxxxxxxxx>

>  Thanks Ruslan, we are very happy that there is agreement to include
> advanced Hadoop configuration functionality in Savanna.
> Now we can focus on the technical aspects of implementing this
> functionality.
>
> I would like to take some time to clarify a possible misunderstanding in
> the advanced configuration approach previously described.
>
> The plugin specific advanced hadoop configuration would not dictate any
> specific VM configuration.  In the advanced use case, the user would still
> need to specify VM related information independent of the hadoop
> configuration which would include VM count, flavor of VM's, etc.  In the
> future, we may need to allow additional VM related information to be
> provided such as rack, physical host, etc., but this would not be in the
> hadoop configuration.  This information is used to provision all VM's.  The
> module responsible for provisioning the VM's would not need any Hadoop
> related information from the advanced configuration.  The VM would be
> provisioned with no knowledge from the hadoop configuration.  It would not
> know anything about hadoop services, roles, etc.  The VM's would basically
> be Hadoop agnostic.  Basically, the VM provisioning module would be
> responsible for interacting with OpenStack (nova) to provision vanilla VM's
> based on a VM configuration which would contain instance count, VM images,
> flavors of the VM's and potentially rack and other topology related
> information.  In the VM configuration it would be possible to specify the
> image/flavor for all VM's.
>
> At the completion of the VM provisioning step, no Hadoop related
> configuration/provisioning has occurred on any VM.
>
> After all of the VM's have been provisioned, the Hadoop plugin would be
> invoked and the advanced Hadoop configuration as well as VM cluster
> information would be provided to the Hadoop plugin.  The VM information
> would contain specific information about all VM's that have been
> provisioned in the previous step.  For each VM that was provisioned, the VM
> configuration would contain VM properties that would describe the VM
> properties to the Hadoop plugin.  This would include things such as VM
> flavor, image, networking information, rack, physical host, etc.  The
> hadoop configuration would specify all hadoop services as well as rules for
> mapping services/roles to a set of physical VM's, which have already been
> provisioned.  These rules would utilize the properties provided in the VM
> configuration.  For example, a configuration could dictate that all master
> services would run on a single VM which had a minimum of 2G RAM and that
> all slave roles run on every other machine.  These role mapping rules would
> be in simple query form such as MEMORY > 2G and DISK_SPACE > X. The rules
> would not dictate the number of necessary hosts for a given cluster.  After
> the Hadoop provider determined which services/roles would be placed on each
> VM, the Hadoop provider would be responsible for installing all Hadoop
> bits, configuring and starting all services on each VM.  For HDP, this
> would be done using Ambari.
>
> The important thing that I am trying to convey is that VM configuration is
> distinct from hadoop configuration.  The VM plugin would provision vanilla
> VM's(may use hadoop plugin image with local repos installed) and the Hadoop
> plugin would map services/roles to the VM's that have been provisioned
> based on simple rules in the advanced configuration.
>
> If you feel that there is still a compelling reason that the controller
> would need information from the advanced hadoop configuration to provision
> VM's, please provide specific details.
>
> Thanks,
> -John
>
>
>
>
>
>
>
>
>
> On 5/13/13 1:10 PM, Ruslan Kamaldinov wrote:
>
>  Jon, John,
>
>  We are concerned that proposed architecture will not allow user to
> configure Hadoop and OpenStack at the same time. It allows to configure
> Hadoop, but doesn’t allow to configure OpenStack: flavor, swift, etc. It
> also doesn't allow user to specify flavor  per node, what we usually do
> when we deploy Hadoop on real hardware.
>
>  We understand that advanced Hadoop configuration is important feature
> for you. And we absolutely don’t want to restrict this feature.
>
>  So, here is how this problem could be avoided:
> - User passes advanced Hadoop config to Savanna
> - Savanna passes this config to Plugin plugin.convert_advanced_config()
> - Plugin returns template for the cluster which is understandable by
> Savanna. Template might contain that advanced config unmodified. It can be
> just an inner json object inside the plugin-specific template. Template
> should also contain information about the number and types of nodes in the
> cluster
> - User maps OpenStack-specific configuration to this template. For
> example, disk mapping for HDFS, node placement, flavor of each node (or of
> node groups).
>
>  We also like our approach because we will be able to use the same
> standard flow which is already designed. What do you think?
>
>  I understand that current blueprint for hierarchical templates is not
> complete and we definitely need to update it. We are working on this. Once
> document is updated we hope that Advanced Hadoop configuration will fit
> into hierarchical templates architecture.
>
>
>  And we agree with your vision on separation of responsibilities in
> Savanna:
> - Savanna core manages OpenStack
> - Plugin manages Hadoop and Hadoop management tool
>
>
>  Thanks,
> Ruslan
>
> On Saturday, May 11, 2013 at 8:14 PM, Jon Maron wrote:
>
>   It may also be helpful to see a representative sample of a
> configuration you envision passing to the controller.
>
>  On May 11, 2013, at 11:59 AM, John Speidel <jspeidel@xxxxxxxxxxxxxxx>
> wrote:
>
>   Ruslan,
>
> It would be  helpful if you could describe how the controller would use
> the data that you mention, DN placement, HDFS, etc.,  while provisioning
> vm's.
>
> Thanks,
> John
>
> On 5/11/13 10:09 AM, Jon Maron wrote:
>
>
>  On May 11, 2013, at 8:46 AM, Ruslan Kamaldinov <rkamaldinov@xxxxxxxxxxxx>
> wrote:
>
>  *
> > I don't believe that openstack currently has rack awareness?
> This is one of the main goals of Savanna. It is targeted for phase 2.
> Quote from the roadmap:
> Hadoop cluster topology configuration parameters
> - Data node placement control
> - HDFS location
> - Swift integration
>
>
>  While your approach targets all the Hadoop related configs, it misses
> all the OpenStack related configurations. Advanced Hadoop cluster will
> require advanced OpenStack configuration: Swift, Cinder, placement control,
> etc.
> We **have to** give user control over both worlds: Hadoop and OpenStack.
> Giving control to the plugin means that user will lose control over
> OpenStack-related configuration.
> *
>
>
>  I do not disagree.  I just feel that there should strive to have the
> configuration structures in such a way that the VM configuration element of
> the controller doesn't need to process Hadoop configuration and the Hadoop
> plugin doesn't need to comprehend VM related configuration.  We are
> striving for a design that allows each component of the savanna system to
> process their configuration alone while having enough information about the
> system to make appropriate decisions.  So I'd view the goal to be:
>
>  1)  controller assembles information, based on use input, that has both
> VM cluster and Hadoop cluster information.
> 2)  The VM cluster configuration is passed to a VM provisioning component.
>  The output of that invocation is a VM cluster spec with server instances
> that provide information about their characteristics.
> 3)  The controller passes the Hadoop cluster configuration (either
> standard or advanced) and the VM cluster spec to the Hadoop plugin.
> 4)  The plugin leverages the configuration it is provided, and the set of
> VMs it is made aware of via the VM cluster spec, to execute the appropriate
> package installations, configuration file edits, etc to setup the Hadoop
> cluster on the given VMs.
>
>  I think this allows for the cleanest separation of responsibilities and
> for the most effective and extensible design for savanna.  I think we
> should follow this approach to drive the structures we come up with to
> designate the cluster and Hadoop configurations.
>
>  *
>
>  Hierarchical node/cluster templates (see
> https://blueprints.launchpad.net/savanna/+spec/hierarchical-templates)
> were designed specifically to support both Hadoop and OpenStack advanced
> configurations.
> *
>
>
>  We don't object to the template approach.  It'll probably cover a great
> deal of the scenarios we may encouter.  However, we've just been through
> enough similar efforts to realize that:
>
>  1)  There are always edge cases that need the most flexible approach
> 2)  Users like to use existing assets (e.g. Ambari blueprints they've
> already assembled in a non-openstack/VM environment).  They will resent or
> resist having to learn a new management mechanism on top of the one they
> already understand and implement.
>
>  *
>  If you think, that current design misses something, that something
> doesn't allow to support "Hadoop Blueprint Specification" let's discuss it.
> It was designed to support such configurations and it **has to support
> them**.
>
>
>  Thanks,
>  Ruslan
>
>  *
>
>  On Sat, May 11, 2013 at 1:17 AM, Jon Maron <jmaron@xxxxxxxxxxxxxxx>wrote:
>
>
>  On May 10, 2013, at 4:35 PM, Ruslan Kamaldinov <rkamaldinov@xxxxxxxxxxxx>
> wrote:
>
>  Hi John,
>
>  If controller doesn't know anything about services which will run on
> VMs, then it will not be able to place them correctly. The whole cluster
> might end up on one physical machine (or rack).
>
>
>  I don't believe that openstack currently has rack awareness?  In
> addition, the controller doesn't need actual service or hadoop information
> to make a determination about which physical machine to utilize (I think
> that would actually be a mistake and could limit the controllers ability to
> extend to other potential scenarios).  Rather, if we deem it necessary we
> could create some additional VM specific configuration it can utilize to
> appropriately provision the VMs, independent of the hadoop configuration.
>  We think it'd be a mistake to expect the controller in general to
> interpret hadoop specific information (standard or advanced).  The
> controller is simply providing services and managing the cluster creation
> workflow.  There should be a clear VM provisioning element that reads the
> VM specific configuration and provisions accordingly, and then the hadoop
> configuration (standard or advanced), along with the vm specs, should be
> passed to the plugin and allow it to proceed with service/component
> installations.
>
>
>  That's why we need to pass more detailed config to the controller, so it
> would be able to place VMs in correct place. And we can't have this logic
> inside the plugin.
>
>
>  I don't quite understand your concern.
>
>  The controller is going to deal with the VM provisioning element and
> request it to create the VMs based on the information provided (number of
> VMs, flavors).   The VM information will then be related to the plugin
> within the vm_specs object.   Then, given a list of VMs and their
> characteristics, the plugin will be able to select the appropriate VMs to
> install the various hadoop services based on predicates available within
> the hadoop cluster configuration within the advanced configuration file.
>  For example, for the name node the hadoop configuration may include a min
> memory requirement.  The plugin will be able to iterate thru the list of
> VMs and find one that has the appropriate amount of memory.  Once a VM is
> found that meets all the criteria listed for the given component, the
> installation can proceed.
>
>  It was indicated to us that the p
>
>      --
> Mailing list: https://launchpad.net/~savanna-all
> Post to     : savanna-all@xxxxxxxxxxxxxxxxxxx
> Unsubscribe : https://launchpad.net/~savanna-all
> More help   : https://help.launchpad.net/ListHelp
>
>  --
> Mailing list: https://launchpad.net/~savanna-all
> Post to : savanna-all@xxxxxxxxxxxxxxxxxxx
> Unsubscribe : https://launchpad.net/~savanna-all
> More help : https://help.launchpad.net/ListHelp
>
>
>
>
> --
> Mailing list: https://launchpad.net/~savanna-all
> Post to     : savanna-all@xxxxxxxxxxxxxxxxxxx
> Unsubscribe : https://launchpad.net/~savanna-all
> More help   : https://help.launchpad.net/ListHelp
>
>
Attachment: InteroperabilitySpecs.pdf
Description: Adobe PDF document
References

advanced hadoop configuration
From: John Speidel, 2013-05-08
Re: advanced hadoop configuration
From: Jon Maron, 2013-05-10
Re: advanced hadoop configuration
From: Ruslan Kamaldinov, 2013-05-10
Re: advanced hadoop configuration
From: Jon Maron, 2013-05-10
Re: advanced hadoop configuration
From: Ruslan Kamaldinov, 2013-05-10
Re: advanced hadoop configuration
From: Jon Maron, 2013-05-10
Re: advanced hadoop configuration
From: Ruslan Kamaldinov, 2013-05-11
Re: advanced hadoop configuration
From: John Speidel, 2013-05-11
Re: advanced hadoop configuration
From: Jon Maron, 2013-05-11
Re: advanced hadoop configuration
From: Ruslan Kamaldinov, 2013-05-13
Re: advanced hadoop configuration
From: John Speidel, 2013-05-13