savanna-all team mailing list archive

Thread
Date
Re: advanced hadoop configuration

To: John Speidel <jspeidel@xxxxxxxxxxxxxxx>
From: Nirmal Ranganathan <rnirmal@xxxxxxxxx>
Date: Thu, 16 May 2013 06:27:18 -0500
Cc: "savanna-all@xxxxxxxxxxxxxxxxxxx" <savanna-all@xxxxxxxxxxxxxxxxxxx>, Jon Maron <jonmaron@xxxxxxxxx>
In-reply-to: <51914687.7010802@hortonworks.com>
Some comments inline.

Rack awareness if being implemented in nova and scheduler groups is well
under way, some of it in the code review phase.

On Mon, May 13, 2013 at 3:01 PM, John Speidel <jspeidel@xxxxxxxxxxxxxxx>wrote:

>  Thanks Ruslan, we are very happy that there is agreement to include
> advanced Hadoop configuration functionality in Savanna.
> Now we can focus on the technical aspects of implementing this
> functionality.
>
> I would like to take some time to clarify a possible misunderstanding in
> the advanced configuration approach previously described.
>
> The plugin specific advanced hadoop configuration would not dictate any
> specific VM configuration.  In the advanced use case, the user would still
> need to specify VM related information independent of the hadoop
> configuration which would include VM count, flavor of VM's, etc.  In the
> future, we may need to allow additional VM related information to be
> provided such as rack, physical host, etc., but this would not be in the
> hadoop configuration.  This information is used to provision all VM's.  The
> module responsible for provisioning the VM's would not need any Hadoop
> related information from the advanced configuration.  The VM would be
> provisioned with no knowledge from the hadoop configuration.  It would not
> know anything about hadoop services, roles, etc.  The VM's would basically
> be Hadoop agnostic.  Basically, the VM provisioning module would be
> responsible for interacting with OpenStack (nova) to provision vanilla VM's
> based on a VM configuration which would contain instance count, VM images,
> flavors of the VM's and potentially rack and other topology related
> information.  In the VM configuration it would be possible to specify the
> image/flavor for all VM's.
>
> At the completion of the VM provisioning step, no Hadoop related
> configuration/provisioning has occurred on any VM.
>
> After all of the VM's have been provisioned, the Hadoop plugin would be
> invoked and the advanced Hadoop configuration as well as VM cluster
> information would be provided to the Hadoop plugin.  The VM information
> would contain specific information about all VM's that have been
> provisioned in the previous step.  For each VM that was provisioned, the VM
> configuration would contain VM properties that would describe the VM
> properties to the Hadoop plugin.  This would include things such as VM
> flavor, image, networking information, rack, physical host, etc.  The
> hadoop configuration would specify all hadoop services as well as rules for
> mapping services/roles to a set of physical VM's, which have already been
> provisioned.  These rules would utilize the properties provided in the VM
> configuration.
>

This seems to defeat the whole purpose of Savanna. You can achieve all of
this with a simple Heat template to create all the desired vms and then
just proceed as usual with Ambari to do the Hadoop specific configuration.
Or even more have an extension in Ambari to just call the Heat Api and you
have Hadoop provisioned on Openstack thru Ambari. Lets not try to
complicate things and make the Savanna API provider specific. The whole
point of Openstack is to provide abstractions and make things easier for
the user not harder.

Take for example nova. When I create a server thru nova I don't need to
really know what hypervisor is used or what volume backend is being used,
all I need to know is I will get a linux or windows vm. I'm not necessarily
disagreeing with the advanced configuration options, just
the approach taken seems to be very specifically targeted and as far as
possible would like that not to be the case.



> For example, a configuration could dictate that all master services would
> run on a single VM which had a minimum of 2G RAM and that all slave roles
> run on every other machine.  These role mapping rules would be in simple
> query form such as MEMORY > 2G and DISK_SPACE > X. The rules would not
> dictate the number of necessary hosts for a given cluster.  After the
> Hadoop provider determined which services/roles would be placed on each VM,
> the Hadoop provider would be responsible for installing all Hadoop bits,
> configuring and starting all services on each VM.  For HDP, this would be
> done using Ambari.
>
> The important thing that I am trying to convey is that VM configuration is
> distinct from hadoop configuration.  The VM plugin would provision vanilla
> VM's(may use hadoop plugin image with local repos installed) and the Hadoop
> plugin would map services/roles to the VM's that have been provisioned
> based on simple rules in the advanced configuration.
>
>
This begs us to reconsider what Savanna is, from what you point out all of
this can be easily achieved with Heat why need Savanna at that point. New
features can always be added, but it's going to be difficult to remove
existing features, so before we go full onboard with implementing lets
reconsider.

To make my point clear, I'm of the opinion that Savanna does node placement
and roles mapped to templates, meaning Savanna rather the user (maybe in a
future iteration that's abstracted as well) decides what flavor to choose
for a namenode vs datanode (provided via templates) and provides additional
configuration to tune the cluster. Savanna creates the VM's and provides
placement and role information to the provider rather than the provider
deciding which VM plays which role.



> If you feel that there is still a compelling reason that the controller
> would need information from the advanced hadoop configuration to provision
> VM's, please provide specific details.
>
> Thanks,
> -John
>
>
>
>
>
>
>
>
>
> On 5/13/13 1:10 PM, Ruslan Kamaldinov wrote:
>
>  Jon, John,
>
>  We are concerned that proposed architecture will not allow user to
> configure Hadoop and OpenStack at the same time. It allows to configure
> Hadoop, but doesn’t allow to configure OpenStack: flavor, swift, etc. It
> also doesn't allow user to specify flavor  per node, what we usually do
> when we deploy Hadoop on real hardware.
>
>  We understand that advanced Hadoop configuration is important feature
> for you. And we absolutely don’t want to restrict this feature.
>
>  So, here is how this problem could be avoided:
> - User passes advanced Hadoop config to Savanna
> - Savanna passes this config to Plugin plugin.convert_advanced_config()
> - Plugin returns template for the cluster which is understandable by
> Savanna. Template might contain that advanced config unmodified. It can be
> just an inner json object inside the plugin-specific template. Template
> should also contain information about the number and types of nodes in the
> cluster
> - User maps OpenStack-specific configuration to this template. For
> example, disk mapping for HDFS, node placement, flavor of each node (or of
> node groups).
>
>  We also like our approach because we will be able to use the same
> standard flow which is already designed. What do you think?
>
>  I understand that current blueprint for hierarchical templates is not
> complete and we definitely need to update it. We are working on this. Once
> document is updated we hope that Advanced Hadoop configuration will fit
> into hierarchical templates architecture.
>
>
>  And we agree with your vision on separation of responsibilities in
> Savanna:
> - Savanna core manages OpenStack
> - Plugin manages Hadoop and Hadoop management tool
>
>
>  Thanks,
> Ruslan
>
> On Saturday, May 11, 2013 at 8:14 PM, Jon Maron wrote:
>
>   It may also be helpful to see a representative sample of a
> configuration you envision passing to the controller.
>
>  On May 11, 2013, at 11:59 AM, John Speidel <jspeidel@xxxxxxxxxxxxxxx>
> wrote:
>
>   Ruslan,
>
> It would be  helpful if you could describe how the controller would use
> the data that you mention, DN placement, HDFS, etc.,  while provisioning
> vm's.
>
> Thanks,
> John
>
> On 5/11/13 10:09 AM, Jon Maron wrote:
>
>
>  On May 11, 2013, at 8:46 AM, Ruslan Kamaldinov <rkamaldinov@xxxxxxxxxxxx>
> wrote:
>
>  *
> > I don't believe that openstack currently has rack awareness?
> This is one of the main goals of Savanna. It is targeted for phase 2.
> Quote from the roadmap:
> Hadoop cluster topology configuration parameters
> - Data node placement control
> - HDFS location
> - Swift integration
>
>
>  While your approach targets all the Hadoop related configs, it misses
> all the OpenStack related configurations. Advanced Hadoop cluster will
> require advanced OpenStack configuration: Swift, Cinder, placement control,
> etc.
> We **have to** give user control over both worlds: Hadoop and OpenStack.
> Giving control to the plugin means that user will lose control over
> OpenStack-related configuration.
> *
>
>
>  I do not disagree.  I just feel that there should strive to have the
> configuration structures in such a way that the VM configuration element of
> the controller doesn't need to process Hadoop configuration and the Hadoop
> plugin doesn't need to comprehend VM related configuration.  We are
> striving for a design that allows each component of the savanna system to
> process their configuration alone while having enough information about the
> system to make appropriate decisions.  So I'd view the goal to be:
>
>  1)  controller assembles information, based on use input, that has both
> VM cluster and Hadoop cluster information.
> 2)  The VM cluster configuration is passed to a VM provisioning component.
>  The output of that invocation is a VM cluster spec with server instances
> that provide information about their characteristics.
> 3)  The controller passes the Hadoop cluster configuration (either
> standard or advanced) and the VM cluster spec to the Hadoop plugin.
> 4)  The plugin leverages the configuration it is provided, and the set of
> VMs it is made aware of via the VM cluster spec, to execute the appropriate
> package installations, configuration file edits, etc to setup the Hadoop
> cluster on the given VMs.
>
>  I think this allows for the cleanest separation of responsibilities and
> for the most effective and extensible design for savanna.  I think we
> should follow this approach to drive the structures we come up with to
> designate the cluster and Hadoop configurations.
>
>  *
>
>  Hierarchical node/cluster templates (see
> https://blueprints.launchpad.net/savanna/+spec/hierarchical-templates)
> were designed specifically to support both Hadoop and OpenStack advanced
> configurations.
> *
>
>
>  We don't object to the template approach.  It'll probably cover a great
> deal of the scenarios we may encouter.  However, we've just been through
> enough similar efforts to realize that:
>
>  1)  There are always edge cases that need the most flexible approach
> 2)  Users like to use existing assets (e.g. Ambari blueprints they've
> already assembled in a non-openstack/VM environment).  They will resent or
> resist having to learn a new management mechanism on top of the one they
> already understand and implement.
>
>  *
>  If you think, that current design misses something, that something
> doesn't allow to support "Hadoop Blueprint Specification" let's discuss it.
> It was designed to support such configurations and it **has to support
> them**.
>
>
>  Thanks,
>  Ruslan
>
>  *
>
>  On Sat, May 11, 2013 at 1:17 AM, Jon Maron <jmaron@xxxxxxxxxxxxxxx>wrote:
>
>
>  On May 10, 2013, at 4:35 PM, Ruslan Kamaldinov <rkamaldinov@xxxxxxxxxxxx>
> wrote:
>
>  Hi John,
>
>  If controller doesn't know anything about services which will run on
> VMs, then it will not be able to place them correctly. The whole cluster
> might end up on one physical machine (or rack).
>
>
>  I don't believe that openstack currently has rack awareness?  In
> addition, the controller doesn't need actual service or hadoop information
> to make a determination about which physical machine to utilize (I think
> that would actually be a mistake and could limit the controllers ability to
> extend to other potential scenarios).  Rather, if we deem it necessary we
> could create some additional VM specific configuration it can utilize to
> appropriately provision the VMs, independent of the hadoop configuration.
>  We think it'd be a mistake to expect the controller in general to
> interpret hadoop specific information (standard or advanced).  The
> controller is simply providing services and managing the cluster creation
> workflow.  There should be a clear VM provisioning element that reads the
> VM specific configuration and provisions accordingly, and then the hadoop
> configuration (standard or advanced), along with the vm specs, should be
> passed to the plugin and allow it to proceed with service/component
> installations.
>
>
>  That's why we need to pass more detailed config to the controller, so it
> would be able to place VMs in correct place. And we can't have this logic
> inside the plugin.
>
>
>  I don't quite understand your concern.
>
>  The controller is going to deal with the VM provisioning element and
> request it to create the VMs based on the information provided (number of
> VMs, flavors).   The VM information will then be related to the plugin
> within the vm_specs object.   Then, given a list of VMs and their
> characteristics, the plugin will be able to select the appropriate VMs to
> install the various hadoop services based on predicates available within
> the hadoop cluster configuration within the advanced configuration file.
>  For example, for the name node the hadoop configuration may include a min
> memory requirement.  The plugin will be able to iterate thru the list of
> VMs and find one that has the appropriate amount of memory.  Once a VM is
> found that meets all the criteria listed for the given component, the
> installation can proceed.
>
>  It was indicated to us that the p
>
>      --
> Mailing list: https://launchpad.net/~savanna-all
> Post to     : savanna-all@xxxxxxxxxxxxxxxxxxx
> Unsubscribe : https://launchpad.net/~savanna-all
> More help   : https://help.launchpad.net/ListHelp
>
>  --
> Mailing list: https://launchpad.net/~savanna-all
> Post to : savanna-all@xxxxxxxxxxxxxxxxxxx
> Unsubscribe : https://launchpad.net/~savanna-all
> More help : https://help.launchpad.net/ListHelp
>
>
>
>
> --
> Mailing list: https://launchpad.net/~savanna-all
> Post to     : savanna-all@xxxxxxxxxxxxxxxxxxx
> Unsubscribe : https://launchpad.net/~savanna-all
> More help   : https://help.launchpad.net/ListHelp
>
References

advanced hadoop configuration
From: John Speidel, 2013-05-08
Re: advanced hadoop configuration
From: Jon Maron, 2013-05-10
Re: advanced hadoop configuration
From: Ruslan Kamaldinov, 2013-05-10
Re: advanced hadoop configuration
From: Jon Maron, 2013-05-10
Re: advanced hadoop configuration
From: Ruslan Kamaldinov, 2013-05-10
Re: advanced hadoop configuration
From: Jon Maron, 2013-05-10
Re: advanced hadoop configuration
From: Ruslan Kamaldinov, 2013-05-11
Re: advanced hadoop configuration
From: John Speidel, 2013-05-11
Re: advanced hadoop configuration
From: Jon Maron, 2013-05-11
Re: advanced hadoop configuration
From: Ruslan Kamaldinov, 2013-05-13
Re: advanced hadoop configuration
From: John Speidel, 2013-05-13