savanna-all team mailing list archive

Thread
Date

Re: advanced hadoop configuration

To: Jon Maron <jmaron@xxxxxxxxxxxxxxx>, John Speidel <jspeidel@xxxxxxxxxxxxxxx>
From: Ruslan Kamaldinov <rkamaldinov@xxxxxxxxxxxx>
Date: Mon, 13 May 2013 21:10:58 +0400
Cc: "savanna-all@xxxxxxxxxxxxxxxxxxx" <savanna-all@xxxxxxxxxxxxxxxxxxx>, Jon Maron <jonmaron@xxxxxxxxx>
In-reply-to: <8AE86949-D9A3-4A43-A2DE-704959A6049F@hortonworks.com>

Jon, John,

We are concerned that proposed architecture will not allow user to configure Hadoop and OpenStack at the same time. It allows to configure Hadoop, but doesn’t allow to configure OpenStack: flavor, swift, etc. It also doesn't allow user to specify flavor  per node, what we usually do when we deploy Hadoop on real hardware.

We understand that advanced Hadoop configuration is important feature for you. And we absolutely don’t want to restrict this feature.

So, here is how this problem could be avoided:
- User passes advanced Hadoop config to Savanna
- Savanna passes this config to Plugin plugin.convert_advanced_config()
- Plugin returns template for the cluster which is understandable by Savanna. Template might contain that advanced config unmodified. It can be just an inner json object inside the plugin-specific template. Template should also contain information about the number and types of nodes in the cluster
- User maps OpenStack-specific configuration to this template. For example, disk mapping for HDFS, node placement, flavor of each node (or of node groups).

We also like our approach because we will be able to use the same standard flow which is already designed. What do you think?

I understand that current blueprint for hierarchical templates is not complete and we definitely need to update it. We are working on this. Once document is updated we hope that Advanced Hadoop configuration will fit into hierarchical templates architecture.


And we agree with your vision on separation of responsibilities in Savanna:
- Savanna core manages OpenStack
- Plugin manages Hadoop and Hadoop management tool


Thanks,
Ruslan



On Saturday, May 11, 2013 at 8:14 PM, Jon Maron wrote:

> It may also be helpful to see a representative sample of a configuration you envision passing to the controller.  
>  
> On May 11, 2013, at 11:59 AM, John Speidel <jspeidel@xxxxxxxxxxxxxxx (mailto:jspeidel@xxxxxxxxxxxxxxx)> wrote:
>  
> > Ruslan,
> >  
> > It would be  helpful if you could describe how the controller would use the data that you mention, DN placement, HDFS, etc.,  while provisioning vm's.
> >  
> > Thanks,
> > John
> >  
> > On 5/11/13 10:09 AM, Jon Maron wrote:
> > >  
> > > On May 11, 2013, at 8:46 AM, Ruslan Kamaldinov <rkamaldinov@xxxxxxxxxxxx (mailto:rkamaldinov@xxxxxxxxxxxx)> wrote:  
> > > > > I don't believe that openstack currently has rack awareness?    
> > > > This is one of the main goals of Savanna. It is targeted for phase 2. Quote from the roadmap:
> > > > Hadoop cluster topology configuration parameters
> > > > - Data node placement control
> > > > - HDFS location
> > > > - Swift integration  
> > > >  
> > > >  
> > > > While your approach targets all the Hadoop related configs, it misses all the OpenStack related configurations. Advanced Hadoop cluster will require advanced OpenStack configuration: Swift, Cinder, placement control, etc.  
> > > > We **have to** give user control over both worlds: Hadoop and OpenStack. Giving control to the plugin means that user will lose control over OpenStack-related configuration.
> > > >  
> > > >  
> > >  
> > >  
> > > I do not disagree.  I just feel that there should strive to have the configuration structures in such a way that the VM configuration element of the controller doesn't need to process Hadoop configuration and the Hadoop plugin doesn't need to comprehend VM related configuration.  We are striving for a design that allows each component of the savanna system to process their configuration alone while having enough information about the system to make appropriate decisions.  So I'd view the goal to be:  
> > >  
> > > 1)  controller assembles information, based on use input, that has both VM cluster and Hadoop cluster information.  
> > > 2)  The VM cluster configuration is passed to a VM provisioning component.  The output of that invocation is a VM cluster spec with server instances that provide information about their characteristics.
> > > 3)  The controller passes the Hadoop cluster configuration (either standard or advanced) and the VM cluster spec to the Hadoop plugin.
> > > 4)  The plugin leverages the configuration it is provided, and the set of VMs it is made aware of via the VM cluster spec, to execute the appropriate package installations, configuration file edits, etc to setup the Hadoop cluster on the given VMs.
> > >  
> > > I think this allows for the cleanest separation of responsibilities and for the most effective and extensible design for savanna.  I think we should follow this approach to drive the structures we come up with to designate the cluster and Hadoop configurations.  
> > > >  
> > > >  
> > > > Hierarchical node/cluster templates (see https://blueprints.launchpad.net/savanna/+spec/hierarchical-templates) were designed specifically to support both Hadoop and OpenStack advanced configurations.  
> > >  
> > > We don't object to the template approach.  It'll probably cover a great deal of the scenarios we may encouter.  However, we've just been through enough similar efforts to realize that:  
> > >  
> > > 1)  There are always edge cases that need the most flexible approach  
> > > 2)  Users like to use existing assets (e.g. Ambari blueprints they've already assembled in a non-openstack/VM environment).  They will resent or resist having to learn a new management mechanism on top of the one they already understand and implement.
> > >  
> > > >  
> > > > If you think, that current design misses something, that something doesn't allow to support "Hadoop Blueprint Specification" let's discuss it. It was designed to support such configurations and it **has to support them**.  
> > > >  
> > > > Thanks,  
> > > > Ruslan
> > > >  
> > > >  
> > > > On Sat, May 11, 2013 at 1:17 AM, Jon Maron <jmaron@xxxxxxxxxxxxxxx (mailto:jmaron@xxxxxxxxxxxxxxx)> wrote:
> > > > >  
> > > > > On May 10, 2013, at 4:35 PM, Ruslan Kamaldinov <rkamaldinov@xxxxxxxxxxxx (mailto:rkamaldinov@xxxxxxxxxxxx)> wrote:  
> > > > > > Hi John,  
> > > > > >  
> > > > > > If controller doesn't know anything about services which will run on VMs, then it will not be able to place them correctly. The whole cluster might end up on one physical machine (or rack).  
> > > > > >  
> > > > >  
> > > > > I don't believe that openstack currently has rack awareness?  In addition, the controller doesn't need actual service or hadoop information to make a determination about which physical machine to utilize (I think that would actually be a mistake and could limit the controllers ability to extend to other potential scenarios).  Rather, if we deem it necessary we could create some additional VM specific configuration it can utilize to appropriately provision the VMs, independent of the hadoop configuration.  We think it'd be a mistake to expect the controller in general to interpret hadoop specific information (standard or advanced).  The controller is simply providing services and managing the cluster creation workflow.  There should be a clear VM provisioning element that reads the VM specific configuration and provisions accordingly, and then the hadoop configuration (standard or advanced), along with the vm specs, should be passed to the plugin and allow it to proceed with service/component installations.  
> > > > >  
> > > > > >  
> > > > > > That's why we need to pass more detailed config to the controller, so it would be able to place VMs in correct place. And we can't have this logic inside the plugin.  
> > > > >  
> > > > > I don't quite understand your concern.  
> > > > >  
> > > > > The controller is going to deal with the VM provisioning element and request it to create the VMs based on the information provided (number of VMs, flavors).   The VM information will then be related to the plugin within the vm_specs object.   Then, given a list of VMs and their characteristics, the plugin will be able to select the appropriate VMs to install the various hadoop services based on predicates available within the hadoop cluster configuration within the advanced configuration file.  For example, for the name node the hadoop configuration may include a min memory requirement.  The plugin will be able to iterate thru the list of VMs and find one that has the appropriate amount of memory.  Once a VM is found that meets all the criteria listed for the given component, the installation can proceed.  
> > > > >  
> > > > > It was indicated to us that the p
> > --  
> > Mailing list: https://launchpad.net/~savanna-all
> > Post to     : savanna-all@xxxxxxxxxxxxxxxxxxx (mailto:savanna-all@xxxxxxxxxxxxxxxxxxx)
> > Unsubscribe : https://launchpad.net/~savanna-all
> > More help   : https://help.launchpad.net/ListHelp
> --  
> Mailing list: https://launchpad.net/~savanna-all
> Post to : savanna-all@xxxxxxxxxxxxxxxxxxx (mailto:savanna-all@xxxxxxxxxxxxxxxxxxx)
> Unsubscribe : https://launchpad.net/~savanna-all
> More help : https://help.launchpad.net/ListHelp
>  
>

Follow ups

Re: advanced hadoop configuration
From: John Speidel, 2013-05-13

References

advanced hadoop configuration
From: John Speidel, 2013-05-08
Re: advanced hadoop configuration
From: Jon Maron, 2013-05-10
Re: advanced hadoop configuration
From: Ruslan Kamaldinov, 2013-05-10
Re: advanced hadoop configuration
From: Jon Maron, 2013-05-10
Re: advanced hadoop configuration
From: Ruslan Kamaldinov, 2013-05-10
Re: advanced hadoop configuration
From: Jon Maron, 2013-05-10
Re: advanced hadoop configuration
From: Ruslan Kamaldinov, 2013-05-11
Re: advanced hadoop configuration
From: John Speidel, 2013-05-11
Re: advanced hadoop configuration
From: Jon Maron, 2013-05-11