savanna-all team mailing list archive

Thread
Date

Re: advanced hadoop configuration

To: Ruslan Kamaldinov <rkamaldinov@xxxxxxxxxxxx>
From: John Speidel <jspeidel@xxxxxxxxxxxxxxx>
Date: Mon, 13 May 2013 16:01:11 -0400
Cc: "savanna-all@xxxxxxxxxxxxxxxxxxx" <savanna-all@xxxxxxxxxxxxxxxxxxx>, Jon Maron <jonmaron@xxxxxxxxx>
In-reply-to: <C52FFE0359CD4D4D8E35C259E5B60B72@mirantis.com>
User-agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.7; rv:17.0) Gecko/20130328 Thunderbird/17.0.5

Thanks Ruslan, we are very happy that there is agreement to includeadvanced Hadoop configuration functionality in Savanna.Now we can focus on the technical aspects of implementing thisfunctionality.

I would like to take some time to clarify a possible misunderstanding inthe advanced configuration approach previously described.

The plugin specific advanced hadoop configuration would not dictate anyspecific VM configuration. In the advanced use case, the user wouldstill need to specify VM related information independent of the hadoopconfiguration which would include VM count, flavor of VM's, etc. In thefuture, we may need to allow additional VM related information to beprovided such as rack, physical host, etc., but this would not be in thehadoop configuration. This information is used to provision all VM's.The module responsible for provisioning the VM's would not need anyHadoop related information from the advanced configuration. The VM wouldbe provisioned with no knowledge from the hadoop configuration. Itwould not know anything about hadoop services, roles, etc. The VM'swould basically be Hadoop agnostic. Basically, the VM provisioningmodule would be responsible for interacting with OpenStack (nova) toprovision vanilla VM's based on a VM configuration which would containinstance count, VM images, flavors of the VM's and potentially rack andother topology related information. In the VM configuration it would bepossible to specify the image/flavor for all VM's.

At the completion of the VM provisioning step, no Hadoop relatedconfiguration/provisioning has occurred on any VM.

After all of the VM's have been provisioned, the Hadoop plugin would beinvoked and the advanced Hadoop configuration as well as VM clusterinformation would be provided to the Hadoop plugin. The VM informationwould contain specific information about all VM's that have beenprovisioned in the previous step. For each VM that was provisioned, theVM configuration would contain VM properties that would describe the VMproperties to the Hadoop plugin. This would include things such as VMflavor, image, networking information, rack, physical host, etc. Thehadoop configuration would specify all hadoop services as well as rulesfor mapping services/roles to a set of physical VM's, which have alreadybeen provisioned. These rules would utilize the properties provided inthe VM configuration. For example, a configuration could dictate thatall master services would run on a single VM which had a minimum of 2GRAM and that all slave roles run on every other machine. These rolemapping rules would be in simple query form such as MEMORY > 2G andDISK_SPACE > X. The rules would not dictate the number of necessaryhosts for a given cluster. After the Hadoop provider determined whichservices/roles would be placed on each VM, the Hadoop provider would beresponsible for installing all Hadoop bits, configuring and starting allservices on each VM. For HDP, this would be done using Ambari.

The important thing that I am trying to convey is that VM configurationis distinct from hadoop configuration. The VM plugin would provisionvanilla VM's(may use hadoop plugin image with local repos installed) andthe Hadoop plugin would map services/roles to the VM's that have beenprovisioned based on simple rules in the advanced configuration.

If you feel that there is still a compelling reason that the controllerwould need information from the advanced hadoop configuration toprovision VM's, please provide specific details.


Thanks,
-John








On 5/13/13 1:10 PM, Ruslan Kamaldinov wrote:

Jon, John,
We are concerned that proposed architecture will not allow user toconfigure Hadoop and OpenStack at the same time. It allows toconfigure Hadoop, but doesn’t allow to configure OpenStack: flavor,swift, etc. It also doesn't allow user to specify flavor per node,what we usually do when we deploy Hadoop on real hardware.
We understand that advanced Hadoop configuration is important featurefor you. And we absolutely don’t want to restrict this feature.
So, here is how this problem could be avoided:
- User passes advanced Hadoop config to Savanna
- Savanna passes this config to Plugin plugin.convert_advanced_config()
- Plugin returns template for the cluster which is understandable bySavanna. Template might contain that advanced config unmodified. Itcan be just an inner json object inside the plugin-specific template.Template should also contain information about the number and types ofnodes in the cluster- User maps OpenStack-specific configuration to this template. Forexample, disk mapping for HDFS, node placement, flavor of each node(or of node groups).
We also like our approach because we will be able to use the samestandard flow which is already designed. What do you think?
I understand that current blueprint for hierarchical templates is notcomplete and we definitely need to update it. We are working on this.Once document is updated we hope that Advanced Hadoop configurationwill fit into hierarchical templates architecture.
And we agree with your vision on separation of responsibilities inSavanna:
- Savanna core manages OpenStack
- Plugin manages Hadoop and Hadoop management tool


Thanks,
Ruslan

On Saturday, May 11, 2013 at 8:14 PM, Jon Maron wrote:
It may also be helpful to see a representative sample of aconfiguration you envision passing to the controller.
On May 11, 2013, at 11:59 AM, John Speidel <jspeidel@xxxxxxxxxxxxxxx<mailto:jspeidel@xxxxxxxxxxxxxxx>> wrote:
Ruslan,
It would be helpful if you could describe how the controller woulduse the data that you mention, DN placement, HDFS, etc., whileprovisioning vm's.
Thanks,
John

On 5/11/13 10:09 AM, Jon Maron wrote:
On May 11, 2013, at 8:46 AM, Ruslan Kamaldinov<rkamaldinov@xxxxxxxxxxxx <mailto:rkamaldinov@xxxxxxxxxxxx>> wrote:
*
> I don't believe that openstack currently has rack awareness?
This is one of the main goals of Savanna. It is targeted for phase2. Quote from the roadmap:
/Hadoop cluster topology configuration parameters/
/- Data node placement control/
/- HDFS location/
/- Swift integration/
While your approach targets all the Hadoop related configs, itmisses all the OpenStack related configurations. Advanced Hadoopcluster will require advanced OpenStack configuration: Swift,Cinder, placement control, etc.We **have to** give user control over both worlds: Hadoop andOpenStack. Giving control to the plugin means that user will losecontrol over OpenStack-related configuration.
*
I do not disagree. I just feel that there should strive to havethe configuration structures in such a way that the VMconfiguration element of the controller doesn't need to processHadoop configuration and the Hadoop plugin doesn't need tocomprehend VM related configuration. We are striving for a designthat allows each component of the savanna system to process theirconfiguration alone while having enough information about thesystem to make appropriate decisions. So I'd view the goal to be:
1) controller assembles information, based on use input, that hasboth VM cluster and Hadoop cluster information.2) The VM cluster configuration is passed to a VM provisioningcomponent. The output of that invocation is a VM cluster spec withserver instances that provide information about their characteristics.3) The controller passes the Hadoop cluster configuration (eitherstandard or advanced) and the VM cluster spec to the Hadoop plugin.4) The plugin leverages the configuration it is provided, and theset of VMs it is made aware of via the VM cluster spec, to executethe appropriate package installations, configuration file edits,etc to setup the Hadoop cluster on the given VMs.
I think this allows for the cleanest separation of responsibilitiesand for the most effective and extensible design for savanna. Ithink we should follow this approach to drive the structures wecome up with to designate the cluster and Hadoop configurations.
*
Hierarchical node/cluster templates (seehttps://blueprints.launchpad.net/savanna/+spec/hierarchical-templates)were designed specifically to support both Hadoop and OpenStackadvanced configurations.
*
We don't object to the template approach. It'll probably cover agreat deal of the scenarios we may encouter. However, we've justbeen through enough similar efforts to realize that:
1)  There are always edge cases that need the most flexible approach
2) Users like to use existing assets (e.g. Ambari blueprintsthey've already assembled in a non-openstack/VM environment). Theywill resent or resist having to learn a new management mechanism ontop of the one they already understand and implement.
*
If you think, that current design misses something, that somethingdoesn't allow to support "Hadoop Blueprint Specification" let'sdiscuss it. It was designed to support such configurations and it**has to support them**.
Thanks,
Ruslan

*
On Sat, May 11, 2013 at 1:17 AM, Jon Maron <jmaron@xxxxxxxxxxxxxxx<mailto:jmaron@xxxxxxxxxxxxxxx>> wrote:
On May 10, 2013, at 4:35 PM, Ruslan Kamaldinov<rkamaldinov@xxxxxxxxxxxx <mailto:rkamaldinov@xxxxxxxxxxxx>> wrote:
Hi John,
If controller doesn't know anything about services which willrun on VMs, then it will not be able to place them correctly.The whole cluster might end up on one physical machine (or rack).
I don't believe that openstack currently has rack awareness? Inaddition, the controller doesn't need actual service or hadoopinformation to make a determination about which physical machineto utilize (I think that would actually be a mistake and couldlimit the controllers ability to extend to other potentialscenarios). Rather, if we deem it necessary we could create someadditional VM specific configuration it can utilize toappropriately provision the VMs, independent of the hadoopconfiguration. We think it'd be a mistake to expect thecontroller in general to interpret hadoop specific information(standard or advanced). The controller is simply providingservices and managing the cluster creation workflow. Thereshould be a clear VM provisioning element that reads the VMspecific configuration and provisions accordingly, and then thehadoop configuration (standard or advanced), along with the vmspecs, should be passed to the plugin and allow it to proceedwith service/component installations.
That's why we need to pass more detailed config to thecontroller, so it would be able to place VMs in correct place.And we can't have this logic inside the plugin.
I don't quite understand your concern.
The controller is going to deal with the VM provisioning elementand request it to create the VMs based on the informationprovided (number of VMs, flavors). The VM information will thenbe related to the plugin within the vm_specs object. Then,given a list of VMs and their characteristics, the plugin will beable to select the appropriate VMs to install the various hadoopservices based on predicates available within the hadoop clusterconfiguration within the advanced configuration file. Forexample, for the name node the hadoop configuration may include amin memory requirement. The plugin will be able to iterate thruthe list of VMs and find one that has the appropriate amount ofmemory. Once a VM is found that meets all the criteria listedfor the given component, the installation can proceed.
It was indicated to us that the p
--
Mailing list: https://launchpad.net/~savanna-all<https://launchpad.net/%7Esavanna-all>Post to : savanna-all@xxxxxxxxxxxxxxxxxxx<mailto:savanna-all@xxxxxxxxxxxxxxxxxxx>Unsubscribe : https://launchpad.net/~savanna-all<https://launchpad.net/%7Esavanna-all>
More help   : https://help.launchpad.net/ListHelp
--
Mailing list: https://launchpad.net/~savanna-all<https://launchpad.net/%7Esavanna-all>Post to : savanna-all@xxxxxxxxxxxxxxxxxxx<mailto:savanna-all@xxxxxxxxxxxxxxxxxxx>Unsubscribe : https://launchpad.net/~savanna-all<https://launchpad.net/%7Esavanna-all>
More help : https://help.launchpad.net/ListHelp

Follow ups

Re: advanced hadoop configuration
From: Nirmal Ranganathan, 2013-05-16
Re: advanced hadoop configuration
From: Dmitry Mescheryakov, 2013-05-14

References

advanced hadoop configuration
From: John Speidel, 2013-05-08
Re: advanced hadoop configuration
From: Jon Maron, 2013-05-10
Re: advanced hadoop configuration
From: Ruslan Kamaldinov, 2013-05-10
Re: advanced hadoop configuration
From: Jon Maron, 2013-05-10
Re: advanced hadoop configuration
From: Ruslan Kamaldinov, 2013-05-10
Re: advanced hadoop configuration
From: Jon Maron, 2013-05-10
Re: advanced hadoop configuration
From: Ruslan Kamaldinov, 2013-05-11
Re: advanced hadoop configuration
From: John Speidel, 2013-05-11
Re: advanced hadoop configuration
From: Jon Maron, 2013-05-11
Re: advanced hadoop configuration
From: Ruslan Kamaldinov, 2013-05-13