← Back to team overview

savanna-all team mailing list archive

Re: advanced hadoop configuration


Thanks Ruslan, we are very happy that there is agreement to include advanced Hadoop configuration functionality in Savanna. Now we can focus on the technical aspects of implementing this functionality.

I would like to take some time to clarify a possible misunderstanding in the advanced configuration approach previously described.

The plugin specific advanced hadoop configuration would not dictate any specific VM configuration. In the advanced use case, the user would still need to specify VM related information independent of the hadoop configuration which would include VM count, flavor of VM's, etc. In the future, we may need to allow additional VM related information to be provided such as rack, physical host, etc., but this would not be in the hadoop configuration. This information is used to provision all VM's. The module responsible for provisioning the VM's would not need any Hadoop related information from the advanced configuration. The VM would be provisioned with no knowledge from the hadoop configuration. It would not know anything about hadoop services, roles, etc. The VM's would basically be Hadoop agnostic. Basically, the VM provisioning module would be responsible for interacting with OpenStack (nova) to provision vanilla VM's based on a VM configuration which would contain instance count, VM images, flavors of the VM's and potentially rack and other topology related information. In the VM configuration it would be possible to specify the image/flavor for all VM's.

At the completion of the VM provisioning step, no Hadoop related configuration/provisioning has occurred on any VM.

After all of the VM's have been provisioned, the Hadoop plugin would be invoked and the advanced Hadoop configuration as well as VM cluster information would be provided to the Hadoop plugin. The VM information would contain specific information about all VM's that have been provisioned in the previous step. For each VM that was provisioned, the VM configuration would contain VM properties that would describe the VM properties to the Hadoop plugin. This would include things such as VM flavor, image, networking information, rack, physical host, etc. The hadoop configuration would specify all hadoop services as well as rules for mapping services/roles to a set of physical VM's, which have already been provisioned. These rules would utilize the properties provided in the VM configuration. For example, a configuration could dictate that all master services would run on a single VM which had a minimum of 2G RAM and that all slave roles run on every other machine. These role mapping rules would be in simple query form such as MEMORY > 2G and DISK_SPACE > X. The rules would not dictate the number of necessary hosts for a given cluster. After the Hadoop provider determined which services/roles would be placed on each VM, the Hadoop provider would be responsible for installing all Hadoop bits, configuring and starting all services on each VM. For HDP, this would be done using Ambari.

The important thing that I am trying to convey is that VM configuration is distinct from hadoop configuration. The VM plugin would provision vanilla VM's(may use hadoop plugin image with local repos installed) and the Hadoop plugin would map services/roles to the VM's that have been provisioned based on simple rules in the advanced configuration.

If you feel that there is still a compelling reason that the controller would need information from the advanced hadoop configuration to provision VM's, please provide specific details.


On 5/13/13 1:10 PM, Ruslan Kamaldinov wrote:
Jon, John,

We are concerned that proposed architecture will not allow user to configure Hadoop and OpenStack at the same time. It allows to configure Hadoop, but doesn’t allow to configure OpenStack: flavor, swift, etc. It also doesn't allow user to specify flavor per node, what we usually do when we deploy Hadoop on real hardware.

We understand that advanced Hadoop configuration is important feature for you. And we absolutely don’t want to restrict this feature.

So, here is how this problem could be avoided:
- User passes advanced Hadoop config to Savanna
- Savanna passes this config to Plugin plugin.convert_advanced_config()
- Plugin returns template for the cluster which is understandable by Savanna. Template might contain that advanced config unmodified. It can be just an inner json object inside the plugin-specific template. Template should also contain information about the number and types of nodes in the cluster - User maps OpenStack-specific configuration to this template. For example, disk mapping for HDFS, node placement, flavor of each node (or of node groups).

We also like our approach because we will be able to use the same standard flow which is already designed. What do you think?

I understand that current blueprint for hierarchical templates is not complete and we definitely need to update it. We are working on this. Once document is updated we hope that Advanced Hadoop configuration will fit into hierarchical templates architecture.

And we agree with your vision on separation of responsibilities in Savanna:
- Savanna core manages OpenStack
- Plugin manages Hadoop and Hadoop management tool


On Saturday, May 11, 2013 at 8:14 PM, Jon Maron wrote:

It may also be helpful to see a representative sample of a configuration you envision passing to the controller.

On May 11, 2013, at 11:59 AM, John Speidel <jspeidel@xxxxxxxxxxxxxxx <mailto:jspeidel@xxxxxxxxxxxxxxx>> wrote:


It would be helpful if you could describe how the controller would use the data that you mention, DN placement, HDFS, etc., while provisioning vm's.


On 5/11/13 10:09 AM, Jon Maron wrote:

On May 11, 2013, at 8:46 AM, Ruslan Kamaldinov <rkamaldinov@xxxxxxxxxxxx <mailto:rkamaldinov@xxxxxxxxxxxx>> wrote:

> I don't believe that openstack currently has rack awareness?
This is one of the main goals of Savanna. It is targeted for phase 2. Quote from the roadmap:
/Hadoop cluster topology configuration parameters/
/- Data node placement control/
/- HDFS location/
/- Swift integration/

While your approach targets all the Hadoop related configs, it misses all the OpenStack related configurations. Advanced Hadoop cluster will require advanced OpenStack configuration: Swift, Cinder, placement control, etc. We **have to** give user control over both worlds: Hadoop and OpenStack. Giving control to the plugin means that user will lose control over OpenStack-related configuration.

I do not disagree. I just feel that there should strive to have the configuration structures in such a way that the VM configuration element of the controller doesn't need to process Hadoop configuration and the Hadoop plugin doesn't need to comprehend VM related configuration. We are striving for a design that allows each component of the savanna system to process their configuration alone while having enough information about the system to make appropriate decisions. So I'd view the goal to be:

1) controller assembles information, based on use input, that has both VM cluster and Hadoop cluster information. 2) The VM cluster configuration is passed to a VM provisioning component. The output of that invocation is a VM cluster spec with server instances that provide information about their characteristics. 3) The controller passes the Hadoop cluster configuration (either standard or advanced) and the VM cluster spec to the Hadoop plugin. 4) The plugin leverages the configuration it is provided, and the set of VMs it is made aware of via the VM cluster spec, to execute the appropriate package installations, configuration file edits, etc to setup the Hadoop cluster on the given VMs.

I think this allows for the cleanest separation of responsibilities and for the most effective and extensible design for savanna. I think we should follow this approach to drive the structures we come up with to designate the cluster and Hadoop configurations.


Hierarchical node/cluster templates (see https://blueprints.launchpad.net/savanna/+spec/hierarchical-templates) were designed specifically to support both Hadoop and OpenStack advanced configurations.

We don't object to the template approach. It'll probably cover a great deal of the scenarios we may encouter. However, we've just been through enough similar efforts to realize that:

1)  There are always edge cases that need the most flexible approach
2) Users like to use existing assets (e.g. Ambari blueprints they've already assembled in a non-openstack/VM environment). They will resent or resist having to learn a new management mechanism on top of the one they already understand and implement.

If you think, that current design misses something, that something doesn't allow to support "Hadoop Blueprint Specification" let's discuss it. It was designed to support such configurations and it **has to support them**.



On Sat, May 11, 2013 at 1:17 AM, Jon Maron <jmaron@xxxxxxxxxxxxxxx <mailto:jmaron@xxxxxxxxxxxxxxx>> wrote:

On May 10, 2013, at 4:35 PM, Ruslan Kamaldinov <rkamaldinov@xxxxxxxxxxxx <mailto:rkamaldinov@xxxxxxxxxxxx>> wrote:

Hi John,

If controller doesn't know anything about services which will run on VMs, then it will not be able to place them correctly. The whole cluster might end up on one physical machine (or rack).

I don't believe that openstack currently has rack awareness? In addition, the controller doesn't need actual service or hadoop information to make a determination about which physical machine to utilize (I think that would actually be a mistake and could limit the controllers ability to extend to other potential scenarios). Rather, if we deem it necessary we could create some additional VM specific configuration it can utilize to appropriately provision the VMs, independent of the hadoop configuration. We think it'd be a mistake to expect the controller in general to interpret hadoop specific information (standard or advanced). The controller is simply providing services and managing the cluster creation workflow. There should be a clear VM provisioning element that reads the VM specific configuration and provisions accordingly, and then the hadoop configuration (standard or advanced), along with the vm specs, should be passed to the plugin and allow it to proceed with service/component installations.

That's why we need to pass more detailed config to the controller, so it would be able to place VMs in correct place. And we can't have this logic inside the plugin.

I don't quite understand your concern.

The controller is going to deal with the VM provisioning element and request it to create the VMs based on the information provided (number of VMs, flavors). The VM information will then be related to the plugin within the vm_specs object. Then, given a list of VMs and their characteristics, the plugin will be able to select the appropriate VMs to install the various hadoop services based on predicates available within the hadoop cluster configuration within the advanced configuration file. For example, for the name node the hadoop configuration may include a min memory requirement. The plugin will be able to iterate thru the list of VMs and find one that has the appropriate amount of memory. Once a VM is found that meets all the criteria listed for the given component, the installation can proceed.

It was indicated to us that the p
Mailing list: https://launchpad.net/~savanna-all <https://launchpad.net/%7Esavanna-all> Post to : savanna-all@xxxxxxxxxxxxxxxxxxx <mailto:savanna-all@xxxxxxxxxxxxxxxxxxx> Unsubscribe : https://launchpad.net/~savanna-all <https://launchpad.net/%7Esavanna-all>
More help   : https://help.launchpad.net/ListHelp
Mailing list: https://launchpad.net/~savanna-all <https://launchpad.net/%7Esavanna-all> Post to : savanna-all@xxxxxxxxxxxxxxxxxxx <mailto:savanna-all@xxxxxxxxxxxxxxxxxxx> Unsubscribe : https://launchpad.net/~savanna-all <https://launchpad.net/%7Esavanna-all>
More help : https://help.launchpad.net/ListHelp

Follow ups