savanna-all team mailing list archive

Thread
Date

Re: advanced hadoop configuration

To: John Speidel <jspeidel@xxxxxxxxxxxxxxx>
From: Jon Maron <jmaron@xxxxxxxxxxxxxxx>
Date: Sat, 11 May 2013 12:14:11 -0400
Cc: "savanna-all@xxxxxxxxxxxxxxxxxxx" <savanna-all@xxxxxxxxxxxxxxxxxxx>, Jon Maron <jonmaron@xxxxxxxxx>
In-reply-to: <518E6ACF.1020007@hortonworks.com>

It may also be helpful to see a representative sample of a configuration you envision passing to the controller. 

On May 11, 2013, at 11:59 AM, John Speidel <jspeidel@xxxxxxxxxxxxxxx> wrote:

> Ruslan,
> 
> It would be  helpful if you could describe how the controller would use the data that you mention, DN placement, HDFS, etc.,  while provisioning vm's.
> 
> Thanks,
> John
> 
> On 5/11/13 10:09 AM, Jon Maron wrote:
>> 
>> On May 11, 2013, at 8:46 AM, Ruslan Kamaldinov <rkamaldinov@xxxxxxxxxxxx> wrote:
>> 
>>>  >
>>>                   I don't believe that openstack currently has rack
>>>                   awareness?  
>>> This
>>>                   is one of the main goals of Savanna. It is targeted
>>>                   for phase 2. Quote from the roadmap:
>>> Hadoop
>>>                     cluster topology configuration parameters
>>> -
>>>                     Data node placement control
>>> -
>>>                     HDFS location
>>> -
>>>                     Swift integration 
>>> 
>>> 
>>>  While your approach
>>>                   targets all the Hadoop related configs, it misses all
>>>                   the OpenStack related configurations. Advanced Hadoop
>>>                   cluster will require advanced OpenStack configuration:
>>>                   Swift, Cinder, placement control, etc.
>>> We
>>>                   **have to** give user control over both worlds: Hadoop
>>>                   and OpenStack. Giving control to the plugin means that
>>>                   user will lose control over OpenStack-related
>>>                   configuration.
>> 
>> I do not disagree.  I just feel that there should strive to have the configuration structures in such a way that the VM configuration element of the controller doesn't need to process Hadoop configuration and the Hadoop plugin doesn't need to comprehend VM related configuration.  We are striving for a design that allows each component of the savanna system to process their configuration alone while having enough information about the system to make appropriate decisions.  So I'd view the goal to be:
>> 
>> 1)  controller assembles information, based on use input, that has both VM cluster and Hadoop cluster information.
>> 2)  The VM cluster configuration is passed to a VM provisioning component.  The output of that invocation is a VM cluster spec with server instances that provide information about their characteristics.
>> 3)  The controller passes the Hadoop cluster configuration (either standard or advanced) and the VM cluster spec to the Hadoop plugin.
>> 4)  The plugin leverages the configuration it is provided, and the set of VMs it is made aware of via the VM cluster spec, to execute the appropriate package installations, configuration file edits, etc to setup the Hadoop cluster on the given VMs.
>> 
>> I think this allows for the cleanest separation of responsibilities and for the most effective and extensible design for savanna.  I think we should follow this approach to drive the structures we come up with to designate the cluster and Hadoop configurations.
>> 
>>> 
>>> 
>>>  Hierarchical
>>>                   node/cluster templates (see https://blueprints.launchpad.net/savanna/+spec/hierarchical-templates)
>>>                   were designed specifically to support both Hadoop and
>>>                   OpenStack advanced configurations.
>> 
>> We don't object to the template approach.  It'll probably cover a great deal of the scenarios we may encouter.  However, we've just been through enough similar efforts to realize that:
>> 
>> 1)  There are always edge cases that need the most flexible approach
>> 2)  Users like to use existing assets (e.g. Ambari blueprints they've already assembled in a non-openstack/VM environment).  They will resent or resist having to learn a           new management mechanism on top of the one they already understand and implement.
>> 
>>> 
>>>  If
>>>                   you think, that current design misses something, that
>>>                   something doesn't allow to support "Hadoop Blueprint
>>>                   Specification" let's discuss it. It was designed to
>>>                   support such configurations and it **has to support
>>>                   them**.
>>> 
>>> 
>>>  Thanks,
>>> Ruslan
>>> 
>>> 
>>> On Sat, May 11, 2013 at 1:17 AM, Jon Maron <jmaron@xxxxxxxxxxxxxxx> wrote:
>>>> 
>>>> On May 10, 2013, at 4:35 PM, Ruslan Kamaldinov <rkamaldinov@xxxxxxxxxxxx> wrote:
>>>> 
>>>>> Hi John,
>>>>> 
>>>>> If controller doesn't know anything about services which will run on VMs, then it will not be able to place them correctly. The whole cluster might end up                               on one physical machine (or rack).
>>>> 
>>>> I don't believe that openstack currently has rack awareness?  In addition, the controller doesn't need actual service or hadoop information to make a determination about which physical machine to utilize (I think that would actually be a mistake and could limit the controllers ability to extend to other potential                         scenarios).  Rather, if we deem it necessary we could create some additional VM specific configuration it can utilize to appropriately provision the VMs, independent of the hadoop configuration.  We think it'd be a mistake to expect the controller in general to interpret hadoop specific information (standard or advanced).  The controller is simply providing services and managing the cluster creation workflow.  There should be a clear VM provisioning element that reads the VM specific configuration and provisions accordingly, and then the hadoop configuration (standard or advanced), along with the vm specs, should be passed to the plugin and allow it to proceed with service/component installations.
>>>> 
>>>>> 
>>>>> That's why we need to pass more detailed config to the controller, so it would be able to place VMs in correct place. And we can't have this logic inside the plugin.
>>>> 
>>>> I don't quite understand your concern.
>>>> 
>>>> The controller is going to deal with the VM provisioning element and request it to create the VMs based on the information provided (number of VMs, flavors).   The VM information will then be related to the plugin within the vm_specs object.   Then, given a list of VMs and their characteristics, the plugin will be able to select the appropriate VMs to install the various hadoop services based on predicates available within the hadoop cluster configuration within the advanced configuration                         file.  For example, for the name node the hadoop configuration may include a min memory requirement.  The plugin will be able to iterate thru the list of VMs and find one that has the appropriate amount of memory.  Once a VM is found that meets all the criteria listed for the given component, the installation can proceed.
>>>> 
>>>> It was indicated to us that the p
> -- 
> Mailing list: https://launchpad.net/~savanna-all
> Post to     : savanna-all@xxxxxxxxxxxxxxxxxxx
> Unsubscribe : https://launchpad.net/~savanna-all
> More help   : https://help.launchpad.net/ListHelp

Follow ups

Re: advanced hadoop configuration
From: Ruslan Kamaldinov, 2013-05-13

References

advanced hadoop configuration
From: John Speidel, 2013-05-08
Re: advanced hadoop configuration
From: Jon Maron, 2013-05-10
Re: advanced hadoop configuration
From: Ruslan Kamaldinov, 2013-05-10
Re: advanced hadoop configuration
From: Jon Maron, 2013-05-10
Re: advanced hadoop configuration
From: Ruslan Kamaldinov, 2013-05-10
Re: advanced hadoop configuration
From: Jon Maron, 2013-05-10
Re: advanced hadoop configuration
From: Ruslan Kamaldinov, 2013-05-11
Re: advanced hadoop configuration
From: John Speidel, 2013-05-11