← Back to team overview

openstack team mailing list archive

Re: [openstack-dev] Discussion about where to put database for bare-metal provisioning (review 10726)

 

 Hi Vish,

 I think I understand your idea.
One service entry with multiple bare-metal compute_node entries are registered at the start of bare-metal nova-compute.
'hypervisor_hostname' must be different for each bare-metal machine, such as 'bare-metal-0001.xxx.com', 'bare-metal-0002.xxx.com', etc.)
But their IP addresses must be the IP address of bare-metal nova-compute, such that an instance is casted 
not to bare-metal machine directly but to bare-metal nova-compute.

 One extension we need to do at the scheduler side is using (host, hypervisor_hostname) instead of (host) only in host_manager.py.
'HostManager.service_state' is { <host> : { <service > : { cap k : v }}}.
It needs to be changed to { <host> : { <service> : { <hypervisor_name> : { cap k : v }}}}.

Most functions of HostState need to be changed to use (host, hypervisor_name) pair to identify a compute node. 

 Are we on the same page, now?

 Thanks,
 David

----- Original Message -----
> Hi David,
> 
> I just checked out the code more extensively and I don't see why you
> need to create a new service entry for each compute_node entry. The
> code in host_manager to get all host states explicitly gets all
> compute_node entries. I don't see any reason why multiple compute_node
> entries can't share the same service. I don't see any place in the
> scheduler that is grabbing records by "service" instead of by "compute
> node", but if there is one that I missed, it should be fairly easy to
> change it.
> 
> The compute_node record is created in the compute/resource_tracker.py
> as of a recent commit, so I think the path forward would be to make
> sure that one of the records is created for each bare metal node by
> the bare metal compute, perhaps by having multiple resource_trackers.
> 
> Vish
> 
> On Aug 27, 2012, at 9:40 AM, David Kang <dkang@xxxxxxx> wrote:
> 
> >
> >  Vish,
> >
> >  I think I don't understand your statement fully.
> > Unless we use different hostnames, (hostname, hypervisor_hostname)
> > must be the
> > same for all bare-metal nodes under a bare-metal nova-compute.
> >
> >  Could you elaborate the following statement a little bit more?
> >
> >> You would just have to use a little more than hostname. Perhaps
> >> (hostname, hypervisor_hostname) could be used to update the entry?
> >>
> >
> >  Thanks,
> >  David
> >
> >
> >
> > ----- Original Message -----
> >> I would investigate changing the capabilities to key off of
> >> something
> >> other than hostname. It looks from the table structure like
> >> compute_nodes could be have a many-to-one relationship with
> >> services.
> >> You would just have to use a little more than hostname. Perhaps
> >> (hostname, hypervisor_hostname) could be used to update the entry?
> >>
> >> Vish
> >>
> >> On Aug 24, 2012, at 11:23 AM, David Kang <dkang@xxxxxxx> wrote:
> >>
> >>>
> >>>  Vish,
> >>>
> >>>  I've tested your code and did more testing.
> >>> There are a couple of problems.
> >>> 1. host name should be unique. If not, any repetitive updates of
> >>> new
> >>> capabilities with the same host name are simply overwritten.
> >>> 2. We cannot generate arbitrary host names on the fly.
> >>>   The scheduler (I tested filter scheduler) gets host names from
> >>>   db.
> >>>   So, if a host name is not in the 'services' table, it is not
> >>>   considered by the scheduler at all.
> >>>
> >>> So, to make your suggestions possible, nova-compute should
> >>> register
> >>> N different host names in 'services' table,
> >>> and N corresponding entries in 'compute_nodes' table.
> >>> Here is an example:
> >>>
> >>> mysql> select id, host, binary, topic, report_count, disabled,
> >>> availability_zone from services;
> >>> +----+-------------+----------------+-----------+--------------+----------+-------------------+
> >>> | id | host | binary | topic | report_count | disabled |
> >>> | availability_zone |
> >>> +----+-------------+----------------+-----------+--------------+----------+-------------------+
> >>> |  1 | bespin101 | nova-scheduler | scheduler | 17145 | 0 | nova |
> >>> |  2 | bespin101 | nova-network | network | 16819 | 0 | nova |
> >>> |  3 | bespin101-0 | nova-compute | compute | 16405 | 0 | nova |
> >>> |  4 | bespin101-1 | nova-compute | compute | 1 | 0 | nova |
> >>> +----+-------------+----------------+-----------+--------------+----------+-------------------+
> >>>
> >>> mysql> select id, service_id, hypervisor_hostname from
> >>> compute_nodes;
> >>> +----+------------+------------------------+
> >>> | id | service_id | hypervisor_hostname |
> >>> +----+------------+------------------------+
> >>> |  1 | 3 | bespin101.east.isi.edu |
> >>> |  2 | 4 | bespin101.east.isi.edu |
> >>> +----+------------+------------------------+
> >>>
> >>>  Then, nova db (compute_nodes table) has entries of all bare-metal
> >>>  nodes.
> >>> What do you think of this approach.
> >>> Do you have any better approach?
> >>>
> >>>  Thanks,
> >>>  David
> >>>
> >>>
> >>>
> >>> ----- Original Message -----
> >>>> To elaborate, something the below. I'm not absolutely sure you
> >>>> need
> >>>> to
> >>>> be able to set service_name and host, but this gives you the
> >>>> option
> >>>> to
> >>>> do so if needed.
> >>>>
> >>>> iff --git a/nova/manager.py b/nova/manager.py
> >>>> index c6711aa..c0f4669 100644
> >>>> --- a/nova/manager.py
> >>>> +++ b/nova/manager.py
> >>>> @@ -217,6 +217,8 @@ class SchedulerDependentManager(Manager):
> >>>>
> >>>> def update_service_capabilities(self, capabilities):
> >>>> """Remember these capabilities to send on next periodic
> >>>> update."""
> >>>> + if not isinstance(capabilities, list):
> >>>> + capabilities = [capabilities]
> >>>> self.last_capabilities = capabilities
> >>>>
> >>>> @periodic_task
> >>>> @@ -224,5 +226,8 @@ class SchedulerDependentManager(Manager):
> >>>> """Pass data back to the scheduler at a periodic interval."""
> >>>> if self.last_capabilities:
> >>>> LOG.debug(_('Notifying Schedulers of capabilities ...'))
> >>>> - self.scheduler_rpcapi.update_service_capabilities(context,
> >>>> - self.service_name, self.host, self.last_capabilities)
> >>>> + for capability_item in self.last_capabilities:
> >>>> + name = capability_item.get('service_name', self.service_name)
> >>>> + host = capability_item.get('host', self.host)
> >>>> + self.scheduler_rpcapi.update_service_capabilities(context,
> >>>> + name, host, capability_item)
> >>>>
> >>>> On Aug 21, 2012, at 1:28 PM, David Kang <dkang@xxxxxxx> wrote:
> >>>>
> >>>>>
> >>>>>  Hi Vish,
> >>>>>
> >>>>>  We are trying to change our code according to your comment.
> >>>>> I want to ask a question.
> >>>>>
> >>>>>>>> a) modify driver.get_host_stats to be able to return a list
> >>>>>>>> of
> >>>>>>>> host
> >>>>>>>> stats instead of just one. Report the whole list back to the
> >>>>>>>> scheduler. We could modify the receiving end to accept a list
> >>>>>>>> as
> >>>>>>>> well
> >>>>>>>> or just make multiple calls to
> >>>>>>>> self.update_service_capabilities(capabilities)
> >>>>>
> >>>>>  Modifying driver.get_host_stats to return a list of host stats
> >>>>>  is
> >>>>>  easy.
> >>>>> Calling muliple calls to
> >>>>> self.update_service_capabilities(capabilities) doesn't seem to
> >>>>> work,
> >>>>> because 'capabilities' is overwritten each time.
> >>>>>
> >>>>>  Modifying the receiving end to accept a list seems to be easy.
> >>>>> However, 'capabilities' is assumed to be dictionary by all other
> >>>>> scheduler routines,
> >>>>> it looks like that we have to change all of them to handle
> >>>>> 'capability' as a list of dictionary.
> >>>>>
> >>>>>  If my understanding is correct, it would affect many parts of
> >>>>>  the
> >>>>>  scheduler.
> >>>>> Is it what you recommended?
> >>>>>
> >>>>>  Thanks,
> >>>>>  David
> >>>>>
> >>>>>
> >>>>> ----- Original Message -----
> >>>>>> This was an immediate goal, the bare-metal nova-compute node
> >>>>>> could
> >>>>>> keep an internal database, but report capabilities through nova
> >>>>>> in
> >>>>>> the
> >>>>>> common way with the changes below. Then the scheduler wouldn't
> >>>>>> need
> >>>>>> access to the bare metal database at all.
> >>>>>>
> >>>>>> On Aug 15, 2012, at 4:23 PM, David Kang <dkang@xxxxxxx> wrote:
> >>>>>>
> >>>>>>>
> >>>>>>> Hi Vish,
> >>>>>>>
> >>>>>>> Is this discussion for long-term goal or for this Folsom
> >>>>>>> release?
> >>>>>>>
> >>>>>>> We still believe that bare-metal database is needed
> >>>>>>> because there is not an automated way how bare-metal nodes
> >>>>>>> report
> >>>>>>> their capabilities
> >>>>>>> to their bare-metal nova-compute node.
> >>>>>>>
> >>>>>>> Thanks,
> >>>>>>> David
> >>>>>>>
> >>>>>>>>
> >>>>>>>> I am interested in finding a solution that enables bare-metal
> >>>>>>>> and
> >>>>>>>> virtualized requests to be serviced through the same
> >>>>>>>> scheduler
> >>>>>>>> where
> >>>>>>>> the compute_nodes table has a full view of schedulable
> >>>>>>>> resources.
> >>>>>>>> This
> >>>>>>>> would seem to simplify the end-to-end flow while opening up
> >>>>>>>> some
> >>>>>>>> additional use cases (e.g. dynamic allocation of a node from
> >>>>>>>> bare-metal to hypervisor and back).
> >>>>>>>>
> >>>>>>>> One approach would be to have a proxy running a single
> >>>>>>>> nova-compute
> >>>>>>>> daemon fronting the bare-metal nodes . That nova-compute
> >>>>>>>> daemon
> >>>>>>>> would
> >>>>>>>> report up many HostState objects (1 per bare-metal node) to
> >>>>>>>> become
> >>>>>>>> entries in the compute_nodes table and accessible through the
> >>>>>>>> scheduler HostManager object.
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> The HostState object would set cpu_info, vcpus, member_mb and
> >>>>>>>> local_gb
> >>>>>>>> values to be used for scheduling with the hypervisor_host
> >>>>>>>> field
> >>>>>>>> holding the bare-metal machine address (e.g. for IPMI based
> >>>>>>>> commands)
> >>>>>>>> and hypervisor_type = NONE. The bare-metal Flavors are
> >>>>>>>> created
> >>>>>>>> with
> >>>>>>>> an
> >>>>>>>> extra_spec of hypervisor_type= NONE and the corresponding
> >>>>>>>> compute_capabilities_filter would reduce the available hosts
> >>>>>>>> to
> >>>>>>>> those
> >>>>>>>> bare_metal nodes. The scheduler would need to understand that
> >>>>>>>> hypervisor_type = NONE means you need an exact fit (or
> >>>>>>>> best-fit)
> >>>>>>>> host
> >>>>>>>> vs weighting them (perhaps through the multi-scheduler). The
> >>>>>>>> scheduler
> >>>>>>>> would cast out the message to the <topic>.<service-hostname>
> >>>>>>>> (code
> >>>>>>>> today uses the HostState hostname), with the compute driver
> >>>>>>>> having
> >>>>>>>> to
> >>>>>>>> understand if it must be serviced elsewhere (but does not
> >>>>>>>> break
> >>>>>>>> any
> >>>>>>>> existing implementations since it is 1 to 1).
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> Does this solution seem workable? Anything I missed?
> >>>>>>>>
> >>>>>>>> The bare metal driver already is proxying for the other nodes
> >>>>>>>> so
> >>>>>>>> it
> >>>>>>>> sounds like we need a couple of things to make this happen:
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> a) modify driver.get_host_stats to be able to return a list
> >>>>>>>> of
> >>>>>>>> host
> >>>>>>>> stats instead of just one. Report the whole list back to the
> >>>>>>>> scheduler. We could modify the receiving end to accept a list
> >>>>>>>> as
> >>>>>>>> well
> >>>>>>>> or just make multiple calls to
> >>>>>>>> self.update_service_capabilities(capabilities)
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> b) make a few minor changes to the scheduler to make sure
> >>>>>>>> filtering
> >>>>>>>> still works. Note the changes here may be very helpful:
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> https://review.openstack.org/10327
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> c) we have to make sure that instances launched on those
> >>>>>>>> nodes
> >>>>>>>> take
> >>>>>>>> up
> >>>>>>>> the entire host state somehow. We could probably do this by
> >>>>>>>> making
> >>>>>>>> sure that the instance_type ram, mb, gb etc. matches what the
> >>>>>>>> node
> >>>>>>>> has, but we may want a new boolean field "used" if those
> >>>>>>>> aren't
> >>>>>>>> sufficient.
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> I This approach seems pretty good. We could potentially get
> >>>>>>>> rid
> >>>>>>>> of
> >>>>>>>> the
> >>>>>>>> shared bare_metal_node table. I guess the only other concern
> >>>>>>>> is
> >>>>>>>> how
> >>>>>>>> you populate the capabilities that the bare metal nodes are
> >>>>>>>> reporting.
> >>>>>>>> I guess an api extension that rpcs to a baremetal node to add
> >>>>>>>> the
> >>>>>>>> node. Maybe someday this could be autogenerated by the bare
> >>>>>>>> metal
> >>>>>>>> host
> >>>>>>>> looking in its arp table for dhcp requests! :)
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> Vish
> >>>>>>>>
> >>>>>>>> _______________________________________________
> >>>>>>>> OpenStack-dev mailing list
> >>>>>>>> OpenStack-dev@xxxxxxxxxxxxxxxxxxx
> >>>>>>>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
> >>>>>>>
> >>>>>>> _______________________________________________
> >>>>>>> OpenStack-dev mailing list
> >>>>>>> OpenStack-dev@xxxxxxxxxxxxxxxxxxx
> >>>>>>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
> >>>>>>
> >>>>>>
> >>>>>> _______________________________________________
> >>>>>> OpenStack-dev mailing list
> >>>>>> OpenStack-dev@xxxxxxxxxxxxxxxxxxx
> >>>>>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
> >>>>>
> >>>>> _______________________________________________
> >>>>> OpenStack-dev mailing list
> >>>>> OpenStack-dev@xxxxxxxxxxxxxxxxxxx
> >>>>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
> >>>>
> >>>>
> >>>> _______________________________________________
> >>>> OpenStack-dev mailing list
> >>>> OpenStack-dev@xxxxxxxxxxxxxxxxxxx
> >>>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Follow ups

References