← Back to team overview

openstack team mailing list archive

Re: [openstack-dev] Discussion about where to put database for bare-metal provisioning (review 10726)

 

David Kang <dkang@xxxxxxx> wrote on 08/27/2012 05:22:37 PM:

> From: David Kang <dkang@xxxxxxx>
> To: Michael J Fork/Rochester/IBM@IBMUS,
> Cc: "openstack@xxxxxxxxxxxxxxxxxxx (openstack@xxxxxxxxxxxxxxxxxxx)"
> <openstack@xxxxxxxxxxxxxxxxxxx>, openstack-bounces+mjfork=us ibm com
> <openstack-bounces+mjfork=us.ibm.com@xxxxxxxxxxxxxxxxxxx>, OpenStack
> Development Mailing List <openstack-dev@xxxxxxxxxxxxxxxxxxx>,
> Vishvananda Ishaya <vishvananda@xxxxxxxxx>
> Date: 08/27/2012 05:22 PM
> Subject: Re: [Openstack] [openstack-dev] Discussion about where to
> put database for bare-metal provisioning (review 10726)
>
>
>  Michael,
>
>  I think you mean "compute_node hostname" as 'hypervisor_hostname'
> field in the 'compute_node' table.

Yes.  This value would be part of the payload of the message cast to the
proxy node so that it knows who the request was directed to.

> What do you mean by "service hostname"?
> I don't see such field in the 'service' table in the database.
> Is it in some other table?
> Or do you suggest adding 'service_hostname' field in the 'service' table?

The "host" field in the services table.  This value would be used as the
target of the rpc cast so that the proxy node would receive the message.

>
>  Thanks,
>  David
>
> ----- Original Message -----
> > openstack-bounces+mjfork=us.ibm.com@xxxxxxxxxxxxxxxxxxx wrote on
> > 08/27/2012 02:58:56 PM:
> >
> > > From: David Kang <dkang@xxxxxxx>
> > > To: Vishvananda Ishaya <vishvananda@xxxxxxxxx>,
> > > Cc: OpenStack Development Mailing List <openstack-
> > > dev@xxxxxxxxxxxxxxxxxxx>, "openstack@xxxxxxxxxxxxxxxxxxx \
> > > (openstack@xxxxxxxxxxxxxxxxxxx\)" <openstack@xxxxxxxxxxxxxxxxxxx>
> > > Date: 08/27/2012 03:06 PM
> > > Subject: Re: [Openstack] [openstack-dev] Discussion about where to
> > > put database for bare-metal provisioning (review 10726)
> > > Sent by: openstack-bounces+mjfork=us.ibm.com@xxxxxxxxxxxxxxxxxxx
> > >
> > >
> > > Hi Vish,
> > >
> > > I think I understand your idea.
> > > One service entry with multiple bare-metal compute_node entries are
> > > registered at the start of bare-metal nova-compute.
> > > 'hypervisor_hostname' must be different for each bare-metal machine,
> > > such as 'bare-metal-0001.xxx.com', 'bare-metal-0002.xxx.com', etc.)
> > > But their IP addresses must be the IP address of bare-metal nova-
> > > compute, such that an instance is casted
> > > not to bare-metal machine directly but to bare-metal nova-compute.
> >
> > I believe the change here is to cast out the message to the
> > <topic>.<service-hostname>. Existing code sends it to the compute_node
> > hostname (see line 202 of nova/scheduler/filter_scheduler.py,
> > specifically host=weighted_host.host_state.host). Changing that to
> > cast to the service hostname would send the message to the bare-metal
> > proxy node and should not have an effect on current deployments since
> > the service hostname and the host_state.host would always be equal.
> > This model will also let you keep the bare-metal compute node IP in
> > the compute node table.
> >
> > > One extension we need to do at the scheduler side is using (host,
> > > hypervisor_hostname) instead of (host) only in host_manager.py.
> > > 'HostManager.service_state' is { <host> : { <service > : { cap k : v
> > > }}}.
> > > It needs to be changed to { <host> : { <service> : {
> > > <hypervisor_name> : { cap k : v }}}}.
> > > Most functions of HostState need to be changed to use (host,
> > > hypervisor_name) pair to identify a compute node.
> >
> > Would an alternative here be to change the top level "host" to be the
> > hypervisor_hostname and enforce uniqueness?
> >
> > > Are we on the same page, now?
> > >
> > > Thanks,
> > > David
> > >
> > > ----- Original Message -----
> > > > Hi David,
> > > >
> > > > I just checked out the code more extensively and I don't see why
> > > > you
> > > > need to create a new service entry for each compute_node entry.
> > > > The
> > > > code in host_manager to get all host states explicitly gets all
> > > > compute_node entries. I don't see any reason why multiple
> > > > compute_node
> > > > entries can't share the same service. I don't see any place in the
> > > > scheduler that is grabbing records by "service" instead of by
> > > > "compute
> > > > node", but if there is one that I missed, it should be fairly easy
> > > > to
> > > > change it.
> > > >
> > > > The compute_node record is created in the
> > > > compute/resource_tracker.py
> > > > as of a recent commit, so I think the path forward would be to
> > > > make
> > > > sure that one of the records is created for each bare metal node
> > > > by
> > > > the bare metal compute, perhaps by having multiple
> > > > resource_trackers.
> > > >
> > > > Vish
> > > >
> > > > On Aug 27, 2012, at 9:40 AM, David Kang <dkang@xxxxxxx> wrote:
> > > >
> > > > >
> > > > > Vish,
> > > > >
> > > > > I think I don't understand your statement fully.
> > > > > Unless we use different hostnames, (hostname,
> > > > > hypervisor_hostname)
> > > > > must be the
> > > > > same for all bare-metal nodes under a bare-metal nova-compute.
> > > > >
> > > > > Could you elaborate the following statement a little bit more?
> > > > >
> > > > >> You would just have to use a little more than hostname. Perhaps
> > > > >> (hostname, hypervisor_hostname) could be used to update the
> > > > >> entry?
> > > > >>
> > > > >
> > > > > Thanks,
> > > > > David
> > > > >
> > > > >
> > > > >
> > > > > ----- Original Message -----
> > > > >> I would investigate changing the capabilities to key off of
> > > > >> something
> > > > >> other than hostname. It looks from the table structure like
> > > > >> compute_nodes could be have a many-to-one relationship with
> > > > >> services.
> > > > >> You would just have to use a little more than hostname. Perhaps
> > > > >> (hostname, hypervisor_hostname) could be used to update the
> > > > >> entry?
> > > > >>
> > > > >> Vish
> > > > >>
> > > > >> On Aug 24, 2012, at 11:23 AM, David Kang <dkang@xxxxxxx> wrote:
> > > > >>
> > > > >>>
> > > > >>> Vish,
> > > > >>>
> > > > >>> I've tested your code and did more testing.
> > > > >>> There are a couple of problems.
> > > > >>> 1. host name should be unique. If not, any repetitive updates
> > > > >>> of
> > > > >>> new
> > > > >>> capabilities with the same host name are simply overwritten.
> > > > >>> 2. We cannot generate arbitrary host names on the fly.
> > > > >>> The scheduler (I tested filter scheduler) gets host names from
> > > > >>> db.
> > > > >>> So, if a host name is not in the 'services' table, it is not
> > > > >>> considered by the scheduler at all.
> > > > >>>
> > > > >>> So, to make your suggestions possible, nova-compute should
> > > > >>> register
> > > > >>> N different host names in 'services' table,
> > > > >>> and N corresponding entries in 'compute_nodes' table.
> > > > >>> Here is an example:
> > > > >>>
> > > > >>> mysql> select id, host, binary, topic, report_count, disabled,
> > > > >>> availability_zone from services;
> > > > >>> +----+-------------+----------------+-----------
> > > +--------------+----------+-------------------+
> > > > >>> | id | host | binary | topic | report_count | disabled |
> > > > >>> | availability_zone |
> > > > >>> +----+-------------+----------------+-----------
> > > +--------------+----------+-------------------+
> > > > >>> | 1 | bespin101 | nova-scheduler | scheduler | 17145 | 0 |
> > > > >>> | nova |
> > > > >>> | 2 | bespin101 | nova-network | network | 16819 | 0 | nova |
> > > > >>> | 3 | bespin101-0 | nova-compute | compute | 16405 | 0 | nova
> > > > >>> | |
> > > > >>> | 4 | bespin101-1 | nova-compute | compute | 1 | 0 | nova |
> > > > >>> +----+-------------+----------------+-----------
> > > +--------------+----------+-------------------+
> > > > >>>
> > > > >>> mysql> select id, service_id, hypervisor_hostname from
> > > > >>> compute_nodes;
> > > > >>> +----+------------+------------------------+
> > > > >>> | id | service_id | hypervisor_hostname |
> > > > >>> +----+------------+------------------------+
> > > > >>> | 1 | 3 | bespin101.east.isi.edu |
> > > > >>> | 2 | 4 | bespin101.east.isi.edu |
> > > > >>> +----+------------+------------------------+
> > > > >>>
> > > > >>> Then, nova db (compute_nodes table) has entries of all
> > > > >>> bare-metal
> > > > >>> nodes.
> > > > >>> What do you think of this approach.
> > > > >>> Do you have any better approach?
> > > > >>>
> > > > >>> Thanks,
> > > > >>> David
> > > > >>>
> > > > >>>
> > > > >>>
> > > > >>> ----- Original Message -----
> > > > >>>> To elaborate, something the below. I'm not absolutely sure
> > > > >>>> you
> > > > >>>> need
> > > > >>>> to
> > > > >>>> be able to set service_name and host, but this gives you the
> > > > >>>> option
> > > > >>>> to
> > > > >>>> do so if needed.
> > > > >>>>
> > > > >>>> iff --git a/nova/manager.py b/nova/manager.py
> > > > >>>> index c6711aa..c0f4669 100644
> > > > >>>> --- a/nova/manager.py
> > > > >>>> +++ b/nova/manager.py
> > > > >>>> @@ -217,6 +217,8 @@ class SchedulerDependentManager(Manager):
> > > > >>>>
> > > > >>>> def update_service_capabilities(self, capabilities):
> > > > >>>> """Remember these capabilities to send on next periodic
> > > > >>>> update."""
> > > > >>>> + if not isinstance(capabilities, list):
> > > > >>>> + capabilities = [capabilities]
> > > > >>>> self.last_capabilities = capabilities
> > > > >>>>
> > > > >>>> @periodic_task
> > > > >>>> @@ -224,5 +226,8 @@ class SchedulerDependentManager(Manager):
> > > > >>>> """Pass data back to the scheduler at a periodic interval."""
> > > > >>>> if self.last_capabilities:
> > > > >>>> LOG.debug(_('Notifying Schedulers of capabilities ...'))
> > > > >>>> - self.scheduler_rpcapi.update_service_capabilities(context,
> > > > >>>> - self.service_name, self.host, self.last_capabilities)
> > > > >>>> + for capability_item in self.last_capabilities:
> > > > >>>> + name = capability_item.get('service_name',
> > > > >>>> self.service_name)
> > > > >>>> + host = capability_item.get('host', self.host)
> > > > >>>> + self.scheduler_rpcapi.update_service_capabilities(context,
> > > > >>>> + name, host, capability_item)
> > > > >>>>
> > > > >>>> On Aug 21, 2012, at 1:28 PM, David Kang <dkang@xxxxxxx>
> > > > >>>> wrote:
> > > > >>>>
> > > > >>>>>
> > > > >>>>> Hi Vish,
> > > > >>>>>
> > > > >>>>> We are trying to change our code according to your comment.
> > > > >>>>> I want to ask a question.
> > > > >>>>>
> > > > >>>>>>>> a) modify driver.get_host_stats to be able to return a
> > > > >>>>>>>> list
> > > > >>>>>>>> of
> > > > >>>>>>>> host
> > > > >>>>>>>> stats instead of just one. Report the whole list back to
> > > > >>>>>>>> the
> > > > >>>>>>>> scheduler. We could modify the receiving end to accept a
> > > > >>>>>>>> list
> > > > >>>>>>>> as
> > > > >>>>>>>> well
> > > > >>>>>>>> or just make multiple calls to
> > > > >>>>>>>> self.update_service_capabilities(capabilities)
> > > > >>>>>
> > > > >>>>> Modifying driver.get_host_stats to return a list of host
> > > > >>>>> stats
> > > > >>>>> is
> > > > >>>>> easy.
> > > > >>>>> Calling muliple calls to
> > > > >>>>> self.update_service_capabilities(capabilities) doesn't seem
> > > > >>>>> to
> > > > >>>>> work,
> > > > >>>>> because 'capabilities' is overwritten each time.
> > > > >>>>>
> > > > >>>>> Modifying the receiving end to accept a list seems to be
> > > > >>>>> easy.
> > > > >>>>> However, 'capabilities' is assumed to be dictionary by all
> > > > >>>>> other
> > > > >>>>> scheduler routines,
> > > > >>>>> it looks like that we have to change all of them to handle
> > > > >>>>> 'capability' as a list of dictionary.
> > > > >>>>>
> > > > >>>>> If my understanding is correct, it would affect many parts
> > > > >>>>> of
> > > > >>>>> the
> > > > >>>>> scheduler.
> > > > >>>>> Is it what you recommended?
> > > > >>>>>
> > > > >>>>> Thanks,
> > > > >>>>> David
> > > > >>>>>
> > > > >>>>>
> > > > >>>>> ----- Original Message -----
> > > > >>>>>> This was an immediate goal, the bare-metal nova-compute
> > > > >>>>>> node
> > > > >>>>>> could
> > > > >>>>>> keep an internal database, but report capabilities through
> > > > >>>>>> nova
> > > > >>>>>> in
> > > > >>>>>> the
> > > > >>>>>> common way with the changes below. Then the scheduler
> > > > >>>>>> wouldn't
> > > > >>>>>> need
> > > > >>>>>> access to the bare metal database at all.
> > > > >>>>>>
> > > > >>>>>> On Aug 15, 2012, at 4:23 PM, David Kang <dkang@xxxxxxx>
> > > > >>>>>> wrote:
> > > > >>>>>>
> > > > >>>>>>>
> > > > >>>>>>> Hi Vish,
> > > > >>>>>>>
> > > > >>>>>>> Is this discussion for long-term goal or for this Folsom
> > > > >>>>>>> release?
> > > > >>>>>>>
> > > > >>>>>>> We still believe that bare-metal database is needed
> > > > >>>>>>> because there is not an automated way how bare-metal nodes
> > > > >>>>>>> report
> > > > >>>>>>> their capabilities
> > > > >>>>>>> to their bare-metal nova-compute node.
> > > > >>>>>>>
> > > > >>>>>>> Thanks,
> > > > >>>>>>> David
> > > > >>>>>>>
> > > > >>>>>>>>
> > > > >>>>>>>> I am interested in finding a solution that enables
> > > > >>>>>>>> bare-metal
> > > > >>>>>>>> and
> > > > >>>>>>>> virtualized requests to be serviced through the same
> > > > >>>>>>>> scheduler
> > > > >>>>>>>> where
> > > > >>>>>>>> the compute_nodes table has a full view of schedulable
> > > > >>>>>>>> resources.
> > > > >>>>>>>> This
> > > > >>>>>>>> would seem to simplify the end-to-end flow while opening
> > > > >>>>>>>> up
> > > > >>>>>>>> some
> > > > >>>>>>>> additional use cases (e.g. dynamic allocation of a node
> > > > >>>>>>>> from
> > > > >>>>>>>> bare-metal to hypervisor and back).
> > > > >>>>>>>>
> > > > >>>>>>>> One approach would be to have a proxy running a single
> > > > >>>>>>>> nova-compute
> > > > >>>>>>>> daemon fronting the bare-metal nodes . That nova-compute
> > > > >>>>>>>> daemon
> > > > >>>>>>>> would
> > > > >>>>>>>> report up many HostState objects (1 per bare-metal node)
> > > > >>>>>>>> to
> > > > >>>>>>>> become
> > > > >>>>>>>> entries in the compute_nodes table and accessible through
> > > > >>>>>>>> the
> > > > >>>>>>>> scheduler HostManager object.
> > > > >>>>>>>>
> > > > >>>>>>>>
> > > > >>>>>>>>
> > > > >>>>>>>>
> > > > >>>>>>>> The HostState object would set cpu_info, vcpus, member_mb
> > > > >>>>>>>> and
> > > > >>>>>>>> local_gb
> > > > >>>>>>>> values to be used for scheduling with the hypervisor_host
> > > > >>>>>>>> field
> > > > >>>>>>>> holding the bare-metal machine address (e.g. for IPMI
> > > > >>>>>>>> based
> > > > >>>>>>>> commands)
> > > > >>>>>>>> and hypervisor_type = NONE. The bare-metal Flavors are
> > > > >>>>>>>> created
> > > > >>>>>>>> with
> > > > >>>>>>>> an
> > > > >>>>>>>> extra_spec of hypervisor_type= NONE and the corresponding
> > > > >>>>>>>> compute_capabilities_filter would reduce the available
> > > > >>>>>>>> hosts
> > > > >>>>>>>> to
> > > > >>>>>>>> those
> > > > >>>>>>>> bare_metal nodes. The scheduler would need to understand
> > > > >>>>>>>> that
> > > > >>>>>>>> hypervisor_type = NONE means you need an exact fit (or
> > > > >>>>>>>> best-fit)
> > > > >>>>>>>> host
> > > > >>>>>>>> vs weighting them (perhaps through the multi-scheduler).
> > > > >>>>>>>> The
> > > > >>>>>>>> scheduler
> > > > >>>>>>>> would cast out the message to the
> > > > >>>>>>>> <topic>.<service-hostname>
> > > > >>>>>>>> (code
> > > > >>>>>>>> today uses the HostState hostname), with the compute
> > > > >>>>>>>> driver
> > > > >>>>>>>> having
> > > > >>>>>>>> to
> > > > >>>>>>>> understand if it must be serviced elsewhere (but does not
> > > > >>>>>>>> break
> > > > >>>>>>>> any
> > > > >>>>>>>> existing implementations since it is 1 to 1).
> > > > >>>>>>>>
> > > > >>>>>>>>
> > > > >>>>>>>>
> > > > >>>>>>>>
> > > > >>>>>>>>
> > > > >>>>>>>> Does this solution seem workable? Anything I missed?
> > > > >>>>>>>>
> > > > >>>>>>>> The bare metal driver already is proxying for the other
> > > > >>>>>>>> nodes
> > > > >>>>>>>> so
> > > > >>>>>>>> it
> > > > >>>>>>>> sounds like we need a couple of things to make this
> > > > >>>>>>>> happen:
> > > > >>>>>>>>
> > > > >>>>>>>>
> > > > >>>>>>>> a) modify driver.get_host_stats to be able to return a
> > > > >>>>>>>> list
> > > > >>>>>>>> of
> > > > >>>>>>>> host
> > > > >>>>>>>> stats instead of just one. Report the whole list back to
> > > > >>>>>>>> the
> > > > >>>>>>>> scheduler. We could modify the receiving end to accept a
> > > > >>>>>>>> list
> > > > >>>>>>>> as
> > > > >>>>>>>> well
> > > > >>>>>>>> or just make multiple calls to
> > > > >>>>>>>> self.update_service_capabilities(capabilities)
> > > > >>>>>>>>
> > > > >>>>>>>>
> > > > >>>>>>>> b) make a few minor changes to the scheduler to make sure
> > > > >>>>>>>> filtering
> > > > >>>>>>>> still works. Note the changes here may be very helpful:
> > > > >>>>>>>>
> > > > >>>>>>>>
> > > > >>>>>>>> https://review.openstack.org/10327
> > > > >>>>>>>>
> > > > >>>>>>>>
> > > > >>>>>>>> c) we have to make sure that instances launched on those
> > > > >>>>>>>> nodes
> > > > >>>>>>>> take
> > > > >>>>>>>> up
> > > > >>>>>>>> the entire host state somehow. We could probably do this
> > > > >>>>>>>> by
> > > > >>>>>>>> making
> > > > >>>>>>>> sure that the instance_type ram, mb, gb etc. matches what
> > > > >>>>>>>> the
> > > > >>>>>>>> node
> > > > >>>>>>>> has, but we may want a new boolean field "used" if those
> > > > >>>>>>>> aren't
> > > > >>>>>>>> sufficient.
> > > > >>>>>>>>
> > > > >>>>>>>>
> > > > >>>>>>>> I This approach seems pretty good. We could potentially
> > > > >>>>>>>> get
> > > > >>>>>>>> rid
> > > > >>>>>>>> of
> > > > >>>>>>>> the
> > > > >>>>>>>> shared bare_metal_node table. I guess the only other
> > > > >>>>>>>> concern
> > > > >>>>>>>> is
> > > > >>>>>>>> how
> > > > >>>>>>>> you populate the capabilities that the bare metal nodes
> > > > >>>>>>>> are
> > > > >>>>>>>> reporting.
> > > > >>>>>>>> I guess an api extension that rpcs to a baremetal node to
> > > > >>>>>>>> add
> > > > >>>>>>>> the
> > > > >>>>>>>> node. Maybe someday this could be autogenerated by the
> > > > >>>>>>>> bare
> > > > >>>>>>>> metal
> > > > >>>>>>>> host
> > > > >>>>>>>> looking in its arp table for dhcp requests! :)
> > > > >>>>>>>>
> > > > >>>>>>>>
> > > > >>>>>>>> Vish
> > > > >>>>>>>>
> > > > >>>>>>>> _______________________________________________
> > > > >>>>>>>> OpenStack-dev mailing list
> > > > >>>>>>>> OpenStack-dev@xxxxxxxxxxxxxxxxxxx
> > > > >>>>>>>> http://lists.openstack.org/cgi-bin/mailman/listinfo/
> openstack-dev
> > > > >>>>>>>
> > > > >>>>>>> _______________________________________________
> > > > >>>>>>> OpenStack-dev mailing list
> > > > >>>>>>> OpenStack-dev@xxxxxxxxxxxxxxxxxxx
> > > > >>>>>>> http://lists.openstack.org/cgi-bin/mailman/listinfo/
> openstack-dev
> > > > >>>>>>
> > > > >>>>>>
> > > > >>>>>> _______________________________________________
> > > > >>>>>> OpenStack-dev mailing list
> > > > >>>>>> OpenStack-dev@xxxxxxxxxxxxxxxxxxx
> > > > >>>>>>
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
> > > > >>>>>
> > > > >>>>> _______________________________________________
> > > > >>>>> OpenStack-dev mailing list
> > > > >>>>> OpenStack-dev@xxxxxxxxxxxxxxxxxxx
> > > > >>>>>
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
> > > > >>>>
> > > > >>>>
> > > > >>>> _______________________________________________
> > > > >>>> OpenStack-dev mailing list
> > > > >>>> OpenStack-dev@xxxxxxxxxxxxxxxxxxx
> > > > >>>>
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
> > >
> > > _______________________________________________
> > > Mailing list: https://launchpad.net/~openstack
> > > Post to : openstack@xxxxxxxxxxxxxxxxxxx
> > > Unsubscribe : https://launchpad.net/~openstack
> > > More help : https://help.launchpad.net/ListHelp
> > >
> >
> > Michael
> >
> > -------------------------------------------------
> > Michael Fork
> > Cloud Architect, Emerging Solutions
> > IBM Systems & Technology Group
>
Michael

-------------------------------------------------
Michael Fork
Cloud Architect, Emerging Solutions
IBM Systems & Technology Group

Follow ups

References