openstack team mailing list archive
-
openstack team
-
Mailing list archive
-
Message #16159
Re: [openstack-dev] Discussion about where to put database for bare-metal provisioning (review 10726)
Michael,
I think you mean "compute_node hostname" as 'hypervisor_hostname' field in the 'compute_node' table.
What do you mean by "service hostname"?
I don't see such field in the 'service' table in the database.
Is it in some other table?
Or do you suggest adding 'service_hostname' field in the 'service' table?
Thanks,
David
----- Original Message -----
> openstack-bounces+mjfork=us.ibm.com@xxxxxxxxxxxxxxxxxxx wrote on
> 08/27/2012 02:58:56 PM:
>
> > From: David Kang <dkang@xxxxxxx>
> > To: Vishvananda Ishaya <vishvananda@xxxxxxxxx>,
> > Cc: OpenStack Development Mailing List <openstack-
> > dev@xxxxxxxxxxxxxxxxxxx>, "openstack@xxxxxxxxxxxxxxxxxxx \
> > (openstack@xxxxxxxxxxxxxxxxxxx\)" <openstack@xxxxxxxxxxxxxxxxxxx>
> > Date: 08/27/2012 03:06 PM
> > Subject: Re: [Openstack] [openstack-dev] Discussion about where to
> > put database for bare-metal provisioning (review 10726)
> > Sent by: openstack-bounces+mjfork=us.ibm.com@xxxxxxxxxxxxxxxxxxx
> >
> >
> > Hi Vish,
> >
> > I think I understand your idea.
> > One service entry with multiple bare-metal compute_node entries are
> > registered at the start of bare-metal nova-compute.
> > 'hypervisor_hostname' must be different for each bare-metal machine,
> > such as 'bare-metal-0001.xxx.com', 'bare-metal-0002.xxx.com', etc.)
> > But their IP addresses must be the IP address of bare-metal nova-
> > compute, such that an instance is casted
> > not to bare-metal machine directly but to bare-metal nova-compute.
>
> I believe the change here is to cast out the message to the
> <topic>.<service-hostname>. Existing code sends it to the compute_node
> hostname (see line 202 of nova/scheduler/filter_scheduler.py,
> specifically host=weighted_host.host_state.host). Changing that to
> cast to the service hostname would send the message to the bare-metal
> proxy node and should not have an effect on current deployments since
> the service hostname and the host_state.host would always be equal.
> This model will also let you keep the bare-metal compute node IP in
> the compute node table.
>
> > One extension we need to do at the scheduler side is using (host,
> > hypervisor_hostname) instead of (host) only in host_manager.py.
> > 'HostManager.service_state' is { <host> : { <service > : { cap k : v
> > }}}.
> > It needs to be changed to { <host> : { <service> : {
> > <hypervisor_name> : { cap k : v }}}}.
> > Most functions of HostState need to be changed to use (host,
> > hypervisor_name) pair to identify a compute node.
>
> Would an alternative here be to change the top level "host" to be the
> hypervisor_hostname and enforce uniqueness?
>
> > Are we on the same page, now?
> >
> > Thanks,
> > David
> >
> > ----- Original Message -----
> > > Hi David,
> > >
> > > I just checked out the code more extensively and I don't see why
> > > you
> > > need to create a new service entry for each compute_node entry.
> > > The
> > > code in host_manager to get all host states explicitly gets all
> > > compute_node entries. I don't see any reason why multiple
> > > compute_node
> > > entries can't share the same service. I don't see any place in the
> > > scheduler that is grabbing records by "service" instead of by
> > > "compute
> > > node", but if there is one that I missed, it should be fairly easy
> > > to
> > > change it.
> > >
> > > The compute_node record is created in the
> > > compute/resource_tracker.py
> > > as of a recent commit, so I think the path forward would be to
> > > make
> > > sure that one of the records is created for each bare metal node
> > > by
> > > the bare metal compute, perhaps by having multiple
> > > resource_trackers.
> > >
> > > Vish
> > >
> > > On Aug 27, 2012, at 9:40 AM, David Kang <dkang@xxxxxxx> wrote:
> > >
> > > >
> > > > Vish,
> > > >
> > > > I think I don't understand your statement fully.
> > > > Unless we use different hostnames, (hostname,
> > > > hypervisor_hostname)
> > > > must be the
> > > > same for all bare-metal nodes under a bare-metal nova-compute.
> > > >
> > > > Could you elaborate the following statement a little bit more?
> > > >
> > > >> You would just have to use a little more than hostname. Perhaps
> > > >> (hostname, hypervisor_hostname) could be used to update the
> > > >> entry?
> > > >>
> > > >
> > > > Thanks,
> > > > David
> > > >
> > > >
> > > >
> > > > ----- Original Message -----
> > > >> I would investigate changing the capabilities to key off of
> > > >> something
> > > >> other than hostname. It looks from the table structure like
> > > >> compute_nodes could be have a many-to-one relationship with
> > > >> services.
> > > >> You would just have to use a little more than hostname. Perhaps
> > > >> (hostname, hypervisor_hostname) could be used to update the
> > > >> entry?
> > > >>
> > > >> Vish
> > > >>
> > > >> On Aug 24, 2012, at 11:23 AM, David Kang <dkang@xxxxxxx> wrote:
> > > >>
> > > >>>
> > > >>> Vish,
> > > >>>
> > > >>> I've tested your code and did more testing.
> > > >>> There are a couple of problems.
> > > >>> 1. host name should be unique. If not, any repetitive updates
> > > >>> of
> > > >>> new
> > > >>> capabilities with the same host name are simply overwritten.
> > > >>> 2. We cannot generate arbitrary host names on the fly.
> > > >>> The scheduler (I tested filter scheduler) gets host names from
> > > >>> db.
> > > >>> So, if a host name is not in the 'services' table, it is not
> > > >>> considered by the scheduler at all.
> > > >>>
> > > >>> So, to make your suggestions possible, nova-compute should
> > > >>> register
> > > >>> N different host names in 'services' table,
> > > >>> and N corresponding entries in 'compute_nodes' table.
> > > >>> Here is an example:
> > > >>>
> > > >>> mysql> select id, host, binary, topic, report_count, disabled,
> > > >>> availability_zone from services;
> > > >>> +----+-------------+----------------+-----------
> > +--------------+----------+-------------------+
> > > >>> | id | host | binary | topic | report_count | disabled |
> > > >>> | availability_zone |
> > > >>> +----+-------------+----------------+-----------
> > +--------------+----------+-------------------+
> > > >>> | 1 | bespin101 | nova-scheduler | scheduler | 17145 | 0 |
> > > >>> | nova |
> > > >>> | 2 | bespin101 | nova-network | network | 16819 | 0 | nova |
> > > >>> | 3 | bespin101-0 | nova-compute | compute | 16405 | 0 | nova
> > > >>> | |
> > > >>> | 4 | bespin101-1 | nova-compute | compute | 1 | 0 | nova |
> > > >>> +----+-------------+----------------+-----------
> > +--------------+----------+-------------------+
> > > >>>
> > > >>> mysql> select id, service_id, hypervisor_hostname from
> > > >>> compute_nodes;
> > > >>> +----+------------+------------------------+
> > > >>> | id | service_id | hypervisor_hostname |
> > > >>> +----+------------+------------------------+
> > > >>> | 1 | 3 | bespin101.east.isi.edu |
> > > >>> | 2 | 4 | bespin101.east.isi.edu |
> > > >>> +----+------------+------------------------+
> > > >>>
> > > >>> Then, nova db (compute_nodes table) has entries of all
> > > >>> bare-metal
> > > >>> nodes.
> > > >>> What do you think of this approach.
> > > >>> Do you have any better approach?
> > > >>>
> > > >>> Thanks,
> > > >>> David
> > > >>>
> > > >>>
> > > >>>
> > > >>> ----- Original Message -----
> > > >>>> To elaborate, something the below. I'm not absolutely sure
> > > >>>> you
> > > >>>> need
> > > >>>> to
> > > >>>> be able to set service_name and host, but this gives you the
> > > >>>> option
> > > >>>> to
> > > >>>> do so if needed.
> > > >>>>
> > > >>>> iff --git a/nova/manager.py b/nova/manager.py
> > > >>>> index c6711aa..c0f4669 100644
> > > >>>> --- a/nova/manager.py
> > > >>>> +++ b/nova/manager.py
> > > >>>> @@ -217,6 +217,8 @@ class SchedulerDependentManager(Manager):
> > > >>>>
> > > >>>> def update_service_capabilities(self, capabilities):
> > > >>>> """Remember these capabilities to send on next periodic
> > > >>>> update."""
> > > >>>> + if not isinstance(capabilities, list):
> > > >>>> + capabilities = [capabilities]
> > > >>>> self.last_capabilities = capabilities
> > > >>>>
> > > >>>> @periodic_task
> > > >>>> @@ -224,5 +226,8 @@ class SchedulerDependentManager(Manager):
> > > >>>> """Pass data back to the scheduler at a periodic interval."""
> > > >>>> if self.last_capabilities:
> > > >>>> LOG.debug(_('Notifying Schedulers of capabilities ...'))
> > > >>>> - self.scheduler_rpcapi.update_service_capabilities(context,
> > > >>>> - self.service_name, self.host, self.last_capabilities)
> > > >>>> + for capability_item in self.last_capabilities:
> > > >>>> + name = capability_item.get('service_name',
> > > >>>> self.service_name)
> > > >>>> + host = capability_item.get('host', self.host)
> > > >>>> + self.scheduler_rpcapi.update_service_capabilities(context,
> > > >>>> + name, host, capability_item)
> > > >>>>
> > > >>>> On Aug 21, 2012, at 1:28 PM, David Kang <dkang@xxxxxxx>
> > > >>>> wrote:
> > > >>>>
> > > >>>>>
> > > >>>>> Hi Vish,
> > > >>>>>
> > > >>>>> We are trying to change our code according to your comment.
> > > >>>>> I want to ask a question.
> > > >>>>>
> > > >>>>>>>> a) modify driver.get_host_stats to be able to return a
> > > >>>>>>>> list
> > > >>>>>>>> of
> > > >>>>>>>> host
> > > >>>>>>>> stats instead of just one. Report the whole list back to
> > > >>>>>>>> the
> > > >>>>>>>> scheduler. We could modify the receiving end to accept a
> > > >>>>>>>> list
> > > >>>>>>>> as
> > > >>>>>>>> well
> > > >>>>>>>> or just make multiple calls to
> > > >>>>>>>> self.update_service_capabilities(capabilities)
> > > >>>>>
> > > >>>>> Modifying driver.get_host_stats to return a list of host
> > > >>>>> stats
> > > >>>>> is
> > > >>>>> easy.
> > > >>>>> Calling muliple calls to
> > > >>>>> self.update_service_capabilities(capabilities) doesn't seem
> > > >>>>> to
> > > >>>>> work,
> > > >>>>> because 'capabilities' is overwritten each time.
> > > >>>>>
> > > >>>>> Modifying the receiving end to accept a list seems to be
> > > >>>>> easy.
> > > >>>>> However, 'capabilities' is assumed to be dictionary by all
> > > >>>>> other
> > > >>>>> scheduler routines,
> > > >>>>> it looks like that we have to change all of them to handle
> > > >>>>> 'capability' as a list of dictionary.
> > > >>>>>
> > > >>>>> If my understanding is correct, it would affect many parts
> > > >>>>> of
> > > >>>>> the
> > > >>>>> scheduler.
> > > >>>>> Is it what you recommended?
> > > >>>>>
> > > >>>>> Thanks,
> > > >>>>> David
> > > >>>>>
> > > >>>>>
> > > >>>>> ----- Original Message -----
> > > >>>>>> This was an immediate goal, the bare-metal nova-compute
> > > >>>>>> node
> > > >>>>>> could
> > > >>>>>> keep an internal database, but report capabilities through
> > > >>>>>> nova
> > > >>>>>> in
> > > >>>>>> the
> > > >>>>>> common way with the changes below. Then the scheduler
> > > >>>>>> wouldn't
> > > >>>>>> need
> > > >>>>>> access to the bare metal database at all.
> > > >>>>>>
> > > >>>>>> On Aug 15, 2012, at 4:23 PM, David Kang <dkang@xxxxxxx>
> > > >>>>>> wrote:
> > > >>>>>>
> > > >>>>>>>
> > > >>>>>>> Hi Vish,
> > > >>>>>>>
> > > >>>>>>> Is this discussion for long-term goal or for this Folsom
> > > >>>>>>> release?
> > > >>>>>>>
> > > >>>>>>> We still believe that bare-metal database is needed
> > > >>>>>>> because there is not an automated way how bare-metal nodes
> > > >>>>>>> report
> > > >>>>>>> their capabilities
> > > >>>>>>> to their bare-metal nova-compute node.
> > > >>>>>>>
> > > >>>>>>> Thanks,
> > > >>>>>>> David
> > > >>>>>>>
> > > >>>>>>>>
> > > >>>>>>>> I am interested in finding a solution that enables
> > > >>>>>>>> bare-metal
> > > >>>>>>>> and
> > > >>>>>>>> virtualized requests to be serviced through the same
> > > >>>>>>>> scheduler
> > > >>>>>>>> where
> > > >>>>>>>> the compute_nodes table has a full view of schedulable
> > > >>>>>>>> resources.
> > > >>>>>>>> This
> > > >>>>>>>> would seem to simplify the end-to-end flow while opening
> > > >>>>>>>> up
> > > >>>>>>>> some
> > > >>>>>>>> additional use cases (e.g. dynamic allocation of a node
> > > >>>>>>>> from
> > > >>>>>>>> bare-metal to hypervisor and back).
> > > >>>>>>>>
> > > >>>>>>>> One approach would be to have a proxy running a single
> > > >>>>>>>> nova-compute
> > > >>>>>>>> daemon fronting the bare-metal nodes . That nova-compute
> > > >>>>>>>> daemon
> > > >>>>>>>> would
> > > >>>>>>>> report up many HostState objects (1 per bare-metal node)
> > > >>>>>>>> to
> > > >>>>>>>> become
> > > >>>>>>>> entries in the compute_nodes table and accessible through
> > > >>>>>>>> the
> > > >>>>>>>> scheduler HostManager object.
> > > >>>>>>>>
> > > >>>>>>>>
> > > >>>>>>>>
> > > >>>>>>>>
> > > >>>>>>>> The HostState object would set cpu_info, vcpus, member_mb
> > > >>>>>>>> and
> > > >>>>>>>> local_gb
> > > >>>>>>>> values to be used for scheduling with the hypervisor_host
> > > >>>>>>>> field
> > > >>>>>>>> holding the bare-metal machine address (e.g. for IPMI
> > > >>>>>>>> based
> > > >>>>>>>> commands)
> > > >>>>>>>> and hypervisor_type = NONE. The bare-metal Flavors are
> > > >>>>>>>> created
> > > >>>>>>>> with
> > > >>>>>>>> an
> > > >>>>>>>> extra_spec of hypervisor_type= NONE and the corresponding
> > > >>>>>>>> compute_capabilities_filter would reduce the available
> > > >>>>>>>> hosts
> > > >>>>>>>> to
> > > >>>>>>>> those
> > > >>>>>>>> bare_metal nodes. The scheduler would need to understand
> > > >>>>>>>> that
> > > >>>>>>>> hypervisor_type = NONE means you need an exact fit (or
> > > >>>>>>>> best-fit)
> > > >>>>>>>> host
> > > >>>>>>>> vs weighting them (perhaps through the multi-scheduler).
> > > >>>>>>>> The
> > > >>>>>>>> scheduler
> > > >>>>>>>> would cast out the message to the
> > > >>>>>>>> <topic>.<service-hostname>
> > > >>>>>>>> (code
> > > >>>>>>>> today uses the HostState hostname), with the compute
> > > >>>>>>>> driver
> > > >>>>>>>> having
> > > >>>>>>>> to
> > > >>>>>>>> understand if it must be serviced elsewhere (but does not
> > > >>>>>>>> break
> > > >>>>>>>> any
> > > >>>>>>>> existing implementations since it is 1 to 1).
> > > >>>>>>>>
> > > >>>>>>>>
> > > >>>>>>>>
> > > >>>>>>>>
> > > >>>>>>>>
> > > >>>>>>>> Does this solution seem workable? Anything I missed?
> > > >>>>>>>>
> > > >>>>>>>> The bare metal driver already is proxying for the other
> > > >>>>>>>> nodes
> > > >>>>>>>> so
> > > >>>>>>>> it
> > > >>>>>>>> sounds like we need a couple of things to make this
> > > >>>>>>>> happen:
> > > >>>>>>>>
> > > >>>>>>>>
> > > >>>>>>>> a) modify driver.get_host_stats to be able to return a
> > > >>>>>>>> list
> > > >>>>>>>> of
> > > >>>>>>>> host
> > > >>>>>>>> stats instead of just one. Report the whole list back to
> > > >>>>>>>> the
> > > >>>>>>>> scheduler. We could modify the receiving end to accept a
> > > >>>>>>>> list
> > > >>>>>>>> as
> > > >>>>>>>> well
> > > >>>>>>>> or just make multiple calls to
> > > >>>>>>>> self.update_service_capabilities(capabilities)
> > > >>>>>>>>
> > > >>>>>>>>
> > > >>>>>>>> b) make a few minor changes to the scheduler to make sure
> > > >>>>>>>> filtering
> > > >>>>>>>> still works. Note the changes here may be very helpful:
> > > >>>>>>>>
> > > >>>>>>>>
> > > >>>>>>>> https://review.openstack.org/10327
> > > >>>>>>>>
> > > >>>>>>>>
> > > >>>>>>>> c) we have to make sure that instances launched on those
> > > >>>>>>>> nodes
> > > >>>>>>>> take
> > > >>>>>>>> up
> > > >>>>>>>> the entire host state somehow. We could probably do this
> > > >>>>>>>> by
> > > >>>>>>>> making
> > > >>>>>>>> sure that the instance_type ram, mb, gb etc. matches what
> > > >>>>>>>> the
> > > >>>>>>>> node
> > > >>>>>>>> has, but we may want a new boolean field "used" if those
> > > >>>>>>>> aren't
> > > >>>>>>>> sufficient.
> > > >>>>>>>>
> > > >>>>>>>>
> > > >>>>>>>> I This approach seems pretty good. We could potentially
> > > >>>>>>>> get
> > > >>>>>>>> rid
> > > >>>>>>>> of
> > > >>>>>>>> the
> > > >>>>>>>> shared bare_metal_node table. I guess the only other
> > > >>>>>>>> concern
> > > >>>>>>>> is
> > > >>>>>>>> how
> > > >>>>>>>> you populate the capabilities that the bare metal nodes
> > > >>>>>>>> are
> > > >>>>>>>> reporting.
> > > >>>>>>>> I guess an api extension that rpcs to a baremetal node to
> > > >>>>>>>> add
> > > >>>>>>>> the
> > > >>>>>>>> node. Maybe someday this could be autogenerated by the
> > > >>>>>>>> bare
> > > >>>>>>>> metal
> > > >>>>>>>> host
> > > >>>>>>>> looking in its arp table for dhcp requests! :)
> > > >>>>>>>>
> > > >>>>>>>>
> > > >>>>>>>> Vish
> > > >>>>>>>>
> > > >>>>>>>> _______________________________________________
> > > >>>>>>>> OpenStack-dev mailing list
> > > >>>>>>>> OpenStack-dev@xxxxxxxxxxxxxxxxxxx
> > > >>>>>>>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
> > > >>>>>>>
> > > >>>>>>> _______________________________________________
> > > >>>>>>> OpenStack-dev mailing list
> > > >>>>>>> OpenStack-dev@xxxxxxxxxxxxxxxxxxx
> > > >>>>>>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
> > > >>>>>>
> > > >>>>>>
> > > >>>>>> _______________________________________________
> > > >>>>>> OpenStack-dev mailing list
> > > >>>>>> OpenStack-dev@xxxxxxxxxxxxxxxxxxx
> > > >>>>>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
> > > >>>>>
> > > >>>>> _______________________________________________
> > > >>>>> OpenStack-dev mailing list
> > > >>>>> OpenStack-dev@xxxxxxxxxxxxxxxxxxx
> > > >>>>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
> > > >>>>
> > > >>>>
> > > >>>> _______________________________________________
> > > >>>> OpenStack-dev mailing list
> > > >>>> OpenStack-dev@xxxxxxxxxxxxxxxxxxx
> > > >>>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
> >
> > _______________________________________________
> > Mailing list: https://launchpad.net/~openstack
> > Post to : openstack@xxxxxxxxxxxxxxxxxxx
> > Unsubscribe : https://launchpad.net/~openstack
> > More help : https://help.launchpad.net/ListHelp
> >
>
> Michael
>
> -------------------------------------------------
> Michael Fork
> Cloud Architect, Emerging Solutions
> IBM Systems & Technology Group
Follow ups
References