yahoo-eng-team team mailing list archive
-
yahoo-eng-team team
-
Mailing list archive
-
Message #94932
[Bug 2089388] [NEW] nova-scheduler error: SSL: DECRYPTION_FAILED_OR_BAD_RECORD_MAC
Public bug reported:
Description
===========
If I try to schedule a VM immediately after restarting nova-scheduler service, I get this error on every second scheduling request:
keystoneauth1.exceptions.connection.SSLError: SSL exception connecting
to
https://api.ng1.os.ops.xx.xx:8778/allocation_candidates?limit=1000&member_of=in....
(Caused by SSLError(SSLError(1, '[SSL:
DECRYPTION_FAILED_OR_BAD_RECORD_MAC] decryption failed or bad record mac
(_ssl.c:2578)')))
NOTE: Full trace in attachment.
Is most likely due to the fact that after nova-scheduler restart,
several worker processes based on the parent of SchedulerManager() are
created in nova/cmd/scheduler.py respectively in
../oslo_service/service.py and SchedulerManager(), in its __init__()
method, initializes the placement client by calling the
report.report_client_singleton() method from scheduler/client/report.py,
which should always return the same* instance of class
SchedulerReportClient(): which creates the actual client also within its
__init__() method self._client = self._create_client() line:230 approx.
This already opens a real socket on the placement service.
But this socket is inherited by descendants, so the whole situation after the restart (workers=2) is as follows:
root@tt-os1-lab1:~# for P in $(pgrep -f "usr/bin/python3 /usr/bin/nova-scheduler");do echo $P ;lsof -p $P |grep :8778 ;done
PID:1693940
nova-sche 1693940 nova 11u IPv4 2361702700 0t0 TCP tt-os1-lab1.ko.xx.xx:34444->tt-os1-lab1.ko.xx.xx:8778 (ESTABLISHED)
PID:1693999
nova-sche 1693999 nova 11u IPv4 2361702700 0t0 TCP tt-os1-lab1.ko.xx.xx:34444->tt-os1-lab1.ko.xx.xx:8778 (ESTABLISHED)
PID:1694000
nova-sche 1694000 nova 11u IPv4 2361702700 0t0 TCP tt-os1-lab1.ko.xx.xx:34444->tt-os1-lab1.ko.xx.xx:8778 (ESTABLISHED)
NOTE: Do you notice that everyone has the same socket open.
If scheduling occurs while this shared socket is open, the first scheduling will pass without error and the worker will open a new solo connection for it. The situation looks like this:
PID:1693940
nova-sche 1693940 nova 11u IPv4 2361702700 0t0 TCP tt-os1-lab1.ko.xx.xx:34444->tt-os1-lab1.ko.xx.xx:8778 (ESTABLISHED)
PID:1693999
nova-sche 1693999 nova 11u IPv4 2361702700 0t0 TCP tt-os1-lab1.ko.xx.xx:34444->tt-os1-lab1.ko.xx.xx:8778 (ESTABLISHED)
PID:1694000
nova-sche 1694000 nova 11u IPv4 2361702700 0t0 TCP tt-os1-lab1.ko.xx.xx:34444->tt-os1-lab1.ko.xx.xx:8778 (ESTABLISHED)
nova-sche 1694000 nova 21u IPv4 2361698804 0t0 TCP tt-os1-lab1.ko.xx.xx:47444->tt-os1-lab1.ko.xx.xx:8778 (ESTABLISHED)
And finally the second attempt in the sequence fails and the open tcp connections on placement look like this:
root@tt-os1-lab1:~# for P in $(pgrep -f "usr/bin/python3 /usr/bin/nova-scheduler");do echo $P ;lsof -p $P |grep :8778 ;done
1693940
1693999
1694000
nova-sche 1694000 nova 21u IPv4 2361698804 0t0 TCP tt-os1-lab1.ko.xx.xx:47444->tt-os1-lab1.ko.xx.xx:8778 (ESTABLISHED)
>From this moment on, each process opens its own connection and everything runs fine. But I don't understand why the first request passes without error.
Steps to reproduce
==================
* placement service with keepalive set up (in my case 65s)
* more that 1 nova-scheduler worker (in my case worker==2)
* restart nova-scheduler
* and make two scheduling calls within the keealive time interval
* second call gives you SSL: DECRYPTION_FAILED_OR_BAD_RECORD_MAC error
Expected result
===============
The parent process does not open any socket for placement, or closes it before the child forks
Environment
===========
* Ubuntu 22.04.5 LTS
* nova and nova-scheduler 29.0.1-0ubuntu1.4~cloud0
* I also checked the current code in the master repository and the situation is the same there
* python3-openssl 21.0.0-1
* openssl 3.0.2-0ubuntu1.18
Related patches
===========
https://review.opendev.org/c/x/tobiko/+/880152
https://review.opendev.org/c/openstack/freezer/+/456758
*) Actually, I don't know if the singleton works correctly and as
expected when not using thready but separate forked processes. Is that
OK?
** Affects: nova
Importance: Undecided
Status: New
** Attachment added: "decryption failed or bad record mac error"
https://bugs.launchpad.net/bugs/2089388/+attachment/5839439/+files/nova-scheduler-SSL_DECRYPTION_FAILED_OR_BAD_RECORD_MAC_full_log.txt
--
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/2089388
Title:
nova-scheduler error: SSL: DECRYPTION_FAILED_OR_BAD_RECORD_MAC
Status in OpenStack Compute (nova):
New
Bug description:
Description
===========
If I try to schedule a VM immediately after restarting nova-scheduler service, I get this error on every second scheduling request:
keystoneauth1.exceptions.connection.SSLError: SSL exception connecting
to
https://api.ng1.os.ops.xx.xx:8778/allocation_candidates?limit=1000&member_of=in....
(Caused by SSLError(SSLError(1, '[SSL:
DECRYPTION_FAILED_OR_BAD_RECORD_MAC] decryption failed or bad record
mac (_ssl.c:2578)')))
NOTE: Full trace in attachment.
Is most likely due to the fact that after nova-scheduler restart,
several worker processes based on the parent of SchedulerManager() are
created in nova/cmd/scheduler.py respectively in
../oslo_service/service.py and SchedulerManager(), in its __init__()
method, initializes the placement client by calling the
report.report_client_singleton() method from
scheduler/client/report.py, which should always return the same*
instance of class SchedulerReportClient(): which creates the actual
client also within its __init__() method self._client =
self._create_client() line:230 approx. This already opens a real
socket on the placement service.
But this socket is inherited by descendants, so the whole situation after the restart (workers=2) is as follows:
root@tt-os1-lab1:~# for P in $(pgrep -f "usr/bin/python3 /usr/bin/nova-scheduler");do echo $P ;lsof -p $P |grep :8778 ;done
PID:1693940
nova-sche 1693940 nova 11u IPv4 2361702700 0t0 TCP tt-os1-lab1.ko.xx.xx:34444->tt-os1-lab1.ko.xx.xx:8778 (ESTABLISHED)
PID:1693999
nova-sche 1693999 nova 11u IPv4 2361702700 0t0 TCP tt-os1-lab1.ko.xx.xx:34444->tt-os1-lab1.ko.xx.xx:8778 (ESTABLISHED)
PID:1694000
nova-sche 1694000 nova 11u IPv4 2361702700 0t0 TCP tt-os1-lab1.ko.xx.xx:34444->tt-os1-lab1.ko.xx.xx:8778 (ESTABLISHED)
NOTE: Do you notice that everyone has the same socket open.
If scheduling occurs while this shared socket is open, the first scheduling will pass without error and the worker will open a new solo connection for it. The situation looks like this:
PID:1693940
nova-sche 1693940 nova 11u IPv4 2361702700 0t0 TCP tt-os1-lab1.ko.xx.xx:34444->tt-os1-lab1.ko.xx.xx:8778 (ESTABLISHED)
PID:1693999
nova-sche 1693999 nova 11u IPv4 2361702700 0t0 TCP tt-os1-lab1.ko.xx.xx:34444->tt-os1-lab1.ko.xx.xx:8778 (ESTABLISHED)
PID:1694000
nova-sche 1694000 nova 11u IPv4 2361702700 0t0 TCP tt-os1-lab1.ko.xx.xx:34444->tt-os1-lab1.ko.xx.xx:8778 (ESTABLISHED)
nova-sche 1694000 nova 21u IPv4 2361698804 0t0 TCP tt-os1-lab1.ko.xx.xx:47444->tt-os1-lab1.ko.xx.xx:8778 (ESTABLISHED)
And finally the second attempt in the sequence fails and the open tcp connections on placement look like this:
root@tt-os1-lab1:~# for P in $(pgrep -f "usr/bin/python3 /usr/bin/nova-scheduler");do echo $P ;lsof -p $P |grep :8778 ;done
1693940
1693999
1694000
nova-sche 1694000 nova 21u IPv4 2361698804 0t0 TCP tt-os1-lab1.ko.xx.xx:47444->tt-os1-lab1.ko.xx.xx:8778 (ESTABLISHED)
From this moment on, each process opens its own connection and everything runs fine. But I don't understand why the first request passes without error.
Steps to reproduce
==================
* placement service with keepalive set up (in my case 65s)
* more that 1 nova-scheduler worker (in my case worker==2)
* restart nova-scheduler
* and make two scheduling calls within the keealive time interval
* second call gives you SSL: DECRYPTION_FAILED_OR_BAD_RECORD_MAC error
Expected result
===============
The parent process does not open any socket for placement, or closes it before the child forks
Environment
===========
* Ubuntu 22.04.5 LTS
* nova and nova-scheduler 29.0.1-0ubuntu1.4~cloud0
* I also checked the current code in the master repository and the situation is the same there
* python3-openssl 21.0.0-1
* openssl 3.0.2-0ubuntu1.18
Related patches
===========
https://review.opendev.org/c/x/tobiko/+/880152
https://review.opendev.org/c/openstack/freezer/+/456758
*) Actually, I don't know if the singleton works correctly and as expected when not using thready but separate forked processes. Is that OK?
To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/2089388/+subscriptions