yahoo-eng-team team mailing list archive

Thread
Date
[Bug 2089388] [NEW] nova-scheduler error: SSL: DECRYPTION_FAILED_OR_BAD_RECORD_MAC

To: yahoo-eng-team@xxxxxxxxxxxxxxxxxxx
From: Pavel Mracek <2089388@xxxxxxxxxxxxxxxxxx>
Date: Fri, 22 Nov 2024 14:03:09 -0000
Reply-to: Bug 2089388 <2089388@xxxxxxxxxxxxxxxxxx>
Sender: noreply@xxxxxxxxxxxxx
Public bug reported:

Description
===========
If I try to schedule a VM immediately after restarting nova-scheduler service, I get this error on every second scheduling request:

keystoneauth1.exceptions.connection.SSLError: SSL exception connecting
to
https://api.ng1.os.ops.xx.xx:8778/allocation_candidates?limit=1000&member_of=in....
(Caused by SSLError(SSLError(1, '[SSL:
DECRYPTION_FAILED_OR_BAD_RECORD_MAC] decryption failed or bad record mac
(_ssl.c:2578)')))

NOTE: Full trace in attachment.

Is most likely due to the fact that after nova-scheduler restart,
several worker processes based on the parent of SchedulerManager() are
created in nova/cmd/scheduler.py respectively in
../oslo_service/service.py and SchedulerManager(), in its __init__()
method, initializes the placement client by calling the
report.report_client_singleton() method from scheduler/client/report.py,
which should always return the same* instance of class
SchedulerReportClient(): which creates the actual client also within its
__init__() method self._client = self._create_client() line:230 approx.
This already opens a real socket on the placement service.

But this socket is inherited by descendants, so the whole situation after the restart (workers=2) is as follows:
root@tt-os1-lab1:~# for P in $(pgrep -f "usr/bin/python3 /usr/bin/nova-scheduler");do echo $P ;lsof -p $P |grep :8778 ;done
PID:1693940
nova-sche 1693940 nova   11u     IPv4         2361702700      0t0        TCP tt-os1-lab1.ko.xx.xx:34444->tt-os1-lab1.ko.xx.xx:8778 (ESTABLISHED)
PID:1693999
nova-sche 1693999 nova   11u     IPv4         2361702700      0t0        TCP tt-os1-lab1.ko.xx.xx:34444->tt-os1-lab1.ko.xx.xx:8778 (ESTABLISHED)
PID:1694000
nova-sche 1694000 nova   11u     IPv4         2361702700      0t0        TCP tt-os1-lab1.ko.xx.xx:34444->tt-os1-lab1.ko.xx.xx:8778 (ESTABLISHED)

NOTE: Do you notice that everyone has the same socket open.

If scheduling occurs while this shared socket is open, the first scheduling will pass without error and the worker will open a new solo connection for it. The situation looks like this:
PID:1693940
nova-sche 1693940 nova   11u     IPv4         2361702700      0t0        TCP tt-os1-lab1.ko.xx.xx:34444->tt-os1-lab1.ko.xx.xx:8778 (ESTABLISHED)
PID:1693999
nova-sche 1693999 nova   11u     IPv4         2361702700      0t0        TCP tt-os1-lab1.ko.xx.xx:34444->tt-os1-lab1.ko.xx.xx:8778 (ESTABLISHED)
PID:1694000
nova-sche 1694000 nova   11u     IPv4         2361702700      0t0        TCP tt-os1-lab1.ko.xx.xx:34444->tt-os1-lab1.ko.xx.xx:8778 (ESTABLISHED)
nova-sche 1694000 nova   21u     IPv4         2361698804      0t0        TCP tt-os1-lab1.ko.xx.xx:47444->tt-os1-lab1.ko.xx.xx:8778 (ESTABLISHED)


And finally the second attempt in the sequence fails and the open tcp connections on placement look like this:
root@tt-os1-lab1:~# for P in $(pgrep -f "usr/bin/python3 /usr/bin/nova-scheduler");do echo $P ;lsof -p $P |grep :8778 ;done
1693940
1693999
1694000
nova-sche 1694000 nova   21u     IPv4         2361698804      0t0        TCP tt-os1-lab1.ko.xx.xx:47444->tt-os1-lab1.ko.xx.xx:8778 (ESTABLISHED)


>From this moment on, each process opens its own connection and everything runs fine. But I don't understand why the first request passes without error.


Steps to reproduce
==================
* placement service with keepalive set up (in my case 65s)
* more that 1 nova-scheduler worker (in my case worker==2)
* restart nova-scheduler
* and make two scheduling calls within the keealive time interval 
* second call gives you SSL: DECRYPTION_FAILED_OR_BAD_RECORD_MAC error


Expected result
===============
The parent process does not open any socket for placement, or closes it before the child forks


Environment
===========
* Ubuntu 22.04.5 LTS
* nova and nova-scheduler 29.0.1-0ubuntu1.4~cloud0
* I also checked the current code in the master repository and the situation is the same there
* python3-openssl 21.0.0-1
* openssl 3.0.2-0ubuntu1.18


Related patches
===========
https://review.opendev.org/c/x/tobiko/+/880152
https://review.opendev.org/c/openstack/freezer/+/456758


*) Actually, I don't know if the singleton works correctly and as
expected when not using thready but separate forked processes. Is that
OK?

** Affects: nova
     Importance: Undecided
         Status: New

** Attachment added: "decryption failed or bad record mac error"
   https://bugs.launchpad.net/bugs/2089388/+attachment/5839439/+files/nova-scheduler-SSL_DECRYPTION_FAILED_OR_BAD_RECORD_MAC_full_log.txt

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/2089388

Title:
  nova-scheduler error: SSL: DECRYPTION_FAILED_OR_BAD_RECORD_MAC

Status in OpenStack Compute (nova):
  New

Bug description:
  Description
  ===========
  If I try to schedule a VM immediately after restarting nova-scheduler service, I get this error on every second scheduling request:

  keystoneauth1.exceptions.connection.SSLError: SSL exception connecting
  to
  https://api.ng1.os.ops.xx.xx:8778/allocation_candidates?limit=1000&member_of=in....
  (Caused by SSLError(SSLError(1, '[SSL:
  DECRYPTION_FAILED_OR_BAD_RECORD_MAC] decryption failed or bad record
  mac (_ssl.c:2578)')))

  NOTE: Full trace in attachment.

  Is most likely due to the fact that after nova-scheduler restart,
  several worker processes based on the parent of SchedulerManager() are
  created in nova/cmd/scheduler.py respectively in
  ../oslo_service/service.py and SchedulerManager(), in its __init__()
  method, initializes the placement client by calling the
  report.report_client_singleton() method from
  scheduler/client/report.py, which should always return the same*
  instance of class SchedulerReportClient(): which creates the actual
  client also within its __init__() method self._client =
  self._create_client() line:230 approx. This already opens a real
  socket on the placement service.

  But this socket is inherited by descendants, so the whole situation after the restart (workers=2) is as follows:
  root@tt-os1-lab1:~# for P in $(pgrep -f "usr/bin/python3 /usr/bin/nova-scheduler");do echo $P ;lsof -p $P |grep :8778 ;done
  PID:1693940
  nova-sche 1693940 nova   11u     IPv4         2361702700      0t0        TCP tt-os1-lab1.ko.xx.xx:34444->tt-os1-lab1.ko.xx.xx:8778 (ESTABLISHED)
  PID:1693999
  nova-sche 1693999 nova   11u     IPv4         2361702700      0t0        TCP tt-os1-lab1.ko.xx.xx:34444->tt-os1-lab1.ko.xx.xx:8778 (ESTABLISHED)
  PID:1694000
  nova-sche 1694000 nova   11u     IPv4         2361702700      0t0        TCP tt-os1-lab1.ko.xx.xx:34444->tt-os1-lab1.ko.xx.xx:8778 (ESTABLISHED)

  NOTE: Do you notice that everyone has the same socket open.

  If scheduling occurs while this shared socket is open, the first scheduling will pass without error and the worker will open a new solo connection for it. The situation looks like this:
  PID:1693940
  nova-sche 1693940 nova   11u     IPv4         2361702700      0t0        TCP tt-os1-lab1.ko.xx.xx:34444->tt-os1-lab1.ko.xx.xx:8778 (ESTABLISHED)
  PID:1693999
  nova-sche 1693999 nova   11u     IPv4         2361702700      0t0        TCP tt-os1-lab1.ko.xx.xx:34444->tt-os1-lab1.ko.xx.xx:8778 (ESTABLISHED)
  PID:1694000
  nova-sche 1694000 nova   11u     IPv4         2361702700      0t0        TCP tt-os1-lab1.ko.xx.xx:34444->tt-os1-lab1.ko.xx.xx:8778 (ESTABLISHED)
  nova-sche 1694000 nova   21u     IPv4         2361698804      0t0        TCP tt-os1-lab1.ko.xx.xx:47444->tt-os1-lab1.ko.xx.xx:8778 (ESTABLISHED)

  
  And finally the second attempt in the sequence fails and the open tcp connections on placement look like this:
  root@tt-os1-lab1:~# for P in $(pgrep -f "usr/bin/python3 /usr/bin/nova-scheduler");do echo $P ;lsof -p $P |grep :8778 ;done
  1693940
  1693999
  1694000
  nova-sche 1694000 nova   21u     IPv4         2361698804      0t0        TCP tt-os1-lab1.ko.xx.xx:47444->tt-os1-lab1.ko.xx.xx:8778 (ESTABLISHED)

  
  From this moment on, each process opens its own connection and everything runs fine. But I don't understand why the first request passes without error.

  
  Steps to reproduce
  ==================
  * placement service with keepalive set up (in my case 65s)
  * more that 1 nova-scheduler worker (in my case worker==2)
  * restart nova-scheduler
  * and make two scheduling calls within the keealive time interval 
  * second call gives you SSL: DECRYPTION_FAILED_OR_BAD_RECORD_MAC error

  
  Expected result
  ===============
  The parent process does not open any socket for placement, or closes it before the child forks

  
  Environment
  ===========
  * Ubuntu 22.04.5 LTS
  * nova and nova-scheduler 29.0.1-0ubuntu1.4~cloud0
  * I also checked the current code in the master repository and the situation is the same there
  * python3-openssl 21.0.0-1
  * openssl 3.0.2-0ubuntu1.18

  
  Related patches
  ===========
  https://review.opendev.org/c/x/tobiko/+/880152
  https://review.opendev.org/c/openstack/freezer/+/456758


  
  *) Actually, I don't know if the singleton works correctly and as expected when not using thready but separate forked processes. Is that OK?

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/2089388/+subscriptions