yahoo-eng-team team mailing list archive

Thread
Date
[Bug 2093334] [NEW] libvirt - I/O erorrs after modify fs.aio-max-nr

To: yahoo-eng-team@xxxxxxxxxxxxxxxxxxx
From: Patryk Basko <2093334@xxxxxxxxxxxxxxxxxx>
Date: Thu, 09 Jan 2025 12:55:42 -0000
Reply-to: Bug 2093334 <2093334@xxxxxxxxxxxxxxxxxx>
Sender: noreply@xxxxxxxxxxxxx
Public bug reported:

Description
===========
After increased fs.aio-max-nr parameter to 1048576 value, we observe issue while instance is migrating or attaching a new volumes to the instances. Default value was 65536 (on my case I reachable limit so this is reason why I increase max parameter).

Symptoms are different, most of errors returning libvirt. This error is
tagged as qemuDomainBlockJobAbort:14400 on
/var/log/kolla/libvirt/libvirtd.log

Guest OS can return I/O errors too:

------
I/O error, dev vdb, sector 419430272 op (...)
Buffer I/O error on dev vdb, logical block 52428784, async page read
-----

Below logs while migrating instance:
-------------------
2025-01-03 11:56:36.696+0000: 2052655: error : qemuDomainBlockJobAbort:14400 : invalid argument: disk vdb does not have an active block job
2025-01-03 11:56:36.750+0000: 2052655: error : qemuDomainBlockJobAbort:14400 : invalid argument: disk vdb does not have an active block job
2025-01-03 11:58:37.800+0000: 2052656: error : qemuDomainBlockJobAbort:14400 : invalid argument: disk vdc does not have an active block job
2025-01-03 11:58:37.853+0000: 2052656: error : qemuDomainBlockJobAbort:14400 : invalid argument: disk vdc does not have an active block job


Jan  7 16:31:00 comp-b16 nova-compute: 2025-01-07 16:31:00.947 7 
WARNING os_brick.initiator.connectors.base [req-a642bf63-afe1-4735-9484-6996f0c6a12a req-517426df-68df-462d-9b0b-09c1c3614010 2fa3eaeea47247778d2e5d9e622100bf 995bbd9fad3a4f71843859fef971ea2f - - default default] 
Service needs to call os_brick.setup() before connecting volumes, if it doesn't it will break on the next release: nova.exception.VolumeRebaseFailed: 
Volume rebase failed: invalid argument: disk vda does not have an active block job
------------

For purpose testing, I created a 10 VMs on compute with modified fs.aio-
max-nr. Guest OS (Ubuntu 22) on 4-5 VMs returned I/O errors with a new
attached disks (disk read only). To return my VMs to alive, I did detach
volumens, which were atached previously, and reboot instance.


I increased limit without restarting nova-libvirt container.


Steps to reproduce
==================
- reach max value for fs.aio-max-nr  (you can try decrease to simulate)
- increase fs.aio-max-nr to 1048576 on compute B
- migrate instances from compute A to compute B / attach a new disk to exist instance on compute B
- check libvirt logs and OS guest
- reboot instance and verify if booted successfully

Expected result
===============
Increased value doesn't affect existing VMs and volumens

Actual result
=============
Issues with attach a new disks, migrating instances

Environment
===========
OpenStack 2023.2 Bobcat
Kolla-ansible
libvirtd (libvirt) 8.0.0


Maybe this parameter should be increased on different place on hypervisor/kolla? Dear team, could you explain me, what is wrong, please?

** Affects: nova
     Importance: Undecided
         Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/2093334

Title:
  libvirt - I/O erorrs after modify fs.aio-max-nr

Status in OpenStack Compute (nova):
  New

Bug description:
  Description
  ===========
  After increased fs.aio-max-nr parameter to 1048576 value, we observe issue while instance is migrating or attaching a new volumes to the instances. Default value was 65536 (on my case I reachable limit so this is reason why I increase max parameter).

  Symptoms are different, most of errors returning libvirt. This error
  is tagged as qemuDomainBlockJobAbort:14400 on
  /var/log/kolla/libvirt/libvirtd.log

  Guest OS can return I/O errors too:

  ------
  I/O error, dev vdb, sector 419430272 op (...)
  Buffer I/O error on dev vdb, logical block 52428784, async page read
  -----

  Below logs while migrating instance:
  -------------------
  2025-01-03 11:56:36.696+0000: 2052655: error : qemuDomainBlockJobAbort:14400 : invalid argument: disk vdb does not have an active block job
  2025-01-03 11:56:36.750+0000: 2052655: error : qemuDomainBlockJobAbort:14400 : invalid argument: disk vdb does not have an active block job
  2025-01-03 11:58:37.800+0000: 2052656: error : qemuDomainBlockJobAbort:14400 : invalid argument: disk vdc does not have an active block job
  2025-01-03 11:58:37.853+0000: 2052656: error : qemuDomainBlockJobAbort:14400 : invalid argument: disk vdc does not have an active block job

  
  Jan  7 16:31:00 comp-b16 nova-compute: 2025-01-07 16:31:00.947 7 
  WARNING os_brick.initiator.connectors.base [req-a642bf63-afe1-4735-9484-6996f0c6a12a req-517426df-68df-462d-9b0b-09c1c3614010 2fa3eaeea47247778d2e5d9e622100bf 995bbd9fad3a4f71843859fef971ea2f - - default default] 
  Service needs to call os_brick.setup() before connecting volumes, if it doesn't it will break on the next release: nova.exception.VolumeRebaseFailed: 
  Volume rebase failed: invalid argument: disk vda does not have an active block job
  ------------

  For purpose testing, I created a 10 VMs on compute with modified
  fs.aio-max-nr. Guest OS (Ubuntu 22) on 4-5 VMs returned I/O errors
  with a new attached disks (disk read only). To return my VMs to alive,
  I did detach volumens, which were atached previously, and reboot
  instance.

  
  I increased limit without restarting nova-libvirt container.

  
  Steps to reproduce
  ==================
  - reach max value for fs.aio-max-nr  (you can try decrease to simulate)
  - increase fs.aio-max-nr to 1048576 on compute B
  - migrate instances from compute A to compute B / attach a new disk to exist instance on compute B
  - check libvirt logs and OS guest
  - reboot instance and verify if booted successfully

  Expected result
  ===============
  Increased value doesn't affect existing VMs and volumens

  Actual result
  =============
  Issues with attach a new disks, migrating instances

  Environment
  ===========
  OpenStack 2023.2 Bobcat
  Kolla-ansible
  libvirtd (libvirt) 8.0.0

  
  Maybe this parameter should be increased on different place on hypervisor/kolla? Dear team, could you explain me, what is wrong, please?

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/2093334/+subscriptions