← Back to team overview

yahoo-eng-team team mailing list archive

[Bug 1877491] [NEW] cc_grub_dpkg: determine idevs in a more robust manner with grub-mkdevicemap

 

Public bug reported:

Currently, we populate the debconf database variable grub-
pc/install_devices by checking to see if a device is present in a
hardcoded list [1] of directories:

- /dev/sda
- /dev/vda
- /dev/xvda
- /dev/sda1
- /dev/vda1
- /dev/xvda1

[1] https://github.com/canonical/cloud-
init/blob/master/cloudinit/config/cc_grub_dpkg.py

While this is a simple elegant solution, the hardcoded list does not
match real world conditions, where grub is installed to a disk which is
not on this list.

The primary example is any cloud which uses NVMe storage, such as AWS c5
instances.

/dev/nvme0n1 is not on the above list, and in this case, falls back to a
hardcoded /dev/sda value for grub-pc/install_devices.

The thing is, the grub postinstall script [2] checks to see if the value
from grub-pc/install_devices exists, and if it doesn't, shows the user
an interactive dpkg prompt where they must select the disk to install
grub to. See the screenshot [3].

[2] https://paste.ubuntu.com/p/5FChJxbk5K/
[3] https://launchpadlibrarian.net/478771797/Screenshot%20from%202020-04-14%2014-39-11.png

This breaks scripts that don't set DEBIAN_FRONTEND=noninteractive as
they get hung waiting for the user to input a choice.

I propose that we modify the cc_grub_dpkg module to be more robust at
selecting the correct disk grub is installed to.

Why not simply add an extra directory to the hardcoded list?

Lets take NVMe storage as an example again. On a c5d.large instance I
spun up just now, lsblk returns:

$ lsblk
NAME        MAJ:MIN RM  SIZE RO TYPE MOUNTPOINT
loop0         7:0    0   18M  1 loop /snap/amazon-ssm-agent/1566
loop1         7:1    0 93.8M  1 loop /snap/core/8935
nvme0n1     259:0    0 46.6G  0 disk
nvme1n1     259:1    0    8G  0 disk
└─nvme1n1p1 259:2    0    8G  0 part /

We cannot hardcode /dev/nvme0n1, as the NVMe naming conventions are not
stable in the kernel, and some boots the 8G disk will be /dev/nvme0n1,
and others will be /dev/nvme1n1.

Instead, I propose a slightly more complex, but still well tested and
well defined method for determining the disk that grub is installed to.

The procedure closely follows how the postinst.in script [2] for grub2
determines what disks should be selected, and uses the exact same
commands, just run as subprocesses.

1) We check to see if /usr/sbin/grub-mkdevicemap exists. If it does,
grub has been installed. If not, we are in a container, and can exit
with empty values.

2) We execute "grub-mkdevicemap -n -m - | cut -f2" to get a list of
valid grub install targets.

3) We determine if the system is EFI or BIOS based by checking the
existence of /sys/firmware/efi. If BIOS goto 4). If EFI goto 5).

4) If BIOS, we iterate over each drive in the list from 2), and use dd
to pull the first 512 bytes of the MBR, and search for the word "GRUB".
The command used is "dd if="$device" bs=512 count=1 2> /dev/null | grep
-aq GRUB". We select the disk which contains the "GRUB" string.

5) If EFI, we find the disk which contains the /boot/EFI partition, by
parsing mountpoints, like "findmnt -o SOURCE -n /boot/efi". From there,
we check if we can simply drop the partition number, by cross
referencing the list from 2). If not, we are likely on bare metal, and
need to turn the disk to a /dev/disk/by-id value to match what grub-
mkdevicemap generates.

Most values written to grub-pc/install_devices will be in a /dev/disk
/by-id format, as produced by grub-mkdevicemap. This is robust to
unstable kernel device naming conventions.

On Nitro, this returns:
/dev/disk/by-id/nvme-Amazon_Elastic_Block_Store_vol0179fff411dd211f0

On Xen, this returns:
/dev/xvda

On a typical QEMU/KVM machine, this returns:
/dev/vda

On my personal desktop computer, this returns:
/dev/disk/by-id/ata-WDC_WD5000AAKX-00PWEA0_WD-WMAYP3497618

I believe this method is much more robust at detecting the correct grub
install disk than the previous hardcoded list. It is more complex, and I
accept that it will increase boot time by a few tenths of a second as it
runs these programs. cc_grub_dpkg only runs once, on instance creation,
so this shouldn't be a major problem.

I have tested this on AWS, on Xen, Nitro, on KVM, with BIOS and EFI
based instances, in LXC, and on bare metal with a BIOS based MAAS
machine.

All give the correct results in my testing.

Due to the complexity of the code, I anticipate this will need a few
revisions to get right, so please let me know if something needs to be
changed.

TESTING:

You can fetch grub-pc/install_devices with:

$ echo get grub-pc/install_devices | sudo debconf-communicate grub-pc

Reset with:

$ echo reset grub-pc/install_devices | sudo debconf-communicate grub-pc

** Affects: cloud-init
     Importance: Undecided
     Assignee: Matthew Ruffell (mruffell)
         Status: In Progress


** Tags: sts

** Changed in: cloud-init
       Status: New => In Progress

** Changed in: cloud-init
     Assignee: (unassigned) => Matthew Ruffell (mruffell)

** Tags added: sts

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to cloud-init.
https://bugs.launchpad.net/bugs/1877491

Title:
  cc_grub_dpkg: determine idevs in a more robust manner with grub-
  mkdevicemap

Status in cloud-init:
  In Progress

Bug description:
  Currently, we populate the debconf database variable grub-
  pc/install_devices by checking to see if a device is present in a
  hardcoded list [1] of directories:

  - /dev/sda
  - /dev/vda
  - /dev/xvda
  - /dev/sda1
  - /dev/vda1
  - /dev/xvda1

  [1] https://github.com/canonical/cloud-
  init/blob/master/cloudinit/config/cc_grub_dpkg.py

  While this is a simple elegant solution, the hardcoded list does not
  match real world conditions, where grub is installed to a disk which
  is not on this list.

  The primary example is any cloud which uses NVMe storage, such as AWS
  c5 instances.

  /dev/nvme0n1 is not on the above list, and in this case, falls back to
  a hardcoded /dev/sda value for grub-pc/install_devices.

  The thing is, the grub postinstall script [2] checks to see if the
  value from grub-pc/install_devices exists, and if it doesn't, shows
  the user an interactive dpkg prompt where they must select the disk to
  install grub to. See the screenshot [3].

  [2] https://paste.ubuntu.com/p/5FChJxbk5K/
  [3] https://launchpadlibrarian.net/478771797/Screenshot%20from%202020-04-14%2014-39-11.png

  This breaks scripts that don't set DEBIAN_FRONTEND=noninteractive as
  they get hung waiting for the user to input a choice.

  I propose that we modify the cc_grub_dpkg module to be more robust at
  selecting the correct disk grub is installed to.

  Why not simply add an extra directory to the hardcoded list?

  Lets take NVMe storage as an example again. On a c5d.large instance I
  spun up just now, lsblk returns:

  $ lsblk
  NAME        MAJ:MIN RM  SIZE RO TYPE MOUNTPOINT
  loop0         7:0    0   18M  1 loop /snap/amazon-ssm-agent/1566
  loop1         7:1    0 93.8M  1 loop /snap/core/8935
  nvme0n1     259:0    0 46.6G  0 disk
  nvme1n1     259:1    0    8G  0 disk
  └─nvme1n1p1 259:2    0    8G  0 part /

  We cannot hardcode /dev/nvme0n1, as the NVMe naming conventions are
  not stable in the kernel, and some boots the 8G disk will be
  /dev/nvme0n1, and others will be /dev/nvme1n1.

  Instead, I propose a slightly more complex, but still well tested and
  well defined method for determining the disk that grub is installed
  to.

  The procedure closely follows how the postinst.in script [2] for grub2
  determines what disks should be selected, and uses the exact same
  commands, just run as subprocesses.

  1) We check to see if /usr/sbin/grub-mkdevicemap exists. If it does,
  grub has been installed. If not, we are in a container, and can exit
  with empty values.

  2) We execute "grub-mkdevicemap -n -m - | cut -f2" to get a list of
  valid grub install targets.

  3) We determine if the system is EFI or BIOS based by checking the
  existence of /sys/firmware/efi. If BIOS goto 4). If EFI goto 5).

  4) If BIOS, we iterate over each drive in the list from 2), and use dd
  to pull the first 512 bytes of the MBR, and search for the word
  "GRUB". The command used is "dd if="$device" bs=512 count=1 2>
  /dev/null | grep -aq GRUB". We select the disk which contains the
  "GRUB" string.

  5) If EFI, we find the disk which contains the /boot/EFI partition, by
  parsing mountpoints, like "findmnt -o SOURCE -n /boot/efi". From
  there, we check if we can simply drop the partition number, by cross
  referencing the list from 2). If not, we are likely on bare metal, and
  need to turn the disk to a /dev/disk/by-id value to match what grub-
  mkdevicemap generates.

  Most values written to grub-pc/install_devices will be in a /dev/disk
  /by-id format, as produced by grub-mkdevicemap. This is robust to
  unstable kernel device naming conventions.

  On Nitro, this returns:
  /dev/disk/by-id/nvme-Amazon_Elastic_Block_Store_vol0179fff411dd211f0

  On Xen, this returns:
  /dev/xvda

  On a typical QEMU/KVM machine, this returns:
  /dev/vda

  On my personal desktop computer, this returns:
  /dev/disk/by-id/ata-WDC_WD5000AAKX-00PWEA0_WD-WMAYP3497618

  I believe this method is much more robust at detecting the correct
  grub install disk than the previous hardcoded list. It is more
  complex, and I accept that it will increase boot time by a few tenths
  of a second as it runs these programs. cc_grub_dpkg only runs once, on
  instance creation, so this shouldn't be a major problem.

  I have tested this on AWS, on Xen, Nitro, on KVM, with BIOS and EFI
  based instances, in LXC, and on bare metal with a BIOS based MAAS
  machine.

  All give the correct results in my testing.

  Due to the complexity of the code, I anticipate this will need a few
  revisions to get right, so please let me know if something needs to be
  changed.

  TESTING:

  You can fetch grub-pc/install_devices with:

  $ echo get grub-pc/install_devices | sudo debconf-communicate grub-pc

  Reset with:

  $ echo reset grub-pc/install_devices | sudo debconf-communicate grub-
  pc

To manage notifications about this bug go to:
https://bugs.launchpad.net/cloud-init/+bug/1877491/+subscriptions


Follow ups