From db82e667c7b52b8ff75d2cfc071d275a831bc915 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Zbigniew=20J=C4=99drzejewski-Szmek?= Date: Thu, 9 Jun 2022 12:15:46 +0200 Subject: [PATCH] docs/BLS: move "boot counting" into the main spec The boot-counting file-renaming entry-sorting part that the boot loader implements is moved to the main document. The second document describes a specific implementation that is provided through systemd units. The sorting algorithm is extended to say that bad entries should be sorted later. I also added a note that bad entries should be available for booting. For some reason, the second document said that it applies only to EFI systems. AFAIK there are no implementations for non-EFI, but the specification should work just fine, if somebody were to implement it. So that part is dropped. Fixes #23345. Sadly, bootctl doesn't implement sorting of boot entries with counting :(((( But I'm leaving that for another PR. --- docs/AUTOMATIC_BOOT_ASSESSMENT.md | 95 +++++++++++++------------------ docs/BOOT_LOADER_SPECIFICATION.md | 86 +++++++++++++++++++++++++--- 2 files changed, 118 insertions(+), 63 deletions(-) diff --git a/docs/AUTOMATIC_BOOT_ASSESSMENT.md b/docs/AUTOMATIC_BOOT_ASSESSMENT.md index 2e015eab371..59cae4754a3 100644 --- a/docs/AUTOMATIC_BOOT_ASSESSMENT.md +++ b/docs/AUTOMATIC_BOOT_ASSESSMENT.md @@ -8,14 +8,17 @@ SPDX-License-Identifier: LGPL-2.1-or-later # Automatic Boot Assessment systemd provides support for automatically reverting back to the previous -version of the OS or kernel in case the system consistently fails to boot. This -support is built into various of its components. When used together these -components provide a complete solution on UEFI systems, built as add-on to the -[Boot Loader Specification](BOOT_LOADER_SPECIFICATION.md). -However, the different components may also be used independently, and in -combination with other software, to implement similar schemes, for example with -other boot loaders or for non-UEFI systems. Here's a brief overview of the -complete set of components: +version of the OS or kernel in case the system consistently fails to boot. The +[Boot Loader Specification](BOOT_LOADER_SPECIFICATION.md#boot-counting) +describes how to annotate boot loader entries with a counter that specifies how +many attempts should be made to boot it. This document describes how systemd +implements this scheme. + +The many different components involved in the implementation may be used +independently and in combination with other software to for example support +other boot loaders or take actions outside of the boot loader. + +Here's a brief overview of the complete set of components: * The [`systemd-boot(7)`](https://www.freedesktop.org/software/systemd/man/systemd-boot.html) @@ -57,38 +60,36 @@ complete set of components: ## Details -The boot counting data `systemd-boot` and `systemd-bless-boot.service` -manage is stored in the name of the boot loader entries. If a boot loader entry -file name contains `+` followed by one or two numbers (if two numbers, then -those need to be separated by `-`) right before the `.conf` suffix, then boot -counting is enabled for it. The first number is the "tries left" counter -encoding how many attempts to boot this entry shall still be made. The second -number is the "tries done" counter, encoding how many failed attempts to boot -it have already been made. Each time a boot loader entry marked this way is -booted the first counter is decreased by one, and the second one increased by -one. (If the second counter is missing, then it is assumed to be equivalent to -zero.) If the "tries left" counter is above zero the entry is still considered -for booting (the entry's state is considered to be "indeterminate"), as soon as -it reached zero the entry is not tried anymore (entry state "bad"). If the boot -attempt completed successfully the entry's counters are removed from the name -(entry state "good"), thus turning off boot counting for the future. +As described in [Boot Loader Specification](BOOT_LOADER_SPECIFICATION.md#boot-counting), +the boot counting data is stored in the file name of the boot loader entries as +a plus (`+`), followed by a number, optionally followed by `-` and another +number, right before the file name suffix (`.conf` or `.efi`). + +The first number is the "tries left" counter encoding how many attempts to boot +this entry shall still be made. The second number is the "tries done" counter, +encoding how many failed attempts to boot it have already been made. Each time +a boot loader entry marked this way is booted the first counter is decremented, +and the second one incremented. (If the second counter is missing, then it is +assumed to be equivalent to zero.) If the boot attempt completed successfully +the entry's counters are removed from the name (entry state "good"), thus +turning off boot counting for the future. ## Walkthrough Here's an example walkthrough of how this all fits together. -1. The user runs `echo 3 > /etc/kernel/tries` to enable boot counting. +1. The user runs `echo 3 >/etc/kernel/tries` to enable boot counting. 2. A new kernel is installed. `kernel-install` is used to generate a new boot loader entry file for it. Let's say the version string for the new kernel is `4.14.11-300.fc27.x86_64`, a new boot loader entry `/boot/loader/entries/4.14.11-300.fc27.x86_64+3.conf` is hence created. -3. The system is booted for the first time after the new kernel is +3. The system is booted for the first time after the new kernel has been installed. The boot loader now sees the `+3` counter in the entry file name. It hence renames the file to `4.14.11-300.fc27.x86_64+2-1.conf` - indicating that at this point one attempt has started and thus only one less - is left. After the rename completed the entry is booted as usual. + indicating that at this point one attempt has started. + After the rename completed, the entry is booted as usual. 4. Let's say this attempt to boot fails. On the following boot the boot loader will hence see the `+2-1` tag in the name, and hence rename the entry file to @@ -98,11 +99,11 @@ Here's an example walkthrough of how this all fits together. see the `+1-2` tag, and rename the file to `4.14.11-300.fc27.x86_64+0-3.conf` and boot it. -6. If this boot also fails, on the next boot the boot loader will see the - tag `+0-3`, i.e. the counter reached zero. At this point the entry will be - considered "bad", and ordered to the beginning of the list of entries. The - next newest boot entry is now tried, i.e. the system automatically reverted - back to an earlier version. +6. If this boot also fails, on the next boot the boot loader will see the tag + `+0-3`, i.e. the counter reached zero. At this point the entry will be + considered "bad", and ordered after all non-bad entries. The next newest + boot entry is now tried, i.e. the system automatically reverted to an + earlier version. The above describes the walkthrough when the selected boot entry continuously fails. Let's have a look at an alternative ending to this walkthrough. In this @@ -143,7 +144,7 @@ scenario the first 4 steps are the same as above: renames it dropping the counter tag. Thus `4.14.11-300.fc27.x86_64+1-2.conf` is renamed to `4.14.11-300.fc27.x86_64.conf`. From this moment boot counting is turned - off. + off for this entry. 12. On the following boot (and all subsequent boots after that) the entry is now seen with boot counting turned off, no further renaming takes place. @@ -156,9 +157,9 @@ are a couple of recommendations. 1. To support alternative boot loaders in place of `systemd-boot` two scenarios are recommended: - a. Boot loaders already implementing the Boot Loader Specification can simply - implement an equivalent file rename based logic, and thus integrate fully - with the rest of the stack. + a. Boot loaders already implementing the Boot Loader Specification can + simply implement the same rename logic, and thus integrate fully with + the rest of the stack. b. Boot loaders that want to implement boot counting and store the counters elsewhere can provide their own replacements for @@ -181,27 +182,11 @@ are a couple of recommendations. ## FAQ -1. *Why do you use file renames to store the counter? Why not a regular file?* - — Mainly two reasons: it's relatively likely that renames can be implemented - atomically even in simpler file systems, while writing to file contents has - a much bigger chance to be result in incomplete or corrupt data, as renaming - generally avoids allocating or releasing data blocks. Moreover it has the - benefit that the boot count metadata is directly attached to the boot loader - entry file, and thus the lifecycle of the metadata and the entry itself are - bound together. This means no additional clean-up needs to take place to - drop the boot loader counting information for an entry when it is removed. - -2. *Why not use EFI variables for storing the boot counter?* — The memory chips - used to back the persistent EFI variables are generally not of the highest - quality, hence shouldn't be written to more than necessary. This means we - can't really use it for changes made regularly during boot, but can use it - only for seldom made configuration changes. - -3. *I have a service which — when it fails — should immediately cause a - reboot. How does that fit in with the above?* — Well, that's orthogonal to +1. *I have a service which — when it fails — should immediately cause a + reboot. How does that fit in with the above?* — That's orthogonal to the above, please use `FailureAction=` in the unit file for this. -4. *Under some condition I want to mark the current boot loader entry as bad +2. *Under some condition I want to mark the current boot loader entry as bad right-away, so that it never is tried again, how do I do that?* — You may invoke `/usr/lib/systemd/systemd-bless-boot bad` at any time to mark the current boot loader entry as "bad" right-away so that it isn't tried again diff --git a/docs/BOOT_LOADER_SPECIFICATION.md b/docs/BOOT_LOADER_SPECIFICATION.md index 805c1da0318..8a6a16c63f8 100644 --- a/docs/BOOT_LOADER_SPECIFICATION.md +++ b/docs/BOOT_LOADER_SPECIFICATION.md @@ -391,25 +391,77 @@ creating a partition and file system for it) and creates the `/loader/entries/` directory in it. It then installs an appropriate boot loader that can read these snippets. Finally, it installs one or more kernel packages. +## Boot counting + +The main idea is that when boot entries are initially installed, they are +marked as "indeterminate" and assigned a number of boot attempts. Each time the +boot loader tries to boot an entry, it decreases this count by one. If the +operating system considers the boot as successful, it removes the counter +altogether and the entry becomes "good". Otherwise, once the assigned number of +boots is exhausted, the entry is marked as "bad". + +Which boots are "successful" is determined by the operating system. systemd +provides a generic mechanism that can be extended with arbitrary checks and +actions, see [Automatic Boot Assesment](AUTOMATIC_BOOT_ASSESSMENT.md), but the +boot counting mechanism described in this specifaction can also be used with +other implementations. + +The boot counting data is stored in the name of the boot loader entry. A boot +loader entry file name may contain a plus (`+`) followed by a number. This may +optionally be followed by a minus (`-`) followed by a second number. The dot +(`.`) and file name suffix (`conf` of `efi`) must immediately follow. Boot +counting is enabled for entries which match this pattern. + +The first number is the "tries left" counter signifying how many attempts to boot +this entry shall still be made. The second number is the "tries done" counter, +showing how many failed attempts to boot it have already been made. Each time +a boot loader entry marked this way is booted, the first counter is decremented, +and the second one incremented. (If the second counter is missing, +then it is assumed to be equivalent to zero.) If the "tries left" counter is +above zero the entry is still considered "indeterminate". A boot entry with the +"tries left" counter at zero is considered "bad". + +If the boot attempt completed successfully the entry's counters are removed +from the name (entry state becomes "good"), thus turning off boot counting for +this entry. + ## Sorting The boot loader menu should generally show entries in some order meaningful to the user. The `title` key is free-form and not suitable to be used as the primary sorting key. Instead, the boot loader should use the following rules: -if `sort-key` is set on both entries, use in order of priority, -the `sort-key` (A-Z, increasing [alphanumerical order](#alphanumerical-order)), -`machine-id` (A-Z, increasing alphanumerical order), -and `version` keys (decreasing [version order](#version-order)). -If `sort-key` is set on one entry, it sorts earlier. -At the end, if necessary, when `sort-key` is not set or those fields are not -set or are all equal, the boot loader should sort using the file name of the -entry (decreasing version sort), with the suffix removed. + +1. Entries which are subject to boot counting and are marked as "bad", should + be sorted later than all other entries. Entries which are marked as + "indeterminate" or "good" (or were not subject to boot counting at all), + are thus sorted earlier. + +2. If `sort-key` is set on both entries, use in order of priority, + the `sort-key` (A-Z, increasing [alphanumerical order](#alphanumerical-order)), + `machine-id` (A-Z, increasing alphanumerical order), + and `version` keys (decreasing [version order](#version-order)). + +3. If `sort-key` is set on one entry, it sorts earlier. + +4. At the end, if necessary, when `sort-key` is not set or those fields are not + set or are all equal, the boot loader should sort using the file name of the + entry (decreasing version sort), with the suffix removed. **Note:** _This description assumes that the boot loader shows entries in a traditional menu, with newest and "best" entries at the top, thus entries with a higher version number are sorter *earlier*. The boot loader is free to use a different direction (or none at all) during display._ +**Note:** _The boot loader should allow booting "bad" entries, e.g. in case no +other entries are left or they are unusable for other reasons. It may +deemphasize or hide such entries by default._ + +**Note:** _"Bad" boot entries have a suffix of "+0-`n`", where `n` is the +number of failed boot attempts. Removal of the suffix is not necessary for +comparisons described by the last point above. In the unlikely scenario that we +have multiple such boot entries that differ only by the boot counting data, we +would sort them by `n`._ + ### Alphanumerical order Free-form strings and machine IDs should be compared using a method equivalent @@ -574,6 +626,24 @@ to have them in reverse order. But when multiple kernels are available for the same installation, we want to display the latest kernel with highest priority, i.e. earlier in the list. +### Why do you use file renames to store the counter? Why not a regular file? + +Mainly two reasons: it's relatively likely that renames can be implemented +atomically even in simpler file systems, as renaming generally avoids +allocating or releasing data blocks. Writing to file contents has a much bigger +chance to be result in incomplete or corrupt data. Moreover renaming has the +benefit that the boot count metadata is directly attached to the boot loader +entry file, and thus the lifecycle of the metadata and the entry itself are +bound together. This means no additional clean-up needs to take place to drop +the boot loader counting information for an entry when it is removed. + +### Why not use EFI variables for storing the boot counter? + +The memory chips used to back the persistent EFI variables are generally not of +the highest quality, hence shouldn't be written to more than necessary. This +means we can't really use it for changes made regularly during boot, but should +use it only for seldom-made configuration changes. + ### Out of Focus There are a couple of items that are out of focus for this specification: