linux/arch/arm64/kernel/module-plts.c

379 lines
11 KiB
C
Raw Normal View History

// SPDX-License-Identifier: GPL-2.0-only
/*
arm64: module: split core and init PLT sections The arm64 module PLT code allocates all PLT entries in a single core section, since the overhead of having a separate init PLT section is not justified by the small number of PLT entries usually required for init code. However, the core and init module regions are allocated independently, and there is a corner case where the core region may be allocated from the VMALLOC region if the dedicated module region is exhausted, but the init region, being much smaller, can still be allocated from the module region. This leads to relocation failures if the distance between those regions exceeds 128 MB. (In fact, this corner case is highly unlikely to occur on arm64, but the issue has been observed on ARM, whose module region is much smaller). So split the core and init PLT regions, and name the latter ".init.plt" so it gets allocated along with (and sufficiently close to) the .init sections that it serves. Also, given that init PLT entries may need to be emitted for branches that target the core module, modify the logic that disregards defined symbols to only disregard symbols that are defined in the same section as the relocated branch instruction. Since there may now be two PLT entries associated with each entry in the symbol table, we can no longer hijack the symbol::st_size fields to record the addresses of PLT entries as we emit them for zero-addend relocations. So instead, perform an explicit comparison to check for duplicate entries. Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2017-02-21 22:12:57 +00:00
* Copyright (C) 2014-2017 Linaro Ltd. <ard.biesheuvel@linaro.org>
*/
#include <linux/elf.h>
arm64: implement ftrace with regs This patch implements FTRACE_WITH_REGS for arm64, which allows a traced function's arguments (and some other registers) to be captured into a struct pt_regs, allowing these to be inspected and/or modified. This is a building block for live-patching, where a function's arguments may be forwarded to another function. This is also necessary to enable ftrace and in-kernel pointer authentication at the same time, as it allows the LR value to be captured and adjusted prior to signing. Using GCC's -fpatchable-function-entry=N option, we can have the compiler insert a configurable number of NOPs between the function entry point and the usual prologue. This also ensures functions are AAPCS compliant (e.g. disabling inter-procedural register allocation). For example, with -fpatchable-function-entry=2, GCC 8.1.0 compiles the following: | unsigned long bar(void); | | unsigned long foo(void) | { | return bar() + 1; | } ... to: | <foo>: | nop | nop | stp x29, x30, [sp, #-16]! | mov x29, sp | bl 0 <bar> | add x0, x0, #0x1 | ldp x29, x30, [sp], #16 | ret This patch builds the kernel with -fpatchable-function-entry=2, prefixing each function with two NOPs. To trace a function, we replace these NOPs with a sequence that saves the LR into a GPR, then calls an ftrace entry assembly function which saves this and other relevant registers: | mov x9, x30 | bl <ftrace-entry> Since patchable functions are AAPCS compliant (and the kernel does not use x18 as a platform register), x9-x18 can be safely clobbered in the patched sequence and the ftrace entry code. There are now two ftrace entry functions, ftrace_regs_entry (which saves all GPRs), and ftrace_entry (which saves the bare minimum). A PLT is allocated for each within modules. Signed-off-by: Torsten Duwe <duwe@suse.de> [Mark: rework asm, comments, PLTs, initialization, commit message] Signed-off-by: Mark Rutland <mark.rutland@arm.com> Reviewed-by: Amit Daniel Kachhap <amit.kachhap@arm.com> Reviewed-by: Ard Biesheuvel <ard.biesheuvel@linaro.org> Reviewed-by: Torsten Duwe <duwe@suse.de> Tested-by: Amit Daniel Kachhap <amit.kachhap@arm.com> Tested-by: Torsten Duwe <duwe@suse.de> Cc: AKASHI Takahiro <takahiro.akashi@linaro.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Julien Thierry <jthierry@redhat.com> Cc: Will Deacon <will@kernel.org>
2019-02-08 15:10:19 +00:00
#include <linux/ftrace.h>
#include <linux/kernel.h>
#include <linux/module.h>
#include <linux/sort.h>
static struct plt_entry __get_adrp_add_pair(u64 dst, u64 pc,
enum aarch64_insn_register reg)
{
u32 adrp, add;
adrp = aarch64_insn_gen_adr(pc, dst, reg, AARCH64_INSN_ADR_TYPE_ADRP);
add = aarch64_insn_gen_add_sub_imm(reg, reg, dst % SZ_4K,
AARCH64_INSN_VARIANT_64BIT,
AARCH64_INSN_ADSB_ADD);
return (struct plt_entry){ cpu_to_le32(adrp), cpu_to_le32(add) };
}
struct plt_entry get_plt_entry(u64 dst, void *pc)
{
struct plt_entry plt;
static u32 br;
if (!br)
br = aarch64_insn_gen_branch_reg(AARCH64_INSN_REG_16,
AARCH64_INSN_BRANCH_NOLINK);
plt = __get_adrp_add_pair(dst, (u64)pc, AARCH64_INSN_REG_16);
plt.br = cpu_to_le32(br);
return plt;
}
static bool plt_entries_equal(const struct plt_entry *a,
const struct plt_entry *b)
{
u64 p, q;
/*
* Check whether both entries refer to the same target:
* do the cheapest checks first.
* If the 'add' or 'br' opcodes are different, then the target
* cannot be the same.
*/
if (a->add != b->add || a->br != b->br)
return false;
p = ALIGN_DOWN((u64)a, SZ_4K);
q = ALIGN_DOWN((u64)b, SZ_4K);
/*
* If the 'adrp' opcodes are the same then we just need to check
* that they refer to the same 4k region.
*/
if (a->adrp == b->adrp && p == q)
return true;
return (p + aarch64_insn_adrp_get_offset(le32_to_cpu(a->adrp))) ==
(q + aarch64_insn_adrp_get_offset(le32_to_cpu(b->adrp)));
}
arm64: module: split core and init PLT sections The arm64 module PLT code allocates all PLT entries in a single core section, since the overhead of having a separate init PLT section is not justified by the small number of PLT entries usually required for init code. However, the core and init module regions are allocated independently, and there is a corner case where the core region may be allocated from the VMALLOC region if the dedicated module region is exhausted, but the init region, being much smaller, can still be allocated from the module region. This leads to relocation failures if the distance between those regions exceeds 128 MB. (In fact, this corner case is highly unlikely to occur on arm64, but the issue has been observed on ARM, whose module region is much smaller). So split the core and init PLT regions, and name the latter ".init.plt" so it gets allocated along with (and sufficiently close to) the .init sections that it serves. Also, given that init PLT entries may need to be emitted for branches that target the core module, modify the logic that disregards defined symbols to only disregard symbols that are defined in the same section as the relocated branch instruction. Since there may now be two PLT entries associated with each entry in the symbol table, we can no longer hijack the symbol::st_size fields to record the addresses of PLT entries as we emit them for zero-addend relocations. So instead, perform an explicit comparison to check for duplicate entries. Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2017-02-21 22:12:57 +00:00
static bool in_init(const struct module *mod, void *loc)
{
return (u64)loc - (u64)mod->init_layout.base < mod->init_layout.size;
}
u64 module_emit_plt_entry(struct module *mod, Elf64_Shdr *sechdrs,
void *loc, const Elf64_Rela *rela,
Elf64_Sym *sym)
{
arm64: module: split core and init PLT sections The arm64 module PLT code allocates all PLT entries in a single core section, since the overhead of having a separate init PLT section is not justified by the small number of PLT entries usually required for init code. However, the core and init module regions are allocated independently, and there is a corner case where the core region may be allocated from the VMALLOC region if the dedicated module region is exhausted, but the init region, being much smaller, can still be allocated from the module region. This leads to relocation failures if the distance between those regions exceeds 128 MB. (In fact, this corner case is highly unlikely to occur on arm64, but the issue has been observed on ARM, whose module region is much smaller). So split the core and init PLT regions, and name the latter ".init.plt" so it gets allocated along with (and sufficiently close to) the .init sections that it serves. Also, given that init PLT entries may need to be emitted for branches that target the core module, modify the logic that disregards defined symbols to only disregard symbols that are defined in the same section as the relocated branch instruction. Since there may now be two PLT entries associated with each entry in the symbol table, we can no longer hijack the symbol::st_size fields to record the addresses of PLT entries as we emit them for zero-addend relocations. So instead, perform an explicit comparison to check for duplicate entries. Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2017-02-21 22:12:57 +00:00
struct mod_plt_sec *pltsec = !in_init(mod, loc) ? &mod->arch.core :
&mod->arch.init;
struct plt_entry *plt = (struct plt_entry *)sechdrs[pltsec->plt_shndx].sh_addr;
arm64: module: split core and init PLT sections The arm64 module PLT code allocates all PLT entries in a single core section, since the overhead of having a separate init PLT section is not justified by the small number of PLT entries usually required for init code. However, the core and init module regions are allocated independently, and there is a corner case where the core region may be allocated from the VMALLOC region if the dedicated module region is exhausted, but the init region, being much smaller, can still be allocated from the module region. This leads to relocation failures if the distance between those regions exceeds 128 MB. (In fact, this corner case is highly unlikely to occur on arm64, but the issue has been observed on ARM, whose module region is much smaller). So split the core and init PLT regions, and name the latter ".init.plt" so it gets allocated along with (and sufficiently close to) the .init sections that it serves. Also, given that init PLT entries may need to be emitted for branches that target the core module, modify the logic that disregards defined symbols to only disregard symbols that are defined in the same section as the relocated branch instruction. Since there may now be two PLT entries associated with each entry in the symbol table, we can no longer hijack the symbol::st_size fields to record the addresses of PLT entries as we emit them for zero-addend relocations. So instead, perform an explicit comparison to check for duplicate entries. Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2017-02-21 22:12:57 +00:00
int i = pltsec->plt_num_entries;
int j = i - 1;
u64 val = sym->st_value + rela->r_addend;
if (is_forbidden_offset_for_adrp(&plt[i].adrp))
i++;
plt[i] = get_plt_entry(val, &plt[i]);
arm64: module: split core and init PLT sections The arm64 module PLT code allocates all PLT entries in a single core section, since the overhead of having a separate init PLT section is not justified by the small number of PLT entries usually required for init code. However, the core and init module regions are allocated independently, and there is a corner case where the core region may be allocated from the VMALLOC region if the dedicated module region is exhausted, but the init region, being much smaller, can still be allocated from the module region. This leads to relocation failures if the distance between those regions exceeds 128 MB. (In fact, this corner case is highly unlikely to occur on arm64, but the issue has been observed on ARM, whose module region is much smaller). So split the core and init PLT regions, and name the latter ".init.plt" so it gets allocated along with (and sufficiently close to) the .init sections that it serves. Also, given that init PLT entries may need to be emitted for branches that target the core module, modify the logic that disregards defined symbols to only disregard symbols that are defined in the same section as the relocated branch instruction. Since there may now be two PLT entries associated with each entry in the symbol table, we can no longer hijack the symbol::st_size fields to record the addresses of PLT entries as we emit them for zero-addend relocations. So instead, perform an explicit comparison to check for duplicate entries. Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2017-02-21 22:12:57 +00:00
/*
* Check if the entry we just created is a duplicate. Given that the
* relocations are sorted, this will be the last entry we allocated.
* (if one exists).
*/
if (j >= 0 && plt_entries_equal(plt + i, plt + j))
return (u64)&plt[j];
arm64: module: split core and init PLT sections The arm64 module PLT code allocates all PLT entries in a single core section, since the overhead of having a separate init PLT section is not justified by the small number of PLT entries usually required for init code. However, the core and init module regions are allocated independently, and there is a corner case where the core region may be allocated from the VMALLOC region if the dedicated module region is exhausted, but the init region, being much smaller, can still be allocated from the module region. This leads to relocation failures if the distance between those regions exceeds 128 MB. (In fact, this corner case is highly unlikely to occur on arm64, but the issue has been observed on ARM, whose module region is much smaller). So split the core and init PLT regions, and name the latter ".init.plt" so it gets allocated along with (and sufficiently close to) the .init sections that it serves. Also, given that init PLT entries may need to be emitted for branches that target the core module, modify the logic that disregards defined symbols to only disregard symbols that are defined in the same section as the relocated branch instruction. Since there may now be two PLT entries associated with each entry in the symbol table, we can no longer hijack the symbol::st_size fields to record the addresses of PLT entries as we emit them for zero-addend relocations. So instead, perform an explicit comparison to check for duplicate entries. Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2017-02-21 22:12:57 +00:00
pltsec->plt_num_entries += i - j;
if (WARN_ON(pltsec->plt_num_entries > pltsec->plt_max_entries))
return 0;
return (u64)&plt[i];
}
arm64/kernel: don't ban ADRP to work around Cortex-A53 erratum #843419 Working around Cortex-A53 erratum #843419 involves special handling of ADRP instructions that end up in the last two instruction slots of a 4k page, or whose output register gets overwritten without having been read. (Note that the latter instruction sequence is never emitted by a properly functioning compiler, which is why it is disregarded by the handling of the same erratum in the bfd.ld linker which we rely on for the core kernel) Normally, this gets taken care of by the linker, which can spot such sequences at final link time, and insert a veneer if the ADRP ends up at a vulnerable offset. However, linux kernel modules are partially linked ELF objects, and so there is no 'final link time' other than the runtime loading of the module, at which time all the static relocations are resolved. For this reason, we have implemented the #843419 workaround for modules by avoiding ADRP instructions altogether, by using the large C model, and by passing -mpc-relative-literal-loads to recent versions of GCC that may emit adrp/ldr pairs to perform literal loads. However, this workaround forces us to keep literal data mixed with the instructions in the executable .text segment, and literal data may inadvertently turn into an exploitable speculative gadget depending on the relative offsets of arbitrary symbols. So let's reimplement this workaround in a way that allows us to switch back to the small C model, and to drop the -mpc-relative-literal-loads GCC switch, by patching affected ADRP instructions at runtime: - ADRP instructions that do not appear at 4k relative offset 0xff8 or 0xffc are ignored - ADRP instructions that are within 1 MB of their target symbol are converted into ADR instructions - remaining ADRP instructions are redirected via a veneer that performs the load using an unaffected movn/movk sequence. Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org> [will: tidied up ADRP -> ADR instruction patching.] [will: use ULL suffix for 64-bit immediate] Signed-off-by: Will Deacon <will.deacon@arm.com>
2018-03-06 17:15:33 +00:00
#ifdef CONFIG_ARM64_ERRATUM_843419
u64 module_emit_veneer_for_adrp(struct module *mod, Elf64_Shdr *sechdrs,
void *loc, u64 val)
arm64/kernel: don't ban ADRP to work around Cortex-A53 erratum #843419 Working around Cortex-A53 erratum #843419 involves special handling of ADRP instructions that end up in the last two instruction slots of a 4k page, or whose output register gets overwritten without having been read. (Note that the latter instruction sequence is never emitted by a properly functioning compiler, which is why it is disregarded by the handling of the same erratum in the bfd.ld linker which we rely on for the core kernel) Normally, this gets taken care of by the linker, which can spot such sequences at final link time, and insert a veneer if the ADRP ends up at a vulnerable offset. However, linux kernel modules are partially linked ELF objects, and so there is no 'final link time' other than the runtime loading of the module, at which time all the static relocations are resolved. For this reason, we have implemented the #843419 workaround for modules by avoiding ADRP instructions altogether, by using the large C model, and by passing -mpc-relative-literal-loads to recent versions of GCC that may emit adrp/ldr pairs to perform literal loads. However, this workaround forces us to keep literal data mixed with the instructions in the executable .text segment, and literal data may inadvertently turn into an exploitable speculative gadget depending on the relative offsets of arbitrary symbols. So let's reimplement this workaround in a way that allows us to switch back to the small C model, and to drop the -mpc-relative-literal-loads GCC switch, by patching affected ADRP instructions at runtime: - ADRP instructions that do not appear at 4k relative offset 0xff8 or 0xffc are ignored - ADRP instructions that are within 1 MB of their target symbol are converted into ADR instructions - remaining ADRP instructions are redirected via a veneer that performs the load using an unaffected movn/movk sequence. Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org> [will: tidied up ADRP -> ADR instruction patching.] [will: use ULL suffix for 64-bit immediate] Signed-off-by: Will Deacon <will.deacon@arm.com>
2018-03-06 17:15:33 +00:00
{
struct mod_plt_sec *pltsec = !in_init(mod, loc) ? &mod->arch.core :
&mod->arch.init;
struct plt_entry *plt = (struct plt_entry *)sechdrs[pltsec->plt_shndx].sh_addr;
arm64/kernel: don't ban ADRP to work around Cortex-A53 erratum #843419 Working around Cortex-A53 erratum #843419 involves special handling of ADRP instructions that end up in the last two instruction slots of a 4k page, or whose output register gets overwritten without having been read. (Note that the latter instruction sequence is never emitted by a properly functioning compiler, which is why it is disregarded by the handling of the same erratum in the bfd.ld linker which we rely on for the core kernel) Normally, this gets taken care of by the linker, which can spot such sequences at final link time, and insert a veneer if the ADRP ends up at a vulnerable offset. However, linux kernel modules are partially linked ELF objects, and so there is no 'final link time' other than the runtime loading of the module, at which time all the static relocations are resolved. For this reason, we have implemented the #843419 workaround for modules by avoiding ADRP instructions altogether, by using the large C model, and by passing -mpc-relative-literal-loads to recent versions of GCC that may emit adrp/ldr pairs to perform literal loads. However, this workaround forces us to keep literal data mixed with the instructions in the executable .text segment, and literal data may inadvertently turn into an exploitable speculative gadget depending on the relative offsets of arbitrary symbols. So let's reimplement this workaround in a way that allows us to switch back to the small C model, and to drop the -mpc-relative-literal-loads GCC switch, by patching affected ADRP instructions at runtime: - ADRP instructions that do not appear at 4k relative offset 0xff8 or 0xffc are ignored - ADRP instructions that are within 1 MB of their target symbol are converted into ADR instructions - remaining ADRP instructions are redirected via a veneer that performs the load using an unaffected movn/movk sequence. Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org> [will: tidied up ADRP -> ADR instruction patching.] [will: use ULL suffix for 64-bit immediate] Signed-off-by: Will Deacon <will.deacon@arm.com>
2018-03-06 17:15:33 +00:00
int i = pltsec->plt_num_entries++;
u32 br;
arm64/kernel: don't ban ADRP to work around Cortex-A53 erratum #843419 Working around Cortex-A53 erratum #843419 involves special handling of ADRP instructions that end up in the last two instruction slots of a 4k page, or whose output register gets overwritten without having been read. (Note that the latter instruction sequence is never emitted by a properly functioning compiler, which is why it is disregarded by the handling of the same erratum in the bfd.ld linker which we rely on for the core kernel) Normally, this gets taken care of by the linker, which can spot such sequences at final link time, and insert a veneer if the ADRP ends up at a vulnerable offset. However, linux kernel modules are partially linked ELF objects, and so there is no 'final link time' other than the runtime loading of the module, at which time all the static relocations are resolved. For this reason, we have implemented the #843419 workaround for modules by avoiding ADRP instructions altogether, by using the large C model, and by passing -mpc-relative-literal-loads to recent versions of GCC that may emit adrp/ldr pairs to perform literal loads. However, this workaround forces us to keep literal data mixed with the instructions in the executable .text segment, and literal data may inadvertently turn into an exploitable speculative gadget depending on the relative offsets of arbitrary symbols. So let's reimplement this workaround in a way that allows us to switch back to the small C model, and to drop the -mpc-relative-literal-loads GCC switch, by patching affected ADRP instructions at runtime: - ADRP instructions that do not appear at 4k relative offset 0xff8 or 0xffc are ignored - ADRP instructions that are within 1 MB of their target symbol are converted into ADR instructions - remaining ADRP instructions are redirected via a veneer that performs the load using an unaffected movn/movk sequence. Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org> [will: tidied up ADRP -> ADR instruction patching.] [will: use ULL suffix for 64-bit immediate] Signed-off-by: Will Deacon <will.deacon@arm.com>
2018-03-06 17:15:33 +00:00
int rd;
if (WARN_ON(pltsec->plt_num_entries > pltsec->plt_max_entries))
return 0;
if (is_forbidden_offset_for_adrp(&plt[i].adrp))
i = pltsec->plt_num_entries++;
arm64/kernel: don't ban ADRP to work around Cortex-A53 erratum #843419 Working around Cortex-A53 erratum #843419 involves special handling of ADRP instructions that end up in the last two instruction slots of a 4k page, or whose output register gets overwritten without having been read. (Note that the latter instruction sequence is never emitted by a properly functioning compiler, which is why it is disregarded by the handling of the same erratum in the bfd.ld linker which we rely on for the core kernel) Normally, this gets taken care of by the linker, which can spot such sequences at final link time, and insert a veneer if the ADRP ends up at a vulnerable offset. However, linux kernel modules are partially linked ELF objects, and so there is no 'final link time' other than the runtime loading of the module, at which time all the static relocations are resolved. For this reason, we have implemented the #843419 workaround for modules by avoiding ADRP instructions altogether, by using the large C model, and by passing -mpc-relative-literal-loads to recent versions of GCC that may emit adrp/ldr pairs to perform literal loads. However, this workaround forces us to keep literal data mixed with the instructions in the executable .text segment, and literal data may inadvertently turn into an exploitable speculative gadget depending on the relative offsets of arbitrary symbols. So let's reimplement this workaround in a way that allows us to switch back to the small C model, and to drop the -mpc-relative-literal-loads GCC switch, by patching affected ADRP instructions at runtime: - ADRP instructions that do not appear at 4k relative offset 0xff8 or 0xffc are ignored - ADRP instructions that are within 1 MB of their target symbol are converted into ADR instructions - remaining ADRP instructions are redirected via a veneer that performs the load using an unaffected movn/movk sequence. Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org> [will: tidied up ADRP -> ADR instruction patching.] [will: use ULL suffix for 64-bit immediate] Signed-off-by: Will Deacon <will.deacon@arm.com>
2018-03-06 17:15:33 +00:00
/* get the destination register of the ADRP instruction */
rd = aarch64_insn_decode_register(AARCH64_INSN_REGTYPE_RD,
le32_to_cpup((__le32 *)loc));
br = aarch64_insn_gen_branch_imm((u64)&plt[i].br, (u64)loc + 4,
AARCH64_INSN_BRANCH_NOLINK);
plt[i] = __get_adrp_add_pair(val, (u64)&plt[i], rd);
plt[i].br = cpu_to_le32(br);
arm64/kernel: don't ban ADRP to work around Cortex-A53 erratum #843419 Working around Cortex-A53 erratum #843419 involves special handling of ADRP instructions that end up in the last two instruction slots of a 4k page, or whose output register gets overwritten without having been read. (Note that the latter instruction sequence is never emitted by a properly functioning compiler, which is why it is disregarded by the handling of the same erratum in the bfd.ld linker which we rely on for the core kernel) Normally, this gets taken care of by the linker, which can spot such sequences at final link time, and insert a veneer if the ADRP ends up at a vulnerable offset. However, linux kernel modules are partially linked ELF objects, and so there is no 'final link time' other than the runtime loading of the module, at which time all the static relocations are resolved. For this reason, we have implemented the #843419 workaround for modules by avoiding ADRP instructions altogether, by using the large C model, and by passing -mpc-relative-literal-loads to recent versions of GCC that may emit adrp/ldr pairs to perform literal loads. However, this workaround forces us to keep literal data mixed with the instructions in the executable .text segment, and literal data may inadvertently turn into an exploitable speculative gadget depending on the relative offsets of arbitrary symbols. So let's reimplement this workaround in a way that allows us to switch back to the small C model, and to drop the -mpc-relative-literal-loads GCC switch, by patching affected ADRP instructions at runtime: - ADRP instructions that do not appear at 4k relative offset 0xff8 or 0xffc are ignored - ADRP instructions that are within 1 MB of their target symbol are converted into ADR instructions - remaining ADRP instructions are redirected via a veneer that performs the load using an unaffected movn/movk sequence. Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org> [will: tidied up ADRP -> ADR instruction patching.] [will: use ULL suffix for 64-bit immediate] Signed-off-by: Will Deacon <will.deacon@arm.com>
2018-03-06 17:15:33 +00:00
return (u64)&plt[i];
}
#endif
#define cmp_3way(a, b) ((a) < (b) ? -1 : (a) > (b))
static int cmp_rela(const void *a, const void *b)
{
const Elf64_Rela *x = a, *y = b;
int i;
/* sort by type, symbol index and addend */
i = cmp_3way(ELF64_R_TYPE(x->r_info), ELF64_R_TYPE(y->r_info));
if (i == 0)
i = cmp_3way(ELF64_R_SYM(x->r_info), ELF64_R_SYM(y->r_info));
if (i == 0)
i = cmp_3way(x->r_addend, y->r_addend);
return i;
}
static bool duplicate_rel(const Elf64_Rela *rela, int num)
{
/*
* Entries are sorted by type, symbol index and addend. That means
* that, if a duplicate entry exists, it must be in the preceding
* slot.
*/
return num > 0 && cmp_rela(rela + num, rela + num - 1) == 0;
}
arm64: module: split core and init PLT sections The arm64 module PLT code allocates all PLT entries in a single core section, since the overhead of having a separate init PLT section is not justified by the small number of PLT entries usually required for init code. However, the core and init module regions are allocated independently, and there is a corner case where the core region may be allocated from the VMALLOC region if the dedicated module region is exhausted, but the init region, being much smaller, can still be allocated from the module region. This leads to relocation failures if the distance between those regions exceeds 128 MB. (In fact, this corner case is highly unlikely to occur on arm64, but the issue has been observed on ARM, whose module region is much smaller). So split the core and init PLT regions, and name the latter ".init.plt" so it gets allocated along with (and sufficiently close to) the .init sections that it serves. Also, given that init PLT entries may need to be emitted for branches that target the core module, modify the logic that disregards defined symbols to only disregard symbols that are defined in the same section as the relocated branch instruction. Since there may now be two PLT entries associated with each entry in the symbol table, we can no longer hijack the symbol::st_size fields to record the addresses of PLT entries as we emit them for zero-addend relocations. So instead, perform an explicit comparison to check for duplicate entries. Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2017-02-21 22:12:57 +00:00
static unsigned int count_plts(Elf64_Sym *syms, Elf64_Rela *rela, int num,
arm64/kernel: don't ban ADRP to work around Cortex-A53 erratum #843419 Working around Cortex-A53 erratum #843419 involves special handling of ADRP instructions that end up in the last two instruction slots of a 4k page, or whose output register gets overwritten without having been read. (Note that the latter instruction sequence is never emitted by a properly functioning compiler, which is why it is disregarded by the handling of the same erratum in the bfd.ld linker which we rely on for the core kernel) Normally, this gets taken care of by the linker, which can spot such sequences at final link time, and insert a veneer if the ADRP ends up at a vulnerable offset. However, linux kernel modules are partially linked ELF objects, and so there is no 'final link time' other than the runtime loading of the module, at which time all the static relocations are resolved. For this reason, we have implemented the #843419 workaround for modules by avoiding ADRP instructions altogether, by using the large C model, and by passing -mpc-relative-literal-loads to recent versions of GCC that may emit adrp/ldr pairs to perform literal loads. However, this workaround forces us to keep literal data mixed with the instructions in the executable .text segment, and literal data may inadvertently turn into an exploitable speculative gadget depending on the relative offsets of arbitrary symbols. So let's reimplement this workaround in a way that allows us to switch back to the small C model, and to drop the -mpc-relative-literal-loads GCC switch, by patching affected ADRP instructions at runtime: - ADRP instructions that do not appear at 4k relative offset 0xff8 or 0xffc are ignored - ADRP instructions that are within 1 MB of their target symbol are converted into ADR instructions - remaining ADRP instructions are redirected via a veneer that performs the load using an unaffected movn/movk sequence. Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org> [will: tidied up ADRP -> ADR instruction patching.] [will: use ULL suffix for 64-bit immediate] Signed-off-by: Will Deacon <will.deacon@arm.com>
2018-03-06 17:15:33 +00:00
Elf64_Word dstidx, Elf_Shdr *dstsec)
{
unsigned int ret = 0;
Elf64_Sym *s;
int i;
for (i = 0; i < num; i++) {
arm64/kernel: don't ban ADRP to work around Cortex-A53 erratum #843419 Working around Cortex-A53 erratum #843419 involves special handling of ADRP instructions that end up in the last two instruction slots of a 4k page, or whose output register gets overwritten without having been read. (Note that the latter instruction sequence is never emitted by a properly functioning compiler, which is why it is disregarded by the handling of the same erratum in the bfd.ld linker which we rely on for the core kernel) Normally, this gets taken care of by the linker, which can spot such sequences at final link time, and insert a veneer if the ADRP ends up at a vulnerable offset. However, linux kernel modules are partially linked ELF objects, and so there is no 'final link time' other than the runtime loading of the module, at which time all the static relocations are resolved. For this reason, we have implemented the #843419 workaround for modules by avoiding ADRP instructions altogether, by using the large C model, and by passing -mpc-relative-literal-loads to recent versions of GCC that may emit adrp/ldr pairs to perform literal loads. However, this workaround forces us to keep literal data mixed with the instructions in the executable .text segment, and literal data may inadvertently turn into an exploitable speculative gadget depending on the relative offsets of arbitrary symbols. So let's reimplement this workaround in a way that allows us to switch back to the small C model, and to drop the -mpc-relative-literal-loads GCC switch, by patching affected ADRP instructions at runtime: - ADRP instructions that do not appear at 4k relative offset 0xff8 or 0xffc are ignored - ADRP instructions that are within 1 MB of their target symbol are converted into ADR instructions - remaining ADRP instructions are redirected via a veneer that performs the load using an unaffected movn/movk sequence. Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org> [will: tidied up ADRP -> ADR instruction patching.] [will: use ULL suffix for 64-bit immediate] Signed-off-by: Will Deacon <will.deacon@arm.com>
2018-03-06 17:15:33 +00:00
u64 min_align;
switch (ELF64_R_TYPE(rela[i].r_info)) {
case R_AARCH64_JUMP26:
case R_AARCH64_CALL26:
arm64/kernel: don't ban ADRP to work around Cortex-A53 erratum #843419 Working around Cortex-A53 erratum #843419 involves special handling of ADRP instructions that end up in the last two instruction slots of a 4k page, or whose output register gets overwritten without having been read. (Note that the latter instruction sequence is never emitted by a properly functioning compiler, which is why it is disregarded by the handling of the same erratum in the bfd.ld linker which we rely on for the core kernel) Normally, this gets taken care of by the linker, which can spot such sequences at final link time, and insert a veneer if the ADRP ends up at a vulnerable offset. However, linux kernel modules are partially linked ELF objects, and so there is no 'final link time' other than the runtime loading of the module, at which time all the static relocations are resolved. For this reason, we have implemented the #843419 workaround for modules by avoiding ADRP instructions altogether, by using the large C model, and by passing -mpc-relative-literal-loads to recent versions of GCC that may emit adrp/ldr pairs to perform literal loads. However, this workaround forces us to keep literal data mixed with the instructions in the executable .text segment, and literal data may inadvertently turn into an exploitable speculative gadget depending on the relative offsets of arbitrary symbols. So let's reimplement this workaround in a way that allows us to switch back to the small C model, and to drop the -mpc-relative-literal-loads GCC switch, by patching affected ADRP instructions at runtime: - ADRP instructions that do not appear at 4k relative offset 0xff8 or 0xffc are ignored - ADRP instructions that are within 1 MB of their target symbol are converted into ADR instructions - remaining ADRP instructions are redirected via a veneer that performs the load using an unaffected movn/movk sequence. Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org> [will: tidied up ADRP -> ADR instruction patching.] [will: use ULL suffix for 64-bit immediate] Signed-off-by: Will Deacon <will.deacon@arm.com>
2018-03-06 17:15:33 +00:00
if (!IS_ENABLED(CONFIG_RANDOMIZE_BASE))
break;
/*
* We only have to consider branch targets that resolve
arm64: module: split core and init PLT sections The arm64 module PLT code allocates all PLT entries in a single core section, since the overhead of having a separate init PLT section is not justified by the small number of PLT entries usually required for init code. However, the core and init module regions are allocated independently, and there is a corner case where the core region may be allocated from the VMALLOC region if the dedicated module region is exhausted, but the init region, being much smaller, can still be allocated from the module region. This leads to relocation failures if the distance between those regions exceeds 128 MB. (In fact, this corner case is highly unlikely to occur on arm64, but the issue has been observed on ARM, whose module region is much smaller). So split the core and init PLT regions, and name the latter ".init.plt" so it gets allocated along with (and sufficiently close to) the .init sections that it serves. Also, given that init PLT entries may need to be emitted for branches that target the core module, modify the logic that disregards defined symbols to only disregard symbols that are defined in the same section as the relocated branch instruction. Since there may now be two PLT entries associated with each entry in the symbol table, we can no longer hijack the symbol::st_size fields to record the addresses of PLT entries as we emit them for zero-addend relocations. So instead, perform an explicit comparison to check for duplicate entries. Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2017-02-21 22:12:57 +00:00
* to symbols that are defined in a different section.
* This is not simply a heuristic, it is a fundamental
* limitation, since there is no guaranteed way to emit
* PLT entries sufficiently close to the branch if the
* section size exceeds the range of a branch
* instruction. So ignore relocations against defined
* symbols if they live in the same section as the
* relocation target.
*/
s = syms + ELF64_R_SYM(rela[i].r_info);
arm64: module: split core and init PLT sections The arm64 module PLT code allocates all PLT entries in a single core section, since the overhead of having a separate init PLT section is not justified by the small number of PLT entries usually required for init code. However, the core and init module regions are allocated independently, and there is a corner case where the core region may be allocated from the VMALLOC region if the dedicated module region is exhausted, but the init region, being much smaller, can still be allocated from the module region. This leads to relocation failures if the distance between those regions exceeds 128 MB. (In fact, this corner case is highly unlikely to occur on arm64, but the issue has been observed on ARM, whose module region is much smaller). So split the core and init PLT regions, and name the latter ".init.plt" so it gets allocated along with (and sufficiently close to) the .init sections that it serves. Also, given that init PLT entries may need to be emitted for branches that target the core module, modify the logic that disregards defined symbols to only disregard symbols that are defined in the same section as the relocated branch instruction. Since there may now be two PLT entries associated with each entry in the symbol table, we can no longer hijack the symbol::st_size fields to record the addresses of PLT entries as we emit them for zero-addend relocations. So instead, perform an explicit comparison to check for duplicate entries. Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2017-02-21 22:12:57 +00:00
if (s->st_shndx == dstidx)
break;
/*
* Jump relocations with non-zero addends against
* undefined symbols are supported by the ELF spec, but
* do not occur in practice (e.g., 'jump n bytes past
* the entry point of undefined function symbol f').
* So we need to support them, but there is no need to
* take them into consideration when trying to optimize
* this code. So let's only check for duplicates when
* the addend is zero: this allows us to record the PLT
* entry address in the symbol table itself, rather than
* having to search the list for duplicates each time we
* emit one.
*/
if (rela[i].r_addend != 0 || !duplicate_rel(rela, i))
ret++;
break;
arm64/kernel: don't ban ADRP to work around Cortex-A53 erratum #843419 Working around Cortex-A53 erratum #843419 involves special handling of ADRP instructions that end up in the last two instruction slots of a 4k page, or whose output register gets overwritten without having been read. (Note that the latter instruction sequence is never emitted by a properly functioning compiler, which is why it is disregarded by the handling of the same erratum in the bfd.ld linker which we rely on for the core kernel) Normally, this gets taken care of by the linker, which can spot such sequences at final link time, and insert a veneer if the ADRP ends up at a vulnerable offset. However, linux kernel modules are partially linked ELF objects, and so there is no 'final link time' other than the runtime loading of the module, at which time all the static relocations are resolved. For this reason, we have implemented the #843419 workaround for modules by avoiding ADRP instructions altogether, by using the large C model, and by passing -mpc-relative-literal-loads to recent versions of GCC that may emit adrp/ldr pairs to perform literal loads. However, this workaround forces us to keep literal data mixed with the instructions in the executable .text segment, and literal data may inadvertently turn into an exploitable speculative gadget depending on the relative offsets of arbitrary symbols. So let's reimplement this workaround in a way that allows us to switch back to the small C model, and to drop the -mpc-relative-literal-loads GCC switch, by patching affected ADRP instructions at runtime: - ADRP instructions that do not appear at 4k relative offset 0xff8 or 0xffc are ignored - ADRP instructions that are within 1 MB of their target symbol are converted into ADR instructions - remaining ADRP instructions are redirected via a veneer that performs the load using an unaffected movn/movk sequence. Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org> [will: tidied up ADRP -> ADR instruction patching.] [will: use ULL suffix for 64-bit immediate] Signed-off-by: Will Deacon <will.deacon@arm.com>
2018-03-06 17:15:33 +00:00
case R_AARCH64_ADR_PREL_PG_HI21_NC:
case R_AARCH64_ADR_PREL_PG_HI21:
if (!IS_ENABLED(CONFIG_ARM64_ERRATUM_843419) ||
!cpus_have_const_cap(ARM64_WORKAROUND_843419))
arm64/kernel: don't ban ADRP to work around Cortex-A53 erratum #843419 Working around Cortex-A53 erratum #843419 involves special handling of ADRP instructions that end up in the last two instruction slots of a 4k page, or whose output register gets overwritten without having been read. (Note that the latter instruction sequence is never emitted by a properly functioning compiler, which is why it is disregarded by the handling of the same erratum in the bfd.ld linker which we rely on for the core kernel) Normally, this gets taken care of by the linker, which can spot such sequences at final link time, and insert a veneer if the ADRP ends up at a vulnerable offset. However, linux kernel modules are partially linked ELF objects, and so there is no 'final link time' other than the runtime loading of the module, at which time all the static relocations are resolved. For this reason, we have implemented the #843419 workaround for modules by avoiding ADRP instructions altogether, by using the large C model, and by passing -mpc-relative-literal-loads to recent versions of GCC that may emit adrp/ldr pairs to perform literal loads. However, this workaround forces us to keep literal data mixed with the instructions in the executable .text segment, and literal data may inadvertently turn into an exploitable speculative gadget depending on the relative offsets of arbitrary symbols. So let's reimplement this workaround in a way that allows us to switch back to the small C model, and to drop the -mpc-relative-literal-loads GCC switch, by patching affected ADRP instructions at runtime: - ADRP instructions that do not appear at 4k relative offset 0xff8 or 0xffc are ignored - ADRP instructions that are within 1 MB of their target symbol are converted into ADR instructions - remaining ADRP instructions are redirected via a veneer that performs the load using an unaffected movn/movk sequence. Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org> [will: tidied up ADRP -> ADR instruction patching.] [will: use ULL suffix for 64-bit immediate] Signed-off-by: Will Deacon <will.deacon@arm.com>
2018-03-06 17:15:33 +00:00
break;
/*
* Determine the minimal safe alignment for this ADRP
* instruction: the section alignment at which it is
* guaranteed not to appear at a vulnerable offset.
*
* This comes down to finding the least significant zero
* bit in bits [11:3] of the section offset, and
* increasing the section's alignment so that the
* resulting address of this instruction is guaranteed
* to equal the offset in that particular bit (as well
* as all less significant bits). This ensures that the
arm64/kernel: don't ban ADRP to work around Cortex-A53 erratum #843419 Working around Cortex-A53 erratum #843419 involves special handling of ADRP instructions that end up in the last two instruction slots of a 4k page, or whose output register gets overwritten without having been read. (Note that the latter instruction sequence is never emitted by a properly functioning compiler, which is why it is disregarded by the handling of the same erratum in the bfd.ld linker which we rely on for the core kernel) Normally, this gets taken care of by the linker, which can spot such sequences at final link time, and insert a veneer if the ADRP ends up at a vulnerable offset. However, linux kernel modules are partially linked ELF objects, and so there is no 'final link time' other than the runtime loading of the module, at which time all the static relocations are resolved. For this reason, we have implemented the #843419 workaround for modules by avoiding ADRP instructions altogether, by using the large C model, and by passing -mpc-relative-literal-loads to recent versions of GCC that may emit adrp/ldr pairs to perform literal loads. However, this workaround forces us to keep literal data mixed with the instructions in the executable .text segment, and literal data may inadvertently turn into an exploitable speculative gadget depending on the relative offsets of arbitrary symbols. So let's reimplement this workaround in a way that allows us to switch back to the small C model, and to drop the -mpc-relative-literal-loads GCC switch, by patching affected ADRP instructions at runtime: - ADRP instructions that do not appear at 4k relative offset 0xff8 or 0xffc are ignored - ADRP instructions that are within 1 MB of their target symbol are converted into ADR instructions - remaining ADRP instructions are redirected via a veneer that performs the load using an unaffected movn/movk sequence. Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org> [will: tidied up ADRP -> ADR instruction patching.] [will: use ULL suffix for 64-bit immediate] Signed-off-by: Will Deacon <will.deacon@arm.com>
2018-03-06 17:15:33 +00:00
* address modulo 4 KB != 0xfff8 or 0xfffc (which would
* have all ones in bits [11:3])
*/
min_align = 2ULL << ffz(rela[i].r_offset | 0x7);
/*
* Allocate veneer space for each ADRP that may appear
* at a vulnerable offset nonetheless. At relocation
* time, some of these will remain unused since some
* ADRP instructions can be patched to ADR instructions
* instead.
*/
if (min_align > SZ_4K)
ret++;
else
dstsec->sh_addralign = max(dstsec->sh_addralign,
min_align);
break;
}
}
if (IS_ENABLED(CONFIG_ARM64_ERRATUM_843419) &&
cpus_have_const_cap(ARM64_WORKAROUND_843419))
/*
* Add some slack so we can skip PLT slots that may trigger
* the erratum due to the placement of the ADRP instruction.
*/
ret += DIV_ROUND_UP(ret, (SZ_4K / sizeof(struct plt_entry)));
return ret;
}
arm64/module: Optimize module load time by optimizing PLT counting When loading a module, module_frob_arch_sections() tries to figure out the number of PLTs that'll be needed to handle all the RELAs. While doing this, it tries to dedupe PLT allocations for multiple R_AARCH64_CALL26 relocations to the same symbol. It does the same for R_AARCH64_JUMP26 relocations. To make checks for duplicates easier/faster, it sorts the relocation list by type, symbol and addend. That way, to check for a duplicate relocation, it just needs to compare with the previous entry. However, sorting the entire relocation array is unnecessary and expensive (O(n log n)) because there are a lot of other relocation types that don't need deduping or can't be deduped. So this commit partitions the array into entries that need deduping and those that don't. And then sorts just the part that needs deduping. And when CONFIG_RANDOMIZE_BASE is disabled, the sorting is skipped entirely because PLTs are not allocated for R_AARCH64_CALL26 and R_AARCH64_JUMP26 if it's disabled. This gives significant reduction in module load time for modules with large number of relocations with no measurable impact on modules with a small number of relocations. In my test setup with CONFIG_RANDOMIZE_BASE enabled, these were the results for a few downstream modules: Module Size (MB) wlan 14 video codec 3.8 drm 1.8 IPA 2.5 audio 1.2 gpu 1.8 Without this patch: Module Number of entries sorted Module load time (ms) wlan 243739 283 video codec 74029 138 drm 53837 67 IPA 42800 90 audio 21326 27 gpu 20967 32 Total time to load all these module: 637 ms With this patch: Module Number of entries sorted Module load time (ms) wlan 22454 61 video codec 10150 47 drm 13014 40 IPA 8097 63 audio 4606 16 gpu 6527 20 Total time to load all these modules: 247 Time saved during boot for just these 6 modules: 390 ms Signed-off-by: Saravana Kannan <saravanak@google.com> Acked-by: Will Deacon <will@kernel.org> Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org> Link: https://lore.kernel.org/r/20200623011803.91232-1-saravanak@google.com Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2020-06-23 01:18:02 +00:00
static bool branch_rela_needs_plt(Elf64_Sym *syms, Elf64_Rela *rela,
Elf64_Word dstidx)
{
Elf64_Sym *s = syms + ELF64_R_SYM(rela->r_info);
if (s->st_shndx == dstidx)
return false;
return ELF64_R_TYPE(rela->r_info) == R_AARCH64_JUMP26 ||
ELF64_R_TYPE(rela->r_info) == R_AARCH64_CALL26;
}
/* Group branch PLT relas at the front end of the array. */
static int partition_branch_plt_relas(Elf64_Sym *syms, Elf64_Rela *rela,
int numrels, Elf64_Word dstidx)
{
int i = 0, j = numrels - 1;
if (!IS_ENABLED(CONFIG_RANDOMIZE_BASE))
return 0;
while (i < j) {
if (branch_rela_needs_plt(syms, &rela[i], dstidx))
i++;
else if (branch_rela_needs_plt(syms, &rela[j], dstidx))
swap(rela[i], rela[j]);
else
j--;
}
return i;
}
int module_frob_arch_sections(Elf_Ehdr *ehdr, Elf_Shdr *sechdrs,
char *secstrings, struct module *mod)
{
arm64: module: split core and init PLT sections The arm64 module PLT code allocates all PLT entries in a single core section, since the overhead of having a separate init PLT section is not justified by the small number of PLT entries usually required for init code. However, the core and init module regions are allocated independently, and there is a corner case where the core region may be allocated from the VMALLOC region if the dedicated module region is exhausted, but the init region, being much smaller, can still be allocated from the module region. This leads to relocation failures if the distance between those regions exceeds 128 MB. (In fact, this corner case is highly unlikely to occur on arm64, but the issue has been observed on ARM, whose module region is much smaller). So split the core and init PLT regions, and name the latter ".init.plt" so it gets allocated along with (and sufficiently close to) the .init sections that it serves. Also, given that init PLT entries may need to be emitted for branches that target the core module, modify the logic that disregards defined symbols to only disregard symbols that are defined in the same section as the relocated branch instruction. Since there may now be two PLT entries associated with each entry in the symbol table, we can no longer hijack the symbol::st_size fields to record the addresses of PLT entries as we emit them for zero-addend relocations. So instead, perform an explicit comparison to check for duplicate entries. Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2017-02-21 22:12:57 +00:00
unsigned long core_plts = 0;
unsigned long init_plts = 0;
Elf64_Sym *syms = NULL;
Elf_Shdr *pltsec, *tramp = NULL;
int i;
/*
* Find the empty .plt section so we can expand it to store the PLT
* entries. Record the symtab address as well.
*/
for (i = 0; i < ehdr->e_shnum; i++) {
arm64: module: split core and init PLT sections The arm64 module PLT code allocates all PLT entries in a single core section, since the overhead of having a separate init PLT section is not justified by the small number of PLT entries usually required for init code. However, the core and init module regions are allocated independently, and there is a corner case where the core region may be allocated from the VMALLOC region if the dedicated module region is exhausted, but the init region, being much smaller, can still be allocated from the module region. This leads to relocation failures if the distance between those regions exceeds 128 MB. (In fact, this corner case is highly unlikely to occur on arm64, but the issue has been observed on ARM, whose module region is much smaller). So split the core and init PLT regions, and name the latter ".init.plt" so it gets allocated along with (and sufficiently close to) the .init sections that it serves. Also, given that init PLT entries may need to be emitted for branches that target the core module, modify the logic that disregards defined symbols to only disregard symbols that are defined in the same section as the relocated branch instruction. Since there may now be two PLT entries associated with each entry in the symbol table, we can no longer hijack the symbol::st_size fields to record the addresses of PLT entries as we emit them for zero-addend relocations. So instead, perform an explicit comparison to check for duplicate entries. Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2017-02-21 22:12:57 +00:00
if (!strcmp(secstrings + sechdrs[i].sh_name, ".plt"))
mod->arch.core.plt_shndx = i;
arm64: module: split core and init PLT sections The arm64 module PLT code allocates all PLT entries in a single core section, since the overhead of having a separate init PLT section is not justified by the small number of PLT entries usually required for init code. However, the core and init module regions are allocated independently, and there is a corner case where the core region may be allocated from the VMALLOC region if the dedicated module region is exhausted, but the init region, being much smaller, can still be allocated from the module region. This leads to relocation failures if the distance between those regions exceeds 128 MB. (In fact, this corner case is highly unlikely to occur on arm64, but the issue has been observed on ARM, whose module region is much smaller). So split the core and init PLT regions, and name the latter ".init.plt" so it gets allocated along with (and sufficiently close to) the .init sections that it serves. Also, given that init PLT entries may need to be emitted for branches that target the core module, modify the logic that disregards defined symbols to only disregard symbols that are defined in the same section as the relocated branch instruction. Since there may now be two PLT entries associated with each entry in the symbol table, we can no longer hijack the symbol::st_size fields to record the addresses of PLT entries as we emit them for zero-addend relocations. So instead, perform an explicit comparison to check for duplicate entries. Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2017-02-21 22:12:57 +00:00
else if (!strcmp(secstrings + sechdrs[i].sh_name, ".init.plt"))
mod->arch.init.plt_shndx = i;
arm64/module: set trampoline section flags regardless of CONFIG_DYNAMIC_FTRACE In the arm64 module linker script, the section .text.ftrace_trampoline is specified unconditionally regardless of whether CONFIG_DYNAMIC_FTRACE is enabled (this is simply due to the limitation that module linker scripts are not preprocessed like the vmlinux one). Normally, for .plt and .text.ftrace_trampoline, the section flags present in the module binary wouldn't matter since module_frob_arch_sections() would assign them manually anyway. However, the arm64 module loader only sets the section flags for .text.ftrace_trampoline when CONFIG_DYNAMIC_FTRACE=y. That's only become problematic recently due to a recent change in binutils-2.35, where the .text.ftrace_trampoline section (along with the .plt section) is now marked writable and executable (WAX). We no longer allow writable and executable sections to be loaded due to commit 5c3a7db0c7ec ("module: Harden STRICT_MODULE_RWX"), so this is causing all modules linked with binutils-2.35 to be rejected under arm64. Drop the IS_ENABLED(CONFIG_DYNAMIC_FTRACE) check in module_frob_arch_sections() so that the section flags for .text.ftrace_trampoline get properly set to SHF_EXECINSTR|SHF_ALLOC, without SHF_WRITE. Signed-off-by: Jessica Yu <jeyu@kernel.org> Acked-by: Will Deacon <will@kernel.org> Acked-by: Ard Biesheuvel <ardb@kernel.org> Link: http://lore.kernel.org/r/20200831094651.GA16385@linux-8ccs Link: https://lore.kernel.org/r/20200901160016.3646-1-jeyu@kernel.org Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2020-09-01 16:00:16 +00:00
else if (!strcmp(secstrings + sechdrs[i].sh_name,
".text.ftrace_trampoline"))
tramp = sechdrs + i;
else if (sechdrs[i].sh_type == SHT_SYMTAB)
syms = (Elf64_Sym *)sechdrs[i].sh_addr;
}
if (!mod->arch.core.plt_shndx || !mod->arch.init.plt_shndx) {
arm64: module: split core and init PLT sections The arm64 module PLT code allocates all PLT entries in a single core section, since the overhead of having a separate init PLT section is not justified by the small number of PLT entries usually required for init code. However, the core and init module regions are allocated independently, and there is a corner case where the core region may be allocated from the VMALLOC region if the dedicated module region is exhausted, but the init region, being much smaller, can still be allocated from the module region. This leads to relocation failures if the distance between those regions exceeds 128 MB. (In fact, this corner case is highly unlikely to occur on arm64, but the issue has been observed on ARM, whose module region is much smaller). So split the core and init PLT regions, and name the latter ".init.plt" so it gets allocated along with (and sufficiently close to) the .init sections that it serves. Also, given that init PLT entries may need to be emitted for branches that target the core module, modify the logic that disregards defined symbols to only disregard symbols that are defined in the same section as the relocated branch instruction. Since there may now be two PLT entries associated with each entry in the symbol table, we can no longer hijack the symbol::st_size fields to record the addresses of PLT entries as we emit them for zero-addend relocations. So instead, perform an explicit comparison to check for duplicate entries. Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2017-02-21 22:12:57 +00:00
pr_err("%s: module PLT section(s) missing\n", mod->name);
return -ENOEXEC;
}
if (!syms) {
pr_err("%s: module symtab section missing\n", mod->name);
return -ENOEXEC;
}
for (i = 0; i < ehdr->e_shnum; i++) {
Elf64_Rela *rels = (void *)ehdr + sechdrs[i].sh_offset;
arm64/module: Optimize module load time by optimizing PLT counting When loading a module, module_frob_arch_sections() tries to figure out the number of PLTs that'll be needed to handle all the RELAs. While doing this, it tries to dedupe PLT allocations for multiple R_AARCH64_CALL26 relocations to the same symbol. It does the same for R_AARCH64_JUMP26 relocations. To make checks for duplicates easier/faster, it sorts the relocation list by type, symbol and addend. That way, to check for a duplicate relocation, it just needs to compare with the previous entry. However, sorting the entire relocation array is unnecessary and expensive (O(n log n)) because there are a lot of other relocation types that don't need deduping or can't be deduped. So this commit partitions the array into entries that need deduping and those that don't. And then sorts just the part that needs deduping. And when CONFIG_RANDOMIZE_BASE is disabled, the sorting is skipped entirely because PLTs are not allocated for R_AARCH64_CALL26 and R_AARCH64_JUMP26 if it's disabled. This gives significant reduction in module load time for modules with large number of relocations with no measurable impact on modules with a small number of relocations. In my test setup with CONFIG_RANDOMIZE_BASE enabled, these were the results for a few downstream modules: Module Size (MB) wlan 14 video codec 3.8 drm 1.8 IPA 2.5 audio 1.2 gpu 1.8 Without this patch: Module Number of entries sorted Module load time (ms) wlan 243739 283 video codec 74029 138 drm 53837 67 IPA 42800 90 audio 21326 27 gpu 20967 32 Total time to load all these module: 637 ms With this patch: Module Number of entries sorted Module load time (ms) wlan 22454 61 video codec 10150 47 drm 13014 40 IPA 8097 63 audio 4606 16 gpu 6527 20 Total time to load all these modules: 247 Time saved during boot for just these 6 modules: 390 ms Signed-off-by: Saravana Kannan <saravanak@google.com> Acked-by: Will Deacon <will@kernel.org> Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org> Link: https://lore.kernel.org/r/20200623011803.91232-1-saravanak@google.com Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2020-06-23 01:18:02 +00:00
int nents, numrels = sechdrs[i].sh_size / sizeof(Elf64_Rela);
Elf64_Shdr *dstsec = sechdrs + sechdrs[i].sh_info;
if (sechdrs[i].sh_type != SHT_RELA)
continue;
/* ignore relocations that operate on non-exec sections */
if (!(dstsec->sh_flags & SHF_EXECINSTR))
continue;
arm64/module: Optimize module load time by optimizing PLT counting When loading a module, module_frob_arch_sections() tries to figure out the number of PLTs that'll be needed to handle all the RELAs. While doing this, it tries to dedupe PLT allocations for multiple R_AARCH64_CALL26 relocations to the same symbol. It does the same for R_AARCH64_JUMP26 relocations. To make checks for duplicates easier/faster, it sorts the relocation list by type, symbol and addend. That way, to check for a duplicate relocation, it just needs to compare with the previous entry. However, sorting the entire relocation array is unnecessary and expensive (O(n log n)) because there are a lot of other relocation types that don't need deduping or can't be deduped. So this commit partitions the array into entries that need deduping and those that don't. And then sorts just the part that needs deduping. And when CONFIG_RANDOMIZE_BASE is disabled, the sorting is skipped entirely because PLTs are not allocated for R_AARCH64_CALL26 and R_AARCH64_JUMP26 if it's disabled. This gives significant reduction in module load time for modules with large number of relocations with no measurable impact on modules with a small number of relocations. In my test setup with CONFIG_RANDOMIZE_BASE enabled, these were the results for a few downstream modules: Module Size (MB) wlan 14 video codec 3.8 drm 1.8 IPA 2.5 audio 1.2 gpu 1.8 Without this patch: Module Number of entries sorted Module load time (ms) wlan 243739 283 video codec 74029 138 drm 53837 67 IPA 42800 90 audio 21326 27 gpu 20967 32 Total time to load all these module: 637 ms With this patch: Module Number of entries sorted Module load time (ms) wlan 22454 61 video codec 10150 47 drm 13014 40 IPA 8097 63 audio 4606 16 gpu 6527 20 Total time to load all these modules: 247 Time saved during boot for just these 6 modules: 390 ms Signed-off-by: Saravana Kannan <saravanak@google.com> Acked-by: Will Deacon <will@kernel.org> Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org> Link: https://lore.kernel.org/r/20200623011803.91232-1-saravanak@google.com Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2020-06-23 01:18:02 +00:00
/*
* sort branch relocations requiring a PLT by type, symbol index
* and addend
*/
nents = partition_branch_plt_relas(syms, rels, numrels,
sechdrs[i].sh_info);
if (nents)
sort(rels, nents, sizeof(Elf64_Rela), cmp_rela, NULL);
if (!str_has_prefix(secstrings + dstsec->sh_name, ".init"))
arm64: module: split core and init PLT sections The arm64 module PLT code allocates all PLT entries in a single core section, since the overhead of having a separate init PLT section is not justified by the small number of PLT entries usually required for init code. However, the core and init module regions are allocated independently, and there is a corner case where the core region may be allocated from the VMALLOC region if the dedicated module region is exhausted, but the init region, being much smaller, can still be allocated from the module region. This leads to relocation failures if the distance between those regions exceeds 128 MB. (In fact, this corner case is highly unlikely to occur on arm64, but the issue has been observed on ARM, whose module region is much smaller). So split the core and init PLT regions, and name the latter ".init.plt" so it gets allocated along with (and sufficiently close to) the .init sections that it serves. Also, given that init PLT entries may need to be emitted for branches that target the core module, modify the logic that disregards defined symbols to only disregard symbols that are defined in the same section as the relocated branch instruction. Since there may now be two PLT entries associated with each entry in the symbol table, we can no longer hijack the symbol::st_size fields to record the addresses of PLT entries as we emit them for zero-addend relocations. So instead, perform an explicit comparison to check for duplicate entries. Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2017-02-21 22:12:57 +00:00
core_plts += count_plts(syms, rels, numrels,
arm64/kernel: don't ban ADRP to work around Cortex-A53 erratum #843419 Working around Cortex-A53 erratum #843419 involves special handling of ADRP instructions that end up in the last two instruction slots of a 4k page, or whose output register gets overwritten without having been read. (Note that the latter instruction sequence is never emitted by a properly functioning compiler, which is why it is disregarded by the handling of the same erratum in the bfd.ld linker which we rely on for the core kernel) Normally, this gets taken care of by the linker, which can spot such sequences at final link time, and insert a veneer if the ADRP ends up at a vulnerable offset. However, linux kernel modules are partially linked ELF objects, and so there is no 'final link time' other than the runtime loading of the module, at which time all the static relocations are resolved. For this reason, we have implemented the #843419 workaround for modules by avoiding ADRP instructions altogether, by using the large C model, and by passing -mpc-relative-literal-loads to recent versions of GCC that may emit adrp/ldr pairs to perform literal loads. However, this workaround forces us to keep literal data mixed with the instructions in the executable .text segment, and literal data may inadvertently turn into an exploitable speculative gadget depending on the relative offsets of arbitrary symbols. So let's reimplement this workaround in a way that allows us to switch back to the small C model, and to drop the -mpc-relative-literal-loads GCC switch, by patching affected ADRP instructions at runtime: - ADRP instructions that do not appear at 4k relative offset 0xff8 or 0xffc are ignored - ADRP instructions that are within 1 MB of their target symbol are converted into ADR instructions - remaining ADRP instructions are redirected via a veneer that performs the load using an unaffected movn/movk sequence. Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org> [will: tidied up ADRP -> ADR instruction patching.] [will: use ULL suffix for 64-bit immediate] Signed-off-by: Will Deacon <will.deacon@arm.com>
2018-03-06 17:15:33 +00:00
sechdrs[i].sh_info, dstsec);
arm64: module: split core and init PLT sections The arm64 module PLT code allocates all PLT entries in a single core section, since the overhead of having a separate init PLT section is not justified by the small number of PLT entries usually required for init code. However, the core and init module regions are allocated independently, and there is a corner case where the core region may be allocated from the VMALLOC region if the dedicated module region is exhausted, but the init region, being much smaller, can still be allocated from the module region. This leads to relocation failures if the distance between those regions exceeds 128 MB. (In fact, this corner case is highly unlikely to occur on arm64, but the issue has been observed on ARM, whose module region is much smaller). So split the core and init PLT regions, and name the latter ".init.plt" so it gets allocated along with (and sufficiently close to) the .init sections that it serves. Also, given that init PLT entries may need to be emitted for branches that target the core module, modify the logic that disregards defined symbols to only disregard symbols that are defined in the same section as the relocated branch instruction. Since there may now be two PLT entries associated with each entry in the symbol table, we can no longer hijack the symbol::st_size fields to record the addresses of PLT entries as we emit them for zero-addend relocations. So instead, perform an explicit comparison to check for duplicate entries. Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2017-02-21 22:12:57 +00:00
else
init_plts += count_plts(syms, rels, numrels,
arm64/kernel: don't ban ADRP to work around Cortex-A53 erratum #843419 Working around Cortex-A53 erratum #843419 involves special handling of ADRP instructions that end up in the last two instruction slots of a 4k page, or whose output register gets overwritten without having been read. (Note that the latter instruction sequence is never emitted by a properly functioning compiler, which is why it is disregarded by the handling of the same erratum in the bfd.ld linker which we rely on for the core kernel) Normally, this gets taken care of by the linker, which can spot such sequences at final link time, and insert a veneer if the ADRP ends up at a vulnerable offset. However, linux kernel modules are partially linked ELF objects, and so there is no 'final link time' other than the runtime loading of the module, at which time all the static relocations are resolved. For this reason, we have implemented the #843419 workaround for modules by avoiding ADRP instructions altogether, by using the large C model, and by passing -mpc-relative-literal-loads to recent versions of GCC that may emit adrp/ldr pairs to perform literal loads. However, this workaround forces us to keep literal data mixed with the instructions in the executable .text segment, and literal data may inadvertently turn into an exploitable speculative gadget depending on the relative offsets of arbitrary symbols. So let's reimplement this workaround in a way that allows us to switch back to the small C model, and to drop the -mpc-relative-literal-loads GCC switch, by patching affected ADRP instructions at runtime: - ADRP instructions that do not appear at 4k relative offset 0xff8 or 0xffc are ignored - ADRP instructions that are within 1 MB of their target symbol are converted into ADR instructions - remaining ADRP instructions are redirected via a veneer that performs the load using an unaffected movn/movk sequence. Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org> [will: tidied up ADRP -> ADR instruction patching.] [will: use ULL suffix for 64-bit immediate] Signed-off-by: Will Deacon <will.deacon@arm.com>
2018-03-06 17:15:33 +00:00
sechdrs[i].sh_info, dstsec);
}
pltsec = sechdrs + mod->arch.core.plt_shndx;
pltsec->sh_type = SHT_NOBITS;
pltsec->sh_flags = SHF_EXECINSTR | SHF_ALLOC;
pltsec->sh_addralign = L1_CACHE_BYTES;
pltsec->sh_size = (core_plts + 1) * sizeof(struct plt_entry);
arm64: module: split core and init PLT sections The arm64 module PLT code allocates all PLT entries in a single core section, since the overhead of having a separate init PLT section is not justified by the small number of PLT entries usually required for init code. However, the core and init module regions are allocated independently, and there is a corner case where the core region may be allocated from the VMALLOC region if the dedicated module region is exhausted, but the init region, being much smaller, can still be allocated from the module region. This leads to relocation failures if the distance between those regions exceeds 128 MB. (In fact, this corner case is highly unlikely to occur on arm64, but the issue has been observed on ARM, whose module region is much smaller). So split the core and init PLT regions, and name the latter ".init.plt" so it gets allocated along with (and sufficiently close to) the .init sections that it serves. Also, given that init PLT entries may need to be emitted for branches that target the core module, modify the logic that disregards defined symbols to only disregard symbols that are defined in the same section as the relocated branch instruction. Since there may now be two PLT entries associated with each entry in the symbol table, we can no longer hijack the symbol::st_size fields to record the addresses of PLT entries as we emit them for zero-addend relocations. So instead, perform an explicit comparison to check for duplicate entries. Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2017-02-21 22:12:57 +00:00
mod->arch.core.plt_num_entries = 0;
mod->arch.core.plt_max_entries = core_plts;
pltsec = sechdrs + mod->arch.init.plt_shndx;
pltsec->sh_type = SHT_NOBITS;
pltsec->sh_flags = SHF_EXECINSTR | SHF_ALLOC;
pltsec->sh_addralign = L1_CACHE_BYTES;
pltsec->sh_size = (init_plts + 1) * sizeof(struct plt_entry);
arm64: module: split core and init PLT sections The arm64 module PLT code allocates all PLT entries in a single core section, since the overhead of having a separate init PLT section is not justified by the small number of PLT entries usually required for init code. However, the core and init module regions are allocated independently, and there is a corner case where the core region may be allocated from the VMALLOC region if the dedicated module region is exhausted, but the init region, being much smaller, can still be allocated from the module region. This leads to relocation failures if the distance between those regions exceeds 128 MB. (In fact, this corner case is highly unlikely to occur on arm64, but the issue has been observed on ARM, whose module region is much smaller). So split the core and init PLT regions, and name the latter ".init.plt" so it gets allocated along with (and sufficiently close to) the .init sections that it serves. Also, given that init PLT entries may need to be emitted for branches that target the core module, modify the logic that disregards defined symbols to only disregard symbols that are defined in the same section as the relocated branch instruction. Since there may now be two PLT entries associated with each entry in the symbol table, we can no longer hijack the symbol::st_size fields to record the addresses of PLT entries as we emit them for zero-addend relocations. So instead, perform an explicit comparison to check for duplicate entries. Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2017-02-21 22:12:57 +00:00
mod->arch.init.plt_num_entries = 0;
mod->arch.init.plt_max_entries = init_plts;
if (tramp) {
tramp->sh_type = SHT_NOBITS;
tramp->sh_flags = SHF_EXECINSTR | SHF_ALLOC;
tramp->sh_addralign = __alignof__(struct plt_entry);
arm64: implement ftrace with regs This patch implements FTRACE_WITH_REGS for arm64, which allows a traced function's arguments (and some other registers) to be captured into a struct pt_regs, allowing these to be inspected and/or modified. This is a building block for live-patching, where a function's arguments may be forwarded to another function. This is also necessary to enable ftrace and in-kernel pointer authentication at the same time, as it allows the LR value to be captured and adjusted prior to signing. Using GCC's -fpatchable-function-entry=N option, we can have the compiler insert a configurable number of NOPs between the function entry point and the usual prologue. This also ensures functions are AAPCS compliant (e.g. disabling inter-procedural register allocation). For example, with -fpatchable-function-entry=2, GCC 8.1.0 compiles the following: | unsigned long bar(void); | | unsigned long foo(void) | { | return bar() + 1; | } ... to: | <foo>: | nop | nop | stp x29, x30, [sp, #-16]! | mov x29, sp | bl 0 <bar> | add x0, x0, #0x1 | ldp x29, x30, [sp], #16 | ret This patch builds the kernel with -fpatchable-function-entry=2, prefixing each function with two NOPs. To trace a function, we replace these NOPs with a sequence that saves the LR into a GPR, then calls an ftrace entry assembly function which saves this and other relevant registers: | mov x9, x30 | bl <ftrace-entry> Since patchable functions are AAPCS compliant (and the kernel does not use x18 as a platform register), x9-x18 can be safely clobbered in the patched sequence and the ftrace entry code. There are now two ftrace entry functions, ftrace_regs_entry (which saves all GPRs), and ftrace_entry (which saves the bare minimum). A PLT is allocated for each within modules. Signed-off-by: Torsten Duwe <duwe@suse.de> [Mark: rework asm, comments, PLTs, initialization, commit message] Signed-off-by: Mark Rutland <mark.rutland@arm.com> Reviewed-by: Amit Daniel Kachhap <amit.kachhap@arm.com> Reviewed-by: Ard Biesheuvel <ard.biesheuvel@linaro.org> Reviewed-by: Torsten Duwe <duwe@suse.de> Tested-by: Amit Daniel Kachhap <amit.kachhap@arm.com> Tested-by: Torsten Duwe <duwe@suse.de> Cc: AKASHI Takahiro <takahiro.akashi@linaro.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Julien Thierry <jthierry@redhat.com> Cc: Will Deacon <will@kernel.org>
2019-02-08 15:10:19 +00:00
tramp->sh_size = NR_FTRACE_PLTS * sizeof(struct plt_entry);
}
return 0;
}