linux/arch/arm
Eric Biggers 1862eb0073 crypto: arm/blake2b - add NEON-accelerated BLAKE2b
Add a NEON-accelerated implementation of BLAKE2b.

On Cortex-A7 (which these days is the most common ARM processor that
doesn't have the ARMv8 Crypto Extensions), this is over twice as fast as
SHA-256, and slightly faster than SHA-1.  It is also almost three times
as fast as the generic implementation of BLAKE2b:

	Algorithm            Cycles per byte (on 4096-byte messages)
	===================  =======================================
	blake2b-256-neon     14.0
	sha1-neon            16.3
	blake2s-256-arm      18.8
	sha1-asm             20.8
	blake2s-256-generic  26.0
	sha256-neon	     28.9
	sha256-asm	     32.0
	blake2b-256-generic  38.9

This implementation isn't directly based on any other implementation,
but it borrows some ideas from previous NEON code I've written as well
as from chacha-neon-core.S.  At least on Cortex-A7, it is faster than
the other NEON implementations of BLAKE2b I'm aware of (the
implementation in the BLAKE2 official repository using intrinsics, and
Andrew Moon's implementation which can be found in SUPERCOP).  It does
only one block at a time, so it performs well on short messages too.

NEON-accelerated BLAKE2b is useful because there is interest in using
BLAKE2b-256 for dm-verity on low-end Android devices (specifically,
devices that lack the ARMv8 Crypto Extensions) to replace SHA-1.  On
these devices, the performance cost of upgrading to SHA-256 may be
unacceptable, whereas BLAKE2b-256 would actually improve performance.

Although BLAKE2b is intended for 64-bit platforms (unlike BLAKE2s which
is intended for 32-bit platforms), on 32-bit ARM processors with NEON,
BLAKE2b is actually faster than BLAKE2s.  This is because NEON supports
64-bit operations, and because BLAKE2s's block size is too small for
NEON to be helpful for it.  The best I've been able to do with BLAKE2s
on Cortex-A7 is 18.8 cpb with an optimized scalar implementation.

(I didn't try BLAKE2sp and BLAKE3, which in theory would be faster, but
they're more complex as they require running multiple hashes at once.
Note that BLAKE2b already uses all the NEON bandwidth on the Cortex-A7,
so I expect that any speedup from BLAKE2sp or BLAKE3 would come only
from the smaller number of rounds, not from the extra parallelism.)

For now this BLAKE2b implementation is only wired up to the shash API,
since there is no library API for BLAKE2b yet.  However, I've tried to
keep things consistent with BLAKE2s, e.g. by defining
blake2b_compress_arch() which is analogous to blake2s_compress_arch()
and could be exported for use by the library API later if needed.

Acked-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Tested-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2021-01-03 08:41:39 +11:00
..
boot ARM updates for 5.11: 2020-12-22 13:34:27 -08:00
common
configs ARM: SoC defconfigs for 5.11 2020-12-16 16:25:03 -08:00
crypto crypto: arm/blake2b - add NEON-accelerated BLAKE2b 2021-01-03 08:41:39 +11:00
include EFI updates collected by Ard Biesheuvel: 2020-12-24 12:40:07 -08:00
kernel A treewide cleanup of interrupt descriptor (ab)use with all sorts of racy 2020-12-24 13:50:23 -08:00
lib ARM: 9022/1: Change arch/arm/lib/mem*.S to use WEAK instead of .weak 2020-11-12 14:53:19 +00:00
mach-actions
mach-alpine
mach-artpec
mach-asm9260
mach-aspeed
mach-at91
mach-axxia
mach-bcm arm: remove CONFIG_ARCH_HAS_HOLES_MEMORYMODEL 2020-12-15 12:13:42 -08:00
mach-berlin
mach-clps711x
mach-cns3xxx
mach-davinci arm: remove CONFIG_ARCH_HAS_HOLES_MEMORYMODEL 2020-12-15 12:13:42 -08:00
mach-digicolor
mach-dove
mach-efm32
mach-ep93xx
mach-exynos ARM: SoC updates for 5.11 2020-12-16 16:22:36 -08:00
mach-footbridge
mach-gemini
mach-highbank arm: remove CONFIG_ARCH_HAS_HOLES_MEMORYMODEL 2020-12-15 12:13:42 -08:00
mach-hisi
mach-imx ARM: SoC updates for 5.11 2020-12-16 16:22:36 -08:00
mach-integrator
mach-iop32x
mach-ixp4xx
mach-keystone ARM: SoC drivers for v5.11 2020-12-16 16:38:41 -08:00
mach-lpc18xx
mach-lpc32xx
mach-mediatek
mach-meson
mach-milbeaut
mach-mmp
mach-moxart
mach-mstar ARM: mstar: SMP support 2020-12-09 17:45:40 +01:00
mach-mv78xx0
mach-mvebu
mach-mxs ARM: mxs: Add serial number support for i.MX23, i.MX28 SoCs 2020-11-30 17:31:29 +08:00
mach-nomadik
mach-npcm
mach-nspire
mach-omap1 ARM: SoC drivers for v5.11 2020-12-16 16:38:41 -08:00
mach-omap2 ARM: SoC updates for OMAP GenPD 2020-12-16 16:53:54 -08:00
mach-orion5x
mach-oxnas
mach-picoxcell
mach-prima2
mach-pxa
mach-qcom
mach-rda
mach-realtek
mach-realview
mach-rockchip
mach-rpc ARM: rpc: use legacy_timer_tick 2020-10-30 21:57:05 +01:00
mach-s3c power supply and reset changes for the v5.11 series 2020-12-19 11:58:46 -08:00
mach-s5pv210 arm: remove CONFIG_ARCH_HAS_HOLES_MEMORYMODEL 2020-12-15 12:13:42 -08:00
mach-sa1100 power: supply: collie_battery: Convert to GPIO descriptors 2020-11-30 02:18:49 +01:00
mach-shmobile ARM: shmobile: Stop using __raw_*() I/O accessors 2020-11-23 09:54:59 +01:00
mach-socfpga
mach-spear
mach-sti
mach-stm32
mach-sunxi ARM: sunxi: Add machine match for the Allwinner V3 SoC 2020-11-02 10:28:14 +01:00
mach-tango arm: remove CONFIG_ARCH_HAS_HOLES_MEMORYMODEL 2020-12-15 12:13:42 -08:00
mach-tegra
mach-u300
mach-uniphier
mach-ux500
mach-versatile
mach-vexpress
mach-vt8500
mach-zx
mach-zynq
mm ARM updates for 5.11: 2020-12-22 13:34:27 -08:00
net
nwfpe
oprofile
plat-omap
plat-orion
plat-pxa
plat-versatile
probes
tools epoll: wire up syscall epoll_pwait2 2020-12-19 11:18:38 -08:00
vdso
vfp ARM: 9044/1: vfp: use undef hook for VFP support detection 2020-12-21 11:19:19 +00:00
xen
Kbuild
Kconfig ARM updates for 5.11: 2020-12-22 13:34:27 -08:00
Kconfig-nommu
Kconfig.assembler
Kconfig.debug ARM: remove ebsa110 platform 2020-10-30 21:57:03 +01:00
Makefile ARM updates for 5.11: 2020-12-22 13:34:27 -08:00