Commit graph

982080 commits

Author SHA1 Message Date
Ard Biesheuvel 9ad58b46f8 crypto: x86/serpent - drop dependency on glue helper
Replace the glue helper dependency with implementations of ECB and CBC
based on the new CPP macros, which avoid the need for indirect calls.

Acked-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2021-01-14 17:10:29 +11:00
Ard Biesheuvel 407d409a81 crypto: x86/camellia - drop dependency on glue helper
Replace the glue helper dependency with implementations of ECB and CBC
based on the new CPP macros, which avoid the need for indirect calls.

Acked-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2021-01-14 17:10:29 +11:00
Ard Biesheuvel 827ee47228 crypto: x86 - add some helper macros for ECB and CBC modes
The x86 glue helper module is starting to show its age:
- It relies heavily on function pointers to invoke asm helper functions that
  operate on fixed input sizes that are relatively small. This means the
  performance is severely impacted by retpolines.
- It goes to great lengths to amortize the cost of kernel_fpu_begin()/end()
  over as much work as possible, which is no longer necessary now that FPU
  save/restore is done lazily, and doing so may cause unbounded scheduling
  blackouts due to the fact that enabling the FPU in kernel mode disables
  preemption.
- The CBC mode decryption helper makes backward strides through the input, in
  order to avoid a single block size memcpy() between chunks. Consuming the
  input in this manner is highly likely to defeat any hardware prefetchers,
  so it is better to go through the data linearly, and perform the extra
  memcpy() where needed (which is turned into direct loads and stores by the
  compiler anyway). Note that benchmarks won't show this effect, given that
  the memory they use is always cache hot.
- It implements blockwise XOR in terms of le128 pointers, which imply an
  alignment that is not guaranteed by the API, violating the C standard.

GCC does not seem to be smart enough to elide the indirect calls when the
function pointers are passed as arguments to static inline helper routines
modeled after the existing ones. So instead, let's create some CPP macros
that encapsulate the core of the ECB and CBC processing, so we can wire
them up for existing users of the glue helper module, i.e., Camellia,
Serpent, Twofish and CAST6.

Acked-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2021-01-14 17:10:29 +11:00
Ard Biesheuvel c0a64926c5 crypto: x86/blowfish - drop CTR mode implementation
Blowfish in counter mode is never used in the kernel, so there
is no point in keeping an accelerated implementation around.

Acked-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2021-01-14 17:10:29 +11:00
Ard Biesheuvel 768db5fee3 crypto: x86/des - drop CTR mode implementation
DES or Triple DES in counter mode is never used in the kernel, so there
is no point in keeping an accelerated implementation around.

Acked-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2021-01-14 17:10:28 +11:00
Ard Biesheuvel 89b7ba5c8b crypto: x86/glue-helper - drop CTR helper routines
The glue helper's CTR routines are no longer used, so drop them.

Acked-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2021-01-14 17:10:28 +11:00
Ard Biesheuvel f43dcaf2c9 crypto: x86/twofish - drop CTR mode implementation
Twofish in CTR mode is never used by the kernel directly, and is highly
unlikely to be relied upon by dm-crypt or algif_skcipher. So let's drop
the accelerated CTR mode implementation, and instead, rely on the CTR
template and the bare cipher.

Acked-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2021-01-14 17:10:28 +11:00
Ard Biesheuvel 7a6623cc68 crypto: x86/cast6 - drop CTR mode implementation
CAST6 in CTR mode is never used by the kernel directly, and is highly
unlikely to be relied upon by dm-crypt or algif_skcipher. So let's drop
the accelerated CTR mode implementation, and instead, rely on the CTR
template and the bare cipher.

Acked-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2021-01-14 17:10:28 +11:00
Ard Biesheuvel e2d60e2f59 crypto: x86/cast5 - drop CTR mode implementation
CAST5 in CTR mode is never used by the kernel directly, and is highly
unlikely to be relied upon by dm-crypt or algif_skcipher. So let's drop
the accelerated CTR mode implementation, and instead, rely on the CTR
template and the bare cipher.

Acked-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2021-01-14 17:10:28 +11:00
Ard Biesheuvel 2e9440ae6e crypto: x86/serpent - drop CTR mode implementation
Serpent in CTR mode is never used by the kernel directly, and is highly
unlikely to be relied upon by dm-crypt or algif_skcipher. So let's drop
the accelerated CTR mode implementation, and instead, rely on the CTR
template and the bare cipher.

Acked-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2021-01-14 17:10:28 +11:00
Ard Biesheuvel a1f91ecf81 crypto: x86/camellia - drop CTR mode implementation
Camellia in CTR mode is never used by the kernel directly, and is highly
unlikely to be relied upon by dm-crypt or algif_skcipher. So let's drop
the accelerated CTR mode implementation, and instead, rely on the CTR
template and the bare cipher.

Acked-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2021-01-14 17:10:28 +11:00
Ard Biesheuvel 31d49c448a crypto: x86/glue-helper - drop XTS helper routines
The glue helper's XTS routines are no longer used, so drop them.

Acked-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2021-01-14 17:10:27 +11:00
Ard Biesheuvel da4df93a94 crypto: x86/twofish - switch to XTS template
Now that the XTS template can wrap accelerated ECB modes, it can be
used to implement Twofish in XTS mode as well, which turns out to
be at least as fast, and sometimes even faster

Acked-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2021-01-14 17:10:27 +11:00
Ard Biesheuvel 9ec0af8aa6 crypto: x86/serpent- switch to XTS template
Now that the XTS template can wrap accelerated ECB modes, it can be
used to implement Serpent in XTS mode as well, which turns out to
be at least as fast, and sometimes even faster

Acked-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2021-01-14 17:10:27 +11:00
Ard Biesheuvel 2cc0fedb81 crypto: x86/cast6 - switch to XTS template
Now that the XTS template can wrap accelerated ECB modes, it can be
used to implement CAST6 in XTS mode as well, which turns out to
be at least as fast, and sometimes even faster

Acked-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2021-01-14 17:10:27 +11:00
Ard Biesheuvel 55a7e88f01 crypto: x86/camellia - switch to XTS template
Now that the XTS template can wrap accelerated ECB modes, it can be
used to implement Camellia in XTS mode as well, which turns out to
be at least as fast, and sometimes even faster.

Acked-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2021-01-14 17:10:27 +11:00
Bhaskar Chowdhury 4d6a5a4b1e crypto: marvell/cesa - Fix a spelling s/fautly/faultly/ in comment
s/fautly/faulty/p

Signed-off-by: Bhaskar Chowdhury <unixbhaskar@gmail.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2021-01-14 17:10:27 +11:00
Kai Ye 34932a6033 crypto: hisilicon/sec - register SEC device to uacce
Register SEC device to uacce framework for user space.

Signed-off-by: Kai Ye <yekai13@huawei.com>
Reviewed-by: Zhou Wang <wangzhou1@hisilicon.com>
Reviewed-by: Zaibo Xu <xuzaibo@huawei.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2021-01-14 17:10:27 +11:00
Kai Ye bedd04e4aa crypto: hisilicon/hpre - register HPRE device to uacce
Register HPRE device to uacce framework for user space.

Signed-off-by: Kai Ye <yekai13@huawei.com>
Reviewed-by: Zhou Wang <wangzhou1@hisilicon.com>
Reviewed-by: Zaibo Xu <xuzaibo@huawei.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2021-01-14 17:10:26 +11:00
Kai Ye f8408d2b79 crypto: hisilicon - add ZIP device using mode parameter
Add 'uacce_mode' parameter for ZIP, which can be set as 0(default) or 1.
'0' means ZIP is only registered to kernel crypto, and '1' means it's
registered to both kernel crypto and UACCE.

Signed-off-by: Kai Ye <yekai13@huawei.com>
Reviewed-by: Zhou Wang <wangzhou1@hisilicon.com>
Reviewed-by: Zaibo Xu <xuzaibo@huawei.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2021-01-14 17:10:26 +11:00
Kai Ye 0d61c3f144 crypto: hisilicon/qm - SVA bugfixed on Kunpeng920
Kunpeng920 SEC/HPRE/ZIP cannot support running user space SVA and kernel
Crypto at the same time. Therefore, the algorithms should not be registered
to Crypto as user space SVA is enabled.

Signed-off-by: Kai Ye <yekai13@huawei.com>
Reviewed-by: Zaibo Xu <xuzaibo@huawei.com>
Reviewed-by: Zhou Wang <wangzhou1@hisilicon.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2021-01-14 17:10:26 +11:00
Jiri Olsa f7f2b43eaf crypto: bcm - Rename struct device_private to bcm_device_private
Renaming 'struct device_private' to 'struct bcm_device_private',
because it clashes with 'struct device_private' from
'drivers/base/base.h'.

While it's not a functional problem, it's causing two distinct
type hierarchies in BTF data. It also breaks build with options:
  CONFIG_DEBUG_INFO_BTF=y
  CONFIG_CRYPTO_DEV_BCM_SPU=y

as reported by Qais Yousef [1].

[1] https://lore.kernel.org/lkml/20201229151352.6hzmjvu3qh6p2qgg@e107158-lin/

Fixes: 9d12ba86f8 ("crypto: brcm - Add Broadcom SPU driver")
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Tested-by: Qais Yousef <qais.yousef@arm.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2021-01-14 17:10:26 +11:00
Adam Guerin e48767c177 crypto: qat - reduce size of mapped region
Restrict size of field to what is required by the operation.

This issue was detected by smatch:

    drivers/crypto/qat/qat_common/qat_asym_algs.c:328 qat_dh_compute_value() error: dma_map_single_attrs() '&qat_req->in.dh.in.b' too small (8 vs 64)

Signed-off-by: Adam Guerin <adam.guerin@intel.com>
Reviewed-by: Giovanni Cabiddu <giovanni.cabiddu@intel.com>
Signed-off-by: Giovanni Cabiddu <giovanni.cabiddu@intel.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2021-01-14 17:10:26 +11:00
Adam Guerin 80fccf18fe crypto: qat - change format string and cast ring size
Cast ADF_SIZE_TO_RING_SIZE_IN_BYTES() so it can return a 64 bit value.

This issue was detected by smatch:

    drivers/crypto/qat/qat_common/adf_transport_debug.c:65 adf_ring_show() warn: should '(1 << (ring->ring_size - 1)) << 7' be a 64 bit type?

Signed-off-by: Adam Guerin <adam.guerin@intel.com>
Reviewed-by: Giovanni Cabiddu <giovanni.cabiddu@intel.com>
Signed-off-by: Giovanni Cabiddu <giovanni.cabiddu@intel.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2021-01-14 17:10:26 +11:00
Adam Guerin 1aaae055d4 crypto: qat - fix potential spectre issue
Sanitize ring_num value coming from configuration (and potentially
from user space) before it is used as index in the banks array.

This issue was detected by smatch:

    drivers/crypto/qat/qat_common/adf_transport.c:233 adf_create_ring() warn: potential spectre issue 'bank->rings' [r] (local cap)

Signed-off-by: Adam Guerin <adam.guerin@intel.com>
Reviewed-by: Giovanni Cabiddu <giovanni.cabiddu@intel.com>
Signed-off-by: Giovanni Cabiddu <giovanni.cabiddu@intel.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2021-01-14 17:10:26 +11:00
Wojciech Ziemba 0db0d797ab crypto: qat - configure arbiter mapping based on engines enabled
The hardware specific function adf_get_arbiter_mapping() modifies
the static array thrd_to_arb_map to disable mappings for AEs
that are disabled. This static array is used for each device
of the same type. If the ae mask is not identical for all devices
of the same type then the arbiter mapping returned by
adf_get_arbiter_mapping() may be wrong.

This patch fixes this problem by ensuring the static arbiter
mapping is unchanged and the device arbiter mapping is re-calculated
each time based on the static mapping.

Signed-off-by: Wojciech Ziemba <wojciech.ziemba@intel.com>
Reviewed-by: Giovanni Cabiddu <giovanni.cabiddu@intel.com>
Signed-off-by: Giovanni Cabiddu <giovanni.cabiddu@intel.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2021-01-14 17:10:26 +11:00
Ard Biesheuvel d6cbf4eaa4 crypto: aesni - replace function pointers with static branches
Replace the function pointers in the GCM implementation with static branches,
which are based on code patching, which occurs only at module load time.
This avoids the severe performance penalty caused by the use of retpolines.

In order to retain the ability to switch between different versions of the
implementation based on the input size on cores that support AVX and AVX2,
use static branches instead of static calls.

Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2021-01-14 17:10:25 +11:00
Ard Biesheuvel 83c83e6588 crypto: aesni - refactor scatterlist processing
Currently, the gcm(aes-ni) driver open codes the scatterlist handling
that is encapsulated by the skcipher walk API. So let's switch to that
instead.

Also, move the handling at the end of gcmaes_crypt_by_sg() that is
dependent on whether we are encrypting or decrypting into the callers,
which always do one or the other.

Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2021-01-14 17:10:25 +11:00
Ard Biesheuvel 2694e23ffd crypto: aesni - clean up mapping of associated data
The gcm(aes-ni) driver is only built for x86_64, which does not make
use of highmem. So testing for PageHighMem is pointless and can be
omitted.

While at it, replace GFP_ATOMIC with the appropriate runtime decided
value based on the context.

Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2021-01-14 17:10:25 +11:00
Ard Biesheuvel 30f2c18eb5 crypto: aesni - drop unused asm prototypes
Drop some prototypes that are declared but never called.

Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2021-01-14 17:10:25 +11:00
Ard Biesheuvel a13ed1d15b crypto: aesni - prevent misaligned buffers on the stack
The GCM mode driver uses 16 byte aligned buffers on the stack to pass
the IV to the asm helpers, but unfortunately, the x86 port does not
guarantee that the stack pointer is 16 byte aligned upon entry in the
first place. Since the compiler is not aware of this, it will not emit
the additional stack realignment sequence that is needed, and so the
alignment is not guaranteed to be more than 8 bytes.

So instead, allocate some padding on the stack, and realign the IV
pointer by hand.

Cc: <stable@vger.kernel.org>
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2021-01-14 17:10:25 +11:00
Marco Chiappero 4f1a02e75a crypto: qat - replace CRYPTO_AES with CRYPTO_LIB_AES in Kconfig
Use CRYPTO_LIB_AES in place of CRYPTO_AES in the dependences for the QAT
common code.

Fixes: c0e583ab20 ("crypto: qat - add CRYPTO_AES to Kconfig dependencies")
Reported-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Marco Chiappero <marco.chiappero@intel.com>
Acked-by: Ard Biesheuvel <ardb@kernel.org>
Reviewed-by: Giovanni Cabiddu <giovanni.cabiddu@intel.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2021-01-14 17:10:25 +11:00
Herbert Xu 81064c96d8 crypto: stm32 - Fix last sparse warning in stm32_cryp_check_ctr_counter
This patch changes the cast in stm32_cryp_check_ctr_counter from
u32 to __be32 to match the prototype of stm32_cryp_hw_write_iv
correctly.

Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2021-01-14 17:10:25 +11:00
Herbert Xu 622aae879c crypto: vmx - Move extern declarations into header file
This patch moves the extern algorithm declarations into a header
file so that a number of compiler warnings are silenced.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2021-01-08 15:39:47 +11:00
Ard Biesheuvel 2481104fe9 crypto: x86/aes-ni-xts - rewrite and drop indirections via glue helper
The AES-NI driver implements XTS via the glue helper, which consumes
a struct with sets of function pointers which are invoked on chunks
of input data of the appropriate size, as annotated in the struct.

Let's get rid of this indirection, so that we can perform direct calls
to the assembler helpers. Instead, let's adopt the arm64 strategy, i.e.,
provide a helper which can consume inputs of any size, provided that the
penultimate, full block is passed via the last call if ciphertext stealing
needs to be applied.

This also allows us to enable the XTS mode for i386.

Tested-by: Eric Biggers <ebiggers@google.com> # x86_64
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Reported-by: kernel test robot <lkp@intel.com>
Reported-by: kernel test robot <lkp@intel.com>
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2021-01-08 15:39:47 +11:00
Ard Biesheuvel 86ad60a65f crypto: x86/aes-ni-xts - use direct calls to and 4-way stride
The XTS asm helper arrangement is a bit odd: the 8-way stride helper
consists of back-to-back calls to the 4-way core transforms, which
are called indirectly, based on a boolean that indicates whether we
are performing encryption or decryption.

Given how costly indirect calls are on x86, let's switch to direct
calls, and given how the 8-way stride doesn't really add anything
substantial, use a 4-way stride instead, and make the asm core
routine deal with any multiple of 4 blocks. Since 512 byte sectors
or 4 KB blocks are the typical quantities XTS operates on, increase
the stride exported to the glue helper to 512 bytes as well.

As a result, the number of indirect calls is reduced from 3 per 64 bytes
of in/output to 1 per 512 bytes of in/output, which produces a 65% speedup
when operating on 1 KB blocks (measured on a Intel(R) Core(TM) i7-8650U CPU)

Fixes: 9697fa39ef ("x86/retpoline/crypto: Convert crypto assembler indirect jumps")
Tested-by: Eric Biggers <ebiggers@google.com> # x86_64
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2021-01-08 15:39:47 +11:00
Rob Herring fecff3b931 crypto: picoxcell - Remove PicoXcell driver
PicoXcell has had nothing but treewide cleanups for at least the last 8
years and no signs of activity. The most recent activity is a yocto vendor
kernel based on v3.0 in 2015.

Cc: Jamie Iles <jamie@jamieiles.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: linux-crypto@vger.kernel.org
Signed-off-by: Rob Herring <robh@kernel.org>
Acked-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2021-01-03 09:03:36 +11:00
Eric Biggers 1862eb0073 crypto: arm/blake2b - add NEON-accelerated BLAKE2b
Add a NEON-accelerated implementation of BLAKE2b.

On Cortex-A7 (which these days is the most common ARM processor that
doesn't have the ARMv8 Crypto Extensions), this is over twice as fast as
SHA-256, and slightly faster than SHA-1.  It is also almost three times
as fast as the generic implementation of BLAKE2b:

	Algorithm            Cycles per byte (on 4096-byte messages)
	===================  =======================================
	blake2b-256-neon     14.0
	sha1-neon            16.3
	blake2s-256-arm      18.8
	sha1-asm             20.8
	blake2s-256-generic  26.0
	sha256-neon	     28.9
	sha256-asm	     32.0
	blake2b-256-generic  38.9

This implementation isn't directly based on any other implementation,
but it borrows some ideas from previous NEON code I've written as well
as from chacha-neon-core.S.  At least on Cortex-A7, it is faster than
the other NEON implementations of BLAKE2b I'm aware of (the
implementation in the BLAKE2 official repository using intrinsics, and
Andrew Moon's implementation which can be found in SUPERCOP).  It does
only one block at a time, so it performs well on short messages too.

NEON-accelerated BLAKE2b is useful because there is interest in using
BLAKE2b-256 for dm-verity on low-end Android devices (specifically,
devices that lack the ARMv8 Crypto Extensions) to replace SHA-1.  On
these devices, the performance cost of upgrading to SHA-256 may be
unacceptable, whereas BLAKE2b-256 would actually improve performance.

Although BLAKE2b is intended for 64-bit platforms (unlike BLAKE2s which
is intended for 32-bit platforms), on 32-bit ARM processors with NEON,
BLAKE2b is actually faster than BLAKE2s.  This is because NEON supports
64-bit operations, and because BLAKE2s's block size is too small for
NEON to be helpful for it.  The best I've been able to do with BLAKE2s
on Cortex-A7 is 18.8 cpb with an optimized scalar implementation.

(I didn't try BLAKE2sp and BLAKE3, which in theory would be faster, but
they're more complex as they require running multiple hashes at once.
Note that BLAKE2b already uses all the NEON bandwidth on the Cortex-A7,
so I expect that any speedup from BLAKE2sp or BLAKE3 would come only
from the smaller number of rounds, not from the extra parallelism.)

For now this BLAKE2b implementation is only wired up to the shash API,
since there is no library API for BLAKE2b yet.  However, I've tried to
keep things consistent with BLAKE2s, e.g. by defining
blake2b_compress_arch() which is analogous to blake2s_compress_arch()
and could be exported for use by the library API later if needed.

Acked-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Tested-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2021-01-03 08:41:39 +11:00
Eric Biggers 0cdc438e6e crypto: blake2b - update file comment
The file comment for blake2b_generic.c makes it sound like it's the
reference implementation of BLAKE2b with only minor changes.  But it's
actually been changed a lot.  Update the comment to make this clearer.

Reviewed-by: David Sterba <dsterba@suse.com>
Acked-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2021-01-03 08:41:39 +11:00
Eric Biggers 28dcca4cc0 crypto: blake2b - sync with blake2s implementation
Sync the BLAKE2b code with the BLAKE2s code as much as possible:

- Move a lot of code into new headers <crypto/blake2b.h> and
  <crypto/internal/blake2b.h>, and adjust it to be like the
  corresponding BLAKE2s code, i.e. like <crypto/blake2s.h> and
  <crypto/internal/blake2s.h>.

- Rename constants, e.g. BLAKE2B_*_DIGEST_SIZE => BLAKE2B_*_HASH_SIZE.

- Use a macro BLAKE2B_ALG() to define the shash_alg structs.

- Export blake2b_compress_generic() for use as a fallback.

This makes it much easier to add optimized implementations of BLAKE2b,
as optimized implementations can use the helper functions
crypto_blake2b_{setkey,init,update,final}() and
blake2b_compress_generic().  The ARM implementation will use these.

But this change is also helpful because it eliminates unnecessary
differences between the BLAKE2b and BLAKE2s code, so that the same
improvements can easily be made to both.  (The two algorithms are
basically identical, except for the word size and constants.)  It also
makes it straightforward to add a library API for BLAKE2b in the future
if/when it's needed.

This change does make the BLAKE2b code slightly more complicated than it
needs to be, as it doesn't actually provide a library API yet.  For
example, __blake2b_update() doesn't really need to exist yet; it could
just be inlined into crypto_blake2b_update().  But I believe this is
outweighed by the benefits of keeping the code in sync.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Acked-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2021-01-03 08:41:39 +11:00
Eric Biggers a64bfe7ad4 wireguard: Kconfig: select CRYPTO_BLAKE2S_ARM
When available, select the new implementation of BLAKE2s for 32-bit ARM.
This is faster than the generic C implementation.

Reviewed-by: Jason A. Donenfeld <Jason@zx2c4.com>
Acked-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2021-01-03 08:41:39 +11:00
Eric Biggers 5172d322d3 crypto: arm/blake2s - add ARM scalar optimized BLAKE2s
Add an ARM scalar optimized implementation of BLAKE2s.

NEON isn't very useful for BLAKE2s because the BLAKE2s block size is too
small for NEON to help.  Each NEON instruction would depend on the
previous one, resulting in poor performance.

With scalar instructions, on the other hand, we can take advantage of
ARM's "free" rotations (like I did in chacha-scalar-core.S) to get an
implementation get runs much faster than the C implementation.

Performance results on Cortex-A7 in cycles per byte using the shash API:

	4096-byte messages:
		blake2s-256-arm:     18.8
		blake2s-256-generic: 26.0

	500-byte messages:
		blake2s-256-arm:     20.3
		blake2s-256-generic: 27.9

	100-byte messages:
		blake2s-256-arm:     29.7
		blake2s-256-generic: 39.2

	32-byte messages:
		blake2s-256-arm:     50.6
		blake2s-256-generic: 66.2

Except on very short messages, this is still slower than the NEON
implementation of BLAKE2b which I've written; that is 14.0, 16.4, 25.8,
and 76.1 cpb on 4096, 500, 100, and 32-byte messages, respectively.
However, optimized BLAKE2s is useful for cases where BLAKE2s is used
instead of BLAKE2b, such as WireGuard.

This new implementation is added in the form of a new module
blake2s-arm.ko, which is analogous to blake2s-x86_64.ko in that it
provides blake2s_compress_arch() for use by the library API as well as
optionally register the algorithms with the shash API.

Acked-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Tested-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2021-01-03 08:41:39 +11:00
Eric Biggers bbda6e0f13 crypto: blake2s - include <linux/bug.h> instead of <asm/bug.h>
Address the following checkpatch warning:

	WARNING: Use #include <linux/bug.h> instead of <asm/bug.h>

Signed-off-by: Eric Biggers <ebiggers@google.com>
Acked-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2021-01-03 08:41:39 +11:00
Eric Biggers 8786841bc2 crypto: blake2s - adjust include guard naming
Use the full path in the include guards for the BLAKE2s headers to avoid
ambiguity and to match the convention for most files in include/crypto/.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Acked-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2021-01-03 08:41:38 +11:00
Eric Biggers 7d87131fad crypto: blake2s - add comment for blake2s_state fields
The first three fields of 'struct blake2s_state' are used in assembly
code, which isn't immediately obvious, so add a comment to this effect.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Acked-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2021-01-03 08:41:38 +11:00
Eric Biggers 42ad8cf821 crypto: blake2s - optimize blake2s initialization
If no key was provided, then don't waste time initializing the block
buffer, as its initial contents won't be used.

Also, make crypto_blake2s_init() and blake2s() call a single internal
function __blake2s_init() which treats the key as optional, rather than
conditionally calling blake2s_init() or blake2s_init_key().  This
reduces the compiled code size, as previously both blake2s_init() and
blake2s_init_key() were being inlined into these two callers, except
when the key size passed to blake2s() was a compile-time constant.

These optimizations aren't that significant for BLAKE2s.  However, the
equivalent optimizations will be more significant for BLAKE2b, as
everything is twice as big in BLAKE2b.  And it's good to keep things
consistent rather than making optimizations for BLAKE2b but not BLAKE2s.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Acked-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2021-01-03 08:41:38 +11:00
Eric Biggers 8c4a93a127 crypto: blake2s - share the "shash" API boilerplate code
Add helper functions for shash implementations of BLAKE2s to
include/crypto/internal/blake2s.h, taking advantage of
__blake2s_update() and __blake2s_final() that were added by the previous
patch to share more code between the library and shash implementations.

crypto_blake2s_setkey() and crypto_blake2s_init() are usable as
shash_alg::setkey and shash_alg::init directly, while
crypto_blake2s_update() and crypto_blake2s_final() take an extra
'blake2s_compress_t' function pointer parameter.  This allows the
implementation of the compression function to be overridden, which is
the only part that optimized implementations really care about.

The new functions are inline functions (similar to those in sha1_base.h,
sha256_base.h, and sm3_base.h) because this avoids needing to add a new
module blake2s_helpers.ko, they aren't *too* long, and this avoids
indirect calls which are expensive these days.  Note that they can't go
in blake2s_generic.ko, as that would require selecting CRYPTO_BLAKE2S
from CRYPTO_BLAKE2S_X86, which would cause a recursive dependency.

Finally, use these new helper functions in the x86 implementation of
BLAKE2s.  (This part should be a separate patch, but unfortunately the
x86 implementation used the exact same function names like
"crypto_blake2s_update()", so it had to be updated at the same time.)

Signed-off-by: Eric Biggers <ebiggers@google.com>
Acked-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2021-01-03 08:41:38 +11:00
Eric Biggers 057edc9c8b crypto: blake2s - move update and final logic to internal/blake2s.h
Move most of blake2s_update() and blake2s_final() into new inline
functions __blake2s_update() and __blake2s_final() in
include/crypto/internal/blake2s.h so that this logic can be shared by
the shash helper functions.  This will avoid duplicating this logic
between the library and shash implementations.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Acked-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2021-01-03 08:41:38 +11:00
Eric Biggers df412e7efd crypto: blake2s - remove unneeded includes
It doesn't make sense for the generic implementation of BLAKE2s to
include <crypto/internal/simd.h> and <linux/jump_label.h>, as these are
things that would only be useful in an architecture-specific
implementation.  Remove these unnecessary includes.

Acked-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2021-01-03 08:41:38 +11:00
Eric Biggers 1aa90f4cf0 crypto: x86/blake2s - define shash_alg structs using macros
The shash_alg structs for the four variants of BLAKE2s are identical
except for the algorithm name, driver name, and digest size.  So, avoid
code duplication by using a macro to define these structs.

Acked-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2021-01-03 08:41:38 +11:00