Commit graph

450 commits

Author SHA1 Message Date
Keith Randall 6b165577fe cmd/compile: remove memequal call from string compares in more cases
Add more rules to ensure that order doesn't matter.

Add memequal 0 rule.

Try to use a constant argument to memequal when one is available.

Fixes #59684

Change-Id: I36e85ffbd949396ed700ed6e8ec2bc3ae013f5d2
Reviewed-on: https://go-review.googlesource.com/c/go/+/485535
Reviewed-by: Keith Randall <khr@google.com>
Reviewed-by: Cherry Mui <cherryyz@google.com>
Run-TryBot: Keith Randall <khr@golang.org>
TryBot-Result: Gopher Robot <gobot@golang.org>
2023-04-18 21:31:33 +00:00
ruinan 9be533a8ee cmd/compile: get more bounds info from logic operators in prove pass
Currently, the prove pass can get knowledge from some specific logic
operators only before the CFG is explored, which means that the bounds
information of the branch will be ignored.

This CL updates the facts table by the logic operators in every
branch. Combined with the branch information, this will be helpful for
BCE in some circumstances.

Fixes #57243

Change-Id: I0bd164f1b47804ccfc37879abe9788740b016fd5
Reviewed-on: https://go-review.googlesource.com/c/go/+/419555
Reviewed-by: Keith Randall <khr@golang.org>
Run-TryBot: Eric Fang <eric.fang@arm.com>
Reviewed-by: Keith Randall <khr@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
Reviewed-by: Heschi Kreinick <heschi@google.com>
2023-04-07 10:09:11 +00:00
erifan01 42f99b203d cmd/compile: optimize cmp to cmn under conditions < and >= on arm64
Under the right conditions we can optimize cmp comparisons to cmn
comparisons, such as:
func foo(a, b int) int {
  var c int
  if a + b < 0 {
  	c = 1
  }
  return c
}

Previously it's compiled as:
  ADD     R1, R0, R1
  CMP     $0, R1
  CSET    LT, R0
With this CL it's compiled as:
  CMN     R1, R0
  CSET    MI, R0
Here we need to pay attention to the overflow situation of a+b, the MI
flag means N==1, which doesn't honor the overflow flag V, its value
depends only on the sign of the result. So it has the same semantic of
the Go code, so it's correct.

Similarly, this CL also optimizes the case of >= comparison
using the PL conditional flag.

Change-Id: I47179faba5b30cca84ea69bafa2ad5241bf6dfba
Reviewed-on: https://go-review.googlesource.com/c/go/+/476116
Run-TryBot: Eric Fang <eric.fang@arm.com>
Reviewed-by: Cherry Mui <cherryyz@google.com>
Reviewed-by: David Chase <drchase@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
2023-03-24 01:19:09 +00:00
erifan01 91a2e921dd cmd/compile: fix incorrect truncating when converting CMP to TST on arm64
CL 420434 optimized CMP into TST in some situations, but it has a bug,
these four rules are not correct:
(LessThan (CMPWconst [0] x:(ANDconst [c] y))) && x.Uses == 1 => (LessThan (TSTconst [c] y))
(LessEqual (CMPWconst [0] x:(ANDconst [c] y))) && x.Uses == 1 => (LessEqual (TSTconst [c] y))
(GreaterThan (CMPWconst [0] x:(ANDconst [c] y))) && x.Uses == 1 => (GreaterThan (TSTconst [c] y))
(GreaterEqual (CMPWconst [0] x:(ANDconst [c] y))) && x.Uses == 1 => (GreaterEqual (TSTconst [c] y))

But due to the existence of this rule
(LessThan (CMPWconst [0] x:(ANDconst [c] y))) && x.Uses == 1 =>
(LessThan (TSTWconst [int32(c)] y)), the above rules have never been
fired. This CL corrects them as:
(LessThan (CMPconst [0] x:(ANDconst [c] y))) && x.Uses == 1 => (LessThan (TSTconst [c] y))
(LessEqual (CMPconst [0] x:(ANDconst [c] y))) && x.Uses == 1 => (LessEqual (TSTconst [c] y))
(GreaterThan (CMPconst [0] x:(ANDconst [c] y))) && x.Uses == 1 => (GreaterThan (TSTconst [c] y))
(GreaterEqual (CMPconst [0] x:(ANDconst [c] y))) && x.Uses == 1 => (GreaterEqual (TSTconst [c] y))

Change-Id: I7d60bcc9a266ee58388baeaab9f493b57cf1ad55
Reviewed-on: https://go-review.googlesource.com/c/go/+/473617
TryBot-Result: Gopher Robot <gobot@golang.org>
Reviewed-by: Cherry Mui <cherryyz@google.com>
Reviewed-by: Dmitri Shuralyov <dmitshur@google.com>
Run-TryBot: Eric Fang <eric.fang@arm.com>
2023-03-22 08:32:53 +00:00
Yi Yang da4687923b cmd/compile: add rewrite rules for arithmetic operations
Add the following common local transformations

(t + x) - (t + y) == x - y
(t + x) - (y + t) == x - y
(x + t) - (y + t) == x - y
(x + t) - (t + y) == x - y
(x - t) + (t + y) == x + y
(x - t) + (y + t) == x + y

The compiler itself matches such patterns many times. This also aligns with other popular compilers.

Fixes #59111

Change-Id: Ibdfdb414782f8fcaa20b84ac5d43d0d9ae2c7b60
GitHub-Last-Rev: 1aad82e62e
GitHub-Pull-Request: golang/go#59119
Reviewed-on: https://go-review.googlesource.com/c/go/+/477555
Reviewed-by: Keith Randall <khr@golang.org>
Reviewed-by: Heschi Kreinick <heschi@google.com>
Auto-Submit: Keith Randall <khr@golang.org>
Reviewed-by: Keith Randall <khr@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
Run-TryBot: Keith Randall <khr@golang.org>
2023-03-20 15:42:09 +00:00
Wayne Zuo cedfcba3e8 cmd/compile: instrinsify TrailingZeros{8,32,64} for 386
This CL add support for instrinsifying the TrialingZeros{8,32,64}
functions for 386 architecture. We need handle the case when the input
is 0, which could lead to undefined output from the BSFL instruction.

Next CL will remove the assembly code in runtime/internal/sys package.

Change-Id: Ic168edf68e81bf69a536102100fdd3f56f0f4a1b
Reviewed-on: https://go-review.googlesource.com/c/go/+/475735
Reviewed-by: Keith Randall <khr@golang.org>
Reviewed-by: Keith Randall <khr@google.com>
Reviewed-by: Cherry Mui <cherryyz@google.com>
Run-TryBot: Wayne Zuo <wdvxdr@golangcn.org>
TryBot-Result: Gopher Robot <gobot@golang.org>
2023-03-14 08:10:32 +00:00
ruinan 4d180f71dc cmd/compile: omit redundant sign/unsign extension on arm64
On Arm64, all 32-bit instructions will ignore the upper 32 bits and
clear them to zero for the result. No need to do an unsign extend before
a 32 bit op.

This CL removes the redundant unsign extension only for the existing
32-bit opcodes, and also omits the sign extension when the upper bit of
the result can be predicted.

Fixes #42162

Change-Id: I61e6670bfb8982572430e67a4fa61134a3ea240a
CustomizedGitHooks: yes
Reviewed-on: https://go-review.googlesource.com/c/go/+/427454
Reviewed-by: Keith Randall <khr@google.com>
Auto-Submit: Eric Fang <eric.fang@arm.com>
Reviewed-by: Keith Randall <khr@golang.org>
Run-TryBot: Eric Fang <eric.fang@arm.com>
Reviewed-by: Cherry Mui <cherryyz@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
2023-02-28 03:16:44 +00:00
Dmitri Shuralyov 7a0799b2c0 cmd/dist, test: convert test/run.go runner to a cmd/go test
As motivated on the issue, we want to move the functionality of the
run.go program to happen via a normal go test. Each .go test case in
the GOROOT/test directory gets a subtest, and cmd/go's support for
parallel test execution replaces run.go's own implementation thereof.

The goal of this change is to have fairly minimal and readable diff
while making an atomic changeover. The working directory is modified
during the test execution to be GOROOT/test as it was with run.go,
and most of the test struct and its run method are kept unchanged.
The next CL in the stack applies further simplifications and cleanups
that become viable.

There's no noticeable difference in test execution time: it takes around
60-80 seconds both before and after on my machine. Test caching, which
the previous runner lacked, can shorten the time significantly.

For #37486.
Fixes #56844.

Change-Id: I209619dc9d90e7529624e49c01efeadfbeb5c9ae
Reviewed-on: https://go-review.googlesource.com/c/go/+/463276
Run-TryBot: Dmitri Shuralyov <dmitshur@golang.org>
Reviewed-by: Austin Clements <austin@google.com>
Auto-Submit: Dmitri Shuralyov <dmitshur@golang.org>
Reviewed-by: Dmitri Shuralyov <dmitshur@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
2023-02-28 01:11:37 +00:00
Michael Munday 85d54a7667 cmd/compile: use zero constants in comparisons where possible
Some integer comparisons with 1 and -1 can be rewritten as comparisons
with 0. For example, x < 1 is equivalent to x <= 0. This is an
advantageous transformation on riscv64 because comparisons with zero
do not require a constant to be loaded into a register. Other
architectures will likely benefit too and the transformation is
relatively benign on architectures that do not benefit.

Change-Id: I2ce9821dd7605a660eb71d76e83a61f9bae1bf25
Reviewed-on: https://go-review.googlesource.com/c/go/+/350831
Reviewed-by: Keith Randall <khr@golang.org>
Reviewed-by: Ian Lance Taylor <iant@google.com>
Reviewed-by: Michael Knyszek <mknyszek@google.com>
Run-TryBot: Michael Munday <mike.munday@lowrisc.org>
TryBot-Result: Gopher Robot <gobot@golang.org>
2023-02-27 21:38:30 +00:00
Lynn Boger ebe49f98c8 cmd/compile: inline constant sized memclrNoHeapPointers calls on PPC64
Update the function isInlinableMemclr for ppc64 and ppc64le
to enable inlining for the constant sized cases < 512.

Larger cases can use dcbz which performs better but requires
alignment checking so it is best to continue using memclrNoHeapPointers
for those cases.

Results on p10:

MemclrKnownSize1         2.07ns ± 0%     0.57ns ± 0%   -72.59%
MemclrKnownSize2         2.56ns ± 5%     0.57ns ± 0%   -77.82%
MemclrKnownSize4         5.15ns ± 0%     0.57ns ± 0%   -89.00%
MemclrKnownSize8         2.23ns ± 0%     0.57ns ± 0%   -74.57%
MemclrKnownSize16        2.23ns ± 0%     0.50ns ± 0%   -77.74%
MemclrKnownSize32        2.28ns ± 0%     0.56ns ± 0%   -75.28%
MemclrKnownSize64        2.49ns ± 0%     0.72ns ± 0%   -70.95%
MemclrKnownSize112       2.97ns ± 2%     1.14ns ± 0%   -61.72%
MemclrKnownSize128       4.64ns ± 6%     2.45ns ± 1%   -47.17%
MemclrKnownSize192       5.45ns ± 5%     2.79ns ± 0%   -48.87%
MemclrKnownSize248       4.51ns ± 0%     2.83ns ± 0%   -37.12%
MemclrKnownSize256       6.34ns ± 1%     3.58ns ± 0%   -43.53%
MemclrKnownSize512       3.64ns ± 0%     3.64ns ± 0%    -0.03%
MemclrKnownSize1024      4.73ns ± 0%     4.73ns ± 0%    +0.01%
MemclrKnownSize4096      17.1ns ± 0%     17.1ns ± 0%    +0.07%
MemclrKnownSize512KiB    2.12µs ± 0%     2.12µs ± 0%      ~     (all equal)

Change-Id: If1abf5749f4802c64523a41fe0058bd144d0ea46
Reviewed-on: https://go-review.googlesource.com/c/go/+/464340
Run-TryBot: Lynn Boger <laboger@linux.vnet.ibm.com>
Reviewed-by: Jakub Ciolek <jakub@ciolek.dev>
Reviewed-by: Archana Ravindar <aravind5@in.ibm.com>
Reviewed-by: Ian Lance Taylor <iant@google.com>
Reviewed-by: Carlos Eduardo Seo <carlos.seo@linaro.org>
TryBot-Result: Gopher Robot <gobot@golang.org>
Reviewed-by: Than McIntosh <thanm@google.com>
2023-02-23 18:57:27 +00:00
Wayne Zuo d7ac5d1480 cmd/compile: intrinsify math/bits/ReverseBytes{32|64} for 386
The BSWAPL instruction is supported in i486 and newer.
https://github.com/golang/go/wiki/MinimumRequirements#386 says we
support "All Pentium MMX or later". The Pentium is also referred to as
i586, so that we are safe with these instructions.

Change-Id: I6dea1f9d864a45bb07c8f8f35a81cfe16cca216c
Reviewed-on: https://go-review.googlesource.com/c/go/+/465515
Run-TryBot: Wayne Zuo <wdvxdr@golangcn.org>
Reviewed-by: Keith Randall <khr@golang.org>
TryBot-Result: Gopher Robot <gobot@golang.org>
Reviewed-by: Matthew Dempsky <mdempsky@google.com>
Reviewed-by: Keith Randall <khr@google.com>
2023-02-08 03:43:23 +00:00
Archana R a432d89137 cmd/compile: add rules to emit SETBC/R instructions on power10
This CL adds rules that replaces instances of ISEL that produce
a boolean result based on a condition register by SETBC/SETBCR
operations. On Power10 these are convereted to SETBC/SETBCR
instructions that use one register instead of 3 registers
conventionally used by ISEL and hence reduces register pressure.
On loops written specifically to exercise such instances of ISEL
extensively, a performance improvement of 2.5% is seen on Power10.
Also added verification tests to verify correct generation of
SETBC/SETBCR instructions on Power10.

Change-Id: Ib719897f09d893de40324440a43052dca026e8fa
Reviewed-on: https://go-review.googlesource.com/c/go/+/449795
Reviewed-by: Michael Knyszek <mknyszek@google.com>
Reviewed-by: Dmitri Shuralyov <dmitshur@google.com>
Run-TryBot: Archana Ravindar <aravind5@in.ibm.com>
Reviewed-by: Lynn Boger <laboger@linux.vnet.ibm.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
2023-02-06 12:49:53 +00:00
Archana R cd1fc87156 cmd/compile: intrinsify math/bits/ReverseBytes{16|32|64} for ppc64/power10
This change intrinsifies ReverseBytes{16|32|64} by generating the
corresponding new instructions in Power10: brh, brd and brw and
adds a verification test for the same.
On Power 9 and 8, the .go code performs optimally as it is.

Performance improvement seen on Power10:
ReverseBytes32  1.38ns ± 0%  1.18ns ± 0%  -14.2
ReverseBytes64  1.52ns ± 0%  1.11ns ± 0%  -26.87
ReverseBytes16  1.41ns ± 1%  1.18ns ± 0%  -16.47

Change-Id: I88f127f3ab9ba24a772becc21ad90acfba324b37
Reviewed-on: https://go-review.googlesource.com/c/go/+/446675
Reviewed-by: Lynn Boger <laboger@linux.vnet.ibm.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
Reviewed-by: Dmitri Shuralyov <dmitshur@google.com>
Run-TryBot: Lynn Boger <laboger@linux.vnet.ibm.com>
Reviewed-by: Michael Knyszek <mknyszek@google.com>
2023-02-03 19:01:06 +00:00
Keith Randall 6224db9b4d cmd/compile: schedule values with no in-block uses later
When scheduling a block, deprioritize values whose results aren't used
until subsequent blocks.

For #58166, this has the effect of pushing the induction variable increment
to the end of the block, past all the other uses of the pre-incremented value.

Do this only with optimizations on. Debuggers have a preference for values
in source code order, which this CL can degrade.

Fixes #58166
Fixes #57976

Change-Id: I40d5885c661b142443c6d4702294c8abe8026c4f
Reviewed-on: https://go-review.googlesource.com/c/go/+/463751
Run-TryBot: Keith Randall <khr@golang.org>
Reviewed-by: David Chase <drchase@google.com>
Reviewed-by: Keith Randall <khr@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
2023-02-01 18:41:07 +00:00
Jakub Ciolek 1fc585dc2f cmd/compile: inline known-size memclrNoHeapPointers calls
This patch rewrites known-size calls to memclrNoHeapPointers with an OpZero.
This significantly improves performance and lets some clears get DSE'd.

One of the cases where this applies is zeroing a known-size array, example:

	var x [256]int8

        ...

	for a := range x {
	    x[a] = 0
	}

Other cases can be found in the runtime itself where memclrNoHeapPointers is sometimes directly invoked with a constant.

It seems that for some sized-clears on some architectures (AMD64, maybe others) the memcrlNoHeapPointers is more performant than OpZero.
See the issue #56997 for more details.

Benches ARM (M1 Pro):

name                      old time/op     new time/op     delta
MemclrKnownSize1-10          2.03ns ± 0%     0.31ns ± 0%    -84.69%  (p=0.000 n=18+19)
MemclrKnownSize2-10          1.97ns ± 0%     0.31ns ± 0%    -84.19%  (p=0.000 n=12+19)
MemclrKnownSize4-10          2.02ns ± 0%     0.31ns ± 0%    -84.56%  (p=0.000 n=12+20)
MemclrKnownSize8-10          2.02ns ± 0%     0.31ns ± 0%    -84.59%  (p=0.000 n=14+19)
MemclrKnownSize16-10         2.15ns ± 0%     0.31ns ± 0%    -85.50%  (p=0.000 n=18+19)
MemclrKnownSize32-10         2.48ns ± 0%     0.31ns ± 0%    -87.48%  (p=0.000 n=20+19)
MemclrKnownSize64-10         1.93ns ± 0%     0.62ns ± 0%    -67.88%  (p=0.000 n=20+19)
MemclrKnownSize112-10        2.48ns ± 0%     1.80ns ± 0%    -27.74%  (p=0.000 n=19+20)
MemclrKnownSize128-10       10.0ns ±112%      2.0ns ± 0%    -79.76%  (p=0.000 n=18+17)
MemclrKnownSize192-10       27.4ns ±103%      2.6ns ± 0%    -90.38%  (p=0.000 n=16+19)
MemclrKnownSize248-10        9.67ns ±43%     3.26ns ± 0%    -66.29%  (p=0.000 n=19+19)
MemclrKnownSize256-10       85.4ns ±148%      3.3ns ± 0%    -96.18%  (p=0.000 n=20+20)
MemclrKnownSize512-10         223ns ±54%        6ns ± 0%    -97.42%  (p=0.000 n=18+20)
MemclrKnownSize1024-10        216ns ±26%       11ns ± 0%    -95.00%  (p=0.000 n=18+15)
MemclrKnownSize4096-10        265ns ± 2%       88ns ± 0%    -66.84%  (p=0.000 n=19+17)
MemclrKnownSize512KiB-10     9.91µs ± 1%    10.23µs ± 2%     +3.14%  (p=0.000 n=19+19)
[Geo mean]                   15.6ns           2.5ns         -83.62%

name                      old speed       new speed       delta
MemclrKnownSize1-10         493MB/s ± 0%   3216MB/s ± 0%   +553.04%  (p=0.000 n=18+19)
MemclrKnownSize2-10        1.02GB/s ± 0%   6.43GB/s ± 0%   +532.33%  (p=0.000 n=16+19)
MemclrKnownSize4-10        1.99GB/s ± 0%  12.86GB/s ± 0%   +547.67%  (p=0.000 n=18+20)
MemclrKnownSize8-10        3.96GB/s ± 0%  25.72GB/s ± 0%   +548.81%  (p=0.000 n=19+19)
MemclrKnownSize16-10       7.46GB/s ± 0%  51.43GB/s ± 0%   +589.72%  (p=0.000 n=20+19)
MemclrKnownSize32-10       12.9GB/s ± 0%  102.9GB/s ± 0%   +698.60%  (p=0.000 n=20+18)
MemclrKnownSize64-10       33.1GB/s ± 0%  103.0GB/s ± 0%   +211.34%  (p=0.000 n=19+19)
MemclrKnownSize112-10      45.1GB/s ± 0%   62.4GB/s ± 0%    +38.38%  (p=0.000 n=19+20)
MemclrKnownSize128-10     13.3GB/s ±107%   63.5GB/s ± 0%   +378.03%  (p=0.000 n=19+18)
MemclrKnownSize192-10     6.97GB/s ±139%  72.72GB/s ± 0%   +943.44%  (p=0.000 n=19+19)
MemclrKnownSize248-10      25.9GB/s ±46%   76.1GB/s ± 0%   +194.16%  (p=0.000 n=20+17)
MemclrKnownSize256-10     8.64GB/s ±196%  78.51GB/s ± 0%   +808.19%  (p=0.000 n=20+20)
MemclrKnownSize512-10      2.33GB/s ±86%  89.13GB/s ± 0%  +3719.50%  (p=0.000 n=17+20)
MemclrKnownSize1024-10     4.85GB/s ±32%  94.93GB/s ± 0%  +1856.74%  (p=0.000 n=18+19)
MemclrKnownSize4096-10     15.4GB/s ± 2%   46.6GB/s ± 0%   +201.55%  (p=0.000 n=19+18)
MemclrKnownSize512KiB-10   52.9GB/s ± 1%   51.3GB/s ± 2%     -3.04%  (p=0.000 n=19+19)
[Geo mean]                 7.54GB/s       42.86GB/s        +468.76%

Intel Alder Lake 12600k:

name                      old time/op    new time/op     delta
MemclrKnownSize1-16         0.59ns ± 3%     0.38ns ± 6%   -36.00%  (p=0.000 n=19+18)
MemclrKnownSize2-16         0.57ns ± 1%     0.19ns ± 5%   -66.27%  (p=0.000 n=19+19)
MemclrKnownSize4-16         0.66ns ± 2%     0.36ns ±21%   -45.12%  (p=0.000 n=19+20)
MemclrKnownSize8-16         0.74ns ± 1%     0.30ns ±26%   -59.81%  (p=0.000 n=18+20)
MemclrKnownSize16-16        1.00ns ± 7%     0.21ns ± 8%   -79.51%  (p=0.000 n=20+19)
MemclrKnownSize32-16        0.95ns ± 1%     0.40ns ± 1%   -57.61%  (p=0.000 n=20+18)
MemclrKnownSize64-16        1.20ns ± 2%     0.41ns ± 0%   -65.82%  (p=0.000 n=20+18)
MemclrKnownSize112-16       1.27ns ± 2%     1.03ns ± 0%   -19.35%  (p=0.000 n=20+18)
MemclrKnownSize128-16       1.34ns ± 2%     1.03ns ± 0%   -23.02%  (p=0.000 n=20+20)
MemclrKnownSize192-16       1.92ns ± 2%     1.44ns ± 0%   -24.89%  (p=0.000 n=20+16)
MemclrKnownSize248-16       2.77ns ± 1%     3.29ns ± 0%   +18.81%  (p=0.000 n=20+16)
MemclrKnownSize256-16       1.92ns ± 1%     1.86ns ± 0%    -3.49%  (p=0.000 n=19+15)
MemclrKnownSize512-16       2.81ns ± 2%     3.49ns ± 0%   +24.15%  (p=0.000 n=20+17)
MemclrKnownSize1024-16      4.02ns ± 1%     6.78ns ± 0%   +68.44%  (p=0.000 n=20+18)
MemclrKnownSize4096-16      17.2ns ± 2%     14.4ns ± 0%   -16.73%  (p=0.000 n=20+17)
MemclrKnownSize512KiB-16    6.71µs ± 1%     6.52µs ± 0%    -2.85%  (p=0.000 n=20+18)
[Geo mean]                  2.60ns          1.71ns        -34.06%

name                      old speed      new speed       delta
MemclrKnownSize1-16       1.71GB/s ± 3%   2.67GB/s ± 6%   +56.39%  (p=0.000 n=19+18)
MemclrKnownSize2-16       3.52GB/s ± 2%  10.43GB/s ± 6%  +196.04%  (p=0.000 n=20+20)
MemclrKnownSize4-16       6.06GB/s ± 1%  10.83GB/s ±11%   +78.63%  (p=0.000 n=19+18)
MemclrKnownSize8-16       10.7GB/s ± 1%   27.0GB/s ±21%  +151.49%  (p=0.000 n=18+20)
MemclrKnownSize16-16      16.0GB/s ± 8%   78.1GB/s ± 7%  +387.24%  (p=0.000 n=20+19)
MemclrKnownSize32-16      33.6GB/s ± 1%   79.4GB/s ± 1%  +135.89%  (p=0.000 n=20+18)
MemclrKnownSize64-16      53.3GB/s ± 2%  155.9GB/s ± 0%  +192.58%  (p=0.000 n=20+18)
MemclrKnownSize112-16     88.0GB/s ± 2%  109.1GB/s ± 0%   +23.97%  (p=0.000 n=20+18)
MemclrKnownSize128-16     95.3GB/s ± 2%  123.8GB/s ± 0%   +29.88%  (p=0.000 n=20+20)
MemclrKnownSize192-16      100GB/s ± 2%    133GB/s ± 0%   +33.12%  (p=0.000 n=20+17)
MemclrKnownSize248-16     89.7GB/s ± 1%   75.5GB/s ± 0%   -15.84%  (p=0.000 n=20+19)
MemclrKnownSize256-16      133GB/s ± 1%    138GB/s ± 0%    +3.61%  (p=0.000 n=19+14)
MemclrKnownSize512-16      182GB/s ± 2%    147GB/s ± 0%   -19.46%  (p=0.000 n=20+17)
MemclrKnownSize1024-16     254GB/s ± 1%    151GB/s ± 0%   -40.64%  (p=0.000 n=20+18)
MemclrKnownSize4096-16     237GB/s ± 2%    285GB/s ± 0%   +20.09%  (p=0.000 n=20+17)
MemclrKnownSize512KiB-16  78.2GB/s ± 1%   80.4GB/s ± 0%    +2.93%  (p=0.000 n=20+18)
[Geo mean]                42.1GB/s        63.8GB/s        +51.53%

compilecmp linux/amd64:

runtime
runtime.(*pallocData).allocAll 85 -> 45  (-47.06%)
runtime.(*pageAlloc).allocRange 942 -> 923  (-2.02%)
runtime.(*pageAlloc).free 798 -> 774  (-3.01%)
runtime.(*pageBits).clearAll 66 -> 20  (-69.70%)
runtime.startCheckmarks 255 -> 246  (-3.53%)
runtime.(*pallocData).freeAll 86 -> 46  (-46.51%)
runtime.(*pallocBits).freeAll 66 -> 20  (-69.70%)
runtime.(*consistentHeapStats).unsafeClear 66 -> 19  (-71.21%)
runtime.newproc1 965 -> 933  (-3.32%)

crypto/rc4
crypto/rc4.(*Cipher).Reset 78 -> 69  (-11.54%)

compress/bzip2
compress/bzip2.(*reader).readBlock 2973 -> 2941  (-1.08%)

image/jpeg
image/jpeg.(*decoder).processDHT 1179 -> 1166  (-1.10%)

index/suffixarray
index/suffixarray.bucketMax_8_32 394 -> 241  (-38.83%)
index/suffixarray.freq_8_32 317 -> 185  (-41.64%)
index/suffixarray.freq_8_64 317 -> 178  (-43.85%)
index/suffixarray.bucketMin_8_32 394 -> 243  (-38.32%)
index/suffixarray.bucketMin_8_64 398 -> 234  (-41.21%)
index/suffixarray.bucketMax_8_64 398 -> 234  (-41.21%)

compress/flate
compress/flate.(*huffmanBitWriter).generateCodegen 965 -> 838  (-13.16%)
compress/flate.(*compressor).reset 429 -> 409  (-4.66%)

cmd/vendor/golang.org/x/sys/unix
cmd/vendor/golang.org/x/sys/unix.(*FdSet).Zero 66 -> 60  (-9.09%)
cmd/vendor/golang.org/x/sys/unix.(*Ifreq).SetInet4Addr 211 -> 129  (-38.86%)
cmd/vendor/golang.org/x/sys/unix.(*Ifreq).SetUint32 98 -> 14  (-85.71%)
cmd/vendor/golang.org/x/sys/unix.(*Ifreq).clear 66 -> 11  (-83.33%)
cmd/vendor/golang.org/x/sys/unix.(*Ifreq).SetUint16 101 -> 15  (-85.15%)
cmd/vendor/golang.org/x/sys/unix.(*CPUSet).Zero 66 -> 60  (-9.09%)

internal/coverage/decodemeta
internal/coverage/decodemeta.(*CoverageMetaFileReader).rdUint64 325 -> 293  (-9.85%)

crypto/tls
crypto/tls.(*halfConn).setTrafficSecret 253 -> 247  (-2.37%)
crypto/tls.(*Conn).readRecordOrCCS 10315 -> 10283  (-0.31%)
crypto/tls.(*halfConn).changeCipherSpec 271 -> 261  (-3.69%)
crypto/tls.(*Conn).writeRecordLocked 1765 -> 1748  (-0.96%)

file                               before   after    Δ       %
runtime.s                          512467   512164   -303    -0.059%
crypto/rc4.s                       955      946      -9      -0.942%
compress/bzip2.s                   9586     9554     -32     -0.334%
image/jpeg.s                       32122    32109    -13     -0.040%
index/suffixarray.s                38547    37644    -903    -2.343%
compress/flate.s                   46668    46521    -147    -0.315%
cmd/vendor/golang.org/x/sys/unix.s 118620   118301   -319    -0.269%
internal/coverage/decodemeta.s     7224     7192     -32     -0.443%
crypto/tls.s                       288762   288697   -65     -0.023%
cmd/compile/internal/ssa.s         3639799  3640727  +928    +0.025%
total                              20790248 20789353 -895    -0.004%

src/runtime benchmarks (Linux Alder Lake 12600k):

name                                             old time/op    new time/op    delta
MakeChan/Byte-16                                   26.2ns ± 2%    25.6ns ± 3%   -2.05%  (p=0.003 n=9+10)
MakeChan/Int-16                                    33.9ns ± 2%    33.3ns ± 4%   -1.99%  (p=0.015 n=10+10)
MakeChan/Ptr-16                                    54.2ns ± 2%    53.7ns ± 1%   -0.90%  (p=0.016 n=10+9)
MakeChan/Struct/0-16                               23.8ns ± 3%    23.4ns ± 1%   -1.72%  (p=0.009 n=10+8)
MakeChan/Struct/32-16                              55.9ns ± 2%    53.9ns ± 1%   -3.48%  (p=0.000 n=10+10)
MakeChan/Struct/40-16                              63.5ns ± 1%    61.1ns ± 2%   -3.79%  (p=0.000 n=10+9)
ChanNonblocking-16                                 0.22ns ± 0%    0.22ns ± 0%   +0.40%  (p=0.011 n=9+8)
SelectUncontended-16                               4.63ns ± 1%    4.62ns ± 0%   -0.35%  (p=0.001 n=10+8)
SelectSyncContended-16                             1.58µs ± 2%    1.59µs ± 1%     ~     (p=0.540 n=10+10)
SelectAsyncContended-16                             290ns ± 0%     291ns ± 0%   +0.14%  (p=0.012 n=8+9)
SelectNonblock-16                                  0.95ns ± 1%    0.95ns ± 1%     ~     (p=0.546 n=9+9)
ChanUncontended-16                                  239ns ± 3%     242ns ± 6%     ~     (p=0.886 n=9+10)
ChanContended-16                                   17.7µs ± 1%    18.2µs ± 1%   +2.87%  (p=0.000 n=10+9)
ChanSync-16                                         109ns ± 2%     109ns ± 1%     ~     (p=0.342 n=10+10)
ChanSyncWork-16                                    6.55µs ± 1%    6.53µs ± 1%     ~     (p=0.101 n=10+10)
ChanProdCons0-16                                    502ns ± 1%     499ns ± 0%   -0.55%  (p=0.001 n=10+9)
ChanProdCons10-16                                   373ns ± 2%     377ns ± 1%     ~     (p=0.095 n=10+9)
ChanProdCons100-16                                  224ns ± 2%     223ns ± 3%     ~     (p=0.150 n=9+10)
ChanProdConsWork0-16                                491ns ± 1%     484ns ± 0%   -1.26%  (p=0.000 n=10+9)
ChanProdConsWork10-16                               451ns ± 2%     448ns ± 2%     ~     (p=0.210 n=8+10)
ChanProdConsWork100-16                              406ns ± 0%     407ns ± 1%     ~     (p=0.138 n=8+8)
SelectProdCons-16                                   509ns ± 0%     509ns ± 0%     ~     (p=0.917 n=9+9)
ReceiveDataFromClosedChan-16                       12.1ns ± 0%    12.1ns ± 0%     ~     (p=0.780 n=10+10)
ChanCreation-16                                    22.6ns ± 1%    22.4ns ± 0%   -0.72%  (p=0.001 n=10+8)
ChanSem-16                                          165ns ± 1%     166ns ± 1%   +0.72%  (p=0.002 n=10+10)
ChanPopular-16                                      500µs ± 2%     498µs ± 1%     ~     (p=0.218 n=10+10)
ChanClosed-16                                      0.29ns ± 0%    0.29ns ± 0%   +0.09%  (p=0.019 n=9+8)
CallClosure-16                                     1.28ns ± 0%    1.27ns ± 0%   -0.51%  (p=0.000 n=9+9)
CallClosure1-16                                    1.50ns ± 0%    1.50ns ± 0%     ~     (p=0.123 n=9+9)
CallClosure2-16                                    8.86ns ± 1%    8.86ns ± 3%     ~     (p=0.590 n=9+10)
CallClosure3-16                                    8.75ns ± 2%    8.69ns ± 2%     ~     (p=0.247 n=10+10)
CallClosure4-16                                    8.65ns ± 2%    8.56ns ± 2%     ~     (p=0.105 n=10+10)
Complex128DivNormal-16                             2.47ns ± 0%    2.47ns ± 0%     ~     (p=0.790 n=10+9)
Complex128DivNisNaN-16                             4.44ns ± 0%    4.43ns ± 0%     ~     (p=0.564 n=10+10)
Complex128DivDisNaN-16                             4.48ns ± 0%    4.48ns ± 0%     ~     (p=0.101 n=10+10)
Complex128DivNisInf-16                             2.58ns ± 0%    2.58ns ± 0%     ~     (p=0.808 n=10+10)
Complex128DivDisInf-16                             6.30ns ± 0%    6.31ns ± 0%     ~     (p=0.305 n=10+10)
SetTypePtr-16                                      0.73ns ± 1%    0.73ns ± 3%     ~     (p=0.644 n=10+10)
SetTypePtr8-16                                     4.12ns ± 0%    4.12ns ± 0%     ~     (p=0.127 n=10+10)
SetTypePtr16-16                                    4.13ns ± 1%    4.12ns ± 0%     ~     (p=0.109 n=10+10)
SetTypePtr32-16                                    4.12ns ± 0%    4.12ns ± 0%     ~     (p=0.203 n=9+10)
SetTypePtr64-16                                    4.12ns ± 0%    4.12ns ± 0%     ~     (p=0.696 n=10+10)
SetTypePtr126-16                                   6.91ns ± 0%    6.91ns ± 0%     ~     (p=0.469 n=10+10)
SetTypePtr128-16                                   6.66ns ± 0%    6.67ns ± 0%     ~     (p=0.246 n=9+10)
SetTypePtrSlice-16                                 54.1ns ± 1%    54.1ns ± 1%     ~     (p=0.509 n=9+10)
SetTypeNode1-16                                    4.13ns ± 1%    4.12ns ± 0%     ~     (p=0.342 n=10+10)
SetTypeNode1Slice-16                               10.1ns ± 1%    10.0ns ± 1%   -1.18%  (p=0.000 n=10+10)
SetTypeNode8-16                                    4.12ns ± 0%    4.12ns ± 0%     ~     (p=0.137 n=8+8)
SetTypeNode8Slice-16                               22.6ns ± 0%    22.6ns ± 0%     ~     (p=0.423 n=10+10)
SetTypeNode64-16                                   6.90ns ± 0%    6.91ns ± 0%     ~     (p=0.275 n=10+10)
SetTypeNode64Slice-16                               173ns ± 0%     173ns ± 0%     ~     (p=0.610 n=9+10)
SetTypeNode64Dead-16                               5.53ns ± 0%    5.52ns ± 0%     ~     (p=0.123 n=10+6)
SetTypeNode64DeadSlice-16                           150ns ± 0%     150ns ± 0%     ~     (p=0.398 n=10+10)
SetTypeNode124-16                                  6.90ns ± 0%    6.90ns ± 0%     ~     (p=0.779 n=10+10)
SetTypeNode124Slice-16                              222ns ± 5%     217ns ± 0%     ~     (p=0.302 n=10+10)
SetTypeNode126-16                                  6.66ns ± 0%    6.66ns ± 0%     ~     (p=0.324 n=10+9)
SetTypeNode126Slice-16                              218ns ± 0%     218ns ± 0%     ~     (p=0.119 n=9+10)
SetTypeNode128-16                                  9.76ns ± 0%    9.73ns ± 0%   -0.31%  (p=0.003 n=9+10)
SetTypeNode128Slice-16                              279ns ± 0%     278ns ± 0%     ~     (p=0.112 n=10+9)
SetTypeNode130-16                                  9.77ns ± 0%    9.73ns ± 0%   -0.33%  (p=0.002 n=10+10)
SetTypeNode130Slice-16                              284ns ± 0%     284ns ± 0%     ~     (p=0.668 n=10+10)
SetTypeNode1024-16                                 51.2ns ± 0%    51.6ns ± 1%     ~     (p=0.080 n=9+9)
SetTypeNode1024Slice-16                            1.83µs ± 0%    1.82µs ± 0%     ~     (p=0.115 n=10+10)
Allocation-16                                      4.64µs ± 1%    4.37µs ± 1%   -5.69%  (p=0.000 n=9+9)
ReadMemStats-16                                    5.62µs ± 2%    5.55µs ± 5%   -1.36%  (p=0.050 n=10+10)
WriteBarrier-16                                    4.95ns ± 3%    4.99ns ± 3%     ~     (p=0.255 n=10+10)
BulkWriteBarrier-16                                1.69ns ± 2%    1.63ns ± 4%   -3.77%  (p=0.001 n=10+10)
ScanStackNoLocals-16                               12.8ms ± 2%    12.9ms ± 1%   +0.72%  (p=0.019 n=10+10)
MSpanCountAlloc/bits=64-16                         1.65ns ± 0%    1.65ns ± 0%     ~     (p=0.124 n=10+10)
MSpanCountAlloc/bits=128-16                        2.08ns ± 1%    2.06ns ± 1%   -0.87%  (p=0.000 n=10+10)
MSpanCountAlloc/bits=256-16                        2.71ns ± 1%    2.69ns ± 1%   -0.74%  (p=0.001 n=10+9)
MSpanCountAlloc/bits=512-16                        4.15ns ± 0%    4.23ns ± 2%   +2.15%  (p=0.000 n=10+10)
MSpanCountAlloc/bits=1024-16                       7.89ns ± 1%    7.89ns ± 1%     ~     (p=0.867 n=10+10)
Hash5-16                                           1.93ns ± 1%    2.01ns ± 0%   +3.99%  (p=0.000 n=10+8)
Hash16-16                                          2.04ns ± 1%    2.21ns ± 1%   +8.61%  (p=0.000 n=10+10)
Hash64-16                                          2.67ns ± 0%    2.67ns ± 0%     ~     (p=0.154 n=9+9)
Hash1024-16                                        16.4ns ± 0%    16.4ns ± 0%   +0.17%  (p=0.020 n=9+10)
Hash65536-16                                        886ns ± 0%     885ns ± 0%     ~     (p=0.725 n=10+10)
AlignedLoad-16                                     0.96ns ± 2%    0.95ns ± 3%     ~     (p=0.123 n=10+10)
UnalignedLoad-16                                   0.95ns ± 2%    1.01ns ± 2%   +6.31%  (p=0.000 n=10+10)
EqEfaceConcrete-16                                 0.31ns ± 3%    0.33ns ± 5%   +8.10%  (p=0.000 n=10+10)
EqIfaceConcrete-16                                 0.31ns ±13%    0.28ns ± 2%   -9.23%  (p=0.001 n=10+10)
NeEfaceConcrete-16                                 0.29ns ± 1%    0.31ns ± 7%   +5.59%  (p=0.010 n=8+8)
NeIfaceConcrete-16                                 0.28ns ± 2%    0.29ns ± 1%   +4.49%  (p=0.000 n=9+8)
ConvT2EByteSized/bool-16                           0.53ns ± 1%    0.52ns ± 1%   -2.18%  (p=0.000 n=10+10)
ConvT2EByteSized/uint8-16                          0.53ns ± 1%    0.53ns ± 0%   +1.22%  (p=0.000 n=10+10)
ConvT2ESmall-16                                    1.13ns ± 0%    1.13ns ± 0%     ~     (p=0.774 n=9+9)
ConvT2EUintptr-16                                  1.03ns ± 0%    1.04ns ± 0%   +0.50%  (p=0.000 n=10+8)
ConvT2ELarge-16                                    14.4ns ± 2%    14.4ns ± 1%     ~     (p=0.726 n=10+10)
ConvT2ISmall-16                                    1.13ns ± 0%    1.13ns ± 0%     ~     (p=0.693 n=9+10)
ConvT2IUintptr-16                                  1.03ns ± 0%    1.03ns ± 0%   +0.44%  (p=0.000 n=10+10)
ConvT2ILarge-16                                    14.2ns ± 1%    14.4ns ± 1%   +0.85%  (p=0.007 n=9+10)
ConvI2E-16                                         0.54ns ± 1%    0.54ns ± 0%   -0.39%  (p=0.037 n=10+8)
ConvI2I-16                                         2.68ns ± 0%    2.70ns ± 1%   +0.73%  (p=0.000 n=9+10)
AssertE2T-16                                       0.28ns ± 1%    0.39ns ± 5%  +37.38%  (p=0.000 n=10+10)
AssertE2TLarge-16                                  0.42ns ± 2%    0.48ns ± 1%  +14.92%  (p=0.000 n=9+10)
AssertE2I-16                                       2.67ns ± 0%    2.67ns ± 0%     ~     (p=0.352 n=9+9)
AssertI2T-16                                       0.37ns ± 3%    0.34ns ± 1%   -6.16%  (p=0.000 n=10+10)
AssertI2I-16                                       2.67ns ± 0%    2.67ns ± 0%     ~     (p=0.286 n=10+10)
AssertI2E-16                                       0.54ns ± 1%    0.54ns ± 0%   -0.94%  (p=0.000 n=10+10)
AssertE2E-16                                       0.41ns ± 0%    0.41ns ± 1%     ~     (p=0.880 n=9+9)
AssertE2T2-16                                      0.41ns ± 1%    0.41ns ± 1%     ~     (p=0.725 n=10+10)
AssertE2T2Blank-16                                 0.24ns ± 5%    0.21ns ± 1%  -14.79%  (p=0.000 n=10+9)
AssertI2E2-16                                      0.69ns ± 0%    0.69ns ± 1%     ~     (p=0.541 n=10+10)
AssertI2E2Blank-16                                 0.26ns ± 9%    0.21ns ± 1%  -18.86%  (p=0.000 n=10+9)
AssertE2E2-16                                      0.53ns ± 1%    0.53ns ± 1%   +0.72%  (p=0.004 n=10+10)
AssertE2E2Blank-16                                 0.23ns ± 4%    0.21ns ± 1%   -8.42%  (p=0.000 n=10+10)
ConvT2Ezero/zero/16-16                             1.13ns ± 0%    1.14ns ± 1%     ~     (p=0.583 n=9+10)
ConvT2Ezero/zero/32-16                             1.13ns ± 0%    1.13ns ± 0%     ~     (p=0.417 n=10+10)
ConvT2Ezero/zero/64-16                             1.03ns ± 1%    1.03ns ± 0%     ~     (p=0.051 n=10+10)
ConvT2Ezero/zero/str-16                            1.03ns ± 0%    1.03ns ± 0%     ~     (p=0.132 n=10+10)
ConvT2Ezero/zero/slice-16                          1.14ns ± 0%    1.15ns ± 0%   +0.49%  (p=0.001 n=10+10)
ConvT2Ezero/zero/big-16                             123ns ± 1%     123ns ± 1%     ~     (p=0.171 n=10+10)
ConvT2Ezero/nonzero/str-16                         19.4ns ± 1%    19.3ns ± 3%     ~     (p=0.548 n=9+10)
ConvT2Ezero/nonzero/slice-16                       22.2ns ± 2%    22.0ns ± 2%     ~     (p=0.109 n=10+10)
ConvT2Ezero/nonzero/big-16                          123ns ± 1%     123ns ± 1%     ~     (p=0.446 n=10+8)
ConvT2Ezero/smallint/16-16                         1.13ns ± 0%    1.14ns ± 1%     ~     (p=0.362 n=10+10)
ConvT2Ezero/smallint/32-16                         1.13ns ± 0%    1.13ns ± 0%     ~     (p=0.907 n=10+9)
ConvT2Ezero/smallint/64-16                         1.04ns ± 0%    1.03ns ± 0%   -0.38%  (p=0.002 n=10+10)
ConvT2Ezero/largeint/16-16                         6.65ns ± 1%    6.65ns ± 2%     ~     (p=0.618 n=10+9)
ConvT2Ezero/largeint/32-16                         6.75ns ± 3%    6.63ns ± 2%   -1.77%  (p=0.015 n=10+10)
ConvT2Ezero/largeint/64-16                         9.19ns ± 1%    9.26ns ± 2%     ~     (p=0.123 n=10+10)
Malloc8-16                                         8.66ns ± 1%    8.89ns ± 2%   +2.74%  (p=0.000 n=10+10)
Malloc16-16                                        13.7ns ± 1%    13.8ns ± 1%   +0.71%  (p=0.022 n=10+8)
MallocTypeInfo8-16                                 11.7ns ± 3%    11.6ns ± 2%     ~     (p=0.469 n=10+10)
MallocTypeInfo16-16                                18.3ns ± 1%    18.2ns ± 2%     ~     (p=0.251 n=9+10)
MallocLargeStruct-16                                195ns ± 1%     198ns ± 1%   +1.65%  (p=0.000 n=9+10)
GoroutineSelect-16                                 1.10ms ± 1%    1.12ms ± 1%   +1.36%  (p=0.000 n=10+8)
GoroutineBlocking-16                                986µs ± 1%     998µs ± 1%   +1.23%  (p=0.002 n=10+10)
GoroutineForRange-16                                985µs ± 1%    1001µs ± 1%   +1.68%  (p=0.000 n=10+10)
GoroutineIdle-16                                    679µs ± 1%     691µs ± 0%   +1.74%  (p=0.000 n=10+9)
HashStringSpeed-16                                 5.33ns ± 5%    5.19ns ± 4%     ~     (p=0.113 n=9+9)
HashBytesSpeed-16                                  8.20ns ± 3%    8.24ns ± 1%     ~     (p=0.497 n=10+9)
HashInt32Speed-16                                  4.01ns ± 2%    3.90ns ± 4%   -2.63%  (p=0.011 n=9+10)
HashInt64Speed-16                                  3.94ns ± 4%    3.79ns ± 1%   -3.74%  (p=0.003 n=10+9)
HashStringArraySpeed-16                            12.5ns ± 4%    12.3ns ± 1%     ~     (p=0.055 n=10+10)
MegMap-16                                          3.72ns ± 1%    3.73ns ± 1%     ~     (p=0.484 n=9+10)
MegOneMap-16                                       2.28ns ± 1%    2.27ns ± 1%     ~     (p=0.287 n=10+10)
MegEqMap-16                                        22.0µs ± 3%    22.3µs ± 2%   +1.48%  (p=0.028 n=10+9)
MegEmptyMap-16                                     0.93ns ± 1%    0.92ns ± 1%   -0.52%  (p=0.030 n=10+10)
SmallStrMap-16                                     3.77ns ± 0%    3.77ns ± 0%     ~     (p=0.324 n=10+10)
MapStringKeysEight_16-16                           3.91ns ± 0%    3.91ns ± 0%     ~     (p=0.088 n=9+9)
MapStringKeysEight_32-16                           3.58ns ± 1%    3.50ns ± 0%   -2.11%  (p=0.000 n=10+10)
MapStringKeysEight_64-16                           3.58ns ± 1%    3.50ns ± 0%   -2.23%  (p=0.000 n=10+10)
MapStringKeysEight_1M-16                           3.57ns ± 1%    3.50ns ± 0%   -1.92%  (p=0.000 n=10+10)
IntMap-16                                          2.89ns ± 1%    2.89ns ± 0%     ~     (p=0.381 n=10+10)
MapFirst/1-16                                      1.60ns ± 1%    1.59ns ± 2%   -0.49%  (p=0.020 n=10+9)
MapFirst/2-16                                      1.61ns ± 0%    1.59ns ± 1%   -1.17%  (p=0.001 n=10+10)
MapFirst/3-16                                      1.61ns ± 1%    1.59ns ± 1%   -1.45%  (p=0.000 n=10+10)
MapFirst/4-16                                      1.60ns ± 1%    1.59ns ± 1%   -1.16%  (p=0.000 n=10+10)
MapFirst/5-16                                      1.60ns ± 1%    1.58ns ± 1%   -0.98%  (p=0.000 n=10+10)
MapFirst/6-16                                      1.60ns ± 1%    1.59ns ± 1%   -0.87%  (p=0.001 n=10+10)
MapFirst/7-16                                      1.60ns ± 1%    1.59ns ± 1%   -0.79%  (p=0.002 n=10+10)
MapFirst/8-16                                      1.60ns ± 1%    1.59ns ± 1%   -0.67%  (p=0.017 n=9+10)
MapFirst/9-16                                      2.83ns ± 0%    2.83ns ± 0%     ~     (p=0.492 n=10+10)
MapFirst/10-16                                     2.83ns ± 0%    2.84ns ± 0%   +0.24%  (p=0.017 n=10+10)
MapFirst/11-16                                     2.83ns ± 0%    2.83ns ± 0%     ~     (p=0.445 n=10+10)
MapFirst/12-16                                     2.83ns ± 0%    2.83ns ± 0%     ~     (p=0.564 n=10+10)
MapFirst/13-16                                     2.83ns ± 0%    2.84ns ± 0%     ~     (p=0.175 n=9+10)
MapFirst/14-16                                     2.83ns ± 0%    2.83ns ± 0%     ~     (p=0.322 n=10+9)
MapFirst/15-16                                     2.83ns ± 0%    2.84ns ± 1%     ~     (p=0.209 n=10+10)
MapFirst/16-16                                     2.83ns ± 1%    2.84ns ± 0%     ~     (p=0.238 n=10+10)
MapMid/1-16                                        1.64ns ± 0%    1.64ns ± 0%     ~     (p=0.453 n=10+9)
MapMid/2-16                                        1.86ns ± 1%    1.86ns ± 0%     ~     (p=0.764 n=10+9)
MapMid/3-16                                        1.86ns ± 0%    1.86ns ± 1%     ~     (p=1.000 n=10+10)
MapMid/4-16                                        2.06ns ± 0%    2.06ns ± 0%   -0.27%  (p=0.014 n=10+9)
MapMid/5-16                                        2.06ns ± 0%    2.06ns ± 0%     ~     (p=0.075 n=9+10)
MapMid/6-16                                        2.27ns ± 0%    2.27ns ± 1%     ~     (p=0.898 n=10+10)
MapMid/7-16                                        2.27ns ± 1%    2.26ns ± 0%   -0.23%  (p=0.049 n=10+10)
MapMid/8-16                                        2.47ns ± 0%    2.47ns ± 1%     ~     (p=0.840 n=10+10)
MapMid/9-16                                        4.21ns ± 7%    4.13ns ±19%     ~     (p=0.315 n=10+10)
MapMid/10-16                                       4.17ns ± 7%    4.31ns ± 5%   +3.37%  (p=0.021 n=10+9)
MapMid/11-16                                       4.18ns ± 7%    4.32ns ± 6%   +3.50%  (p=0.015 n=10+10)
MapMid/12-16                                       4.34ns ± 7%    4.30ns ± 5%     ~     (p=0.858 n=9+10)
MapMid/13-16                                       4.25ns ± 6%    4.28ns ± 6%     ~     (p=0.489 n=9+9)
MapMid/14-16                                       3.75ns ±23%    3.90ns ±16%     ~     (p=0.353 n=10+10)
MapMid/15-16                                       3.87ns ±25%    3.95ns ±26%     ~     (p=0.315 n=10+10)
MapMid/16-16                                       4.06ns ±19%    3.94ns ±16%     ~     (p=0.796 n=10+10)
MapLast/1-16                                       1.65ns ± 0%    1.65ns ± 0%     ~     (p=0.607 n=10+10)
MapLast/2-16                                       1.86ns ± 0%    1.86ns ± 0%   +0.26%  (p=0.029 n=10+10)
MapLast/3-16                                       2.06ns ± 1%    2.06ns ± 0%     ~     (p=0.689 n=8+9)
MapLast/4-16                                       2.27ns ± 1%    2.26ns ± 0%     ~     (p=0.148 n=10+9)
MapLast/5-16                                       2.47ns ± 0%    2.47ns ± 0%     ~     (p=0.385 n=9+10)
MapLast/6-16                                       2.67ns ± 0%    2.68ns ± 0%     ~     (p=0.202 n=9+10)
MapLast/7-16                                       2.88ns ± 0%    2.88ns ± 0%     ~     (p=0.751 n=10+10)
MapLast/8-16                                       3.08ns ± 0%    3.08ns ± 0%     ~     (p=0.826 n=10+9)
MapLast/9-16                                       4.31ns ± 6%    4.54ns ± 5%     ~     (p=0.070 n=9+8)
MapLast/10-16                                      4.25ns ± 5%    4.42ns ± 6%     ~     (p=0.321 n=9+8)
MapLast/11-16                                      4.59ns ±16%    5.42ns ±44%  +17.99%  (p=0.019 n=10+10)
MapLast/12-16                                      5.04ns ±19%    6.11ns ±28%  +21.35%  (p=0.005 n=9+10)
MapLast/13-16                                      6.00ns ±35%    5.76ns ± 3%     ~     (p=0.173 n=10+8)
MapLast/14-16                                      4.27ns ± 5%    4.53ns ± 6%   +6.14%  (p=0.007 n=10+10)
MapLast/15-16                                      4.41ns ± 1%    4.44ns ± 7%     ~     (p=0.515 n=8+10)
MapLast/16-16                                      4.18ns ± 6%    4.99ns ±18%  +19.48%  (p=0.000 n=10+10)
MapCycle-16                                        7.48ns ± 2%    7.46ns ± 1%     ~     (p=0.699 n=10+10)
RepeatedLookupStrMapKey32-16                       6.98ns ± 3%    6.73ns ± 2%   -3.63%  (p=0.000 n=10+10)
RepeatedLookupStrMapKey1M-16                       14.7µs ± 5%    14.7µs ± 4%     ~     (p=0.604 n=9+10)
MakeMap/[Byte]Byte-16                              58.5ns ± 1%    58.5ns ± 1%     ~     (p=0.780 n=10+9)
MakeMap/[Int]Int-16                                 113ns ± 0%     113ns ± 1%     ~     (p=0.100 n=8+10)
NewEmptyMap-16                                     2.47ns ± 0%    2.47ns ± 0%     ~     (p=0.638 n=10+10)
NewSmallMap-16                                     11.5ns ± 1%    11.6ns ± 0%   +1.18%  (p=0.000 n=10+10)
MapIter-16                                         42.2ns ± 0%    42.8ns ± 1%   +1.50%  (p=0.000 n=10+10)
MapIterEmpty-16                                    1.85ns ± 0%    1.85ns ± 0%     ~     (p=0.651 n=10+10)
SameLengthMap-16                                   1.85ns ± 1%    1.85ns ± 0%     ~     (p=0.247 n=10+10)
BigKeyMap-16                                       7.18ns ± 1%    7.42ns ± 4%   +3.33%  (p=0.004 n=10+10)
BigValMap-16                                       7.03ns ± 2%    7.19ns ± 1%   +2.33%  (p=0.000 n=10+9)
SmallKeyMap-16                                     5.32ns ± 1%    5.24ns ± 1%   -1.41%  (p=0.000 n=10+10)
MapPopulate/1-16                                   6.30ns ± 0%    6.41ns ± 1%   +1.81%  (p=0.000 n=8+10)
MapPopulate/10-16                                   239ns ± 2%     234ns ± 2%   -2.05%  (p=0.001 n=9+10)
MapPopulate/100-16                                 4.19µs ± 2%    4.22µs ± 2%     ~     (p=0.171 n=10+10)
MapPopulate/1000-16                                52.3µs ± 1%    52.5µs ± 1%     ~     (p=0.133 n=9+10)
MapPopulate/10000-16                                459µs ± 1%     466µs ± 2%   +1.45%  (p=0.005 n=10+10)
MapPopulate/100000-16                              4.22ms ± 2%    4.25ms ± 2%     ~     (p=0.393 n=10+10)
ComplexAlgMap-16                                   12.5ns ± 1%    12.4ns ± 1%   -0.95%  (p=0.022 n=10+10)
GoMapClear/Reflexive/1-16                          9.61ns ± 1%    9.58ns ± 0%   -0.27%  (p=0.027 n=10+10)
GoMapClear/Reflexive/10-16                         10.0ns ± 1%    10.0ns ± 1%     ~     (p=0.648 n=9+9)
GoMapClear/Reflexive/100-16                        31.4ns ± 0%    31.4ns ± 1%     ~     (p=0.305 n=9+10)
GoMapClear/Reflexive/1000-16                        147ns ± 0%     149ns ± 2%   +1.21%  (p=0.000 n=10+10)
GoMapClear/Reflexive/10000-16                      3.99µs ± 0%    4.00µs ± 0%   +0.21%  (p=0.018 n=9+10)
GoMapClear/NonReflexive/1-16                       41.4ns ± 2%    41.7ns ± 1%   +0.55%  (p=0.043 n=9+10)
GoMapClear/NonReflexive/10-16                      50.3ns ± 1%    50.9ns ± 1%   +1.16%  (p=0.000 n=10+10)
GoMapClear/NonReflexive/100-16                      125ns ± 0%     126ns ± 0%   +0.96%  (p=0.000 n=8+10)
GoMapClear/NonReflexive/1000-16                    1.08µs ± 0%    1.08µs ± 1%     ~     (p=0.097 n=10+10)
GoMapClear/NonReflexive/10000-16                   8.18µs ± 2%    8.10µs ± 0%   -0.91%  (p=0.019 n=10+8)
MapStringConversion/32/simple-16                   4.66ns ± 1%    4.69ns ± 3%     ~     (p=0.905 n=9+10)
MapStringConversion/32/struct-16                   4.65ns ± 3%    4.94ns ± 2%   +6.23%  (p=0.000 n=10+10)
MapStringConversion/32/array-16                    4.69ns ± 3%    4.72ns ± 3%     ~     (p=0.631 n=10+10)
MapStringConversion/64/simple-16                   4.14ns ± 0%    4.14ns ± 1%     ~     (p=0.342 n=10+10)
MapStringConversion/64/struct-16                   4.13ns ± 0%    4.13ns ± 0%     ~     (p=0.809 n=10+10)
MapStringConversion/64/array-16                    4.13ns ± 1%    4.13ns ± 1%     ~     (p=0.752 n=10+10)
MapInterfaceString-16                              7.90ns ±23%    8.51ns ±33%     ~     (p=0.604 n=9+10)
MapInterfacePtr-16                                 7.68ns ±29%    7.10ns ±36%     ~     (p=0.353 n=10+10)
NewEmptyMapHintLessThan8-16                        3.70ns ± 0%    3.70ns ± 0%     ~     (p=0.209 n=10+10)
NewEmptyMapHintGreaterThan8-16                      270ns ± 1%     272ns ± 1%   +0.71%  (p=0.005 n=10+9)
MapPop100-16                                       6.45µs ± 0%    6.50µs ± 1%   +0.77%  (p=0.000 n=10+10)
MapPop1000-16                                       114µs ± 1%     114µs ± 1%     ~     (p=0.190 n=10+10)
MapPop10000-16                                     2.28ms ± 2%    2.28ms ± 2%     ~     (p=0.912 n=10+10)
MapAssign/Int32/256-16                             4.75ns ± 2%    4.82ns ± 4%     ~     (p=0.101 n=10+10)
MapAssign/Int32/65536-16                           16.4ns ± 1%    16.7ns ± 0%   +1.44%  (p=0.000 n=10+9)
MapAssign/Int64/256-16                             4.79ns ± 5%    4.79ns ± 1%     ~     (p=0.616 n=10+8)
MapAssign/Int64/65536-16                           17.1ns ± 1%    16.8ns ± 0%   -1.28%  (p=0.000 n=10+8)
MapAssign/Str/256-16                               6.07ns ± 6%    6.24ns ± 2%   +2.84%  (p=0.035 n=10+9)
MapAssign/Str/65536-16                             21.4ns ± 0%    21.4ns ± 3%     ~     (p=0.300 n=7+10)
MapOperatorAssign/Int32/256-16                     4.82ns ± 3%    4.81ns ± 3%     ~     (p=0.684 n=10+10)
MapOperatorAssign/Int32/65536-16                   16.8ns ± 1%    16.5ns ± 1%   -1.68%  (p=0.000 n=9+10)
MapOperatorAssign/Int64/256-16                     4.74ns ± 1%    4.77ns ± 3%     ~     (p=0.563 n=10+9)
MapOperatorAssign/Int64/65536-16                   16.9ns ± 1%    17.2ns ± 1%   +1.88%  (p=0.000 n=10+10)
MapOperatorAssign/Str/256-16                       1.09µs ± 1%    1.10µs ± 2%     ~     (p=0.210 n=10+10)
MapOperatorAssign/Str/65536-16                      184ns ± 9%     184ns ± 8%     ~     (p=0.922 n=10+9)
MapAppendAssign/Int32/256-16                       13.8ns ±10%    14.4ns ±11%     ~     (p=0.190 n=10+10)
MapAppendAssign/Int32/65536-16                     28.9ns ± 5%    30.7ns ± 6%   +6.13%  (p=0.001 n=9+10)
MapAppendAssign/Int64/256-16                       14.5ns ±12%    13.8ns ± 8%   -5.02%  (p=0.037 n=10+10)
MapAppendAssign/Int64/65536-16                     30.9ns ± 1%    30.4ns ± 2%   -1.56%  (p=0.001 n=10+10)
MapAppendAssign/Str/256-16                         30.2ns ± 6%    30.0ns ±10%     ~     (p=0.645 n=10+10)
MapAppendAssign/Str/65536-16                       44.5ns ± 4%    46.8ns ± 3%   +5.17%  (p=0.001 n=8+9)
MapDelete/Int32/100-16                             18.7ns ± 0%    18.7ns ± 0%   -0.27%  (p=0.017 n=10+10)
MapDelete/Int32/1000-16                            17.6ns ± 1%    17.5ns ± 1%   -0.85%  (p=0.000 n=9+10)
MapDelete/Int32/10000-16                           18.7ns ± 0%    18.3ns ± 1%   -1.92%  (p=0.000 n=10+10)
MapDelete/Int64/100-16                             19.1ns ± 0%    19.2ns ± 0%   +0.68%  (p=0.000 n=10+9)
MapDelete/Int64/1000-16                            17.7ns ± 2%    18.3ns ± 1%   +3.00%  (p=0.000 n=10+10)
MapDelete/Int64/10000-16                           18.8ns ± 1%    19.2ns ± 0%   +2.01%  (p=0.000 n=10+9)
MapDelete/Str/100-16                               26.5ns ± 0%    26.4ns ± 1%   -0.73%  (p=0.000 n=10+10)
MapDelete/Str/1000-16                              23.5ns ± 2%    23.4ns ± 1%     ~     (p=0.425 n=10+10)
MapDelete/Str/10000-16                             25.1ns ± 0%    25.1ns ± 1%   +0.28%  (p=0.037 n=10+10)
MapDelete/Pointer/100-16                           20.6ns ± 1%    20.6ns ± 0%     ~     (p=0.117 n=10+10)
MapDelete/Pointer/1000-16                          19.2ns ± 1%    19.4ns ± 1%   +0.97%  (p=0.004 n=10+10)
MapDelete/Pointer/10000-16                         20.0ns ± 0%    20.1ns ± 1%   +0.52%  (p=0.022 n=10+10)
Memmove/0-16                                       0.21ns ± 2%    0.21ns ± 1%     ~     (p=0.671 n=10+10)
Memmove/1-16                                       0.93ns ± 0%    0.93ns ± 0%   +0.21%  (p=0.034 n=10+10)
Memmove/2-16                                       0.93ns ± 0%    0.93ns ± 0%     ~     (p=0.101 n=10+10)
Memmove/3-16                                       0.93ns ± 1%    0.93ns ± 1%   +0.49%  (p=0.004 n=10+10)
Memmove/4-16                                       1.03ns ± 0%    1.03ns ± 0%     ~     (p=0.260 n=10+10)
Memmove/5-16                                       1.13ns ± 0%    1.13ns ± 0%   +0.20%  (p=0.034 n=10+10)
Memmove/6-16                                       1.13ns ± 0%    1.13ns ± 1%     ~     (p=0.126 n=10+10)
Memmove/7-16                                       1.13ns ± 0%    1.13ns ± 1%   +0.22%  (p=0.028 n=10+10)
Memmove/8-16                                       1.13ns ± 0%    1.13ns ± 0%     ~     (p=0.545 n=9+10)
Memmove/9-16                                       1.25ns ± 0%    1.35ns ± 0%   +7.98%  (p=0.000 n=10+10)
Memmove/10-16                                      1.25ns ± 0%    1.35ns ± 0%   +7.96%  (p=0.000 n=9+9)
Memmove/11-16                                      1.25ns ± 0%    1.35ns ± 0%   +8.53%  (p=0.000 n=10+9)
Memmove/12-16                                      1.25ns ± 0%    1.35ns ± 1%   +8.24%  (p=0.000 n=10+10)
Memmove/13-16                                      1.25ns ± 0%    1.34ns ± 0%   +7.75%  (p=0.000 n=10+10)
Memmove/14-16                                      1.25ns ± 0%    1.35ns ± 1%   +8.28%  (p=0.000 n=10+9)
Memmove/15-16                                      1.25ns ± 0%    1.35ns ± 0%   +8.07%  (p=0.000 n=10+9)
Memmove/16-16                                      1.25ns ± 0%    1.35ns ± 1%   +8.35%  (p=0.000 n=9+10)
Memmove/32-16                                      1.34ns ± 0%    1.36ns ± 1%   +1.22%  (p=0.000 n=10+10)
Memmove/64-16                                      1.45ns ± 0%    1.64ns ± 0%  +13.07%  (p=0.000 n=10+9)
Memmove/128-16                                     1.86ns ± 0%    2.02ns ± 0%   +8.64%  (p=0.000 n=10+10)
Memmove/256-16                                     2.47ns ± 0%    2.49ns ± 1%   +1.14%  (p=0.000 n=10+10)
Memmove/512-16                                     3.96ns ± 1%    3.96ns ± 0%     ~     (p=0.182 n=10+10)
Memmove/1024-16                                    5.90ns ± 1%    5.87ns ± 1%     ~     (p=0.258 n=9+9)
Memmove/2048-16                                    9.62ns ± 1%    9.62ns ± 2%     ~     (p=0.963 n=8+9)
Memmove/4096-16                                    16.4ns ± 0%    17.1ns ± 4%   +4.19%  (p=0.003 n=8+9)
MemmoveOverlap/32-16                               1.62ns ± 1%    1.68ns ± 1%   +3.53%  (p=0.000 n=10+10)
MemmoveOverlap/64-16                               1.64ns ± 0%    1.65ns ± 0%   +0.29%  (p=0.002 n=9+9)
MemmoveOverlap/128-16                              2.06ns ± 0%    2.06ns ± 0%     ~     (p=0.070 n=10+10)
MemmoveOverlap/256-16                              2.67ns ± 0%    2.67ns ± 0%   +0.26%  (p=0.012 n=10+10)
MemmoveOverlap/512-16                              6.20ns ±18%    5.74ns ± 0%     ~     (p=0.645 n=10+8)
MemmoveOverlap/1024-16                             7.28ns ± 0%    7.30ns ± 0%   +0.28%  (p=0.006 n=8+10)
MemmoveOverlap/2048-16                             11.9ns ± 0%    12.0ns ± 1%   +0.37%  (p=0.014 n=9+9)
MemmoveOverlap/4096-16                             23.3ns ± 1%    23.1ns ± 1%   -0.84%  (p=0.000 n=8+10)
MemmoveUnalignedDst/0-16                           1.03ns ± 0%    1.03ns ± 0%   +0.19%  (p=0.007 n=10+10)
MemmoveUnalignedDst/1-16                           1.24ns ± 0%    1.25ns ± 1%   +0.52%  (p=0.022 n=10+10)
MemmoveUnalignedDst/2-16                           1.23ns ± 0%    1.23ns ± 0%     ~     (p=0.051 n=10+10)
MemmoveUnalignedDst/3-16                           1.23ns ± 0%    1.23ns ± 0%   +0.14%  (p=0.006 n=9+9)
MemmoveUnalignedDst/4-16                           1.23ns ± 0%    1.24ns ± 1%   +0.37%  (p=0.004 n=10+10)
MemmoveUnalignedDst/5-16                           1.35ns ± 0%    1.35ns ± 0%     ~     (p=0.075 n=10+10)
MemmoveUnalignedDst/6-16                           1.34ns ± 0%    1.34ns ± 0%     ~     (p=0.779 n=10+10)
MemmoveUnalignedDst/7-16                           1.34ns ± 0%    1.34ns ± 0%     ~     (p=1.000 n=10+10)
MemmoveUnalignedDst/8-16                           1.34ns ± 0%    1.35ns ± 1%   +0.39%  (p=0.024 n=10+10)
MemmoveUnalignedDst/9-16                           1.44ns ± 0%    1.44ns ± 0%     ~     (p=0.849 n=10+10)
MemmoveUnalignedDst/10-16                          1.44ns ± 0%    1.44ns ± 0%     ~     (p=0.255 n=10+10)
MemmoveUnalignedDst/11-16                          1.44ns ± 0%    1.44ns ± 0%     ~     (p=0.304 n=10+10)
MemmoveUnalignedDst/12-16                          1.44ns ± 0%    1.44ns ± 0%     ~     (p=0.672 n=10+10)
MemmoveUnalignedDst/13-16                          1.44ns ± 0%    1.44ns ± 0%     ~     (p=0.435 n=10+10)
MemmoveUnalignedDst/14-16                          1.44ns ± 0%    1.44ns ± 0%     ~     (p=0.340 n=10+10)
MemmoveUnalignedDst/15-16                          1.44ns ± 0%    1.44ns ± 0%     ~     (p=0.911 n=10+9)
MemmoveUnalignedDst/16-16                          1.44ns ± 0%    1.44ns ± 0%     ~     (p=0.074 n=10+10)
MemmoveUnalignedDst/32-16                          1.62ns ± 0%    1.63ns ± 0%     ~     (p=0.059 n=10+10)
MemmoveUnalignedDst/64-16                          1.65ns ± 0%    1.65ns ± 0%     ~     (p=0.234 n=10+10)
MemmoveUnalignedDst/128-16                         2.06ns ± 0%    2.06ns ± 0%     ~     (p=0.709 n=10+9)
MemmoveUnalignedDst/256-16                         3.69ns ± 0%    3.70ns ± 0%     ~     (p=0.144 n=10+10)
MemmoveUnalignedDst/512-16                         4.15ns ± 1%    4.14ns ± 0%     ~     (p=0.778 n=10+8)
MemmoveUnalignedDst/1024-16                        7.52ns ± 0%    7.53ns ± 1%     ~     (p=0.650 n=9+9)
MemmoveUnalignedDst/2048-16                        12.9ns ± 0%    12.9ns ± 1%     ~     (p=0.548 n=8+8)
MemmoveUnalignedDst/4096-16                        25.4ns ± 0%    25.4ns ± 0%     ~     (p=0.947 n=9+9)
MemmoveUnalignedDstOverlap/32-16                   4.08ns ± 0%    4.09ns ± 0%     ~     (p=0.360 n=10+10)
MemmoveUnalignedDstOverlap/64-16                   4.56ns ± 0%    4.56ns ± 0%     ~     (p=0.705 n=10+9)
MemmoveUnalignedDstOverlap/128-16                  4.67ns ± 0%    4.67ns ± 0%     ~     (p=0.397 n=10+10)
MemmoveUnalignedDstOverlap/256-16                  5.08ns ± 0%    5.08ns ± 0%     ~     (p=0.159 n=10+9)
MemmoveUnalignedDstOverlap/512-16                  8.45ns ± 5%    8.19ns ± 0%   -3.10%  (p=0.021 n=10+9)
MemmoveUnalignedDstOverlap/1024-16                 9.55ns ± 0%    9.56ns ± 0%     ~     (p=0.221 n=8+8)
MemmoveUnalignedDstOverlap/2048-16                 14.0ns ± 0%    14.0ns ± 1%     ~     (p=0.200 n=10+9)
MemmoveUnalignedDstOverlap/4096-16                 26.5ns ± 0%    26.5ns ± 0%     ~     (p=0.458 n=10+9)
MemmoveUnalignedSrc/0-16                           1.02ns ± 1%    0.99ns ± 1%   -2.67%  (p=0.000 n=10+9)
MemmoveUnalignedSrc/1-16                           1.13ns ± 0%    1.13ns ± 1%   -0.25%  (p=0.027 n=10+9)
MemmoveUnalignedSrc/2-16                           1.13ns ± 1%    1.13ns ± 0%   -0.28%  (p=0.012 n=10+9)
MemmoveUnalignedSrc/3-16                           1.24ns ± 1%    1.23ns ± 0%   -0.25%  (p=0.022 n=9+10)
MemmoveUnalignedSrc/4-16                           1.24ns ± 0%    1.23ns ± 1%     ~     (p=0.118 n=9+10)
MemmoveUnalignedSrc/5-16                           1.34ns ± 0%    1.34ns ± 1%     ~     (p=0.564 n=8+10)
MemmoveUnalignedSrc/6-16                           1.34ns ± 0%    1.34ns ± 0%   -0.39%  (p=0.000 n=10+10)
MemmoveUnalignedSrc/7-16                           1.34ns ± 0%    1.34ns ± 0%     ~     (p=0.235 n=10+10)
MemmoveUnalignedSrc/8-16                           1.34ns ± 0%    1.34ns ± 0%   -0.37%  (p=0.002 n=10+9)
MemmoveUnalignedSrc/9-16                           1.44ns ± 0%    1.44ns ± 0%     ~     (p=0.579 n=10+9)
MemmoveUnalignedSrc/10-16                          1.44ns ± 0%    1.44ns ± 0%     ~     (p=0.534 n=10+9)
MemmoveUnalignedSrc/11-16                          1.44ns ± 0%    1.44ns ± 1%     ~     (p=0.415 n=10+10)
MemmoveUnalignedSrc/12-16                          1.44ns ± 0%    1.44ns ± 0%     ~     (p=0.218 n=10+10)
MemmoveUnalignedSrc/13-16                          1.44ns ± 0%    1.44ns ± 1%     ~     (p=0.693 n=10+10)
MemmoveUnalignedSrc/14-16                          1.44ns ± 0%    1.44ns ± 0%     ~     (p=0.901 n=10+10)
MemmoveUnalignedSrc/15-16                          1.44ns ± 0%    1.44ns ± 0%     ~     (p=0.379 n=10+10)
MemmoveUnalignedSrc/16-16                          1.44ns ± 1%    1.44ns ± 0%     ~     (p=0.538 n=10+10)
MemmoveUnalignedSrc/32-16                          1.60ns ± 1%    1.60ns ± 0%     ~     (p=0.491 n=10+10)
MemmoveUnalignedSrc/64-16                          1.65ns ± 0%    1.65ns ± 0%     ~     (p=0.564 n=10+10)
MemmoveUnalignedSrc/128-16                         2.09ns ± 0%    2.09ns ± 0%     ~     (p=0.497 n=10+9)
MemmoveUnalignedSrc/256-16                         2.70ns ± 0%    2.78ns ± 1%   +2.81%  (p=0.000 n=10+10)
MemmoveUnalignedSrc/512-16                         4.31ns ± 0%    4.30ns ± 0%   -0.26%  (p=0.031 n=8+9)
MemmoveUnalignedSrc/1024-16                        7.28ns ± 0%    7.21ns ± 1%   -1.05%  (p=0.000 n=8+10)
MemmoveUnalignedSrc/2048-16                        13.0ns ± 0%    13.0ns ± 0%     ~     (p=0.180 n=9+8)
MemmoveUnalignedSrc/4096-16                        25.4ns ± 0%    25.3ns ± 1%     ~     (p=0.054 n=10+10)
MemmoveUnalignedSrcOverlap/32-16                   4.04ns ± 0%    4.06ns ± 0%   +0.62%  (p=0.000 n=9+10)
MemmoveUnalignedSrcOverlap/64-16                   4.12ns ± 0%    4.12ns ± 0%     ~     (p=0.421 n=10+10)
MemmoveUnalignedSrcOverlap/128-16                  4.53ns ± 0%    4.52ns ± 0%     ~     (p=0.251 n=10+10)
MemmoveUnalignedSrcOverlap/256-16                  6.17ns ± 0%    6.15ns ± 0%   -0.35%  (p=0.000 n=10+9)
MemmoveUnalignedSrcOverlap/512-16                  7.43ns ± 0%    7.44ns ± 0%     ~     (p=0.524 n=9+8)
MemmoveUnalignedSrcOverlap/1024-16                 8.94ns ± 0%    8.94ns ± 0%     ~     (p=0.419 n=8+8)
MemmoveUnalignedSrcOverlap/2048-16                 13.2ns ± 0%    14.5ns ±21%     ~     (p=0.107 n=8+10)
MemmoveUnalignedSrcOverlap/4096-16                 25.6ns ± 0%    25.6ns ± 1%     ~     (p=0.650 n=9+9)
Memclr/5-16                                        0.86ns ± 1%    0.86ns ± 2%     ~     (p=0.531 n=9+9)
Memclr/16-16                                       1.04ns ± 0%    1.04ns ± 0%   +0.32%  (p=0.013 n=9+10)
Memclr/64-16                                       1.23ns ± 0%    1.26ns ± 0%   +2.28%  (p=0.000 n=10+10)
Memclr/256-16                                      2.27ns ± 0%    2.27ns ± 0%     ~     (p=0.127 n=10+10)
Memclr/4096-16                                     17.1ns ± 1%    17.3ns ± 0%   +0.88%  (p=0.000 n=10+10)
Memclr/65536-16                                     821ns ± 0%     822ns ± 0%     ~     (p=0.516 n=10+10)
Memclr/1M-16                                       14.1µs ± 1%    14.0µs ± 1%     ~     (p=0.516 n=10+10)
Memclr/4M-16                                       86.1µs ± 1%    85.9µs ± 0%     ~     (p=0.123 n=10+10)
Memclr/8M-16                                        174µs ± 2%     173µs ± 0%     ~     (p=0.408 n=10+8)
Memclr/16M-16                                       385µs ± 4%     387µs ± 0%     ~     (p=0.173 n=10+8)
Memclr/64M-16                                      2.18ms ± 0%    2.19ms ± 0%     ~     (p=0.113 n=10+9)
GoMemclr/5-16                                      0.82ns ± 0%    0.82ns ± 0%     ~     (p=0.346 n=9+10)
GoMemclr/16-16                                     1.02ns ± 0%    1.02ns ± 0%   +0.22%  (p=0.003 n=10+8)
GoMemclr/64-16                                     1.14ns ± 0%    1.14ns ± 0%     ~     (p=0.948 n=10+9)
GoMemclr/256-16                                    2.06ns ± 0%    2.06ns ± 0%     ~     (p=0.868 n=10+10)
MemclrRange/1K_2K-16                                457ns ± 0%     428ns ± 1%   -6.38%  (p=0.000 n=10+10)
MemclrRange/2K_8K-16                               1.46µs ± 0%    1.46µs ± 0%     ~     (p=0.700 n=10+10)
MemclrRange/4K_16K-16                              1.16µs ± 0%    1.16µs ± 0%     ~     (p=0.567 n=9+10)
MemclrRange/160K_228K-16                           20.7µs ± 0%    20.7µs ± 0%     ~     (p=0.160 n=10+10)
ClearFat7-16                                       0.38ns ± 5%    0.21ns ± 1%  -45.79%  (p=0.000 n=9+10)
ClearFat8-16                                       0.21ns ± 3%    0.12ns ± 2%  -44.16%  (p=0.000 n=8+9)
ClearFat11-16                                      0.35ns ± 3%    0.21ns ± 1%  -40.46%  (p=0.000 n=9+9)
ClearFat12-16                                      0.23ns ± 9%    0.21ns ± 1%  -10.23%  (p=0.000 n=10+9)
ClearFat13-16                                      0.22ns ± 6%    0.21ns ± 2%   -6.53%  (p=0.000 n=10+10)
ClearFat14-16                                      0.22ns ± 4%    0.21ns ± 1%   -5.97%  (p=0.000 n=10+10)
ClearFat15-16                                      0.22ns ± 4%    0.21ns ± 1%   -6.96%  (p=0.000 n=10+9)
ClearFat16-16                                      0.19ns ± 9%    0.12ns ± 6%  -34.89%  (p=0.000 n=9+10)
ClearFat24-16                                      0.23ns ± 6%    0.21ns ± 1%  -10.26%  (p=0.000 n=10+9)
ClearFat32-16                                      0.22ns ± 5%    0.21ns ± 2%   -5.31%  (p=0.000 n=10+10)
ClearFat40-16                                      0.34ns ± 4%    0.62ns ± 1%  +83.00%  (p=0.000 n=10+10)
ClearFat48-16                                      0.33ns ± 2%    0.41ns ± 0%  +26.71%  (p=0.000 n=10+10)
ClearFat56-16                                      0.41ns ± 1%    0.41ns ± 0%     ~     (p=0.838 n=10+10)
ClearFat64-16                                      0.41ns ± 0%    0.41ns ± 0%     ~     (p=0.178 n=10+8)
ClearFat72-16                                      0.82ns ± 0%    0.82ns ± 0%     ~     (p=0.669 n=10+10)
ClearFat128-16                                     1.04ns ± 0%    1.04ns ± 0%     ~     (p=0.679 n=10+10)
ClearFat256-16                                     1.86ns ± 0%    1.86ns ± 0%     ~     (p=0.066 n=9+10)
ClearFat512-16                                     3.50ns ± 0%    3.50ns ± 0%     ~     (p=0.626 n=10+10)
ClearFat1024-16                                    6.79ns ± 0%    6.79ns ± 0%     ~     (p=0.986 n=10+10)
ClearFat1032-16                                    13.6ns ± 0%    13.6ns ± 0%   +0.13%  (p=0.044 n=10+10)
ClearFat1040-16                                    10.3ns ± 0%    10.3ns ± 0%     ~     (p=0.175 n=10+9)
CopyFat7-16                                        0.37ns ±13%    0.25ns ± 1%  -31.74%  (p=0.000 n=10+9)
CopyFat8-16                                        0.17ns ± 1%    0.17ns ± 2%   +1.35%  (p=0.004 n=9+9)
CopyFat11-16                                       0.26ns ± 1%    0.30ns ± 3%  +12.58%  (p=0.000 n=9+10)
CopyFat12-16                                       0.28ns ± 2%    0.26ns ± 1%   -5.66%  (p=0.000 n=9+9)
CopyFat13-16                                       0.26ns ± 0%    0.28ns ± 4%   +7.35%  (p=0.000 n=8+10)
CopyFat14-16                                       0.29ns ± 6%    0.26ns ± 2%  -10.46%  (p=0.000 n=10+9)
CopyFat15-16                                       0.26ns ± 1%    0.30ns ± 6%  +14.12%  (p=0.000 n=8+10)
CopyFat16-16                                       0.21ns ± 1%    0.21ns ± 0%     ~     (p=0.426 n=8+8)
CopyFat24-16                                       0.29ns ± 3%    0.25ns ± 1%  -12.27%  (p=0.000 n=9+10)
CopyFat32-16                                       0.26ns ± 4%    0.29ns ± 4%  +11.71%  (p=0.000 n=10+10)
CopyFat64-16                                       0.46ns ± 8%    0.42ns ± 1%   -8.37%  (p=0.002 n=10+10)
CopyFat72-16                                       0.82ns ± 0%    0.82ns ± 0%     ~     (p=0.563 n=10+10)
CopyFat128-16                                      1.53ns ± 0%    1.54ns ± 0%   +0.62%  (p=0.000 n=10+10)
CopyFat256-16                                      2.68ns ± 0%    2.65ns ± 1%   -1.23%  (p=0.000 n=10+10)
CopyFat512-16                                      4.93ns ± 1%    5.19ns ± 3%   +5.16%  (p=0.000 n=9+9)
CopyFat520-16                                      6.99ns ± 0%    6.99ns ± 0%     ~     (p=0.539 n=10+10)
CopyFat1024-16                                     11.5ns ± 1%     9.8ns ± 1%  -14.98%  (p=0.000 n=9+10)
CopyFat1032-16                                     13.6ns ± 0%    13.6ns ± 0%     ~     (p=0.728 n=10+10)
CopyFat1040-16                                     11.0ns ± 0%    11.1ns ± 0%   +0.53%  (p=0.000 n=10+10)
Issue18740/2byte-16                                10.1µs ± 0%    10.1µs ± 0%     ~     (p=0.342 n=10+10)
Issue18740/4byte-16                                2.34µs ± 0%    2.35µs ± 0%   +0.30%  (p=0.002 n=10+8)
Issue18740/8byte-16                                1.28µs ± 0%    1.28µs ± 0%   +0.32%  (p=0.000 n=9+10)
Finalizer-16                                        345µs ± 1%     336µs ± 0%   -2.55%  (p=0.000 n=10+9)
FinalizerRun-16                                     450ns ± 3%     420ns ± 1%   -6.65%  (p=0.000 n=10+10)
PallocBitsSummarize/Unpacked00-16                  2.88ns ± 0%    2.88ns ± 0%     ~     (p=0.358 n=10+10)
PallocBitsSummarize/UnpackedFFFFFFFFFFFFFFFF-16    15.2ns ± 0%    15.2ns ± 0%     ~     (p=0.925 n=10+10)
PallocBitsSummarize/UnpackedAA-16                  16.4ns ± 0%    16.3ns ± 0%     ~     (p=0.113 n=9+9)
PallocBitsSummarize/UnpackedAAAAAAAAAAAAAAAA-16    16.5ns ± 0%    16.6ns ± 0%     ~     (p=0.238 n=10+10)
PallocBitsSummarize/Unpacked80000000AAAAAAAA-16    37.8ns ± 1%    36.4ns ± 0%   -3.70%  (p=0.000 n=10+9)
PallocBitsSummarize/UnpackedAAAAAAAA00000001-16    41.8ns ± 1%    39.9ns ± 0%   -4.68%  (p=0.000 n=9+10)
PallocBitsSummarize/UnpackedBBBBBBBBBBBBBBBB-16    18.3ns ± 0%    18.3ns ± 0%     ~     (p=0.781 n=10+10)
PallocBitsSummarize/Unpacked80000000BBBBBBBB-16    38.8ns ± 1%    38.1ns ± 0%   -1.78%  (p=0.000 n=9+10)
PallocBitsSummarize/UnpackedBBBBBBBB00000001-16    37.5ns ± 0%    36.1ns ± 1%   -3.88%  (p=0.000 n=8+10)
PallocBitsSummarize/UnpackedCCCCCCCCCCCCCCCC-16    21.8ns ± 0%    21.9ns ± 0%   +0.20%  (p=0.018 n=10+9)
PallocBitsSummarize/Unpacked4444444444444444-16    21.8ns ± 0%    21.9ns ± 0%   +0.20%  (p=0.029 n=10+9)
PallocBitsSummarize/Unpacked4040404040404040-16    26.5ns ± 0%    26.5ns ± 0%   -0.24%  (p=0.001 n=9+10)
PallocBitsSummarize/Unpacked4000400040004000-16    33.4ns ± 1%    31.3ns ± 0%   -6.20%  (p=0.000 n=9+10)
PallocBitsSummarize/Unpacked1000404044CCAAFF-16    36.4ns ± 1%    35.9ns ± 0%   -1.50%  (p=0.000 n=10+10)
FindBitRange64/Pattern00Size2-16                   0.34ns ± 1%    0.35ns ± 1%   +3.80%  (p=0.000 n=10+9)
FindBitRange64/Pattern00Size8-16                   0.70ns ± 1%    0.70ns ± 0%   -0.68%  (p=0.000 n=10+10)
FindBitRange64/Pattern00Size32-16                  0.70ns ± 1%    0.69ns ± 0%   -0.86%  (p=0.001 n=10+8)
FindBitRange64/PatternFFFFFFFFFFFFFFFFSize2-16     0.34ns ± 1%    0.35ns ± 1%   +4.45%  (p=0.000 n=9+8)
FindBitRange64/PatternFFFFFFFFFFFFFFFFSize8-16     1.54ns ± 0%    1.54ns ± 1%     ~     (p=0.914 n=9+9)
FindBitRange64/PatternFFFFFFFFFFFFFFFFSize32-16    2.78ns ± 0%    2.78ns ± 0%     ~     (p=0.295 n=9+10)
FindBitRange64/PatternAASize2-16                   0.34ns ± 2%    0.35ns ± 2%   +4.61%  (p=0.000 n=10+10)
FindBitRange64/PatternAASize8-16                   0.70ns ± 1%    0.70ns ± 1%   -0.82%  (p=0.005 n=10+10)
FindBitRange64/PatternAASize32-16                  0.70ns ± 1%    0.70ns ± 0%   -0.73%  (p=0.003 n=10+9)
FindBitRange64/PatternAAAAAAAAAAAAAAAASize2-16     0.34ns ± 2%    0.35ns ± 2%   +3.94%  (p=0.000 n=10+10)
FindBitRange64/PatternAAAAAAAAAAAAAAAASize8-16     0.70ns ± 1%    0.70ns ± 1%   -0.67%  (p=0.025 n=10+10)
FindBitRange64/PatternAAAAAAAAAAAAAAAASize32-16    0.70ns ± 1%    0.70ns ± 1%     ~     (p=0.118 n=9+10)
FindBitRange64/Pattern80000000AAAAAAAASize2-16     0.34ns ± 1%    0.35ns ± 2%   +3.72%  (p=0.000 n=10+9)
FindBitRange64/Pattern80000000AAAAAAAASize8-16     0.70ns ± 1%    0.70ns ± 0%     ~     (p=0.102 n=10+10)
FindBitRange64/Pattern80000000AAAAAAAASize32-16    0.70ns ± 1%    0.70ns ± 1%   -0.55%  (p=0.011 n=10+10)
FindBitRange64/PatternAAAAAAAA00000001Size2-16     0.34ns ± 2%    0.35ns ± 1%   +3.83%  (p=0.000 n=10+9)
FindBitRange64/PatternAAAAAAAA00000001Size8-16     0.70ns ± 1%    0.70ns ± 1%     ~     (p=0.065 n=10+10)
FindBitRange64/PatternAAAAAAAA00000001Size32-16    0.70ns ± 1%    0.70ns ± 1%   -0.95%  (p=0.002 n=10+10)
FindBitRange64/PatternBBBBBBBBBBBBBBBBSize2-16     0.34ns ± 0%    0.35ns ± 1%   +4.12%  (p=0.000 n=8+10)
FindBitRange64/PatternBBBBBBBBBBBBBBBBSize8-16     1.24ns ± 0%    1.23ns ± 0%   -0.30%  (p=0.002 n=10+9)
FindBitRange64/PatternBBBBBBBBBBBBBBBBSize32-16    1.24ns ± 0%    1.24ns ± 0%   -0.17%  (p=0.023 n=9+10)
FindBitRange64/Pattern80000000BBBBBBBBSize2-16     0.34ns ± 1%    0.35ns ± 2%   +4.82%  (p=0.000 n=9+10)
FindBitRange64/Pattern80000000BBBBBBBBSize8-16     1.24ns ± 1%    1.24ns ± 0%     ~     (p=0.063 n=10+10)
FindBitRange64/Pattern80000000BBBBBBBBSize32-16    1.24ns ± 0%    1.24ns ± 0%     ~     (p=0.164 n=9+10)
FindBitRange64/PatternBBBBBBBB00000001Size2-16     0.34ns ± 1%    0.35ns ± 1%   +4.38%  (p=0.000 n=8+10)
FindBitRange64/PatternBBBBBBBB00000001Size8-16     1.24ns ± 1%    1.24ns ± 0%     ~     (p=0.052 n=10+10)
FindBitRange64/PatternBBBBBBBB00000001Size32-16    1.24ns ± 0%    1.23ns ± 0%   -0.40%  (p=0.000 n=10+10)
FindBitRange64/PatternCCCCCCCCCCCCCCCCSize2-16     0.34ns ± 0%    0.35ns ± 2%   +3.96%  (p=0.000 n=9+10)
FindBitRange64/PatternCCCCCCCCCCCCCCCCSize8-16     1.24ns ± 0%    1.23ns ± 0%   -0.30%  (p=0.000 n=10+9)
FindBitRange64/PatternCCCCCCCCCCCCCCCCSize32-16    1.24ns ± 0%    1.24ns ± 1%     ~     (p=0.284 n=10+10)
FindBitRange64/Pattern4444444444444444Size2-16     0.34ns ± 1%    0.35ns ± 1%   +3.91%  (p=0.000 n=9+9)
FindBitRange64/Pattern4444444444444444Size8-16     0.70ns ± 1%    0.70ns ± 1%     ~     (p=0.617 n=10+10)
FindBitRange64/Pattern4444444444444444Size32-16    0.70ns ± 1%    0.70ns ± 1%   -0.60%  (p=0.006 n=10+10)
FindBitRange64/Pattern4040404040404040Size2-16     0.34ns ± 2%    0.35ns ± 2%   +3.67%  (p=0.000 n=10+10)
FindBitRange64/Pattern4040404040404040Size8-16     0.70ns ± 2%    0.70ns ± 1%   -0.87%  (p=0.014 n=10+10)
FindBitRange64/Pattern4040404040404040Size32-16    0.70ns ± 1%    0.70ns ± 1%     ~     (p=0.256 n=10+10)
FindBitRange64/Pattern4000400040004000Size2-16     0.34ns ± 2%    0.35ns ± 3%   +4.71%  (p=0.000 n=10+10)
FindBitRange64/Pattern4000400040004000Size8-16     0.70ns ± 1%    0.70ns ± 1%     ~     (p=0.393 n=10+10)
FindBitRange64/Pattern4000400040004000Size32-16    0.70ns ± 1%    0.70ns ± 1%   -0.86%  (p=0.014 n=10+10)
NetpollBreak-16                                    1.49µs ± 1%    1.50µs ± 3%     ~     (p=0.181 n=8+10)
Syscall-16                                         3.68ns ± 1%    3.66ns ± 2%     ~     (p=0.148 n=10+10)
SyscallWork-16                                     5.15ns ± 1%    5.13ns ± 0%     ~     (p=0.188 n=10+9)
SyscallExcess-16                                   3.89ns ± 2%    3.83ns ± 1%   -1.52%  (p=0.001 n=10+10)
SyscallExcessWork-16                               5.34ns ± 1%    5.31ns ± 0%   -0.64%  (p=0.000 n=10+9)
PingPongHog-16                                      397ns ± 7%     394ns ±11%     ~     (p=0.912 n=10+10)
StackGrowth-16                                     67.9ns ± 0%    68.8ns ± 0%   +1.28%  (p=0.000 n=10+8)
StackGrowthDeep-16                                 7.70µs ± 1%    8.48µs ± 2%  +10.06%  (p=0.000 n=9+10)
CreateGoroutines-16                                 124ns ± 1%     124ns ± 1%     ~     (p=0.254 n=10+10)
CreateGoroutinesParallel-16                        25.7ns ± 1%    27.6ns ± 2%   +7.51%  (p=0.000 n=10+10)
CreateGoroutinesCapture-16                          823ns ± 1%     821ns ± 2%     ~     (p=0.699 n=10+10)
CreateGoroutinesSingle-16                           175ns ± 3%     172ns ± 3%   -1.90%  (p=0.011 n=10+10)
ClosureCall-16                                     0.11ns ± 7%    0.12ns ± 3%     ~     (p=0.842 n=9+10)
WakeupParallelSpinning/0s-16                       11.4µs ± 0%    11.4µs ± 0%     ~     (p=0.325 n=9+10)
WakeupParallelSpinning/1µs-16                      15.4µs ± 0%    15.4µs ± 1%     ~     (p=0.955 n=10+10)
WakeupParallelSpinning/2µs-16                      18.7µs ± 2%    18.9µs ± 2%     ~     (p=0.052 n=10+10)
WakeupParallelSpinning/5µs-16                      30.7µs ± 0%    30.7µs ± 0%   -0.03%  (p=0.003 n=10+10)
WakeupParallelSpinning/10µs-16                     48.8µs ± 0%    48.8µs ± 0%     ~     (p=0.670 n=10+10)
WakeupParallelSpinning/20µs-16                     90.8µs ± 0%    90.8µs ± 0%   -0.02%  (p=0.004 n=10+10)
WakeupParallelSpinning/50µs-16                      211µs ± 0%     211µs ± 0%     ~     (p=0.194 n=10+10)
WakeupParallelSpinning/100µs-16                     323µs ± 0%     323µs ± 0%     ~     (p=1.000 n=10+9)
WakeupParallelSyscall/0s-16                         118µs ± 0%     118µs ± 0%     ~     (p=0.447 n=10+9)
WakeupParallelSyscall/1µs-16                        119µs ± 2%     119µs ± 1%     ~     (p=0.604 n=10+9)
WakeupParallelSyscall/2µs-16                        120µs ± 1%     121µs ± 3%     ~     (p=0.263 n=8+10)
WakeupParallelSyscall/5µs-16                        126µs ± 2%     126µs ± 2%     ~     (p=0.510 n=10+9)
WakeupParallelSyscall/10µs-16                       136µs ± 1%     137µs ± 1%     ~     (p=0.095 n=9+10)
WakeupParallelSyscall/20µs-16                       156µs ± 2%     157µs ± 3%     ~     (p=0.604 n=10+9)
WakeupParallelSyscall/50µs-16                       221µs ± 1%     220µs ± 1%     ~     (p=0.063 n=10+10)
WakeupParallelSyscall/100µs-16                      326µs ± 0%     325µs ± 0%   -0.26%  (p=0.003 n=9+10)
Matmult-16                                         0.67ns ± 2%    0.66ns ± 2%     ~     (p=0.256 n=10+10)
Fastrand-16                                        0.08ns ±11%    0.08ns ±13%     ~     (p=0.661 n=9+10)
Fastrand64-16                                      0.08ns ±11%    0.08ns ± 6%     ~     (p=0.631 n=10+10)
FastrandHashiter-16                                1.76ns ± 1%    1.76ns ± 1%     ~     (p=0.854 n=8+8)
Fastrandn/2-16                                     0.86ns ± 1%    0.86ns ± 1%   +1.09%  (p=0.000 n=10+9)
Fastrandn/3-16                                     0.85ns ± 1%    0.86ns ± 1%   +1.23%  (p=0.001 n=10+10)
Fastrandn/4-16                                     0.85ns ± 1%    0.87ns ± 2%   +1.60%  (p=0.000 n=10+10)
Fastrandn/5-16                                     0.85ns ± 1%    0.86ns ± 1%   +1.05%  (p=0.000 n=10+10)
IfaceCmp100-16                                     46.6ns ± 0%    46.1ns ± 0%   -1.18%  (p=0.000 n=10+10)
IfaceCmpNil100-16                                  26.8ns ± 0%    26.8ns ± 0%     ~     (p=0.777 n=10+8)
EfaceCmpDiff-16                                     132ns ± 0%     130ns ± 0%   -0.95%  (p=0.000 n=10+9)
EfaceCmpDiffIndirect-16                             209ns ± 0%     211ns ± 0%   +1.14%  (p=0.000 n=10+9)
Defer-16                                           3.40ns ± 1%    3.04ns ± 0%  -10.67%  (p=0.000 n=10+10)
Defer10-16                                         29.4ns ± 2%    27.2ns ± 3%   -7.26%  (p=0.000 n=10+10)
DeferMany-16                                        110ns ± 6%     113ns ± 2%   +3.45%  (p=0.017 n=9+9)
PanicRecover-16                                    67.6ns ± 0%    67.7ns ± 2%     ~     (p=0.436 n=9+9)
GoroutineProfile/small-nil/idle-16                 3.90µs ± 4%    3.86µs ± 2%     ~     (p=0.305 n=10+9)
GoroutineProfile/small-nil/loaded-16               4.82µs ± 6%    4.82µs ± 4%     ~     (p=0.905 n=10+9)
GoroutineProfile/small/idle-16                      103µs ± 3%     102µs ± 3%     ~     (p=0.113 n=9+9)
GoroutineProfile/small/loaded-16                    432µs ± 5%     440µs ±13%     ~     (p=0.604 n=9+10)
GoroutineProfile/large-nil/idle-16                 3.86µs ± 3%    3.82µs ± 3%     ~     (p=0.210 n=10+10)
GoroutineProfile/large-nil/loaded-16               4.90µs ± 2%    4.90µs ± 5%     ~     (p=0.780 n=10+9)
GoroutineProfile/large/idle-16                     2.58ms ± 1%    2.52ms ± 1%   -2.38%  (p=0.000 n=10+10)
GoroutineProfile/large/loaded-16                   8.62ms ± 9%    8.90ms ±11%     ~     (p=0.400 n=9+10)
GoroutineProfile/sparse-nil/idle-16                3.85µs ± 4%    3.81µs ± 3%     ~     (p=0.470 n=10+10)
GoroutineProfile/sparse-nil/loaded-16              4.82µs ± 4%    4.69µs ± 5%     ~     (p=0.052 n=10+10)
GoroutineProfile/sparse/idle-16                     102µs ± 4%     102µs ± 2%     ~     (p=0.497 n=10+9)
GoroutineProfile/sparse/loaded-16                   438µs ± 7%     437µs ± 6%     ~     (p=0.796 n=10+10)
RWMutexUncontended-16                              6.79ns ± 0%    6.78ns ± 0%     ~     (p=0.228 n=10+8)
RWMutexWrite100-16                                 85.4ns ± 0%    87.1ns ± 0%   +2.00%  (p=0.000 n=10+8)
RWMutexWrite10-16                                   168ns ±25%     152ns ±11%     ~     (p=0.063 n=10+10)
RWMutexWorkWrite100-16                              106ns ± 0%     106ns ± 3%     ~     (p=0.136 n=10+10)
RWMutexWorkWrite10-16                               567ns ± 3%     571ns ± 1%     ~     (p=0.326 n=10+9)
SemTable/OneAddrCollision/n=1000-16                15.9µs ± 1%    16.0µs ± 1%   +0.50%  (p=0.031 n=9+9)
SemTable/ManyAddrCollision/n=1000-16               56.2µs ± 1%    56.8µs ± 1%   +1.06%  (p=0.000 n=10+10)
SemTable/OneAddrCollision/n=2000-16                32.6µs ± 2%    32.9µs ± 4%     ~     (p=0.156 n=10+9)
SemTable/ManyAddrCollision/n=2000-16                118µs ± 0%     119µs ± 0%   +0.75%  (p=0.000 n=9+10)
SemTable/OneAddrCollision/n=4000-16                65.3µs ± 1%    65.6µs ± 3%     ~     (p=0.497 n=9+10)
SemTable/ManyAddrCollision/n=4000-16                245µs ± 0%     248µs ± 2%   +1.36%  (p=0.000 n=9+10)
SemTable/OneAddrCollision/n=8000-16                 131µs ± 1%     130µs ± 1%   -1.01%  (p=0.002 n=9+10)
SemTable/ManyAddrCollision/n=8000-16                503µs ± 1%     508µs ± 0%   +0.97%  (p=0.000 n=10+10)
MakeSliceCopy/mallocmove/Byte-16                   67.6ns ± 1%    64.1ns ± 2%   -5.20%  (p=0.000 n=10+10)
MakeSliceCopy/mallocmove/Int-16                    65.0ns ± 7%    61.7ns ± 4%   -5.08%  (p=0.009 n=10+10)
MakeSliceCopy/mallocmove/Ptr-16                    88.1ns ± 1%    79.9ns ± 1%   -9.29%  (p=0.000 n=10+10)
MakeSliceCopy/makecopy/Byte-16                     65.2ns ± 6%    63.4ns ± 0%     ~     (p=0.500 n=10+8)
MakeSliceCopy/makecopy/Int-16                      63.2ns ± 1%    64.1ns ± 1%   +1.34%  (p=0.001 n=9+9)
MakeSliceCopy/makecopy/Ptr-16                      88.1ns ± 1%    80.1ns ± 1%   -9.09%  (p=0.000 n=10+10)
MakeSliceCopy/nilappend/Byte-16                    69.8ns ± 1%    65.7ns ± 3%   -5.80%  (p=0.000 n=10+10)
MakeSliceCopy/nilappend/Int-16                     69.6ns ± 2%    67.2ns ± 1%   -3.50%  (p=0.000 n=10+9)
MakeSliceCopy/nilappend/Ptr-16                     91.5ns ± 1%    83.8ns ± 1%   -8.42%  (p=0.000 n=9+10)
MakeSlice/Byte-16                                  6.64ns ± 3%    6.58ns ± 2%     ~     (p=0.393 n=10+10)
MakeSlice/Int16-16                                 8.60ns ± 1%    8.38ns ± 3%   -2.48%  (p=0.001 n=9+10)
MakeSlice/Int-16                                   17.7ns ± 3%    16.9ns ± 1%   -4.67%  (p=0.000 n=10+9)
MakeSlice/Ptr-16                                   24.0ns ± 3%    23.3ns ± 2%   -3.25%  (p=0.000 n=10+9)
MakeSlice/Struct/24-16                             34.1ns ± 1%    32.0ns ± 1%   -6.11%  (p=0.000 n=10+10)
MakeSlice/Struct/32-16                             39.1ns ± 4%    38.2ns ± 1%     ~     (p=0.829 n=10+8)
MakeSlice/Struct/40-16                             47.0ns ± 5%    43.0ns ± 2%   -8.55%  (p=0.000 n=10+9)
GrowSlice/Byte-16                                  15.3ns ± 3%    15.0ns ± 2%   -1.75%  (p=0.005 n=9+9)
GrowSlice/Int16-16                                 18.9ns ± 2%    18.4ns ± 2%   -2.71%  (p=0.000 n=10+9)
GrowSlice/Int-16                                   33.9ns ± 1%    32.2ns ± 1%   -4.89%  (p=0.000 n=10+9)
GrowSlice/Ptr-16                                   45.3ns ± 2%    43.5ns ± 1%   -4.12%  (p=0.000 n=10+10)
GrowSlice/Struct/24-16                             61.9ns ± 2%    60.0ns ± 4%   -3.10%  (p=0.002 n=10+10)
GrowSlice/Struct/32-16                             79.9ns ± 2%    72.3ns ± 3%   -9.58%  (p=0.000 n=8+10)
GrowSlice/Struct/40-16                             97.1ns ± 7%    88.8ns ± 5%   -8.49%  (p=0.000 n=10+10)
ExtendSlice/IntSlice-16                            21.1ns ± 2%    20.3ns ± 2%   -3.71%  (p=0.000 n=10+10)
ExtendSlice/PointerSlice-16                        26.8ns ± 2%    26.3ns ± 2%   -1.86%  (p=0.004 n=10+10)
ExtendSlice/NoGrow-16                              1.23ns ± 0%    1.30ns ± 1%   +5.03%  (p=0.000 n=10+10)
Append-16                                          4.58ns ± 1%    4.53ns ± 0%   -1.11%  (p=0.000 n=10+10)
AppendGrowByte-16                                  1.46ms ± 8%    1.42ms ± 7%   -3.24%  (p=0.035 n=10+10)
AppendGrowString-16                                27.8ms ± 4%    27.2ms ± 5%     ~     (p=0.052 n=10+10)
AppendSlice/1Bytes-16                              1.03ns ± 1%    1.04ns ± 1%     ~     (p=0.303 n=10+10)
AppendSlice/4Bytes-16                              1.04ns ± 0%    1.05ns ± 0%   +0.79%  (p=0.000 n=9+10)
AppendSlice/7Bytes-16                              1.23ns ± 1%    1.24ns ± 0%   +0.45%  (p=0.001 n=10+10)
AppendSlice/8Bytes-16                              1.24ns ± 0%    1.24ns ± 0%     ~     (p=0.183 n=10+10)
AppendSlice/15Bytes-16                             1.37ns ± 1%    1.43ns ± 1%   +3.88%  (p=0.000 n=10+10)
AppendSlice/16Bytes-16                             1.37ns ± 1%    1.42ns ± 1%   +3.63%  (p=0.000 n=9+10)
AppendSlice/32Bytes-16                             1.44ns ± 0%    1.47ns ± 1%   +1.83%  (p=0.000 n=10+10)
AppendSliceLarge/1024Bytes-16                       257ns ± 2%     234ns ± 1%   -8.96%  (p=0.000 n=8+9)
AppendSliceLarge/4096Bytes-16                       871ns ± 6%     812ns ± 1%   -6.80%  (p=0.000 n=10+10)
AppendSliceLarge/16384Bytes-16                     3.15µs ± 6%    3.04µs ± 5%     ~     (p=0.052 n=10+10)
AppendSliceLarge/65536Bytes-16                     10.7µs ± 7%    10.8µs ± 2%     ~     (p=0.278 n=10+9)
AppendSliceLarge/262144Bytes-16                    42.9µs ± 1%    39.6µs ± 5%   -7.75%  (p=0.000 n=9+10)
AppendSliceLarge/1048576Bytes-16                    147µs ± 4%     144µs ± 4%   -2.21%  (p=0.035 n=10+10)
AppendStr/1Bytes-16                                1.20ns ± 0%    1.20ns ± 0%     ~     (p=0.755 n=10+10)
AppendStr/4Bytes-16                                1.13ns ± 0%    1.14ns ± 1%   +1.20%  (p=0.000 n=10+10)
AppendStr/8Bytes-16                                1.24ns ± 0%    1.25ns ± 0%   +0.93%  (p=0.000 n=10+10)
AppendStr/16Bytes-16                               1.40ns ± 0%    1.42ns ± 0%   +2.10%  (p=0.000 n=9+10)
AppendStr/32Bytes-16                               1.44ns ± 0%    1.45ns ± 0%   +0.99%  (p=0.000 n=10+10)
AppendSpecialCase-16                               8.64ns ± 1%    8.89ns ± 2%   +2.90%  (p=0.000 n=10+10)
Copy/1Byte-16                                      1.24ns ± 1%    1.24ns ± 0%   -0.28%  (p=0.000 n=10+6)
Copy/1String-16                                    1.24ns ± 0%    1.23ns ± 0%     ~     (p=0.160 n=10+10)
Copy/2Byte-16                                      1.24ns ± 0%    1.24ns ± 0%     ~     (p=0.115 n=10+10)
Copy/2String-16                                    1.24ns ± 0%    1.24ns ± 1%     ~     (p=0.954 n=10+10)
Copy/4Byte-16                                      1.24ns ± 0%    1.24ns ± 0%   -0.44%  (p=0.001 n=10+10)
Copy/4String-16                                    1.23ns ± 0%    1.23ns ± 0%     ~     (p=0.081 n=10+10)
Copy/8Byte-16                                      1.37ns ± 0%    1.34ns ± 0%   -1.79%  (p=0.000 n=9+9)
Copy/8String-16                                    1.34ns ± 0%    1.34ns ± 0%   -0.58%  (p=0.000 n=9+10)
Copy/12Byte-16                                     1.44ns ± 0%    1.44ns ± 0%     ~     (p=0.149 n=9+9)
Copy/12String-16                                   1.44ns ± 0%    1.45ns ± 0%     ~     (p=0.124 n=9+9)
Copy/16Byte-16                                     1.44ns ± 0%    1.44ns ± 0%   -0.19%  (p=0.004 n=10+9)
Copy/16String-16                                   1.44ns ± 0%    1.45ns ± 0%   +0.30%  (p=0.008 n=10+10)
Copy/32Byte-16                                     1.63ns ± 1%    1.62ns ± 1%   -0.72%  (p=0.002 n=10+10)
Copy/32String-16                                   1.60ns ± 1%    1.64ns ± 0%   +2.23%  (p=0.000 n=10+10)
Copy/128Byte-16                                    2.06ns ± 0%    2.06ns ± 0%     ~     (p=0.757 n=9+10)
Copy/128String-16                                  2.07ns ± 0%    2.07ns ± 0%   +0.36%  (p=0.004 n=10+10)
Copy/1024Byte-16                                   6.07ns ± 2%    6.00ns ± 1%   -1.20%  (p=0.000 n=9+10)
Copy/1024String-16                                 6.05ns ± 0%    5.95ns ± 1%   -1.54%  (p=0.000 n=10+9)
AppendInPlace/NoGrow/Byte-16                        288ns ± 1%     284ns ± 1%   -1.58%  (p=0.000 n=10+10)
AppendInPlace/NoGrow/1Ptr-16                        844ns ± 1%     809ns ± 3%   -4.13%  (p=0.000 n=9+10)
AppendInPlace/NoGrow/2Ptr-16                       1.47µs ± 1%    1.46µs ± 1%     ~     (p=0.388 n=9+10)
AppendInPlace/NoGrow/3Ptr-16                       1.87µs ± 7%    1.91µs ± 1%     ~     (p=0.166 n=10+8)
AppendInPlace/NoGrow/4Ptr-16                       2.66µs ± 1%    2.67µs ± 3%     ~     (p=0.968 n=9+10)
AppendInPlace/Grow/Byte-16                          126ns ± 2%     121ns ± 2%   -4.06%  (p=0.000 n=10+10)
AppendInPlace/Grow/1Ptr-16                          132ns ± 2%     127ns ± 2%   -4.28%  (p=0.000 n=10+9)
AppendInPlace/Grow/2Ptr-16                          196ns ± 2%     188ns ± 1%   -4.20%  (p=0.000 n=10+8)
AppendInPlace/Grow/3Ptr-16                          264ns ± 1%     260ns ± 1%   -1.51%  (p=0.000 n=9+10)
AppendInPlace/Grow/4Ptr-16                          297ns ± 2%     294ns ± 2%     ~     (p=0.085 n=10+10)
StackCopyPtr-16                                    36.4ms ± 2%    36.7ms ± 2%     ~     (p=0.481 n=10+10)
StackCopy-16                                       33.9ms ± 3%    32.6ms ± 1%   -3.87%  (p=0.000 n=10+8)
StackCopyNoCache-16                                1.00ms ± 5%    1.01ms ± 5%     ~     (p=0.143 n=10+10)
StackCopyWithStkobj-16                             11.0ms ± 3%    10.9ms ± 4%     ~     (p=0.579 n=10+10)
Issue18138-16                                      49.2µs ± 5%    49.0µs ± 4%     ~     (p=1.000 n=10+9)
CompareStringEqual-16                              1.39ns ± 1%    1.45ns ± 2%   +3.80%  (p=0.000 n=8+10)
CompareStringIdentical-16                          0.55ns ± 1%    0.55ns ± 0%   +0.42%  (p=0.007 n=10+10)
CompareStringSameLength-16                         1.03ns ± 0%    1.03ns ± 0%     ~     (p=0.430 n=9+10)
CompareStringDifferentLength-16                    0.11ns ± 2%    0.11ns ± 3%     ~     (p=0.139 n=9+10)
CompareStringBigUnaligned-16                       23.9µs ± 1%    24.0µs ± 1%     ~     (p=0.370 n=9+8)
CompareStringBig-16                                22.0µs ± 3%    22.2µs ± 3%     ~     (p=0.243 n=9+10)
ConcatStringAndBytes-16                            10.7ns ± 1%    10.0ns ± 2%   -6.33%  (p=0.000 n=10+10)
SliceByteToString/1-16                             1.34ns ± 0%    1.34ns ± 0%     ~     (p=0.057 n=10+10)
SliceByteToString/2-16                             6.67ns ± 2%    6.60ns ± 3%     ~     (p=0.101 n=10+10)
SliceByteToString/4-16                             7.76ns ± 2%    7.56ns ± 3%   -2.59%  (p=0.001 n=10+10)
SliceByteToString/8-16                             9.81ns ± 4%    9.57ns ± 2%   -2.48%  (p=0.005 n=10+10)
SliceByteToString/16-16                            14.0ns ± 3%    13.7ns ± 2%   -2.31%  (p=0.009 n=10+10)
SliceByteToString/32-16                            17.3ns ± 1%    16.7ns ± 2%   -3.41%  (p=0.000 n=10+10)
SliceByteToString/64-16                            25.1ns ± 1%    24.1ns ± 2%   -3.93%  (p=0.000 n=9+10)
SliceByteToString/128-16                           38.6ns ± 1%    36.5ns ± 1%   -5.60%  (p=0.000 n=10+10)
RuneCount/lenruneslice/ASCII-16                    4.12ns ± 0%    4.11ns ± 0%     ~     (p=0.382 n=10+10)
RuneCount/lenruneslice/Japanese-16                 25.4ns ± 2%    25.6ns ± 2%     ~     (p=0.138 n=9+10)
RuneCount/lenruneslice/MixedLength-16              17.1ns ± 0%    17.2ns ± 0%   +0.59%  (p=0.000 n=9+9)
RuneCount/rangeloop/ASCII-16                       3.30ns ± 1%    3.29ns ± 0%     ~     (p=0.267 n=10+10)
RuneCount/rangeloop/Japanese-16                    20.1ns ± 1%    24.9ns ± 1%  +24.31%  (p=0.000 n=9+9)
RuneCount/rangeloop/MixedLength-16                 16.5ns ± 1%    16.7ns ± 1%   +1.34%  (p=0.000 n=10+10)
RuneCount/utf8.RuneCountInString/ASCII-16          5.71ns ± 1%    5.73ns ± 2%     ~     (p=0.579 n=10+10)
RuneCount/utf8.RuneCountInString/Japanese-16       22.0ns ± 6%    18.4ns ± 3%  -16.41%  (p=0.000 n=9+10)
RuneCount/utf8.RuneCountInString/MixedLength-16    15.0ns ± 1%    14.9ns ± 1%   -1.01%  (p=0.004 n=9+10)
RuneIterate/range/ASCII-16                         2.69ns ± 1%    2.72ns ± 0%   +0.94%  (p=0.026 n=10+9)
RuneIterate/range/Japanese-16                      24.5ns ± 2%    25.3ns ± 2%   +3.23%  (p=0.000 n=9+10)
RuneIterate/range/MixedLength-16                   17.0ns ± 1%    17.1ns ± 1%   +0.85%  (p=0.000 n=10+10)
RuneIterate/range1/ASCII-16                        2.70ns ± 1%    2.72ns ± 0%     ~     (p=0.058 n=9+9)
RuneIterate/range1/Japanese-16                     24.1ns ± 2%    25.2ns ± 3%   +4.30%  (p=0.000 n=10+10)
RuneIterate/range1/MixedLength-16                  16.9ns ± 1%    17.7ns ± 0%   +5.04%  (p=0.000 n=10+8)
RuneIterate/range2/ASCII-16                        2.84ns ± 8%    2.72ns ± 1%   -4.28%  (p=0.003 n=10+9)
RuneIterate/range2/Japanese-16                     22.7ns ± 4%    25.2ns ± 3%  +10.97%  (p=0.000 n=10+10)
RuneIterate/range2/MixedLength-16                  17.0ns ± 1%    17.2ns ± 0%   +0.95%  (p=0.000 n=10+10)
ArrayEqual-16                                      0.40ns ± 5%    0.35ns ± 2%  -11.83%  (p=0.000 n=10+10)
Func/Name-16                                       8.05ns ± 1%    8.09ns ± 1%   +0.40%  (p=0.025 n=8+10)
Func/Entry-16                                      1.73ns ± 1%    1.66ns ± 1%   -3.93%  (p=0.000 n=10+10)
Func/FileLine-16                                   27.5ns ± 2%    26.0ns ± 0%   -5.50%  (p=0.000 n=10+10)
[Geo mean]                                         16.7ns         15.7ns        -6.08%

name                                             old speed      new speed      delta
SetTypePtr-16                                    11.0GB/s ± 1%  11.0GB/s ± 3%     ~     (p=0.684 n=10+10)
SetTypePtr8-16                                   15.5GB/s ± 0%  15.5GB/s ± 0%     ~     (p=0.123 n=10+10)
SetTypePtr16-16                                  31.0GB/s ± 1%  31.1GB/s ± 0%     ~     (p=0.123 n=10+10)
SetTypePtr32-16                                  62.1GB/s ± 0%  62.2GB/s ± 0%     ~     (p=0.123 n=10+10)
SetTypePtr64-16                                   124GB/s ± 0%   124GB/s ± 0%     ~     (p=0.684 n=10+10)
SetTypePtr126-16                                  146GB/s ± 0%   146GB/s ± 0%     ~     (p=0.481 n=10+10)
SetTypePtr128-16                                  154GB/s ± 0%   154GB/s ± 0%     ~     (p=0.243 n=9+10)
SetTypePtrSlice-16                                151GB/s ± 1%   151GB/s ± 1%     ~     (p=0.497 n=9+10)
SetTypeNode1-16                                  5.82GB/s ± 1%  5.82GB/s ± 0%     ~     (p=0.353 n=10+10)
SetTypeNode1Slice-16                             76.1GB/s ± 1%  77.0GB/s ± 1%   +1.19%  (p=0.000 n=10+10)
SetTypeNode8-16                                  19.4GB/s ± 0%  19.4GB/s ± 0%     ~     (p=0.130 n=8+8)
SetTypeNode8Slice-16                              113GB/s ± 0%   113GB/s ± 0%     ~     (p=0.604 n=10+9)
SetTypeNode64-16                                 76.5GB/s ± 0%  76.5GB/s ± 0%     ~     (p=0.190 n=10+10)
SetTypeNode64Slice-16                            97.8GB/s ± 0%  97.7GB/s ± 0%     ~     (p=0.549 n=9+10)
SetTypeNode64Dead-16                             95.5GB/s ± 0%  95.7GB/s ± 0%     ~     (p=0.118 n=10+6)
SetTypeNode64DeadSlice-16                         112GB/s ± 0%   112GB/s ± 0%     ~     (p=0.353 n=10+10)
SetTypeNode124-16                                 146GB/s ± 0%   146GB/s ± 0%     ~     (p=0.853 n=10+10)
SetTypeNode124Slice-16                            146GB/s ± 5%   149GB/s ± 0%     ~     (p=0.315 n=10+10)
SetTypeNode126-16                                 154GB/s ± 0%   154GB/s ± 0%     ~     (p=0.356 n=10+9)
SetTypeNode126Slice-16                            150GB/s ± 0%   150GB/s ± 0%     ~     (p=0.095 n=9+10)
SetTypeNode128-16                                 107GB/s ± 0%   107GB/s ± 0%   +0.31%  (p=0.003 n=9+10)
SetTypeNode128Slice-16                            119GB/s ± 0%   120GB/s ± 0%     ~     (p=0.156 n=10+9)
SetTypeNode130-16                                 108GB/s ± 0%   108GB/s ± 0%   +0.33%  (p=0.002 n=10+10)
SetTypeNode130Slice-16                            119GB/s ± 0%   119GB/s ± 0%     ~     (p=0.739 n=10+10)
SetTypeNode1024-16                                160GB/s ± 0%   159GB/s ± 1%     ~     (p=0.113 n=9+9)
SetTypeNode1024Slice-16                           144GB/s ± 0%   144GB/s ± 0%     ~     (p=0.063 n=10+10)
Hash5-16                                         2.59GB/s ± 1%  2.49GB/s ± 0%   -3.90%  (p=0.000 n=10+9)
Hash16-16                                        7.85GB/s ± 1%  7.23GB/s ± 1%   -7.92%  (p=0.000 n=10+10)
Hash64-16                                        24.0GB/s ± 0%  23.9GB/s ± 0%     ~     (p=0.190 n=9+9)
Hash1024-16                                      62.4GB/s ± 0%  62.3GB/s ± 0%   -0.16%  (p=0.017 n=9+10)
Hash65536-16                                     74.0GB/s ± 0%  74.0GB/s ± 0%     ~     (p=0.796 n=10+10)
Memmove/1-16                                     1.08GB/s ± 0%  1.08GB/s ± 0%   -0.21%  (p=0.035 n=10+10)
Memmove/2-16                                     2.16GB/s ± 0%  2.15GB/s ± 0%     ~     (p=0.105 n=10+10)
Memmove/3-16                                     3.24GB/s ± 1%  3.22GB/s ± 1%   -0.49%  (p=0.004 n=10+10)
Memmove/4-16                                     3.89GB/s ± 0%  3.89GB/s ± 0%     ~     (p=0.218 n=10+10)
Memmove/5-16                                     4.42GB/s ± 0%  4.42GB/s ± 0%     ~     (p=0.075 n=10+10)
Memmove/6-16                                     5.31GB/s ± 0%  5.29GB/s ± 1%     ~     (p=0.218 n=10+10)
Memmove/7-16                                     6.19GB/s ± 0%  6.18GB/s ± 0%   -0.15%  (p=0.035 n=10+9)
Memmove/8-16                                     7.07GB/s ± 0%  7.07GB/s ± 0%     ~     (p=0.684 n=10+10)
Memmove/9-16                                     7.22GB/s ± 0%  6.68GB/s ± 0%   -7.37%  (p=0.000 n=10+10)
Memmove/10-16                                    8.02GB/s ± 0%  7.43GB/s ± 0%   -7.38%  (p=0.000 n=9+9)
Memmove/11-16                                    8.83GB/s ± 0%  8.13GB/s ± 0%   -7.87%  (p=0.000 n=10+9)
Memmove/12-16                                    9.62GB/s ± 0%  8.89GB/s ± 1%   -7.61%  (p=0.000 n=10+10)
Memmove/13-16                                    10.4GB/s ± 0%   9.7GB/s ± 0%   -7.20%  (p=0.000 n=10+10)
Memmove/14-16                                    11.2GB/s ± 0%  10.4GB/s ± 1%   -7.64%  (p=0.000 n=10+9)
Memmove/15-16                                    12.0GB/s ± 0%  11.1GB/s ± 0%   -7.46%  (p=0.000 n=10+9)
Memmove/16-16                                    12.8GB/s ± 0%  11.8GB/s ± 1%   -7.67%  (p=0.000 n=10+10)
Memmove/32-16                                    23.8GB/s ± 0%  23.5GB/s ± 1%   -1.20%  (p=0.000 n=10+10)
Memmove/64-16                                    44.2GB/s ± 0%  39.1GB/s ± 0%  -11.56%  (p=0.000 n=10+9)
Memmove/128-16                                   68.7GB/s ± 0%  63.2GB/s ± 0%   -7.95%  (p=0.000 n=10+10)
Memmove/256-16                                    104GB/s ± 0%   103GB/s ± 0%   -1.13%  (p=0.000 n=10+10)
Memmove/512-16                                    129GB/s ± 1%   129GB/s ± 0%     ~     (p=0.165 n=10+10)
Memmove/1024-16                                   174GB/s ± 1%   174GB/s ± 1%     ~     (p=0.258 n=9+9)
Memmove/2048-16                                   213GB/s ± 1%   213GB/s ± 2%     ~     (p=0.963 n=8+9)
Memmove/4096-16                                   250GB/s ± 1%   240GB/s ± 4%   -3.83%  (p=0.006 n=9+9)
MemmoveOverlap/32-16                             19.8GB/s ± 1%  19.1GB/s ± 1%   -3.40%  (p=0.000 n=10+10)
MemmoveOverlap/64-16                             39.0GB/s ± 0%  38.8GB/s ± 0%   -0.28%  (p=0.001 n=9+9)
MemmoveOverlap/128-16                            62.2GB/s ± 0%  62.1GB/s ± 0%     ~     (p=0.063 n=10+10)
MemmoveOverlap/256-16                            96.0GB/s ± 0%  95.8GB/s ± 0%   -0.26%  (p=0.009 n=10+10)
MemmoveOverlap/512-16                            83.6GB/s ±16%  89.2GB/s ± 0%     ~     (p=0.696 n=10+8)
MemmoveOverlap/1024-16                            141GB/s ± 0%   140GB/s ± 0%   -0.28%  (p=0.006 n=8+10)
MemmoveOverlap/2048-16                            172GB/s ± 0%   171GB/s ± 1%   -0.38%  (p=0.008 n=9+9)
MemmoveOverlap/4096-16                            176GB/s ± 1%   177GB/s ± 1%   +0.84%  (p=0.001 n=8+10)
MemmoveUnalignedDst/1-16                          806MB/s ± 0%   802MB/s ± 1%   -0.52%  (p=0.023 n=10+10)
MemmoveUnalignedDst/2-16                         1.62GB/s ± 0%  1.62GB/s ± 0%   -0.11%  (p=0.041 n=10+10)
MemmoveUnalignedDst/3-16                         2.43GB/s ± 0%  2.43GB/s ± 0%   -0.14%  (p=0.006 n=9+9)
MemmoveUnalignedDst/4-16                         3.24GB/s ± 0%  3.23GB/s ± 1%   -0.36%  (p=0.007 n=10+10)
MemmoveUnalignedDst/5-16                         3.71GB/s ± 0%  3.71GB/s ± 0%     ~     (p=0.063 n=10+10)
MemmoveUnalignedDst/6-16                         4.48GB/s ± 0%  4.47GB/s ± 0%     ~     (p=0.912 n=10+10)
MemmoveUnalignedDst/7-16                         5.22GB/s ± 0%  5.22GB/s ± 0%     ~     (p=1.000 n=10+10)
MemmoveUnalignedDst/8-16                         5.95GB/s ± 0%  5.93GB/s ± 1%   -0.40%  (p=0.023 n=10+10)
MemmoveUnalignedDst/9-16                         6.24GB/s ± 0%  6.24GB/s ± 0%     ~     (p=0.912 n=10+10)
MemmoveUnalignedDst/10-16                        6.94GB/s ± 0%  6.94GB/s ± 0%     ~     (p=0.353 n=10+10)
MemmoveUnalignedDst/11-16                        7.64GB/s ± 0%  7.63GB/s ± 0%     ~     (p=0.393 n=10+10)
MemmoveUnalignedDst/12-16                        8.33GB/s ± 0%  8.33GB/s ± 0%     ~     (p=0.971 n=10+10)
MemmoveUnalignedDst/13-16                        9.02GB/s ± 0%  9.01GB/s ± 0%     ~     (p=0.436 n=10+10)
MemmoveUnalignedDst/14-16                        9.71GB/s ± 0%  9.71GB/s ± 0%     ~     (p=0.280 n=10+10)
MemmoveUnalignedDst/15-16                        10.4GB/s ± 0%  10.4GB/s ± 1%     ~     (p=0.853 n=10+10)
MemmoveUnalignedDst/16-16                        11.1GB/s ± 0%  11.1GB/s ± 0%     ~     (p=0.089 n=10+10)
MemmoveUnalignedDst/32-16                        19.7GB/s ± 1%  19.6GB/s ± 0%     ~     (p=0.075 n=10+10)
MemmoveUnalignedDst/64-16                        38.9GB/s ± 0%  38.8GB/s ± 0%     ~     (p=0.218 n=10+10)
MemmoveUnalignedDst/128-16                       62.1GB/s ± 0%  62.1GB/s ± 0%     ~     (p=0.549 n=10+9)
MemmoveUnalignedDst/256-16                       69.4GB/s ± 0%  69.3GB/s ± 0%     ~     (p=0.105 n=10+10)
MemmoveUnalignedDst/512-16                        124GB/s ± 1%   124GB/s ± 0%     ~     (p=0.762 n=10+8)
MemmoveUnalignedDst/1024-16                       136GB/s ± 0%   136GB/s ± 1%     ~     (p=0.666 n=9+9)
MemmoveUnalignedDst/2048-16                       159GB/s ± 0%   159GB/s ± 0%     ~     (p=0.574 n=8+8)
MemmoveUnalignedDst/4096-16                       161GB/s ± 0%   161GB/s ± 0%     ~     (p=1.000 n=9+9)
MemmoveUnalignedDstOverlap/32-16                 7.84GB/s ± 0%  7.83GB/s ± 0%     ~     (p=0.353 n=10+10)
MemmoveUnalignedDstOverlap/64-16                 14.0GB/s ± 0%  14.0GB/s ± 0%     ~     (p=0.661 n=10+9)
MemmoveUnalignedDstOverlap/128-16                27.4GB/s ± 0%  27.4GB/s ± 0%     ~     (p=0.353 n=10+10)
MemmoveUnalignedDstOverlap/256-16                50.4GB/s ± 0%  50.4GB/s ± 0%     ~     (p=0.156 n=10+9)
MemmoveUnalignedDstOverlap/512-16                60.7GB/s ± 4%  62.5GB/s ± 0%   +3.07%  (p=0.022 n=10+9)
MemmoveUnalignedDstOverlap/1024-16                107GB/s ± 0%   107GB/s ± 0%     ~     (p=0.234 n=8+8)
MemmoveUnalignedDstOverlap/2048-16                146GB/s ± 0%   146GB/s ± 1%     ~     (p=0.182 n=10+9)
MemmoveUnalignedDstOverlap/4096-16                155GB/s ± 0%   155GB/s ± 0%     ~     (p=0.400 n=10+9)
MemmoveUnalignedSrc/1-16                          882MB/s ± 0%   884MB/s ± 1%   +0.24%  (p=0.033 n=10+9)
MemmoveUnalignedSrc/2-16                         1.76GB/s ± 1%  1.77GB/s ± 0%   +0.27%  (p=0.028 n=10+9)
MemmoveUnalignedSrc/3-16                         2.43GB/s ± 0%  2.43GB/s ± 0%   +0.26%  (p=0.027 n=9+10)
MemmoveUnalignedSrc/4-16                         3.24GB/s ± 0%  3.24GB/s ± 1%     ~     (p=0.079 n=9+10)
MemmoveUnalignedSrc/5-16                         3.73GB/s ± 0%  3.73GB/s ± 1%     ~     (p=0.829 n=8+10)
MemmoveUnalignedSrc/6-16                         4.47GB/s ± 0%  4.49GB/s ± 0%   +0.39%  (p=0.000 n=10+10)
MemmoveUnalignedSrc/7-16                         5.22GB/s ± 0%  5.23GB/s ± 0%     ~     (p=0.280 n=10+10)
MemmoveUnalignedSrc/8-16                         5.95GB/s ± 0%  5.98GB/s ± 0%   +0.39%  (p=0.001 n=10+9)
MemmoveUnalignedSrc/9-16                         6.24GB/s ± 0%  6.25GB/s ± 0%     ~     (p=0.549 n=10+9)
MemmoveUnalignedSrc/10-16                        6.93GB/s ± 0%  6.94GB/s ± 0%     ~     (p=0.604 n=10+9)
MemmoveUnalignedSrc/11-16                        7.63GB/s ± 0%  7.63GB/s ± 1%     ~     (p=0.353 n=10+10)
MemmoveUnalignedSrc/12-16                        8.32GB/s ± 0%  8.32GB/s ± 0%     ~     (p=0.218 n=10+10)
MemmoveUnalignedSrc/13-16                        9.02GB/s ± 0%  9.00GB/s ± 1%     ~     (p=0.684 n=10+10)
MemmoveUnalignedSrc/14-16                        9.71GB/s ± 0%  9.71GB/s ± 0%     ~     (p=0.739 n=10+10)
MemmoveUnalignedSrc/15-16                        10.4GB/s ± 0%  10.4GB/s ± 0%     ~     (p=0.353 n=10+10)
MemmoveUnalignedSrc/16-16                        11.1GB/s ± 1%  11.1GB/s ± 0%     ~     (p=0.579 n=10+10)
MemmoveUnalignedSrc/32-16                        20.0GB/s ± 1%  20.0GB/s ± 0%     ~     (p=0.631 n=10+10)
MemmoveUnalignedSrc/64-16                        38.8GB/s ± 0%  38.8GB/s ± 0%     ~     (p=0.579 n=10+10)
MemmoveUnalignedSrc/128-16                       61.2GB/s ± 0%  61.2GB/s ± 0%     ~     (p=0.780 n=10+9)
MemmoveUnalignedSrc/256-16                       94.8GB/s ± 0%  92.2GB/s ± 1%   -2.73%  (p=0.000 n=10+10)
MemmoveUnalignedSrc/512-16                        119GB/s ± 0%   119GB/s ± 0%   +0.26%  (p=0.027 n=8+9)
MemmoveUnalignedSrc/1024-16                       141GB/s ± 0%   142GB/s ± 1%   +1.07%  (p=0.000 n=8+10)
MemmoveUnalignedSrc/2048-16                       157GB/s ± 0%   157GB/s ± 0%     ~     (p=0.167 n=9+8)
MemmoveUnalignedSrc/4096-16                       161GB/s ± 0%   162GB/s ± 1%     ~     (p=0.063 n=10+10)
MemmoveUnalignedSrcOverlap/32-16                 7.93GB/s ± 0%  7.88GB/s ± 0%   -0.63%  (p=0.000 n=9+10)
MemmoveUnalignedSrcOverlap/64-16                 15.5GB/s ± 0%  15.5GB/s ± 0%     ~     (p=0.529 n=10+10)
MemmoveUnalignedSrcOverlap/128-16                28.3GB/s ± 0%  28.3GB/s ± 0%     ~     (p=0.218 n=10+10)
MemmoveUnalignedSrcOverlap/256-16                41.5GB/s ± 0%  41.6GB/s ± 0%   +0.35%  (p=0.000 n=10+9)
MemmoveUnalignedSrcOverlap/512-16                68.9GB/s ± 0%  68.8GB/s ± 0%     ~     (p=0.541 n=9+8)
MemmoveUnalignedSrcOverlap/1024-16                115GB/s ± 0%   115GB/s ± 0%     ~     (p=0.382 n=8+8)
MemmoveUnalignedSrcOverlap/2048-16                155GB/s ± 0%   144GB/s ±18%     ~     (p=0.101 n=8+10)
MemmoveUnalignedSrcOverlap/4096-16                160GB/s ± 0%   160GB/s ± 1%     ~     (p=0.605 n=9+9)
Memclr/5-16                                      5.81GB/s ± 1%  5.80GB/s ± 2%     ~     (p=0.546 n=9+9)
Memclr/16-16                                     15.5GB/s ± 0%  15.4GB/s ± 0%   -0.32%  (p=0.008 n=9+10)
Memclr/64-16                                     51.8GB/s ± 0%  50.7GB/s ± 0%   -2.22%  (p=0.000 n=10+10)
Memclr/256-16                                     113GB/s ± 0%   113GB/s ± 0%     ~     (p=0.143 n=10+10)
Memclr/4096-16                                    239GB/s ± 1%   237GB/s ± 0%   -0.87%  (p=0.000 n=10+10)
Memclr/65536-16                                  79.8GB/s ± 0%  79.7GB/s ± 0%     ~     (p=0.529 n=10+10)
Memclr/1M-16                                     74.6GB/s ± 1%  74.7GB/s ± 1%     ~     (p=0.529 n=10+10)
Memclr/4M-16                                     48.7GB/s ± 1%  48.8GB/s ± 0%     ~     (p=0.123 n=10+10)
Memclr/8M-16                                     48.2GB/s ± 2%  48.6GB/s ± 0%     ~     (p=0.408 n=10+8)
Memclr/16M-16                                    43.6GB/s ± 4%  43.3GB/s ± 0%     ~     (p=0.173 n=10+8)
Memclr/64M-16                                    30.7GB/s ± 0%  30.7GB/s ± 0%     ~     (p=0.113 n=10+9)
GoMemclr/5-16                                    6.07GB/s ± 0%  6.08GB/s ± 0%     ~     (p=0.367 n=9+10)
GoMemclr/16-16                                   15.6GB/s ± 0%  15.6GB/s ± 0%   -0.22%  (p=0.004 n=9+9)
GoMemclr/64-16                                   56.1GB/s ± 0%  56.1GB/s ± 0%     ~     (p=0.968 n=10+9)
GoMemclr/256-16                                   125GB/s ± 0%   124GB/s ± 0%     ~     (p=0.912 n=10+10)
MemclrRange/1K_2K-16                              210GB/s ± 0%   224GB/s ± 1%   +6.81%  (p=0.000 n=10+10)
MemclrRange/2K_8K-16                              228GB/s ± 0%   228GB/s ± 0%     ~     (p=0.684 n=10+10)
MemclrRange/4K_16K-16                             279GB/s ± 0%   279GB/s ± 0%     ~     (p=0.780 n=9+10)
MemclrRange/160K_228K-16                         80.3GB/s ± 0%  80.2GB/s ± 0%     ~     (p=0.165 n=10+10)
Copy/1Byte-16                                     808MB/s ± 1%   810MB/s ± 0%   +0.28%  (p=0.000 n=10+8)
Copy/1String-16                                   810MB/s ± 0%   811MB/s ± 0%     ~     (p=0.105 n=10+10)
Copy/2Byte-16                                    1.62GB/s ± 0%  1.62GB/s ± 0%     ~     (p=0.182 n=10+9)
Copy/2String-16                                  1.62GB/s ± 1%  1.62GB/s ± 1%     ~     (p=1.000 n=10+10)
Copy/4Byte-16                                    3.22GB/s ± 0%  3.24GB/s ± 0%   +0.46%  (p=0.000 n=10+10)
Copy/4String-16                                  3.24GB/s ± 0%  3.24GB/s ± 0%     ~     (p=0.075 n=10+10)
Copy/8Byte-16                                    5.86GB/s ± 0%  5.96GB/s ± 0%   +1.82%  (p=0.000 n=9+9)
Copy/8String-16                                  5.95GB/s ± 0%  5.99GB/s ± 0%   +0.59%  (p=0.000 n=9+10)
Copy/12Byte-16                                   8.32GB/s ± 0%  8.32GB/s ± 0%     ~     (p=0.190 n=9+9)
Copy/12String-16                                 8.31GB/s ± 0%  8.29GB/s ± 0%     ~     (p=0.068 n=9+10)
Copy/16Byte-16                                   11.1GB/s ± 0%  11.1GB/s ± 0%   +0.18%  (p=0.003 n=10+9)
Copy/16String-16                                 11.1GB/s ± 0%  11.1GB/s ± 0%   -0.31%  (p=0.009 n=10+10)
Copy/32Byte-16                                   19.6GB/s ± 1%  19.8GB/s ± 1%   +0.72%  (p=0.002 n=10+10)
Copy/32String-16                                 20.0GB/s ± 0%  19.5GB/s ± 0%   -2.19%  (p=0.000 n=10+10)
Copy/128Byte-16                                  62.2GB/s ± 0%  62.1GB/s ± 0%     ~     (p=0.661 n=9+10)
Copy/128String-16                                61.9GB/s ± 0%  61.7GB/s ± 0%   -0.35%  (p=0.005 n=10+10)
Copy/1024Byte-16                                  169GB/s ± 2%   171GB/s ± 1%   +1.21%  (p=0.000 n=9+10)
Copy/1024String-16                                169GB/s ± 0%   172GB/s ± 1%   +1.57%  (p=0.000 n=10+9)
CompareStringBigUnaligned-16                     43.8GB/s ± 1%  43.7GB/s ± 1%     ~     (p=0.370 n=9+8)
CompareStringBig-16                              47.6GB/s ± 3%  47.3GB/s ± 3%     ~     (p=0.243 n=9+10)
[Geo mean]                                       25.3GB/s       28.0GB/s       +10.66%

name                                             old p50-ns     new p50-ns     delta
ReadMemStatsLatency-16                              98.4k ±37%    112.7k ±64%     ~     (p=0.436 n=10+10)
ReadMetricsLatency-16                               1.69k ± 3%     1.70k ± 2%     ~     (p=0.646 n=9+10)
GoroutineProfile/small-nil/idle-16                  3.75k ± 4%     3.72k ± 1%     ~     (p=0.447 n=10+9)
GoroutineProfile/small-nil/loaded-16                4.33k ± 3%     4.31k ± 5%     ~     (p=0.931 n=9+9)
GoroutineProfile/small/idle-16                       102k ± 3%      101k ± 4%     ~     (p=0.113 n=9+9)
GoroutineProfile/small/loaded-16                     214k ± 3%      215k ± 6%     ~     (p=0.842 n=9+10)
GoroutineProfile/large-nil/idle-16                  3.70k ± 2%     3.65k ± 2%     ~     (p=0.075 n=10+10)
GoroutineProfile/large-nil/loaded-16                4.36k ± 3%     4.31k ±10%     ~     (p=0.631 n=10+10)
GoroutineProfile/large/idle-16                      2.56M ± 1%     2.51M ± 1%   -2.28%  (p=0.000 n=10+10)
GoroutineProfile/large/loaded-16                    6.77M ± 4%     6.85M ±19%     ~     (p=0.536 n=7+10)
GoroutineProfile/sparse-nil/idle-16                 3.66k ± 1%     3.64k ± 2%     ~     (p=0.136 n=9+9)
GoroutineProfile/sparse-nil/loaded-16               4.25k ± 5%     4.15k ± 4%     ~     (p=0.190 n=10+10)
GoroutineProfile/sparse/idle-16                      102k ± 4%      101k ± 3%     ~     (p=0.447 n=10+9)
GoroutineProfile/sparse/loaded-16                    216k ± 4%      218k ± 4%     ~     (p=0.549 n=9+10)
[Geo mean]                                          35.8k          35.9k        +0.35%

name                                             old p90-ns     new p90-ns     delta
ReadMemStatsLatency-16                              983k ±310%      200k ±34%  -79.62%  (p=0.034 n=10+8)
ReadMetricsLatency-16                               4.01k ±35%     3.75k ±17%     ~     (p=0.315 n=10+10)
GoroutineProfile/small-nil/idle-16                  4.21k ± 4%     4.27k ± 8%     ~     (p=0.968 n=9+10)
GoroutineProfile/small-nil/loaded-16                5.58k ± 8%     5.35k ±12%     ~     (p=0.190 n=10+10)
GoroutineProfile/small/idle-16                       108k ± 6%      107k ± 7%     ~     (p=0.497 n=9+10)
GoroutineProfile/small/loaded-16                     450k ± 5%      432k ± 3%   -3.92%  (p=0.002 n=9+9)
GoroutineProfile/large-nil/idle-16                  4.13k ± 7%     4.04k ± 2%     ~     (p=0.181 n=10+8)
GoroutineProfile/large-nil/loaded-16                5.76k ± 4%     5.67k ± 5%     ~     (p=0.190 n=10+10)
GoroutineProfile/large/idle-16                      2.63M ± 2%     2.58M ± 1%   -1.97%  (p=0.000 n=10+10)
GoroutineProfile/large/loaded-16                    16.9M ± 4%     17.0M ± 6%     ~     (p=0.661 n=9+10)
GoroutineProfile/sparse-nil/idle-16                 4.21k ±10%     4.07k ± 7%     ~     (p=0.128 n=10+10)
GoroutineProfile/sparse-nil/loaded-16               5.55k ± 8%     5.38k ± 6%     ~     (p=0.089 n=10+10)
GoroutineProfile/sparse/idle-16                      106k ± 4%      106k ± 3%     ~     (p=0.661 n=10+9)
GoroutineProfile/sparse/loaded-16                    454k ± 6%      441k ± 6%   -2.86%  (p=0.043 n=10+10)
[Geo mean]                                          58.4k          51.0k       -12.61%

name                                             old p99-ns     new p99-ns     delta
ReadMemStatsLatency-16                              983k ±310%      200k ±34%  -79.62%  (p=0.034 n=10+8)
ReadMetricsLatency-16                               26.6k ±22%     26.6k ±17%     ~     (p=0.971 n=10+10)
GoroutineProfile/small-nil/idle-16                  5.27k ±12%     5.19k ±12%     ~     (p=0.579 n=10+10)
GoroutineProfile/small-nil/loaded-16                7.28k ± 3%     7.04k ± 9%     ~     (p=0.113 n=9+10)
GoroutineProfile/small/idle-16                       114k ± 6%      113k ± 6%     ~     (p=0.604 n=9+10)
GoroutineProfile/small/loaded-16                    4.84M ±70%    6.06M ±131%     ~     (p=0.842 n=9+10)
GoroutineProfile/large-nil/idle-16                  5.38k ±19%     5.26k ±13%     ~     (p=0.912 n=10+10)
GoroutineProfile/large-nil/loaded-16                7.38k ± 3%     7.23k ± 4%     ~     (p=0.143 n=10+10)
GoroutineProfile/large/idle-16                      2.79M ± 5%     2.72M ± 3%     ~     (p=0.089 n=10+10)
GoroutineProfile/large/loaded-16                    24.0M ±24%     24.9M ±29%     ~     (p=0.684 n=10+10)
GoroutineProfile/sparse-nil/idle-16                 5.32k ±17%     5.49k ±17%     ~     (p=0.684 n=10+10)
GoroutineProfile/sparse-nil/loaded-16               7.25k ± 4%     6.97k ± 5%   -3.90%  (p=0.005 n=9+10)
GoroutineProfile/sparse/idle-16                      113k ± 5%      112k ± 6%     ~     (p=0.631 n=10+10)
GoroutineProfile/sparse/loaded-16                   4.00M ±66%     4.26M ±65%     ~     (p=0.489 n=9+9)
[Geo mean]                                           107k            97k        -9.55%

name                                             old alloc/op   new alloc/op   delta
NewEmptyMap-16                                      0.00B          0.00B          ~     (all equal)
NewSmallMap-16                                      0.00B          0.00B          ~     (all equal)
MapPopulate/1-16                                    0.00B          0.00B          ~     (all equal)
MapPopulate/10-16                                    179B ± 0%      179B ± 0%     ~     (all equal)
MapPopulate/100-16                                 3.35kB ± 0%    3.35kB ± 0%     ~     (p=0.294 n=10+8)
MapPopulate/1000-16                                53.3kB ± 0%    53.3kB ± 0%     ~     (p=1.000 n=8+10)
MapPopulate/10000-16                                428kB ± 0%     428kB ± 0%     ~     (p=0.469 n=10+10)
MapPopulate/100000-16                              3.62MB ± 0%    3.62MB ± 0%     ~     (p=0.888 n=9+10)
MapStringConversion/32/simple-16                    0.00B          0.00B          ~     (all equal)
MapStringConversion/32/struct-16                    0.00B          0.00B          ~     (all equal)
MapStringConversion/32/array-16                     0.00B          0.00B          ~     (all equal)
MapStringConversion/64/simple-16                    0.00B          0.00B          ~     (all equal)
MapStringConversion/64/struct-16                    0.00B          0.00B          ~     (all equal)
MapStringConversion/64/array-16                     0.00B          0.00B          ~     (all equal)
NewEmptyMapHintLessThan8-16                         0.00B          0.00B          ~     (all equal)
NewEmptyMapHintGreaterThan8-16                     1.15kB ± 0%    1.15kB ± 0%     ~     (all equal)
MapAppendAssign/Int32/256-16                        41.7B ±15%     44.3B ±12%     ~     (p=0.106 n=10+10)
MapAppendAssign/Int32/65536-16                      22.6B ± 6%     23.5B ± 6%   +4.19%  (p=0.025 n=9+10)
MapAppendAssign/Int64/256-16                        43.5B ±10%     42.9B ± 7%     ~     (p=0.757 n=10+10)
MapAppendAssign/Int64/65536-16                      24.7B ± 7%     21.8B ± 6%  -11.74%  (p=0.000 n=10+10)
MapAppendAssign/Str/256-16                          87.6B ±10%     89.2B ± 9%     ~     (p=0.379 n=10+10)
MapAppendAssign/Str/65536-16                        45.1B ±14%     47.2B ± 8%     ~     (p=0.150 n=10+9)
CreateGoroutinesCapture-16                           144B ± 0%      144B ± 0%     ~     (all equal)
[Geo mean]                                           769B           770B        +0.21%

name                                             old allocs/op  new allocs/op  delta
NewEmptyMap-16                                       0.00           0.00          ~     (all equal)
NewSmallMap-16                                       0.00           0.00          ~     (all equal)
MapPopulate/1-16                                     0.00           0.00          ~     (all equal)
MapPopulate/10-16                                    1.00 ± 0%      1.00 ± 0%     ~     (all equal)
MapPopulate/100-16                                   17.0 ± 0%      17.0 ± 0%     ~     (all equal)
MapPopulate/1000-16                                  73.0 ± 0%      73.0 ± 0%     ~     (all equal)
MapPopulate/10000-16                                  320 ± 0%       320 ± 0%     ~     (p=1.000 n=10+10)
MapPopulate/100000-16                               4.00k ± 0%     4.00k ± 0%     ~     (p=0.753 n=10+10)
MapStringConversion/32/simple-16                     0.00           0.00          ~     (all equal)
MapStringConversion/32/struct-16                     0.00           0.00          ~     (all equal)
MapStringConversion/32/array-16                      0.00           0.00          ~     (all equal)
MapStringConversion/64/simple-16                     0.00           0.00          ~     (all equal)
MapStringConversion/64/struct-16                     0.00           0.00          ~     (all equal)
MapStringConversion/64/array-16                      0.00           0.00          ~     (all equal)
NewEmptyMapHintLessThan8-16                          0.00           0.00          ~     (all equal)
NewEmptyMapHintGreaterThan8-16                       1.00 ± 0%      1.00 ± 0%     ~     (all equal)
MapAppendAssign/Int32/256-16                         0.00           0.00          ~     (all equal)
MapAppendAssign/Int32/65536-16                       0.00           0.00          ~     (all equal)
MapAppendAssign/Int64/256-16                         0.00           0.00          ~     (all equal)
MapAppendAssign/Int64/65536-16                       0.00           0.00          ~     (all equal)
MapAppendAssign/Str/256-16                           0.00           0.00          ~     (all equal)
MapAppendAssign/Str/65536-16                         0.00           0.00          ~     (all equal)
CreateGoroutinesCapture-16                           5.00 ± 0%      5.00 ± 0%     ~     (all equal)
[Geo mean]                                           26.0           26.0        +0.00%

Change-Id: I5fb03e93df8b380e04795afbdcd1c94aeeecacc6
Reviewed-on: https://go-review.googlesource.com/c/go/+/454255
Reviewed-by: Keith Randall <khr@golang.org>
Reviewed-by: Keith Randall <khr@google.com>
Reviewed-by: Matthew Dempsky <mdempsky@google.com>
Run-TryBot: Jakub Ciolek <jakub@ciolek.dev>
TryBot-Result: Gopher Robot <gobot@golang.org>
2023-01-31 18:11:24 +00:00
Paul E. Murphy 1540531746 test/codegen: merge identical ppc64 and ppc64le tests
Manually consolidate the remaining ppc64/ppc64le test which
are not so trivial to automatically merge.

The remaining ppc64le tests are limited to cases where load/stores are
merged (this only happens on ppc64le) and the race detector (only
supported on ppc64le).

Change-Id: I1f9c0f3d3ddbb7fbbd8c81fbbd6537394fba63ce
Reviewed-on: https://go-review.googlesource.com/c/go/+/463217
Reviewed-by: Dmitri Shuralyov <dmitshur@google.com>
Reviewed-by: Cherry Mui <cherryyz@google.com>
Run-TryBot: Paul Murphy <murp@ibm.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
Reviewed-by: Lynn Boger <laboger@linux.vnet.ibm.com>
2023-01-27 19:03:02 +00:00
Paul E. Murphy 0301c6c351 test/codegen: combine trivial PPC64 tests into ppc64x
Use a small python script to consolidate duplicate
ppc64/ppc64le tests into a single ppc64x codegen test.

This makes small assumption that anytime two tests with
for different arch/variant combos exists, those tests
can be combined into a single ppc64x test.

E.x:

  // ppc64le: foo
  // ppc64le/power9: foo
into
  // ppc64x: foo

or

  // ppc64: foo
  // ppc64le: foo
into
  // ppc64x: foo

import glob
import re
files = glob.glob("codegen/*.go")
for file in files:
    with open(file) as f:
        text = [l for l in f]
    i = 0
    while i < len(text):
        first = re.match("\s*// ?ppc64(le)?(/power[89])?:(.*)", text[i])
        if first:
            j = i+1
            while j < len(text):
                second = re.match("\s*// ?ppc64(le)?(/power[89])?:(.*)", text[j])
                if not second:
                    break
                if (not first.group(2) or first.group(2) == second.group(2)) and first.group(3) == second.group(3):
                    text[i] = re.sub(" ?ppc64(le|x)?"," ppc64x",text[i])
                    text=text[:j] + (text[j+1:])
                else:
                    j += 1
        i+=1
    with open(file, 'w') as f:
        f.write("".join(text))

Change-Id: Ic6b009b54eacaadc5a23db9c5a3bf7331b595821
Reviewed-on: https://go-review.googlesource.com/c/go/+/463220
Reviewed-by: Cherry Mui <cherryyz@google.com>
Reviewed-by: Lynn Boger <laboger@linux.vnet.ibm.com>
Reviewed-by: Bryan Mills <bcmills@google.com>
Run-TryBot: Paul Murphy <murp@ibm.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
2023-01-27 18:24:12 +00:00
Paul E. Murphy a37672bb7b test/codegen: accept ppc64x as alias for ppc64le and ppc64 arches
This helps simplify the noise when adding ppc codegen tests. ppc64x
is used in other places to indicate something which runs on either
endian.

This helps cleanup existing codegen tests which are mostly
identical between endian variants.

condmove tests are converted as an example.

Change-Id: I2b2d98a9a1859015f62db38d62d9d5d7593435b4
Reviewed-on: https://go-review.googlesource.com/c/go/+/462895
Reviewed-by: Cherry Mui <cherryyz@google.com>
Reviewed-by: Matthew Dempsky <mdempsky@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
Run-TryBot: Paul Murphy <murp@ibm.com>
2023-01-24 22:55:18 +00:00
David Chase e22bd2348c internal/abi,runtime: refactor map constants into one place
Previously TryBot-tested with bucket bits = 4.
Also tested locally with bucket bits = 5.
This makes it much easier to change the size of map
buckets, and hopefully provides pointers to all the
code that in some way depends on details of map layout.

Change-Id: I9f6669d1eadd02f182d0bc3f959dc5f385fa1683
Reviewed-on: https://go-review.googlesource.com/c/go/+/462115
TryBot-Result: Gopher Robot <gobot@golang.org>
Run-TryBot: David Chase <drchase@google.com>
Reviewed-by: Austin Clements <austin@google.com>
2023-01-23 15:51:32 +00:00
Jorropo 5c67ebbb31 cmd/compile: AMD64v3 remove unnecessary TEST comparision in isPowerOfTwo
With GOAMD64=V3 the canonical isPowerOfTwo function:
  func isPowerOfTwo(x uintptr) bool {
    return x&(x-1) == 0
  }

Used to compile to:
  temp := BLSR(x) // x&(x-1)
  flags = TEST(temp, temp)
  return flags.zf

However the blsr instruction already set ZF according to the result.
So we can remove the TEST instruction if we are just checking ZF.
Such as in multiple pieces of code around memory allocations.

This make the code smaller and faster.

Change-Id: Ia12d5a73aa3cb49188c0b647b1eff7b56c5a7b58
Reviewed-on: https://go-review.googlesource.com/c/go/+/448255
Run-TryBot: Jakub Ciolek <jakub@ciolek.dev>
TryBot-Result: Gopher Robot <gobot@golang.org>
Reviewed-by: Keith Randall <khr@google.com>
Reviewed-by: Keith Randall <khr@golang.org>
Reviewed-by: Cherry Mui <cherryyz@google.com>
2023-01-20 04:58:59 +00:00
Jorropo fc814056aa cmd/compile: rewrite empty makeslice to zerobase pointer
make\(\[\][a-zA-Z0-9]+, 0\) is seen 52 times in the go source.
And at least 391 times on internet:
https://grep.app/search?q=make%5C%28%5C%5B%5C%5D%5Ba-zA-Z0-9%5D%2B%2C%200%5C%29&regexp=true
This used to compile to calling runtime.makeslice.
However we can copy what we do for []T{}, just use a zerobase pointer.

On my machine this is 10x faster (from 3ns to 0.3ns).
Note that an empty loop also runs in 0.3ns,
so this really is free when you count superscallar execution.

Change-Id: I1cfe7e69f5a7a4dabbc71912ce6a4f8a2d4a7f3c
Reviewed-on: https://go-review.googlesource.com/c/go/+/454036
Reviewed-by: Cherry Mui <cherryyz@google.com>
Reviewed-by: Keith Randall <khr@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
Reviewed-by: Keith Randall <khr@golang.org>
Run-TryBot: Jakub Ciolek <jakub@ciolek.dev>
2023-01-20 04:57:35 +00:00
Keith Randall f959fb3872 cmd/compile: add anchored version of SP
The SPanchored opcode is identical to SP, except that it takes a memory
argument so that it (and more importantly, anything that uses it)
must be scheduled at or after that memory argument.

This opcode ensures that a LEAQ of a variable gets scheduled after the
corresponding VARDEF for that variable.

This may lead to less CSE of LEAQ operations. The effect is very small.
The go binary is only 80 bytes bigger after this CL. Usually LEAQs get
folded into load/store operations, so the effect is only for pointerful
types, large enough to need a duffzero, and have their address passed
somewhere. Even then, usually the CSEd LEAQs will be un-CSEd because
the two uses are on different sides of a function call and the LEAQ
ends up being rematerialized at the second use anyway.

Change-Id: Ib893562cd05369b91dd563b48fb83f5250950293
Reviewed-on: https://go-review.googlesource.com/c/go/+/452916
TryBot-Result: Gopher Robot <gobot@golang.org>
Run-TryBot: Keith Randall <khr@golang.org>
Reviewed-by: Martin Möhrmann <moehrmann@google.com>
Reviewed-by: Martin Möhrmann <martin@golang.org>
Reviewed-by: Keith Randall <khr@google.com>
2023-01-19 22:43:12 +00:00
Keith Randall 1eb0465fa5 cmd/compile: turn off jump tables when spectre retpolines are on
Fixes #57097

Change-Id: I6ab659abbca1ae0ac8710674d39aec116fab0baa
Reviewed-on: https://go-review.googlesource.com/c/go/+/455336
Reviewed-by: Keith Randall <khr@google.com>
Reviewed-by: Cherry Mui <cherryyz@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
Run-TryBot: Keith Randall <khr@golang.org>
2022-12-06 05:12:12 +00:00
Paul E. Murphy dc6b7c86df cmd/compile: merge zero constant ISEL in PPC64 lateLower pass
Add a new SSA opcode ISELZ, similar to ISELB to represent a select
of value or 0. Then, merge candidate ISEL opcodes inside the late
lower pass.

This avoids complicating rules within the the lower pass.

Change-Id: I3b14c94b763863aadc834b0e910a85870c131313
Reviewed-on: https://go-review.googlesource.com/c/go/+/442596
TryBot-Result: Gopher Robot <gobot@golang.org>
Reviewed-by: Lynn Boger <laboger@linux.vnet.ibm.com>
Reviewed-by: Michael Knyszek <mknyszek@google.com>
Run-TryBot: Paul Murphy <murp@ibm.com>
Reviewed-by: Joedian Reid <joedian@golang.org>
2022-11-14 19:44:47 +00:00
Wayne Zuo 268f4629df cmd/compile: enable brachelim pass on loong64
Change-Id: I4fd1c307901c265ab9865bf8a74460ddc15e5d14
Reviewed-on: https://go-review.googlesource.com/c/go/+/416735
Reviewed-by: Keith Randall <khr@google.com>
Reviewed-by: xiaodong liu <teaofmoli@gmail.com>
Auto-Submit: Wayne Zuo <wdvxdr@golangcn.org>
TryBot-Result: Gopher Robot <gobot@golang.org>
Reviewed-by: David Chase <drchase@google.com>
Run-TryBot: Wayne Zuo <wdvxdr@golangcn.org>
2022-11-09 06:10:55 +00:00
Paul E. Murphy 390abbbbf1 codegen: check for PPC64 ISEL in condmove tests
ISEL is roughly equivalent to CMOV on PPC64. Verify ISEL generation
in all reasonable cases.

Note "ISEL test x y z" is the same as "ISEL !test y x z". test is
always one of LT (0), GT (1), EQ (2), SO (3). Sometimes x and y are
swapped if GE/LE/NE is desired.

Change-Id: Ie1bf029224064e004d855099731fe5e8d05aa990
Reviewed-on: https://go-review.googlesource.com/c/go/+/445215
Reviewed-by: Lynn Boger <laboger@linux.vnet.ibm.com>
Reviewed-by: Bryan Mills <bcmills@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
Run-TryBot: Paul Murphy <murp@ibm.com>
Reviewed-by: Than McIntosh <thanm@google.com>
2022-11-07 15:19:20 +00:00
Paul E. Murphy d031e9e07a cmd/compile/internal/ssa: re-adjust CarryChainTail scheduling priority
This needs to be as low as possible while not breaking priority
assumptions of other scores to correctly schedule carry chains.

Prior to the arm64 changes, it was set below ReadTuple. At the time,
this prevented the MulHiLo implementation on PPC64 from occluding
the scheduling of a full carry chain.

Memory scores can also prevent better scheduling, as can be observed
with crypto/internal/edwards25519/field.feMulGeneric.

Fixes #56497

Change-Id: Ia4b54e6dffcce584faf46b1b8d7cea18a3913887
Reviewed-on: https://go-review.googlesource.com/c/go/+/447435
Reviewed-by: Cherry Mui <cherryyz@google.com>
Reviewed-by: Keith Randall <khr@google.com>
Run-TryBot: Paul Murphy <murp@ibm.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
Reviewed-by: Bryan Mills <bcmills@google.com>
Reviewed-by: Keith Randall <khr@golang.org>
2022-11-03 19:59:19 +00:00
Keith Randall 9ce27feaeb cmd/compile: add rule for post-decomposed growslice optimization
The recently added rule only works before decomposing slices.
Add a rule that works after decomposing slices.

The reason we need the latter is because although the length may
be a constant, it can be hidden inside a slice that is not constant
(its pointer or capacity might be changing). By applying this
optimization after decomposing slices, we can find more cases
where it applies.

Fixes #56440

Change-Id: I0094e59eee3065ab4d210defdda8227a6e897420
Reviewed-on: https://go-review.googlesource.com/c/go/+/446277
Run-TryBot: Keith Randall <khr@golang.org>
Reviewed-by: Cherry Mui <cherryyz@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
Reviewed-by: Keith Randall <khr@google.com>
2022-10-31 21:40:49 +00:00
Keith Randall 0156b797e6 cmd/compile: recognize when the result of append has a constant length
Fixes a performance regression due to CL 418554.

Fixes #56440

Change-Id: I6ff152e9b83084756363f49ee6b0844a7a284880
Reviewed-on: https://go-review.googlesource.com/c/go/+/445875
Run-TryBot: Keith Randall <khr@golang.org>
Reviewed-by: Keith Randall <khr@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
Reviewed-by: Cherry Mui <cherryyz@google.com>
2022-10-27 17:09:50 +00:00
Wayne Zuo 90a3527427 cmd/compile: intrinsify Sub64 on loong64
This is a follow up of CL 420095  on loong64.

file                                    before    after     Δ       %
compile/internal/ssa.a                  35649482  35653274  +3792   +0.011%
compile/internal/ssagen.a               4099858   4098728   -1130   -0.028%
ecdh.a                                  227896    226896    -1000   -0.439%
internal/nistec/fiat.a                  1212254   1128184   -84070  -6.935%
tls.a                                   3256800   3256802   +2      +0.000%
big.a                                   1708518   1702496   -6022   -0.352%
bits.a                                  106762    105734    -1028   -0.963%
math.a                                  578762    577288    -1474   -0.255%
netip.a                                 555922    555610    -312    -0.056%
net.a                                   3286528   3286530   +2      +0.000%
golang.org/x/crypto/internal/poly1305.a 109546    107686    -1860   -1.698%
total                                   260392768 260299668 -93100  -0.036%

Change-Id: Ieffca705aae5666501f284502d986ca179dde494
Reviewed-on: https://go-review.googlesource.com/c/go/+/428557
Reviewed-by: Carlos Amedee <carlos@golang.org>
TryBot-Result: Gopher Robot <gobot@golang.org>
Reviewed-by: David Chase <drchase@google.com>
Run-TryBot: Wayne Zuo <wdvxdr@golangcn.org>
2022-10-07 18:16:26 +00:00
Wayne Zuo 97760ed651 cmd/compile: intrinsify Add64 on loong64
This is a follow up of CL 420094  on loong64.

Reduce go toolchain size slightly on linux/loong64.

compilecmp HEAD~1 -> HEAD
HEAD~1 (8a32354219): internal/trace: use strings.Builder
HEAD (1767784ac3): cmd/compile: intrinsify Add64 on loong64
platform: linux/loong64

file      before    after     Δ       %
addr2line 3882616   3882536   -80     -0.002%
api       5528866   5528450   -416    -0.008%
asm       5133780   5133796   +16     +0.000%
cgo       4668787   4668491   -296    -0.006%
compile   25163409  25164729  +1320   +0.005%
cover     4658055   4658007   -48     -0.001%
dist      3437783   3437727   -56     -0.002%
doc       3883069   3883205   +136    +0.004%
fix       3383254   3383070   -184    -0.005%
link      6747559   6747023   -536    -0.008%
nm        3793923   3793939   +16     +0.000%
objdump   4256628   4256812   +184    +0.004%
pack      2356328   2356144   -184    -0.008%
pprof     14233370  14131910  -101460 -0.713%
test2json 2638668   2638476   -192    -0.007%
trace     13392065  13360781  -31284  -0.234%
vet       7456388   7455588   -800    -0.011%
total     132498256 132364392 -133864 -0.101%

file                                    before    after     Δ       %
compile/internal/ssa.a                  35644590  35649482  +4892   +0.014%
compile/internal/ssagen.a               4101250   4099858   -1392   -0.034%
internal/edwards25519/field.a           226064    201718    -24346  -10.770%
internal/nistec/fiat.a                  1689922   1212254   -477668 -28.266%
tls.a                                   3256798   3256800   +2      +0.000%
big.a                                   1718552   1708518   -10034  -0.584%
bits.a                                  107786    106762    -1024   -0.950%
cmplx.a                                 169434    168214    -1220   -0.720%
math.a                                  581302    578762    -2540   -0.437%
netip.a                                 556096    555922    -174    -0.031%
net.a                                   3286526   3286528   +2      +0.000%
runtime.a                               8644786   8644510   -276    -0.003%
strconv.a                               519098    518374    -724    -0.139%
golang.org/x/crypto/internal/poly1305.a 115398    109546    -5852   -5.071%
total                                   260913122 260392768 -520354 -0.199%

Change-Id: I75b2bb7761fa5a0d0d032d4ebe3582d092ea77be
Reviewed-on: https://go-review.googlesource.com/c/go/+/428556
Reviewed-by: Carlos Amedee <carlos@golang.org>
Run-TryBot: Wayne Zuo <wdvxdr@golangcn.org>
Reviewed-by: David Chase <drchase@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
2022-10-07 18:16:10 +00:00
Wayne Zuo af668c689c cmd/compile: fold constant shift with extension on riscv64
For example:

  movb a0, a0
  srai $1, a0, a0

the assembler will expand to:

  slli $56, a0, a0
  srai $56, a0, a0
  srai $1, a0, a0

this CL optimize to:

  slli $56, a0, a0
  srai $57, a0, a0

Remove 270+ instructions from Go binary on linux/riscv64.

Change-Id: I375e19f9d3bd54f2781791d8cbe5970191297dc8
Reviewed-on: https://go-review.googlesource.com/c/go/+/428496
Reviewed-by: Keith Randall <khr@google.com>
Run-TryBot: Wayne Zuo <wdvxdr@golangcn.org>
Reviewed-by: Joel Sing <joel@sing.id.au>
Reviewed-by: Cherry Mui <cherryyz@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
2022-10-06 05:21:04 +00:00
eric fang ddc7d2a80c cmd/compile: add late lower pass for last rules to run
Usually optimization rules have corresponding priorities, some need to
be run first, some run next, and some run last, which produces the best
code. But currently our optimization rules have no priority, this CL
adds a late lower pass that runs those rules that need to be run at last,
such as split unreasonable constant folding. This pass can be seen as
the second round of the lower pass.

For example:
func foo(a, b uint64) uint64 {
        d := a+0x1234568
        d1 := b+0x1234568
        return d&d1
}
The code generated by the master branch:
	0x0004 00004        ADD     $19088744, R0, R2 // movz+movk+add
	0x0010 00016        ADD     $19088744, R1, R1 // movz+movk+add
	0x001c 00028        AND     R1, R2, R0

This is because the current constant folding optimization rules do not
take into account the range of constants, causing the constant to be
loaded repeatedly. This CL splits these unreasonable constants folding
in the late lower pass. With this CL the generated code:
	0x0004 00004        MOVD    $19088744, R2 // movz+movk
	0x000c 00012        ADD     R0, R2, R3
	0x0010 00016        ADD     R1, R2, R1
	0x0014 00020        AND     R1, R3, R0

This CL also adds constant folding optimization for ADDS instruction.

In addition, in order not to introduce the codegen regression, an
optimization rule is added to change the addition of a negative number
into a subtraction of a positive number.

go1 benchmarks:
name                     old time/op    new time/op    delta
BinaryTree17-8              1.22s ± 1%     1.24s ± 0%  +1.56%  (p=0.008 n=5+5)
Fannkuch11-8                1.54s ± 0%     1.53s ± 0%  -0.69%  (p=0.016 n=4+5)
FmtFprintfEmpty-8          14.1ns ± 0%    14.1ns ± 0%    ~     (p=0.079 n=4+5)
FmtFprintfString-8         26.0ns ± 0%    26.1ns ± 0%  +0.23%  (p=0.008 n=5+5)
FmtFprintfInt-8            32.3ns ± 0%    32.9ns ± 1%  +1.72%  (p=0.008 n=5+5)
FmtFprintfIntInt-8         54.5ns ± 0%    55.5ns ± 0%  +1.83%  (p=0.008 n=5+5)
FmtFprintfPrefixedInt-8    61.5ns ± 0%    62.0ns ± 0%  +0.93%  (p=0.008 n=5+5)
FmtFprintfFloat-8          72.0ns ± 0%    73.6ns ± 0%  +2.24%  (p=0.008 n=5+5)
FmtManyArgs-8               221ns ± 0%     224ns ± 0%  +1.22%  (p=0.008 n=5+5)
GobDecode-8                1.91ms ± 0%    1.93ms ± 0%  +0.98%  (p=0.008 n=5+5)
GobEncode-8                1.40ms ± 1%    1.39ms ± 0%  -0.79%  (p=0.032 n=5+5)
Gzip-8                      115ms ± 0%     117ms ± 1%  +1.17%  (p=0.008 n=5+5)
Gunzip-8                   19.4ms ± 1%    19.3ms ± 0%  -0.71%  (p=0.016 n=5+4)
HTTPClientServer-8         27.0µs ± 0%    27.3µs ± 0%  +0.80%  (p=0.008 n=5+5)
JSONEncode-8               3.36ms ± 1%    3.33ms ± 0%    ~     (p=0.056 n=5+5)
JSONDecode-8               17.5ms ± 2%    17.8ms ± 0%  +1.71%  (p=0.016 n=5+4)
Mandelbrot200-8            2.29ms ± 0%    2.29ms ± 0%    ~     (p=0.151 n=5+5)
GoParse-8                  1.35ms ± 1%    1.36ms ± 1%    ~     (p=0.056 n=5+5)
RegexpMatchEasy0_32-8      24.5ns ± 0%    24.5ns ± 0%    ~     (p=0.444 n=4+5)
RegexpMatchEasy0_1K-8       131ns ±11%     118ns ± 6%    ~     (p=0.056 n=5+5)
RegexpMatchEasy1_32-8      22.9ns ± 0%    22.9ns ± 0%    ~     (p=0.905 n=4+5)
RegexpMatchEasy1_1K-8       126ns ± 0%     127ns ± 0%    ~     (p=0.063 n=4+5)
RegexpMatchMedium_32-8      486ns ± 5%     483ns ± 0%    ~     (p=0.381 n=5+4)
RegexpMatchMedium_1K-8     15.4µs ± 1%    15.5µs ± 0%    ~     (p=0.151 n=5+5)
RegexpMatchHard_32-8        687ns ± 0%     686ns ± 0%    ~     (p=0.103 n=5+5)
RegexpMatchHard_1K-8       20.7µs ± 0%    20.7µs ± 1%    ~     (p=0.151 n=5+5)
Revcomp-8                   175ms ± 2%     176ms ± 3%    ~     (p=1.000 n=5+5)
Template-8                 20.4ms ± 6%    20.1ms ± 2%    ~     (p=0.151 n=5+5)
TimeParse-8                 112ns ± 0%     113ns ± 0%  +0.97%  (p=0.016 n=5+4)
TimeFormat-8                156ns ± 0%     145ns ± 0%  -7.14%  (p=0.029 n=4+4)

Change-Id: I3ced26e89041f873ac989586514ccc5ee09f13da
Reviewed-on: https://go-review.googlesource.com/c/go/+/425134
Reviewed-by: Keith Randall <khr@google.com>
Reviewed-by: Cherry Mui <cherryyz@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
Reviewed-by: Keith Randall <khr@golang.org>
Run-TryBot: Eric Fang <eric.fang@arm.com>
2022-10-05 02:40:56 +00:00
Keith Randall 6485e8f503 cmd/compile: use stricter rule for possible partial overlap
Partial overlaps can only happen for strict sub-pieces of larger arrays.
That's a much stronger condition than the current optimization rules.

Update #54467

Change-Id: I11e539b71099e50175f37ee78fddf69283f83ee5
Reviewed-on: https://go-review.googlesource.com/c/go/+/433056
TryBot-Result: Gopher Robot <gobot@golang.org>
Reviewed-by: David Chase <drchase@google.com>
Run-TryBot: Keith Randall <khr@golang.org>
Reviewed-by: Matthew Dempsky <mdempsky@google.com>
2022-09-27 20:09:33 +00:00
eric fang cf53990b18 cmd/compile: Add some CMP and CMN optimization rules on arm64
This CL adds some optimizaion rules:
1, Converts CMP to CMN, or vice versa, when comparing with a negative
number.
2, For equal and not equal comparisons, CMP can be converted to CMN in
some cases. In theory we could do the same optimization for LT, LE, GT
and GE, but need to account for overflow, this CL doesn't handle them.

There are no noticeable performance changes.

Change-Id: Ia49266c019ab7908ebc9510c2f02e121b1607869
Reviewed-on: https://go-review.googlesource.com/c/go/+/429795
Reviewed-by: Keith Randall <khr@golang.org>
Reviewed-by: Keith Randall <khr@google.com>
Reviewed-by: Cherry Mui <cherryyz@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
Run-TryBot: Eric Fang <eric.fang@arm.com>
2022-09-20 01:14:40 +00:00
Joel Sing a7bcc94719 cmd/compile: resolve known outcomes for SLTI/SLTIU on riscv64
When SLTI/SLTIU is used with ANDI/ORI, it may be possible to determine the
outcome based on the values of the immediates. Resolve these cases.

Improves code generation for various shift operations.

While here, sort tests by architecture to improve readability and ease
future maintenance.

Change-Id: I87e71e016a0e396a928e7d6389a2df61583dfd8d
Reviewed-on: https://go-review.googlesource.com/c/go/+/428217
Reviewed-by: Wayne Zuo <wdvxdr@golangcn.org>
TryBot-Result: Gopher Robot <gobot@golang.org>
Run-TryBot: Jenny Rakoczy <jenny@golang.org>
Reviewed-by: Jenny Rakoczy <jenny@golang.org>
Run-TryBot: Joel Sing <joel@sing.id.au>
Reviewed-by: Cherry Mui <cherryyz@google.com>
Auto-Submit: Jenny Rakoczy <jenny@golang.org>
2022-09-17 17:17:52 +00:00
ruinan 454a058ffc cmd/compile: Add shiftIsBounded check for logic shifts of arm64
This CL adds shiftIsBounded checks for the Lsh* and Rsh* rules in arm64.
There is no need to check the shift value again with CMP + CSEL when the
shift value is valid.

Change-Id: I54620de64f02a1b5a11089add237248ae2de01b4
Reviewed-on: https://go-review.googlesource.com/c/go/+/417714
Reviewed-by: Keith Randall <khr@golang.org>
Reviewed-by: Keith Randall <khr@google.com>
Reviewed-by: Heschi Kreinick <heschi@google.com>
2022-09-07 20:10:13 +00:00
Keith Randall 5b1fbfba1c cmd/compile: rewrite >>c<<c to &^(1<<c-1)
Fixes #54496

Change-Id: I3c2ed8cd55836d5b07c8cdec00d3b584885aca79
Reviewed-on: https://go-review.googlesource.com/c/go/+/424856
Reviewed-by: Keith Randall <khr@google.com>
Reviewed-by: Heschi Kreinick <heschi@google.com>
Run-TryBot: Martin Möhrmann <martin@golang.org>
TryBot-Result: Gopher Robot <gobot@golang.org>
Run-TryBot: Keith Randall <khr@golang.org>
Reviewed-by: Martin Möhrmann <martin@golang.org>
2022-09-02 18:51:37 +00:00
Matthew Dempsky 34f0029a85 cmd/compile/internal/noder: allow OCONVNOP for identical iface conversions
In go.dev/cl/421821, I included a hack to force OCONVNOP back to
OCONVIFACE for conversions involving shape types and non-empty
interfaces. The comment correctly noted that this was only needed for
conversions between non-identical types, but the code was conservative
and applied to even conversions between identical types.

This CL adds an extra bool to record whether the conversion is between
identical types, so we can keep OCONVNOP instead of forcing back to
OCONVIFACE. This has a small improvement to generated code, because we
no longer need a convI2I call (as demonstrated by codegen/ifaces.go).

But more usefully, this is relevant to pruning unnecessary itab slots
in runtime dictionaries (next CL).

Change-Id: I94f89e961cd26629b925037fea58d283140766ff
Reviewed-on: https://go-review.googlesource.com/c/go/+/427678
Reviewed-by: Keith Randall <khr@golang.org>
Run-TryBot: Matthew Dempsky <mdempsky@google.com>
Reviewed-by: Cuong Manh Le <cuong.manhle.vn@gmail.com>
Reviewed-by: Keith Randall <khr@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
2022-09-02 18:26:02 +00:00
Derek Parker 6605686e3b cmd/compile: new inline heuristic for struct compares
This CL changes the heuristic used to determine whether we can inline a
struct equality check or if we must generate a function and call that
function for equality.

The old method was to count struct fields, but this can lead to poor
in lining decisions. We should really be determining the cost of the
equality check and use that to determine if we should inline or generate
a function.

The new benchmark provided in this CL returns the following when compared
against tip:

```
name         old time/op  new time/op  delta
EqStruct-32  2.46ns ± 4%  0.25ns ±10%  -89.72%  (p=0.000 n=39+39)
```

Fixes #38494

Change-Id: Ie06b80a2b2a03a3fd0978bcaf7715f9afb66e0ab
GitHub-Last-Rev: e9a18d9389
GitHub-Pull-Request: golang/go#53326
Reviewed-on: https://go-review.googlesource.com/c/go/+/411674
Reviewed-by: Keith Randall <khr@google.com>
Reviewed-by: Keith Randall <khr@golang.org>
Run-TryBot: Keith Randall <khr@golang.org>
Auto-Submit: Keith Randall <khr@golang.org>
TryBot-Result: Gopher Robot <gobot@golang.org>
Reviewed-by: Heschi Kreinick <heschi@google.com>
2022-09-02 17:48:22 +00:00
ruinan 121344ac33 cmd/compile: optimize RotateLeft8/16 on arm64
This CL optimizes RotateLeft8/16 on arm64.

For 16 bits, we form a 32 bits register by duplicating two 16 bits
registers, then use RORW instruction to do the rotate shift.

For 8 bits, we just use LSR and LSL instead of RORW because the code is
simpler.

Benchmark          Old          ThisCL       delta
RotateLeft8-46     2.16 ns/op   1.73 ns/op   -19.70%
RotateLeft16-46    2.16 ns/op   1.54 ns/op   -28.53%

Change-Id: I09cde4383d12e31876a57f8cdfd3bb4f324fadb0
Reviewed-on: https://go-review.googlesource.com/c/go/+/420976
Reviewed-by: Keith Randall <khr@google.com>
Auto-Submit: Keith Randall <khr@golang.org>
Reviewed-by: Keith Randall <khr@golang.org>
TryBot-Result: Gopher Robot <gobot@golang.org>
Reviewed-by: Heschi Kreinick <heschi@google.com>
Run-TryBot: Keith Randall <khr@golang.org>
2022-09-02 17:46:31 +00:00
ruinan 54c7bc9cff cmd/compile: optimize shift ops on arm64 when the shift value is v&63
For the following code case:

  var x uint64
  x >> (shift & 63)

We can directly genereta `x >> shift` on arm64, since the hardware will
only use the bottom 6 bits of the shift amount.

Benchmark               old time/op  new time/op    delta
ShiftArithmeticRight-8  0.40ns       0.31ns        -21.7%

Change-Id: Id58c8a5b2f6dd5c30c3876f4a36e11b4d81e2dc9
Reviewed-on: https://go-review.googlesource.com/c/go/+/425294
Reviewed-by: Keith Randall <khr@golang.org>
Reviewed-by: Keith Randall <khr@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
Auto-Submit: Keith Randall <khr@golang.org>
Run-TryBot: Keith Randall <khr@golang.org>
Reviewed-by: Heschi Kreinick <heschi@google.com>
2022-09-02 17:44:29 +00:00
Keith Randall 33a7e5a4b4 cmd/compile: combine multiple rotate instructions
Rotating by c, then by d, is the same as rotating by c+d.

Change-Id: I36df82261460ff80f7c6d39bcdf0e840cef1c91a
Reviewed-on: https://go-review.googlesource.com/c/go/+/424894
Reviewed-by: Wayne Zuo <wdvxdr@golangcn.org>
TryBot-Result: Gopher Robot <gobot@golang.org>
Reviewed-by: Cherry Mui <cherryyz@google.com>
Reviewed-by: Ruinan Sun <Ruinan.Sun@arm.com>
Run-TryBot: Keith Randall <khr@golang.org>
Reviewed-by: Heschi Kreinick <heschi@google.com>
2022-08-31 22:10:52 +00:00
Keith Randall 69aed4712d cmd/compile: use better splitting condition for string binary search
Currently we use a full cmpstring to do the comparison for each
split in the binary search for a string switch.

Instead, split by comparing a single byte of the input string with a
constant. That will give us a much faster split (although it might be
not quite as good a split).

Fixes #53333

R=go1.20

Change-Id: I28c7209342314f367071e4aa1f2beb6ec9ff7123
Reviewed-on: https://go-review.googlesource.com/c/go/+/414894
TryBot-Result: Gopher Robot <gobot@golang.org>
Run-TryBot: Keith Randall <khr@golang.org>
Reviewed-by: David Chase <drchase@google.com>
Reviewed-by: Heschi Kreinick <heschi@google.com>
2022-08-31 22:08:26 +00:00
Wayne Zuo da6556968f cmd/compile: simplify bounded shift on riscv64
The prove pass will mark some shifts bounded, and then we can use that
information to generate better code on riscv64.

Change-Id: Ia22f43d0598453c9417adac7017db28d7240948b
Reviewed-on: https://go-review.googlesource.com/c/go/+/422616
TryBot-Result: Gopher Robot <gobot@golang.org>
Reviewed-by: Joel Sing <joel@sing.id.au>
Run-TryBot: Wayne Zuo <wdvxdr@golangcn.org>
Reviewed-by: Keith Randall <khr@golang.org>
Reviewed-by: Keith Randall <khr@google.com>
Auto-Submit: Keith Randall <khr@golang.org>
Run-TryBot: Keith Randall <khr@golang.org>
Reviewed-by: Cherry Mui <cherryyz@google.com>
2022-08-31 20:21:00 +00:00
Wayne Zuo e8f0340fa4 cmd/compile: intrinsify RotateLeft{32,64} on loong64
Benchmark on crypto/sha256 (provided by Xiaodong Liu):
name               old time/op    new time/op    delta
Hash8Bytes/New       1.19µs ± 0%    0.97µs ± 0%  -18.75%  (p=0.000 n=9+9)
Hash8Bytes/Sum224    1.21µs ± 0%    0.97µs ± 0%  -20.04%  (p=0.000 n=9+10)
Hash8Bytes/Sum256    1.21µs ± 0%    0.98µs ± 0%  -19.16%  (p=0.000 n=10+7)
Hash1K/New           15.9µs ± 0%    12.4µs ± 0%  -22.10%  (p=0.000 n=10+10)
Hash1K/Sum224        15.9µs ± 0%    12.4µs ± 0%  -22.18%  (p=0.000 n=8+10)
Hash1K/Sum256        15.9µs ± 0%    12.4µs ± 0%  -22.15%  (p=0.000 n=10+9)
Hash8K/New            119µs ± 0%      92µs ± 0%  -22.40%  (p=0.000 n=10+9)
Hash8K/Sum224         119µs ± 0%      92µs ± 0%  -22.41%  (p=0.000 n=9+10)
Hash8K/Sum256         119µs ± 0%      92µs ± 0%  -22.40%  (p=0.000 n=9+9)

name               old speed      new speed      delta
Hash8Bytes/New     6.70MB/s ± 0%  8.25MB/s ± 0%  +23.13%  (p=0.000 n=10+10)
Hash8Bytes/Sum224  6.60MB/s ± 0%  8.26MB/s ± 0%  +25.06%  (p=0.000 n=10+10)
Hash8Bytes/Sum256  6.59MB/s ± 0%  8.15MB/s ± 0%  +23.67%  (p=0.000 n=10+7)
Hash1K/New         64.3MB/s ± 0%  82.5MB/s ± 0%  +28.36%  (p=0.000 n=10+10)
Hash1K/Sum224      64.3MB/s ± 0%  82.6MB/s ± 0%  +28.51%  (p=0.000 n=10+10)
Hash1K/Sum256      64.3MB/s ± 0%  82.6MB/s ± 0%  +28.46%  (p=0.000 n=9+9)
Hash8K/New         69.0MB/s ± 0%  89.0MB/s ± 0%  +28.87%  (p=0.000 n=10+8)
Hash8K/Sum224      69.0MB/s ± 0%  89.0MB/s ± 0%  +28.88%  (p=0.000 n=9+10)
Hash8K/Sum256      69.0MB/s ± 0%  88.9MB/s ± 0%  +28.87%  (p=0.000 n=8+9)

Benchmark on crypto/sha512 (provided by Xiaodong Liu):
name               old time/op    new time/op     delta
Hash8Bytes/New       1.55µs ± 0%     1.31µs ± 0%  -15.67%  (p=0.000 n=10+10)
Hash8Bytes/Sum384    1.59µs ± 0%     1.35µs ± 0%  -14.97%  (p=0.000 n=10+10)
Hash8Bytes/Sum512    1.62µs ± 0%     1.39µs ± 0%  -14.02%  (p=0.000 n=10+10)
Hash1K/New           10.7µs ± 0%      8.6µs ± 0%  -19.60%  (p=0.000 n=8+8)
Hash1K/Sum384        10.8µs ± 0%      8.7µs ± 0%  -19.40%  (p=0.000 n=9+9)
Hash1K/Sum512        10.8µs ± 0%      8.7µs ± 0%  -19.35%  (p=0.000 n=9+10)
Hash8K/New           74.6µs ± 0%     59.6µs ± 0%  -20.08%  (p=0.000 n=10+9)
Hash8K/Sum384        74.7µs ± 0%     59.7µs ± 0%  -20.04%  (p=0.000 n=9+8)
Hash8K/Sum512        74.7µs ± 0%     59.7µs ± 0%  -20.01%  (p=0.000 n=10+10)

name               old speed      new speed       delta
Hash8Bytes/New     5.16MB/s ± 0%   6.12MB/s ± 0%  +18.60%  (p=0.000 n=10+8)
Hash8Bytes/Sum384  5.02MB/s ± 0%   5.90MB/s ± 0%  +17.56%  (p=0.000 n=10+10)
Hash8Bytes/Sum512  4.94MB/s ± 0%   5.74MB/s ± 0%  +16.29%  (p=0.000 n=10+9)
Hash1K/New         95.4MB/s ± 0%  118.6MB/s ± 0%  +24.38%  (p=0.000 n=10+10)
Hash1K/Sum384      95.0MB/s ± 0%  117.9MB/s ± 0%  +24.06%  (p=0.000 n=8+9)
Hash1K/Sum512      94.8MB/s ± 0%  117.5MB/s ± 0%  +23.99%  (p=0.000 n=8+9)
Hash8K/New          110MB/s ± 0%    137MB/s ± 0%  +25.11%  (p=0.000 n=9+6)
Hash8K/Sum384       110MB/s ± 0%    137MB/s ± 0%  +25.07%  (p=0.000 n=9+8)
Hash8K/Sum512       110MB/s ± 0%    137MB/s ± 0%  +25.01%  (p=0.000 n=10+10)

Change-Id: I28ccfce634659305a336c8e0a3f8589f7361d661
Reviewed-on: https://go-review.googlesource.com/c/go/+/422317
Reviewed-by: Keith Randall <khr@golang.org>
Run-TryBot: Wayne Zuo <wdvxdr@golangcn.org>
Reviewed-by: Ian Lance Taylor <iant@google.com>
Reviewed-by: David Chase <drchase@google.com>
2022-08-30 03:21:44 +00:00
Wayne Zuo a6219737e3 cmd/compile: intrinsify Sub64 on riscv64
After this CL, the performance difference in crypto/elliptic
benchmarks on linux/riscv64 are:

name                 old time/op    new time/op    delta
ScalarBaseMult/P256    1.64ms ± 1%    1.60ms ± 1%   -2.36%  (p=0.008 n=5+5)
ScalarBaseMult/P224    1.53ms ± 1%    1.47ms ± 2%   -4.24%  (p=0.008 n=5+5)
ScalarBaseMult/P384    5.12ms ± 2%    5.03ms ± 2%     ~     (p=0.095 n=5+5)
ScalarBaseMult/P521    22.3ms ± 2%    13.8ms ± 1%  -37.89%  (p=0.008 n=5+5)
ScalarMult/P256        4.49ms ± 2%    4.26ms ± 2%   -5.13%  (p=0.008 n=5+5)
ScalarMult/P224        4.33ms ± 1%    4.09ms ± 1%   -5.59%  (p=0.008 n=5+5)
ScalarMult/P384        16.3ms ± 1%    15.5ms ± 2%   -4.78%  (p=0.008 n=5+5)
ScalarMult/P521         101ms ± 0%      47ms ± 2%  -53.36%  (p=0.008 n=5+5)

Change-Id: I31cf0506e27f9d85f576af1813630a19c20dda8a
Reviewed-on: https://go-review.googlesource.com/c/go/+/420095
Reviewed-by: Cherry Mui <cherryyz@google.com>
Reviewed-by: Joel Sing <joel@sing.id.au>
Reviewed-by: David Chase <drchase@google.com>
Run-TryBot: Wayne Zuo <wdvxdr@golangcn.org>
TryBot-Result: Gopher Robot <gobot@golang.org>
2022-08-27 05:43:59 +00:00
Wayne Zuo 969f48a3a2 cmd/compile: intrinsify Add64 on riscv64
According to RISCV instruction set manual v2.2 Sec 2.4, we can
implement overflowing check for unsigned addition cheaply using
SLTU instructions.

After this CL, the performance difference in crypto/elliptic
benchmarks on linux/riscv64 are:

name                 old time/op    new time/op    delta
ScalarBaseMult/P256    1.93ms ± 1%    1.64ms ± 1%  -14.96%  (p=0.008 n=5+5)
ScalarBaseMult/P224    1.80ms ± 2%    1.53ms ± 1%  -14.89%  (p=0.008 n=5+5)
ScalarBaseMult/P384    6.15ms ± 2%    5.12ms ± 2%  -16.73%  (p=0.008 n=5+5)
ScalarBaseMult/P521    25.9ms ± 1%    22.3ms ± 2%  -13.78%  (p=0.008 n=5+5)
ScalarMult/P256        5.59ms ± 1%    4.49ms ± 2%  -19.79%  (p=0.008 n=5+5)
ScalarMult/P224        5.42ms ± 1%    4.33ms ± 1%  -20.01%  (p=0.008 n=5+5)
ScalarMult/P384        19.9ms ± 2%    16.3ms ± 1%  -18.15%  (p=0.008 n=5+5)
ScalarMult/P521        97.3ms ± 1%   100.7ms ± 0%   +3.48%  (p=0.008 n=5+5)

Change-Id: Ic4c82ced4b072a4a6575343fa9f29dd09b0cabc4
Reviewed-on: https://go-review.googlesource.com/c/go/+/420094
Reviewed-by: David Chase <drchase@google.com>
Reviewed-by: Cherry Mui <cherryyz@google.com>
Run-TryBot: Wayne Zuo <wdvxdr@golangcn.org>
Reviewed-by: Joel Sing <joel@sing.id.au>
TryBot-Result: Gopher Robot <gobot@golang.org>
2022-08-27 05:43:32 +00:00
Wayne Zuo b60432df14 cmd/compile: deadcode for LoweredMuluhilo on riscv64
This is a follow up of CL 425101 on RISCV64.

According to RISCV Volume 1, Unprivileged Spec v. 20191213 Chapter 7.1:
If both the high and low bits of the same product are required, then the
recommended code sequence is: MULH[[S]U] rdh, rs1, rs2; MUL rdl, rs1, rs2
(source register specifiers must be in same order and rdh cannot be the
same as rs1 or rs2). Microarchitectures can then fuse these into a single
multiply operation instead of performing two separate multiplies.

So we should not split Muluhilo to separate instructions.

Updates #54607

Change-Id: If47461f3aaaf00e27cd583a9990e144fb8bcdb17
Reviewed-on: https://go-review.googlesource.com/c/go/+/425203
Auto-Submit: Keith Randall <khr@golang.org>
TryBot-Result: Gopher Robot <gobot@golang.org>
Run-TryBot: Wayne Zuo <wdvxdr@golangcn.org>
Reviewed-by: Keith Randall <khr@google.com>
Reviewed-by: Keith Randall <khr@golang.org>
Reviewed-by: Cherry Mui <cherryyz@google.com>
2022-08-24 18:08:33 +00:00
Keith Randall 60ad3c48f5 cmd/compile: move SSA rotate instruction detection to arch-independent rules
Detect rotate instructions while still in architecture-independent form.
It's easier to do here, and we don't need to repeat it in each
architecture file.

Change-Id: I9396954b3f3b3bfb96c160d064a02002309935bb
Reviewed-on: https://go-review.googlesource.com/c/go/+/421195
TryBot-Result: Gopher Robot <gobot@golang.org>
Reviewed-by: Eric Fang <eric.fang@arm.com>
Reviewed-by: Cuong Manh Le <cuong.manhle.vn@gmail.com>
Reviewed-by: David Chase <drchase@google.com>
Reviewed-by: Joedian Reid <joedian@golang.org>
Reviewed-by: Ruinan Sun <Ruinan.Sun@arm.com>
Run-TryBot: Keith Randall <khr@golang.org>
2022-08-23 21:24:14 +00:00
eric fang 9f0f87c806 cmd/internal/obj/arm64: remove the transition from $0 to ZR
Previously we convert $0 to the ZR register for some reasons, which causes
two problems:
1. Confusion, the special case of the ZR register needs to be considered
when dealing with constants. For encoding, some places we encode ZR, and
some places we encode $0, although we have converted $0 to ZR.
2. Unexpected instruction format. All instructions that support ZR register
operands can be replaced by $0.

This patch removes this conversion. Note that this patch may cause previously
unintendedly supported instruction formats to no longer be supported.

Change-Id: I3d8d2c06711b7614a38191397da7776417f1861c
Reviewed-on: https://go-review.googlesource.com/c/go/+/404316
Reviewed-by: David Chase <drchase@google.com>
Run-TryBot: Eric Fang <eric.fang@arm.com>
Reviewed-by: Cherry Mui <cherryyz@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
2022-08-23 06:11:32 +00:00
eric fang 0a52d80666 cmd/compile/internal/ssa: optimize memory moving on arm64
This CL optimizes memory moving with LDP and STP on arm64.

Benchmarks:
name              old time/op  new time/op  delta
ClearFat7-160     1.08ns ± 0%  0.95ns ± 0%  -11.41%  (p=0.008 n=5+5)
ClearFat8-160     0.84ns ± 0%  0.84ns ± 0%   -0.95%  (p=0.008 n=5+5)
ClearFat11-160    1.08ns ± 0%  0.95ns ± 0%  -11.46%  (p=0.008 n=5+5)
ClearFat12-160    0.95ns ± 0%  0.95ns ± 0%     ~     (p=0.063 n=4+5)
ClearFat13-160    1.08ns ± 0%  0.95ns ± 0%  -11.45%  (p=0.008 n=5+5)
ClearFat14-160    1.08ns ± 0%  0.95ns ± 0%  -11.47%  (p=0.008 n=5+5)
ClearFat15-160    1.24ns ± 0%  0.95ns ± 0%  -22.98%  (p=0.029 n=4+4)
ClearFat16-160    0.84ns ± 0%  0.83ns ± 0%   -0.11%  (p=0.008 n=5+5)
ClearFat24-160    2.15ns ± 0%  2.15ns ± 0%     ~     (all equal)
ClearFat32-160    2.86ns ± 0%  2.86ns ± 0%     ~     (p=0.333 n=5+4)
ClearFat40-160    2.15ns ± 0%  2.15ns ± 0%     ~     (all equal)
ClearFat48-160    3.32ns ± 1%  3.31ns ± 1%     ~     (p=0.690 n=5+5)
ClearFat56-160    2.15ns ± 0%  2.15ns ± 0%     ~     (all equal)
ClearFat64-160    3.25ns ± 1%  3.26ns ± 1%     ~     (p=0.841 n=5+5)
ClearFat72-160    2.22ns ± 0%  2.22ns ± 0%     ~     (p=0.444 n=5+5)
ClearFat128-160   4.03ns ± 0%  4.04ns ± 0%   +0.32%  (p=0.008 n=5+5)
ClearFat256-160   6.44ns ± 0%  6.44ns ± 0%   +0.08%  (p=0.016 n=4+5)
ClearFat512-160   12.2ns ± 0%  12.2ns ± 0%   +0.13%  (p=0.008 n=5+5)
ClearFat1024-160  24.3ns ± 0%  24.3ns ± 0%     ~     (p=0.167 n=5+5)
ClearFat1032-160  24.5ns ± 0%  24.5ns ± 0%     ~     (p=0.238 n=4+5)
ClearFat1040-160  29.2ns ± 0%  29.3ns ± 0%   +0.34%  (p=0.008 n=5+5)
CopyFat7-160      1.43ns ± 0%  1.07ns ± 0%  -24.97%  (p=0.008 n=5+5)
CopyFat8-160      0.89ns ± 0%  0.89ns ± 0%     ~     (p=0.238 n=5+5)
CopyFat11-160     1.43ns ± 0%  1.07ns ± 0%  -24.97%  (p=0.008 n=5+5)
CopyFat12-160     1.07ns ± 0%  1.07ns ± 0%     ~     (p=0.238 n=5+4)
CopyFat13-160     1.43ns ± 0%  1.07ns ± 0%     ~     (p=0.079 n=4+5)
CopyFat14-160     1.43ns ± 0%  1.07ns ± 0%  -24.95%  (p=0.008 n=5+5)
CopyFat15-160     1.79ns ± 0%  1.07ns ± 0%     ~     (p=0.079 n=4+5)
CopyFat16-160     1.07ns ± 0%  1.07ns ± 0%     ~     (p=0.444 n=5+5)
CopyFat24-160     1.84ns ± 2%  1.67ns ± 0%   -9.28%  (p=0.008 n=5+5)
CopyFat32-160     3.22ns ± 0%  2.92ns ± 0%   -9.40%  (p=0.008 n=5+5)
CopyFat64-160     3.64ns ± 0%  3.57ns ± 0%   -1.96%  (p=0.008 n=5+5)
CopyFat72-160     3.56ns ± 0%  3.11ns ± 0%  -12.89%  (p=0.008 n=5+5)
CopyFat128-160    5.06ns ± 0%  5.06ns ± 0%   +0.04%  (p=0.048 n=5+5)
CopyFat256-160    9.13ns ± 0%  9.13ns ± 0%     ~     (p=0.659 n=5+5)
CopyFat512-160    17.4ns ± 0%  17.4ns ± 0%     ~     (p=0.167 n=5+5)
CopyFat520-160    17.2ns ± 0%  17.3ns ± 0%   +0.37%  (p=0.008 n=5+5)
CopyFat1024-160   34.1ns ± 0%  34.0ns ± 0%     ~     (p=0.127 n=5+5)
CopyFat1032-160   80.9ns ± 0%  34.2ns ± 0%  -57.74%  (p=0.008 n=5+5)
CopyFat1040-160   94.4ns ± 0%  41.7ns ± 0%  -55.78%  (p=0.016 n=5+4)

Change-Id: I14186f9f82b0ecf8b6c02191dc5da566b9a21e6c
Reviewed-on: https://go-review.googlesource.com/c/go/+/421654
Reviewed-by: Cherry Mui <cherryyz@google.com>
Run-TryBot: Eric Fang <eric.fang@arm.com>
Reviewed-by: Keith Randall <khr@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
2022-08-23 03:09:07 +00:00
Cherry Mui 8bf9e01473 cmd/compile: split Muluhilo op on ARM64
On ARM64 we use two separate instructions to compute the hi and lo
results of a 64x64->128 multiplication. Lower to two separate ops
so if only one result is needed we can deadcode the other.

Fixes #54607.

Change-Id: Ib023e77eb2b2b0bcf467b45471cb8a294bce6f90
Reviewed-on: https://go-review.googlesource.com/c/go/+/425101
Reviewed-by: Keith Randall <khr@golang.org>
Reviewed-by: Keith Randall <khr@google.com>
2022-08-22 21:29:31 +00:00
Keith Randall 908499adec cmd/compile: stop using VARKILL
With the introduction of stack objects, VARKILL information is
no longer needed.

With stack objects, an object is dead when there are no more static
references to it, and the stack scanner can't find any live pointers
to it. VARKILL information isn't used to establish live ranges for
address-taken variables any more. In effect, the last static reference
*is* the VARKILL, and there's an additional dynamic liveness check
during stack scanning.

Next CL will actually rip out the VARKILL opcodes.

Change-Id: I030a2ab867445cf4e0e69397911f8a2e2f0ed07b
Reviewed-on: https://go-review.googlesource.com/c/go/+/419234
TryBot-Result: Gopher Robot <gobot@golang.org>
Reviewed-by: Cherry Mui <cherryyz@google.com>
Reviewed-by: David Chase <drchase@google.com>
Run-TryBot: Keith Randall <khr@golang.org>
2022-08-18 17:36:38 +00:00
Archana R d09c6ac417 test/codegen: updated multiple tests to verify on ppc64,ppc64le
Updated multiple tests in test/codegen: math.go, mathbits.go, shift.go
and slices.go to verify on ppc64/ppc64le as well

Change-Id: Id88dd41569b7097819fb4d451b615f69cf7f7a94
Reviewed-on: https://go-review.googlesource.com/c/go/+/412115
TryBot-Result: Gopher Robot <gobot@golang.org>
Run-TryBot: Archana Ravindar <aravind5@in.ibm.com>
Reviewed-by: Than McIntosh <thanm@google.com>
Reviewed-by: Paul Murphy <murp@ibm.com>
Reviewed-by: Ian Lance Taylor <iant@google.com>
2022-08-17 13:56:55 +00:00
Wayne Zuo 09932f95f5 cmd/compile: combine more constant stores on amd64
Fixes #53324

Change-Id: I06149d860f858b082235e9d80bf0ea494679b386
Reviewed-on: https://go-review.googlesource.com/c/go/+/411614
Reviewed-by: Keith Randall <khr@google.com>
Run-TryBot: Wayne Zuo <wdvxdr@golangcn.org>
Reviewed-by: Keith Randall <khr@golang.org>
TryBot-Result: Gopher Robot <gobot@golang.org>
Reviewed-by: Ian Lance Taylor <iant@google.com>
2022-08-15 16:19:21 +00:00
eric fang efe5929dbd cmd/compile/internal/ssa: optimize ARM64 code with TST
For signed comparisons, the following four optimization rules hold:

(CMPconst [0] z:(AND x y)) && z.Uses == 1 => (TST x y)
(CMPWconst [0] z:(AND x y)) && z.Uses == 1 => (TSTW x y)
(CMPconst [0] x:(ANDconst [c] y)) && x.Uses == 1 => (TSTconst [c] y)
(CMPWconst [0] x:(ANDconst [c] y)) && x.Uses == 1 => (TSTWconst [int32(c)] y)

But currently they only apply to jump instructions, not to conditional
instructions within a block, such as cset, csel, etc. This CL extends
the above rules into blocks so that conditional instructions can also be
optimized.

name                       old time/op  new time/op  delta
DivisiblePow2constI64-160  1.04ns ± 0%  0.86ns ± 0%  -17.30%  (p=0.008 n=5+5)
DivisiblePow2constI32-160  1.04ns ± 0%  0.87ns ± 0%  -16.16%  (p=0.016 n=4+5)
DivisiblePow2constI16-160  1.04ns ± 0%  0.87ns ± 0%  -16.03%  (p=0.008 n=5+5)
DivisiblePow2constI8-160   1.04ns ± 0%  0.86ns ± 0%  -17.15%  (p=0.008 n=5+5)

Change-Id: I6bc34bff30862210e8dd001e0340b8fe502fe3de
Reviewed-on: https://go-review.googlesource.com/c/go/+/420434
Reviewed-by: Cherry Mui <cherryyz@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
Reviewed-by: Dmitri Shuralyov <dmitshur@google.com>
Run-TryBot: Eric Fang <eric.fang@arm.com>
2022-08-10 02:13:53 +00:00
Cuong Manh Le 0f8dffd0aa all: use ":" for compiler generated symbols
As it can't appear in user package paths.

There is a hack for handling "go:buildid" and "type:*" on windows/386.

Previously, windows/386 requires underscore prefix on external symbols,
but that's only applied for SHOSTOBJ/SUNDEFEXT or cgo export symbols.
"go.buildid" is STEXT, "type.*" is STYPE, thus they are not prefixed
with underscore.

In external linking mode, the external linker can't resolve them as
external symbols. But we are lucky that they have "." in their name,
so the external linker see them as Forwarder RVA exports. See:

 - https://docs.microsoft.com/en-us/windows/win32/debug/pe-format#export-address-table
 - https://sourceware.org/git/?p=binutils-gdb.git;a=blob;f=ld/pe-dll.c;h=e7b82ba6ffadf74dc1b9ee71dc13d48336941e51;hb=HEAD#l972)

This CL changes "." to ":" in symbols name, so theses symbols can not be
found by external linker anymore. So a hacky way is adding the
underscore prefix for these 2 symbols. I don't have enough knowledge to
verify whether adding the underscore for all STEXT/STYPE symbols are
fine, even if it could be, that would be done in future CL.

Fixes #37762

Change-Id: I92eaaf24c0820926a36e0530fdb07b07af1fcc35
Reviewed-on: https://go-review.googlesource.com/c/go/+/317917
Reviewed-by: Than McIntosh <thanm@google.com>
Run-TryBot: Cuong Manh Le <cuong.manhle.vn@gmail.com>
Reviewed-by: Cherry Mui <cherryyz@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
2022-08-09 11:28:56 +00:00
Lynn Boger c1bfefe9d1 cmd/compile: fix confusion with ANDCCconst in PPC64 rules
Currently there is a an ANDconst and an ANDCCconst op in PPC64,
which is confusing since they map onto the same instruction.
One of these ops sets the result of the AND operation, and the
other sets the flag (condition register).

This converts ANDCCconst into an op with the 2 expected results:
the integer result of the AND and the flag setting. The ANDconst
op has been removed.

Note that in the PPC64 ISA the only variation of the 'and immediate'
is the one that sets the condition bit, which probably led to the
original (confusing) implementation.

This also adds a few rules to improve the use of ANDCCconst with
ISELB and some testcases to verify those improvements.

Change-Id: I523703fa4da2098eb995dc3ba744d36fa28e41d4
Reviewed-on: https://go-review.googlesource.com/c/go/+/422015
Reviewed-by: Cherry Mui <cherryyz@google.com>
Reviewed-by: David Chase <drchase@google.com>
Reviewed-by: Paul Murphy <murp@ibm.com>
2022-08-08 20:15:55 +00:00
Keith Randall c2a9c55823 cmd/compile: optimize unsafe.Slice generated code
We don't need a multiply when the element type is size 0 or 1.

The panic functions don't return, so we don't need any post-call
code (register restores, etc.).

Change-Id: I0dcea5df56d29d7be26554ddca966b3903c672e5
Reviewed-on: https://go-review.googlesource.com/c/go/+/419754
TryBot-Result: Gopher Robot <gobot@golang.org>
Reviewed-by: Cuong Manh Le <cuong.manhle.vn@gmail.com>
Run-TryBot: Keith Randall <khr@golang.org>
Reviewed-by: Keith Randall <khr@google.com>
Reviewed-by: Than McIntosh <thanm@google.com>
Reviewed-by: Matthew Dempsky <mdempsky@google.com>
2022-08-08 17:36:47 +00:00
cuiweixie e7307034cc cmd/compile: store combine on amd64
Fixes #54120

Change-Id: I6915b6e8d459d9becfdef4fdcba95ee4dea6af05
GitHub-Last-Rev: 03f19942c7
GitHub-Pull-Request: golang/go#54126
Reviewed-on: https://go-review.googlesource.com/c/go/+/420115
Reviewed-by: Keith Randall <khr@golang.org>
Reviewed-by: Dmitri Shuralyov <dmitshur@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
Reviewed-by: Than McIntosh <thanm@google.com>
Run-TryBot: Wayne Zuo <wdvxdr@golangcn.org>
2022-08-08 16:40:58 +00:00
Matthew Dempsky ab8d7dd75e cmd/compile: set LocalPkg.Path to -p flag
Since CL 391014, cmd/compile now requires the -p flag to be set the
build system. This CL changes it to initialize LocalPkg.Path to the
provided path, rather than relying on writing out `"".` into object
files and expecting cmd/link to substitute them.

However, this actually involved a rather long tail of fixes. Many have
already been submitted, but a few notable ones that have to land
simultaneously with changing LocalPkg:

1. When compiling package runtime, there are really two "runtime"
packages: types.LocalPkg (the source package itself) and
ir.Pkgs.Runtime (the compiler's internal representation, for synthetic
references). Previously, these ended up creating separate link
symbols (`"".xxx` and `runtime.xxx`, respectively), but now they both
end up as `runtime.xxx`, which causes lsym collisions (notably
inittask and funcsyms).

2. test/codegen tests need to be updated to expect symbols to be named
`command-line-arguments.xxx` rather than `"".foo`.

3. The issue20014 test case is sensitive to the sort order of field
tracking symbols. In particular, the local package now sorts to its
natural place in the list, rather than to the front.

Thanks to David Chase for helping track down all of the fixes needed
for this CL.

Updates #51734.

Change-Id: Iba3041cf7ad967d18c6e17922fa06ba11798b565
Reviewed-on: https://go-review.googlesource.com/c/go/+/393715
Reviewed-by: David Chase <drchase@google.com>
Run-TryBot: Matthew Dempsky <mdempsky@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
Reviewed-by: Cuong Manh Le <cuong.manhle.vn@gmail.com>
2022-05-16 18:19:47 +00:00
Cherry Mui 540f8c2b50 cmd/compile: use jump table on ARM64
Following CL 357330, use jump tables on ARM64.

name                         old time/op  new time/op  delta
Switch8Predictable-4         3.41ns ± 0%  3.21ns ± 0%     ~     (p=0.079 n=4+5)
Switch8Unpredictable-4       12.0ns ± 0%   9.5ns ± 0%  -21.17%  (p=0.000 n=5+4)
Switch32Predictable-4        3.06ns ± 0%  2.82ns ± 0%   -7.78%  (p=0.008 n=5+5)
Switch32Unpredictable-4      13.3ns ± 0%   9.5ns ± 0%  -28.87%  (p=0.016 n=4+5)
SwitchStringPredictable-4    3.71ns ± 0%  3.21ns ± 0%  -13.43%  (p=0.000 n=5+4)
SwitchStringUnpredictable-4  14.8ns ± 0%  15.1ns ± 0%   +2.37%  (p=0.008 n=5+5)

Change-Id: Ia0b85df7ca9273cf70c05eb957225c6e61822fa6
Reviewed-on: https://go-review.googlesource.com/c/go/+/403979
TryBot-Result: Gopher Robot <gobot@golang.org>
Reviewed-by: Keith Randall <khr@golang.org>
Run-TryBot: Cherry Mui <cherryyz@google.com>
Reviewed-by: David Chase <drchase@google.com>
2022-05-13 19:51:03 +00:00
Paul E. Murphy 0c43878baa cmd/compile: lower Add64/Sub64 into ssa on PPC64
math/bits.Add64 and math/bits.Sub64 now lower and optimize
directly in SSA form.

The optimization of carry chains focuses around eliding
XER<->GPR transfers of the CA bit when used exclusively as an
input to a single carry operations, or when the CA value is
known.

This also adds support for handling XER spills in the assembler
which could happen if carry chains contain inter-dependencies
on each other (which seems very unlikely with practical usage),
or a clobber happens (SRAW/SRAD/SUBFC operations clobber CA).

With PPC64 Add64/Sub64 lowering into SSA and this patch, the net
performance difference in crypto/elliptic benchmarks on P9/ppc64le
are:

name                                old time/op    new time/op    delta
ScalarBaseMult/P256                   46.3µs ± 0%    46.9µs ± 0%   +1.34%
ScalarBaseMult/P224                    356µs ± 0%     209µs ± 0%  -41.14%
ScalarBaseMult/P384                   1.20ms ± 0%    0.57ms ± 0%  -52.14%
ScalarBaseMult/P521                   3.38ms ± 0%    1.44ms ± 0%  -57.27%
ScalarMult/P256                        199µs ± 0%     199µs ± 0%   -0.17%
ScalarMult/P224                        357µs ± 0%     212µs ± 0%  -40.56%
ScalarMult/P384                       1.20ms ± 0%    0.58ms ± 0%  -51.86%
ScalarMult/P521                       3.37ms ± 0%    1.44ms ± 0%  -57.32%
MarshalUnmarshal/P256/Uncompressed    2.59µs ± 0%    2.52µs ± 0%   -2.63%
MarshalUnmarshal/P256/Compressed      2.58µs ± 0%    2.52µs ± 0%   -2.06%
MarshalUnmarshal/P224/Uncompressed    1.54µs ± 0%    1.40µs ± 0%   -9.42%
MarshalUnmarshal/P224/Compressed      1.54µs ± 0%    1.39µs ± 0%   -9.87%
MarshalUnmarshal/P384/Uncompressed    2.40µs ± 0%    1.80µs ± 0%  -24.93%
MarshalUnmarshal/P384/Compressed      2.35µs ± 0%    1.81µs ± 0%  -23.03%
MarshalUnmarshal/P521/Uncompressed    3.79µs ± 0%    2.58µs ± 0%  -31.81%
MarshalUnmarshal/P521/Compressed      3.80µs ± 0%    2.60µs ± 0%  -31.67%

Note, P256 uses an asm implementation, thus, little variation is expected.

Change-Id: I88a24f6bf0f4f285c649e40243b1ab69cc452b71
Reviewed-on: https://go-review.googlesource.com/c/go/+/346870
Reviewed-by: Lynn Boger <laboger@linux.vnet.ibm.com>
Reviewed-by: Dmitri Shuralyov <dmitshur@google.com>
Run-TryBot: Paul Murphy <murp@ibm.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
Reviewed-by: Ian Lance Taylor <iant@google.com>
2022-05-10 20:03:53 +00:00
Jorropo e1e056fa6a cmd/compile: fold constants found by prove
It is hit ~70k times building go.
This make the go binary, 0.04% smaller.
I didn't included benchmarks because this is just constant foldings
and is hard to mesure objectively.

For example, this enable rewriting things like:
  if x == 20 {
    return x + 30 + z
  }

Into:
  if x == 20 {
    return 50 + z
  }

It's not just fixing programer's code,
the ssa generator generate code like this sometimes.

Change-Id: I0861f342b27f7227b5f1c34d8267fa0057b1bbbc
GitHub-Last-Rev: 4c2f9b5216
GitHub-Pull-Request: golang/go#52669
Reviewed-on: https://go-review.googlesource.com/c/go/+/403735
Reviewed-by: Keith Randall <khr@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
Run-TryBot: Keith Randall <khr@golang.org>
Reviewed-by: Keith Randall <khr@golang.org>
Reviewed-by: David Chase <drchase@google.com>
2022-05-04 20:30:17 +00:00
Paul E. Murphy c570f0eda2 cmd/compile: combine OR + NOT into ORN on PPC64
This shows up in a few crypto functions, and other
assorted places.

Change-Id: I5a7f4c25ddd4a6499dc295ef693b9fe43d2448ab
Reviewed-on: https://go-review.googlesource.com/c/go/+/404057
Run-TryBot: Paul Murphy <murp@ibm.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
Reviewed-by: Lynn Boger <laboger@linux.vnet.ibm.com>
Reviewed-by: David Chase <drchase@google.com>
Reviewed-by: Russ Cox <rsc@golang.org>
2022-05-04 18:49:50 +00:00
Cuong Manh Le 0668e3cb1a cmd/compile: support pointers to arrays in arrayClear
Fixes #52635

Change-Id: I85f182931e30292983ef86c55a0ab6e01282395c
Reviewed-on: https://go-review.googlesource.com/c/go/+/403337
Reviewed-by: Matthew Dempsky <mdempsky@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
Run-TryBot: Cuong Manh Le <cuong.manhle.vn@gmail.com>
Reviewed-by: Ian Lance Taylor <iant@google.com>
2022-05-03 05:42:48 +00:00
Keith Randall c4b2288755 cmd/compile: add jump table codegen test
Change-Id: Ic67f676f5ebe146166a0d3c1d78a802881320e49
Reviewed-on: https://go-review.googlesource.com/c/go/+/400375
Run-TryBot: Keith Randall <khr@golang.org>
TryBot-Result: Gopher Robot <gobot@golang.org>
Reviewed-by: Cherry Mui <cherryyz@google.com>
Reviewed-by: Keith Randall <khr@google.com>
2022-04-14 21:16:29 +00:00
Wayne Zuo 66f03f79da cmd/compile: add SHLX&SHRX without load
Change-Id: I79eb5e7d6bcb23f26d3a100e915efff6dae70391
Reviewed-on: https://go-review.googlesource.com/c/go/+/399061
Reviewed-by: Keith Randall <khr@golang.org>
Reviewed-by: Keith Randall <khr@google.com>
Reviewed-by: Cherry Mui <cherryyz@google.com>
2022-04-13 17:48:36 +00:00
Wayne Zuo 517781b391 cmd/compile: add SARXQload and SARXLload
Change-Id: I4e8dc5009a5b8af37d71b62f1322f94806d3e9d9
Reviewed-on: https://go-review.googlesource.com/c/go/+/399056
Run-TryBot: Wayne Zuo <wdvxdr@golangcn.org>
Reviewed-by: Keith Randall <khr@golang.org>
TryBot-Result: Gopher Robot <gobot@golang.org>
Reviewed-by: Keith Randall <khr@google.com>
Reviewed-by: Cherry Mui <cherryyz@google.com>
2022-04-13 17:48:12 +00:00
Wayne Zuo d6320f1a58 cmd/compile: add SARX instruction for GOAMD64>=3
name                    old time/op  new time/op  delta
ShiftArithmeticRight-8  0.68ns ± 5%  0.30ns ± 6%  -56.14%  (p=0.000 n=10+10)

Change-Id: I052a0d7b9e6526d526276444e588b0cc288beff4
Reviewed-on: https://go-review.googlesource.com/c/go/+/399055
Run-TryBot: Wayne Zuo <wdvxdr@golangcn.org>
TryBot-Result: Gopher Robot <gobot@golang.org>
Reviewed-by: Keith Randall <khr@golang.org>
Auto-Submit: Keith Randall <khr@golang.org>
Reviewed-by: Keith Randall <khr@google.com>
Reviewed-by: Cherry Mui <cherryyz@google.com>
2022-04-12 12:59:27 +00:00
Wayne Zuo 32de2b0d1c cmd/compile: add MOVBE index load/store
Fixes #51724

Change-Id: I94e650a7482dc4c479d597f0162a6a89d779708d
Reviewed-on: https://go-review.googlesource.com/c/go/+/395474
Reviewed-by: Keith Randall <khr@golang.org>
Trust: Cherry Mui <cherryyz@google.com>
2022-04-11 15:41:56 +00:00
Wayne Zuo 615d3c3040 test: adjust load and store test
In the load tests, we only want to test the assembly produced by
the load operations. If we use the global variable sink, it will produce
one load operation and one store operation(assign to sink).

For example:

func load_be64(b []byte) uint64 {
	sink64 = binary.BigEndian.Uint64(b)
}

If we compile this function with GOAMD64=v3, it may produce MOVBEQload
and MOVQstore or MOVQload and MOVBEQstore, but we only want MOVBEQload.
Discovered when developing CL 395474.

Same for the store tests.

Change-Id: I65c3c742f1eff657c3a0d2dd103f51140ae8079e
Reviewed-on: https://go-review.googlesource.com/c/go/+/397875
Reviewed-by: Keith Randall <khr@golang.org>
Trust: Cherry Mui <cherryyz@google.com>
2022-04-11 15:41:04 +00:00
Wayne Zuo 7fbabe8d57 cmd/compile: use shlx&shrx instruction for GOAMD64>=v3
The SHRX/SHLX instruction can take any general register as the shift count operand, and can read source from memory. This CL introduces some operators to combine load and shift to one instruction.

For #47120

Change-Id: I13b48f53c7d30067a72eb2c8382242045dead36a
Reviewed-on: https://go-review.googlesource.com/c/go/+/385174
Reviewed-by: Keith Randall <khr@golang.org>
Trust: Cherry Mui <cherryyz@google.com>
2022-04-04 20:07:23 +00:00
Wayne Zuo a92ca51507 cmd/compile: use LZCNT instruction for GOAMD64>=3
LZCNT is similar to BSR, but BSR(x) is undefined when x == 0, so using
LZCNT can avoid a special case for zero input. Except that case,
LZCNTQ(x) == 63-BSRQ(x) and LZCNTL(x) == 31-BSRL(x).

And according to https://www.agner.org/optimize/instruction_tables.pdf,
LZCNT instructions are much faster than BSR on AMD CPU.

name              old time/op  new time/op  delta
LeadingZeros-8    0.91ns ± 1%  0.80ns ± 7%  -11.68%  (p=0.000 n=9+9)
LeadingZeros8-8   0.98ns ±15%  0.91ns ± 1%   -7.34%  (p=0.000 n=9+9)
LeadingZeros16-8  0.94ns ± 3%  0.92ns ± 2%   -2.36%  (p=0.001 n=10+10)
LeadingZeros32-8  0.89ns ± 1%  0.78ns ± 2%  -12.49%  (p=0.000 n=10+10)
LeadingZeros64-8  0.92ns ± 1%  0.78ns ± 1%  -14.48%  (p=0.000 n=10+10)

Change-Id: I125147fe3d6994a4cfe558432780408e9a27557a
Reviewed-on: https://go-review.googlesource.com/c/go/+/396794
Reviewed-by: Keith Randall <khr@golang.org>
Trust: Emmanuel Odeke <emmanuel@orijtech.com>
Run-TryBot: Emmanuel Odeke <emmanuel@orijtech.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
2022-04-04 04:01:17 +00:00
Wayne Zuo ba6df85c7c cmd/compile: add MOVBEWstore support for GOAMD64>=3
This CL add MOVBE support for 16-bit version, but MOVBEWload is
excluded because it does not satisfy zero extented.

For #51724

Change-Id: I3fadf20bcbb9b423f6355e6a1e340107e8e621ac
Reviewed-on: https://go-review.googlesource.com/c/go/+/396617
Reviewed-by: Keith Randall <khr@golang.org>
Run-TryBot: Keith Randall <khr@golang.org>
TryBot-Result: Gopher Robot <gobot@golang.org>
Trust: Emmanuel Odeke <emmanuel@orijtech.com>
2022-04-03 23:48:16 +00:00
fanzha02 8ab42a945a cmd/compile: merge ANDconst and UBFX into UBFX on arm64
Add a new rewrite rule to merge ANDconst and UBFX into
UBFX.

Add test cases.

Change-Id: I24d6442d0c956d7ce092c3a3858d4a3a41771670
Reviewed-on: https://go-review.googlesource.com/c/go/+/377054
Trust: Fannie Zhang <Fannie.Zhang@arm.com>
Reviewed-by: Cherry Mui <cherryyz@google.com>
Run-TryBot: Cherry Mui <cherryyz@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
2022-03-25 01:24:44 +00:00
Keith Randall aaa3d39f27 cmd/compile: include all entries in map literal hint size
Currently we only include static entries in the hint for sizing
the map when allocating a map for a map literal. Change that to
include all entries.

This will be an overallocation if the dynamic entries in the map have
equal keys, but equal keys in map literals are rare, and at worst we
waste a bit of space.

Fixes #43020

Change-Id: I232f82f15316bdf4ea6d657d25a0b094b77884ce
Reviewed-on: https://go-review.googlesource.com/c/go/+/383634
Run-TryBot: Keith Randall <khr@golang.org>
Trust: Keith Randall <khr@golang.org>
Trust: Josh Bleecher Snyder <josharian@gmail.com>
Reviewed-by: Josh Bleecher Snyder <josharian@gmail.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
2022-03-01 23:20:30 +00:00
Archana R 4056934e48 test/codegen: updated arithmetic tests to verify on ppc64,ppc64le
Updated multiple tests in test/codegen/arithmetic.go to verify
on ppc64/ppc64le as well

Change-Id: I79ca9f87017ea31147a4ba16f5d42ba0fcae64e1
Reviewed-on: https://go-review.googlesource.com/c/go/+/358546
Run-TryBot: Lynn Boger <laboger@linux.vnet.ibm.com>
TryBot-Result: Go Bot <gobot@golang.org>
Reviewed-by: Lynn Boger <laboger@linux.vnet.ibm.com>
Reviewed-by: Cherry Mui <cherryyz@google.com>
2021-11-01 13:12:37 +00:00
Archana R 8b9c0d1a79 test/codegen: updated comparison test to verify on ppc64,ppc64le
Updated test/codegen/comparison.go to verify memequal is inlined
as implemented in CL 328291.

Change-Id: If7824aed37ee1f8640e54fda0f9b7610582ba316
Reviewed-on: https://go-review.googlesource.com/c/go/+/357289
Run-TryBot: Lynn Boger <laboger@linux.vnet.ibm.com>
TryBot-Result: Go Bot <gobot@golang.org>
Trust: Lynn Boger <laboger@linux.vnet.ibm.com>
Reviewed-by: Cherry Mui <cherryyz@google.com>
2021-10-21 13:05:07 +00:00
wdvxdr fe7df4c4d0 cmd/compile: use MOVBE instruction for GOAMD64>=v3
In CL 354670, I copied some existing rules for convenience but forgot
to update the last rule which broke `GOAMD64=v3 ./make.bat`

Revive CL 354670

Change-Id: Ic1e2047c603f0122482a4b293ce1ef74d806c019
Reviewed-on: https://go-review.googlesource.com/c/go/+/356810
Reviewed-by: Daniel Martí <mvdan@mvdan.cc>
Reviewed-by: Keith Randall <khr@golang.org>
Trust: Daniel Martí <mvdan@mvdan.cc>
Run-TryBot: Daniel Martí <mvdan@mvdan.cc>
TryBot-Result: Go Bot <gobot@golang.org>
2021-10-19 16:18:56 +00:00
Daniel Martí b0351bfd7d Revert "cmd/compile: use MOVBE instruction for GOAMD64>=v3"
This reverts CL 354670.

Reason for revert: broke make.bash with GOAMD64=v3.

Fixes #49061.

Change-Id: I7f2ed99b7c10100c4e0c1462ea91c4c9d8c609b2
Reviewed-on: https://go-review.googlesource.com/c/go/+/356790
Trust: Daniel Martí <mvdan@mvdan.cc>
Run-TryBot: Daniel Martí <mvdan@mvdan.cc>
TryBot-Result: Go Bot <gobot@golang.org>
Reviewed-by: Koichi Shiraishi <zchee.io@gmail.com>
Reviewed-by: Cuong Manh Le <cuong.manhle.vn@gmail.com>
2021-10-19 09:49:38 +00:00
Lynn Boger 33b3260c1e cmd/compile/internal/ssagen: set BitLen32 as intrinsic on PPC64
It was noticed through some other investigation that BitLen32
was not generating the best code and found that it wasn't recognized
as an intrinsic. This corrects that and enables the test for PPC64.

Change-Id: Iab496a8830c8552f507b7292649b1b660f3848b5
Reviewed-on: https://go-review.googlesource.com/c/go/+/355872
Run-TryBot: Lynn Boger <laboger@linux.vnet.ibm.com>
TryBot-Result: Go Bot <gobot@golang.org>
Trust: Lynn Boger <laboger@linux.vnet.ibm.com>
Reviewed-by: Cherry Mui <cherryyz@google.com>
2021-10-18 21:33:08 +00:00
wdvxdr 3e5cc4d6f6 cmd/compile: use MOVBE instruction for GOAMD64>=v3
encoding/binary benchmark on my laptop:
name                      old time/op    new time/op     delta
ReadSlice1000Int32s-8       4.42µs ± 5%     4.20µs ± 1%   -4.94%  (p=0.046 n=9+8)
ReadStruct-8                 359ns ± 8%      368ns ± 5%   +2.35%  (p=0.041 n=9+10)
WriteStruct-8                349ns ± 1%      357ns ± 1%   +2.15%  (p=0.000 n=8+10)
ReadInts-8                   235ns ± 1%      233ns ± 1%   -1.01%  (p=0.005 n=10+10)
WriteInts-8                  265ns ± 1%      274ns ± 1%   +3.45%  (p=0.000 n=10+10)
WriteSlice1000Int32s-8      4.61µs ± 5%     4.59µs ± 5%     ~     (p=0.986 n=10+10)
PutUint16-8                 0.56ns ± 4%     0.57ns ± 4%     ~     (p=0.101 n=10+10)
PutUint32-8                 0.83ns ± 2%     0.56ns ± 6%  -32.91%  (p=0.000 n=10+10)
PutUint64-8                 0.81ns ± 3%     0.62ns ± 4%  -23.82%  (p=0.000 n=10+10)
LittleEndianPutUint16-8     0.55ns ± 4%     0.55ns ± 3%     ~     (p=0.926 n=10+10)
LittleEndianPutUint32-8     0.41ns ± 4%     0.42ns ± 3%     ~     (p=0.148 n=10+9)
LittleEndianPutUint64-8     0.55ns ± 2%     0.56ns ± 6%     ~     (p=0.897 n=10+10)
ReadFloats-8                60.4ns ± 4%     59.0ns ± 1%   -2.25%  (p=0.007 n=10+10)
WriteFloats-8               72.3ns ± 2%     71.5ns ± 7%     ~     (p=0.089 n=10+10)
ReadSlice1000Float32s-8     4.21µs ± 3%     4.18µs ± 2%     ~     (p=0.197 n=10+10)
WriteSlice1000Float32s-8    4.61µs ± 2%     4.68µs ± 7%     ~     (p=1.000 n=8+10)
ReadSlice1000Uint8s-8        250ns ± 4%      247ns ± 4%     ~     (p=0.324 n=10+10)
WriteSlice1000Uint8s-8       227ns ± 5%      229ns ± 2%     ~     (p=0.193 n=10+7)
PutUvarint32-8              15.3ns ± 2%     15.4ns ± 4%     ~     (p=0.782 n=10+10)
PutUvarint64-8              38.5ns ± 1%     38.6ns ± 5%     ~     (p=0.396 n=8+10)

name                      old speed      new speed       delta
ReadSlice1000Int32s-8      890MB/s ±17%    953MB/s ± 1%   +7.00%  (p=0.027 n=10+8)
ReadStruct-8               209MB/s ± 8%    204MB/s ± 5%   -2.42%  (p=0.043 n=9+10)
WriteStruct-8              214MB/s ± 3%    210MB/s ± 1%   -1.75%  (p=0.003 n=9+10)
ReadInts-8                 127MB/s ± 1%    129MB/s ± 1%   +1.01%  (p=0.006 n=10+10)
WriteInts-8                113MB/s ± 1%    109MB/s ± 1%   -3.34%  (p=0.000 n=10+10)
WriteSlice1000Int32s-8     868MB/s ± 5%    872MB/s ± 5%     ~     (p=1.000 n=10+10)
PutUint16-8               3.55GB/s ± 4%   3.50GB/s ± 4%     ~     (p=0.093 n=10+10)
PutUint32-8               4.83GB/s ± 2%   7.21GB/s ± 6%  +49.16%  (p=0.000 n=10+10)
PutUint64-8               9.89GB/s ± 3%  12.99GB/s ± 4%  +31.26%  (p=0.000 n=10+10)
LittleEndianPutUint16-8   3.65GB/s ± 4%   3.65GB/s ± 4%     ~     (p=0.912 n=10+10)
LittleEndianPutUint32-8   9.74GB/s ± 3%   9.63GB/s ± 3%     ~     (p=0.222 n=9+9)
LittleEndianPutUint64-8   14.4GB/s ± 2%   14.3GB/s ± 5%     ~     (p=0.912 n=10+10)
ReadFloats-8               199MB/s ± 4%    203MB/s ± 1%   +2.27%  (p=0.007 n=10+10)
WriteFloats-8              166MB/s ± 2%    168MB/s ± 7%     ~     (p=0.089 n=10+10)
ReadSlice1000Float32s-8    949MB/s ± 3%    958MB/s ± 2%     ~     (p=0.218 n=10+10)
WriteSlice1000Float32s-8   867MB/s ± 2%    857MB/s ± 6%     ~     (p=1.000 n=8+10)
ReadSlice1000Uint8s-8     4.00GB/s ± 4%   4.06GB/s ± 4%     ~     (p=0.353 n=10+10)
WriteSlice1000Uint8s-8    4.40GB/s ± 4%   4.36GB/s ± 2%     ~     (p=0.193 n=10+7)
PutUvarint32-8             262MB/s ± 2%    260MB/s ± 4%     ~     (p=0.739 n=10+10)
PutUvarint64-8             208MB/s ± 1%    207MB/s ± 5%     ~     (p=0.408 n=8+10)

Updates #45453

Change-Id: Ifda0d48d54665cef45d46d3aad974062633142c4
Reviewed-on: https://go-review.googlesource.com/c/go/+/354670
Run-TryBot: Alberto Donizetti <alb.donizetti@gmail.com>
TryBot-Result: Go Bot <gobot@golang.org>
Reviewed-by: Keith Randall <khr@golang.org>
Trust: Matthew Dempsky <mdempsky@google.com>
2021-10-18 16:01:55 +00:00
Jake Ciolek 732f6fa9d5 cmd/compile: use ANDL for small immediates
We can rewrite ANDQ with an immediate fitting in 32bit with an ANDL, which is shorter to encode.

Looking at Go binary itself, before the change there was:

ANDL: 2337
ANDQ: 4476

After the change:

ANDL: 3790
ANDQ: 3024

So we got rid of 1452 ANDQs

This makes the Linux x86_64 binary 0.03% smaller.

There seems to be an impact on performance.

Intel Cascade Lake benchmarks (with perflock):

name                     old time/op    new time/op    delta
BinaryTree17-8              1.91s ± 1%     1.89s ± 1%  -1.22%  (p=0.000 n=21+18)
Fannkuch11-8                2.34s ± 0%     2.34s ± 0%    ~     (p=0.052 n=20+20)
FmtFprintfEmpty-8          27.7ns ± 1%    27.4ns ± 3%    ~     (p=0.497 n=21+21)
FmtFprintfString-8         53.2ns ± 0%    51.5ns ± 0%  -3.21%  (p=0.000 n=20+19)
FmtFprintfInt-8            57.3ns ± 0%    55.7ns ± 0%  -2.89%  (p=0.000 n=19+19)
FmtFprintfIntInt-8         92.3ns ± 0%    88.4ns ± 1%  -4.23%  (p=0.000 n=20+21)
FmtFprintfPrefixedInt-8     103ns ± 0%     103ns ± 0%  +0.23%  (p=0.000 n=20+21)
FmtFprintfFloat-8           147ns ± 0%     148ns ± 0%  +0.75%  (p=0.000 n=20+21)
FmtManyArgs-8               384ns ± 0%     381ns ± 0%  -0.63%  (p=0.000 n=21+21)
GobDecode-8                3.86ms ± 1%    3.88ms ± 1%  +0.52%  (p=0.000 n=20+21)
GobEncode-8                2.77ms ± 1%    2.77ms ± 0%    ~     (p=0.078 n=21+21)
Gzip-8                      168ms ± 1%     168ms ± 0%  +0.24%  (p=0.000 n=20+20)
Gunzip-8                   25.1ms ± 0%    24.3ms ± 0%  -3.03%  (p=0.000 n=21+21)
HTTPClientServer-8         61.4µs ± 8%    59.1µs ±10%    ~     (p=0.088 n=20+21)
JSONEncode-8               6.86ms ± 0%    6.70ms ± 0%  -2.29%  (p=0.000 n=20+19)
JSONDecode-8               30.8ms ± 1%    30.6ms ± 1%  -0.82%  (p=0.000 n=20+20)
Mandelbrot200-8            3.85ms ± 0%    3.85ms ± 0%    ~     (p=0.191 n=16+17)
GoParse-8                  2.61ms ± 2%    2.60ms ± 1%    ~     (p=0.561 n=21+20)
RegexpMatchEasy0_32-8      48.5ns ± 2%    45.9ns ± 3%  -5.26%  (p=0.000 n=20+21)
RegexpMatchEasy0_1K-8       139ns ± 0%     139ns ± 0%  +0.27%  (p=0.000 n=18+20)
RegexpMatchEasy1_32-8      41.3ns ± 0%    42.1ns ± 4%  +1.95%  (p=0.000 n=17+21)
RegexpMatchEasy1_1K-8       216ns ± 2%     216ns ± 0%  +0.17%  (p=0.020 n=21+19)
RegexpMatchMedium_32-8      790ns ± 7%     803ns ± 8%    ~     (p=0.178 n=21+21)
RegexpMatchMedium_1K-8     23.5µs ± 5%    23.7µs ± 5%    ~     (p=0.421 n=21+21)
RegexpMatchHard_32-8       1.09µs ± 1%    1.09µs ± 1%  -0.53%  (p=0.000 n=19+18)
RegexpMatchHard_1K-8       33.0µs ± 0%    33.0µs ± 0%    ~     (p=0.610 n=21+20)
Revcomp-8                   348ms ± 0%     353ms ± 0%  +1.38%  (p=0.000 n=17+18)
Template-8                 42.0ms ± 1%    41.9ms ± 1%  -0.30%  (p=0.049 n=20+20)
TimeParse-8                 185ns ± 0%     185ns ± 0%    ~     (p=0.387 n=20+18)
TimeFormat-8                237ns ± 1%     241ns ± 1%  +1.57%  (p=0.000 n=21+21)
[Geo mean]                 35.4µs         35.2µs       -0.66%

name                     old speed      new speed      delta
GobDecode-8               199MB/s ± 1%   198MB/s ± 1%  -0.52%  (p=0.000 n=20+21)
GobEncode-8               277MB/s ± 1%   277MB/s ± 0%    ~     (p=0.075 n=21+21)
Gzip-8                    116MB/s ± 1%   115MB/s ± 0%  -0.25%  (p=0.000 n=20+20)
Gunzip-8                  773MB/s ± 0%   797MB/s ± 0%  +3.12%  (p=0.000 n=21+21)
JSONEncode-8              283MB/s ± 0%   290MB/s ± 0%  +2.35%  (p=0.000 n=20+19)
JSONDecode-8             63.0MB/s ± 1%  63.5MB/s ± 1%  +0.82%  (p=0.000 n=20+20)
GoParse-8                22.2MB/s ± 2%  22.3MB/s ± 1%    ~     (p=0.539 n=21+20)
RegexpMatchEasy0_32-8     660MB/s ± 2%   697MB/s ± 3%  +5.57%  (p=0.000 n=20+21)
RegexpMatchEasy0_1K-8    7.36GB/s ± 0%  7.34GB/s ± 0%  -0.26%  (p=0.000 n=18+20)
RegexpMatchEasy1_32-8     775MB/s ± 0%   761MB/s ± 4%  -1.88%  (p=0.000 n=17+21)
RegexpMatchEasy1_1K-8    4.74GB/s ± 2%  4.74GB/s ± 0%  -0.18%  (p=0.020 n=21+19)
RegexpMatchMedium_32-8   40.6MB/s ± 7%  39.9MB/s ± 9%    ~     (p=0.191 n=21+21)
RegexpMatchMedium_1K-8   43.7MB/s ± 5%  43.2MB/s ± 5%    ~     (p=0.435 n=21+21)
RegexpMatchHard_32-8     29.3MB/s ± 1%  29.4MB/s ± 1%  +0.53%  (p=0.000 n=19+18)
RegexpMatchHard_1K-8     31.0MB/s ± 0%  31.0MB/s ± 0%    ~     (p=0.572 n=21+20)
Revcomp-8                 730MB/s ± 0%   720MB/s ± 0%  -1.36%  (p=0.000 n=17+18)
Template-8               46.2MB/s ± 1%  46.3MB/s ± 1%  +0.30%  (p=0.041 n=20+20)
[Geo mean]                204MB/s        205MB/s       +0.30%

Change-Id: Iac75d0ec184a515ce0e65e19559d5fe2e9840514
Reviewed-on: https://go-review.googlesource.com/c/go/+/354970
Reviewed-by: Josh Bleecher Snyder <josharian@gmail.com>
Trust: Josh Bleecher Snyder <josharian@gmail.com>
Trust: Keith Randall <khr@golang.org>
Run-TryBot: Josh Bleecher Snyder <josharian@gmail.com>
TryBot-Result: Go Bot <gobot@golang.org>
2021-10-12 22:02:39 +00:00
Alejandro García Montoro e1c294a56d cmd/compile: eliminate successive swaps
The code generated when storing eight bytes loaded from memory in big
endian introduced two successive byte swaps that did not actually
modified the data.

The new rules match this specific pattern both for amd64 and for arm64,
eliminating the double swap.

Fixes #41684

Change-Id: Icb6dc20b68e4393cef4fe6a07b33aba0d18c3ff3
Reviewed-on: https://go-review.googlesource.com/c/go/+/320073
Reviewed-by: Keith Randall <khr@golang.org>
Run-TryBot: Keith Randall <khr@golang.org>
TryBot-Result: Go Bot <gobot@golang.org>
Trust: Daniel Martí <mvdan@mvdan.cc>
Trust: Dmitri Shuralyov <dmitshur@golang.org>
2021-10-09 01:04:29 +00:00
Ruslan Andreev 810b08b8ec cmd/compile: inline memequal(x, const, sz) for small sizes
This CL adds late expanded memequal(x, const, sz) inlining for 2, 4, 8
bytes size. This PoC is using the same method as CL 248404.
This optimization fires about 100 times in Go compiler (1675 occurrences
reduced to 1574, so -6%).
Also, added unit-tests to codegen/comparisions.go file.

Updates #37275

Change-Id: Ia52808d573cb706d1da8166c5746ede26f46c5da
Reviewed-on: https://go-review.googlesource.com/c/go/+/328291
Reviewed-by: Cherry Mui <cherryyz@google.com>
Run-TryBot: Cherry Mui <cherryyz@google.com>
Trust: David Chase <drchase@google.com>
2021-10-06 13:47:50 +00:00
nimelehin 097a82f54d cmd/compile: don't emit unnecessary amd64 extension checks
In case of amd64 the compiler issues checks if extensions are
available on a platform. With GOAMD64 microarchitecture levels
provided, some of the checks could be eliminated.

Change-Id: If15c178bcae273b2ce7d3673415cb8849292e087
Reviewed-on: https://go-review.googlesource.com/c/go/+/352010
Reviewed-by: Keith Randall <khr@golang.org>
Reviewed-by: Matthew Dempsky <mdempsky@google.com>
Run-TryBot: Keith Randall <khr@golang.org>
TryBot-Result: Go Bot <gobot@golang.org>
2021-10-05 18:32:12 +00:00
wdvxdr 060cd73ab9 cmd/compile: use TZCNT instruction for GOAMD64>=v3
on my Intel CoffeeLake CPU:
name               old time/op  new time/op  delta
TrailingZeros-8    0.68ns ± 1%  0.64ns ± 1%  -6.26%  (p=0.000 n=10+10)
TrailingZeros8-8   0.70ns ± 1%  0.70ns ± 1%    ~     (p=0.697 n=10+10)
TrailingZeros16-8  0.70ns ± 1%  0.70ns ± 1%  +0.57%  (p=0.043 n=10+10)
TrailingZeros32-8  0.66ns ± 1%  0.64ns ± 1%  -3.35%  (p=0.000 n=10+10)
TrailingZeros64-8  0.68ns ± 1%  0.64ns ± 1%  -5.84%  (p=0.000 n=9+10)

Updates #45453

Change-Id: I228ff2d51df24b1306136f061432f8a12bb1d6fd
Reviewed-on: https://go-review.googlesource.com/c/go/+/353249
Trust: Michael Knyszek <mknyszek@google.com>
Run-TryBot: Michael Knyszek <mknyszek@google.com>
TryBot-Result: Go Bot <gobot@golang.org>
Reviewed-by: Keith Randall <khr@golang.org>
2021-10-05 16:06:49 +00:00
Lynn Boger b8990ec932 test: update test/codegen/noextend.go to work with either ABI on ppc64x
This updates the codegen tests in noextend.go so they are not
dependent on the ABI.

Change-Id: I8433bea9dc78830c143290a7e0cf901b2397d38a
Reviewed-on: https://go-review.googlesource.com/c/go/+/353070
Trust: Lynn Boger <laboger@linux.vnet.ibm.com>
Run-TryBot: Lynn Boger <laboger@linux.vnet.ibm.com>
TryBot-Result: Go Bot <gobot@golang.org>
Reviewed-by: Cherry Mui <cherryyz@google.com>
2021-09-29 17:30:11 +00:00
Archana R b35c668072 cmd/compile: add PPC64-specific inlining for runtime.memmove
Add rule to PPC64.rules to inline runtime.memmove in more cases, as is
done for other target architectures
Updated tests in codegen/copy.go to verify changes are done on
ppc64/ppc64le

Updates #41662

Change-Id: Id937ce21f9b4f4047b3e66dfa3c960128ee16a2a
Reviewed-on: https://go-review.googlesource.com/c/go/+/352054
Run-TryBot: Lynn Boger <laboger@linux.vnet.ibm.com>
TryBot-Result: Go Bot <gobot@golang.org>
Reviewed-by: Cherry Mui <cherryyz@google.com>
Trust: Lynn Boger <laboger@linux.vnet.ibm.com>
2021-09-29 16:55:51 +00:00
Joel Sing fe8347b61a cmd/compile: optimise immediate operands with constants on riscv64
Instructions with immediates can be precomputed when operating on a
constant - do so for SLTI/SLTIU, SLLI/SRLI/SRAI, NEG/NEGW, ANDI, ORI
and ADDI. Additionally, optimise ANDI and ORI when the immediate is
all ones or all zeroes.

In particular, the RISCV64 logical left and right shift rules
(Lsh*x*/Rsh*Ux*) produce sequences that check if the shift amount
exceeds 64 and if so returns zero. When the shift amount is a
constant we can precompute and eliminate the filter entirely.

Likewise the arithmetic right shift rules produce sequences that
check if the shift amount exceeds 64 and if so, ensures that the
lower six bits of the shift are all ones. When the shift amount
is a constant we can precompute the shift value.

Arithmetic right shift sequences like:

   117fc:       00100513                li      a0,1
   11800:       04053593                sltiu   a1,a0,64
   11804:       fff58593                addi    a1,a1,-1
   11808:       0015e593                ori     a1,a1,1
   1180c:       40b45433                sra     s0,s0,a1

Are now a single srai instruction:

   117fc:       40145413                srai    s0,s0,0x1

Likewise for logical left shift (and logical right shift):

   1d560:       01100413                li      s0,17
   1d564:       04043413                sltiu   s0,s0,64
   1d568:       40800433                neg     s0,s0
   1d56c:       01131493                slli    s1,t1,0x11
   1d570:       0084f433                and     s0,s1,s0

Which are now a single slli (or srli) instruction:

   1d120:       01131413                slli    s0,t1,0x11

This removes more than 30,000 instructions from the Go binary and
should improve performance in a variety of areas - of note
runtime.makemap_small drops from 48 to 36 instructions. Similar
gains exist in at least other parts of runtime and math/bits.

Change-Id: I33f6f3d1fd36d9ff1bda706997162bfe4bb859b6
Reviewed-on: https://go-review.googlesource.com/c/go/+/350689
Trust: Joel Sing <joel@sing.id.au>
Reviewed-by: Michael Munday <mike.munday@lowrisc.org>
Reviewed-by: Cherry Mui <cherryyz@google.com>
2021-09-24 10:51:48 +00:00
Joel Sing d413908320 test/codegen: add shift tests for RISCV64
Add tests for shift by constant, masked shifts and bounded shifts. While here,
sort tests by architecture and keep order of tests consistent (lsh, rshU, rsh).

Change-Id: I512d64196f34df9cb2884e8c0f6adcf9dd88b0fc
Reviewed-on: https://go-review.googlesource.com/c/go/+/351289
Trust: Joel Sing <joel@sing.id.au>
Reviewed-by: Michael Munday <mike.munday@lowrisc.org>
Reviewed-by: Cherry Mui <cherryyz@google.com>
Run-TryBot: Cherry Mui <cherryyz@google.com>
2021-09-24 07:22:13 +00:00
Matthew Dempsky 04572fa29b cmd/compile: use BMI1 instructions for GOAMD64=v3 and higher
BMI1 includes four instructions (ANDN, BLSI, BLSMSK, BLSR) that are
easy to peephole optimize, and which GCC always seems to favor using
when available and applicable.

Updates #45453.

Change-Id: I0274184057058f5c579e5bc3ea9c414396d3cf46
Reviewed-on: https://go-review.googlesource.com/c/go/+/351130
Run-TryBot: Matthew Dempsky <mdempsky@google.com>
Trust: Matthew Dempsky <mdempsky@google.com>
TryBot-Result: Go Bot <gobot@golang.org>
Reviewed-by: Keith Randall <khr@golang.org>
2021-09-22 00:15:27 +00:00
Keith Randall 6f35430faa cmd/compile: allow rotates to be merged with logical ops on arm64
Fixes #48002

Change-Id: Ie3a157d55b291f5ac2ef4845e6ce4fefd84fc642
Reviewed-on: https://go-review.googlesource.com/c/go/+/350912
Trust: Keith Randall <khr@golang.org>
Run-TryBot: Keith Randall <khr@golang.org>
TryBot-Result: Go Bot <gobot@golang.org>
Reviewed-by: Cherry Mui <cherryyz@google.com>
2021-09-20 16:25:36 +00:00
Keith Randall 315dbd10c9 cmd/compile: fold double negate on arm64
Fixes #48467

Change-Id: I52305dbf561ee3eee6c1f053e555a3a6ec1ab892
Reviewed-on: https://go-review.googlesource.com/c/go/+/350910
Trust: Keith Randall <khr@golang.org>
Trust: Josh Bleecher Snyder <josharian@gmail.com>
Run-TryBot: Keith Randall <khr@golang.org>
TryBot-Result: Go Bot <gobot@golang.org>
Reviewed-by: Josh Bleecher Snyder <josharian@gmail.com>
2021-09-19 23:50:45 +00:00
Keith Randall 83b36ffb10 cmd/compile: implement constant rotates on arm64
Explicit constant rotates work, but constant arguments to
bits.RotateLeft* needed the additional rule.

Fixes #48465

Change-Id: Ia7544f21d0e7587b6b6506f72421459cd769aea6
Reviewed-on: https://go-review.googlesource.com/c/go/+/350909
Trust: Keith Randall <khr@golang.org>
Trust: Josh Bleecher Snyder <josharian@gmail.com>
Run-TryBot: Keith Randall <khr@golang.org>
TryBot-Result: Go Bot <gobot@golang.org>
Reviewed-by: Josh Bleecher Snyder <josharian@gmail.com>
2021-09-19 23:50:23 +00:00
Michael Munday c69f5c0d76 cmd/compile: add support for Abs and Copysign intrinsics on riscv64
Also, add the FABSS and FABSD pseudo instructions to the assembler.
The compiler could use FSGNJX[SD] directly but there doesn't seem
to be much advantage to doing so and the pseudo instructions are
easier to understand.

Change-Id: Ie8825b8aa8773c69cc4f07a32ef04abf4061d80d
Reviewed-on: https://go-review.googlesource.com/c/go/+/348989
Trust: Michael Munday <mike.munday@lowrisc.org>
Run-TryBot: Michael Munday <mike.munday@lowrisc.org>
TryBot-Result: Go Bot <gobot@golang.org>
Reviewed-by: Joel Sing <joel@sing.id.au>
2021-09-10 10:45:59 +00:00
fanzha02 2091bd3f26 cmd/compile: simiplify arm64 bitfield optimizations
In some rewrite rules for arm64 bitfield optimizations, the
bitfield lsb value and the bitfield width value are related
to datasize, some of them use datasize directly to check the
bitfield lsb value is valid, to get the bitfiled width value,
but some of them call isARM64BFMask() and arm64BFWidth()
functions. In order to be consistent, this patch changes them
all to use datasize.

Besides, this patch sorts the codegen test cases.

Run the "toolstash-check -all" command and find one inconsistent code
is as the following.

new:    src/math/fma.go:104      BEQ	247
master: src/math/fma.go:104      BEQ	248

The above inconsistence is due to this patch changing the range of the
field lsb value in "UBFIZ" optimization rules from "lc+(32|16|8)<64" to
"lc<64", so that the following code is generated as "UBFIZ". The logical
of changed code is still correct.

The code of src/math/fma.go:160:
  const uvinf = 0x7FF0000000000000
  func FMA(a, b uint32) float64 {
  	ps := a+b
  	return Float64frombits(uint64(ps)<<63 | uvinf)
  }

The new assembly code:
  TEXT    "".FMA(SB), LEAF|NOFRAME|ABIInternal, $0-16
  MOVWU   "".a(FP), R0
  MOVWU   "".b+4(FP), R1
  ADD     R1, R0, R0
  UBFIZ   $63, R0, $1, R0
  ORR     $9218868437227405312, R0, R0
  MOVD    R0, "".~r2+8(FP)
  RET     (R30)

The master assembly code:
  TEXT    "".FMA(SB), LEAF|NOFRAME|ABIInternal, $0-16
  MOVWU   "".a(FP), R0
  MOVWU   "".b+4(FP), R1
  ADD     R1, R0, R0
  MOVWU   R0, R0
  LSL     $63, R0, R0
  ORR     $9218868437227405312, R0, R0
  MOVD    R0, "".~r2+8(FP)
  RET     (R30)

Change-Id: I9061104adfdfd3384d0525327ae1e5c8b0df5c35
Reviewed-on: https://go-review.googlesource.com/c/go/+/265038
Trust: fannie zhang <Fannie.Zhang@arm.com>
Run-TryBot: fannie zhang <Fannie.Zhang@arm.com>
TryBot-Result: Go Bot <gobot@golang.org>
Reviewed-by: Cherry Mui <cherryyz@google.com>
2021-09-10 02:44:36 +00:00
Michael Munday 8214257347 test/codegen: fix package name for test case
The codegen tests are currently skipped (see #48247). The test
added in CL 346050 did not compile because it was in the main
package but did not contain a main function. Changing the package
to 'codegen' fixes the issue.

Updates #48247.

Change-Id: I0a0eaca8e6a7d7b335606d2c76a204ac0c12e6d5
Reviewed-on: https://go-review.googlesource.com/c/go/+/348392
Trust: Michael Munday <mike.munday@lowrisc.org>
Run-TryBot: Michael Munday <mike.munday@lowrisc.org>
Reviewed-by: Cherry Mui <cherryyz@google.com>
TryBot-Result: Go Bot <gobot@golang.org>
2021-09-08 14:51:40 +00:00