Enable SHA-NI optimizations for SHA-256.
While our CI machines don't have these instructions, Intel SDE covers
them. Benchmarks on an AMD EPYC machine (VM on Google Compute Engine):
Before:
Did 13619000 SHA-256 (16 bytes) operations in 3000147us (72.6 MB/sec)
Did 3728000 SHA-256 (256 bytes) operations in 3000566us (318.1 MB/sec)
Did 920000 SHA-256 (1350 bytes) operations in 3002829us (413.6 MB/sec)
Did 161000 SHA-256 (8192 bytes) operations in 3017473us (437.1 MB/sec)
Did 81000 SHA-256 (16384 bytes) operations in 3029284us (438.1 MB/sec)
After:
Did 25442000 SHA-256 (16 bytes) operations in 3000010us (135.7 MB/sec) [+86.8%]
Did 10706000 SHA-256 (256 bytes) operations in 3000171us (913.5 MB/sec) [+187.2%]
Did 3119000 SHA-256 (1350 bytes) operations in 3000470us (1403.3 MB/sec) [+239.3%]
Did 572000 SHA-256 (8192 bytes) operations in 3001226us (1561.3 MB/sec) [+257.2%]
Did 289000 SHA-256 (16384 bytes) operations in 3006936us (1574.7 MB/sec) [+259.4%]
Although we don't currently have unwind tests in CI, I ran the unwind
tests manually on the same VM. They pass, after adding in the missing
.cfi_startproc and .cfi_endproc lines.
Change-Id: I45b91819e7dcc31e63813843129afa146d0c9d47
Reviewed-on: https://boringssl-review.googlesource.com/c/boringssl/+/51546
Reviewed-by: Adam Langley <agl@google.com>
diff --git a/crypto/fipsmodule/sha/asm/sha512-x86_64.pl b/crypto/fipsmodule/sha/asm/sha512-x86_64.pl
index 61f67cb..2abd065 100755
--- a/crypto/fipsmodule/sha/asm/sha512-x86_64.pl
+++ b/crypto/fipsmodule/sha/asm/sha512-x86_64.pl
@@ -126,15 +126,12 @@
# versions, but BoringSSL is intended to be used with pre-generated perlasm
# output, so this isn't useful anyway.
#
-# TODO(davidben): Enable AVX2 code after testing by setting $avx to 2. Is it
-# necessary to disable AVX2 code when SHA Extensions code is disabled? Upstream
-# did not tie them together until after $shaext was added.
+# This file also has an AVX2 implementation, controlled by setting $avx to 2.
+# For now, we intentionally disable it. While it gives a 13-16% perf boost, the
+# CFI annotations are wrong. It allocates stack in a loop and should be
+# rewritten to avoid this.
$avx = 1;
-
-# TODO(davidben): Consider enabling the Intel SHA Extensions code once it's
-# been tested.
-$shaext=0; ### set to zero if compiling for 1.0.1
-$avx=1 if (!$shaext && $avx);
+$shaext = 1;
open OUT,"| \"$^X\" \"$xlate\" $flavour \"$output\"";
*STDOUT=*OUT;
@@ -275,7 +272,7 @@
___
$code.=<<___ if ($SZ==4 && $shaext);
test \$`1<<29`,%r11d # check for SHA
- jnz _shaext_shortcut
+ jnz .Lshaext_shortcut
___
# XOP codepath removed.
$code.=<<___ if ($avx>1);
@@ -559,7 +556,8 @@
.type sha256_block_data_order_shaext,\@function,3
.align 64
sha256_block_data_order_shaext:
-_shaext_shortcut:
+.cfi_startproc
+.Lshaext_shortcut:
___
$code.=<<___ if ($win64);
lea `-8-5*16`(%rsp),%rsp
@@ -703,6 +701,7 @@
___
$code.=<<___;
ret
+.cfi_endproc
.size sha256_block_data_order_shaext,.-sha256_block_data_order_shaext
___
}}}