Enable SHA-NI optimizations for SHA-256. While our CI machines don't have these instructions, Intel SDE covers them. Benchmarks on an AMD EPYC machine (VM on Google Compute Engine): Before: Did 13619000 SHA-256 (16 bytes) operations in 3000147us (72.6 MB/sec) Did 3728000 SHA-256 (256 bytes) operations in 3000566us (318.1 MB/sec) Did 920000 SHA-256 (1350 bytes) operations in 3002829us (413.6 MB/sec) Did 161000 SHA-256 (8192 bytes) operations in 3017473us (437.1 MB/sec) Did 81000 SHA-256 (16384 bytes) operations in 3029284us (438.1 MB/sec) After: Did 25442000 SHA-256 (16 bytes) operations in 3000010us (135.7 MB/sec) [+86.8%] Did 10706000 SHA-256 (256 bytes) operations in 3000171us (913.5 MB/sec) [+187.2%] Did 3119000 SHA-256 (1350 bytes) operations in 3000470us (1403.3 MB/sec) [+239.3%] Did 572000 SHA-256 (8192 bytes) operations in 3001226us (1561.3 MB/sec) [+257.2%] Did 289000 SHA-256 (16384 bytes) operations in 3006936us (1574.7 MB/sec) [+259.4%] Although we don't currently have unwind tests in CI, I ran the unwind tests manually on the same VM. They pass, after adding in the missing .cfi_startproc and .cfi_endproc lines. Change-Id: I45b91819e7dcc31e63813843129afa146d0c9d47 Reviewed-on: https://boringssl-review.googlesource.com/c/boringssl/+/51546 Reviewed-by: Adam Langley <agl@google.com>
diff --git a/crypto/fipsmodule/sha/asm/sha512-x86_64.pl b/crypto/fipsmodule/sha/asm/sha512-x86_64.pl index 61f67cb..2abd065 100755 --- a/crypto/fipsmodule/sha/asm/sha512-x86_64.pl +++ b/crypto/fipsmodule/sha/asm/sha512-x86_64.pl
@@ -126,15 +126,12 @@ # versions, but BoringSSL is intended to be used with pre-generated perlasm # output, so this isn't useful anyway. # -# TODO(davidben): Enable AVX2 code after testing by setting $avx to 2. Is it -# necessary to disable AVX2 code when SHA Extensions code is disabled? Upstream -# did not tie them together until after $shaext was added. +# This file also has an AVX2 implementation, controlled by setting $avx to 2. +# For now, we intentionally disable it. While it gives a 13-16% perf boost, the +# CFI annotations are wrong. It allocates stack in a loop and should be +# rewritten to avoid this. $avx = 1; - -# TODO(davidben): Consider enabling the Intel SHA Extensions code once it's -# been tested. -$shaext=0; ### set to zero if compiling for 1.0.1 -$avx=1 if (!$shaext && $avx); +$shaext = 1; open OUT,"| \"$^X\" \"$xlate\" $flavour \"$output\""; *STDOUT=*OUT; @@ -275,7 +272,7 @@ ___ $code.=<<___ if ($SZ==4 && $shaext); test \$`1<<29`,%r11d # check for SHA - jnz _shaext_shortcut + jnz .Lshaext_shortcut ___ # XOP codepath removed. $code.=<<___ if ($avx>1); @@ -559,7 +556,8 @@ .type sha256_block_data_order_shaext,\@function,3 .align 64 sha256_block_data_order_shaext: -_shaext_shortcut: +.cfi_startproc +.Lshaext_shortcut: ___ $code.=<<___ if ($win64); lea `-8-5*16`(%rsp),%rsp @@ -703,6 +701,7 @@ ___ $code.=<<___; ret +.cfi_endproc .size sha256_block_data_order_shaext,.-sha256_block_data_order_shaext ___ }}}