Enable SHA-NI optimizations for SHA-256. While our CI machines don't have these instructions, Intel SDE covers them. Benchmarks on an AMD EPYC machine (VM on Google Compute Engine): Before: Did 13619000 SHA-256 (16 bytes) operations in 3000147us (72.6 MB/sec) Did 3728000 SHA-256 (256 bytes) operations in 3000566us (318.1 MB/sec) Did 920000 SHA-256 (1350 bytes) operations in 3002829us (413.6 MB/sec) Did 161000 SHA-256 (8192 bytes) operations in 3017473us (437.1 MB/sec) Did 81000 SHA-256 (16384 bytes) operations in 3029284us (438.1 MB/sec) After: Did 25442000 SHA-256 (16 bytes) operations in 3000010us (135.7 MB/sec) [+86.8%] Did 10706000 SHA-256 (256 bytes) operations in 3000171us (913.5 MB/sec) [+187.2%] Did 3119000 SHA-256 (1350 bytes) operations in 3000470us (1403.3 MB/sec) [+239.3%] Did 572000 SHA-256 (8192 bytes) operations in 3001226us (1561.3 MB/sec) [+257.2%] Did 289000 SHA-256 (16384 bytes) operations in 3006936us (1574.7 MB/sec) [+259.4%] Although we don't currently have unwind tests in CI, I ran the unwind tests manually on the same VM. They pass, after adding in the missing .cfi_startproc and .cfi_endproc lines. Change-Id: I45b91819e7dcc31e63813843129afa146d0c9d47 Reviewed-on: https://boringssl-review.googlesource.com/c/boringssl/+/51546 Reviewed-by: Adam Langley <agl@google.com>

diff --git a/crypto/fipsmodule/sha/asm/sha512-x86_64.pl b/crypto/fipsmodule/sha/asm/sha512-x86_64.pl
index 61f67cb..2abd065 100755
--- a/crypto/fipsmodule/sha/asm/sha512-x86_64.pl
+++ b/crypto/fipsmodule/sha/asm/sha512-x86_64.pl

@@ -126,15 +126,12 @@
 # versions, but BoringSSL is intended to be used with pre-generated perlasm
 # output, so this isn't useful anyway.
 #
-# TODO(davidben): Enable AVX2 code after testing by setting $avx to 2. Is it
-# necessary to disable AVX2 code when SHA Extensions code is disabled? Upstream
-# did not tie them together until after $shaext was added.
+# This file also has an AVX2 implementation, controlled by setting $avx to 2.
+# For now, we intentionally disable it. While it gives a 13-16% perf boost, the
+# CFI annotations are wrong. It allocates stack in a loop and should be
+# rewritten to avoid this.
 $avx = 1;
-
-# TODO(davidben): Consider enabling the Intel SHA Extensions code once it's
-# been tested.
-$shaext=0;	### set to zero if compiling for 1.0.1
-$avx=1		if (!$shaext && $avx);
+$shaext = 1;
 
 open OUT,"| \"$^X\" \"$xlate\" $flavour \"$output\"";
 *STDOUT=*OUT;
@@ -275,7 +272,7 @@
 ___
 $code.=<<___ if ($SZ==4 && $shaext);
 	test	\$`1<<29`,%r11d		# check for SHA
-	jnz	_shaext_shortcut
+	jnz	.Lshaext_shortcut
 ___
     # XOP codepath removed.
 $code.=<<___ if ($avx>1);
@@ -559,7 +556,8 @@
 .type	sha256_block_data_order_shaext,\@function,3
 .align	64
 sha256_block_data_order_shaext:
-_shaext_shortcut:
+.cfi_startproc
+.Lshaext_shortcut:
 ___
 $code.=<<___ if ($win64);
 	lea	`-8-5*16`(%rsp),%rsp
@@ -703,6 +701,7 @@
 ___
 $code.=<<___;
 	ret
+.cfi_endproc
 .size	sha256_block_data_order_shaext,.-sha256_block_data_order_shaext
 ___
 }}}