Enable AVX code for SHA-*.

SHA-1, SHA-256, and SHA-512 get a 12-26%, 17-23%, and 33-37% improvement,
respectively on x86-64. SHA-1 and SHA-256 get a 8-20% and 14-17% improvement on
x86. (x86 does not have AVX code for SHA-512.) This costs us 12k of binary size
on x86-64 and 8k of binary size on x86.

$ bssl speed SHA- (x86-64, before)
Did 4811000 SHA-1 (16 bytes) operations in 1000013us (4810937.5 ops/sec): 77.0 MB/s
Did 1414000 SHA-1 (256 bytes) operations in 1000253us (1413642.3 ops/sec): 361.9 MB/s
Did 56000 SHA-1 (8192 bytes) operations in 1002640us (55852.5 ops/sec): 457.5 MB/s
Did 2536000 SHA-256 (16 bytes) operations in 1000140us (2535645.0 ops/sec): 40.6 MB/s
Did 603000 SHA-256 (256 bytes) operations in 1001613us (602028.9 ops/sec): 154.1 MB/s
Did 25000 SHA-256 (8192 bytes) operations in 1010132us (24749.2 ops/sec): 202.7 MB/s
Did 1767000 SHA-512 (16 bytes) operations in 1000477us (1766157.5 ops/sec): 28.3 MB/s
Did 638000 SHA-512 (256 bytes) operations in 1000933us (637405.3 ops/sec): 163.2 MB/s
Did 32000 SHA-512 (8192 bytes) operations in 1025646us (31199.8 ops/sec): 255.6 MB/s

$ bssl speed SHA- (x86-64, after)
Did 5438000 SHA-1 (16 bytes) operations in 1000060us (5437673.7 ops/sec): 87.0 MB/s
Did 1590000 SHA-1 (256 bytes) operations in 1000181us (1589712.3 ops/sec): 407.0 MB/s
Did 71000 SHA-1 (8192 bytes) operations in 1007958us (70439.4 ops/sec): 577.0 MB/s
Did 2955000 SHA-256 (16 bytes) operations in 1000251us (2954258.5 ops/sec): 47.3 MB/s
Did 740000 SHA-256 (256 bytes) operations in 1000628us (739535.6 ops/sec): 189.3 MB/s
Did 31000 SHA-256 (8192 bytes) operations in 1019619us (30403.5 ops/sec): 249.1 MB/s
Did 2348000 SHA-512 (16 bytes) operations in 1000285us (2347331.0 ops/sec): 37.6 MB/s
Did 878000 SHA-512 (256 bytes) operations in 1001064us (877066.8 ops/sec): 224.5 MB/s
Did 43000 SHA-512 (8192 bytes) operations in 1002485us (42893.4 ops/sec): 351.4 MB/s

$ bssl speed SHA- (x86, before, SHA-512 redacted because irrelevant)
Did 4319000 SHA-1 (16 bytes) operations in 1000066us (4318715.0 ops/sec): 69.1 MB/s
Did 1306000 SHA-1 (256 bytes) operations in 1000437us (1305429.5 ops/sec): 334.2 MB/s
Did 58000 SHA-1 (8192 bytes) operations in 1014807us (57153.7 ops/sec): 468.2 MB/s
Did 2291000 SHA-256 (16 bytes) operations in 1000343us (2290214.5 ops/sec): 36.6 MB/s
Did 594000 SHA-256 (256 bytes) operations in 1000684us (593594.0 ops/sec): 152.0 MB/s
Did 25000 SHA-256 (8192 bytes) operations in 1030688us (24255.6 ops/sec): 198.7 MB/s

$ bssl speed SHA- (x86, after, SHA-512 redacted because irrelevant)
Did 4673000 SHA-1 (16 bytes) operations in 1000063us (4672705.6 ops/sec): 74.8 MB/s
Did 1484000 SHA-1 (256 bytes) operations in 1000453us (1483328.1 ops/sec): 379.7 MB/s
Did 69000 SHA-1 (8192 bytes) operations in 1008305us (68431.7 ops/sec): 560.6 MB/s
Did 2684000 SHA-256 (16 bytes) operations in 1000196us (2683474.0 ops/sec): 42.9 MB/s
Did 679000 SHA-256 (256 bytes) operations in 1000525us (678643.7 ops/sec): 173.7 MB/s
Did 29000 SHA-256 (8192 bytes) operations in 1033251us (28066.8 ops/sec): 229.9 MB/s

Change-Id: I952a3b4fc4c52ebb50690da3b8c97770e8342e98
Reviewed-on: https://boringssl-review.googlesource.com/6470
Reviewed-by: Adam Langley <agl@google.com>
diff --git a/crypto/sha/asm/sha1-586.pl b/crypto/sha/asm/sha1-586.pl
index 09fd3fc..3514273 100644
--- a/crypto/sha/asm/sha1-586.pl
+++ b/crypto/sha/asm/sha1-586.pl
@@ -121,9 +121,7 @@
 # In upstream, this is controlled by shelling out to the compiler to check
 # versions, but BoringSSL is intended to be used with pre-generated perlasm
 # output, so this isn't useful anyway.
-#
-# TODO(davidben): Enable this after testing. $ymm goes up to 1.
-$ymm = 0;
+$ymm = 1;
 
 $ymm = 0 unless ($xmm);
 
diff --git a/crypto/sha/asm/sha1-x86_64.pl b/crypto/sha/asm/sha1-x86_64.pl
index 59b1607..4895f92 100644
--- a/crypto/sha/asm/sha1-x86_64.pl
+++ b/crypto/sha/asm/sha1-x86_64.pl
@@ -96,8 +96,10 @@
 # versions, but BoringSSL is intended to be used with pre-generated perlasm
 # output, so this isn't useful anyway.
 #
-# TODO(davidben): Enable this after testing. $avx goes up to 2.
-$avx = 0;
+# TODO(davidben): Enable AVX2 code after testing by setting $avx to 2. Is it
+# necessary to disable AVX2 code when SHA Extensions code is disabled? Upstream
+# did not tie them together until after $shaext was added.
+$avx = 1;
 
 # TODO(davidben): Consider enabling the Intel SHA Extensions code once it's
 # been tested.
diff --git a/crypto/sha/asm/sha256-586.pl b/crypto/sha/asm/sha256-586.pl
index 1866d5a..fa8f264 100644
--- a/crypto/sha/asm/sha256-586.pl
+++ b/crypto/sha/asm/sha256-586.pl
@@ -72,8 +72,8 @@
 # versions, but BoringSSL is intended to be used with pre-generated perlasm
 # output, so this isn't useful anyway.
 #
-# TODO(davidben): Enable this after testing. $avx goes up to 2.
-$avx = 0;
+# TODO(davidben): Enable AVX2 code after testing by setting $avx to 2.
+$avx = 1;
 
 $avx = 0 unless ($xmm);
 
diff --git a/crypto/sha/asm/sha512-x86_64.pl b/crypto/sha/asm/sha512-x86_64.pl
index 9a0d0c4..2bc33c6 100644
--- a/crypto/sha/asm/sha512-x86_64.pl
+++ b/crypto/sha/asm/sha512-x86_64.pl
@@ -113,8 +113,10 @@
 # versions, but BoringSSL is intended to be used with pre-generated perlasm
 # output, so this isn't useful anyway.
 #
-# TODO(davidben): Enable this after testing. $avx goes up to 2.
-$avx = 0;
+# TODO(davidben): Enable AVX2 code after testing by setting $avx to 2. Is it
+# necessary to disable AVX2 code when SHA Extensions code is disabled? Upstream
+# did not tie them together until after $shaext was added.
+$avx = 1;
 
 # TODO(davidben): Consider enabling the Intel SHA Extensions code once it's
 # been tested.