Add prefetch to sha1_block_data_order_shaext

Similar idea to https://boringssl-review.googlesource.com/c/boringssl/+/55466

Results are pretty close to the current state,
e.g. tool speed goes from
Did 74000 SHA-1 (16384 bytes) operations in 1004094us (73698.3 ops/sec): 1207.5 MB/s
to
Did 75000 SHA-1 (16384 bytes) operations in 1004022us (74699.6 ops/sec): 1223.9 MB/s

But on AMD with prefetchers disabled and large enough data size,
to force cache misses this gives ~3x improvement:
name              old time/op  new time/op  delta
BM_SHA1Hash/2      141ns ± 1%   143ns ± 2%     ~     (p=0.421 n=5+5)
BM_SHA1Hash/4      143ns ± 2%   143ns ± 3%     ~     (p=0.841 n=5+5)
BM_SHA1Hash/8      141ns ± 1%   141ns ± 2%     ~     (p=1.000 n=5+5)
BM_SHA1Hash/16     141ns ± 1%   141ns ± 1%     ~     (p=0.841 n=5+5)
BM_SHA1Hash/32     143ns ± 2%   143ns ± 1%     ~     (p=0.690 n=5+5)
BM_SHA1Hash/64     178ns ± 1%   179ns ± 1%     ~     (p=0.151 n=5+5)
BM_SHA1Hash/512    454ns ± 1%   454ns ± 1%     ~     (p=0.841 n=5+5)
BM_SHA1Hash/4k    2.66µs ± 1%  2.65µs ± 1%     ~     (p=1.000 n=5+5)
BM_SHA1Hash/32k   20.3µs ± 1%  20.3µs ± 2%     ~     (p=1.000 n=5+5)
BM_SHA1Hash/256k   162µs ± 1%   161µs ± 1%     ~     (p=0.548 n=5+5)
BM_SHA1Hash/1M     644µs ± 1%   645µs ± 1%     ~     (p=0.841 n=5+5)
BM_SHA1Hash/2M    1.29ms ± 1%  1.29ms ± 2%     ~     (p=0.690 n=5+5)
BM_SHA1Hash/4M    2.58ms ± 1%  2.58ms ± 1%     ~     (p=0.841 n=5+5)
BM_SHA1Hash/8M    5.14ms ± 0%  5.15ms ± 1%     ~     (p=0.286 n=4+5)
BM_SHA1Hash/16M   11.4ms ± 3%  10.3ms ± 1%   -9.04%  (p=0.016 n=4+5)
BM_SHA1Hash/128M   249ms ± 0%    83ms ± 1%  -66.73%  (p=0.008 n=5+5)

Change-Id: I7cae746b6d8a705d6bf2d5c5df6a2dca6d44791a
Reviewed-on: https://boringssl-review.googlesource.com/c/boringssl/+/57826
Commit-Queue: Adam Langley <agl@google.com>
Reviewed-by: Adam Langley <agl@google.com>
diff --git a/crypto/fipsmodule/sha/asm/sha1-x86_64.pl b/crypto/fipsmodule/sha/asm/sha1-x86_64.pl
index 6ee7887..d9afacb 100755
--- a/crypto/fipsmodule/sha/asm/sha1-x86_64.pl
+++ b/crypto/fipsmodule/sha/asm/sha1-x86_64.pl
@@ -389,6 +389,7 @@
 	lea		0x40($inp),%r8		# next input block
 	paddd		@MSG[0],$E
 	cmovne		%r8,$inp
+	prefetcht0	512($inp)
 	movdqa		$ABCD,$ABCD_SAVE	# offload $ABCD
 ___
 for($i=0;$i<20-4;$i+=2) {