Add prefetch to aes_hw_ctr32_encrypt_blocks

Similar idea to https://boringssl-review.googlesource.com/c/boringssl/+/55466

Results are pretty close to the current state, AMD (rome):
BM_Encrypt/64/0          344ns ± 3%   343ns ± 1%    ~     (p=0.728 n=20+19)
BM_Encrypt/64/1          394ns ± 2%   394ns ± 3%    ~     (p=0.919 n=18+20)
BM_Encrypt/64/8          391ns ± 1%   390ns ± 2%    ~     (p=0.165 n=17+19)
BM_Encrypt/64/64         342ns ± 3%   341ns ± 2%    ~     (p=0.686 n=19+19)
BM_Encrypt/64/97         393ns ± 1%   394ns ± 3%    ~     (p=0.639 n=17+19)
BM_Encrypt/512/0         437ns ± 2%   437ns ± 1%    ~     (p=0.819 n=20+19)
BM_Encrypt/512/1         566ns ± 1%   551ns ± 3%  -2.65%  (p=0.000 n=18+18)
BM_Encrypt/512/8         563ns ± 2%   555ns ± 4%  -1.48%  (p=0.003 n=18+20)
BM_Encrypt/512/64        434ns ± 3%   439ns ± 3%  +1.03%  (p=0.008 n=19+20)
BM_Encrypt/512/97        565ns ± 2%   555ns ± 4%  -1.88%  (p=0.001 n=18+20)
BM_Encrypt/4k/0         1.03µs ± 2%  0.99µs ± 2%  -4.29%  (p=0.000 n=20+20)
BM_Encrypt/4k/1         1.18µs ± 3%  1.11µs ± 3%  -5.66%  (p=0.000 n=20+20)
BM_Encrypt/4k/8         1.17µs ± 3%  1.11µs ± 2%  -5.51%  (p=0.000 n=20+20)
BM_Encrypt/4k/64        1.03µs ± 1%  0.99µs ± 1%  -4.08%  (p=0.000 n=19+19)
BM_Encrypt/4k/97        1.17µs ± 3%  1.11µs ± 2%  -5.65%  (p=0.000 n=20+19)
BM_Encrypt/32k/0        5.26µs ± 1%  5.19µs ± 2%  -1.29%  (p=0.000 n=19+20)
BM_Encrypt/32k/1        5.49µs ± 2%  5.38µs ± 1%  -2.01%  (p=0.000 n=20+20)
BM_Encrypt/32k/8        5.45µs ± 2%  5.34µs ± 1%  -2.12%  (p=0.000 n=20+19)
BM_Encrypt/32k/64       5.28µs ± 1%  5.19µs ± 1%  -1.66%  (p=0.000 n=19+20)
BM_Encrypt/32k/97       5.49µs ± 1%  5.38µs ± 1%  -2.02%  (p=0.000 n=20+17)
BM_Encrypt/256k/0       38.9µs ± 1%  38.5µs ± 2%  -1.09%  (p=0.000 n=20+20)
BM_Encrypt/256k/1       40.3µs ± 2%  39.6µs ± 1%  -1.74%  (p=0.000 n=20+20)
BM_Encrypt/256k/8       39.7µs ± 2%  39.0µs ± 1%  -1.82%  (p=0.000 n=19+18)
BM_Encrypt/256k/64      38.9µs ± 1%  38.4µs ± 1%  -1.35%  (p=0.000 n=20+18)
BM_Encrypt/256k/97      40.1µs ± 1%  39.6µs ± 1%  -1.32%  (p=0.000 n=20+20)
BM_Encrypt/1M/0          154µs ± 1%   153µs ± 1%  -0.62%  (p=0.001 n=17+18)
BM_Encrypt/1M/1          160µs ± 2%   158µs ± 1%  -1.44%  (p=0.000 n=19+20)
BM_Encrypt/1M/8          158µs ± 1%   155µs ± 1%  -1.62%  (p=0.000 n=20+19)
BM_Encrypt/1M/64         155µs ± 2%   153µs ± 1%  -1.48%  (p=0.000 n=20+20)
BM_Encrypt/1M/97         160µs ± 1%   158µs ± 2%  -1.46%  (p=0.000 n=20+20)
BM_EncryptCord/1/0       310ns ± 3%   307ns ± 4%    ~     (p=0.101 n=19+20)

Intel (skylake):

BM_Encrypt/64/0          326ns ± 5%   325ns ± 4%    ~     (p=0.817 n=16+17)
BM_Encrypt/64/1          368ns ± 2%   387ns ±13%    ~     (p=0.845 n=17+20)
BM_Encrypt/64/8          385ns ±14%   365ns ± 3%  -5.12%  (p=0.013 n=20+18)
BM_Encrypt/64/64         325ns ± 4%   325ns ± 6%    ~     (p=0.621 n=18+16)
BM_Encrypt/64/97         367ns ± 3%   366ns ± 3%    ~     (p=0.963 n=18+18)
BM_Encrypt/512/0         504ns ± 4%   456ns ± 3%  -9.52%  (p=0.000 n=17+20)
BM_Encrypt/512/1         568ns ± 2%   528ns ± 4%  -7.09%  (p=0.000 n=15+17)
BM_Encrypt/512/8         580ns ± 3%   541ns ± 4%  -6.66%  (p=0.000 n=20+17)
BM_Encrypt/512/64        500ns ± 3%   454ns ± 4%  -9.26%  (p=0.000 n=17+17)
BM_Encrypt/512/97        564ns ± 2%   526ns ± 4%  -6.82%  (p=0.000 n=18+17)
BM_Encrypt/4k/0         1.26µs ± 2%  1.23µs ± 5%  -2.77%  (p=0.000 n=19+18)
BM_Encrypt/4k/1         1.33µs ± 2%  1.28µs ± 3%  -4.34%  (p=0.000 n=18+18)
BM_Encrypt/4k/8         1.35µs ± 3%  1.29µs ± 3%  -4.31%  (p=0.000 n=19+17)
BM_Encrypt/4k/64        1.27µs ± 3%  1.23µs ± 4%  -3.32%  (p=0.000 n=18+18)
BM_Encrypt/4k/97        1.34µs ± 3%  1.29µs ± 3%  -3.98%  (p=0.000 n=18+16)
BM_Encrypt/32k/0        8.24µs ± 4%  7.99µs ± 5%  -3.00%  (p=0.001 n=17+16)
BM_Encrypt/32k/1        8.23µs ± 2%  7.99µs ± 5%  -2.95%  (p=0.000 n=17+16)
BM_Encrypt/32k/8        8.64µs ±15%  8.05µs ± 5%  -6.92%  (p=0.000 n=20+18)
BM_Encrypt/32k/64       8.14µs ± 3%  7.96µs ± 3%  -2.23%  (p=0.000 n=18+17)
BM_Encrypt/32k/97       8.72µs ±14%  8.01µs ± 4%  -8.20%  (p=0.000 n=20+17)
BM_Encrypt/256k/0       63.2µs ± 4%  61.7µs ± 3%  -2.35%  (p=0.003 n=19+18)
BM_Encrypt/256k/1       63.5µs ± 4%  61.8µs ± 3%  -2.75%  (p=0.000 n=17+19)
BM_Encrypt/256k/8       63.6µs ± 9%  61.0µs ± 1%  -4.08%  (p=0.000 n=18+16)
BM_Encrypt/256k/64      63.1µs ± 3%  61.5µs ± 5%  -2.60%  (p=0.001 n=18+16)
BM_Encrypt/256k/97      65.6µs ±16%  61.6µs ± 4%  -6.09%  (p=0.000 n=19+17)
BM_Encrypt/1M/0          253µs ± 5%   246µs ± 5%  -2.88%  (p=0.001 n=19+19)
BM_Encrypt/1M/1          253µs ± 6%   244µs ± 1%  -3.71%  (p=0.000 n=16+17)
BM_Encrypt/1M/8          254µs ± 5%   244µs ± 3%  -4.15%  (p=0.000 n=18+18)
BM_Encrypt/1M/64         253µs ± 4%   245µs ± 4%  -3.10%  (p=0.000 n=19+19)
BM_Encrypt/1M/97         267µs ±14%   246µs ± 4%  -8.13%  (p=0.000 n=20+18)

But on AMD with prefetchers disabled and large enough data size,
to force cache misses this gives >2x improvement:
BM_Encrypt/64/0          342ns ± 1%   336ns ± 1%   -1.63%  (p=0.000 n=19+19)
BM_Encrypt/64/1          485ns ± 2%   484ns ± 2%     ~     (p=0.396 n=19+20)
BM_Encrypt/64/8          490ns ± 1%   488ns ± 2%     ~     (p=0.098 n=18+19)
BM_Encrypt/64/64         340ns ± 2%   335ns ± 1%   -1.50%  (p=0.000 n=19+19)
BM_Encrypt/64/97         483ns ± 1%   483ns ± 1%     ~     (p=0.912 n=16+20)
BM_Encrypt/512/0         566ns ± 3%   521ns ± 2%   -7.99%  (p=0.000 n=18+20)
BM_Encrypt/512/1         744ns ± 2%   667ns ± 1%  -10.31%  (p=0.000 n=20+20)
BM_Encrypt/512/8         745ns ± 1%   666ns ± 1%  -10.53%  (p=0.000 n=18+20)
BM_Encrypt/512/64        566ns ± 3%   520ns ± 2%   -8.05%  (p=0.000 n=17+19)
BM_Encrypt/512/97        740ns ± 1%   666ns ± 1%   -9.92%  (p=0.000 n=18+19)
BM_Encrypt/4k/0         2.50µs ± 1%  1.35µs ± 1%  -45.82%  (p=0.000 n=19+19)
BM_Encrypt/4k/1         2.65µs ± 3%  1.50µs ± 1%  -43.50%  (p=0.000 n=19+19)
BM_Encrypt/4k/8         2.66µs ± 1%  1.49µs ± 1%  -43.71%  (p=0.000 n=19+19)
BM_Encrypt/4k/64        2.47µs ± 4%  1.36µs ± 1%  -45.05%  (p=0.000 n=20+20)
BM_Encrypt/4k/97        2.66µs ± 1%  1.50µs ± 2%  -43.54%  (p=0.000 n=18+19)
BM_Encrypt/32k/0        18.0µs ± 1%   8.0µs ± 1%  -55.38%  (p=0.000 n=18+19)
BM_Encrypt/32k/1        18.2µs ± 1%   8.2µs ± 1%  -54.91%  (p=0.000 n=14+20)
BM_Encrypt/32k/8        18.2µs ± 1%   8.2µs ± 1%  -54.93%  (p=0.000 n=19+18)
BM_Encrypt/32k/64       18.0µs ± 1%   8.0µs ± 1%  -55.35%  (p=0.000 n=16+20)
BM_Encrypt/32k/97       18.1µs ± 3%   8.2µs ± 1%  -54.84%  (p=0.000 n=20+19)
BM_Encrypt/256k/0        148µs ± 1%    63µs ± 1%  -57.59%  (p=0.000 n=18+19)
BM_Encrypt/256k/1        150µs ± 1%    63µs ± 1%  -57.78%  (p=0.000 n=16+20)
BM_Encrypt/256k/8        147µs ± 5%    63µs ± 1%  -56.95%  (p=0.000 n=20+20)
BM_Encrypt/256k/64       148µs ± 2%    63µs ± 1%  -57.40%  (p=0.000 n=18+20)
BM_Encrypt/256k/97       146µs ± 4%    63µs ± 1%  -56.82%  (p=0.000 n=20+19)
BM_Encrypt/1M/0          595µs ± 1%   254µs ± 1%  -57.33%  (p=0.000 n=19+20)
BM_Encrypt/1M/1          590µs ± 4%   255µs ± 1%  -56.78%  (p=0.000 n=20+20)
BM_Encrypt/1M/8          593µs ± 2%   254µs ± 1%  -57.10%  (p=0.000 n=18+19)
BM_Encrypt/1M/64         595µs ± 1%   254µs ± 1%  -57.34%  (p=0.000 n=16+19)
BM_Encrypt/1M/97         589µs ± 4%   255µs ± 1%  -56.74%  (p=0.000 n=20+20)

Change-Id: I13c783ad261093009b2aa5ff56ce569f45ed3300
Reviewed-on: https://boringssl-review.googlesource.com/c/boringssl/+/60027
Commit-Queue: David Benjamin <davidben@google.com>
Reviewed-by: David Benjamin <davidben@google.com>
diff --git a/crypto/fipsmodule/aes/asm/aesni-x86_64.pl b/crypto/fipsmodule/aes/asm/aesni-x86_64.pl
index 9a90946..215611f 100644
--- a/crypto/fipsmodule/aes/asm/aesni-x86_64.pl
+++ b/crypto/fipsmodule/aes/asm/aesni-x86_64.pl
@@ -1524,6 +1524,8 @@
 	pxor		$rndkey0,$in3
 	movdqu		0x50($inp),$in5
 	pxor		$rndkey0,$in4
+	prefetcht0	0x1c0($inp)	# We process 128 bytes (8*16), so to prefetch 1 iteration
+	prefetcht0	0x200($inp)	# We need to prefetch 2 64 byte lines
 	pxor		$rndkey0,$in5
 	aesenc		$rndkey1,$inout0
 	aesenc		$rndkey1,$inout1