Manually unroll pi and rho steps in Keccak

We've been effectively relying on Clang (on x86_64) to do it for us. But
other compiler/arch platforms don't unroll it as readily. I originally
did this to play with the 32-bit bit interleaving trick, which requires
this, but actually it's a significant win on its own.

Clang (NDK), aarch32, Pixel 5A
Before:
Did 4836 Kyber generate + decap operations in 2036618us (2374.5 ops/sec)
Did 6237 Kyber parse + encap operations in 2055784us (3033.9 ops/sec)
After:
Did 8610 Kyber generate + decap operations in 2051048us (4197.9 ops/sec) [+76.8%]
Did 12138 Kyber parse + encap operations in 2042363us (5943.1 ops/sec) [+95.9%]

Clang (NDK), aarch64, Pixel 5A
Before:
Did 16720 Kyber generate + decap operations in 2011039us (8314.1 ops/sec)
Did 30000 Kyber parse + encap operations in 2023170us (14828.2 ops/sec)
AFter:
Did 17080 Kyber generate + decap operations in 2005310us (8517.4 ops/sec) [+2.4%]
Did 31000 Kyber parse + encap operations in 2059104us (15055.1 ops/sec) [+1.5%]

GCC, x86_64
Before:
Did 14535 Kyber generate + decap operations in 2015051us (7213.2 ops/sec)
Did 21000 Kyber parse + encap operations in 2017842us (10407.2 ops/sec)
After:
Did 19900 Kyber generate + decap operations in 2016747us (9867.4 ops/sec) [+36.8%]
Did 34000 Kyber parse + encap operations in 2059643us (16507.7 ops/sec) [+58.6%]

Clang, x86_64
Before:
Did 19584 Kyber generate + decap operations in 2006839us (9758.6 ops/sec)
Did 34000 Kyber parse + encap operations in 2050513us (16581.2 ops/sec)
After:
Did 19928 Kyber generate + decap operations in 2020249us (9864.1 ops/sec) [+1.1%]
Did 34000 Kyber parse + encap operations in 2052970us (16561.4 ops/sec) [-0.1%]

Change-Id: Iee9315667c1d2044785faa9370815a3c7555c259
Reviewed-on: https://boringssl-review.googlesource.com/c/boringssl/+/63992
Auto-Submit: David Benjamin <davidben@google.com>
Commit-Queue: David Benjamin <davidben@google.com>
Commit-Queue: Adam Langley <agl@google.com>
Reviewed-by: Adam Langley <agl@google.com>
Reviewed-by: Bob Beck <bbe@google.com>
diff --git a/crypto/keccak/keccak.c b/crypto/keccak/keccak.c
index 7ab8edc..15939ce 100644
--- a/crypto/keccak/keccak.c
+++ b/crypto/keccak/keccak.c
@@ -56,19 +56,40 @@
     // and the sequence will repeat. All that remains is to handle the element
     // at (0, 0), but the rotation for that element is zero, and it goes to (0,
     // 0), so we can ignore it.
-    static const uint8_t kIndexes[24] = {10, 7,  11, 17, 18, 3,  5,  16,
-                                         8,  21, 24, 4,  15, 23, 19, 13,
-                                         12, 2,  20, 14, 22, 9,  6,  1};
-    static const uint8_t kRotations[24] = {1,  3,  6,  10, 15, 21, 28, 36,
-                                           45, 55, 2,  14, 27, 41, 56, 8,
-                                           25, 43, 62, 18, 39, 61, 20, 44};
     uint64_t prev_value = state[1];
-    for (int i = 0; i < 24; i++) {
-      const uint64_t value = CRYPTO_rotl_u64(prev_value, kRotations[i]);
-      const size_t index = kIndexes[i];
-      prev_value = state[index];
-      state[index] = value;
-    }
+#define PI_RHO_STEP(index, rotation)                              \
+  do {                                                            \
+    const uint64_t value = CRYPTO_rotl_u64(prev_value, rotation); \
+    prev_value = state[index];                                    \
+    state[index] = value;                                         \
+  } while (0)
+
+    PI_RHO_STEP(10, 1);
+    PI_RHO_STEP(7, 3);
+    PI_RHO_STEP(11, 6);
+    PI_RHO_STEP(17, 10);
+    PI_RHO_STEP(18, 15);
+    PI_RHO_STEP(3, 21);
+    PI_RHO_STEP(5, 28);
+    PI_RHO_STEP(16, 36);
+    PI_RHO_STEP(8, 45);
+    PI_RHO_STEP(21, 55);
+    PI_RHO_STEP(24, 2);
+    PI_RHO_STEP(4, 14);
+    PI_RHO_STEP(15, 27);
+    PI_RHO_STEP(23, 41);
+    PI_RHO_STEP(19, 56);
+    PI_RHO_STEP(13, 8);
+    PI_RHO_STEP(12, 25);
+    PI_RHO_STEP(2, 43);
+    PI_RHO_STEP(20, 62);
+    PI_RHO_STEP(14, 18);
+    PI_RHO_STEP(22, 39);
+    PI_RHO_STEP(9, 61);
+    PI_RHO_STEP(6, 20);
+    PI_RHO_STEP(1, 44);
+
+#undef PI_RHO_STEP
 
     // χ step
     for (int y = 0; y < 5; y++) {