Manually unroll pi and rho steps in Keccak We've been effectively relying on Clang (on x86_64) to do it for us. But other compiler/arch platforms don't unroll it as readily. I originally did this to play with the 32-bit bit interleaving trick, which requires this, but actually it's a significant win on its own. Clang (NDK), aarch32, Pixel 5A Before: Did 4836 Kyber generate + decap operations in 2036618us (2374.5 ops/sec) Did 6237 Kyber parse + encap operations in 2055784us (3033.9 ops/sec) After: Did 8610 Kyber generate + decap operations in 2051048us (4197.9 ops/sec) [+76.8%] Did 12138 Kyber parse + encap operations in 2042363us (5943.1 ops/sec) [+95.9%] Clang (NDK), aarch64, Pixel 5A Before: Did 16720 Kyber generate + decap operations in 2011039us (8314.1 ops/sec) Did 30000 Kyber parse + encap operations in 2023170us (14828.2 ops/sec) AFter: Did 17080 Kyber generate + decap operations in 2005310us (8517.4 ops/sec) [+2.4%] Did 31000 Kyber parse + encap operations in 2059104us (15055.1 ops/sec) [+1.5%] GCC, x86_64 Before: Did 14535 Kyber generate + decap operations in 2015051us (7213.2 ops/sec) Did 21000 Kyber parse + encap operations in 2017842us (10407.2 ops/sec) After: Did 19900 Kyber generate + decap operations in 2016747us (9867.4 ops/sec) [+36.8%] Did 34000 Kyber parse + encap operations in 2059643us (16507.7 ops/sec) [+58.6%] Clang, x86_64 Before: Did 19584 Kyber generate + decap operations in 2006839us (9758.6 ops/sec) Did 34000 Kyber parse + encap operations in 2050513us (16581.2 ops/sec) After: Did 19928 Kyber generate + decap operations in 2020249us (9864.1 ops/sec) [+1.1%] Did 34000 Kyber parse + encap operations in 2052970us (16561.4 ops/sec) [-0.1%] Change-Id: Iee9315667c1d2044785faa9370815a3c7555c259 Reviewed-on: https://boringssl-review.googlesource.com/c/boringssl/+/63992 Auto-Submit: David Benjamin <davidben@google.com> Commit-Queue: David Benjamin <davidben@google.com> Commit-Queue: Adam Langley <agl@google.com> Reviewed-by: Adam Langley <agl@google.com> Reviewed-by: Bob Beck <bbe@google.com>
diff --git a/crypto/keccak/keccak.c b/crypto/keccak/keccak.c index 7ab8edc..15939ce 100644 --- a/crypto/keccak/keccak.c +++ b/crypto/keccak/keccak.c
@@ -56,19 +56,40 @@ // and the sequence will repeat. All that remains is to handle the element // at (0, 0), but the rotation for that element is zero, and it goes to (0, // 0), so we can ignore it. - static const uint8_t kIndexes[24] = {10, 7, 11, 17, 18, 3, 5, 16, - 8, 21, 24, 4, 15, 23, 19, 13, - 12, 2, 20, 14, 22, 9, 6, 1}; - static const uint8_t kRotations[24] = {1, 3, 6, 10, 15, 21, 28, 36, - 45, 55, 2, 14, 27, 41, 56, 8, - 25, 43, 62, 18, 39, 61, 20, 44}; uint64_t prev_value = state[1]; - for (int i = 0; i < 24; i++) { - const uint64_t value = CRYPTO_rotl_u64(prev_value, kRotations[i]); - const size_t index = kIndexes[i]; - prev_value = state[index]; - state[index] = value; - } +#define PI_RHO_STEP(index, rotation) \ + do { \ + const uint64_t value = CRYPTO_rotl_u64(prev_value, rotation); \ + prev_value = state[index]; \ + state[index] = value; \ + } while (0) + + PI_RHO_STEP(10, 1); + PI_RHO_STEP(7, 3); + PI_RHO_STEP(11, 6); + PI_RHO_STEP(17, 10); + PI_RHO_STEP(18, 15); + PI_RHO_STEP(3, 21); + PI_RHO_STEP(5, 28); + PI_RHO_STEP(16, 36); + PI_RHO_STEP(8, 45); + PI_RHO_STEP(21, 55); + PI_RHO_STEP(24, 2); + PI_RHO_STEP(4, 14); + PI_RHO_STEP(15, 27); + PI_RHO_STEP(23, 41); + PI_RHO_STEP(19, 56); + PI_RHO_STEP(13, 8); + PI_RHO_STEP(12, 25); + PI_RHO_STEP(2, 43); + PI_RHO_STEP(20, 62); + PI_RHO_STEP(14, 18); + PI_RHO_STEP(22, 39); + PI_RHO_STEP(9, 61); + PI_RHO_STEP(6, 20); + PI_RHO_STEP(1, 44); + +#undef PI_RHO_STEP // χ step for (int y = 0; y < 5; y++) {