Squeeze a block at a time when computing the matrix in Kyber We were currently squeezing 3 bytes at a time. And because kyber.c and keccak.c are separate files, we've got an optimization barrier, at least in non-LTO builds, at a very inconvenient place. Instead, extract the full 168 bytes (SHAKE-128 rate) at a time. This is conveniently a multiple of three, so we don't need to worry about partial coefficients. We're still doing a copy due to the Keccak abstraction, but it should extend cleanly to either LTO or a different abstraction to read from the state directly. Even without that, it's a significant perf win. Benchmarks on an M1: Before: Did 87390 Kyber generate + decap operations in 4001590us (21838.8 ops/sec) Did 119000 Kyber parse + encap operations in 4009460us (29679.8 ops/sec) After: Did 96747 Kyber generate + decap operations in 4003687us (24164.5 ops/sec) [+10.6%] Did 152000 Kyber parse + encap operations in 4018360us (37826.4 ops/sec) [+27.4%] Change-Id: Iada393edf0c634b7410a34374fb90242a392a9d3 Reviewed-on: https://boringssl-review.googlesource.com/c/boringssl/+/59265 Commit-Queue: Adam Langley <agl@google.com> Reviewed-by: Adam Langley <agl@google.com> Auto-Submit: David Benjamin <davidben@google.com> Commit-Queue: David Benjamin <davidben@google.com>

commit: d5f3a9e82fc6735eff5733f51b892d776f4a84eb [log] [tgz]
author: David Benjamin <davidben@google.com> Tue Apr 25 18:01:14 2023 -0400
committer: Boringssl LUCI CQ <boringssl-scoped@luci-project-accounts.iam.gserviceaccount.com> Tue Apr 25 23:07:53 2023 +0000
tree: ac9e839e0bc970005b65547cc0a63749edcb8685
parent: 4b6d950d8921d6dd5365de0797fcc97302b9561b [diff]
diff --git a/crypto/kyber/kyber.c b/crypto/kyber/kyber.c
index f2ac30b..010d9d4 100644
--- a/crypto/kyber/kyber.c
+++ b/crypto/kyber/kyber.c

@@ -283,16 +283,23 @@
 // operates on public inputs.
 static void scalar_from_keccak_vartime(scalar *out,
                                        struct BORINGSSL_keccak_st *keccak_ctx) {
-  uint8_t bytes[3];
-  for (int i = 0; i < DEGREE;) {
-    BORINGSSL_keccak_squeeze(keccak_ctx, bytes, sizeof(bytes));
-    uint16_t d1 = bytes[0] + 256 * (bytes[1] % 16);
-    uint16_t d2 = bytes[1] / 16 + 16 * bytes[2];
-    if (d1 < kPrime) {
-      out->c[i++] = d1;
-    }
-    if (d2 < kPrime && i < DEGREE) {
-      out->c[i++] = d2;
+  assert(keccak_ctx->offset == 0);
+  assert(keccak_ctx->rate_bytes == 168);
+  static_assert(168 % 3 == 0, "block and coefficient boundaries do not align");
+
+  int done = 0;
+  while (done < DEGREE) {
+    uint8_t block[168];
+    BORINGSSL_keccak_squeeze(keccak_ctx, block, sizeof(block));
+    for (size_t i = 0; i < sizeof(block) && done < DEGREE; i += 3) {
+      uint16_t d1 = block[i] + 256 * (block[i + 1] % 16);
+      uint16_t d2 = block[i + 1] / 16 + 16 * block[i + 2];
+      if (d1 < kPrime) {
+        out->c[done++] = d1;
+      }
+      if (d2 < kPrime && done < DEGREE) {
+        out->c[done++] = d2;
+      }
     }
   }
 }
commit	d5f3a9e82fc6735eff5733f51b892d776f4a84eb	[log] [tgz]
author	David Benjamin <davidben@google.com>	Tue Apr 25 18:01:14 2023 -0400
committer	Boringssl LUCI CQ <boringssl-scoped@luci-project-accounts.iam.gserviceaccount.com>	Tue Apr 25 23:07:53 2023 +0000
tree	ac9e839e0bc970005b65547cc0a63749edcb8685
parent	4b6d950d8921d6dd5365de0797fcc97302b9561b [diff]