Don't prematurely run keccak_f in squeeze

When squeezing a multiple of the rate bytes (e.g. in the Kyber XOF), we
were running the Keccak permutation one more time than necessary.

Before:
Did 18900 Kyber generate + decap operations in 2001506us (9442.9 ops/sec)
Did 32000 Kyber parse + encap operations in 2041500us (15674.7 ops/sec)

After:
Did 19796 Kyber generate + decap operations in 2017501us (9812.1 ops/sec) [+3.9%]
Did 34000 Kyber parse + encap operations in 2032085us (16731.6 ops/sec) [+6.7%]

Change-Id: I69787536508c4eadcc37a2f752c3678c60906c38
Reviewed-on: https://boringssl-review.googlesource.com/c/boringssl/+/64007
Reviewed-by: Adam Langley <agl@google.com>
Auto-Submit: David Benjamin <davidben@google.com>
Commit-Queue: Adam Langley <agl@google.com>
Commit-Queue: David Benjamin <davidben@google.com>
diff --git a/crypto/keccak/keccak.c b/crypto/keccak/keccak.c
index e482404..7ab8edc 100644
--- a/crypto/keccak/keccak.c
+++ b/crypto/keccak/keccak.c
@@ -240,6 +240,11 @@
   // because we require |uint8_t| to be a character type.
   const uint8_t *state_bytes = (const uint8_t *)ctx->state;
   while (out_len) {
+    if (ctx->squeeze_offset == ctx->rate_bytes) {
+      keccak_f(ctx->state);
+      ctx->squeeze_offset = 0;
+    }
+
     size_t remaining = ctx->rate_bytes - ctx->squeeze_offset;
     size_t todo = out_len;
     if (todo > remaining) {
@@ -249,9 +254,5 @@
     out += todo;
     out_len -= todo;
     ctx->squeeze_offset += todo;
-    if (ctx->squeeze_offset == ctx->rate_bytes) {
-      keccak_f(ctx->state);
-      ctx->squeeze_offset = 0;
-    }
   }
 }