Remove cacheline striping in copy_from_prebuf.

The standard computation model for constant-time code is that memory
access patterns must be independent of secret data.
BN_mod_exp_mont_consttime was previously written to a slightly weaker
model: only cacheline access patterns must be independent of secret
data. It assumed accesses within a cacheline were indistinguishable.

The CacheBleed attack (https://eprint.iacr.org/2016/224.pdf) showed this
assumption was false. Cache lines may be divided into cache banks, and
the researchers were able to measure cache bank contention pre-Haswell.
For Haswell, the researchers note "But, as Haswell does show timing
variations that depend on low address bits [19], it may be vulnerable to
similar attacks."

OpenSSL's fix to CacheBleed was not to adopt the standard constant-time
computation model. Rather, it now assumes accesses within a 16-byte
cache bank are indistinguishable, at least in the C copy_from_prebuf
path. These weaker models failed before with CacheBleed, so avoiding
such assumptions seems prudent. (The [19] citation above notes a false
dependence between memory addresses with a distance of 4k, which may be
what the paper was referring to.) Moreover, the C path is largely unused
on x86_64 (which uses mont5 asm), so it is especially questionable for
the generic C code to make assumptions based on x86_64.

Just walk the entire table in the C implementation. Doing so as-is comes
with a performance hit, but the striped memory layout is, at that point,
useless. We regain the performance loss (and then some) by using a more
natural layout. Benchmarks below.

This CL does not touch the mont5 assembly; I haven't figured out what
it's doing yet.

Pixel 3, aarch64:
Before:
Did 3146 RSA 2048 signing operations in 10009070us (314.3 ops/sec)
Did 447 RSA 4096 signing operations in 10026666us (44.6 ops/sec)
After:
Did 3210 RSA 2048 signing operations in 10010712us (320.7 ops/sec)
Did 456 RSA 4096 signing operations in 10063543us (45.3 ops/sec)

Pixel 3, armv7:
Before:
Did 2688 RSA 2048 signing operations in 10002266us (268.7 ops/sec)
Did 459 RSA 4096 signing operations in 10004785us (45.9 ops/sec)
After:
Did 2709 RSA 2048 signing operations in 10001299us (270.9 ops/sec)
Did 459 RSA 4096 signing operations in 10063737us (45.6 ops/sec)

x86_64 Broadwell, mont5 assembly disabled:
(This configuration is not actually shipped anywhere, but seemed a
useful data point.)
Before:
Did 14274 RSA 2048 signing operations in 10009130us (1426.1 ops/sec)
Did 2448 RSA 4096 signing operations in 10046921us (243.7 ops/sec)
After:
Did 14706 RSA 2048 signing operations in 10037908us (1465.0 ops/sec)
Did 2538 RSA 4096 signing operations in 10059986us (252.3 ops/sec)

Change-Id: If41da911d4281433856a86c6c8eadf99cd33e2d8
Reviewed-on: https://boringssl-review.googlesource.com/c/33268
Reviewed-by: Adam Langley <agl@google.com>
Commit-Queue: David Benjamin <davidben@google.com>
diff --git a/crypto/fipsmodule/bn/exponentiation.c b/crypto/fipsmodule/bn/exponentiation.c
index fa326a7..1b9680f 100644
--- a/crypto/fipsmodule/bn/exponentiation.c
+++ b/crypto/fipsmodule/bn/exponentiation.c
@@ -850,67 +850,25 @@
   bn_mod_exp_mont_small(r, a, num, p_minus_two, num, mont);
 }
 
-
-// |BN_mod_exp_mont_consttime| stores the precomputed powers in a specific
-// layout so that accessing any of these table values shows the same access
-// pattern as far as cache lines are concerned. The following functions are
-// used to transfer a BIGNUM from/to that table.
-
 static void copy_to_prebuf(const BIGNUM *b, int top, BN_ULONG *table, int idx,
                            int window) {
-  int i, j;
-  const int width = 1 << window;
-  if (top > b->width) {
-    top = b->width;  // This works because |table| is explicitly zeroed.
-  }
-
-  for (i = 0, j = idx; i < top; i++, j += width)  {
-    table[j] = b->d[i];
-  }
+  int ret = bn_copy_words(table + idx * top, top, b);
+  assert(ret);  // |b| is guaranteed to fit.
+  (void)ret;
 }
 
-static int copy_from_prebuf(BIGNUM *b, int top, const BN_ULONG *buf, int idx,
+static int copy_from_prebuf(BIGNUM *b, int top, const BN_ULONG *table, int idx,
                             int window) {
-  int i, j;
-  const int width = 1 << window;
-  volatile const BN_ULONG *table = (volatile const BN_ULONG *)buf;
-
   if (!bn_wexpand(b, top)) {
     return 0;
   }
 
-  if (window <= 3) {
-    for (i = 0; i < top; i++, table += width) {
-      BN_ULONG acc = 0;
-
-      for (j = 0; j < width; j++) {
-        acc |= table[j] & ((BN_ULONG)0 - (constant_time_eq_int(j, idx) & 1));
-      }
-
-      b->d[i] = acc;
-    }
-  } else {
-    int xstride = 1 << (window - 2);
-    BN_ULONG y0, y1, y2, y3;
-
-    i = idx >> (window - 2);  // equivalent of idx / xstride
-    idx &= xstride - 1;       // equivalent of idx % xstride
-
-    y0 = (BN_ULONG)0 - (constant_time_eq_int(i, 0) & 1);
-    y1 = (BN_ULONG)0 - (constant_time_eq_int(i, 1) & 1);
-    y2 = (BN_ULONG)0 - (constant_time_eq_int(i, 2) & 1);
-    y3 = (BN_ULONG)0 - (constant_time_eq_int(i, 3) & 1);
-
-    for (i = 0; i < top; i++, table += width) {
-      BN_ULONG acc = 0;
-
-      for (j = 0; j < xstride; j++) {
-        acc |= ((table[j + 0 * xstride] & y0) | (table[j + 1 * xstride] & y1) |
-                (table[j + 2 * xstride] & y2) | (table[j + 3 * xstride] & y3)) &
-               ((BN_ULONG)0 - (constant_time_eq_int(j, idx) & 1));
-      }
-
-      b->d[i] = acc;
+  OPENSSL_memset(b->d, 0, sizeof(BN_ULONG) * top);
+  const int width = 1 << window;
+  for (int i = 0; i < width; i++, table += top) {
+    BN_ULONG mask = constant_time_eq_int(i, idx);
+    for (int j = 0; j < top; j++) {
+      b->d[j] |= table[j] & mask;
     }
   }
 
@@ -953,9 +911,8 @@
    (MOD_EXP_CTIME_MIN_CACHE_LINE_WIDTH - \
     (((size_t)(x_)) & (MOD_EXP_CTIME_MIN_CACHE_LINE_MASK))))
 
-// This variant of BN_mod_exp_mont() uses fixed windows and the special
-// precomputation memory layout to limit data-dependency to a minimum
-// to protect secret exponents (cf. the hyper-threading timing attacks
+// This variant of |BN_mod_exp_mont| uses fixed windows and fixed memory access
+// patterns to protect secret exponents (cf. the hyper-threading timing attacks
 // pointed out by Colin Percival,
 // http://www.daemonology.net/hyperthreading-considered-harmful/)
 int BN_mod_exp_mont_consttime(BIGNUM *rr, const BIGNUM *a, const BIGNUM *p,