Remove cacheline striping in copy_from_prebuf.
The standard computation model for constant-time code is that memory
access patterns must be independent of secret data.
BN_mod_exp_mont_consttime was previously written to a slightly weaker
model: only cacheline access patterns must be independent of secret
data. It assumed accesses within a cacheline were indistinguishable.
The CacheBleed attack (https://eprint.iacr.org/2016/224.pdf) showed this
assumption was false. Cache lines may be divided into cache banks, and
the researchers were able to measure cache bank contention pre-Haswell.
For Haswell, the researchers note "But, as Haswell does show timing
variations that depend on low address bits [19], it may be vulnerable to
similar attacks."
OpenSSL's fix to CacheBleed was not to adopt the standard constant-time
computation model. Rather, it now assumes accesses within a 16-byte
cache bank are indistinguishable, at least in the C copy_from_prebuf
path. These weaker models failed before with CacheBleed, so avoiding
such assumptions seems prudent. (The [19] citation above notes a false
dependence between memory addresses with a distance of 4k, which may be
what the paper was referring to.) Moreover, the C path is largely unused
on x86_64 (which uses mont5 asm), so it is especially questionable for
the generic C code to make assumptions based on x86_64.
Just walk the entire table in the C implementation. Doing so as-is comes
with a performance hit, but the striped memory layout is, at that point,
useless. We regain the performance loss (and then some) by using a more
natural layout. Benchmarks below.
This CL does not touch the mont5 assembly; I haven't figured out what
it's doing yet.
Pixel 3, aarch64:
Before:
Did 3146 RSA 2048 signing operations in 10009070us (314.3 ops/sec)
Did 447 RSA 4096 signing operations in 10026666us (44.6 ops/sec)
After:
Did 3210 RSA 2048 signing operations in 10010712us (320.7 ops/sec)
Did 456 RSA 4096 signing operations in 10063543us (45.3 ops/sec)
Pixel 3, armv7:
Before:
Did 2688 RSA 2048 signing operations in 10002266us (268.7 ops/sec)
Did 459 RSA 4096 signing operations in 10004785us (45.9 ops/sec)
After:
Did 2709 RSA 2048 signing operations in 10001299us (270.9 ops/sec)
Did 459 RSA 4096 signing operations in 10063737us (45.6 ops/sec)
x86_64 Broadwell, mont5 assembly disabled:
(This configuration is not actually shipped anywhere, but seemed a
useful data point.)
Before:
Did 14274 RSA 2048 signing operations in 10009130us (1426.1 ops/sec)
Did 2448 RSA 4096 signing operations in 10046921us (243.7 ops/sec)
After:
Did 14706 RSA 2048 signing operations in 10037908us (1465.0 ops/sec)
Did 2538 RSA 4096 signing operations in 10059986us (252.3 ops/sec)
Change-Id: If41da911d4281433856a86c6c8eadf99cd33e2d8
Reviewed-on: https://boringssl-review.googlesource.com/c/33268
Reviewed-by: Adam Langley <agl@google.com>
Commit-Queue: David Benjamin <davidben@google.com>
diff --git a/crypto/fipsmodule/bn/exponentiation.c b/crypto/fipsmodule/bn/exponentiation.c
index fa326a7..1b9680f 100644
--- a/crypto/fipsmodule/bn/exponentiation.c
+++ b/crypto/fipsmodule/bn/exponentiation.c
@@ -850,67 +850,25 @@
bn_mod_exp_mont_small(r, a, num, p_minus_two, num, mont);
}
-
-// |BN_mod_exp_mont_consttime| stores the precomputed powers in a specific
-// layout so that accessing any of these table values shows the same access
-// pattern as far as cache lines are concerned. The following functions are
-// used to transfer a BIGNUM from/to that table.
-
static void copy_to_prebuf(const BIGNUM *b, int top, BN_ULONG *table, int idx,
int window) {
- int i, j;
- const int width = 1 << window;
- if (top > b->width) {
- top = b->width; // This works because |table| is explicitly zeroed.
- }
-
- for (i = 0, j = idx; i < top; i++, j += width) {
- table[j] = b->d[i];
- }
+ int ret = bn_copy_words(table + idx * top, top, b);
+ assert(ret); // |b| is guaranteed to fit.
+ (void)ret;
}
-static int copy_from_prebuf(BIGNUM *b, int top, const BN_ULONG *buf, int idx,
+static int copy_from_prebuf(BIGNUM *b, int top, const BN_ULONG *table, int idx,
int window) {
- int i, j;
- const int width = 1 << window;
- volatile const BN_ULONG *table = (volatile const BN_ULONG *)buf;
-
if (!bn_wexpand(b, top)) {
return 0;
}
- if (window <= 3) {
- for (i = 0; i < top; i++, table += width) {
- BN_ULONG acc = 0;
-
- for (j = 0; j < width; j++) {
- acc |= table[j] & ((BN_ULONG)0 - (constant_time_eq_int(j, idx) & 1));
- }
-
- b->d[i] = acc;
- }
- } else {
- int xstride = 1 << (window - 2);
- BN_ULONG y0, y1, y2, y3;
-
- i = idx >> (window - 2); // equivalent of idx / xstride
- idx &= xstride - 1; // equivalent of idx % xstride
-
- y0 = (BN_ULONG)0 - (constant_time_eq_int(i, 0) & 1);
- y1 = (BN_ULONG)0 - (constant_time_eq_int(i, 1) & 1);
- y2 = (BN_ULONG)0 - (constant_time_eq_int(i, 2) & 1);
- y3 = (BN_ULONG)0 - (constant_time_eq_int(i, 3) & 1);
-
- for (i = 0; i < top; i++, table += width) {
- BN_ULONG acc = 0;
-
- for (j = 0; j < xstride; j++) {
- acc |= ((table[j + 0 * xstride] & y0) | (table[j + 1 * xstride] & y1) |
- (table[j + 2 * xstride] & y2) | (table[j + 3 * xstride] & y3)) &
- ((BN_ULONG)0 - (constant_time_eq_int(j, idx) & 1));
- }
-
- b->d[i] = acc;
+ OPENSSL_memset(b->d, 0, sizeof(BN_ULONG) * top);
+ const int width = 1 << window;
+ for (int i = 0; i < width; i++, table += top) {
+ BN_ULONG mask = constant_time_eq_int(i, idx);
+ for (int j = 0; j < top; j++) {
+ b->d[j] |= table[j] & mask;
}
}
@@ -953,9 +911,8 @@
(MOD_EXP_CTIME_MIN_CACHE_LINE_WIDTH - \
(((size_t)(x_)) & (MOD_EXP_CTIME_MIN_CACHE_LINE_MASK))))
-// This variant of BN_mod_exp_mont() uses fixed windows and the special
-// precomputation memory layout to limit data-dependency to a minimum
-// to protect secret exponents (cf. the hyper-threading timing attacks
+// This variant of |BN_mod_exp_mont| uses fixed windows and fixed memory access
+// patterns to protect secret exponents (cf. the hyper-threading timing attacks
// pointed out by Colin Percival,
// http://www.daemonology.net/hyperthreading-considered-harmful/)
int BN_mod_exp_mont_consttime(BIGNUM *rr, const BIGNUM *a, const BIGNUM *p,