Fix allocation size in BN_mod_exp_mont_consttime.

powerbuf's layout is:
- num_powers values mod m, stored transposed
- one value mod m, tmp
- one value mod m, am
- (mont5-only) an extra copy of m

powerbuf_len broadly computed this, but where tmp + am would be
sizeof(BN_ULONG) * top * 2, it used
sizeof(BN_ULONG) * max(top * 2, num_powers).

(In the actual code it's written as a ternary op and some
multiplications are factored out.)

That is, it allocated enough room for tmp + am OR an extra row in the
num_powers table, as if each entry were top + 1 words long instead of
top, with the space overlapping. This expression dates to upstream's
https://github.com/openssl/openssl/commit/361512da0d900ba276096cbd152e304d402aca65,
though the exact layout has shifted over the years as mont5 evolved.
(Originally, it only contained one extra value mod m.)

At the time, this was necessary because bn_mul_mont_gather5 actually
overreads the table by one row! Although it only uses top * 32 words, it
requires the table to have (top + 1) * 32 words. This is because the
computation was scheduled so that the .Louter4x loop would read and mask
off the next iteration's value while incorporating the previous
iteration:

There were masked reads from $bp into XMM registers at the start of the
loop:
https://github.com/openssl/openssl/blob/361512da0d900ba276096cbd152e304d402aca65/crypto/bn/asm/x86_64-mont5.pl#L541

The XMM logic is interleaved throughout and does not move to a
general-purpose register, $m0, until much later. $m is not read again
until after the jump.
https://github.com/openssl/openssl/blob/361512da0d900ba276096cbd152e304d402aca65/crypto/bn/asm/x86_64-mont5.pl#L700

Meanwhile, the loop is already reading $m0 at the start of the
iteration.
https://github.com/openssl/openssl/blob/361512da0d900ba276096cbd152e304d402aca65/crypto/bn/asm/x86_64-mont5.pl#L551

The whole thing is bootstrapped by similar code just above it:
https://github.com/openssl/openssl/blob/361512da0d900ba276096cbd152e304d402aca65/crypto/bn/asm/x86_64-mont5.pl#L531

In the final iteration, we read one extra row into $m0 but never use it.
That is the overread.

I also confirmed this by rewinding our x86_64-mont5.pl to this state,
hacking things up until it built, and then hacking up
BN_mod_exp_mont_consttime to place the table in its own allocation, with
no extra slop using C11 aligned_alloc. This was so valgrind could
accurately instrument the bounds. When I did that, valgrind was clean if
I allocated (top + 1) * num_powers, but flagged an out-of-bounds read at
top * num_powers.

This no longer applies. After
https://github.com/openssl/openssl/commit/25d14c6c29b53907bf614b9964d43cd98401a7fc,
bn_mul_mont_gather5's scheduling is far less complicated. .Louter4x now
begins with a masked read, setting up $m0, and then it incorporates $m0
into the product. The same valgrind strategy confirmed this. Thus, I
don't believe this extra row is needed and we can allocate the buffer
straightforwardly.

Change-Id: I6c1ee8d5ebdb66eb4e5fec63d2140814c13ae146
Reviewed-on: https://boringssl-review.googlesource.com/c/boringssl/+/55231
Reviewed-by: Bob Beck <bbe@google.com>
Commit-Queue: Bob Beck <bbe@google.com>
diff --git a/crypto/fipsmodule/bn/exponentiation.c b/crypto/fipsmodule/bn/exponentiation.c
index 2903ede..afa8e71 100644
--- a/crypto/fipsmodule/bn/exponentiation.c
+++ b/crypto/fipsmodule/bn/exponentiation.c
@@ -961,9 +961,7 @@
   // Allocate a buffer large enough to hold all of the pre-computed
   // powers of |am|, |am| itself, and |tmp|.
   int num_powers = 1 << window;
-  powerbuf_len +=
-      sizeof(m->d[0]) *
-      (top * num_powers + ((2 * top) > num_powers ? (2 * top) : num_powers));
+  powerbuf_len += sizeof(m->d[0]) * top * (num_powers + 2);
 
 #if defined(OPENSSL_BN_ASM_MONT5)
   if (powerbuf_len <= sizeof(storage)) {