Add an extra reduction step to the end of RSAZ.

RSAZ has a very similar bug to mont5 from
https://boringssl-review.googlesource.com/c/boringssl/+/52825 and may
return the modulus when it should return zero. As in that CL, there is
no security impact on our cryptographic primitives.

RSAZ is described in the paper "Software Implementation of Modular
Exponentiation, Using Advanced Vector Instructions Architectures".

The bug comes from RSAZ's use of "NRMM" or "Non Reduced Montgomery
Multiplication". This is like normal Montgomery multiplication, but
skips the final subtraction altogether (whereas mont5's AMM still
subtracts, but replaces MM's tigher bound with just the carry bit). This
would normally not be stable, but RSAZ picks a larger R > 4M, and
maintains looser bounds for modular arithmetic, a < 2M.

Lemma 1 from the paper proves that NRMM(a, b) preserves this 2M bound.
It also claims NRMM(a, 1) < M. That is, conversion out of Montgomery
form with NRMM is fully reduced. This second claim is wrong. The proof
shows that NRMM(a, 1) < 1/2 + M, which only implies NRMM(a, 1) <= M, not
NRMM(a, 1) < M. RSAZ relies on this to produce a reduced output (see
Figure 7 in the paper).

Thus, like mont5 with AMM, RSAZ may return the modulus when it should
return zero. Fix this by adding a bn_reduce_once_in_place call at the
end of the operation.

Change-Id: If28bc49ae8dfbfb43bea02af5ea10c4209a1c6e6
Reviewed-on: https://boringssl-review.googlesource.com/c/boringssl/+/52827
Reviewed-by: Adam Langley <agl@google.com>
Commit-Queue: David Benjamin <davidben@google.com>
diff --git a/crypto/fipsmodule/bn/bn_tests.txt b/crypto/fipsmodule/bn/bn_tests.txt
index 4bc1fa2..9a1a5db 100644
--- a/crypto/fipsmodule/bn/bn_tests.txt
+++ b/crypto/fipsmodule/bn/bn_tests.txt
@@ -10625,12 +10625,10 @@
 M = 8f42c9e9e351ba9b32ab0cf69da43f4acf7028d19cff6e5059ea0e3fcc97c97f36a31470044737d4c0c933ac441ecb29e32c81401523afdac7de9c3fd8493c97
 
 # 1024-bit
-# TODO(davidben): This test breaks the RSAZ implementation. Fix it and enable
-# this test.
-# ModExp = 00
-# A = 800000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000002f
-# E = ffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffff
-# M = 9da8dc26fdf4d2e49833b240ee552beb7a6e251caa91bfb5d6cafaf8ed9461877fda8f6ac299036d35806bc1ae7872e54eaac1ec6bee6d02c6621a9cf8883b3abc33c49b3e601203e0e86ef8f0562412cc689ee2670704583909ca6d7774c9f9f9f4d77d37fedef9cb51d207cb629ec02fa03b526fd6594bfa8f2da71238a0b7
+ModExp = 00
+A = 800000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000002f
+E = ffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffff
+M = 9da8dc26fdf4d2e49833b240ee552beb7a6e251caa91bfb5d6cafaf8ed9461877fda8f6ac299036d35806bc1ae7872e54eaac1ec6bee6d02c6621a9cf8883b3abc33c49b3e601203e0e86ef8f0562412cc689ee2670704583909ca6d7774c9f9f9f4d77d37fedef9cb51d207cb629ec02fa03b526fd6594bfa8f2da71238a0b7
 
 # 1025-bit
 ModExp = 00
diff --git a/crypto/fipsmodule/bn/rsaz_exp.c b/crypto/fipsmodule/bn/rsaz_exp.c
index 7e15aaf..074f05d 100644
--- a/crypto/fipsmodule/bn/rsaz_exp.c
+++ b/crypto/fipsmodule/bn/rsaz_exp.c
@@ -219,6 +219,8 @@
   rsaz_1024_mul_avx2(result, result, one, m, k0);
 
   rsaz_1024_red2norm_avx2(result_norm, result);
+  BN_ULONG scratch[16];
+  bn_reduce_once_in_place(result_norm, /*carry=*/0, m_norm, scratch, 16);
 
   OPENSSL_cleanse(storage, MOD_EXP_CTIME_STORAGE_LEN * sizeof(BN_ULONG));
 }
diff --git a/crypto/fipsmodule/bn/rsaz_exp.h b/crypto/fipsmodule/bn/rsaz_exp.h
index 104bb7a..bc7a439 100644
--- a/crypto/fipsmodule/bn/rsaz_exp.h
+++ b/crypto/fipsmodule/bn/rsaz_exp.h
@@ -90,7 +90,11 @@
                             int i);
 
 // rsaz_1024_red2norm_avx2 converts |red| from RSAZ to |BIGNUM| representation
-// and writes the result to |norm|.
+// and writes the result to |norm|. The result will be <= the modulus.
+//
+// WARNING: The result of this operation may not be fully reduced. |norm| may be
+// the modulus instead of zero. This function should be followed by a call to
+// |bn_reduce_once|.
 void rsaz_1024_red2norm_avx2(BN_ULONG norm[16], const BN_ULONG red[40]);