Remove p > q normalization in RSA keys

RSA CRT is tiny bit messier when p < q.
https://boringssl-review.googlesource.com/25263 solved this by
normalizing to p > q. The cost was we sometimes had to compute a new
iqmp.

Modular inversion is expensive. We did it only once per key, but it's
still a performance cliff in per-key costs. When later work moves
freeze_private_key into RSA private key parsing, it will be a
performance cliff in the private key parser.

Instead, just handle p < q in the CRT function. The only difference is
needing one extra reduction before the modular subtraction. Even using
the fully general mod_montgomery function (as opposed to checking p < q,
or using bn_reduce_once when num_bits(p) == num_bits(q)) was not
measurable.

In doing so, I noticed we didn't actually have tests that exercise the
reduction step. I added one to evp_tests.txt, but it is only meaningful
when blinding is disabled. (Another cost of blinding.) When blinding is
enabled, the answers mod p and q are randomized and we hit this case
with about 1.8% probability. See comment in evp_test.txt.

I kept the optimization where we store iqmp in Montgomery form, not
because the optimization matters, but because we need to store a
corrected, fixed-width version of the value anyway, so we may as well
store it in a more convenient form.

M1 Max
Before:
Did 9048 RSA 2048 signing operations in 5033403us (1797.6 ops/sec)
Did 1500 RSA 4096 signing operations in 5009288us (299.4 ops/sec)
After:
Did 9116 RSA 2048 signing operations in 5053802us (1803.8 ops/sec) [+0.3%]
Did 1500 RSA 4096 signing operations in 5008283us (299.5 ops/sec) [+0.0%]

Intel(R) Xeon(R) Gold 6154 CPU @ 3.00GHz
Before:
Did 9282 RSA 2048 signing operations in 5019395us (1849.2 ops/sec)
Did 1302 RSA 4096 signing operations in 5055011us (257.6 ops/sec)
After:
Did 9240 RSA 2048 signing operations in 5024845us (1838.9 ops/sec) [-0.6%]
Did 1302 RSA 4096 signing operations in 5046157us (258.0 ops/sec) [+0.2%]

Bug: 316
Change-Id: Icb90c7d5f5188f9b69a6d7bcc63db13d92ec26d5
Reviewed-on: https://boringssl-review.googlesource.com/c/boringssl/+/60705
Commit-Queue: David Benjamin <davidben@google.com>
Reviewed-by: Adam Langley <agl@google.com>
diff --git a/crypto/evp/evp_test.cc b/crypto/evp/evp_test.cc
index 9c4f14d..fafd50b 100644
--- a/crypto/evp/evp_test.cc
+++ b/crypto/evp/evp_test.cc
@@ -125,7 +125,7 @@
   return EVP_PKEY_NONE;
 }
 
-static int GetRSAPadding(FileTest *t, int *out, const std::string &name) {
+static bool GetRSAPadding(FileTest *t, int *out, const std::string &name) {
   if (name == "PKCS1") {
     *out = RSA_PKCS1_PADDING;
     return true;
@@ -138,6 +138,10 @@
     *out = RSA_PKCS1_OAEP_PADDING;
     return true;
   }
+  if (name == "None") {
+    *out = RSA_NO_PADDING;
+    return true;
+  }
   ADD_FAILURE() << "Unknown RSA padding mode: " << name;
   return false;
 }
diff --git a/crypto/evp/evp_tests.txt b/crypto/evp/evp_tests.txt
index 238a602..cbce1b0 100644
--- a/crypto/evp/evp_tests.txt
+++ b/crypto/evp/evp_tests.txt
@@ -655,6 +655,16 @@
 Input = "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa"
 Output = 07fa4e3de9c002c41c952dc292ef5a814c4c17dc1a6cf958c4c971e8089676d6661b442270ef9295c41e5385c9628aa1bdee2cc2558b8473ba212f2ba04b9ff2264c19187b9506b1d0a1cc2751844cc8dedf555d62ce81bc0e70bfe83d0184ee964593af91b9b327c0fb272c799148cd8737d412cbf36c2ad25fd66977bf805f
 
+# The input is 1 (mod p) and q - 1 (mod q). If the CRT implementation does not
+# account for p < q, the subtraction step will break. However, this test only
+# works when RSA blinding is disabled. When blinding is enabled, these values
+# are randomized and, instead, either this or the above test will hit this case
+# if repeated enough. (It hits it with probability (q-p)^2 / 2pq, or 1.8% here.)
+Encrypt = RSA-Swapped
+RSAPadding = None
+Input = 7ab490e1f5e718ff23a9e738f9867559e3a6ea72332e3bcc1b6f58585218ed815a865dab75e3f44ea550bdef815101d6039251a513fd99188c1916abb94b15ef72de793f6472a342b33a125cfc96a4fc89d140e337f5fe6ff937ed3f34b6b09f5227f598e12faddf4423254acbee748197bed26954502c7484c65e279fed7fed
+CheckDecrypt
+
 # Though we will never generate such a key, test that RSA keys where p and q are
 # different sizes work properly.
 PrivateKey = RSA-PrimeMismatch
diff --git a/crypto/fipsmodule/rsa/internal.h b/crypto/fipsmodule/rsa/internal.h
index c6bf60a..5a993a2 100644
--- a/crypto/fipsmodule/rsa/internal.h
+++ b/crypto/fipsmodule/rsa/internal.h
@@ -103,10 +103,8 @@
   // the |freeze_private_key| implementation.
   BIGNUM *d_fixed, *dmp1_fixed, *dmq1_fixed;
 
-  // inv_small_mod_large_mont is q^-1 mod p in Montgomery form, using |mont_p|,
-  // if |p| >= |q|. Otherwise, it is p^-1 mod q in Montgomery form, using
-  // |mont_q|.
-  BIGNUM *inv_small_mod_large_mont;
+  // iqmp_mont is q^-1 mod p in Montgomery form, using |mont_p|.
+  BIGNUM *iqmp_mont;
 
   // num_blindings contains the size of the |blindings| and |blindings_inuse|
   // arrays. This member and the |blindings_inuse| array are protected by
diff --git a/crypto/fipsmodule/rsa/rsa_impl.c b/crypto/fipsmodule/rsa/rsa_impl.c
index 1206397..6cdc290 100644
--- a/crypto/fipsmodule/rsa/rsa_impl.c
+++ b/crypto/fipsmodule/rsa/rsa_impl.c
@@ -243,37 +243,24 @@
       }
 
       // CRT components are only publicly bounded by their corresponding
-      // moduli's bit lengths. |rsa->iqmp| is unused outside of this one-time
-      // setup, so we do not compute a fixed-width version of it.
+      // moduli's bit lengths.
       if (!ensure_fixed_copy(&rsa->dmp1_fixed, rsa->dmp1, p_fixed->width) ||
           !ensure_fixed_copy(&rsa->dmq1_fixed, rsa->dmq1, q_fixed->width)) {
         goto err;
       }
 
-      // Compute |inv_small_mod_large_mont|. Note that it is always modulo the
-      // larger prime, independent of what is stored in |rsa->iqmp|.
-      if (rsa->inv_small_mod_large_mont == NULL) {
-        BIGNUM *inv_small_mod_large_mont = BN_new();
-        int ok;
-        if (BN_cmp(rsa->p, rsa->q) < 0) {
-          ok = inv_small_mod_large_mont != NULL &&
-               bn_mod_inverse_secret_prime(inv_small_mod_large_mont, rsa->p,
-                                           rsa->q, ctx, rsa->mont_q) &&
-               BN_to_montgomery(inv_small_mod_large_mont,
-                                inv_small_mod_large_mont, rsa->mont_q, ctx);
-        } else {
-          ok = inv_small_mod_large_mont != NULL &&
-               BN_to_montgomery(inv_small_mod_large_mont, rsa->iqmp,
-                                rsa->mont_p, ctx);
-        }
-        if (!ok) {
-          BN_free(inv_small_mod_large_mont);
+      // Compute |iqmp_mont|, which is |iqmp| in Montgomery form and with the
+      // correct bit width.
+      if (rsa->iqmp_mont == NULL) {
+        BIGNUM *iqmp_mont = BN_new();
+        if (iqmp_mont == NULL ||
+            !BN_to_montgomery(iqmp_mont, rsa->iqmp, rsa->mont_p, ctx)) {
+          BN_free(iqmp_mont);
           goto err;
         }
-        rsa->inv_small_mod_large_mont = inv_small_mod_large_mont;
-        CONSTTIME_SECRET(
-            rsa->inv_small_mod_large_mont->d,
-            sizeof(BN_ULONG) * rsa->inv_small_mod_large_mont->width);
+        rsa->iqmp_mont = iqmp_mont;
+        CONSTTIME_SECRET(rsa->iqmp_mont->d,
+                         sizeof(BN_ULONG) * rsa->iqmp_mont->width);
       }
     }
   }
@@ -302,8 +289,8 @@
   rsa->dmp1_fixed = NULL;
   BN_free(rsa->dmq1_fixed);
   rsa->dmq1_fixed = NULL;
-  BN_free(rsa->inv_small_mod_large_mont);
-  rsa->inv_small_mod_large_mont = NULL;
+  BN_free(rsa->iqmp_mont);
+  rsa->iqmp_mont = NULL;
 
   for (size_t i = 0; i < rsa->num_blindings; i++) {
     BN_BLINDING_free(rsa->blindings[i]);
@@ -798,42 +785,37 @@
     goto err;
   }
 
-  // Implementing RSA with CRT in constant-time is sensitive to which prime is
-  // larger. Canonicalize fields so that |p| is the larger prime.
-  const BIGNUM *dmp1 = rsa->dmp1_fixed, *dmq1 = rsa->dmq1_fixed;
-  const BN_MONT_CTX *mont_p = rsa->mont_p, *mont_q = rsa->mont_q;
-  if (BN_cmp(rsa->p, rsa->q) < 0) {
-    mont_p = rsa->mont_q;
-    mont_q = rsa->mont_p;
-    dmp1 = rsa->dmq1_fixed;
-    dmq1 = rsa->dmp1_fixed;
-  }
-
   // Use the minimal-width versions of |n|, |p|, and |q|. Either works, but if
   // someone gives us non-minimal values, these will be slightly more efficient
   // on the non-Montgomery operations.
   const BIGNUM *n = &rsa->mont_n->N;
-  const BIGNUM *p = &mont_p->N;
-  const BIGNUM *q = &mont_q->N;
+  const BIGNUM *p = &rsa->mont_p->N;
+  const BIGNUM *q = &rsa->mont_q->N;
 
   // This is a pre-condition for |mod_montgomery|. It was already checked by the
   // caller.
   assert(BN_ucmp(I, n) < 0);
 
   if (// |m1| is the result modulo |q|.
-      !mod_montgomery(r1, I, q, mont_q, p, ctx) ||
-      !BN_mod_exp_mont_consttime(m1, r1, dmq1, q, ctx, mont_q) ||
+      !mod_montgomery(r1, I, q, rsa->mont_q, p, ctx) ||
+      !BN_mod_exp_mont_consttime(m1, r1, rsa->dmq1_fixed, q, ctx,
+                                 rsa->mont_q) ||
       // |r0| is the result modulo |p|.
-      !mod_montgomery(r1, I, p, mont_p, q, ctx) ||
-      !BN_mod_exp_mont_consttime(r0, r1, dmp1, p, ctx, mont_p) ||
-      // Compute r0 = r0 - m1 mod p. |p| is the larger prime, so |m1| is already
-      // fully reduced mod |p|.
-      !bn_mod_sub_consttime(r0, r0, m1, p, ctx) ||
+      !mod_montgomery(r1, I, p, rsa->mont_p, q, ctx) ||
+      !BN_mod_exp_mont_consttime(r0, r1, rsa->dmp1_fixed, p, ctx,
+                                 rsa->mont_p) ||
+      // Compute r0 = r0 - m1 mod p. |m1| is reduced mod |q|, not |p|, so we
+      // just run |mod_montgomery| again for simplicity. This could be more
+      // efficient with more cases: if |p > q|, |m1| is already reduced. If
+      // |p < q| but they have the same bit width, |bn_reduce_once| suffices.
+      // However, compared to over 2048 Montgomery multiplications above, this
+      // difference is not measurable.
+      !mod_montgomery(r1, m1, p, rsa->mont_p, q, ctx) ||
+      !bn_mod_sub_consttime(r0, r0, r1, p, ctx) ||
       // r0 = r0 * iqmp mod p. We use Montgomery multiplication to compute this
-      // in constant time. |inv_small_mod_large_mont| is in Montgomery form and
-      // r0 is not, so the result is taken out of Montgomery form.
-      !BN_mod_mul_montgomery(r0, r0, rsa->inv_small_mod_large_mont, mont_p,
-                             ctx) ||
+      // in constant time. |iqmp_mont| is in Montgomery form and r0 is not, so
+      // the result is taken out of Montgomery form.
+      !BN_mod_mul_montgomery(r0, r0, rsa->iqmp_mont, rsa->mont_p, ctx) ||
       // r0 = r0 * q + m1 gives the final result. Reducing modulo q gives m1, so
       // it is correct mod p. Reducing modulo p gives (r0-m1)*iqmp*q + m1 = r0,
       // so it is correct mod q. Finally, the result is bounded by [m1, n + m1),
@@ -1308,8 +1290,7 @@
   replace_bignum(&rsa->d_fixed, &tmp->d_fixed);
   replace_bignum(&rsa->dmp1_fixed, &tmp->dmp1_fixed);
   replace_bignum(&rsa->dmq1_fixed, &tmp->dmq1_fixed);
-  replace_bignum(&rsa->inv_small_mod_large_mont,
-                 &tmp->inv_small_mod_large_mont);
+  replace_bignum(&rsa->iqmp_mont, &tmp->iqmp_mont);
   rsa->private_key_frozen = tmp->private_key_frozen;
   ret = 1;
 
diff --git a/crypto/rsa_extra/rsa_test.cc b/crypto/rsa_extra/rsa_test.cc
index 87e0396..e332cdf 100644
--- a/crypto/rsa_extra/rsa_test.cc
+++ b/crypto/rsa_extra/rsa_test.cc
@@ -983,7 +983,7 @@
   EXPECT_FALSE(rsa->d_fixed);
   EXPECT_FALSE(rsa->dmp1_fixed);
   EXPECT_FALSE(rsa->dmq1_fixed);
-  EXPECT_FALSE(rsa->inv_small_mod_large_mont);
+  EXPECT_FALSE(rsa->iqmp_mont);
   EXPECT_FALSE(rsa->private_key_frozen);
 
   // Failed key generations leave the previous contents alone.