Remove read locks from PRNG steady state

We don't take write locks in the PRNG, steady state, but we do take some
read locks: computing fork generation, reading the fork-unsafe buffering
flag, and a FIPS-only artifact of some global state clearing mess. That
last one is completely useless, but it's a consequence of FIPS's
understanding of process exit being comically inconsistent with reality.

Taking read locks is, in principle, parallel, but the cacheline write
causes some contention, even in newer glibcs with faster read locks. Fix
these:

- Use atomic reads to check the fork generation. We only need to lock
  when we observe a fork.

- Replace the fork-unsafe buffering flag with an atomic altogether.

- Split state_clear_all_lock into a per-rand_thread_state lock. We still
  need a read lock, but a completely uncontended one until process exit.

With many threads, this gives a significant perf boost.

x86_64, non-FIPS, Intel(R) Xeon(R) Gold 6154 CPU @ 3.00GHz, 30 threads:
Before:
Did 45131875 RNG (16 bytes) operations in 300039649us (150419.7 ops/sec): 2.4 MB/s
Did 44089000 RNG (32 bytes) operations in 300053237us (146937.3 ops/sec): 4.7 MB/s
Did 43328000 RNG (256 bytes) operations in 300058423us (144398.5 ops/sec): 37.0 MB/s
Did 45857000 RNG (1350 bytes) operations in 300095943us (152807.8 ops/sec): 206.3 MB/s
Did 43249000 RNG (8192 bytes) operations in 300102698us (144114.0 ops/sec): 1180.6 MB/s
After:
Did 296204000 RNG (16 bytes) operations in 300009524us (987315.3 ops/sec): 15.8 MB/s
Did 311347000 RNG (32 bytes) operations in 300014396us (1037773.5 ops/sec): 33.2 MB/s
Did 295104000 RNG (256 bytes) operations in 300012657us (983638.5 ops/sec): 251.8 MB/s
Did 255721000 RNG (1350 bytes) operations in 300016481us (852356.5 ops/sec): 1150.7 MB/s
Did 103339000 RNG (8192 bytes) operations in 300040059us (344417.3 ops/sec): 2821.5 MB/s

(Smaller PRNG draws are more impacted because they spend less time in the
DRBG. But they're also more likely because you rarely need to pull 8K of
data out at once.)

x86_64, FIPS, Intel(R) Xeon(R) Gold 6154 CPU @ 3.00GHz, 30 threads:
Before:
Did 29060000 RNG (16 bytes) operations in 300081190us (96840.5 ops/sec): 1.5 MB/s
Did 31882000 RNG (32 bytes) operations in 300118031us (106231.5 ops/sec): 3.4 MB/s
Did 30925000 RNG (256 bytes) operations in 300113646us (103044.3 ops/sec): 26.4 MB/s
Did 31969000 RNG (1350 bytes) operations in 300096688us (106529.0 ops/sec): 143.8 MB/s
Did 33434000 RNG (8192 bytes) operations in 300093240us (111412.0 ops/sec): 912.7 MB/s
After:
Did 299013000 RNG (16 bytes) operations in 300012167us (996669.6 ops/sec): 15.9 MB/s
Did 289788000 RNG (32 bytes) operations in 300014611us (965913.0 ops/sec): 30.9 MB/s
Did 298699000 RNG (256 bytes) operations in 300013443us (995618.7 ops/sec): 254.9 MB/s
Did 247061000 RNG (1350 bytes) operations in 300018215us (823486.7 ops/sec): 1111.7 MB/s
Did 100479000 RNG (8192 bytes) operations in 300037708us (334887.9 ops/sec): 2743.4 MB/s

On an M1 Pro, it's mostly a wash by default (fewer threads because this chip has fewer cores)

aarch64, M1 Pro, 8 threads:
Before:
Did 23218000 RNG (16 bytes) operations in 80009131us (290191.9 ops/sec): 4.6 MB/s
Did 23021000 RNG (256 bytes) operations in 80007544us (287735.4 ops/sec): 73.7 MB/s
Did 22853000 RNG (1350 bytes) operations in 80013184us (285615.4 ops/sec): 385.6 MB/s
Did 25407000 RNG (8192 bytes) operations in 80008371us (317554.3 ops/sec): 2601.4 MB/s
Did 22128000 RNG (16384 bytes) operations in 80013269us (276554.1 ops/sec): 4531.1 MB/s
After:
Did 23303000 RNG (16 bytes) operations in 80011433us (291245.9 ops/sec): 4.7 MB/s
Did 23072000 RNG (256 bytes) operations in 80008755us (288368.4 ops/sec): 73.8 MB/s
Did 22807000 RNG (1350 bytes) operations in 80013355us (285039.9 ops/sec): 384.8 MB/s
Did 23759000 RNG (8192 bytes) operations in 80010212us (296949.6 ops/sec): 2432.6 MB/s
Did 23193000 RNG (16384 bytes) operations in 80011537us (289870.7 ops/sec): 4749.2 MB/s

This is likely because, without RDRAND or MADV_WIPEONFORK, we draw from
the OS on every call. We're likely bottlenecked by getentropy, whether
it's some internal synchronization or syscall overherad. With
fork-unsafe buffering enabled, this change shows even more significant
wins on the M1 Pro.

aarch64, fork-unsafe buffering, M1 Pro, 8 threads:
Before:
Did 25727000 RNG (16 bytes) operations in 80010579us (321545.0 ops/sec): 5.1 MB/s
Did 25776000 RNG (32 bytes) operations in 80008587us (322165.4 ops/sec): 10.3 MB/s
Did 25780000 RNG (256 bytes) operations in 80006127us (322225.3 ops/sec): 82.5 MB/s
Did 33171250 RNG (1350 bytes) operations in 80002532us (414627.5 ops/sec): 559.7 MB/s
Did 54784000 RNG (8192 bytes) operations in 80005706us (684751.2 ops/sec): 5609.5 MB/s
After:
Did 573826000 RNG (16 bytes) operations in 80000668us (7172765.1 ops/sec): 114.8 MB/s
Did 571329000 RNG (32 bytes) operations in 80000423us (7141574.7 ops/sec): 228.5 MB/s
Did 435043750 RNG (256 bytes) operations in 80000214us (5438032.3 ops/sec): 1392.1 MB/s
Did 229536000 RNG (1350 bytes) operations in 80001888us (2869132.3 ops/sec): 3873.3 MB/s
Did 57253000 RNG (8192 bytes) operations in 80004974us (715618.0 ops/sec): 5862.3 MB/s

Note that, on hardware with RDRAND, the read lock in
rand_fork_unsafe_buffering_enabled() doesn't do much. But without
RDRAND, we hit that on every RAND_bytes call. More importantly, the
subsequent CL will fix a bug that will require us to hit it more
frequently.

I've removed the volatile on g_fork_detect_addr because I think we
didn't need it and this avoids thinking about the interaction between
volatile and atomics. The pointer is passed into madvise, so the
compiler knows the pointer escapes. For it to be invalid, the compiler
would need to go out of its way to model madvise as not remembering the
pointer, which would be incorrect of it for MADV_WIPEONFORK.

Bug: 570
Cq-Include-Trybots: luci.boringssl.try:linux_clang_rel_tsan
Change-Id: Ie6977acd1b8e7639aaa419cf6f4f5f0645bde9d1
Reviewed-on: https://boringssl-review.googlesource.com/c/boringssl/+/59849
Commit-Queue: David Benjamin <davidben@google.com>
Reviewed-by: Adam Langley <agl@google.com>
diff --git a/crypto/fipsmodule/rand/fork_detect.c b/crypto/fipsmodule/rand/fork_detect.c
index 58b0687..9e46223 100644
--- a/crypto/fipsmodule/rand/fork_detect.c
+++ b/crypto/fipsmodule/rand/fork_detect.c
@@ -38,7 +38,7 @@
 
 DEFINE_STATIC_ONCE(g_fork_detect_once);
 DEFINE_STATIC_MUTEX(g_fork_detect_lock);
-DEFINE_BSS_GET(volatile char *, g_fork_detect_addr);
+DEFINE_BSS_GET(CRYPTO_atomic_u32 *, g_fork_detect_addr);
 DEFINE_BSS_GET(uint64_t, g_fork_generation);
 DEFINE_BSS_GET(int, g_force_madv_wipeonfork);
 DEFINE_BSS_GET(int, g_force_madv_wipeonfork_enabled);
@@ -70,7 +70,7 @@
     return;
   }
 
-  *((volatile char *) addr) = 1;
+  CRYPTO_atomic_store_u32(addr, 1);
   *g_fork_detect_addr_bss_get() = addr;
   *g_fork_generation_bss_get() = 1;
 }
@@ -83,16 +83,12 @@
   // is initialised atomically, even if multiple threads enter this function
   // concurrently.
   //
-  // In the limit, the kernel may clear WIPEONFORK pages while a multi-threaded
-  // process is running. (For example, because a VM was cloned.) Therefore a
-  // lock is used below to synchronise the potentially multiple threads that may
-  // concurrently observe the cleared flag.
+  // Additionally, while the kernel will only clear WIPEONFORK at a point when a
+  // child process is single-threaded, the child may become multi-threaded
+  // before it observes this. Therefore, we must synchronize the logic below.
 
   CRYPTO_once(g_fork_detect_once_bss_get(), init_fork_detect);
-  // This pointer is |volatile| because the value pointed to may be changed by
-  // external forces (i.e. the kernel wiping the page) thus the compiler must
-  // not assume that it has exclusive access to it.
-  volatile char *const flag_ptr = *g_fork_detect_addr_bss_get();
+  CRYPTO_atomic_u32 *const flag_ptr = *g_fork_detect_addr_bss_get();
   if (flag_ptr == NULL) {
     // Our kernel is too old to support |MADV_WIPEONFORK| or
     // |g_force_madv_wipeonfork| is set.
@@ -105,28 +101,34 @@
     return 0;
   }
 
-  struct CRYPTO_STATIC_MUTEX *const lock = g_fork_detect_lock_bss_get();
+  // In the common case, try to observe the flag without taking a lock. This
+  // avoids cacheline contention in the PRNG.
   uint64_t *const generation_ptr = g_fork_generation_bss_get();
-
-  CRYPTO_STATIC_MUTEX_lock_read(lock);
-  uint64_t current_generation = *generation_ptr;
-  if (*flag_ptr) {
-    CRYPTO_STATIC_MUTEX_unlock_read(lock);
-    return current_generation;
+  if (CRYPTO_atomic_load_u32(flag_ptr) != 0) {
+    // If we observe a non-zero flag, it is safe to read |generation_ptr|
+    // without a lock. The flag and generation number are fixed for this copy of
+    // the address space.
+    return *generation_ptr;
   }
 
-  CRYPTO_STATIC_MUTEX_unlock_read(lock);
+  // The flag was zero. The generation number must be incremented, but other
+  // threads may have concurrently observed the zero, so take a lock before
+  // incrementing.
+  struct CRYPTO_STATIC_MUTEX *const lock = g_fork_detect_lock_bss_get();
   CRYPTO_STATIC_MUTEX_lock_write(lock);
-  current_generation = *generation_ptr;
-  if (*flag_ptr == 0) {
+  uint64_t current_generation = *generation_ptr;
+  if (CRYPTO_atomic_load_u32(flag_ptr) == 0) {
     // A fork has occurred.
-    *flag_ptr = 1;
-
     current_generation++;
     if (current_generation == 0) {
+      // Zero means fork detection isn't supported, so skip that value.
       current_generation = 1;
     }
+
+    // We must update |generation_ptr| before |flag_ptr|. Other threads may
+    // observe |flag_ptr| without taking a lock.
     *generation_ptr = current_generation;
+    CRYPTO_atomic_store_u32(flag_ptr, 1);
   }
   CRYPTO_STATIC_MUTEX_unlock_write(lock);
 
diff --git a/crypto/fipsmodule/rand/rand.c b/crypto/fipsmodule/rand/rand.c
index 0ead182..bf6b046 100644
--- a/crypto/fipsmodule/rand/rand.c
+++ b/crypto/fipsmodule/rand/rand.c
@@ -72,6 +72,10 @@
   // next and prev form a NULL-terminated, double-linked list of all states in
   // a process.
   struct rand_thread_state *next, *prev;
+  // clear_drbg_lock synchronizes between uses of |drbg| and
+  // |rand_thread_state_clear_all| clearing it. This lock should be uncontended
+  // in the common case, except on shutdown.
+  CRYPTO_MUTEX clear_drbg_lock;
 #endif
 };
 
@@ -82,18 +86,19 @@
 // called when the whole process is exiting.
 DEFINE_BSS_GET(struct rand_thread_state *, thread_states_list);
 DEFINE_STATIC_MUTEX(thread_states_list_lock);
-DEFINE_STATIC_MUTEX(state_clear_all_lock);
 
 static void rand_thread_state_clear_all(void) __attribute__((destructor));
 static void rand_thread_state_clear_all(void) {
   CRYPTO_STATIC_MUTEX_lock_write(thread_states_list_lock_bss_get());
-  CRYPTO_STATIC_MUTEX_lock_write(state_clear_all_lock_bss_get());
   for (struct rand_thread_state *cur = *thread_states_list_bss_get();
        cur != NULL; cur = cur->next) {
+    CRYPTO_MUTEX_lock_write(&cur->clear_drbg_lock);
     CTR_DRBG_clear(&cur->drbg);
   }
   // The locks are deliberately left locked so that any threads that are still
-  // running will hang if they try to call |RAND_bytes|.
+  // running will hang if they try to call |RAND_bytes|. It also ensures
+  // |rand_thread_state_free| cannot free any thread state while we've taken the
+  // lock.
 }
 #endif
 
@@ -385,6 +390,7 @@
     state->fork_generation = fork_generation;
 
 #if defined(BORINGSSL_FIPS)
+    CRYPTO_MUTEX_init(&state->clear_drbg_lock);
     if (state != &stack_state) {
       CRYPTO_STATIC_MUTEX_lock_write(thread_states_list_lock_bss_get());
       struct rand_thread_state **states_list = thread_states_list_bss_get();
@@ -410,7 +416,7 @@
     // Take a read lock around accesses to |state->drbg|. This is needed to
     // avoid returning bad entropy if we race with
     // |rand_thread_state_clear_all|.
-    CRYPTO_STATIC_MUTEX_lock_read(state_clear_all_lock_bss_get());
+    CRYPTO_MUTEX_lock_read(&state->clear_drbg_lock);
 #endif
     if (!CTR_DRBG_reseed(&state->drbg, seed, reseed_additional_data,
                          reseed_additional_data_len)) {
@@ -420,7 +426,7 @@
     state->fork_generation = fork_generation;
   } else {
 #if defined(BORINGSSL_FIPS)
-    CRYPTO_STATIC_MUTEX_lock_read(state_clear_all_lock_bss_get());
+    CRYPTO_MUTEX_lock_read(&state->clear_drbg_lock);
 #endif
   }
 
@@ -449,7 +455,7 @@
   }
 
 #if defined(BORINGSSL_FIPS)
-  CRYPTO_STATIC_MUTEX_unlock_read(state_clear_all_lock_bss_get());
+  CRYPTO_MUTEX_unlock_read(&state->clear_drbg_lock);
 #endif
 }
 
diff --git a/crypto/internal.h b/crypto/internal.h
index 00f0582..9edfd0e 100644
--- a/crypto/internal.h
+++ b/crypto/internal.h
@@ -594,6 +594,11 @@
   return atomic_compare_exchange_weak(val, expected, desired);
 }
 
+OPENSSL_INLINE void CRYPTO_atomic_store_u32(CRYPTO_atomic_u32 *val,
+                                            uint32_t desired) {
+  atomic_store(val, desired);
+}
+
 #elif defined(OPENSSL_WINDOWS_ATOMIC)
 
 typedef LONG CRYPTO_atomic_u32;
@@ -626,6 +631,11 @@
   return actual == expected;
 }
 
+OPENSSL_INLINE void CRYPTO_atomic_store_u32(volatile CRYPTO_atomic_u32 *val,
+                                            uint32_t desired) {
+  InterlockedExchange(val, (LONG)desired);
+}
+
 #elif !defined(OPENSSL_THREADS)
 
 typedef uint32_t CRYPTO_atomic_u32;
@@ -644,6 +654,11 @@
   return 1;
 }
 
+OPENSSL_INLINE void CRYPTO_atomic_store_u32(CRYPTO_atomic_u32 *val,
+                                            uint32_t desired) {
+  *val = desired;
+}
+
 #else
 
 // Require some atomics implementation. Contact BoringSSL maintainers if you
diff --git a/crypto/rand_extra/forkunsafe.c b/crypto/rand_extra/forkunsafe.c
index 0f1ecec..356afdd 100644
--- a/crypto/rand_extra/forkunsafe.c
+++ b/crypto/rand_extra/forkunsafe.c
@@ -17,13 +17,12 @@
 #include <stdlib.h>
 
 #include "../fipsmodule/rand/internal.h"
+#include "../internal.h"
 
 
-// g_buffering_enabled is true if fork-unsafe buffering has been enabled.
-static int g_buffering_enabled = 0;
-
-// g_lock protects |g_buffering_enabled|.
-static struct CRYPTO_STATIC_MUTEX g_lock = CRYPTO_STATIC_MUTEX_INIT;
+// g_buffering_enabled is one if fork-unsafe buffering has been enabled and zero
+// otherwise.
+static CRYPTO_atomic_u32 g_buffering_enabled = 0;
 
 #if !defined(OPENSSL_WINDOWS)
 void RAND_enable_fork_unsafe_buffering(int fd) {
@@ -32,15 +31,10 @@
     abort();
   }
 
-  CRYPTO_STATIC_MUTEX_lock_write(&g_lock);
-  g_buffering_enabled = 1;
-  CRYPTO_STATIC_MUTEX_unlock_write(&g_lock);
+  CRYPTO_atomic_store_u32(&g_buffering_enabled, 1);
 }
 #endif
 
 int rand_fork_unsafe_buffering_enabled(void) {
-  CRYPTO_STATIC_MUTEX_lock_read(&g_lock);
-  const int ret = g_buffering_enabled;
-  CRYPTO_STATIC_MUTEX_unlock_read(&g_lock);
-  return ret;
+  return CRYPTO_atomic_load_u32(&g_buffering_enabled) != 0;
 }