optimize constant_time_conditional_memxor for gcc

Fixed: 42290529

While packed tables still don't give us a speedup on GCC, they are now
within a few percent, with the updated CUrve25519 code there is a net
speedup (+9%).

A loop of 64-bit XORs did not reach performance parity, so vector
extensions it is.

/tmp/m/orig-binary-from-before-packed-tables/gcc
Did 57000 Ed25519 key generation operations in 1000871us (56950.4 ops/sec)
Did 56000 Ed25519 signing operations in 1010050us (55442.8 ops/sec)
Did 20000 Ed25519 verify operations in 1023534us (19540.1 ops/sec)
Did 59000 Curve25519 base-point multiplication operations in 1001460us (58914.0 ops/sec)
Did 23000 Curve25519 arbitrary point multiplication operations in 1025589us (22426.1 ops/sec)

/tmp/m/this-cl/gcc
Did 56000 Ed25519 key generation operations in 1014283us (55211.4 ops/sec) [-3.1%]
Did 54000 Ed25519 signing operations in 1000577us (53968.9 ops/sec) [-2.7%]
Did 21000 Ed25519 verify operations in 1047461us (20048.5 ops/sec) [+2.6%]
Did 57000 Curve25519 base-point multiplication operations in 1002519us (56856.8 ops/sec) [-3.5%]
Did 25000 Curve25519 arbitrary point multiplication operations in 1019719us (24516.6 ops/sec) [+9.3%]

/tmp/m/this-cl/clang
Did 78000 Ed25519 key generation operations in 1003047us (77763.1 ops/sec) [+36.5%]
Did 75000 Ed25519 signing operations in 1002741us (74795.0 ops/sec) [+34.9%]
Did 20000 Ed25519 verify operations in 1008417us (19833.1 ops/sec) [+1.5%]
Did 82000 Curve25519 base-point multiplication operations in 1005159us (81579.1 ops/sec) [+38.5%]
Did 30000 Curve25519 arbitrary point multiplication operations in 1006715us (29799.9 ops/sec) [+32.9%]

/tmp/m/this-cl/clang
Did 79000 Ed25519 key generation operations in 1000245us (78980.6 ops/sec) [+38.7%]
Did 76000 Ed25519 signing operations in 1000323us (75975.5 ops/sec) [+37.0%]
Did 20000 Ed25519 verify operations in 1031862us (19382.4 ops/sec) [-0.8%]
Did 83000 Curve25519 base-point multiplication operations in 1001875us (82844.7 ops/sec) [+40.6%]
Did 30000 Curve25519 arbitrary point multiplication operations in 1007550us (29775.2 ops/sec) [+32.8%]

Change-Id: I9dcaa25c5fac863cd90e31caa42f5d80b63238d6
Reviewed-on: https://boringssl-review.googlesource.com/c/boringssl/+/69247
Auto-Submit: Andres Erbsen <andreser@google.com>
Commit-Queue: David Benjamin <davidben@google.com>
Reviewed-by: David Benjamin <davidben@google.com>
diff --git a/crypto/internal.h b/crypto/internal.h
index f93c2e5..6d5736d 100644
--- a/crypto/internal.h
+++ b/crypto/internal.h
@@ -567,11 +567,22 @@
 // |mask| is 0xff..ff and does nothing if |mask| is 0. The |n|-byte memory
 // ranges at |dst| and |src| must not overlap, as when calling |memcpy|.
 static inline void constant_time_conditional_memxor(void *dst, const void *src,
-                                                    const size_t n,
+                                                    size_t n,
                                                     const crypto_word_t mask) {
   assert(!buffers_alias(dst, n, src, n));
   uint8_t *out = (uint8_t *)dst;
   const uint8_t *in = (const uint8_t *)src;
+#if defined(__GNUC__) && !defined(__clang__)
+  // gcc 13.2.0 doesn't automatically vectorize this loop regardless of barrier
+  typedef uint8_t v32u8 __attribute__((vector_size(32), aligned(1), may_alias));
+  size_t n_vec = n&~(size_t)31;
+  v32u8 masks = ((uint8_t)mask-(v32u8){}); // broadcast
+  for (size_t i = 0; i < n_vec; i += 32) {
+    *(v32u8*)&out[i] ^= masks & *(v32u8*)&in[i];
+  }
+  out += n_vec;
+  n -= n_vec;
+#endif
   for (size_t i = 0; i < n; i++) {
     out[i] ^= value_barrier_w(mask) & in[i];
   }