Use a mix of bsaes and vpaes for CTR on NEON.

tl;dr: AES is now constant-time on 32-bit ARM with NEON. Combined with
all the past work, we now have constant-time AES and GHASH on ARM and
x86 chips, 32-bit and 64-bit, provided NEON (required by Chrome on
Android, aside from https://crbug.com/341598) or SSSE3 (almost all
Chrome on Windows users) is available!

CTR-like bsaes modes is harder to resolve than CBC decryption. They use
both bulk (ctr128_f) and one-off (block128_f) operations. We currently
use ctr128_f of bsaes and block128_f of aes_nohw (not constant-time),
which hits 22.0 MB/s on my test chip.

Implement a vpaes/bsaes hybrid to get the best of both worlds. The key
is kept in vpaes form and, when the input is large enough, we convert
the key to bsaes on-demand. This retains bsaes performance, but with no
variable-time gaps.

Alternatives considered:

- Convert to bsaes form immediately and only use bsaes. This makes the
  one-off block128_f calls very expensive. One 8-block batch of
  bsaes_ctr32_encrypt_blocks costs as much as 5.76 vpaes_encrypt calls.

- Do the above, but fold the one-off calls into bsaes batches because
  GCM is parallelizable. This is a mess with the current internal
  structure and doesn't apply to, e.g., CCM.

- Drop bsaes in favor of vpaes. However, even with
  vpaes_ctr32_encrypt_blocks, vpaes is 15.5 MB/s. The hybrid is a 40%
  win on an important platform.

- Try to narrow the gap, as we did for x86_64, with a "2x" optimization.
  I attempted this here but the register pressure was tricky. (x86_64
  was already tight and NEON can't address memory in vtbl.) If I ignored
  this (gives wrong answer), the gap was still 20-25%. Perf here is
  slower overall (20 MB/s for old ARM vs 120-140 MB/s for old x86_64),
  so that gap is scarier.

I retained vpaes_ctr32_encrypt_blocks because it's fairly compact (only
84 bytes assembled), though it's less important in the bsaes hybrid.

Cortex-A53 (Raspberry Pi 3 Model B+)
Before:
Did 267000 AES-128-GCM (16 bytes) seal operations in 2004871us (133175.7 ops/sec): 2.1 MB/s
Did 135000 AES-128-GCM (256 bytes) seal operations in 2013825us (67036.6 ops/sec): 17.2 MB/s
Did 31000 AES-128-GCM (1350 bytes) seal operations in 2059039us (15055.6 ops/sec): 20.3 MB/s
Did 5565 AES-128-GCM (8192 bytes) seal operations in 2073607us (2683.7 ops/sec): 22.0 MB/s
Did 2709 AES-128-GCM (16384 bytes) seal operations in 2020264us (1340.9 ops/sec): 22.0 MB/s
Did 209000 AES-256-GCM (16 bytes) seal operations in 2005654us (104205.4 ops/sec): 1.7 MB/s
Did 109000 AES-256-GCM (256 bytes) seal operations in 2011293us (54194.0 ops/sec): 13.9 MB/s
Did 25000 AES-256-GCM (1350 bytes) seal operations in 2082385us (12005.5 ops/sec): 16.2 MB/s
Did 4452 AES-256-GCM (8192 bytes) seal operations in 2080729us (2139.6 ops/sec): 17.5 MB/s
Did 2226 AES-256-GCM (16384 bytes) seal operations in 2079819us (1070.3 ops/sec): 17.5 MB/s

After:
Did 542000 AES-128-GCM (16 bytes) seal operations in 2003408us (270539.0 ops/sec): 4.3 MB/s [+104.8%]
Did 124000 AES-128-GCM (256 bytes) seal operations in 2012579us (61612.5 ops/sec): 15.8 MB/s [-8.1%]
Did 30000 AES-128-GCM (1350 bytes) seal operations in 2020636us (14846.8 ops/sec): 20.0 MB/s [-1.5%]
Did 5502 AES-128-GCM (8192 bytes) seal operations in 2068807us (2659.5 ops/sec): 21.8 MB/s [-0.9%]
Did 2772 AES-128-GCM (16384 bytes) seal operations in 2085176us (1329.4 ops/sec): 21.8 MB/s [-0.9%]
Did 459000 AES-256-GCM (16 bytes) seal operations in 2003587us (229089.1 ops/sec): 3.7 MB/s [+117.6%]
Did 100000 AES-256-GCM (256 bytes) seal operations in 2018311us (49546.4 ops/sec): 12.7 MB/s [-8.6%]
Did 24000 AES-256-GCM (1350 bytes) seal operations in 2026975us (11840.3 ops/sec): 16.0 MB/s [-1.2%]
Did 4410 AES-256-GCM (8192 bytes) seal operations in 2079581us (2120.6 ops/sec): 17.4 MB/s [-0.6%]
Did 2226 AES-256-GCM (16384 bytes) seal operations in 2099318us (1060.3 ops/sec): 17.4 MB/s [-0.6%]

Bug: 256
Change-Id: Ib74ab7e63974d3ddae8ce5fc35c9b44e73dce305
Reviewed-on: https://boringssl-review.googlesource.com/c/boringssl/+/37429
Reviewed-by: Adam Langley <agl@google.com>
3 files changed
tree: 8f3972d7abbf7fe7f5bab591ac9c77f227f9e5a1
  1. .clang-format
  2. .github/
  3. .gitignore
  4. API-CONVENTIONS.md
  5. BREAKING-CHANGES.md
  6. BUILDING.md
  7. CMakeLists.txt
  8. CONTRIBUTING.md
  9. FUZZING.md
  10. INCORPORATING.md
  11. LICENSE
  12. PORTING.md
  13. README.md
  14. STYLE.md
  15. codereview.settings
  16. crypto/
  17. decrepit/
  18. fuzz/
  19. go.mod
  20. include/
  21. sources.cmake
  22. ssl/
  23. third_party/
  24. tool/
  25. util/
README.md

BoringSSL

BoringSSL is a fork of OpenSSL that is designed to meet Google's needs.

Although BoringSSL is an open source project, it is not intended for general use, as OpenSSL is. We don't recommend that third parties depend upon it. Doing so is likely to be frustrating because there are no guarantees of API or ABI stability.

Programs ship their own copies of BoringSSL when they use it and we update everything as needed when deciding to make API changes. This allows us to mostly avoid compromises in the name of compatibility. It works for us, but it may not work for you.

BoringSSL arose because Google used OpenSSL for many years in various ways and, over time, built up a large number of patches that were maintained while tracking upstream OpenSSL. As Google's product portfolio became more complex, more copies of OpenSSL sprung up and the effort involved in maintaining all these patches in multiple places was growing steadily.

Currently BoringSSL is the SSL library in Chrome/Chromium, Android (but it's not part of the NDK) and a number of other apps/programs.

Project links:

There are other files in this directory which might be helpful: