AES-GCM: optimize ARMv8 kernel and add EOR3 support

Originally written by Jamison Collins <jdcollin@google.com> in
cl/828730455. The assembly parts of this CL were reviewed there.
go/aes-gcm-eor3-benchmarks has some (internal) benchmarks. I've also
added some on the hardware I had on hand here.

Jamison's notes:

This change optimizes the existing ARMv8 AES-GCM assembly by reducing
NEON register pressure and introduces a new kernel variant leveraging
the EOR3 instruction. This CL improves performance by up to 12% on
Conan, 28% on Athena, and 41% on Bondi Beach.

Core Assembly Optimizations: Significant instruction rescheduling was
implemented in aesv8-gcm-armv8.pl to reduce false dependencies and
better utilize execution ports. To address NEON bottlenecks, operations
were shifted away from NEON registers where possible.

Counter Management: A key optimization is the reworked counter
management. Previously, counters were incremented, reversed, and moved
from GPR to NEON registers within the hot loop for every block. The new
approach precalculates these values for the subsequent iteration and
stores them on the stack. Inside the loop, they are pulled in via NEON
loads. As values are passed via memory it's necessary to calculate the
counters one iteration ahead to avoid expensive LD-ST forwarding.

EOR3 Support: Support for the EOR3 instruction (part of the SHA3
extension) has been piped through the library to further optimize
AES-GCM on capable hardware.

Detection: SHA3 capability detection was added to Linux (via hwcap),
Apple (via sysctl), Fuchsia, and system register reading for
baremetal/FreeBSD.

Dispatch: Added gcm_arm64_aes_eor3 to the gcm_impl_t enum.
CRYPTO_gcm128_init_aes_key now selects this implementation when
gcm_sha3_capable() is true, enabling the optimized _eor3 encrypt/decrypt
functions.

Kernel: The Perl script now generates a second kernel variant utilizing
EOR3 to merge XOR operations during GHASH accumulation and final text
rendering. Both versions are stored under different names in
aesv8-gcm-armv8-linux.S.

Changes I made when extracting this to the upstream repo:

- Rebased to main and reran pregenerate to pick up new symbols to prefix

- Pulled the arch_extension machinery into its own CL

- Went ahead and added Windows feature dispatch, since MS has added
  those constants now

- Switched armv8_feature_parsing.h to C++ inline functions, to avoid
  potential theoretical ODR issues with static symbols in headers. Since
  these are now plain C++, I made them C++-named.

- Added armv8_feature_parsing.h to the build.

Benchmarks:

Apple M1 Pro (has EOR3):

Benchmark                                                       Time             CPU      Time Old      Time New       CPU Old       CPU New
--------------------------------------------------------------------------------------------------------------------------------------------
BM_SpeedAEAD/seal_aes_128_gcm/InputSize:16                   +0.0519         +0.0519            42            44            42            44
BM_SpeedAEAD/seal_aes_128_gcm/InputSize:256                  -0.0356         -0.0297            76            74            76            73
BM_SpeedAEAD/seal_aes_128_gcm/InputSize:1350                 -0.1067         -0.1064           255           228           255           228
BM_SpeedAEAD/seal_aes_128_gcm/InputSize:8192                 -0.1451         -0.1458          1253          1071          1249          1067
BM_SpeedAEAD/seal_aes_128_gcm/InputSize:16384                -0.1427         -0.1446          2458          2108          2453          2099
BM_SpeedAEAD/open_aes_128_gcm/InputSize:16                   +0.0697         +0.0725            45            48            45            48
BM_SpeedAEAD/open_aes_128_gcm/InputSize:256                  -0.0254         -0.0301            79            77            79            77
BM_SpeedAEAD/open_aes_128_gcm/InputSize:1350                 -0.1406         -0.1458           266           229           265           226
BM_SpeedAEAD/open_aes_128_gcm/InputSize:8192                 -0.2051         -0.2061          1296          1030          1295          1028
BM_SpeedAEAD/open_aes_128_gcm/InputSize:16384                -0.2122         -0.2114          2547          2007          2541          2004

Pixel 5A (does not have EOR3, just the base AES and PMULL extensions):

Benchmark                                                       Time             CPU      Time Old      Time New       CPU Old       CPU New
--------------------------------------------------------------------------------------------------------------------------------------------
BM_SpeedAEAD/seal_aes_128_gcm/InputSize:16                   +0.0128         +0.0125           116           118           116           118
BM_SpeedAEAD/seal_aes_128_gcm/InputSize:256                  -0.0420         -0.0422           213           204           213           204
BM_SpeedAEAD/seal_aes_128_gcm/InputSize:1350                 -0.0767         -0.0781           740           683           739           681
BM_SpeedAEAD/seal_aes_128_gcm/InputSize:8192                 -0.1032         -0.1032          3829          3434          3822          3427
BM_SpeedAEAD/seal_aes_128_gcm/InputSize:16384                -0.1077         -0.1081          7568          6753          7553          6737
BM_SpeedAEAD/open_aes_128_gcm/InputSize:16                   +0.0000         -0.0002           122           122           122           122
BM_SpeedAEAD/open_aes_128_gcm/InputSize:256                  +0.0125         +0.0122           211           214           211           213
BM_SpeedAEAD/open_aes_128_gcm/InputSize:1350                 -0.0161         -0.0165           697           685           695           684
BM_SpeedAEAD/open_aes_128_gcm/InputSize:8192                 -0.0183         -0.0183          3546          3482          3539          3475
BM_SpeedAEAD/open_aes_128_gcm/InputSize:16384                -0.0184         -0.0176          6985          6857          6968          6845

Change-Id: Ied74d3493f174f6e8aeaa300816b39d72f2be042
Reviewed-on: https://boringssl-review.googlesource.com/c/boringssl/+/87988
Commit-Queue: David Benjamin <davidben@google.com>
Reviewed-by: Lily Chen <chlily@google.com>
25 files changed
tree: d076228be0d9c182418efe1871b80633c7bc80d0
  1. .bcr/
  2. .github/
  3. bench/
  4. cmake/
  5. crypto/
  6. decrepit/
  7. docs/
  8. fuzz/
  9. gen/
  10. include/
  11. infra/
  12. pki/
  13. rust/
  14. ssl/
  15. third_party/
  16. tool/
  17. util/
  18. .bazelignore
  19. .bazelrc
  20. .bazelversion
  21. .clang-format
  22. .clang-format-ignore
  23. .gitignore
  24. API-CONVENTIONS.md
  25. AUTHORS
  26. BREAKING-CHANGES.md
  27. BUILD.bazel
  28. build.json
  29. BUILDING.md
  30. CMakeLists.txt
  31. codereview.settings
  32. CONTRIBUTING.md
  33. FUZZING.md
  34. go.mod
  35. go.sum
  36. INCORPORATING.md
  37. LICENSE
  38. MODULE.bazel
  39. MODULE.bazel.lock
  40. PORTING.md
  41. PRESUBMIT.py
  42. PrivacyInfo.xcprivacy
  43. README.md
  44. SANDBOXING.md
  45. SECURITY.md
  46. STYLE.md
README.md

BoringSSL

BoringSSL is a fork of OpenSSL that is designed to meet Google's needs.

Although BoringSSL is an open source project, it is not intended for general use, as OpenSSL is. We don't recommend that third parties depend upon it. Doing so is likely to be frustrating because there are no guarantees of API or ABI stability.

Programs ship their own copies of BoringSSL when they use it and we update everything as needed when deciding to make API changes. This allows us to mostly avoid compromises in the name of compatibility. It works for us, but it may not work for you.

BoringSSL arose because Google used OpenSSL for many years in various ways and, over time, built up a large number of patches that were maintained while tracking upstream OpenSSL. As Google's product portfolio became more complex, more copies of OpenSSL sprung up and the effort involved in maintaining all these patches in multiple places was growing steadily.

Currently BoringSSL is the SSL library in Chrome/Chromium, Android (but it's not part of the NDK) and a number of other apps/programs.

Project links:

To file a security issue, use the Chromium process and mention in the report this is for BoringSSL. You can ignore the parts of the process that are specific to Chromium/Chrome.

There are other files in this directory which might be helpful: