AES-GCM: optimize ARMv8 kernel and add EOR3 support Originally written by Jamison Collins <jdcollin@google.com> in cl/828730455. The assembly parts of this CL were reviewed there. go/aes-gcm-eor3-benchmarks has some (internal) benchmarks. I've also added some on the hardware I had on hand here. Jamison's notes: This change optimizes the existing ARMv8 AES-GCM assembly by reducing NEON register pressure and introduces a new kernel variant leveraging the EOR3 instruction. This CL improves performance by up to 12% on Conan, 28% on Athena, and 41% on Bondi Beach. Core Assembly Optimizations: Significant instruction rescheduling was implemented in aesv8-gcm-armv8.pl to reduce false dependencies and better utilize execution ports. To address NEON bottlenecks, operations were shifted away from NEON registers where possible. Counter Management: A key optimization is the reworked counter management. Previously, counters were incremented, reversed, and moved from GPR to NEON registers within the hot loop for every block. The new approach precalculates these values for the subsequent iteration and stores them on the stack. Inside the loop, they are pulled in via NEON loads. As values are passed via memory it's necessary to calculate the counters one iteration ahead to avoid expensive LD-ST forwarding. EOR3 Support: Support for the EOR3 instruction (part of the SHA3 extension) has been piped through the library to further optimize AES-GCM on capable hardware. Detection: SHA3 capability detection was added to Linux (via hwcap), Apple (via sysctl), Fuchsia, and system register reading for baremetal/FreeBSD. Dispatch: Added gcm_arm64_aes_eor3 to the gcm_impl_t enum. CRYPTO_gcm128_init_aes_key now selects this implementation when gcm_sha3_capable() is true, enabling the optimized _eor3 encrypt/decrypt functions. Kernel: The Perl script now generates a second kernel variant utilizing EOR3 to merge XOR operations during GHASH accumulation and final text rendering. Both versions are stored under different names in aesv8-gcm-armv8-linux.S. Changes I made when extracting this to the upstream repo: - Rebased to main and reran pregenerate to pick up new symbols to prefix - Pulled the arch_extension machinery into its own CL - Went ahead and added Windows feature dispatch, since MS has added those constants now - Switched armv8_feature_parsing.h to C++ inline functions, to avoid potential theoretical ODR issues with static symbols in headers. Since these are now plain C++, I made them C++-named. - Added armv8_feature_parsing.h to the build. Benchmarks: Apple M1 Pro (has EOR3): Benchmark Time CPU Time Old Time New CPU Old CPU New -------------------------------------------------------------------------------------------------------------------------------------------- BM_SpeedAEAD/seal_aes_128_gcm/InputSize:16 +0.0519 +0.0519 42 44 42 44 BM_SpeedAEAD/seal_aes_128_gcm/InputSize:256 -0.0356 -0.0297 76 74 76 73 BM_SpeedAEAD/seal_aes_128_gcm/InputSize:1350 -0.1067 -0.1064 255 228 255 228 BM_SpeedAEAD/seal_aes_128_gcm/InputSize:8192 -0.1451 -0.1458 1253 1071 1249 1067 BM_SpeedAEAD/seal_aes_128_gcm/InputSize:16384 -0.1427 -0.1446 2458 2108 2453 2099 BM_SpeedAEAD/open_aes_128_gcm/InputSize:16 +0.0697 +0.0725 45 48 45 48 BM_SpeedAEAD/open_aes_128_gcm/InputSize:256 -0.0254 -0.0301 79 77 79 77 BM_SpeedAEAD/open_aes_128_gcm/InputSize:1350 -0.1406 -0.1458 266 229 265 226 BM_SpeedAEAD/open_aes_128_gcm/InputSize:8192 -0.2051 -0.2061 1296 1030 1295 1028 BM_SpeedAEAD/open_aes_128_gcm/InputSize:16384 -0.2122 -0.2114 2547 2007 2541 2004 Pixel 5A (does not have EOR3, just the base AES and PMULL extensions): Benchmark Time CPU Time Old Time New CPU Old CPU New -------------------------------------------------------------------------------------------------------------------------------------------- BM_SpeedAEAD/seal_aes_128_gcm/InputSize:16 +0.0128 +0.0125 116 118 116 118 BM_SpeedAEAD/seal_aes_128_gcm/InputSize:256 -0.0420 -0.0422 213 204 213 204 BM_SpeedAEAD/seal_aes_128_gcm/InputSize:1350 -0.0767 -0.0781 740 683 739 681 BM_SpeedAEAD/seal_aes_128_gcm/InputSize:8192 -0.1032 -0.1032 3829 3434 3822 3427 BM_SpeedAEAD/seal_aes_128_gcm/InputSize:16384 -0.1077 -0.1081 7568 6753 7553 6737 BM_SpeedAEAD/open_aes_128_gcm/InputSize:16 +0.0000 -0.0002 122 122 122 122 BM_SpeedAEAD/open_aes_128_gcm/InputSize:256 +0.0125 +0.0122 211 214 211 213 BM_SpeedAEAD/open_aes_128_gcm/InputSize:1350 -0.0161 -0.0165 697 685 695 684 BM_SpeedAEAD/open_aes_128_gcm/InputSize:8192 -0.0183 -0.0183 3546 3482 3539 3475 BM_SpeedAEAD/open_aes_128_gcm/InputSize:16384 -0.0184 -0.0176 6985 6857 6968 6845 Change-Id: Ied74d3493f174f6e8aeaa300816b39d72f2be042 Reviewed-on: https://boringssl-review.googlesource.com/c/boringssl/+/87988 Commit-Queue: David Benjamin <davidben@google.com> Reviewed-by: Lily Chen <chlily@google.com>
BoringSSL is a fork of OpenSSL that is designed to meet Google's needs.
Although BoringSSL is an open source project, it is not intended for general use, as OpenSSL is. We don't recommend that third parties depend upon it. Doing so is likely to be frustrating because there are no guarantees of API or ABI stability.
Programs ship their own copies of BoringSSL when they use it and we update everything as needed when deciding to make API changes. This allows us to mostly avoid compromises in the name of compatibility. It works for us, but it may not work for you.
BoringSSL arose because Google used OpenSSL for many years in various ways and, over time, built up a large number of patches that were maintained while tracking upstream OpenSSL. As Google's product portfolio became more complex, more copies of OpenSSL sprung up and the effort involved in maintaining all these patches in multiple places was growing steadily.
Currently BoringSSL is the SSL library in Chrome/Chromium, Android (but it's not part of the NDK) and a number of other apps/programs.
Project links:
To file a security issue, use the Chromium process and mention in the report this is for BoringSSL. You can ignore the parts of the process that are specific to Chromium/Chrome.
There are other files in this directory which might be helpful: