Enable SHA-512 ARM acceleration when available.
This imports the changes to sha512-armv8.pl from
upstream's af0fcf7b4668218b24d9250b95e0b96939ccb4d1.
Tweaks needed:
- Add an explicit .text because we put .LK$BITS in .rodata for XOM
- .LK$bits and code are in separate sections, so use adrp/add instead of
plain adr
- Where glibc needs feature flags to *enable* pthread_rwlock, Apple
interprets _XOPEN_SOURCE as a request to *disable* Apple extensions.
Tighten the condition on the _XOPEN_SOURCE check.
Added support for macOS and Linux, tested manually on an ARM Mac and a
VM, respectively. Fuchsia and Windows do not currently have APIs to
expose this bit, so I've left in TODOs. Benchmarks from an Apple M1 Max:
Before:
Did 4647000 SHA-512 (16 bytes) operations in 1000103us (74.3 MB/sec)
Did 1614000 SHA-512 (256 bytes) operations in 1000379us (413.0 MB/sec)
Did 439000 SHA-512 (1350 bytes) operations in 1001694us (591.6 MB/sec)
Did 76000 SHA-512 (8192 bytes) operations in 1011821us (615.3 MB/sec)
Did 39000 SHA-512 (16384 bytes) operations in 1024311us (623.8 MB/sec)
After:
Did 10369000 SHA-512 (16 bytes) operations in 1000088us (165.9 MB/sec) [+123.1%]
Did 3650000 SHA-512 (256 bytes) operations in 1000079us (934.3 MB/sec) [+126.2%]
Did 1029000 SHA-512 (1350 bytes) operations in 1000521us (1388.4 MB/sec) [+134.7%]
Did 175000 SHA-512 (8192 bytes) operations in 1001874us (1430.9 MB/sec) [+132.5%]
Did 89000 SHA-512 (16384 bytes) operations in 1010314us (1443.3 MB/sec) [+131.4%]
(This doesn't seem to change the overall SHA-256 vs SHA-512 performance
question on ARM, when hashing perf matters. SHA-256 on the same chip
gets up to 2454.6 MB/s.)
In terms of build coverage, for now, we'll have build coverage
everywhere and test coverage on Chromium, which runs this code on macOS
CI. We should request a macOS ARM64 bot for our standalone CI. Longer
term, we need a QEMU-based builder to test various features. QEMU seems
to have pretty good coverage of all this, which will at least give us
Linux.
I haven't added an OPENSSL_STATIC_ARMCAP_SHA512 for now. Instead, we
just look at the standard __ARM_FEATURE_SHA512 define. Strangely, the
corresponding -march tag is not sha512. Neither GCC and nor Clang have
-march=armv8-a+sha512. Instead, -march=armv8-a+sha3 implies both
__ARM_FEATURE_SHA3 and __ARM_FEATURE_SHA512! Yet everything else seems
to describe the SHA512 extension as separate from SHA3.
https://developer.arm.com/architectures/system-architectures/software-standards/acle
Update-Note: Consumers with a different build setup may need to
limit -D_XOPEN_SOURCE=700 to Linux or non-Apple platforms. Otherwise,
<sys/types.h> won't define some typedef needed by <sys/sysctl.h>. If you
see a build error about u_char, etc., being undefined in some system
header, that is probably the cause.
Change-Id: Ia213d3796b84c71b7966bb68e0aec92e5d7d26f0
Reviewed-on: https://boringssl-review.googlesource.com/c/boringssl/+/50807
Reviewed-by: Adam Langley <agl@google.com>
Commit-Queue: David Benjamin <davidben@google.com>
diff --git a/BUILDING.md b/BUILDING.md
index 08f004c..64b1520 100644
--- a/BUILDING.md
+++ b/BUILDING.md
@@ -163,22 +163,17 @@
don't have steps for assembling the assembly language source files, so they
currently cannot be used to build BoringSSL.
-## Embedded ARM
+## ARM CPU Capabilities
-ARM, unlike Intel, does not have an instruction that allows applications to
-discover the capabilities of the processor. Instead, the capability information
-has to be provided by the operating system somehow.
+ARM, unlike Intel, does not have a userspace instruction that allows
+applications to discover the capabilities of the processor. Instead, the
+capability information has to be provided by a combination of compile-time
+information and the operating system.
-By default, on Linux-based systems, BoringSSL will try to use `getauxval` and
-`/proc` to discover the capabilities. But some environments don't support that
-sort of thing and, for them, it's possible to configure the CPU capabilities at
-compile time.
-
-On iOS or builds which define `OPENSSL_STATIC_ARMCAP`, features will be
-determined based on the `__ARM_NEON__` and `__ARM_FEATURE_CRYPTO` preprocessor
-symbols reported by the compiler. These values are usually controlled by the
-`-march` flag. You can also define any of the following to enable the
-corresponding ARM feature.
+BoringSSL determines capabilities at compile-time based on `__ARM_NEON__`,
+`__ARM_FEATURE_CRYPTO`, and other preprocessor symbols reported by the compiler.
+These values are usually controlled by the `-march` flag. You can also define
+any of the following to enable the corresponding ARM feature.
* `OPENSSL_STATIC_ARMCAP_NEON`
* `OPENSSL_STATIC_ARMCAP_AES`
@@ -186,8 +181,16 @@
* `OPENSSL_STATIC_ARMCAP_SHA256`
* `OPENSSL_STATIC_ARMCAP_PMULL`
-Note that if a feature is enabled in this way, but not actually supported at
-run-time, BoringSSL will likely crash.
+The resulting binary will assume all such features are always present. This can
+reduce code size, by allowing the compiler to omit fallbacks. However, if the
+feature is not actually supported at runtime, BoringSSL will likely crash.
+
+BoringSSL will additionally query the operating system at runtime for additional
+features, e.g. with `getauxval` on Linux. This allows a single binary to use
+newer instructions when present, but still function on CPUs without them. But
+some environments don't support runtime queries. If building for those, define
+`OPENSSL_STATIC_ARMCAP` to limit BoringSSL to compile-time capabilities. If not
+defined, the target operating system must be known to BoringSSL.
## Binary Size
diff --git a/CMakeLists.txt b/CMakeLists.txt
index f3fc7bc..6c70b55 100644
--- a/CMakeLists.txt
+++ b/CMakeLists.txt
@@ -257,8 +257,10 @@
set(CMAKE_C_FLAGS "${CMAKE_C_FLAGS} -std=c11")
endif()
-# pthread_rwlock_t requires a feature flag.
-if(NOT WIN32)
+# pthread_rwlock_t on Linux requires a feature flag. However, it should not be
+# set on Apple platforms, where it instead disables APIs we use. See compat(5)
+# and sys/cdefs.h.
+if(NOT WIN32 AND NOT APPLE)
set(CMAKE_C_FLAGS "${CMAKE_C_FLAGS} -D_XOPEN_SOURCE=700")
endif()
diff --git a/crypto/CMakeLists.txt b/crypto/CMakeLists.txt
index d73ce1e..31ccfc1 100644
--- a/crypto/CMakeLists.txt
+++ b/crypto/CMakeLists.txt
@@ -263,6 +263,7 @@
cipher_extra/tls_cbc.c
cmac/cmac.c
conf/conf.c
+ cpu-aarch64-apple.c
cpu-aarch64-fuchsia.c
cpu-aarch64-linux.c
cpu-aarch64-win.c
diff --git a/crypto/cpu-aarch64-apple.c b/crypto/cpu-aarch64-apple.c
new file mode 100644
index 0000000..56012d6
--- /dev/null
+++ b/crypto/cpu-aarch64-apple.c
@@ -0,0 +1,73 @@
+/* Copyright (c) 2021, Google Inc.
+ *
+ * Permission to use, copy, modify, and/or distribute this software for any
+ * purpose with or without fee is hereby granted, provided that the above
+ * copyright notice and this permission notice appear in all copies.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS" AND THE AUTHOR DISCLAIMS ALL WARRANTIES
+ * WITH REGARD TO THIS SOFTWARE INCLUDING ALL IMPLIED WARRANTIES OF
+ * MERCHANTABILITY AND FITNESS. IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY
+ * SPECIAL, DIRECT, INDIRECT, OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES
+ * WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN ACTION
+ * OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF OR IN
+ * CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE. */
+
+#include <openssl/cpu.h>
+
+#if defined(OPENSSL_AARCH64) && defined(OPENSSL_APPLE) && \
+ !defined(OPENSSL_STATIC_ARMCAP)
+
+#include <sys/sysctl.h>
+#include <sys/types.h>
+
+#include <openssl/arm_arch.h>
+
+#include "internal.h"
+
+
+extern uint32_t OPENSSL_armcap_P;
+
+static int has_hw_feature(const char *name) {
+ int value;
+ size_t len = sizeof(value);
+ if (sysctlbyname(name, &value, &len, NULL, 0) != 0) {
+ return 0;
+ }
+ if (len != sizeof(int)) {
+ // This should not happen. All the values queried should be integer-valued.
+ assert(0);
+ return 0;
+ }
+
+ // Per sys/sysctl.h:
+ //
+ // Selectors that return errors are not support on the system. Supported
+ // features will return 1 if they are recommended or 0 if they are supported
+ // but are not expected to help performance. Future versions of these
+ // selectors may return larger values as necessary so it is best to test for
+ // non zero.
+ return value != 0;
+}
+
+void OPENSSL_cpuid_setup(void) {
+ // Apple ARM64 platforms have NEON and cryptography extensions available
+ // statically, so we do not need to query them. In particular, there sometimes
+ // are no sysctls corresponding to such features. See below.
+#if !defined(__ARM_NEON) || !defined(__ARM_FEATURE_CRYPTO)
+#error "NEON and crypto extensions should be statically available."
+#endif
+ OPENSSL_armcap_P =
+ ARMV7_NEON | ARMV8_AES | ARMV8_PMULL | ARMV8_SHA1 | ARMV8_SHA256;
+
+ // macOS has sysctls named both like "hw.optional.arm.FEAT_SHA512" and like
+ // "hw.optional.armv8_2_sha512". There does not appear to be documentation on
+ // which to use. The "armv8_2_sha512" style omits statically-available
+ // features, while the "FEAT_SHA512" style includes them. However, the
+ // "FEAT_SHA512" style was added in macOS 12, so we use the older style for
+ // better compatibility and handle static features above.
+ if (has_hw_feature("hw.optional.armv8_2_sha512")) {
+ OPENSSL_armcap_P |= ARMV8_SHA512;
+ }
+}
+
+#endif // OPENSSL_AARCH64 && OPENSSL_APPLE && !OPENSSL_STATIC_ARMCAP
diff --git a/crypto/cpu-aarch64-fuchsia.c b/crypto/cpu-aarch64-fuchsia.c
index 98303a0..5c6d115 100644
--- a/crypto/cpu-aarch64-fuchsia.c
+++ b/crypto/cpu-aarch64-fuchsia.c
@@ -50,6 +50,9 @@
if (hwcap & ZX_ARM64_FEATURE_ISA_SHA2) {
OPENSSL_armcap_P |= ARMV8_SHA256;
}
+ // As of writing, Fuchsia does not have a flag for ARMv8.2 SHA-512
+ // extensions. When it does, add it here. See
+ // https://bugs.fuchsia.dev/p/fuchsia/issues/detail?id=90759.
}
-#endif // OPENSSL_AARCH64 && !OPENSSL_STATIC_ARMCAP
+#endif // OPENSSL_AARCH64 && OPENSSL_FUCHSIA && !OPENSSL_STATIC_ARMCAP
diff --git a/crypto/cpu-aarch64-linux.c b/crypto/cpu-aarch64-linux.c
index 0184dd4..6ae870a 100644
--- a/crypto/cpu-aarch64-linux.c
+++ b/crypto/cpu-aarch64-linux.c
@@ -36,6 +36,7 @@
static const unsigned long kPMULL = 1 << 4;
static const unsigned long kSHA1 = 1 << 5;
static const unsigned long kSHA256 = 1 << 6;
+ static const unsigned long kSHA512 = 1 << 21;
if ((hwcap & kNEON) == 0) {
// Matching OpenSSL, if NEON is missing, don't report other features
@@ -57,6 +58,9 @@
if (hwcap & kSHA256) {
OPENSSL_armcap_P |= ARMV8_SHA256;
}
+ if (hwcap & kSHA512) {
+ OPENSSL_armcap_P |= ARMV8_SHA512;
+ }
}
-#endif // OPENSSL_AARCH64 && !OPENSSL_STATIC_ARMCAP
+#endif // OPENSSL_AARCH64 && OPENSSL_LINUX && !OPENSSL_STATIC_ARMCAP
diff --git a/crypto/cpu-aarch64-win.c b/crypto/cpu-aarch64-win.c
index ee7f8e0..3d0014e 100644
--- a/crypto/cpu-aarch64-win.c
+++ b/crypto/cpu-aarch64-win.c
@@ -36,6 +36,8 @@
OPENSSL_armcap_P |= ARMV8_SHA1;
OPENSSL_armcap_P |= ARMV8_SHA256;
}
+ // As of writing, Windows does not have a |PF_*| value for ARMv8.2 SHA-512
+ // extensions. When it does, add it here.
}
-#endif
+#endif // OPENSSL_AARCH64 && OPENSSL_WINDOWS && !OPENSSL_STATIC_ARMCAP
diff --git a/crypto/crypto.c b/crypto/crypto.c
index 6886aa4..b78b122 100644
--- a/crypto/crypto.c
+++ b/crypto/crypto.c
@@ -105,6 +105,9 @@
#if defined(OPENSSL_STATIC_ARMCAP_PMULL) || defined(__ARM_FEATURE_CRYPTO)
ARMV8_PMULL |
#endif
+#if defined(__ARM_FEATURE_SHA512)
+ ARMV8_SHA512 |
+#endif
0;
#else
diff --git a/crypto/fipsmodule/sha/asm/sha512-armv8.pl b/crypto/fipsmodule/sha/asm/sha512-armv8.pl
index e961312..8cb312f 100644
--- a/crypto/fipsmodule/sha/asm/sha512-armv8.pl
+++ b/crypto/fipsmodule/sha/asm/sha512-armv8.pl
@@ -185,8 +185,6 @@
.type $func,%function
.align 6
$func:
-___
-$code.=<<___ if ($SZ==4);
AARCH64_VALID_CALL_TARGET
#ifndef __KERNEL__
#if __has_feature(hwaddress_sanitizer) && __clang_major__ >= 10
@@ -195,11 +193,17 @@
adrp x16,:pg_hi21:OPENSSL_armcap_P
#endif
ldr w16,[x16,:lo12:OPENSSL_armcap_P]
+___
+$code.=<<___ if ($SZ==4);
tst w16,#ARMV8_SHA256
b.ne .Lv8_entry
-#endif
+___
+$code.=<<___ if ($SZ==8);
+ tst w16,#ARMV8_SHA512
+ b.ne .Lv8_entry
___
$code.=<<___;
+#endif
AARCH64_SIGN_LINK_REGISTER
stp x29,x30,[sp,#-128]!
add x29,sp,#0
@@ -425,6 +429,110 @@
___
}
+if ($SZ==8) {
+my $Ktbl="x3";
+
+my @H = map("v$_.16b",(0..4));
+my ($fg,$de,$m9_10)=map("v$_.16b",(5..7));
+my @MSG=map("v$_.16b",(16..23));
+my ($W0,$W1)=("v24.2d","v25.2d");
+my ($AB,$CD,$EF,$GH)=map("v$_.16b",(26..29));
+
+$code.=<<___;
+.text
+#ifndef __KERNEL__
+.type sha512_block_armv8,%function
+.align 6
+sha512_block_armv8:
+.Lv8_entry:
+ stp x29,x30,[sp,#-16]!
+ add x29,sp,#0
+
+ ld1 {@MSG[0]-@MSG[3]},[$inp],#64 // load input
+ ld1 {@MSG[4]-@MSG[7]},[$inp],#64
+
+ ld1.64 {@H[0]-@H[3]},[$ctx] // load context
+ adrp $Ktbl,:pg_hi21:.LK512
+ add $Ktbl,$Ktbl,:lo12:.LK512
+
+ rev64 @MSG[0],@MSG[0]
+ rev64 @MSG[1],@MSG[1]
+ rev64 @MSG[2],@MSG[2]
+ rev64 @MSG[3],@MSG[3]
+ rev64 @MSG[4],@MSG[4]
+ rev64 @MSG[5],@MSG[5]
+ rev64 @MSG[6],@MSG[6]
+ rev64 @MSG[7],@MSG[7]
+ b .Loop_hw
+
+.align 4
+.Loop_hw:
+ ld1.64 {$W0},[$Ktbl],#16
+ subs $num,$num,#1
+ sub x4,$inp,#128
+ orr $AB,@H[0],@H[0] // offload
+ orr $CD,@H[1],@H[1]
+ orr $EF,@H[2],@H[2]
+ orr $GH,@H[3],@H[3]
+ csel $inp,$inp,x4,ne // conditional rewind
+___
+for($i=0;$i<32;$i++) {
+$code.=<<___;
+ add.i64 $W0,$W0,@MSG[0]
+ ld1.64 {$W1},[$Ktbl],#16
+ ext $W0,$W0,$W0,#8
+ ext $fg,@H[2],@H[3],#8
+ ext $de,@H[1],@H[2],#8
+ add.i64 @H[3],@H[3],$W0 // "T1 + H + K512[i]"
+ sha512su0 @MSG[0],@MSG[1]
+ ext $m9_10,@MSG[4],@MSG[5],#8
+ sha512h @H[3],$fg,$de
+ sha512su1 @MSG[0],@MSG[7],$m9_10
+ add.i64 @H[4],@H[1],@H[3] // "D + T1"
+ sha512h2 @H[3],$H[1],@H[0]
+___
+ ($W0,$W1)=($W1,$W0); push(@MSG,shift(@MSG));
+ @H = (@H[3],@H[0],@H[4],@H[2],@H[1]);
+}
+for(;$i<40;$i++) {
+$code.=<<___ if ($i<39);
+ ld1.64 {$W1},[$Ktbl],#16
+___
+$code.=<<___ if ($i==39);
+ sub $Ktbl,$Ktbl,#$rounds*$SZ // rewind
+___
+$code.=<<___;
+ add.i64 $W0,$W0,@MSG[0]
+ ld1 {@MSG[0]},[$inp],#16 // load next input
+ ext $W0,$W0,$W0,#8
+ ext $fg,@H[2],@H[3],#8
+ ext $de,@H[1],@H[2],#8
+ add.i64 @H[3],@H[3],$W0 // "T1 + H + K512[i]"
+ sha512h @H[3],$fg,$de
+ rev64 @MSG[0],@MSG[0]
+ add.i64 @H[4],@H[1],@H[3] // "D + T1"
+ sha512h2 @H[3],$H[1],@H[0]
+___
+ ($W0,$W1)=($W1,$W0); push(@MSG,shift(@MSG));
+ @H = (@H[3],@H[0],@H[4],@H[2],@H[1]);
+}
+$code.=<<___;
+ add.i64 @H[0],@H[0],$AB // accumulate
+ add.i64 @H[1],@H[1],$CD
+ add.i64 @H[2],@H[2],$EF
+ add.i64 @H[3],@H[3],$GH
+
+ cbnz $num,.Loop_hw
+
+ st1.64 {@H[0]-@H[3]},[$ctx] // store context
+
+ ldr x29,[sp],#16
+ ret
+.size sha512_block_armv8,.-sha512_block_armv8
+#endif
+___
+}
+
{ my %opcode = (
"sha256h" => 0x5e004000, "sha256h2" => 0x5e005000,
"sha256su0" => 0x5e282800, "sha256su1" => 0x5e006000 );
@@ -440,6 +548,21 @@
}
}
+{ my %opcode = (
+ "sha512h" => 0xce608000, "sha512h2" => 0xce608400,
+ "sha512su0" => 0xcec08000, "sha512su1" => 0xce608800 );
+
+ sub unsha512 {
+ my ($mnemonic,$arg)=@_;
+
+ $arg =~ m/[qv]([0-9]+)[^,]*,\s*[qv]([0-9]+)[^,]*(?:,\s*[qv]([0-9]+))?/o
+ &&
+ sprintf ".inst\t0x%08x\t//%s %s",
+ $opcode{$mnemonic}|$1|($2<<5)|($3<<16),
+ $mnemonic,$arg;
+ }
+}
+
open SELF,$0;
while(<SELF>) {
next if (/^#!/);
@@ -452,12 +575,15 @@
s/\`([^\`]*)\`/eval($1)/ge;
+ s/\b(sha512\w+)\s+([qv].*)/unsha512($1,$2)/ge or
s/\b(sha256\w+)\s+([qv].*)/unsha256($1,$2)/ge;
s/\bq([0-9]+)\b/v$1.16b/g; # old->new registers
s/\.[ui]?8(\s)/$1/;
+ s/\.\w?64\b// and s/\.16b/\.2d/g or
s/\.\w?32\b// and s/\.16b/\.4s/g;
+ m/\bext\b/ and s/\.2d/\.16b/g or
m/(ld|st)1[^\[]+\[0\]/ and s/\.4s/\.s/g;
print $_,"\n";
diff --git a/include/openssl/arm_arch.h b/include/openssl/arm_arch.h
index 81dc796..13f5b4a 100644
--- a/include/openssl/arm_arch.h
+++ b/include/openssl/arm_arch.h
@@ -117,6 +117,9 @@
// ARMV8_PMULL indicates support for carryless multiplication.
#define ARMV8_PMULL (1 << 5)
+// ARMV8_SHA512 indicates support for hardware SHA-512 instructions.
+#define ARMV8_SHA512 (1 << 6)
+
#if defined(__ASSEMBLER__)
// Support macros for
diff --git a/include/openssl/cpu.h b/include/openssl/cpu.h
index 91cf95e..e71fbec 100644
--- a/include/openssl/cpu.h
+++ b/include/openssl/cpu.h
@@ -105,8 +105,9 @@
#if defined(OPENSSL_ARM) || defined(OPENSSL_AARCH64)
-#if defined(OPENSSL_APPLE)
-// iOS builds use the static ARM configuration.
+#if defined(OPENSSL_APPLE) && defined(OPENSSL_ARM)
+// We do not detect any features at runtime for Apple's 32-bit ARM platforms. On
+// 64-bit ARM, we detect some post-ARMv8.0 features.
#define OPENSSL_STATIC_ARMCAP
#endif
diff --git a/util/BUILD.toplevel b/util/BUILD.toplevel
index 65e0cdc..462a24f 100644
--- a/util/BUILD.toplevel
+++ b/util/BUILD.toplevel
@@ -89,9 +89,6 @@
# ensure that binaries can be built with non-executable stack.
"-Wa,--noexecstack",
- # This is needed on Linux systems (at least) to get rwlock in pthread.
- "-D_XOPEN_SOURCE=700",
-
# This list of warnings should match those in the top-level CMakeLists.txt.
"-Wall",
"-Werror",
@@ -108,10 +105,17 @@
# "-DOPENSSL_C11_ATOMIC",
]
+linux_copts = posix_copts + [
+ # This is needed on Linux systems (at least) to get rwlock in pthread, but
+ # it should not be set on Apple platforms, where it instead disables APIs
+ # we use. See compat(5) and sys/cdefs.h.
+ "-D_XOPEN_SOURCE=700",
+]
+
boringssl_copts = select({
- ":linux_aarch64": posix_copts,
- ":linux_ppc64le": posix_copts,
- ":linux_x86_64": posix_copts,
+ ":linux_aarch64": linux_copts,
+ ":linux_ppc64le": linux_copts,
+ ":linux_x86_64": linux_copts,
":mac_x86_64": posix_copts,
":windows_x86_64": [
"-DWIN32_LEAN_AND_MEAN",