Enable upstream's ChaCha20 assembly for x86 and ARM (32- and 64-bit).
This removes chacha_vec_arm.S and chacha_vec.c in favor of unifying on
upstream's code. Upstream's is faster and this cuts down on the number of
distinct codepaths. Our old scheme also didn't give vectorized code on
Windows or aarch64.
BoringSSL-specific modifications made to the assembly:
- As usual, the shelling out to $CC is replaced with hardcoding $avx. I've
tested up to the AVX2 codepath, so enable it all.
- I've removed the AMD XOP code as I have not tested it.
- As usual, the ARM file need the arm_arch.h include tweaked.
Speed numbers follow. We can hope for further wins on these benchmarks after
importing the Poly1305 assembly.
x86
---
Old:
Did 1422000 ChaCha20-Poly1305 (16 bytes) seal operations in 1000433us (1421384.5 ops/sec): 22.7 MB/s
Did 123000 ChaCha20-Poly1305 (1350 bytes) seal operations in 1003803us (122534.0 ops/sec): 165.4 MB/s
Did 22000 ChaCha20-Poly1305 (8192 bytes) seal operations in 1000282us (21993.8 ops/sec): 180.2 MB/s
Did 1428000 ChaCha20-Poly1305-Old (16 bytes) seal operations in 1000214us (1427694.5 ops/sec): 22.8 MB/s
Did 124000 ChaCha20-Poly1305-Old (1350 bytes) seal operations in 1006332us (123219.8 ops/sec): 166.3 MB/s
Did 22000 ChaCha20-Poly1305-Old (8192 bytes) seal operations in 1020771us (21552.3 ops/sec): 176.6 MB/s
New:
Did 1520000 ChaCha20-Poly1305 (16 bytes) seal operations in 1000567us (1519138.6 ops/sec): 24.3 MB/s
Did 152000 ChaCha20-Poly1305 (1350 bytes) seal operations in 1004216us (151361.9 ops/sec): 204.3 MB/s
Did 31000 ChaCha20-Poly1305 (8192 bytes) seal operations in 1009085us (30720.9 ops/sec): 251.7 MB/s
Did 1797000 ChaCha20-Poly1305-Old (16 bytes) seal operations in 1000141us (1796746.7 ops/sec): 28.7 MB/s
Did 171000 ChaCha20-Poly1305-Old (1350 bytes) seal operations in 1003204us (170453.9 ops/sec): 230.1 MB/s
Did 31000 ChaCha20-Poly1305-Old (8192 bytes) seal operations in 1005349us (30835.1 ops/sec): 252.6 MB/s
x86_64, no AVX2
---
Old:
Did 1782000 ChaCha20-Poly1305 (16 bytes) seal operations in 1000204us (1781636.5 ops/sec): 28.5 MB/s
Did 317000 ChaCha20-Poly1305 (1350 bytes) seal operations in 1001579us (316500.2 ops/sec): 427.3 MB/s
Did 62000 ChaCha20-Poly1305 (8192 bytes) seal operations in 1012146us (61256.0 ops/sec): 501.8 MB/s
Did 1778000 ChaCha20-Poly1305-Old (16 bytes) seal operations in 1000220us (1777608.9 ops/sec): 28.4 MB/s
Did 315000 ChaCha20-Poly1305-Old (1350 bytes) seal operations in 1002886us (314093.5 ops/sec): 424.0 MB/s
Did 71000 ChaCha20-Poly1305-Old (8192 bytes) seal operations in 1014606us (69977.9 ops/sec): 573.3 MB/s
New:
Did 1866000 ChaCha20-Poly1305 (16 bytes) seal operations in 1000019us (1865964.5 ops/sec): 29.9 MB/s
Did 399000 ChaCha20-Poly1305 (1350 bytes) seal operations in 1001017us (398594.6 ops/sec): 538.1 MB/s
Did 84000 ChaCha20-Poly1305 (8192 bytes) seal operations in 1005645us (83528.5 ops/sec): 684.3 MB/s
Did 1881000 ChaCha20-Poly1305-Old (16 bytes) seal operations in 1000325us (1880388.9 ops/sec): 30.1 MB/s
Did 404000 ChaCha20-Poly1305-Old (1350 bytes) seal operations in 1000004us (403998.4 ops/sec): 545.4 MB/s
Did 85000 ChaCha20-Poly1305-Old (8192 bytes) seal operations in 1010048us (84154.4 ops/sec): 689.4 MB/s
x86_64, AVX2
---
Old:
Did 2375000 ChaCha20-Poly1305 (16 bytes) seal operations in 1000282us (2374330.4 ops/sec): 38.0 MB/s
Did 448000 ChaCha20-Poly1305 (1350 bytes) seal operations in 1001865us (447166.0 ops/sec): 603.7 MB/s
Did 88000 ChaCha20-Poly1305 (8192 bytes) seal operations in 1005217us (87543.3 ops/sec): 717.2 MB/s
Did 2409000 ChaCha20-Poly1305-Old (16 bytes) seal operations in 1000188us (2408547.2 ops/sec): 38.5 MB/s
Did 446000 ChaCha20-Poly1305-Old (1350 bytes) seal operations in 1001003us (445553.1 ops/sec): 601.5 MB/s
Did 90000 ChaCha20-Poly1305-Old (8192 bytes) seal operations in 1006722us (89399.1 ops/sec): 732.4 MB/s
New:
Did 2622000 ChaCha20-Poly1305 (16 bytes) seal operations in 1000266us (2621302.7 ops/sec): 41.9 MB/s
Did 794000 ChaCha20-Poly1305 (1350 bytes) seal operations in 1000783us (793378.8 ops/sec): 1071.1 MB/s
Did 173000 ChaCha20-Poly1305 (8192 bytes) seal operations in 1000176us (172969.6 ops/sec): 1417.0 MB/s
Did 2623000 ChaCha20-Poly1305-Old (16 bytes) seal operations in 1000330us (2622134.7 ops/sec): 42.0 MB/s
Did 783000 ChaCha20-Poly1305-Old (1350 bytes) seal operations in 1000531us (782584.4 ops/sec): 1056.5 MB/s
Did 174000 ChaCha20-Poly1305-Old (8192 bytes) seal operations in 1000840us (173854.0 ops/sec): 1424.2 MB/s
arm, Nexus 4
---
Old:
Did 388550 ChaCha20-Poly1305 (16 bytes) seal operations in 1000580us (388324.8 ops/sec): 6.2 MB/s
Did 90000 ChaCha20-Poly1305 (1350 bytes) seal operations in 1003816us (89657.9 ops/sec): 121.0 MB/s
Did 19000 ChaCha20-Poly1305 (8192 bytes) seal operations in 1045750us (18168.8 ops/sec): 148.8 MB/s
Did 398500 ChaCha20-Poly1305-Old (16 bytes) seal operations in 1000305us (398378.5 ops/sec): 6.4 MB/s
Did 90500 ChaCha20-Poly1305-Old (1350 bytes) seal operations in 1000305us (90472.4 ops/sec): 122.1 MB/s
Did 19000 ChaCha20-Poly1305-Old (8192 bytes) seal operations in 1043278us (18211.8 ops/sec): 149.2 MB/s
New:
Did 424788 ChaCha20-Poly1305 (16 bytes) seal operations in 1000641us (424515.9 ops/sec): 6.8 MB/s
Did 115000 ChaCha20-Poly1305 (1350 bytes) seal operations in 1001526us (114824.8 ops/sec): 155.0 MB/s
Did 27000 ChaCha20-Poly1305 (8192 bytes) seal operations in 1033023us (26136.9 ops/sec): 214.1 MB/s
Did 447750 ChaCha20-Poly1305-Old (16 bytes) seal operations in 1000549us (447504.3 ops/sec): 7.2 MB/s
Did 117500 ChaCha20-Poly1305-Old (1350 bytes) seal operations in 1001923us (117274.5 ops/sec): 158.3 MB/s
Did 27000 ChaCha20-Poly1305-Old (8192 bytes) seal operations in 1025118us (26338.4 ops/sec): 215.8 MB/s
aarch64, Nexus 6p
(Note we didn't have aarch64 assembly before at all, and still don't have it
for Poly1305. Hopefully once that's added this will be faster than the arm
numbers...)
---
Old:
Did 145040 ChaCha20-Poly1305 (16 bytes) seal operations in 1003065us (144596.8 ops/sec): 2.3 MB/s
Did 14000 ChaCha20-Poly1305 (1350 bytes) seal operations in 1042605us (13427.9 ops/sec): 18.1 MB/s
Did 2618 ChaCha20-Poly1305 (8192 bytes) seal operations in 1093241us (2394.7 ops/sec): 19.6 MB/s
Did 148000 ChaCha20-Poly1305-Old (16 bytes) seal operations in 1000709us (147895.1 ops/sec): 2.4 MB/s
Did 14000 ChaCha20-Poly1305-Old (1350 bytes) seal operations in 1047294us (13367.8 ops/sec): 18.0 MB/s
Did 2607 ChaCha20-Poly1305-Old (8192 bytes) seal operations in 1090745us (2390.1 ops/sec): 19.6 MB/s
New:
Did 358000 ChaCha20-Poly1305 (16 bytes) seal operations in 1000769us (357724.9 ops/sec): 5.7 MB/s
Did 45000 ChaCha20-Poly1305 (1350 bytes) seal operations in 1021267us (44062.9 ops/sec): 59.5 MB/s
Did 8591 ChaCha20-Poly1305 (8192 bytes) seal operations in 1047136us (8204.3 ops/sec): 67.2 MB/s
Did 343000 ChaCha20-Poly1305-Old (16 bytes) seal operations in 1000489us (342832.4 ops/sec): 5.5 MB/s
Did 44000 ChaCha20-Poly1305-Old (1350 bytes) seal operations in 1008326us (43636.7 ops/sec): 58.9 MB/s
Did 8866 ChaCha20-Poly1305-Old (8192 bytes) seal operations in 1083341us (8183.9 ops/sec): 67.0 MB/s
Change-Id: I629fe195d072f2c99e8f947578fad6d70823c4c8
Reviewed-on: https://boringssl-review.googlesource.com/7202
Reviewed-by: Adam Langley <agl@google.com>
diff --git a/BUILDING.md b/BUILDING.md
index d40df9e..e111310 100644
--- a/BUILDING.md
+++ b/BUILDING.md
@@ -31,16 +31,6 @@
* [Go](https://golang.org/dl/) is required. If not found by CMake, the go
executable may be configured explicitly by setting `GO_EXECUTABLE`.
- * If you change crypto/chacha/chacha\_vec.c, you will need the
- arm-linux-gnueabihf-gcc compiler:
-
- ```
- wget https://releases.linaro.org/14.11/components/toolchain/binaries/arm-linux-gnueabihf/gcc-linaro-4.9-2014.11-x86_64_arm-linux-gnueabihf.tar.xz && \
- echo bc4ca2ced084d2dc12424815a4442e19cb1422db87068830305d90075feb1a3b gcc-linaro-4.9-2014.11-x86_64_arm-linux-gnueabihf.tar.xz | sha256sum -c && \
- tar xf gcc-linaro-4.9-2014.11-x86_64_arm-linux-gnueabihf.tar.xz && \
- sudo mv gcc-linaro-4.9-2014.11-x86_64_arm-linux-gnueabihf /opt/
- ```
-
## Building
Using Ninja (note the 'N' is capitalized in the cmake invocation):
diff --git a/crypto/chacha/CMakeLists.txt b/crypto/chacha/CMakeLists.txt
index 266e869..f9ab024 100644
--- a/crypto/chacha/CMakeLists.txt
+++ b/crypto/chacha/CMakeLists.txt
@@ -4,7 +4,31 @@
set(
CHACHA_ARCH_SOURCES
- chacha_vec_arm.S
+ chacha-armv4.${ASM_EXT}
+ )
+endif()
+
+if (${ARCH} STREQUAL "aarch64")
+ set(
+ CHACHA_ARCH_SOURCES
+
+ chacha-armv8.${ASM_EXT}
+ )
+endif()
+
+if (${ARCH} STREQUAL "x86")
+ set(
+ CHACHA_ARCH_SOURCES
+
+ chacha-x86.${ASM_EXT}
+ )
+endif()
+
+if (${ARCH} STREQUAL "x86_64")
+ set(
+ CHACHA_ARCH_SOURCES
+
+ chacha-x86_64.${ASM_EXT}
)
endif()
@@ -13,8 +37,12 @@
OBJECT
- chacha_generic.c
- chacha_vec.c
+ chacha.c
${CHACHA_ARCH_SOURCES}
)
+
+perlasm(chacha-armv4.${ASM_EXT} asm/chacha-armv4.pl)
+perlasm(chacha-armv8.${ASM_EXT} asm/chacha-armv8.pl)
+perlasm(chacha-x86.${ASM_EXT} asm/chacha-x86.pl)
+perlasm(chacha-x86_64.${ASM_EXT} asm/chacha-x86_64.pl)
\ No newline at end of file
diff --git a/crypto/chacha/asm/chacha-armv4.pl b/crypto/chacha/asm/chacha-armv4.pl
index 4545101..b190445 100755
--- a/crypto/chacha/asm/chacha-armv4.pl
+++ b/crypto/chacha/asm/chacha-armv4.pl
@@ -162,7 +162,7 @@
}
$code.=<<___;
-#include "arm_arch.h"
+#include <openssl/arm_arch.h>
.text
#if defined(__thumb2__)
diff --git a/crypto/chacha/asm/chacha-armv8.pl b/crypto/chacha/asm/chacha-armv8.pl
index 6ddb31f..e6fa144 100755
--- a/crypto/chacha/asm/chacha-armv8.pl
+++ b/crypto/chacha/asm/chacha-armv8.pl
@@ -111,7 +111,7 @@
}
$code.=<<___;
-#include "arm_arch.h"
+#include <openssl/arm_arch.h>
.text
diff --git a/crypto/chacha/asm/chacha-x86.pl b/crypto/chacha/asm/chacha-x86.pl
index 850c917..edce43d 100755
--- a/crypto/chacha/asm/chacha-x86.pl
+++ b/crypto/chacha/asm/chacha-x86.pl
@@ -26,6 +26,8 @@
# Bulldozer 13.4/+50% 4.38(*)
#
# (*) Bulldozer actually executes 4xXOP code path that delivers 3.55;
+#
+# Modified from upstream OpenSSL to remove the XOP code.
$0 =~ m/(.*[\/\\])[^\/\\]+$/; $dir=$1;
push(@INC,"${dir}","${dir}../../perlasm");
@@ -36,22 +38,7 @@
$xmm=$ymm=0;
for (@ARGV) { $xmm=1 if (/-DOPENSSL_IA32_SSE2/); }
-$ymm=1 if ($xmm &&
- `$ENV{CC} -Wa,-v -c -o /dev/null -x assembler /dev/null 2>&1`
- =~ /GNU assembler version ([2-9]\.[0-9]+)/ &&
- $1>=2.19); # first version supporting AVX
-
-$ymm=1 if ($xmm && !$ymm && $ARGV[0] eq "win32n" &&
- `nasm -v 2>&1` =~ /NASM version ([2-9]\.[0-9]+)/ &&
- $1>=2.03); # first version supporting AVX
-
-$ymm=1 if ($xmm && !$ymm && $ARGV[0] eq "win32" &&
- `ml 2>&1` =~ /Version ([0-9]+)\./ &&
- $1>=10); # first version supporting AVX
-
-$ymm=1 if ($xmm && !$ymm &&
- `$ENV{CC} -v 2>&1` =~ /(^clang version|based on LLVM) ([3-9]\.[0-9]+)/ &&
- $2>=3.0); # first version supporting AVX
+$ymm=$xmm;
$a="eax";
($b,$b_)=("ebx","ebp");
@@ -118,7 +105,6 @@
}
&static_label("ssse3_shortcut");
-&static_label("xop_shortcut");
&static_label("ssse3_data");
&static_label("pic_point");
@@ -434,9 +420,6 @@
&function_begin("ChaCha20_ssse3");
&set_label("ssse3_shortcut");
- &test (&DWP(4,"ebp"),1<<11); # test XOP bit
- &jnz (&label("xop_shortcut"));
-
&mov ($out,&wparam(0));
&mov ($inp,&wparam(1));
&mov ($len,&wparam(2));
@@ -767,366 +750,4 @@
}
&asciz ("ChaCha20 for x86, CRYPTOGAMS by <appro\@openssl.org>");
-if ($xmm) {
-my ($xa,$xa_,$xb,$xb_,$xc,$xc_,$xd,$xd_)=map("xmm$_",(0..7));
-my ($out,$inp,$len)=("edi","esi","ecx");
-
-sub QUARTERROUND_XOP {
-my ($ai,$bi,$ci,$di,$i)=@_;
-my ($an,$bn,$cn,$dn)=map(($_&~3)+(($_+1)&3),($ai,$bi,$ci,$di)); # next
-my ($ap,$bp,$cp,$dp)=map(($_&~3)+(($_-1)&3),($ai,$bi,$ci,$di)); # previous
-
- # a b c d
- #
- # 0 4 8 12 < even round
- # 1 5 9 13
- # 2 6 10 14
- # 3 7 11 15
- # 0 5 10 15 < odd round
- # 1 6 11 12
- # 2 7 8 13
- # 3 4 9 14
-
- if ($i==0) {
- my $j=4;
- ($ap,$bp,$cp,$dp)=map(($_&~3)+(($_-$j--)&3),($ap,$bp,$cp,$dp));
- } elsif ($i==3) {
- my $j=0;
- ($an,$bn,$cn,$dn)=map(($_&~3)+(($_+$j++)&3),($an,$bn,$cn,$dn));
- } elsif ($i==4) {
- my $j=4;
- ($ap,$bp,$cp,$dp)=map(($_&~3)+(($_+$j--)&3),($ap,$bp,$cp,$dp));
- } elsif ($i==7) {
- my $j=0;
- ($an,$bn,$cn,$dn)=map(($_&~3)+(($_-$j++)&3),($an,$bn,$cn,$dn));
- }
-
- #&vpaddd ($xa,$xa,$xb); # see elsewhere
- #&vpxor ($xd,$xd,$xa); # see elsewhere
- &vmovdqa (&QWP(16*$cp-128,"ebx"),$xc_) if ($ai>0 && $ai<3);
- &vprotd ($xd,$xd,16);
- &vmovdqa (&QWP(16*$bp-128,"ebx"),$xb_) if ($i!=0);
- &vpaddd ($xc,$xc,$xd);
- &vmovdqa ($xc_,&QWP(16*$cn-128,"ebx")) if ($ai>0 && $ai<3);
- &vpxor ($xb,$i!=0?$xb:$xb_,$xc);
- &vmovdqa ($xa_,&QWP(16*$an-128,"ebx"));
- &vprotd ($xb,$xb,12);
- &vmovdqa ($xb_,&QWP(16*$bn-128,"ebx")) if ($i<7);
- &vpaddd ($xa,$xa,$xb);
- &vmovdqa ($xd_,&QWP(16*$dn-128,"ebx")) if ($di!=$dn);
- &vpxor ($xd,$xd,$xa);
- &vpaddd ($xa_,$xa_,$xb_) if ($i<7); # elsewhere
- &vprotd ($xd,$xd,8);
- &vmovdqa (&QWP(16*$ai-128,"ebx"),$xa);
- &vpaddd ($xc,$xc,$xd);
- &vmovdqa (&QWP(16*$di-128,"ebx"),$xd) if ($di!=$dn);
- &vpxor ($xb,$xb,$xc);
- &vpxor ($xd_,$di==$dn?$xd:$xd_,$xa_) if ($i<7); # elsewhere
- &vprotd ($xb,$xb,7);
-
- ($xa,$xa_)=($xa_,$xa);
- ($xb,$xb_)=($xb_,$xb);
- ($xc,$xc_)=($xc_,$xc);
- ($xd,$xd_)=($xd_,$xd);
-}
-
-&function_begin("ChaCha20_xop");
-&set_label("xop_shortcut");
- &mov ($out,&wparam(0));
- &mov ($inp,&wparam(1));
- &mov ($len,&wparam(2));
- &mov ("edx",&wparam(3)); # key
- &mov ("ebx",&wparam(4)); # counter and nonce
- &vzeroupper ();
-
- &mov ("ebp","esp");
- &stack_push (131);
- &and ("esp",-64);
- &mov (&DWP(512,"esp"),"ebp");
-
- &lea ("eax",&DWP(&label("ssse3_data")."-".
- &label("pic_point"),"eax"));
- &vmovdqu ("xmm3",&QWP(0,"ebx")); # counter and nonce
-
- &cmp ($len,64*4);
- &jb (&label("1x"));
-
- &mov (&DWP(512+4,"esp"),"edx"); # offload pointers
- &mov (&DWP(512+8,"esp"),"ebx");
- &sub ($len,64*4); # bias len
- &lea ("ebp",&DWP(256+128,"esp")); # size optimization
-
- &vmovdqu ("xmm7",&QWP(0,"edx")); # key
- &vpshufd ("xmm0","xmm3",0x00);
- &vpshufd ("xmm1","xmm3",0x55);
- &vpshufd ("xmm2","xmm3",0xaa);
- &vpshufd ("xmm3","xmm3",0xff);
- &vpaddd ("xmm0","xmm0",&QWP(16*3,"eax")); # fix counters
- &vpshufd ("xmm4","xmm7",0x00);
- &vpshufd ("xmm5","xmm7",0x55);
- &vpsubd ("xmm0","xmm0",&QWP(16*4,"eax"));
- &vpshufd ("xmm6","xmm7",0xaa);
- &vpshufd ("xmm7","xmm7",0xff);
- &vmovdqa (&QWP(16*12-128,"ebp"),"xmm0");
- &vmovdqa (&QWP(16*13-128,"ebp"),"xmm1");
- &vmovdqa (&QWP(16*14-128,"ebp"),"xmm2");
- &vmovdqa (&QWP(16*15-128,"ebp"),"xmm3");
- &vmovdqu ("xmm3",&QWP(16,"edx")); # key
- &vmovdqa (&QWP(16*4-128,"ebp"),"xmm4");
- &vmovdqa (&QWP(16*5-128,"ebp"),"xmm5");
- &vmovdqa (&QWP(16*6-128,"ebp"),"xmm6");
- &vmovdqa (&QWP(16*7-128,"ebp"),"xmm7");
- &vmovdqa ("xmm7",&QWP(16*2,"eax")); # sigma
- &lea ("ebx",&DWP(128,"esp")); # size optimization
-
- &vpshufd ("xmm0","xmm3",0x00);
- &vpshufd ("xmm1","xmm3",0x55);
- &vpshufd ("xmm2","xmm3",0xaa);
- &vpshufd ("xmm3","xmm3",0xff);
- &vpshufd ("xmm4","xmm7",0x00);
- &vpshufd ("xmm5","xmm7",0x55);
- &vpshufd ("xmm6","xmm7",0xaa);
- &vpshufd ("xmm7","xmm7",0xff);
- &vmovdqa (&QWP(16*8-128,"ebp"),"xmm0");
- &vmovdqa (&QWP(16*9-128,"ebp"),"xmm1");
- &vmovdqa (&QWP(16*10-128,"ebp"),"xmm2");
- &vmovdqa (&QWP(16*11-128,"ebp"),"xmm3");
- &vmovdqa (&QWP(16*0-128,"ebp"),"xmm4");
- &vmovdqa (&QWP(16*1-128,"ebp"),"xmm5");
- &vmovdqa (&QWP(16*2-128,"ebp"),"xmm6");
- &vmovdqa (&QWP(16*3-128,"ebp"),"xmm7");
-
- &lea ($inp,&DWP(128,$inp)); # size optimization
- &lea ($out,&DWP(128,$out)); # size optimization
- &jmp (&label("outer_loop"));
-
-&set_label("outer_loop",32);
- #&vmovdqa ("xmm0",&QWP(16*0-128,"ebp")); # copy key material
- &vmovdqa ("xmm1",&QWP(16*1-128,"ebp"));
- &vmovdqa ("xmm2",&QWP(16*2-128,"ebp"));
- &vmovdqa ("xmm3",&QWP(16*3-128,"ebp"));
- #&vmovdqa ("xmm4",&QWP(16*4-128,"ebp"));
- &vmovdqa ("xmm5",&QWP(16*5-128,"ebp"));
- &vmovdqa ("xmm6",&QWP(16*6-128,"ebp"));
- &vmovdqa ("xmm7",&QWP(16*7-128,"ebp"));
- #&vmovdqa (&QWP(16*0-128,"ebx"),"xmm0");
- &vmovdqa (&QWP(16*1-128,"ebx"),"xmm1");
- &vmovdqa (&QWP(16*2-128,"ebx"),"xmm2");
- &vmovdqa (&QWP(16*3-128,"ebx"),"xmm3");
- #&vmovdqa (&QWP(16*4-128,"ebx"),"xmm4");
- &vmovdqa (&QWP(16*5-128,"ebx"),"xmm5");
- &vmovdqa (&QWP(16*6-128,"ebx"),"xmm6");
- &vmovdqa (&QWP(16*7-128,"ebx"),"xmm7");
- #&vmovdqa ("xmm0",&QWP(16*8-128,"ebp"));
- #&vmovdqa ("xmm1",&QWP(16*9-128,"ebp"));
- &vmovdqa ("xmm2",&QWP(16*10-128,"ebp"));
- &vmovdqa ("xmm3",&QWP(16*11-128,"ebp"));
- &vmovdqa ("xmm4",&QWP(16*12-128,"ebp"));
- &vmovdqa ("xmm5",&QWP(16*13-128,"ebp"));
- &vmovdqa ("xmm6",&QWP(16*14-128,"ebp"));
- &vmovdqa ("xmm7",&QWP(16*15-128,"ebp"));
- &vpaddd ("xmm4","xmm4",&QWP(16*4,"eax")); # counter value
- #&vmovdqa (&QWP(16*8-128,"ebx"),"xmm0");
- #&vmovdqa (&QWP(16*9-128,"ebx"),"xmm1");
- &vmovdqa (&QWP(16*10-128,"ebx"),"xmm2");
- &vmovdqa (&QWP(16*11-128,"ebx"),"xmm3");
- &vmovdqa (&QWP(16*12-128,"ebx"),"xmm4");
- &vmovdqa (&QWP(16*13-128,"ebx"),"xmm5");
- &vmovdqa (&QWP(16*14-128,"ebx"),"xmm6");
- &vmovdqa (&QWP(16*15-128,"ebx"),"xmm7");
- &vmovdqa (&QWP(16*12-128,"ebp"),"xmm4"); # save counter value
-
- &vmovdqa ($xa, &QWP(16*0-128,"ebp"));
- &vmovdqa ($xd, "xmm4");
- &vmovdqa ($xb_,&QWP(16*4-128,"ebp"));
- &vmovdqa ($xc, &QWP(16*8-128,"ebp"));
- &vmovdqa ($xc_,&QWP(16*9-128,"ebp"));
-
- &mov ("edx",10); # loop counter
- &nop ();
-
-&set_label("loop",32);
- &vpaddd ($xa,$xa,$xb_); # elsewhere
- &vpxor ($xd,$xd,$xa); # elsewhere
- &QUARTERROUND_XOP(0, 4, 8, 12, 0);
- &QUARTERROUND_XOP(1, 5, 9, 13, 1);
- &QUARTERROUND_XOP(2, 6,10, 14, 2);
- &QUARTERROUND_XOP(3, 7,11, 15, 3);
- &QUARTERROUND_XOP(0, 5,10, 15, 4);
- &QUARTERROUND_XOP(1, 6,11, 12, 5);
- &QUARTERROUND_XOP(2, 7, 8, 13, 6);
- &QUARTERROUND_XOP(3, 4, 9, 14, 7);
- &dec ("edx");
- &jnz (&label("loop"));
-
- &vmovdqa (&QWP(16*4-128,"ebx"),$xb_);
- &vmovdqa (&QWP(16*8-128,"ebx"),$xc);
- &vmovdqa (&QWP(16*9-128,"ebx"),$xc_);
- &vmovdqa (&QWP(16*12-128,"ebx"),$xd);
- &vmovdqa (&QWP(16*14-128,"ebx"),$xd_);
-
- my ($xa0,$xa1,$xa2,$xa3,$xt0,$xt1,$xt2,$xt3)=map("xmm$_",(0..7));
-
- #&vmovdqa ($xa0,&QWP(16*0-128,"ebx")); # it's there
- &vmovdqa ($xa1,&QWP(16*1-128,"ebx"));
- &vmovdqa ($xa2,&QWP(16*2-128,"ebx"));
- &vmovdqa ($xa3,&QWP(16*3-128,"ebx"));
-
- for($i=0;$i<256;$i+=64) {
- &vpaddd ($xa0,$xa0,&QWP($i+16*0-128,"ebp")); # accumulate key material
- &vpaddd ($xa1,$xa1,&QWP($i+16*1-128,"ebp"));
- &vpaddd ($xa2,$xa2,&QWP($i+16*2-128,"ebp"));
- &vpaddd ($xa3,$xa3,&QWP($i+16*3-128,"ebp"));
-
- &vpunpckldq ($xt2,$xa0,$xa1); # "de-interlace" data
- &vpunpckldq ($xt3,$xa2,$xa3);
- &vpunpckhdq ($xa0,$xa0,$xa1);
- &vpunpckhdq ($xa2,$xa2,$xa3);
- &vpunpcklqdq ($xa1,$xt2,$xt3); # "a0"
- &vpunpckhqdq ($xt2,$xt2,$xt3); # "a1"
- &vpunpcklqdq ($xt3,$xa0,$xa2); # "a2"
- &vpunpckhqdq ($xa3,$xa0,$xa2); # "a3"
-
- &vpxor ($xt0,$xa1,&QWP(64*0-128,$inp));
- &vpxor ($xt1,$xt2,&QWP(64*1-128,$inp));
- &vpxor ($xt2,$xt3,&QWP(64*2-128,$inp));
- &vpxor ($xt3,$xa3,&QWP(64*3-128,$inp));
- &lea ($inp,&QWP($i<192?16:(64*4-16*3),$inp));
- &vmovdqa ($xa0,&QWP($i+16*4-128,"ebx")) if ($i<192);
- &vmovdqa ($xa1,&QWP($i+16*5-128,"ebx")) if ($i<192);
- &vmovdqa ($xa2,&QWP($i+16*6-128,"ebx")) if ($i<192);
- &vmovdqa ($xa3,&QWP($i+16*7-128,"ebx")) if ($i<192);
- &vmovdqu (&QWP(64*0-128,$out),$xt0); # store output
- &vmovdqu (&QWP(64*1-128,$out),$xt1);
- &vmovdqu (&QWP(64*2-128,$out),$xt2);
- &vmovdqu (&QWP(64*3-128,$out),$xt3);
- &lea ($out,&QWP($i<192?16:(64*4-16*3),$out));
- }
- &sub ($len,64*4);
- &jnc (&label("outer_loop"));
-
- &add ($len,64*4);
- &jz (&label("done"));
-
- &mov ("ebx",&DWP(512+8,"esp")); # restore pointers
- &lea ($inp,&DWP(-128,$inp));
- &mov ("edx",&DWP(512+4,"esp"));
- &lea ($out,&DWP(-128,$out));
-
- &vmovd ("xmm2",&DWP(16*12-128,"ebp")); # counter value
- &vmovdqu ("xmm3",&QWP(0,"ebx"));
- &vpaddd ("xmm2","xmm2",&QWP(16*6,"eax"));# +four
- &vpand ("xmm3","xmm3",&QWP(16*7,"eax"));
- &vpor ("xmm3","xmm3","xmm2"); # counter value
-{
-my ($a,$b,$c,$d,$t,$t1,$rot16,$rot24)=map("xmm$_",(0..7));
-
-sub XOPROUND {
- &vpaddd ($a,$a,$b);
- &vpxor ($d,$d,$a);
- &vprotd ($d,$d,16);
-
- &vpaddd ($c,$c,$d);
- &vpxor ($b,$b,$c);
- &vprotd ($b,$b,12);
-
- &vpaddd ($a,$a,$b);
- &vpxor ($d,$d,$a);
- &vprotd ($d,$d,8);
-
- &vpaddd ($c,$c,$d);
- &vpxor ($b,$b,$c);
- &vprotd ($b,$b,7);
-}
-
-&set_label("1x");
- &vmovdqa ($a,&QWP(16*2,"eax")); # sigma
- &vmovdqu ($b,&QWP(0,"edx"));
- &vmovdqu ($c,&QWP(16,"edx"));
- #&vmovdqu ($d,&QWP(0,"ebx")); # already loaded
- &vmovdqa ($rot16,&QWP(0,"eax"));
- &vmovdqa ($rot24,&QWP(16,"eax"));
- &mov (&DWP(16*3,"esp"),"ebp");
-
- &vmovdqa (&QWP(16*0,"esp"),$a);
- &vmovdqa (&QWP(16*1,"esp"),$b);
- &vmovdqa (&QWP(16*2,"esp"),$c);
- &vmovdqa (&QWP(16*3,"esp"),$d);
- &mov ("edx",10);
- &jmp (&label("loop1x"));
-
-&set_label("outer1x",16);
- &vmovdqa ($d,&QWP(16*5,"eax")); # one
- &vmovdqa ($a,&QWP(16*0,"esp"));
- &vmovdqa ($b,&QWP(16*1,"esp"));
- &vmovdqa ($c,&QWP(16*2,"esp"));
- &vpaddd ($d,$d,&QWP(16*3,"esp"));
- &mov ("edx",10);
- &vmovdqa (&QWP(16*3,"esp"),$d);
- &jmp (&label("loop1x"));
-
-&set_label("loop1x",16);
- &XOPROUND();
- &vpshufd ($c,$c,0b01001110);
- &vpshufd ($b,$b,0b00111001);
- &vpshufd ($d,$d,0b10010011);
-
- &XOPROUND();
- &vpshufd ($c,$c,0b01001110);
- &vpshufd ($b,$b,0b10010011);
- &vpshufd ($d,$d,0b00111001);
-
- &dec ("edx");
- &jnz (&label("loop1x"));
-
- &vpaddd ($a,$a,&QWP(16*0,"esp"));
- &vpaddd ($b,$b,&QWP(16*1,"esp"));
- &vpaddd ($c,$c,&QWP(16*2,"esp"));
- &vpaddd ($d,$d,&QWP(16*3,"esp"));
-
- &cmp ($len,64);
- &jb (&label("tail"));
-
- &vpxor ($a,$a,&QWP(16*0,$inp)); # xor with input
- &vpxor ($b,$b,&QWP(16*1,$inp));
- &vpxor ($c,$c,&QWP(16*2,$inp));
- &vpxor ($d,$d,&QWP(16*3,$inp));
- &lea ($inp,&DWP(16*4,$inp)); # inp+=64
-
- &vmovdqu (&QWP(16*0,$out),$a); # write output
- &vmovdqu (&QWP(16*1,$out),$b);
- &vmovdqu (&QWP(16*2,$out),$c);
- &vmovdqu (&QWP(16*3,$out),$d);
- &lea ($out,&DWP(16*4,$out)); # inp+=64
-
- &sub ($len,64);
- &jnz (&label("outer1x"));
-
- &jmp (&label("done"));
-
-&set_label("tail");
- &vmovdqa (&QWP(16*0,"esp"),$a);
- &vmovdqa (&QWP(16*1,"esp"),$b);
- &vmovdqa (&QWP(16*2,"esp"),$c);
- &vmovdqa (&QWP(16*3,"esp"),$d);
-
- &xor ("eax","eax");
- &xor ("edx","edx");
- &xor ("ebp","ebp");
-
-&set_label("tail_loop");
- &movb ("al",&BP(0,"esp","ebp"));
- &movb ("dl",&BP(0,$inp,"ebp"));
- &lea ("ebp",&DWP(1,"ebp"));
- &xor ("al","dl");
- &movb (&BP(-1,$out,"ebp"),"al");
- &dec ($len);
- &jnz (&label("tail_loop"));
-}
-&set_label("done");
- &vzeroupper ();
- &mov ("esp",&DWP(512,"esp"));
-&function_end("ChaCha20_xop");
-}
-
&asm_finish();
diff --git a/crypto/chacha/asm/chacha-x86_64.pl b/crypto/chacha/asm/chacha-x86_64.pl
index 107fc70..55b726d 100755
--- a/crypto/chacha/asm/chacha-x86_64.pl
+++ b/crypto/chacha/asm/chacha-x86_64.pl
@@ -36,6 +36,8 @@
# limitations, SSE2 can do better, but gain is considered too
# low to justify the [maintenance] effort;
# (iv) Bulldozer actually executes 4xXOP code path that delivers 2.20;
+#
+# Modified from upstream OpenSSL to remove the XOP code.
$flavour = shift;
$output = shift;
@@ -48,24 +50,7 @@
( $xlate="${dir}../../perlasm/x86_64-xlate.pl" and -f $xlate) or
die "can't locate x86_64-xlate.pl";
-if (`$ENV{CC} -Wa,-v -c -o /dev/null -x assembler /dev/null 2>&1`
- =~ /GNU assembler version ([2-9]\.[0-9]+)/) {
- $avx = ($1>=2.19) + ($1>=2.22);
-}
-
-if (!$avx && $win64 && ($flavour =~ /nasm/ || $ENV{ASM} =~ /nasm/) &&
- `nasm -v 2>&1` =~ /NASM version ([2-9]\.[0-9]+)/) {
- $avx = ($1>=2.09) + ($1>=2.10);
-}
-
-if (!$avx && $win64 && ($flavour =~ /masm/ || $ENV{ASM} =~ /ml64/) &&
- `ml64 2>&1` =~ /Version ([0-9]+)\./) {
- $avx = ($1>=10) + ($1>=11);
-}
-
-if (!$avx && `$ENV{CC} -v 2>&1` =~ /((?:^clang|LLVM) version|.*based on LLVM) ([3-9]\.[0-9]+)/) {
- $avx = ($2>=3.0) + ($2>3.0);
-}
+$avx = 2;
open OUT,"| \"$^X\" $xlate $flavour $output";
*STDOUT=*OUT;
@@ -419,10 +404,6 @@
ChaCha20_ssse3:
.LChaCha20_ssse3:
___
-$code.=<<___ if ($avx);
- test \$`1<<(43-32)`,%r10d
- jnz .LChaCha20_4xop # XOP is fastest even if we use 1/4
-___
$code.=<<___;
cmp \$128,$len # we might throw away some data,
ja .LChaCha20_4x # but overall it won't be slower
@@ -1134,457 +1115,6 @@
}
########################################################################
-# XOP code path that handles all lengths.
-if ($avx) {
-# There is some "anomaly" observed depending on instructions' size or
-# alignment. If you look closely at below code you'll notice that
-# sometimes argument order varies. The order affects instruction
-# encoding by making it larger, and such fiddling gives 5% performance
-# improvement. This is on FX-4100...
-
-my ($xb0,$xb1,$xb2,$xb3, $xd0,$xd1,$xd2,$xd3,
- $xa0,$xa1,$xa2,$xa3, $xt0,$xt1,$xt2,$xt3)=map("%xmm$_",(0..15));
-my @xx=($xa0,$xa1,$xa2,$xa3, $xb0,$xb1,$xb2,$xb3,
- $xt0,$xt1,$xt2,$xt3, $xd0,$xd1,$xd2,$xd3);
-
-sub XOP_lane_ROUND {
-my ($a0,$b0,$c0,$d0)=@_;
-my ($a1,$b1,$c1,$d1)=map(($_&~3)+(($_+1)&3),($a0,$b0,$c0,$d0));
-my ($a2,$b2,$c2,$d2)=map(($_&~3)+(($_+1)&3),($a1,$b1,$c1,$d1));
-my ($a3,$b3,$c3,$d3)=map(($_&~3)+(($_+1)&3),($a2,$b2,$c2,$d2));
-my @x=map("\"$_\"",@xx);
-
- (
- "&vpaddd (@x[$a0],@x[$a0],@x[$b0])", # Q1
- "&vpaddd (@x[$a1],@x[$a1],@x[$b1])", # Q2
- "&vpaddd (@x[$a2],@x[$a2],@x[$b2])", # Q3
- "&vpaddd (@x[$a3],@x[$a3],@x[$b3])", # Q4
- "&vpxor (@x[$d0],@x[$a0],@x[$d0])",
- "&vpxor (@x[$d1],@x[$a1],@x[$d1])",
- "&vpxor (@x[$d2],@x[$a2],@x[$d2])",
- "&vpxor (@x[$d3],@x[$a3],@x[$d3])",
- "&vprotd (@x[$d0],@x[$d0],16)",
- "&vprotd (@x[$d1],@x[$d1],16)",
- "&vprotd (@x[$d2],@x[$d2],16)",
- "&vprotd (@x[$d3],@x[$d3],16)",
-
- "&vpaddd (@x[$c0],@x[$c0],@x[$d0])",
- "&vpaddd (@x[$c1],@x[$c1],@x[$d1])",
- "&vpaddd (@x[$c2],@x[$c2],@x[$d2])",
- "&vpaddd (@x[$c3],@x[$c3],@x[$d3])",
- "&vpxor (@x[$b0],@x[$c0],@x[$b0])",
- "&vpxor (@x[$b1],@x[$c1],@x[$b1])",
- "&vpxor (@x[$b2],@x[$b2],@x[$c2])", # flip
- "&vpxor (@x[$b3],@x[$b3],@x[$c3])", # flip
- "&vprotd (@x[$b0],@x[$b0],12)",
- "&vprotd (@x[$b1],@x[$b1],12)",
- "&vprotd (@x[$b2],@x[$b2],12)",
- "&vprotd (@x[$b3],@x[$b3],12)",
-
- "&vpaddd (@x[$a0],@x[$b0],@x[$a0])", # flip
- "&vpaddd (@x[$a1],@x[$b1],@x[$a1])", # flip
- "&vpaddd (@x[$a2],@x[$a2],@x[$b2])",
- "&vpaddd (@x[$a3],@x[$a3],@x[$b3])",
- "&vpxor (@x[$d0],@x[$a0],@x[$d0])",
- "&vpxor (@x[$d1],@x[$a1],@x[$d1])",
- "&vpxor (@x[$d2],@x[$a2],@x[$d2])",
- "&vpxor (@x[$d3],@x[$a3],@x[$d3])",
- "&vprotd (@x[$d0],@x[$d0],8)",
- "&vprotd (@x[$d1],@x[$d1],8)",
- "&vprotd (@x[$d2],@x[$d2],8)",
- "&vprotd (@x[$d3],@x[$d3],8)",
-
- "&vpaddd (@x[$c0],@x[$c0],@x[$d0])",
- "&vpaddd (@x[$c1],@x[$c1],@x[$d1])",
- "&vpaddd (@x[$c2],@x[$c2],@x[$d2])",
- "&vpaddd (@x[$c3],@x[$c3],@x[$d3])",
- "&vpxor (@x[$b0],@x[$c0],@x[$b0])",
- "&vpxor (@x[$b1],@x[$c1],@x[$b1])",
- "&vpxor (@x[$b2],@x[$b2],@x[$c2])", # flip
- "&vpxor (@x[$b3],@x[$b3],@x[$c3])", # flip
- "&vprotd (@x[$b0],@x[$b0],7)",
- "&vprotd (@x[$b1],@x[$b1],7)",
- "&vprotd (@x[$b2],@x[$b2],7)",
- "&vprotd (@x[$b3],@x[$b3],7)"
- );
-}
-
-my $xframe = $win64 ? 0xa0 : 0;
-
-$code.=<<___;
-.type ChaCha20_4xop,\@function,5
-.align 32
-ChaCha20_4xop:
-.LChaCha20_4xop:
- lea -0x78(%rsp),%r11
- sub \$0x148+$xframe,%rsp
-___
- ################ stack layout
- # +0x00 SIMD equivalent of @x[8-12]
- # ...
- # +0x40 constant copy of key[0-2] smashed by lanes
- # ...
- # +0x100 SIMD counters (with nonce smashed by lanes)
- # ...
- # +0x140
-$code.=<<___ if ($win64);
- movaps %xmm6,-0x30(%r11)
- movaps %xmm7,-0x20(%r11)
- movaps %xmm8,-0x10(%r11)
- movaps %xmm9,0x00(%r11)
- movaps %xmm10,0x10(%r11)
- movaps %xmm11,0x20(%r11)
- movaps %xmm12,0x30(%r11)
- movaps %xmm13,0x40(%r11)
- movaps %xmm14,0x50(%r11)
- movaps %xmm15,0x60(%r11)
-___
-$code.=<<___;
- vzeroupper
-
- vmovdqa .Lsigma(%rip),$xa3 # key[0]
- vmovdqu ($key),$xb3 # key[1]
- vmovdqu 16($key),$xt3 # key[2]
- vmovdqu ($counter),$xd3 # key[3]
- lea 0x100(%rsp),%rcx # size optimization
-
- vpshufd \$0x00,$xa3,$xa0 # smash key by lanes...
- vpshufd \$0x55,$xa3,$xa1
- vmovdqa $xa0,0x40(%rsp) # ... and offload
- vpshufd \$0xaa,$xa3,$xa2
- vmovdqa $xa1,0x50(%rsp)
- vpshufd \$0xff,$xa3,$xa3
- vmovdqa $xa2,0x60(%rsp)
- vmovdqa $xa3,0x70(%rsp)
-
- vpshufd \$0x00,$xb3,$xb0
- vpshufd \$0x55,$xb3,$xb1
- vmovdqa $xb0,0x80-0x100(%rcx)
- vpshufd \$0xaa,$xb3,$xb2
- vmovdqa $xb1,0x90-0x100(%rcx)
- vpshufd \$0xff,$xb3,$xb3
- vmovdqa $xb2,0xa0-0x100(%rcx)
- vmovdqa $xb3,0xb0-0x100(%rcx)
-
- vpshufd \$0x00,$xt3,$xt0 # "$xc0"
- vpshufd \$0x55,$xt3,$xt1 # "$xc1"
- vmovdqa $xt0,0xc0-0x100(%rcx)
- vpshufd \$0xaa,$xt3,$xt2 # "$xc2"
- vmovdqa $xt1,0xd0-0x100(%rcx)
- vpshufd \$0xff,$xt3,$xt3 # "$xc3"
- vmovdqa $xt2,0xe0-0x100(%rcx)
- vmovdqa $xt3,0xf0-0x100(%rcx)
-
- vpshufd \$0x00,$xd3,$xd0
- vpshufd \$0x55,$xd3,$xd1
- vpaddd .Linc(%rip),$xd0,$xd0 # don't save counters yet
- vpshufd \$0xaa,$xd3,$xd2
- vmovdqa $xd1,0x110-0x100(%rcx)
- vpshufd \$0xff,$xd3,$xd3
- vmovdqa $xd2,0x120-0x100(%rcx)
- vmovdqa $xd3,0x130-0x100(%rcx)
-
- jmp .Loop_enter4xop
-
-.align 32
-.Loop_outer4xop:
- vmovdqa 0x40(%rsp),$xa0 # re-load smashed key
- vmovdqa 0x50(%rsp),$xa1
- vmovdqa 0x60(%rsp),$xa2
- vmovdqa 0x70(%rsp),$xa3
- vmovdqa 0x80-0x100(%rcx),$xb0
- vmovdqa 0x90-0x100(%rcx),$xb1
- vmovdqa 0xa0-0x100(%rcx),$xb2
- vmovdqa 0xb0-0x100(%rcx),$xb3
- vmovdqa 0xc0-0x100(%rcx),$xt0 # "$xc0"
- vmovdqa 0xd0-0x100(%rcx),$xt1 # "$xc1"
- vmovdqa 0xe0-0x100(%rcx),$xt2 # "$xc2"
- vmovdqa 0xf0-0x100(%rcx),$xt3 # "$xc3"
- vmovdqa 0x100-0x100(%rcx),$xd0
- vmovdqa 0x110-0x100(%rcx),$xd1
- vmovdqa 0x120-0x100(%rcx),$xd2
- vmovdqa 0x130-0x100(%rcx),$xd3
- vpaddd .Lfour(%rip),$xd0,$xd0 # next SIMD counters
-
-.Loop_enter4xop:
- mov \$10,%eax
- vmovdqa $xd0,0x100-0x100(%rcx) # save SIMD counters
- jmp .Loop4xop
-
-.align 32
-.Loop4xop:
-___
- foreach (&XOP_lane_ROUND(0, 4, 8,12)) { eval; }
- foreach (&XOP_lane_ROUND(0, 5,10,15)) { eval; }
-$code.=<<___;
- dec %eax
- jnz .Loop4xop
-
- vpaddd 0x40(%rsp),$xa0,$xa0 # accumulate key material
- vpaddd 0x50(%rsp),$xa1,$xa1
- vpaddd 0x60(%rsp),$xa2,$xa2
- vpaddd 0x70(%rsp),$xa3,$xa3
-
- vmovdqa $xt2,0x20(%rsp) # offload $xc2,3
- vmovdqa $xt3,0x30(%rsp)
-
- vpunpckldq $xa1,$xa0,$xt2 # "de-interlace" data
- vpunpckldq $xa3,$xa2,$xt3
- vpunpckhdq $xa1,$xa0,$xa0
- vpunpckhdq $xa3,$xa2,$xa2
- vpunpcklqdq $xt3,$xt2,$xa1 # "a0"
- vpunpckhqdq $xt3,$xt2,$xt2 # "a1"
- vpunpcklqdq $xa2,$xa0,$xa3 # "a2"
- vpunpckhqdq $xa2,$xa0,$xa0 # "a3"
-___
- ($xa0,$xa1,$xa2,$xa3,$xt2)=($xa1,$xt2,$xa3,$xa0,$xa2);
-$code.=<<___;
- vpaddd 0x80-0x100(%rcx),$xb0,$xb0
- vpaddd 0x90-0x100(%rcx),$xb1,$xb1
- vpaddd 0xa0-0x100(%rcx),$xb2,$xb2
- vpaddd 0xb0-0x100(%rcx),$xb3,$xb3
-
- vmovdqa $xa0,0x00(%rsp) # offload $xa0,1
- vmovdqa $xa1,0x10(%rsp)
- vmovdqa 0x20(%rsp),$xa0 # "xc2"
- vmovdqa 0x30(%rsp),$xa1 # "xc3"
-
- vpunpckldq $xb1,$xb0,$xt2
- vpunpckldq $xb3,$xb2,$xt3
- vpunpckhdq $xb1,$xb0,$xb0
- vpunpckhdq $xb3,$xb2,$xb2
- vpunpcklqdq $xt3,$xt2,$xb1 # "b0"
- vpunpckhqdq $xt3,$xt2,$xt2 # "b1"
- vpunpcklqdq $xb2,$xb0,$xb3 # "b2"
- vpunpckhqdq $xb2,$xb0,$xb0 # "b3"
-___
- ($xb0,$xb1,$xb2,$xb3,$xt2)=($xb1,$xt2,$xb3,$xb0,$xb2);
- my ($xc0,$xc1,$xc2,$xc3)=($xt0,$xt1,$xa0,$xa1);
-$code.=<<___;
- vpaddd 0xc0-0x100(%rcx),$xc0,$xc0
- vpaddd 0xd0-0x100(%rcx),$xc1,$xc1
- vpaddd 0xe0-0x100(%rcx),$xc2,$xc2
- vpaddd 0xf0-0x100(%rcx),$xc3,$xc3
-
- vpunpckldq $xc1,$xc0,$xt2
- vpunpckldq $xc3,$xc2,$xt3
- vpunpckhdq $xc1,$xc0,$xc0
- vpunpckhdq $xc3,$xc2,$xc2
- vpunpcklqdq $xt3,$xt2,$xc1 # "c0"
- vpunpckhqdq $xt3,$xt2,$xt2 # "c1"
- vpunpcklqdq $xc2,$xc0,$xc3 # "c2"
- vpunpckhqdq $xc2,$xc0,$xc0 # "c3"
-___
- ($xc0,$xc1,$xc2,$xc3,$xt2)=($xc1,$xt2,$xc3,$xc0,$xc2);
-$code.=<<___;
- vpaddd 0x100-0x100(%rcx),$xd0,$xd0
- vpaddd 0x110-0x100(%rcx),$xd1,$xd1
- vpaddd 0x120-0x100(%rcx),$xd2,$xd2
- vpaddd 0x130-0x100(%rcx),$xd3,$xd3
-
- vpunpckldq $xd1,$xd0,$xt2
- vpunpckldq $xd3,$xd2,$xt3
- vpunpckhdq $xd1,$xd0,$xd0
- vpunpckhdq $xd3,$xd2,$xd2
- vpunpcklqdq $xt3,$xt2,$xd1 # "d0"
- vpunpckhqdq $xt3,$xt2,$xt2 # "d1"
- vpunpcklqdq $xd2,$xd0,$xd3 # "d2"
- vpunpckhqdq $xd2,$xd0,$xd0 # "d3"
-___
- ($xd0,$xd1,$xd2,$xd3,$xt2)=($xd1,$xt2,$xd3,$xd0,$xd2);
- ($xa0,$xa1)=($xt2,$xt3);
-$code.=<<___;
- vmovdqa 0x00(%rsp),$xa0 # restore $xa0,1
- vmovdqa 0x10(%rsp),$xa1
-
- cmp \$64*4,$len
- jb .Ltail4xop
-
- vpxor 0x00($inp),$xa0,$xa0 # xor with input
- vpxor 0x10($inp),$xb0,$xb0
- vpxor 0x20($inp),$xc0,$xc0
- vpxor 0x30($inp),$xd0,$xd0
- vpxor 0x40($inp),$xa1,$xa1
- vpxor 0x50($inp),$xb1,$xb1
- vpxor 0x60($inp),$xc1,$xc1
- vpxor 0x70($inp),$xd1,$xd1
- lea 0x80($inp),$inp # size optimization
- vpxor 0x00($inp),$xa2,$xa2
- vpxor 0x10($inp),$xb2,$xb2
- vpxor 0x20($inp),$xc2,$xc2
- vpxor 0x30($inp),$xd2,$xd2
- vpxor 0x40($inp),$xa3,$xa3
- vpxor 0x50($inp),$xb3,$xb3
- vpxor 0x60($inp),$xc3,$xc3
- vpxor 0x70($inp),$xd3,$xd3
- lea 0x80($inp),$inp # inp+=64*4
-
- vmovdqu $xa0,0x00($out)
- vmovdqu $xb0,0x10($out)
- vmovdqu $xc0,0x20($out)
- vmovdqu $xd0,0x30($out)
- vmovdqu $xa1,0x40($out)
- vmovdqu $xb1,0x50($out)
- vmovdqu $xc1,0x60($out)
- vmovdqu $xd1,0x70($out)
- lea 0x80($out),$out # size optimization
- vmovdqu $xa2,0x00($out)
- vmovdqu $xb2,0x10($out)
- vmovdqu $xc2,0x20($out)
- vmovdqu $xd2,0x30($out)
- vmovdqu $xa3,0x40($out)
- vmovdqu $xb3,0x50($out)
- vmovdqu $xc3,0x60($out)
- vmovdqu $xd3,0x70($out)
- lea 0x80($out),$out # out+=64*4
-
- sub \$64*4,$len
- jnz .Loop_outer4xop
-
- jmp .Ldone4xop
-
-.align 32
-.Ltail4xop:
- cmp \$192,$len
- jae .L192_or_more4xop
- cmp \$128,$len
- jae .L128_or_more4xop
- cmp \$64,$len
- jae .L64_or_more4xop
-
- xor %r10,%r10
- vmovdqa $xa0,0x00(%rsp)
- vmovdqa $xb0,0x10(%rsp)
- vmovdqa $xc0,0x20(%rsp)
- vmovdqa $xd0,0x30(%rsp)
- jmp .Loop_tail4xop
-
-.align 32
-.L64_or_more4xop:
- vpxor 0x00($inp),$xa0,$xa0 # xor with input
- vpxor 0x10($inp),$xb0,$xb0
- vpxor 0x20($inp),$xc0,$xc0
- vpxor 0x30($inp),$xd0,$xd0
- vmovdqu $xa0,0x00($out)
- vmovdqu $xb0,0x10($out)
- vmovdqu $xc0,0x20($out)
- vmovdqu $xd0,0x30($out)
- je .Ldone4xop
-
- lea 0x40($inp),$inp # inp+=64*1
- vmovdqa $xa1,0x00(%rsp)
- xor %r10,%r10
- vmovdqa $xb1,0x10(%rsp)
- lea 0x40($out),$out # out+=64*1
- vmovdqa $xc1,0x20(%rsp)
- sub \$64,$len # len-=64*1
- vmovdqa $xd1,0x30(%rsp)
- jmp .Loop_tail4xop
-
-.align 32
-.L128_or_more4xop:
- vpxor 0x00($inp),$xa0,$xa0 # xor with input
- vpxor 0x10($inp),$xb0,$xb0
- vpxor 0x20($inp),$xc0,$xc0
- vpxor 0x30($inp),$xd0,$xd0
- vpxor 0x40($inp),$xa1,$xa1
- vpxor 0x50($inp),$xb1,$xb1
- vpxor 0x60($inp),$xc1,$xc1
- vpxor 0x70($inp),$xd1,$xd1
-
- vmovdqu $xa0,0x00($out)
- vmovdqu $xb0,0x10($out)
- vmovdqu $xc0,0x20($out)
- vmovdqu $xd0,0x30($out)
- vmovdqu $xa1,0x40($out)
- vmovdqu $xb1,0x50($out)
- vmovdqu $xc1,0x60($out)
- vmovdqu $xd1,0x70($out)
- je .Ldone4xop
-
- lea 0x80($inp),$inp # inp+=64*2
- vmovdqa $xa2,0x00(%rsp)
- xor %r10,%r10
- vmovdqa $xb2,0x10(%rsp)
- lea 0x80($out),$out # out+=64*2
- vmovdqa $xc2,0x20(%rsp)
- sub \$128,$len # len-=64*2
- vmovdqa $xd2,0x30(%rsp)
- jmp .Loop_tail4xop
-
-.align 32
-.L192_or_more4xop:
- vpxor 0x00($inp),$xa0,$xa0 # xor with input
- vpxor 0x10($inp),$xb0,$xb0
- vpxor 0x20($inp),$xc0,$xc0
- vpxor 0x30($inp),$xd0,$xd0
- vpxor 0x40($inp),$xa1,$xa1
- vpxor 0x50($inp),$xb1,$xb1
- vpxor 0x60($inp),$xc1,$xc1
- vpxor 0x70($inp),$xd1,$xd1
- lea 0x80($inp),$inp # size optimization
- vpxor 0x00($inp),$xa2,$xa2
- vpxor 0x10($inp),$xb2,$xb2
- vpxor 0x20($inp),$xc2,$xc2
- vpxor 0x30($inp),$xd2,$xd2
-
- vmovdqu $xa0,0x00($out)
- vmovdqu $xb0,0x10($out)
- vmovdqu $xc0,0x20($out)
- vmovdqu $xd0,0x30($out)
- vmovdqu $xa1,0x40($out)
- vmovdqu $xb1,0x50($out)
- vmovdqu $xc1,0x60($out)
- vmovdqu $xd1,0x70($out)
- lea 0x80($out),$out # size optimization
- vmovdqu $xa2,0x00($out)
- vmovdqu $xb2,0x10($out)
- vmovdqu $xc2,0x20($out)
- vmovdqu $xd2,0x30($out)
- je .Ldone4xop
-
- lea 0x40($inp),$inp # inp+=64*3
- vmovdqa $xa2,0x00(%rsp)
- xor %r10,%r10
- vmovdqa $xb2,0x10(%rsp)
- lea 0x40($out),$out # out+=64*3
- vmovdqa $xc2,0x20(%rsp)
- sub \$192,$len # len-=64*3
- vmovdqa $xd2,0x30(%rsp)
-
-.Loop_tail4xop:
- movzb ($inp,%r10),%eax
- movzb (%rsp,%r10),%ecx
- lea 1(%r10),%r10
- xor %ecx,%eax
- mov %al,-1($out,%r10)
- dec $len
- jnz .Loop_tail4xop
-
-.Ldone4xop:
- vzeroupper
-___
-$code.=<<___ if ($win64);
- lea 0x140+0x30(%rsp),%r11
- movaps -0x30(%r11),%xmm6
- movaps -0x20(%r11),%xmm7
- movaps -0x10(%r11),%xmm8
- movaps 0x00(%r11),%xmm9
- movaps 0x10(%r11),%xmm10
- movaps 0x20(%r11),%xmm11
- movaps 0x30(%r11),%xmm12
- movaps 0x40(%r11),%xmm13
- movaps 0x50(%r11),%xmm14
- movaps 0x60(%r11),%xmm15
-___
-$code.=<<___;
- add \$0x148+$xframe,%rsp
- ret
-.size ChaCha20_4xop,.-ChaCha20_4xop
-___
-}
-
-########################################################################
# AVX2 code path
if ($avx>1) {
my ($xb0,$xb1,$xb2,$xb3, $xd0,$xd1,$xd2,$xd3,
diff --git a/crypto/chacha/chacha_generic.c b/crypto/chacha/chacha.c
similarity index 73%
rename from crypto/chacha/chacha_generic.c
rename to crypto/chacha/chacha.c
index f262033..afe1b2a 100644
--- a/crypto/chacha/chacha_generic.c
+++ b/crypto/chacha/chacha.c
@@ -21,7 +21,49 @@
#include <openssl/cpu.h>
-#if defined(OPENSSL_WINDOWS) || (!defined(OPENSSL_X86_64) && !defined(OPENSSL_X86)) || !defined(__SSE2__)
+#define U8TO32_LITTLE(p) \
+ (((uint32_t)((p)[0])) | ((uint32_t)((p)[1]) << 8) | \
+ ((uint32_t)((p)[2]) << 16) | ((uint32_t)((p)[3]) << 24))
+
+#if !defined(OPENSSL_NO_ASM) && \
+ (defined(OPENSSL_X86) || defined(OPENSSL_X86_64) || \
+ defined(OPENSSL_ARM) || defined(OPENSSL_AARCH64))
+
+/* ChaCha20_ctr32 is defined in asm/chacha-*.pl. */
+void ChaCha20_ctr32(uint8_t *out, const uint8_t *in, size_t in_len,
+ const uint32_t key[8], const uint32_t counter[4]);
+
+void CRYPTO_chacha_20(uint8_t *out, const uint8_t *in, size_t in_len,
+ const uint8_t key[32], const uint8_t nonce[12],
+ uint32_t counter) {
+ uint32_t counter_nonce[4];
+ counter_nonce[0] = counter;
+ counter_nonce[1] = U8TO32_LITTLE(nonce + 0);
+ counter_nonce[2] = U8TO32_LITTLE(nonce + 4);
+ counter_nonce[3] = U8TO32_LITTLE(nonce + 8);
+
+ const uint32_t *key_ptr = (const uint32_t *)key;
+#if !defined(OPENSSL_X86) && !defined(OPENSSL_X86_64)
+ /* The assembly expects the key to be four-byte aligned. */
+ uint32_t key_u32[8];
+ if ((((uintptr_t)key) & 3) != 0) {
+ key_u32[0] = U8TO32_LITTLE(key + 0);
+ key_u32[1] = U8TO32_LITTLE(key + 4);
+ key_u32[2] = U8TO32_LITTLE(key + 8);
+ key_u32[3] = U8TO32_LITTLE(key + 12);
+ key_u32[4] = U8TO32_LITTLE(key + 16);
+ key_u32[5] = U8TO32_LITTLE(key + 20);
+ key_u32[6] = U8TO32_LITTLE(key + 24);
+ key_u32[7] = U8TO32_LITTLE(key + 28);
+
+ key_ptr = key_u32;
+ }
+#endif
+
+ ChaCha20_ctr32(out, in, in_len, key_ptr, counter_nonce);
+}
+
+#else
/* sigma contains the ChaCha constants, which happen to be an ASCII string. */
static const uint8_t sigma[16] = { 'e', 'x', 'p', 'a', 'n', 'd', ' ', '3',
@@ -40,10 +82,6 @@
(p)[3] = (v >> 24) & 0xff; \
}
-#define U8TO32_LITTLE(p) \
- (((uint32_t)((p)[0])) | ((uint32_t)((p)[1]) << 8) | \
- ((uint32_t)((p)[2]) << 16) | ((uint32_t)((p)[3]) << 24))
-
/* QUARTERROUND updates a, b, c, d with a ChaCha "quarter" round. */
#define QUARTERROUND(a,b,c,d) \
x[a] = PLUS(x[a],x[b]); x[d] = ROTATE(XOR(x[d],x[a]),16); \
@@ -51,13 +89,6 @@
x[a] = PLUS(x[a],x[b]); x[d] = ROTATE(XOR(x[d],x[a]), 8); \
x[c] = PLUS(x[c],x[d]); x[b] = ROTATE(XOR(x[b],x[c]), 7);
-#if defined(OPENSSL_ARM) && !defined(OPENSSL_NO_ASM)
-/* Defined in chacha_vec.c */
-void CRYPTO_chacha_20_neon(uint8_t *out, const uint8_t *in, size_t in_len,
- const uint8_t key[32], const uint8_t nonce[12],
- uint32_t counter);
-#endif
-
/* chacha_core performs 20 rounds of ChaCha on the input words in
* |input| and writes the 64 output bytes to |output|. */
static void chacha_core(uint8_t output[64], const uint32_t input[16]) {
@@ -91,13 +122,6 @@
uint8_t buf[64];
size_t todo, i;
-#if defined(OPENSSL_ARM) && !defined(OPENSSL_NO_ASM)
- if (CRYPTO_is_NEON_capable()) {
- CRYPTO_chacha_20_neon(out, in, in_len, key, nonce, counter);
- return;
- }
-#endif
-
input[0] = U8TO32_LITTLE(sigma + 0);
input[1] = U8TO32_LITTLE(sigma + 4);
input[2] = U8TO32_LITTLE(sigma + 8);
@@ -137,4 +161,4 @@
}
}
-#endif /* OPENSSL_WINDOWS || !OPENSSL_X86_64 && !OPENSSL_X86 || !__SSE2__ */
+#endif
diff --git a/crypto/chacha/chacha_vec.c b/crypto/chacha/chacha_vec.c
deleted file mode 100644
index 90d6200..0000000
--- a/crypto/chacha/chacha_vec.c
+++ /dev/null
@@ -1,328 +0,0 @@
-/* Copyright (c) 2014, Google Inc.
- *
- * Permission to use, copy, modify, and/or distribute this software for any
- * purpose with or without fee is hereby granted, provided that the above
- * copyright notice and this permission notice appear in all copies.
- *
- * THE SOFTWARE IS PROVIDED "AS IS" AND THE AUTHOR DISCLAIMS ALL WARRANTIES
- * WITH REGARD TO THIS SOFTWARE INCLUDING ALL IMPLIED WARRANTIES OF
- * MERCHANTABILITY AND FITNESS. IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY
- * SPECIAL, DIRECT, INDIRECT, OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES
- * WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN ACTION
- * OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF OR IN
- * CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE. */
-
-/* ====================================================================
- *
- * When updating this file, also update chacha_vec_arm.S
- *
- * ==================================================================== */
-
-
-/* This implementation is by Ted Krovetz and was submitted to SUPERCOP and
- * marked as public domain. It was been altered to allow for non-aligned inputs
- * and to allow the block counter to be passed in specifically. */
-
-#include <openssl/chacha.h>
-
-#include "../internal.h"
-
-
-#if defined(ASM_GEN) || \
- !defined(OPENSSL_WINDOWS) && \
- (defined(OPENSSL_X86_64) || defined(OPENSSL_X86)) && defined(__SSE2__)
-
-#define CHACHA_RNDS 20 /* 8 (high speed), 20 (conservative), 12 (middle) */
-
-/* Architecture-neutral way to specify 16-byte vector of ints */
-typedef unsigned vec __attribute__((vector_size(16)));
-
-/* This implementation is designed for Neon, SSE and AltiVec machines. The
- * following specify how to do certain vector operations efficiently on
- * each architecture, using intrinsics.
- * This implementation supports parallel processing of multiple blocks,
- * including potentially using general-purpose registers. */
-#if __ARM_NEON__
-#include <string.h>
-#include <arm_neon.h>
-#define GPR_TOO 1
-#define VBPI 2
-#define ONE (vec) vsetq_lane_u32(1, vdupq_n_u32(0), 0)
-#define LOAD_ALIGNED(m) (vec)(*((vec *)(m)))
-#define LOAD(m) ({ \
- memcpy(alignment_buffer, m, 16); \
- LOAD_ALIGNED(alignment_buffer); \
- })
-#define STORE(m, r) ({ \
- (*((vec *)(alignment_buffer))) = (r); \
- memcpy(m, alignment_buffer, 16); \
- })
-#define ROTV1(x) (vec) vextq_u32((uint32x4_t)x, (uint32x4_t)x, 1)
-#define ROTV2(x) (vec) vextq_u32((uint32x4_t)x, (uint32x4_t)x, 2)
-#define ROTV3(x) (vec) vextq_u32((uint32x4_t)x, (uint32x4_t)x, 3)
-#define ROTW16(x) (vec) vrev32q_u16((uint16x8_t)x)
-#if __clang__
-#define ROTW7(x) (x << ((vec) {7, 7, 7, 7})) ^ (x >> ((vec) {25, 25, 25, 25}))
-#define ROTW8(x) (x << ((vec) {8, 8, 8, 8})) ^ (x >> ((vec) {24, 24, 24, 24}))
-#define ROTW12(x) \
- (x << ((vec) {12, 12, 12, 12})) ^ (x >> ((vec) {20, 20, 20, 20}))
-#else
-#define ROTW7(x) \
- (vec) vsriq_n_u32(vshlq_n_u32((uint32x4_t)x, 7), (uint32x4_t)x, 25)
-#define ROTW8(x) \
- (vec) vsriq_n_u32(vshlq_n_u32((uint32x4_t)x, 8), (uint32x4_t)x, 24)
-#define ROTW12(x) \
- (vec) vsriq_n_u32(vshlq_n_u32((uint32x4_t)x, 12), (uint32x4_t)x, 20)
-#endif
-#elif __SSE2__
-#include <emmintrin.h>
-#define GPR_TOO 0
-#if __clang__
-#define VBPI 4
-#else
-#define VBPI 3
-#endif
-#define ONE (vec) _mm_set_epi32(0, 0, 0, 1)
-#define LOAD(m) (vec) _mm_loadu_si128((const __m128i *)(m))
-#define LOAD_ALIGNED(m) (vec) _mm_load_si128((const __m128i *)(m))
-#define STORE(m, r) _mm_storeu_si128((__m128i *)(m), (__m128i)(r))
-#define ROTV1(x) (vec) _mm_shuffle_epi32((__m128i)x, _MM_SHUFFLE(0, 3, 2, 1))
-#define ROTV2(x) (vec) _mm_shuffle_epi32((__m128i)x, _MM_SHUFFLE(1, 0, 3, 2))
-#define ROTV3(x) (vec) _mm_shuffle_epi32((__m128i)x, _MM_SHUFFLE(2, 1, 0, 3))
-#define ROTW7(x) \
- (vec)(_mm_slli_epi32((__m128i)x, 7) ^ _mm_srli_epi32((__m128i)x, 25))
-#define ROTW12(x) \
- (vec)(_mm_slli_epi32((__m128i)x, 12) ^ _mm_srli_epi32((__m128i)x, 20))
-#if __SSSE3__
-#include <tmmintrin.h>
-#define ROTW8(x) \
- (vec) _mm_shuffle_epi8((__m128i)x, _mm_set_epi8(14, 13, 12, 15, 10, 9, 8, \
- 11, 6, 5, 4, 7, 2, 1, 0, 3))
-#define ROTW16(x) \
- (vec) _mm_shuffle_epi8((__m128i)x, _mm_set_epi8(13, 12, 15, 14, 9, 8, 11, \
- 10, 5, 4, 7, 6, 1, 0, 3, 2))
-#else
-#define ROTW8(x) \
- (vec)(_mm_slli_epi32((__m128i)x, 8) ^ _mm_srli_epi32((__m128i)x, 24))
-#define ROTW16(x) \
- (vec)(_mm_slli_epi32((__m128i)x, 16) ^ _mm_srli_epi32((__m128i)x, 16))
-#endif
-#else
-#error-- Implementation supports only machines with neon or SSE2
-#endif
-
-#ifndef REVV_BE
-#define REVV_BE(x) (x)
-#endif
-
-#ifndef REVW_BE
-#define REVW_BE(x) (x)
-#endif
-
-#define BPI (VBPI + GPR_TOO) /* Blocks computed per loop iteration */
-
-#define DQROUND_VECTORS(a,b,c,d) \
- a += b; d ^= a; d = ROTW16(d); \
- c += d; b ^= c; b = ROTW12(b); \
- a += b; d ^= a; d = ROTW8(d); \
- c += d; b ^= c; b = ROTW7(b); \
- b = ROTV1(b); c = ROTV2(c); d = ROTV3(d); \
- a += b; d ^= a; d = ROTW16(d); \
- c += d; b ^= c; b = ROTW12(b); \
- a += b; d ^= a; d = ROTW8(d); \
- c += d; b ^= c; b = ROTW7(b); \
- b = ROTV3(b); c = ROTV2(c); d = ROTV1(d);
-
-#define QROUND_WORDS(a,b,c,d) \
- a = a+b; d ^= a; d = d<<16 | d>>16; \
- c = c+d; b ^= c; b = b<<12 | b>>20; \
- a = a+b; d ^= a; d = d<< 8 | d>>24; \
- c = c+d; b ^= c; b = b<< 7 | b>>25;
-
-#define WRITE_XOR(in, op, d, v0, v1, v2, v3) \
- STORE(op + d + 0, LOAD(in + d + 0) ^ REVV_BE(v0)); \
- STORE(op + d + 4, LOAD(in + d + 4) ^ REVV_BE(v1)); \
- STORE(op + d + 8, LOAD(in + d + 8) ^ REVV_BE(v2)); \
- STORE(op + d +12, LOAD(in + d +12) ^ REVV_BE(v3));
-
-#if __ARM_NEON__
-/* For ARM, we can't depend on NEON support, so this function is compiled with
- * a different name, along with the generic code, and can be enabled at
- * run-time. */
-void CRYPTO_chacha_20_neon(
-#else
-void CRYPTO_chacha_20(
-#endif
- uint8_t *out,
- const uint8_t *in,
- size_t inlen,
- const uint8_t key[32],
- const uint8_t nonce[12],
- uint32_t counter)
- {
- unsigned iters, i;
- unsigned *op = (unsigned *)out;
- const unsigned *ip = (const unsigned *)in;
- const unsigned *kp = (const unsigned *)key;
-#if defined(__ARM_NEON__)
- uint32_t np[3];
- alignas(16) uint8_t alignment_buffer[16];
-#endif
- vec s0, s1, s2, s3;
- alignas(16) unsigned chacha_const[] =
- {0x61707865,0x3320646E,0x79622D32,0x6B206574};
-#if defined(__ARM_NEON__)
- memcpy(np, nonce, 12);
-#endif
- s0 = LOAD_ALIGNED(chacha_const);
- s1 = LOAD(&((const vec*)kp)[0]);
- s2 = LOAD(&((const vec*)kp)[1]);
- s3 = (vec){
- counter,
- ((const uint32_t*)nonce)[0],
- ((const uint32_t*)nonce)[1],
- ((const uint32_t*)nonce)[2]
- };
-
- for (iters = 0; iters < inlen/(BPI*64); iters++)
- {
-#if GPR_TOO
- register unsigned x0, x1, x2, x3, x4, x5, x6, x7, x8,
- x9, x10, x11, x12, x13, x14, x15;
-#endif
-#if VBPI > 2
- vec v8,v9,v10,v11;
-#endif
-#if VBPI > 3
- vec v12,v13,v14,v15;
-#endif
-
- vec v0,v1,v2,v3,v4,v5,v6,v7;
- v4 = v0 = s0; v5 = v1 = s1; v6 = v2 = s2; v3 = s3;
- v7 = v3 + ONE;
-#if VBPI > 2
- v8 = v4; v9 = v5; v10 = v6;
- v11 = v7 + ONE;
-#endif
-#if VBPI > 3
- v12 = v8; v13 = v9; v14 = v10;
- v15 = v11 + ONE;
-#endif
-#if GPR_TOO
- x0 = chacha_const[0]; x1 = chacha_const[1];
- x2 = chacha_const[2]; x3 = chacha_const[3];
- x4 = kp[0]; x5 = kp[1]; x6 = kp[2]; x7 = kp[3];
- x8 = kp[4]; x9 = kp[5]; x10 = kp[6]; x11 = kp[7];
- x12 = counter+BPI*iters+(BPI-1); x13 = np[0];
- x14 = np[1]; x15 = np[2];
-#endif
- for (i = CHACHA_RNDS/2; i; i--)
- {
- DQROUND_VECTORS(v0,v1,v2,v3)
- DQROUND_VECTORS(v4,v5,v6,v7)
-#if VBPI > 2
- DQROUND_VECTORS(v8,v9,v10,v11)
-#endif
-#if VBPI > 3
- DQROUND_VECTORS(v12,v13,v14,v15)
-#endif
-#if GPR_TOO
- QROUND_WORDS( x0, x4, x8,x12)
- QROUND_WORDS( x1, x5, x9,x13)
- QROUND_WORDS( x2, x6,x10,x14)
- QROUND_WORDS( x3, x7,x11,x15)
- QROUND_WORDS( x0, x5,x10,x15)
- QROUND_WORDS( x1, x6,x11,x12)
- QROUND_WORDS( x2, x7, x8,x13)
- QROUND_WORDS( x3, x4, x9,x14)
-#endif
- }
-
- WRITE_XOR(ip, op, 0, v0+s0, v1+s1, v2+s2, v3+s3)
- s3 += ONE;
- WRITE_XOR(ip, op, 16, v4+s0, v5+s1, v6+s2, v7+s3)
- s3 += ONE;
-#if VBPI > 2
- WRITE_XOR(ip, op, 32, v8+s0, v9+s1, v10+s2, v11+s3)
- s3 += ONE;
-#endif
-#if VBPI > 3
- WRITE_XOR(ip, op, 48, v12+s0, v13+s1, v14+s2, v15+s3)
- s3 += ONE;
-#endif
- ip += VBPI*16;
- op += VBPI*16;
-#if GPR_TOO
- op[0] = REVW_BE(REVW_BE(ip[0]) ^ (x0 + chacha_const[0]));
- op[1] = REVW_BE(REVW_BE(ip[1]) ^ (x1 + chacha_const[1]));
- op[2] = REVW_BE(REVW_BE(ip[2]) ^ (x2 + chacha_const[2]));
- op[3] = REVW_BE(REVW_BE(ip[3]) ^ (x3 + chacha_const[3]));
- op[4] = REVW_BE(REVW_BE(ip[4]) ^ (x4 + kp[0]));
- op[5] = REVW_BE(REVW_BE(ip[5]) ^ (x5 + kp[1]));
- op[6] = REVW_BE(REVW_BE(ip[6]) ^ (x6 + kp[2]));
- op[7] = REVW_BE(REVW_BE(ip[7]) ^ (x7 + kp[3]));
- op[8] = REVW_BE(REVW_BE(ip[8]) ^ (x8 + kp[4]));
- op[9] = REVW_BE(REVW_BE(ip[9]) ^ (x9 + kp[5]));
- op[10] = REVW_BE(REVW_BE(ip[10]) ^ (x10 + kp[6]));
- op[11] = REVW_BE(REVW_BE(ip[11]) ^ (x11 + kp[7]));
- op[12] = REVW_BE(REVW_BE(ip[12]) ^ (x12 + counter+BPI*iters+(BPI-1)));
- op[13] = REVW_BE(REVW_BE(ip[13]) ^ (x13 + np[0]));
- op[14] = REVW_BE(REVW_BE(ip[14]) ^ (x14 + np[1]));
- op[15] = REVW_BE(REVW_BE(ip[15]) ^ (x15 + np[2]));
- s3 += ONE;
- ip += 16;
- op += 16;
-#endif
- }
-
- for (iters = inlen%(BPI*64)/64; iters != 0; iters--)
- {
- vec v0 = s0, v1 = s1, v2 = s2, v3 = s3;
- for (i = CHACHA_RNDS/2; i; i--)
- {
- DQROUND_VECTORS(v0,v1,v2,v3);
- }
- WRITE_XOR(ip, op, 0, v0+s0, v1+s1, v2+s2, v3+s3)
- s3 += ONE;
- ip += 16;
- op += 16;
- }
-
- inlen = inlen % 64;
- if (inlen)
- {
- alignas(16) vec buf[4];
- vec v0,v1,v2,v3;
- v0 = s0; v1 = s1; v2 = s2; v3 = s3;
- for (i = CHACHA_RNDS/2; i; i--)
- {
- DQROUND_VECTORS(v0,v1,v2,v3);
- }
-
- if (inlen >= 16)
- {
- STORE(op + 0, LOAD(ip + 0) ^ REVV_BE(v0 + s0));
- if (inlen >= 32)
- {
- STORE(op + 4, LOAD(ip + 4) ^ REVV_BE(v1 + s1));
- if (inlen >= 48)
- {
- STORE(op + 8, LOAD(ip + 8) ^
- REVV_BE(v2 + s2));
- buf[3] = REVV_BE(v3 + s3);
- }
- else
- buf[2] = REVV_BE(v2 + s2);
- }
- else
- buf[1] = REVV_BE(v1 + s1);
- }
- else
- buf[0] = REVV_BE(v0 + s0);
-
- for (i=inlen & ~15; i<inlen; i++)
- ((char *)op)[i] = ((const char *)ip)[i] ^ ((const char *)buf)[i];
- }
- }
-
-#endif /* ASM_GEN || !OPENSSL_WINDOWS && (OPENSSL_X86_64 || OPENSSL_X86) && SSE2 */
diff --git a/crypto/chacha/chacha_vec_arm.S b/crypto/chacha/chacha_vec_arm.S
deleted file mode 100644
index f18c867..0000000
--- a/crypto/chacha/chacha_vec_arm.S
+++ /dev/null
@@ -1,1447 +0,0 @@
-# Copyright (c) 2014, Google Inc.
-#
-# Permission to use, copy, modify, and/or distribute this software for any
-# purpose with or without fee is hereby granted, provided that the above
-# copyright notice and this permission notice appear in all copies.
-#
-# THE SOFTWARE IS PROVIDED "AS IS" AND THE AUTHOR DISCLAIMS ALL WARRANTIES
-# WITH REGARD TO THIS SOFTWARE INCLUDING ALL IMPLIED WARRANTIES OF
-# MERCHANTABILITY AND FITNESS. IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY
-# SPECIAL, DIRECT, INDIRECT, OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES
-# WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN ACTION
-# OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF OR IN
-# CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE.
-
-# This file contains a pre-compiled version of chacha_vec.c for ARM. This is
-# needed to support switching on NEON code at runtime. If the whole of OpenSSL
-# were to be compiled with the needed flags to build chacha_vec.c, then it
-# wouldn't be possible to run on non-NEON systems.
-#
-# This file was generated by chacha_vec_arm_generate.go using the following
-# compiler command:
-#
-# /opt/gcc-linaro-4.9-2014.11-x86_64_arm-linux-gnueabihf/bin/arm-linux-gnueabihf-gcc -O3 -mcpu=cortex-a8 -mfpu=neon -fpic -DASM_GEN -I ../../include -S chacha_vec.c -o -
-
-#if !defined(OPENSSL_NO_ASM)
-#if defined(__arm__)
-
- .syntax unified
- .cpu cortex-a8
- .eabi_attribute 27, 3
-
-# EABI attribute 28 sets whether VFP register arguments were used to build this
-# file. If object files are inconsistent on this point, the linker will refuse
-# to link them. Thus we report whatever the compiler expects since we don't use
-# VFP arguments.
-
-#if defined(__ARM_PCS_VFP)
- .eabi_attribute 28, 1
-#else
- .eabi_attribute 28, 0
-#endif
-
- .fpu neon
- .eabi_attribute 20, 1
- .eabi_attribute 21, 1
- .eabi_attribute 23, 3
- .eabi_attribute 24, 1
- .eabi_attribute 25, 1
- .eabi_attribute 26, 2
- .eabi_attribute 30, 2
- .eabi_attribute 34, 1
- .eabi_attribute 18, 4
- .thumb
- .file "chacha_vec.c"
- .text
- .align 2
- .global CRYPTO_chacha_20_neon
- .hidden CRYPTO_chacha_20_neon
- .thumb
- .thumb_func
- .type CRYPTO_chacha_20_neon, %function
-CRYPTO_chacha_20_neon:
- @ args = 8, pretend = 0, frame = 160
- @ frame_needed = 1, uses_anonymous_args = 0
- push {r4, r5, r6, r7, r8, r9, r10, fp, lr}
- mov r9, r3
- vpush.64 {d8, d9, d10, d11, d12, d13, d14, d15}
- mov r10, r2
- ldr r4, .L91+16
- mov fp, r1
- mov r8, r9
-.LPIC16:
- add r4, pc
- sub sp, sp, #164
- add r7, sp, #0
- sub sp, sp, #112
- add lr, r7, #148
- str r0, [r7, #80]
- str r1, [r7, #12]
- str r2, [r7, #8]
- ldmia r4, {r0, r1, r2, r3}
- add r4, sp, #15
- bic r4, r4, #15
- ldr r6, [r7, #264]
- str r4, [r7, #88]
- mov r5, r4
- adds r4, r4, #64
- add ip, r5, #80
- str r9, [r7, #56]
- stmia r4, {r0, r1, r2, r3}
- movw r4, #43691
- ldr r0, [r6] @ unaligned
- movt r4, 43690
- ldr r1, [r6, #4] @ unaligned
- ldr r2, [r6, #8] @ unaligned
- ldr r3, [r9, #12] @ unaligned
- str ip, [r7, #84]
- stmia lr!, {r0, r1, r2}
- mov lr, ip
- ldr r1, [r9, #4] @ unaligned
- ldr r2, [r9, #8] @ unaligned
- ldr r0, [r9] @ unaligned
- vldr d24, [r5, #64]
- vldr d25, [r5, #72]
- umull r4, r5, r10, r4
- stmia ip!, {r0, r1, r2, r3}
- ldr r0, [r8, #16]! @ unaligned
- ldr r2, [r7, #88]
- ldr r4, [r7, #268]
- ldr r1, [r8, #4] @ unaligned
- vldr d26, [r2, #80]
- vldr d27, [r2, #88]
- ldr r3, [r8, #12] @ unaligned
- ldr r2, [r8, #8] @ unaligned
- stmia lr!, {r0, r1, r2, r3}
- ldr r3, [r6]
- ldr r1, [r6, #4]
- ldr r6, [r6, #8]
- str r3, [r7, #68]
- str r3, [r7, #132]
- lsrs r3, r5, #7
- str r6, [r7, #140]
- str r6, [r7, #60]
- ldr r6, [r7, #88]
- str r4, [r7, #128]
- str r1, [r7, #136]
- str r1, [r7, #64]
- vldr d28, [r6, #80]
- vldr d29, [r6, #88]
- vldr d22, [r7, #128]
- vldr d23, [r7, #136]
- beq .L26
- mov r5, r6
- lsls r2, r3, #8
- sub r3, r2, r3, lsl #6
- ldr r2, [r5, #68]
- ldr r6, [r6, #64]
- vldr d0, .L91
- vldr d1, .L91+8
- str r2, [r7, #48]
- ldr r2, [r5, #72]
- str r3, [r7, #4]
- str r6, [r7, #52]
- str r2, [r7, #44]
- adds r2, r4, #2
- str r2, [r7, #72]
- ldr r2, [r5, #76]
- str fp, [r7, #76]
- str r2, [r7, #40]
- ldr r2, [r7, #80]
- adds r3, r2, r3
- str r3, [r7, #16]
-.L4:
- ldr r5, [r7, #56]
- add r8, r7, #40
- ldr r4, [r7, #68]
- vadd.i32 q3, q11, q0
- ldmia r8, {r8, r9, r10, fp}
- mov r1, r5
- ldr r2, [r5, #4]
- vmov q8, q14 @ v4si
- ldr r3, [r5]
- vmov q1, q13 @ v4si
- ldr r6, [r1, #28]
- vmov q9, q12 @ v4si
- mov r0, r2
- ldr r2, [r5, #8]
- str r4, [r7, #112]
- movs r1, #10
- ldr r4, [r7, #72]
- vmov q2, q11 @ v4si
- ldr lr, [r5, #20]
- vmov q15, q14 @ v4si
- str r3, [r7, #108]
- vmov q5, q13 @ v4si
- str r2, [r7, #116]
- vmov q10, q12 @ v4si
- ldr r2, [r5, #12]
- ldr ip, [r5, #16]
- ldr r3, [r7, #64]
- ldr r5, [r5, #24]
- str r6, [r7, #120]
- str r1, [r7, #92]
- ldr r6, [r7, #60]
- str r4, [r7, #100]
- ldr r1, [r7, #116]
- ldr r4, [r7, #108]
- str r8, [r7, #96]
- mov r8, r10
- str lr, [r7, #104]
- mov r10, r9
- mov lr, r3
- mov r9, r5
- str r6, [r7, #124]
- b .L92
-.L93:
- .align 3
-.L91:
- .word 1
- .word 0
- .word 0
- .word 0
- .word .LANCHOR0-(.LPIC16+4)
-.L92:
-.L3:
- vadd.i32 q9, q9, q1
- add r3, r8, r0
- vadd.i32 q10, q10, q5
- add r5, fp, r4
- veor q3, q3, q9
- mov r6, r3
- veor q2, q2, q10
- ldr r3, [r7, #96]
- str r5, [r7, #116]
- add r10, r10, r1
- vrev32.16 q3, q3
- str r6, [r7, #108]
- vadd.i32 q8, q8, q3
- vrev32.16 q2, q2
- vadd.i32 q15, q15, q2
- mov fp, r3
- ldr r3, [r7, #100]
- veor q4, q8, q1
- veor q6, q15, q5
- add fp, fp, r2
- eors r3, r3, r5
- mov r5, r6
- ldr r6, [r7, #112]
- vshl.i32 q1, q4, #12
- vshl.i32 q5, q6, #12
- ror r3, r3, #16
- eors r6, r6, r5
- eor lr, lr, r10
- vsri.32 q1, q4, #20
- mov r5, r6
- ldr r6, [r7, #124]
- vsri.32 q5, q6, #20
- str r3, [r7, #124]
- eor r6, r6, fp
- ror r5, r5, #16
- vadd.i32 q9, q9, q1
- ror lr, lr, #16
- ror r3, r6, #16
- ldr r6, [r7, #124]
- vadd.i32 q10, q10, q5
- add r9, r9, lr
- veor q4, q9, q3
- add ip, ip, r6
- ldr r6, [r7, #104]
- veor q6, q10, q2
- eor r4, ip, r4
- str r3, [r7, #104]
- vshl.i32 q3, q4, #8
- eor r1, r9, r1
- mov r8, r6
- ldr r6, [r7, #120]
- vshl.i32 q2, q6, #8
- ror r4, r4, #20
- add r6, r6, r3
- vsri.32 q3, q4, #24
- str r6, [r7, #100]
- eors r2, r2, r6
- ldr r6, [r7, #116]
- vsri.32 q2, q6, #24
- add r8, r8, r5
- ror r2, r2, #20
- adds r6, r4, r6
- vadd.i32 q4, q8, q3
- eor r0, r8, r0
- vadd.i32 q15, q15, q2
- mov r3, r6
- ldr r6, [r7, #108]
- veor q6, q4, q1
- ror r0, r0, #20
- str r3, [r7, #112]
- veor q5, q15, q5
- adds r6, r0, r6
- str r6, [r7, #120]
- mov r6, r3
- ldr r3, [r7, #124]
- vshl.i32 q8, q6, #7
- add fp, fp, r2
- eors r3, r3, r6
- ldr r6, [r7, #120]
- vshl.i32 q1, q5, #7
- ror r1, r1, #20
- eors r5, r5, r6
- vsri.32 q8, q6, #25
- ldr r6, [r7, #104]
- ror r3, r3, #24
- ror r5, r5, #24
- vsri.32 q1, q5, #25
- str r5, [r7, #116]
- eor r6, fp, r6
- ldr r5, [r7, #116]
- add r10, r10, r1
- add ip, r3, ip
- vext.32 q8, q8, q8, #1
- str ip, [r7, #124]
- add ip, r5, r8
- ldr r5, [r7, #100]
- eor lr, r10, lr
- ror r6, r6, #24
- vext.32 q1, q1, q1, #1
- add r8, r6, r5
- vadd.i32 q9, q9, q8
- ldr r5, [r7, #124]
- vext.32 q3, q3, q3, #3
- vadd.i32 q10, q10, q1
- ror lr, lr, #24
- eor r0, ip, r0
- vext.32 q2, q2, q2, #3
- add r9, r9, lr
- eors r4, r4, r5
- veor q3, q9, q3
- ldr r5, [r7, #112]
- eor r1, r9, r1
- ror r0, r0, #25
- veor q2, q10, q2
- adds r5, r0, r5
- vext.32 q4, q4, q4, #2
- str r5, [r7, #112]
- ldr r5, [r7, #120]
- ror r1, r1, #25
- vrev32.16 q3, q3
- eor r2, r8, r2
- vext.32 q15, q15, q15, #2
- adds r5, r1, r5
- vadd.i32 q4, q4, q3
- ror r4, r4, #25
- vrev32.16 q2, q2
- str r5, [r7, #100]
- vadd.i32 q15, q15, q2
- eors r3, r3, r5
- ldr r5, [r7, #112]
- add fp, fp, r4
- veor q8, q4, q8
- ror r2, r2, #25
- veor q1, q15, q1
- eor lr, fp, lr
- eors r6, r6, r5
- ror r3, r3, #16
- ldr r5, [r7, #116]
- add r10, r10, r2
- str r3, [r7, #120]
- ror lr, lr, #16
- ldr r3, [r7, #120]
- eor r5, r10, r5
- vshl.i32 q5, q8, #12
- add ip, lr, ip
- vshl.i32 q6, q1, #12
- str ip, [r7, #104]
- add ip, r3, r8
- str ip, [r7, #116]
- ldr r3, [r7, #124]
- ror r5, r5, #16
- vsri.32 q5, q8, #20
- ror r6, r6, #16
- add ip, r5, r3
- ldr r3, [r7, #104]
- vsri.32 q6, q1, #20
- add r9, r9, r6
- eor r2, ip, r2
- eors r4, r4, r3
- ldr r3, [r7, #116]
- eor r0, r9, r0
- vadd.i32 q9, q9, q5
- ror r4, r4, #20
- eors r1, r1, r3
- vadd.i32 q10, q10, q6
- ror r3, r2, #20
- str r3, [r7, #108]
- ldr r3, [r7, #112]
- veor q3, q9, q3
- ror r0, r0, #20
- add r8, r4, fp
- veor q2, q10, q2
- add fp, r0, r3
- ldr r3, [r7, #100]
- ror r1, r1, #20
- mov r2, r8
- vshl.i32 q8, q3, #8
- str r8, [r7, #96]
- add r8, r1, r3
- ldr r3, [r7, #108]
- vmov q1, q6 @ v4si
- vshl.i32 q6, q2, #8
- eor r6, fp, r6
- add r10, r10, r3
- ldr r3, [r7, #120]
- vsri.32 q8, q3, #24
- eor lr, r2, lr
- eor r3, r8, r3
- ror r2, r6, #24
- vsri.32 q6, q2, #24
- eor r5, r10, r5
- str r2, [r7, #124]
- ror r2, r3, #24
- ldr r3, [r7, #104]
- vmov q3, q8 @ v4si
- vadd.i32 q15, q15, q6
- ror lr, lr, #24
- vadd.i32 q8, q4, q8
- ror r6, r5, #24
- add r5, lr, r3
- ldr r3, [r7, #124]
- veor q4, q8, q5
- add ip, ip, r6
- vmov q2, q6 @ v4si
- add r9, r9, r3
- veor q6, q15, q1
- ldr r3, [r7, #116]
- vshl.i32 q1, q4, #7
- str r2, [r7, #100]
- add r3, r3, r2
- str r3, [r7, #120]
- vshl.i32 q5, q6, #7
- eors r1, r1, r3
- ldr r3, [r7, #108]
- vsri.32 q1, q4, #25
- eors r4, r4, r5
- eor r0, r9, r0
- eor r2, ip, r3
- vsri.32 q5, q6, #25
- ldr r3, [r7, #92]
- ror r4, r4, #25
- str r6, [r7, #112]
- ror r0, r0, #25
- subs r3, r3, #1
- str r5, [r7, #104]
- ror r1, r1, #25
- ror r2, r2, #25
- vext.32 q15, q15, q15, #2
- str r3, [r7, #92]
- vext.32 q2, q2, q2, #1
- vext.32 q8, q8, q8, #2
- vext.32 q3, q3, q3, #1
- vext.32 q5, q5, q5, #3
- vext.32 q1, q1, q1, #3
- bne .L3
- ldr r3, [r7, #84]
- vadd.i32 q4, q12, q10
- str r9, [r7, #92]
- mov r9, r10
- mov r10, r8
- ldr r8, [r7, #96]
- str lr, [r7, #96]
- mov lr, r5
- ldr r5, [r7, #52]
- vadd.i32 q5, q13, q5
- ldr r6, [r7, #76]
- vadd.i32 q15, q14, q15
- add fp, fp, r5
- ldr r5, [r7, #48]
- str r3, [r7, #104]
- vadd.i32 q7, q14, q8
- ldr r3, [r6, #12] @ unaligned
- add r10, r10, r5
- str r0, [r7, #36]
- vadd.i32 q2, q11, q2
- ldr r0, [r6] @ unaligned
- vadd.i32 q6, q12, q9
- ldr r5, [r7, #104]
- vadd.i32 q1, q13, q1
- str r1, [r7, #116]
- vadd.i32 q11, q11, q0
- ldr r1, [r6, #4] @ unaligned
- str r2, [r7, #32]
- vadd.i32 q3, q11, q3
- ldr r2, [r6, #8] @ unaligned
- vadd.i32 q11, q11, q0
- str r4, [r7, #108]
- ldr r4, [r7, #100]
- vadd.i32 q11, q11, q0
- stmia r5!, {r0, r1, r2, r3}
- ldr r2, [r7, #88]
- ldr r3, [r7, #44]
- ldr r5, [r7, #84]
- vldr d20, [r2, #80]
- vldr d21, [r2, #88]
- add r3, r9, r3
- str r3, [r7, #104]
- veor q10, q10, q4
- ldr r3, [r7, #40]
- add r3, r8, r3
- str r3, [r7, #100]
- ldr r3, [r7, #72]
- vstr d20, [r2, #80]
- vstr d21, [r2, #88]
- adds r1, r4, r3
- str r1, [r7, #28]
- ldmia r5!, {r0, r1, r2, r3}
- ldr r4, [r7, #68]
- ldr r5, [r7, #112]
- ldr r8, [r7, #84]
- add r5, r5, r4
- ldr r4, [r7, #96]
- str r5, [r7, #24]
- ldr r5, [r7, #64]
- add r4, r4, r5
- ldr r5, [r7, #60]
- str r4, [r7, #96]
- ldr r4, [r7, #124]
- add r4, r4, r5
- str r4, [r7, #20]
- ldr r4, [r7, #80]
- mov r5, r8
- str r0, [r4] @ unaligned
- mov r0, r4
- str r1, [r4, #4] @ unaligned
- mov r4, r8
- str r2, [r0, #8] @ unaligned
- mov r8, r0
- str r3, [r0, #12] @ unaligned
- mov r9, r4
- ldr r0, [r6, #16]! @ unaligned
- ldr r3, [r6, #12] @ unaligned
- ldr r1, [r6, #4] @ unaligned
- ldr r2, [r6, #8] @ unaligned
- ldr r6, [r7, #76]
- stmia r5!, {r0, r1, r2, r3}
- mov r5, r8
- ldr r3, [r7, #88]
- vldr d20, [r3, #80]
- vldr d21, [r3, #88]
- veor q10, q10, q5
- vstr d20, [r3, #80]
- vstr d21, [r3, #88]
- ldmia r4!, {r0, r1, r2, r3}
- mov r4, r9
- str r0, [r8, #16] @ unaligned
- str r1, [r8, #20] @ unaligned
- str r2, [r8, #24] @ unaligned
- str r3, [r8, #28] @ unaligned
- mov r8, r5
- ldr r0, [r6, #32]! @ unaligned
- mov r5, r9
- ldr r1, [r6, #4] @ unaligned
- ldr r2, [r6, #8] @ unaligned
- ldr r3, [r6, #12] @ unaligned
- ldr r6, [r7, #76]
- stmia r5!, {r0, r1, r2, r3}
- mov r5, r8
- ldr r1, [r7, #88]
- vldr d16, [r1, #80]
- vldr d17, [r1, #88]
- veor q15, q8, q15
- vstr d30, [r1, #80]
- vstr d31, [r1, #88]
- ldmia r4!, {r0, r1, r2, r3}
- mov r4, r9
- str r0, [r8, #32] @ unaligned
- str r1, [r8, #36] @ unaligned
- str r2, [r8, #40] @ unaligned
- str r3, [r8, #44] @ unaligned
- mov r8, r5
- ldr r0, [r6, #48]! @ unaligned
- ldr r1, [r6, #4] @ unaligned
- ldr r2, [r6, #8] @ unaligned
- ldr r3, [r6, #12] @ unaligned
- ldr r6, [r7, #76]
- stmia r4!, {r0, r1, r2, r3}
- mov r4, r9
- ldr r1, [r7, #88]
- str r9, [r7, #112]
- vldr d18, [r1, #80]
- vldr d19, [r1, #88]
- veor q9, q9, q2
- vstr d18, [r1, #80]
- vstr d19, [r1, #88]
- ldmia r9!, {r0, r1, r2, r3}
- str r0, [r5, #48] @ unaligned
- str r1, [r5, #52] @ unaligned
- str r2, [r5, #56] @ unaligned
- str r3, [r5, #60] @ unaligned
- ldr r0, [r6, #64]! @ unaligned
- ldr r1, [r6, #4] @ unaligned
- ldr r2, [r6, #8] @ unaligned
- ldr r3, [r6, #12] @ unaligned
- ldr r6, [r7, #76]
- mov r9, r6
- mov r6, r4
- stmia r6!, {r0, r1, r2, r3}
- mov r6, r4
- ldr r1, [r7, #88]
- vldr d18, [r1, #80]
- vldr d19, [r1, #88]
- veor q9, q9, q6
- vstr d18, [r1, #80]
- vstr d19, [r1, #88]
- ldmia r4!, {r0, r1, r2, r3}
- mov r4, r6
- str r3, [r5, #76] @ unaligned
- mov r3, r9
- str r2, [r5, #72] @ unaligned
- str r0, [r5, #64] @ unaligned
- str r1, [r5, #68] @ unaligned
- mov r5, r4
- ldr r0, [r3, #80]! @ unaligned
- mov r9, r3
- ldr r1, [r9, #4] @ unaligned
- ldr r2, [r9, #8] @ unaligned
- ldr r3, [r9, #12] @ unaligned
- mov r9, r4
- ldr r6, [r7, #76]
- str r9, [r7, #124]
- stmia r5!, {r0, r1, r2, r3}
- mov r5, r8
- ldr r1, [r7, #88]
- vldr d18, [r1, #80]
- vldr d19, [r1, #88]
- veor q1, q9, q1
- vstr d2, [r1, #80]
- vstr d3, [r1, #88]
- ldmia r4!, {r0, r1, r2, r3}
- mov r4, r9
- str r0, [r8, #80] @ unaligned
- str r1, [r8, #84] @ unaligned
- str r2, [r8, #88] @ unaligned
- str r3, [r8, #92] @ unaligned
- ldr r0, [r6, #96]! @ unaligned
- ldr r3, [r6, #12] @ unaligned
- ldr r1, [r6, #4] @ unaligned
- ldr r2, [r6, #8] @ unaligned
- ldr r6, [r7, #76]
- stmia r4!, {r0, r1, r2, r3}
- mov r4, r9
- ldr r3, [r7, #88]
- vldr d16, [r3, #80]
- vldr d17, [r3, #88]
- veor q8, q8, q7
- vstr d16, [r3, #80]
- vstr d17, [r3, #88]
- ldmia r9!, {r0, r1, r2, r3}
- str r0, [r5, #96] @ unaligned
- str r1, [r5, #100] @ unaligned
- str r2, [r5, #104] @ unaligned
- str r3, [r5, #108] @ unaligned
- ldr r0, [r6, #112]! @ unaligned
- ldr r1, [r6, #4] @ unaligned
- ldr r2, [r6, #8] @ unaligned
- ldr r3, [r6, #12] @ unaligned
- mov r6, r4
- stmia r6!, {r0, r1, r2, r3}
- mov r6, r5
- ldr r3, [r7, #88]
- vldr d16, [r3, #80]
- vldr d17, [r3, #88]
- veor q8, q8, q3
- vstr d16, [r3, #80]
- vstr d17, [r3, #88]
- ldmia r4!, {r0, r1, r2, r3}
- mov r4, r5
- mov r8, r4
- str r2, [r5, #120] @ unaligned
- ldr r2, [r7, #76]
- str r0, [r5, #112] @ unaligned
- str r1, [r5, #116] @ unaligned
- str r3, [r5, #124] @ unaligned
- ldr r3, [r2, #128]
- ldr r1, [r7, #104]
- eor r3, fp, r3
- str r3, [r5, #128]
- ldr r3, [r2, #132]
- mov r5, r2
- eor r3, r10, r3
- str r3, [r6, #132]
- ldr r3, [r2, #136]
- mov r6, r5
- eors r1, r1, r3
- str r1, [r8, #136]
- ldr r1, [r7, #56]
- ldr r3, [r2, #140]
- ldr r2, [r7, #100]
- ldr r0, [r7, #108]
- eors r3, r3, r2
- str r3, [r4, #140]
- ldr r3, [r1]
- ldr r2, [r5, #144]
- mov r8, r0
- add r8, r8, r3
- mov r5, r6
- mov r3, r8
- eors r2, r2, r3
- str r2, [r4, #144]
- ldr r3, [r6, #148]
- ldr r2, [r1, #4]
- ldr r6, [r7, #36]
- add r6, r6, r2
- eors r3, r3, r6
- mov r6, r1
- str r3, [r4, #148]
- ldr r2, [r1, #8]
- ldr r1, [r7, #116]
- ldr r3, [r5, #152]
- mov r8, r1
- add r8, r8, r2
- ldr r1, [r7, #32]
- mov r2, r8
- eors r3, r3, r2
- str r3, [r4, #152]
- mov r8, r4
- ldr r2, [r6, #12]
- ldr r3, [r5, #156]
- add r1, r1, r2
- eors r3, r3, r1
- str r3, [r4, #156]
- ldr r2, [r6, #16]
- mov r1, r4
- ldr r3, [r5, #160]
- mov r4, r5
- add ip, ip, r2
- mov r5, r6
- eor r3, ip, r3
- str r3, [r1, #160]
- ldr r2, [r6, #20]
- ldr r3, [r4, #164]
- add lr, lr, r2
- ldr r2, [r7, #92]
- eor r3, lr, r3
- str r3, [r1, #164]
- ldr r6, [r5, #24]
- mov lr, r4
- ldr r3, [r4, #168]
- add r2, r2, r6
- ldr r6, [r7, #120]
- eors r3, r3, r2
- str r3, [r1, #168]
- ldr r5, [r5, #28]
- ldr r3, [r4, #172]
- add r6, r6, r5
- eors r3, r3, r6
- str r3, [r1, #172]
- ldr r4, [r4, #176]
- ldr r0, [r7, #28]
- ldr r5, [r7, #24]
- eors r4, r4, r0
- str r4, [r8, #176]
- ldr r0, [lr, #180]
- ldr r2, [r7, #96]
- eors r0, r0, r5
- str r0, [r8, #180]
- ldr r1, [lr, #184]
- ldr r4, [r7, #20]
- eors r1, r1, r2
- str r1, [r8, #184]
- ldr r2, [lr, #188]
- add r1, lr, #192
- ldr r3, [r7, #72]
- eors r2, r2, r4
- str r2, [r8, #188]
- ldr r2, [r7, #16]
- adds r3, r3, #3
- str r3, [r7, #72]
- mov r3, r8
- adds r3, r3, #192
- str r1, [r7, #76]
- cmp r2, r3
- str r3, [r7, #80]
- bne .L4
- ldr r3, [r7, #12]
- ldr r2, [r7, #4]
- add r3, r3, r2
- str r3, [r7, #12]
-.L2:
- ldr r1, [r7, #8]
- movw r2, #43691
- movt r2, 43690
- umull r2, r3, r1, r2
- lsr fp, r3, #7
- lsl r3, fp, #8
- sub fp, r3, fp, lsl #6
- rsb fp, fp, r1
- lsrs fp, fp, #6
- beq .L6
- ldr r5, [r7, #12]
- ldr r4, [r7, #16]
- ldr r6, [r7, #88]
- ldr lr, [r7, #84]
- vldr d30, .L94
- vldr d31, .L94+8
- str fp, [r7, #120]
- str fp, [r7, #124]
-.L8:
- vmov q2, q11 @ v4si
- movs r3, #10
- vmov q8, q14 @ v4si
- vmov q9, q13 @ v4si
- vmov q10, q12 @ v4si
-.L7:
- vadd.i32 q10, q10, q9
- subs r3, r3, #1
- veor q3, q2, q10
- vrev32.16 q3, q3
- vadd.i32 q8, q8, q3
- veor q9, q8, q9
- vshl.i32 q2, q9, #12
- vsri.32 q2, q9, #20
- vadd.i32 q10, q10, q2
- veor q3, q10, q3
- vshl.i32 q9, q3, #8
- vsri.32 q9, q3, #24
- vadd.i32 q8, q8, q9
- vext.32 q9, q9, q9, #3
- veor q2, q8, q2
- vext.32 q8, q8, q8, #2
- vshl.i32 q3, q2, #7
- vsri.32 q3, q2, #25
- vext.32 q3, q3, q3, #1
- vadd.i32 q10, q10, q3
- veor q9, q10, q9
- vrev32.16 q9, q9
- vadd.i32 q8, q8, q9
- veor q3, q8, q3
- vshl.i32 q2, q3, #12
- vsri.32 q2, q3, #20
- vadd.i32 q10, q10, q2
- vmov q3, q2 @ v4si
- veor q9, q10, q9
- vshl.i32 q2, q9, #8
- vsri.32 q2, q9, #24
- vadd.i32 q8, q8, q2
- vext.32 q2, q2, q2, #1
- veor q3, q8, q3
- vext.32 q8, q8, q8, #2
- vshl.i32 q9, q3, #7
- vsri.32 q9, q3, #25
- vext.32 q9, q9, q9, #3
- bne .L7
- ldr r0, [r5] @ unaligned
- vadd.i32 q1, q12, q10
- ldr r1, [r5, #4] @ unaligned
- mov ip, lr
- ldr r2, [r5, #8] @ unaligned
- mov r9, lr
- ldr r3, [r5, #12] @ unaligned
- mov r10, r5
- vadd.i32 q9, q13, q9
- mov r8, lr
- vadd.i32 q8, q14, q8
- stmia ip!, {r0, r1, r2, r3}
- mov ip, lr
- vldr d20, [r6, #80]
- vldr d21, [r6, #88]
- vadd.i32 q3, q11, q2
- veor q10, q10, q1
- vadd.i32 q11, q11, q15
- vstr d20, [r6, #80]
- vstr d21, [r6, #88]
- ldmia r9!, {r0, r1, r2, r3}
- mov r9, r5
- str r0, [r4] @ unaligned
- str r1, [r4, #4] @ unaligned
- str r2, [r4, #8] @ unaligned
- str r3, [r4, #12] @ unaligned
- ldr r0, [r10, #16]! @ unaligned
- ldr r1, [r10, #4] @ unaligned
- ldr r2, [r10, #8] @ unaligned
- ldr r3, [r10, #12] @ unaligned
- add r10, r4, #48
- adds r4, r4, #64
- stmia r8!, {r0, r1, r2, r3}
- mov r8, lr
- vldr d20, [r6, #80]
- vldr d21, [r6, #88]
- veor q10, q10, q9
- vstr d20, [r6, #80]
- vstr d21, [r6, #88]
- ldmia ip!, {r0, r1, r2, r3}
- mov ip, lr
- str r0, [r4, #-48] @ unaligned
- str r1, [r4, #-44] @ unaligned
- str r2, [r4, #-40] @ unaligned
- str r3, [r4, #-36] @ unaligned
- ldr r0, [r9, #32]! @ unaligned
- ldr r1, [r9, #4] @ unaligned
- ldr r2, [r9, #8] @ unaligned
- ldr r3, [r9, #12] @ unaligned
- mov r9, r5
- adds r5, r5, #64
- stmia r8!, {r0, r1, r2, r3}
- mov r8, lr
- vldr d18, [r6, #80]
- vldr d19, [r6, #88]
- veor q9, q9, q8
- vstr d18, [r6, #80]
- vstr d19, [r6, #88]
- ldmia ip!, {r0, r1, r2, r3}
- mov ip, lr
- str r0, [r4, #-32] @ unaligned
- str r1, [r4, #-28] @ unaligned
- str r2, [r4, #-24] @ unaligned
- str r3, [r4, #-20] @ unaligned
- ldr r0, [r9, #48]! @ unaligned
- ldr r1, [r9, #4] @ unaligned
- ldr r2, [r9, #8] @ unaligned
- ldr r3, [r9, #12] @ unaligned
- stmia r8!, {r0, r1, r2, r3}
- vldr d16, [r6, #80]
- vldr d17, [r6, #88]
- veor q8, q8, q3
- vstr d16, [r6, #80]
- vstr d17, [r6, #88]
- ldmia ip!, {r0, r1, r2, r3}
- str r0, [r4, #-16] @ unaligned
- str r1, [r4, #-12] @ unaligned
- str r3, [r10, #12] @ unaligned
- ldr r3, [r7, #124]
- str r2, [r10, #8] @ unaligned
- cmp r3, #1
- beq .L87
- movs r3, #1
- str r3, [r7, #124]
- b .L8
-.L95:
- .align 3
-.L94:
- .word 1
- .word 0
- .word 0
- .word 0
-.L87:
- ldr fp, [r7, #120]
- ldr r3, [r7, #12]
- lsl fp, fp, #6
- add r3, r3, fp
- str r3, [r7, #12]
- ldr r3, [r7, #16]
- add r3, r3, fp
- str r3, [r7, #16]
-.L6:
- ldr r3, [r7, #8]
- ands r9, r3, #63
- beq .L1
- vmov q3, q11 @ v4si
- movs r3, #10
- vmov q8, q14 @ v4si
- mov r5, r9
- vmov q15, q13 @ v4si
- vmov q10, q12 @ v4si
-.L10:
- vadd.i32 q10, q10, q15
- subs r3, r3, #1
- veor q9, q3, q10
- vrev32.16 q9, q9
- vadd.i32 q8, q8, q9
- veor q15, q8, q15
- vshl.i32 q3, q15, #12
- vsri.32 q3, q15, #20
- vadd.i32 q10, q10, q3
- veor q15, q10, q9
- vshl.i32 q9, q15, #8
- vsri.32 q9, q15, #24
- vadd.i32 q8, q8, q9
- vext.32 q9, q9, q9, #3
- veor q3, q8, q3
- vext.32 q8, q8, q8, #2
- vshl.i32 q15, q3, #7
- vsri.32 q15, q3, #25
- vext.32 q15, q15, q15, #1
- vadd.i32 q10, q10, q15
- veor q9, q10, q9
- vrev32.16 q9, q9
- vadd.i32 q8, q8, q9
- veor q15, q8, q15
- vshl.i32 q3, q15, #12
- vsri.32 q3, q15, #20
- vadd.i32 q10, q10, q3
- vmov q15, q3 @ v4si
- veor q9, q10, q9
- vshl.i32 q3, q9, #8
- vsri.32 q3, q9, #24
- vadd.i32 q8, q8, q3
- vext.32 q3, q3, q3, #1
- veor q9, q8, q15
- vext.32 q8, q8, q8, #2
- vshl.i32 q15, q9, #7
- vsri.32 q15, q9, #25
- vext.32 q15, q15, q15, #3
- bne .L10
- cmp r5, #15
- mov r9, r5
- bhi .L88
- vadd.i32 q12, q12, q10
- ldr r3, [r7, #88]
- vst1.64 {d24-d25}, [r3:128]
-.L14:
- ldr r3, [r7, #8]
- and r2, r3, #48
- cmp r9, r2
- bls .L1
- ldr r6, [r7, #16]
- add r3, r2, #16
- ldr r1, [r7, #12]
- rsb ip, r2, r9
- adds r0, r1, r2
- mov r4, r6
- add r1, r1, r3
- add r4, r4, r2
- add r3, r3, r6
- cmp r0, r3
- it cc
- cmpcc r4, r1
- ite cs
- movcs r3, #1
- movcc r3, #0
- cmp ip, #18
- ite ls
- movls r3, #0
- andhi r3, r3, #1
- cmp r3, #0
- beq .L16
- and r1, r0, #7
- mov r3, r2
- negs r1, r1
- and r1, r1, #15
- cmp r1, ip
- it cs
- movcs r1, ip
- cmp r1, #0
- beq .L17
- ldr r5, [r7, #88]
- cmp r1, #1
- ldrb r0, [r0] @ zero_extendqisi2
- add r3, r2, #1
- ldrb lr, [r5, r2] @ zero_extendqisi2
- mov r6, r5
- eor r0, lr, r0
- strb r0, [r4]
- beq .L17
- ldr r0, [r7, #12]
- cmp r1, #2
- ldrb r4, [r5, r3] @ zero_extendqisi2
- ldr r5, [r7, #16]
- ldrb r0, [r0, r3] @ zero_extendqisi2
- eor r0, r0, r4
- strb r0, [r5, r3]
- add r3, r2, #2
- beq .L17
- ldr r0, [r7, #12]
- cmp r1, #3
- ldrb r4, [r6, r3] @ zero_extendqisi2
- ldrb r0, [r0, r3] @ zero_extendqisi2
- eor r0, r0, r4
- strb r0, [r5, r3]
- add r3, r2, #3
- beq .L17
- ldr r0, [r7, #12]
- cmp r1, #4
- ldrb r4, [r6, r3] @ zero_extendqisi2
- ldrb r0, [r0, r3] @ zero_extendqisi2
- eor r0, r0, r4
- strb r0, [r5, r3]
- add r3, r2, #4
- beq .L17
- ldr r0, [r7, #12]
- cmp r1, #5
- ldrb r4, [r6, r3] @ zero_extendqisi2
- ldrb r0, [r0, r3] @ zero_extendqisi2
- eor r0, r0, r4
- strb r0, [r5, r3]
- add r3, r2, #5
- beq .L17
- ldr r0, [r7, #12]
- cmp r1, #6
- ldrb r4, [r6, r3] @ zero_extendqisi2
- ldrb r0, [r0, r3] @ zero_extendqisi2
- eor r0, r0, r4
- strb r0, [r5, r3]
- add r3, r2, #6
- beq .L17
- ldr r0, [r7, #12]
- cmp r1, #7
- ldrb r4, [r6, r3] @ zero_extendqisi2
- ldrb r0, [r0, r3] @ zero_extendqisi2
- eor r0, r0, r4
- strb r0, [r5, r3]
- add r3, r2, #7
- beq .L17
- ldr r0, [r7, #12]
- cmp r1, #8
- ldrb r4, [r6, r3] @ zero_extendqisi2
- ldrb r0, [r0, r3] @ zero_extendqisi2
- eor r0, r0, r4
- strb r0, [r5, r3]
- add r3, r2, #8
- beq .L17
- ldr r0, [r7, #12]
- cmp r1, #9
- ldrb r4, [r6, r3] @ zero_extendqisi2
- ldrb r0, [r0, r3] @ zero_extendqisi2
- eor r0, r0, r4
- strb r0, [r5, r3]
- add r3, r2, #9
- beq .L17
- ldr r0, [r7, #12]
- cmp r1, #10
- ldrb r4, [r6, r3] @ zero_extendqisi2
- ldrb r0, [r0, r3] @ zero_extendqisi2
- eor r0, r0, r4
- strb r0, [r5, r3]
- add r3, r2, #10
- beq .L17
- ldr r0, [r7, #12]
- cmp r1, #11
- ldrb r4, [r6, r3] @ zero_extendqisi2
- ldrb r0, [r0, r3] @ zero_extendqisi2
- eor r0, r0, r4
- strb r0, [r5, r3]
- add r3, r2, #11
- beq .L17
- ldr r0, [r7, #12]
- cmp r1, #12
- ldrb r4, [r6, r3] @ zero_extendqisi2
- ldrb r0, [r0, r3] @ zero_extendqisi2
- eor r0, r0, r4
- strb r0, [r5, r3]
- add r3, r2, #12
- beq .L17
- ldr r0, [r7, #12]
- cmp r1, #13
- ldrb r4, [r6, r3] @ zero_extendqisi2
- ldrb r0, [r0, r3] @ zero_extendqisi2
- eor r0, r0, r4
- strb r0, [r5, r3]
- add r3, r2, #13
- beq .L17
- ldr r0, [r7, #12]
- cmp r1, #15
- ldrb r4, [r6, r3] @ zero_extendqisi2
- ldrb r0, [r0, r3] @ zero_extendqisi2
- eor r0, r0, r4
- strb r0, [r5, r3]
- add r3, r2, #14
- bne .L17
- ldr r0, [r7, #12]
- ldrb r4, [r6, r3] @ zero_extendqisi2
- ldrb r0, [r0, r3] @ zero_extendqisi2
- eors r0, r0, r4
- strb r0, [r5, r3]
- add r3, r2, #15
-.L17:
- rsb r4, r1, ip
- add r0, ip, #-1
- sub r6, r4, #16
- subs r0, r0, r1
- cmp r0, #14
- lsr r6, r6, #4
- add r6, r6, #1
- lsl lr, r6, #4
- bls .L19
- add r2, r2, r1
- ldr r1, [r7, #12]
- ldr r5, [r7, #16]
- cmp r6, #1
- add r0, r1, r2
- ldr r1, [r7, #88]
- add r1, r1, r2
- vld1.64 {d18-d19}, [r0:64]
- add r2, r2, r5
- vld1.8 {q8}, [r1]
- veor q8, q8, q9
- vst1.8 {q8}, [r2]
- beq .L20
- add r8, r1, #16
- add ip, r2, #16
- vldr d18, [r0, #16]
- vldr d19, [r0, #24]
- cmp r6, #2
- vld1.8 {q8}, [r8]
- veor q8, q8, q9
- vst1.8 {q8}, [ip]
- beq .L20
- add r8, r1, #32
- add ip, r2, #32
- vldr d18, [r0, #32]
- vldr d19, [r0, #40]
- cmp r6, #3
- vld1.8 {q8}, [r8]
- veor q8, q8, q9
- vst1.8 {q8}, [ip]
- beq .L20
- adds r1, r1, #48
- adds r2, r2, #48
- vldr d18, [r0, #48]
- vldr d19, [r0, #56]
- vld1.8 {q8}, [r1]
- veor q8, q8, q9
- vst1.8 {q8}, [r2]
-.L20:
- cmp lr, r4
- add r3, r3, lr
- beq .L1
-.L19:
- ldr r4, [r7, #88]
- adds r2, r3, #1
- ldr r1, [r7, #12]
- cmp r2, r9
- ldr r5, [r7, #16]
- ldrb r0, [r4, r3] @ zero_extendqisi2
- ldrb r1, [r1, r3] @ zero_extendqisi2
- eor r1, r1, r0
- strb r1, [r5, r3]
- bcs .L1
- ldr r0, [r7, #12]
- adds r1, r3, #2
- mov r6, r4
- cmp r9, r1
- ldrb r4, [r4, r2] @ zero_extendqisi2
- ldrb r0, [r0, r2] @ zero_extendqisi2
- eor r0, r0, r4
- strb r0, [r5, r2]
- bls .L1
- ldr r0, [r7, #12]
- adds r2, r3, #3
- ldrb r4, [r6, r1] @ zero_extendqisi2
- cmp r9, r2
- ldrb r0, [r0, r1] @ zero_extendqisi2
- eor r0, r0, r4
- strb r0, [r5, r1]
- bls .L1
- ldr r0, [r7, #12]
- adds r1, r3, #4
- ldrb r4, [r6, r2] @ zero_extendqisi2
- cmp r9, r1
- ldrb r0, [r0, r2] @ zero_extendqisi2
- eor r0, r0, r4
- strb r0, [r5, r2]
- bls .L1
- ldr r0, [r7, #12]
- adds r2, r3, #5
- ldrb r4, [r6, r1] @ zero_extendqisi2
- cmp r9, r2
- ldrb r0, [r0, r1] @ zero_extendqisi2
- eor r0, r0, r4
- strb r0, [r5, r1]
- bls .L1
- ldr r0, [r7, #12]
- adds r1, r3, #6
- ldrb r4, [r6, r2] @ zero_extendqisi2
- cmp r9, r1
- ldrb r0, [r0, r2] @ zero_extendqisi2
- eor r0, r0, r4
- strb r0, [r5, r2]
- bls .L1
- ldr r0, [r7, #12]
- adds r2, r3, #7
- ldrb r4, [r6, r1] @ zero_extendqisi2
- cmp r9, r2
- ldrb r0, [r0, r1] @ zero_extendqisi2
- eor r0, r0, r4
- strb r0, [r5, r1]
- bls .L1
- ldr r0, [r7, #12]
- add r1, r3, #8
- ldrb r4, [r6, r2] @ zero_extendqisi2
- cmp r9, r1
- ldrb r0, [r0, r2] @ zero_extendqisi2
- eor r0, r0, r4
- strb r0, [r5, r2]
- bls .L1
- ldr r0, [r7, #12]
- add r2, r3, #9
- ldrb r4, [r6, r1] @ zero_extendqisi2
- cmp r9, r2
- ldrb r0, [r0, r1] @ zero_extendqisi2
- eor r0, r0, r4
- strb r0, [r5, r1]
- bls .L1
- ldr r0, [r7, #12]
- add r1, r3, #10
- ldrb r4, [r6, r2] @ zero_extendqisi2
- cmp r9, r1
- ldrb r0, [r0, r2] @ zero_extendqisi2
- eor r0, r0, r4
- strb r0, [r5, r2]
- bls .L1
- ldr r0, [r7, #12]
- add r2, r3, #11
- ldrb r4, [r6, r1] @ zero_extendqisi2
- cmp r9, r2
- ldrb r0, [r0, r1] @ zero_extendqisi2
- eor r0, r0, r4
- strb r0, [r5, r1]
- bls .L1
- ldr r0, [r7, #12]
- add r1, r3, #12
- ldrb r4, [r6, r2] @ zero_extendqisi2
- cmp r9, r1
- ldrb r0, [r0, r2] @ zero_extendqisi2
- eor r0, r0, r4
- strb r0, [r5, r2]
- bls .L1
- ldr r0, [r7, #12]
- add r2, r3, #13
- ldrb r4, [r6, r1] @ zero_extendqisi2
- cmp r9, r2
- ldrb r0, [r0, r1] @ zero_extendqisi2
- eor r0, r0, r4
- strb r0, [r5, r1]
- bls .L1
- ldr r1, [r7, #12]
- adds r3, r3, #14
- ldrb r0, [r6, r2] @ zero_extendqisi2
- cmp r9, r3
- ldrb r1, [r1, r2] @ zero_extendqisi2
- eor r1, r1, r0
- strb r1, [r5, r2]
- bls .L1
- ldr r2, [r7, #88]
- ldrb r1, [r2, r3] @ zero_extendqisi2
- ldr r2, [r7, #12]
- ldrb r2, [r2, r3] @ zero_extendqisi2
- eors r2, r2, r1
- ldr r1, [r7, #16]
- strb r2, [r1, r3]
-.L1:
- adds r7, r7, #164
- mov sp, r7
- @ sp needed
- vldm sp!, {d8-d15}
- pop {r4, r5, r6, r7, r8, r9, r10, fp, pc}
-.L88:
- ldr r5, [r7, #12]
- vadd.i32 q12, q12, q10
- ldr r4, [r7, #84]
- cmp r9, #31
- ldr r0, [r5] @ unaligned
- ldr r1, [r5, #4] @ unaligned
- mov r6, r4
- ldr r2, [r5, #8] @ unaligned
- ldr r3, [r5, #12] @ unaligned
- stmia r6!, {r0, r1, r2, r3}
- ldr r2, [r7, #88]
- ldr r6, [r7, #16]
- vldr d18, [r2, #80]
- vldr d19, [r2, #88]
- veor q9, q9, q12
- vstr d18, [r2, #80]
- vstr d19, [r2, #88]
- ldmia r4!, {r0, r1, r2, r3}
- str r1, [r6, #4] @ unaligned
- mov r1, r6
- str r0, [r6] @ unaligned
- str r2, [r6, #8] @ unaligned
- str r3, [r6, #12] @ unaligned
- bhi .L89
- vadd.i32 q13, q13, q15
- ldr r3, [r7, #88]
- vstr d26, [r3, #16]
- vstr d27, [r3, #24]
- b .L14
-.L16:
- subs r3, r2, #1
- ldr r2, [r7, #12]
- add r2, r2, r9
- mov r5, r2
- ldr r2, [r7, #88]
- add r2, r2, r3
- mov r3, r2
-.L24:
- ldrb r1, [r0], #1 @ zero_extendqisi2
- ldrb r2, [r3, #1]! @ zero_extendqisi2
- cmp r0, r5
- eor r2, r2, r1
- strb r2, [r4], #1
- bne .L24
- adds r7, r7, #164
- mov sp, r7
- @ sp needed
- vldm sp!, {d8-d15}
- pop {r4, r5, r6, r7, r8, r9, r10, fp, pc}
-.L26:
- ldr r3, [r7, #80]
- str r3, [r7, #16]
- b .L2
-.L89:
- mov r3, r5
- ldr r4, [r7, #84]
- ldr r0, [r3, #16]! @ unaligned
- add lr, r1, #16
- mov r5, r1
- vadd.i32 q13, q13, q15
- mov r6, r4
- cmp r9, #47
- ldr r1, [r3, #4] @ unaligned
- ldr r2, [r3, #8] @ unaligned
- ldr r3, [r3, #12] @ unaligned
- stmia r6!, {r0, r1, r2, r3}
- ldr r2, [r7, #88]
- vldr d18, [r2, #80]
- vldr d19, [r2, #88]
- veor q13, q9, q13
- vstr d26, [r2, #80]
- vstr d27, [r2, #88]
- ldmia r4!, {r0, r1, r2, r3}
- str r0, [r5, #16] @ unaligned
- str r1, [lr, #4] @ unaligned
- str r2, [lr, #8] @ unaligned
- str r3, [lr, #12] @ unaligned
- bhi .L90
- vadd.i32 q8, q14, q8
- ldr r3, [r7, #88]
- vstr d16, [r3, #32]
- vstr d17, [r3, #40]
- b .L14
-.L90:
- ldr r3, [r7, #12]
- add lr, r5, #32
- ldr r4, [r7, #84]
- vadd.i32 q8, q14, q8
- ldr r5, [r7, #88]
- vadd.i32 q11, q11, q3
- ldr r0, [r3, #32]! @ unaligned
- mov r6, r4
- vstr d22, [r5, #48]
- vstr d23, [r5, #56]
- ldr r1, [r3, #4] @ unaligned
- ldr r2, [r3, #8] @ unaligned
- ldr r3, [r3, #12] @ unaligned
- stmia r4!, {r0, r1, r2, r3}
- vldr d18, [r5, #80]
- vldr d19, [r5, #88]
- veor q9, q9, q8
- ldr r4, [r7, #16]
- vstr d18, [r5, #80]
- vstr d19, [r5, #88]
- ldmia r6!, {r0, r1, r2, r3}
- str r0, [r4, #32] @ unaligned
- str r1, [lr, #4] @ unaligned
- str r2, [lr, #8] @ unaligned
- str r3, [lr, #12] @ unaligned
- b .L14
- .size CRYPTO_chacha_20_neon, .-CRYPTO_chacha_20_neon
- .section .rodata
- .align 2
-.LANCHOR0 = . + 0
-.LC0:
- .word 1634760805
- .word 857760878
- .word 2036477234
- .word 1797285236
- .ident "GCC: (Linaro GCC 2014.11) 4.9.3 20141031 (prerelease)"
- .section .note.GNU-stack,"",%progbits
-
-#endif /* __arm__ */
-#endif /* !OPENSSL_NO_ASM */
diff --git a/crypto/chacha/chacha_vec_arm_generate.go b/crypto/chacha/chacha_vec_arm_generate.go
deleted file mode 100644
index 82aa847..0000000
--- a/crypto/chacha/chacha_vec_arm_generate.go
+++ /dev/null
@@ -1,153 +0,0 @@
-// Copyright (c) 2014, Google Inc.
-//
-// Permission to use, copy, modify, and/or distribute this software for any
-// purpose with or without fee is hereby granted, provided that the above
-// copyright notice and this permission notice appear in all copies.
-//
-// THE SOFTWARE IS PROVIDED "AS IS" AND THE AUTHOR DISCLAIMS ALL WARRANTIES
-// WITH REGARD TO THIS SOFTWARE INCLUDING ALL IMPLIED WARRANTIES OF
-// MERCHANTABILITY AND FITNESS. IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY
-// SPECIAL, DIRECT, INDIRECT, OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES
-// WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN ACTION
-// OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF OR IN
-// CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE.
-
-// This package generates chacha_vec_arm.S from chacha_vec.c. Install the
-// arm-linux-gnueabihf-gcc compiler as described in BUILDING.md. Then:
-// `(cd crypto/chacha && go run chacha_vec_arm_generate.go)`.
-
-package main
-
-import (
- "bufio"
- "bytes"
- "os"
- "os/exec"
- "strings"
-)
-
-const defaultCompiler = "/opt/gcc-linaro-4.9-2014.11-x86_64_arm-linux-gnueabihf/bin/arm-linux-gnueabihf-gcc"
-
-func main() {
- compiler := defaultCompiler
- if len(os.Args) > 1 {
- compiler = os.Args[1]
- }
-
- args := []string{
- "-O3",
- "-mcpu=cortex-a8",
- "-mfpu=neon",
- "-fpic",
- "-DASM_GEN",
- "-I", "../../include",
- "-S", "chacha_vec.c",
- "-o", "-",
- }
-
- output, err := os.OpenFile("chacha_vec_arm.S", os.O_CREATE|os.O_TRUNC|os.O_WRONLY, 0644)
- if err != nil {
- panic(err)
- }
- defer output.Close()
-
- output.WriteString(preamble)
- output.WriteString(compiler)
- output.WriteString(" ")
- output.WriteString(strings.Join(args, " "))
- output.WriteString("\n\n#if !defined(OPENSSL_NO_ASM)\n")
- output.WriteString("#if defined(__arm__)\n\n")
-
- cmd := exec.Command(compiler, args...)
- cmd.Stderr = os.Stderr
- asm, err := cmd.StdoutPipe()
- if err != nil {
- panic(err)
- }
- if err := cmd.Start(); err != nil {
- panic(err)
- }
-
- attr28 := []byte(".eabi_attribute 28,")
- globalDirective := []byte(".global\t")
- newLine := []byte("\n")
- attr28Handled := false
-
- scanner := bufio.NewScanner(asm)
- for scanner.Scan() {
- line := scanner.Bytes()
-
- if bytes.Contains(line, attr28) {
- output.WriteString(attr28Block)
- attr28Handled = true
- continue
- }
-
- output.Write(line)
- output.Write(newLine)
-
- if i := bytes.Index(line, globalDirective); i >= 0 {
- output.Write(line[:i])
- output.WriteString(".hidden\t")
- output.Write(line[i+len(globalDirective):])
- output.Write(newLine)
- }
- }
-
- if err := scanner.Err(); err != nil {
- panic(err)
- }
-
- if !attr28Handled {
- panic("EABI attribute 28 not seen in processing")
- }
-
- if err := cmd.Wait(); err != nil {
- panic(err)
- }
-
- output.WriteString(trailer)
-}
-
-const preamble = `# Copyright (c) 2014, Google Inc.
-#
-# Permission to use, copy, modify, and/or distribute this software for any
-# purpose with or without fee is hereby granted, provided that the above
-# copyright notice and this permission notice appear in all copies.
-#
-# THE SOFTWARE IS PROVIDED "AS IS" AND THE AUTHOR DISCLAIMS ALL WARRANTIES
-# WITH REGARD TO THIS SOFTWARE INCLUDING ALL IMPLIED WARRANTIES OF
-# MERCHANTABILITY AND FITNESS. IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY
-# SPECIAL, DIRECT, INDIRECT, OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES
-# WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN ACTION
-# OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF OR IN
-# CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE.
-
-# This file contains a pre-compiled version of chacha_vec.c for ARM. This is
-# needed to support switching on NEON code at runtime. If the whole of OpenSSL
-# were to be compiled with the needed flags to build chacha_vec.c, then it
-# wouldn't be possible to run on non-NEON systems.
-#
-# This file was generated by chacha_vec_arm_generate.go using the following
-# compiler command:
-#
-# `
-
-const attr28Block = `
-# EABI attribute 28 sets whether VFP register arguments were used to build this
-# file. If object files are inconsistent on this point, the linker will refuse
-# to link them. Thus we report whatever the compiler expects since we don't use
-# VFP arguments.
-
-#if defined(__ARM_PCS_VFP)
- .eabi_attribute 28, 1
-#else
- .eabi_attribute 28, 0
-#endif
-
-`
-
-const trailer = `
-#endif /* __arm__ */
-#endif /* !OPENSSL_NO_ASM */
-`
diff --git a/util/generate_build_files.py b/util/generate_build_files.py
index acc693a..3220122 100644
--- a/util/generate_build_files.py
+++ b/util/generate_build_files.py
@@ -39,7 +39,6 @@
# perlasm system.
NON_PERL_FILES = {
('linux', 'arm'): [
- 'src/crypto/chacha/chacha_vec_arm.S',
'src/crypto/cpu-arm-asm.S',
'src/crypto/curve25519/asm/x25519-asm-arm.S',
'src/crypto/poly1305/poly1305_arm_asm.S',