Add a constant-time fallback GHASH implementation.

We have several variable-time table-based GHASH implementations, called
"4bit" in the code. We have a fallback one in C and assembly
implementations for x86, x86_64, and armv4. This are used if assembly is
off or if the hardware lacks NEON or SSSE3.

Note these benchmarks are all on hardware several generations beyond
what would actually run this code, so it's a bit artificial.

Implement a constant-time implementation of GHASH based on the notes in
https://bearssl.org/constanttime.html#ghash-for-gcm, as well as the
reduction algorithm described in
https://crypto.stanford.edu/RealWorldCrypto/slides/gueron.pdf.

This new implementation is actually faster than the fallback C code for
both 32-bit and 64-bit. It is slower than the assembly implementations,
particularly for 32-bit. I've left 32-bit x86 alone but replaced the
x86_64 and armv4 ones.  The perf hit on x86_64 is smaller and affects a
small percentage of 64-bit Chrome on Windows users. ARM chips without
NEON is rare (Chrome for Android requires it), so replace that too.

The answer for 32-bit x86 is unclear. More 32-bit Chrome on Windows
users lack SSSE3, and the perf hit is dramatic. gcm_gmult_4bit_mmx uses
SSE2, so perhaps we can close the gap with an SSE2 version of this
strategy, or perhaps we can decide this perf hit is worth fixing the
timing leaks.

32-bit x86 with OPENSSL_NO_ASM
Before: (4bit C)
Did 1136000 AES-128-GCM (16 bytes) seal operations in 1000762us (1135135.0 ops/sec): 18.2 MB/s
Did 190000 AES-128-GCM (256 bytes) seal operations in 1003533us (189331.1 ops/sec): 48.5 MB/s
Did 40000 AES-128-GCM (1350 bytes) seal operations in 1022114us (39134.6 ops/sec): 52.8 MB/s
Did 7282 AES-128-GCM (8192 bytes) seal operations in 1117575us (6515.9 ops/sec): 53.4 MB/s
Did 3663 AES-128-GCM (16384 bytes) seal operations in 1098538us (3334.4 ops/sec): 54.6 MB/s
After:
Did 1503000 AES-128-GCM (16 bytes) seal operations in 1000054us (1502918.8 ops/sec): 24.0 MB/s
Did 252000 AES-128-GCM (256 bytes) seal operations in 1001173us (251704.8 ops/sec): 64.4 MB/s
Did 53000 AES-128-GCM (1350 bytes) seal operations in 1016983us (52114.9 ops/sec): 70.4 MB/s
Did 9317 AES-128-GCM (8192 bytes) seal operations in 1056367us (8819.9 ops/sec): 72.3 MB/s
Did 4356 AES-128-GCM (16384 bytes) seal operations in 1000445us (4354.1 ops/sec): 71.3 MB/s

64-bit x86 with OPENSSL_NO_ASM
Before: (4bit C)
Did 2976000 AES-128-GCM (16 bytes) seal operations in 1000258us (2975232.4 ops/sec): 47.6 MB/s
Did 510000 AES-128-GCM (256 bytes) seal operations in 1000295us (509849.6 ops/sec): 130.5 MB/s
Did 106000 AES-128-GCM (1350 bytes) seal operations in 1001573us (105833.5 ops/sec): 142.9 MB/s
Did 18000 AES-128-GCM (8192 bytes) seal operations in 1003895us (17930.2 ops/sec): 146.9 MB/s
Did 9000 AES-128-GCM (16384 bytes) seal operations in 1003352us (8969.9 ops/sec): 147.0 MB/s
After:
Did 2972000 AES-128-GCM (16 bytes) seal operations in 1000178us (2971471.1 ops/sec): 47.5 MB/s
Did 515000 AES-128-GCM (256 bytes) seal operations in 1001850us (514049.0 ops/sec): 131.6 MB/s
Did 108000 AES-128-GCM (1350 bytes) seal operations in 1004941us (107469.0 ops/sec): 145.1 MB/s
Did 19000 AES-128-GCM (8192 bytes) seal operations in 1034966us (18358.1 ops/sec): 150.4 MB/s
Did 9250 AES-128-GCM (16384 bytes) seal operations in 1005269us (9201.5 ops/sec): 150.8 MB/s

32-bit ARM without NEON
Before: (4bit armv4 asm)
Did 952000 AES-128-GCM (16 bytes) seal operations in 1001009us (951040.4 ops/sec): 15.2 MB/s
Did 152000 AES-128-GCM (256 bytes) seal operations in 1005576us (151157.1 ops/sec): 38.7 MB/s
Did 32000 AES-128-GCM (1350 bytes) seal operations in 1024522us (31234.1 ops/sec): 42.2 MB/s
Did 5290 AES-128-GCM (8192 bytes) seal operations in 1005335us (5261.9 ops/sec): 43.1 MB/s
Did 2650 AES-128-GCM (16384 bytes) seal operations in 1004396us (2638.4 ops/sec): 43.2 MB/s
After:
Did 540000 AES-128-GCM (16 bytes) seal operations in 1000009us (539995.1 ops/sec): 8.6 MB/s
Did 90000 AES-128-GCM (256 bytes) seal operations in 1000028us (89997.5 ops/sec): 23.0 MB/s
Did 19000 AES-128-GCM (1350 bytes) seal operations in 1022041us (18590.3 ops/sec): 25.1 MB/s
Did 3150 AES-128-GCM (8192 bytes) seal operations in 1003199us (3140.0 ops/sec): 25.7 MB/s
Did 1694 AES-128-GCM (16384 bytes) seal operations in 1076156us (1574.1 ops/sec): 25.8 MB/s
(Note fallback AES is dampening the perf hit.)

64-bit x86 with OPENSSL_ia32cap=0
Before: (4bit x86_64 asm)
Did 2615000 AES-128-GCM (16 bytes) seal operations in 1000220us (2614424.8 ops/sec): 41.8 MB/s
Did 431000 AES-128-GCM (256 bytes) seal operations in 1001250us (430461.9 ops/sec): 110.2 MB/s
Did 89000 AES-128-GCM (1350 bytes) seal operations in 1002209us (88803.8 ops/sec): 119.9 MB/s
Did 16000 AES-128-GCM (8192 bytes) seal operations in 1064535us (15030.0 ops/sec): 123.1 MB/s
Did 8261 AES-128-GCM (16384 bytes) seal operations in 1096787us (7532.0 ops/sec): 123.4 MB/s
After:
Did 2355000 AES-128-GCM (16 bytes) seal operations in 1000096us (2354773.9 ops/sec): 37.7 MB/s
Did 373000 AES-128-GCM (256 bytes) seal operations in 1000981us (372634.4 ops/sec): 95.4 MB/s
Did 77000 AES-128-GCM (1350 bytes) seal operations in 1003557us (76727.1 ops/sec): 103.6 MB/s
Did 13000 AES-128-GCM (8192 bytes) seal operations in 1003058us (12960.4 ops/sec): 106.2 MB/s
Did 7139 AES-128-GCM (16384 bytes) seal operations in 1099576us (6492.5 ops/sec): 106.4 MB/s
(Note fallback AES is dampening the perf hit. Pairing with AESNI to roughly
isolate GHASH shows a 40% hit.)

For comparison, this is what removing gcm_gmult_4bit_mmx would do.
32-bit x86 with OPENSSL_ia32cap=0
Before:
Did 2014000 AES-128-GCM (16 bytes) seal operations in 1000026us (2013947.6 ops/sec): 32.2 MB/s
Did 367000 AES-128-GCM (256 bytes) seal operations in 1000097us (366964.4 ops/sec): 93.9 MB/s
Did 77000 AES-128-GCM (1350 bytes) seal operations in 1002135us (76836.0 ops/sec): 103.7 MB/s
Did 13000 AES-128-GCM (8192 bytes) seal operations in 1011394us (12853.5 ops/sec): 105.3 MB/s
Did 7227 AES-128-GCM (16384 bytes) seal operations in 1099409us (6573.5 ops/sec): 107.7 MB/s
If gcm_gmult_4bit_mmx were replaced:
Did 1350000 AES-128-GCM (16 bytes) seal operations in 1000128us (1349827.2 ops/sec): 21.6 MB/s
Did 219000 AES-128-GCM (256 bytes) seal operations in 1000090us (218980.3 ops/sec): 56.1 MB/s
Did 46000 AES-128-GCM (1350 bytes) seal operations in 1017365us (45214.8 ops/sec): 61.0 MB/s
Did 8393 AES-128-GCM (8192 bytes) seal operations in 1115579us (7523.4 ops/sec): 61.6 MB/s
Did 3840 AES-128-GCM (16384 bytes) seal operations in 1001928us (3832.6 ops/sec): 62.8 MB/s
(Note fallback AES is dampening the perf hit. Pairing with AESNI to roughly
isolate GHASH shows a 73% hit. gcm_gmult_4bit_mmx is almost 4x as faster.)

Change-Id: Ib28c981e92e200b17fb9ddc89aef695ac6733a43
Reviewed-on: https://boringssl-review.googlesource.com/c/boringssl/+/38724
Commit-Queue: David Benjamin <davidben@google.com>
Reviewed-by: Adam Langley <agl@google.com>
diff --git a/crypto/fipsmodule/bcm.c b/crypto/fipsmodule/bcm.c
index 91b9170..7485f6c 100644
--- a/crypto/fipsmodule/bcm.c
+++ b/crypto/fipsmodule/bcm.c
@@ -83,6 +83,7 @@
 #include "modes/cfb.c"
 #include "modes/ctr.c"
 #include "modes/gcm.c"
+#include "modes/gcm_nohw.c"
 #include "modes/ofb.c"
 #include "modes/polyval.c"
 #include "rand/ctrdrbg.c"
diff --git a/crypto/fipsmodule/modes/asm/ghash-armv4.pl b/crypto/fipsmodule/modes/asm/ghash-armv4.pl
index cdb6fb4..daf52e8 100644
--- a/crypto/fipsmodule/modes/asm/ghash-armv4.pl
+++ b/crypto/fipsmodule/modes/asm/ghash-armv4.pl
@@ -78,6 +78,9 @@
 # *native* byte order on current platform. See gcm128.c for working
 # example...
 
+# This file was patched in BoringSSL to remove the variable-time 4-bit
+# implementation.
+
 $flavour = shift;
 if ($flavour=~/\w[\w\-]*\.\w+$/) { $output=$flavour; undef $flavour; }
 else { while (($output=shift) && ($output!~/\w[\w\-]*\.\w+$/)) {} }
@@ -100,47 +103,6 @@
 $inp="r2";
 $len="r3";
 
-$Zll="r4";	# variables
-$Zlh="r5";
-$Zhl="r6";
-$Zhh="r7";
-$Tll="r8";
-$Tlh="r9";
-$Thl="r10";
-$Thh="r11";
-$nlo="r12";
-################# r13 is stack pointer
-$nhi="r14";
-################# r15 is program counter
-
-$rem_4bit=$inp;	# used in gcm_gmult_4bit
-$cnt=$len;
-
-sub Zsmash() {
-  my $i=12;
-  my @args=@_;
-  for ($Zll,$Zlh,$Zhl,$Zhh) {
-    $code.=<<___;
-#if __ARM_ARCH__>=7 && defined(__ARMEL__)
-	rev	$_,$_
-	str	$_,[$Xi,#$i]
-#elif defined(__ARMEB__)
-	str	$_,[$Xi,#$i]
-#else
-	mov	$Tlh,$_,lsr#8
-	strb	$_,[$Xi,#$i+3]
-	mov	$Thl,$_,lsr#16
-	strb	$Tlh,[$Xi,#$i+2]
-	mov	$Thh,$_,lsr#24
-	strb	$Thl,[$Xi,#$i+1]
-	strb	$Thh,[$Xi,#$i]
-#endif
-___
-    $code.="\t".shift(@args)."\n";
-    $i-=4;
-  }
-}
-
 $code=<<___;
 #include <openssl/arm_arch.h>
 
@@ -160,226 +122,6 @@
 #else
 .code	32
 #endif
-
-.type	rem_4bit,%object
-.align	5
-rem_4bit:
-.short	0x0000,0x1C20,0x3840,0x2460
-.short	0x7080,0x6CA0,0x48C0,0x54E0
-.short	0xE100,0xFD20,0xD940,0xC560
-.short	0x9180,0x8DA0,0xA9C0,0xB5E0
-.size	rem_4bit,.-rem_4bit
-
-.type	rem_4bit_get,%function
-rem_4bit_get:
-#if defined(__thumb2__)
-	adr	$rem_4bit,rem_4bit
-#else
-	sub	$rem_4bit,pc,#8+32	@ &rem_4bit
-#endif
-	b	.Lrem_4bit_got
-	nop
-	nop
-.size	rem_4bit_get,.-rem_4bit_get
-
-.global	gcm_ghash_4bit
-.type	gcm_ghash_4bit,%function
-.align	4
-gcm_ghash_4bit:
-#if defined(__thumb2__)
-	adr	r12,rem_4bit
-#else
-	sub	r12,pc,#8+48		@ &rem_4bit
-#endif
-	add	$len,$inp,$len		@ $len to point at the end
-	stmdb	sp!,{r3-r11,lr}		@ save $len/end too
-
-	ldmia	r12,{r4-r11}		@ copy rem_4bit ...
-	stmdb	sp!,{r4-r11}		@ ... to stack
-
-	ldrb	$nlo,[$inp,#15]
-	ldrb	$nhi,[$Xi,#15]
-.Louter:
-	eor	$nlo,$nlo,$nhi
-	and	$nhi,$nlo,#0xf0
-	and	$nlo,$nlo,#0x0f
-	mov	$cnt,#14
-
-	add	$Zhh,$Htbl,$nlo,lsl#4
-	ldmia	$Zhh,{$Zll-$Zhh}	@ load Htbl[nlo]
-	add	$Thh,$Htbl,$nhi
-	ldrb	$nlo,[$inp,#14]
-
-	and	$nhi,$Zll,#0xf		@ rem
-	ldmia	$Thh,{$Tll-$Thh}	@ load Htbl[nhi]
-	add	$nhi,$nhi,$nhi
-	eor	$Zll,$Tll,$Zll,lsr#4
-	ldrh	$Tll,[sp,$nhi]		@ rem_4bit[rem]
-	eor	$Zll,$Zll,$Zlh,lsl#28
-	ldrb	$nhi,[$Xi,#14]
-	eor	$Zlh,$Tlh,$Zlh,lsr#4
-	eor	$Zlh,$Zlh,$Zhl,lsl#28
-	eor	$Zhl,$Thl,$Zhl,lsr#4
-	eor	$Zhl,$Zhl,$Zhh,lsl#28
-	eor	$Zhh,$Thh,$Zhh,lsr#4
-	eor	$nlo,$nlo,$nhi
-	and	$nhi,$nlo,#0xf0
-	and	$nlo,$nlo,#0x0f
-	eor	$Zhh,$Zhh,$Tll,lsl#16
-
-.Linner:
-	add	$Thh,$Htbl,$nlo,lsl#4
-	and	$nlo,$Zll,#0xf		@ rem
-	subs	$cnt,$cnt,#1
-	add	$nlo,$nlo,$nlo
-	ldmia	$Thh,{$Tll-$Thh}	@ load Htbl[nlo]
-	eor	$Zll,$Tll,$Zll,lsr#4
-	eor	$Zll,$Zll,$Zlh,lsl#28
-	eor	$Zlh,$Tlh,$Zlh,lsr#4
-	eor	$Zlh,$Zlh,$Zhl,lsl#28
-	ldrh	$Tll,[sp,$nlo]		@ rem_4bit[rem]
-	eor	$Zhl,$Thl,$Zhl,lsr#4
-#ifdef	__thumb2__
-	it	pl
-#endif
-	ldrplb	$nlo,[$inp,$cnt]
-	eor	$Zhl,$Zhl,$Zhh,lsl#28
-	eor	$Zhh,$Thh,$Zhh,lsr#4
-
-	add	$Thh,$Htbl,$nhi
-	and	$nhi,$Zll,#0xf		@ rem
-	eor	$Zhh,$Zhh,$Tll,lsl#16	@ ^= rem_4bit[rem]
-	add	$nhi,$nhi,$nhi
-	ldmia	$Thh,{$Tll-$Thh}	@ load Htbl[nhi]
-	eor	$Zll,$Tll,$Zll,lsr#4
-#ifdef	__thumb2__
-	it	pl
-#endif
-	ldrplb	$Tll,[$Xi,$cnt]
-	eor	$Zll,$Zll,$Zlh,lsl#28
-	eor	$Zlh,$Tlh,$Zlh,lsr#4
-	ldrh	$Tlh,[sp,$nhi]
-	eor	$Zlh,$Zlh,$Zhl,lsl#28
-	eor	$Zhl,$Thl,$Zhl,lsr#4
-	eor	$Zhl,$Zhl,$Zhh,lsl#28
-#ifdef	__thumb2__
-	it	pl
-#endif
-	eorpl	$nlo,$nlo,$Tll
-	eor	$Zhh,$Thh,$Zhh,lsr#4
-#ifdef	__thumb2__
-	itt	pl
-#endif
-	andpl	$nhi,$nlo,#0xf0
-	andpl	$nlo,$nlo,#0x0f
-	eor	$Zhh,$Zhh,$Tlh,lsl#16	@ ^= rem_4bit[rem]
-	bpl	.Linner
-
-	ldr	$len,[sp,#32]		@ re-load $len/end
-	add	$inp,$inp,#16
-	mov	$nhi,$Zll
-___
-	&Zsmash("cmp\t$inp,$len","\n".
-				 "#ifdef __thumb2__\n".
-				 "	it	ne\n".
-				 "#endif\n".
-				 "	ldrneb	$nlo,[$inp,#15]");
-$code.=<<___;
-	bne	.Louter
-
-	add	sp,sp,#36
-#if __ARM_ARCH__>=5
-	ldmia	sp!,{r4-r11,pc}
-#else
-	ldmia	sp!,{r4-r11,lr}
-	tst	lr,#1
-	moveq	pc,lr			@ be binary compatible with V4, yet
-	bx	lr			@ interoperable with Thumb ISA:-)
-#endif
-.size	gcm_ghash_4bit,.-gcm_ghash_4bit
-
-.global	gcm_gmult_4bit
-.type	gcm_gmult_4bit,%function
-gcm_gmult_4bit:
-	stmdb	sp!,{r4-r11,lr}
-	ldrb	$nlo,[$Xi,#15]
-	b	rem_4bit_get
-.Lrem_4bit_got:
-	and	$nhi,$nlo,#0xf0
-	and	$nlo,$nlo,#0x0f
-	mov	$cnt,#14
-
-	add	$Zhh,$Htbl,$nlo,lsl#4
-	ldmia	$Zhh,{$Zll-$Zhh}	@ load Htbl[nlo]
-	ldrb	$nlo,[$Xi,#14]
-
-	add	$Thh,$Htbl,$nhi
-	and	$nhi,$Zll,#0xf		@ rem
-	ldmia	$Thh,{$Tll-$Thh}	@ load Htbl[nhi]
-	add	$nhi,$nhi,$nhi
-	eor	$Zll,$Tll,$Zll,lsr#4
-	ldrh	$Tll,[$rem_4bit,$nhi]	@ rem_4bit[rem]
-	eor	$Zll,$Zll,$Zlh,lsl#28
-	eor	$Zlh,$Tlh,$Zlh,lsr#4
-	eor	$Zlh,$Zlh,$Zhl,lsl#28
-	eor	$Zhl,$Thl,$Zhl,lsr#4
-	eor	$Zhl,$Zhl,$Zhh,lsl#28
-	eor	$Zhh,$Thh,$Zhh,lsr#4
-	and	$nhi,$nlo,#0xf0
-	eor	$Zhh,$Zhh,$Tll,lsl#16
-	and	$nlo,$nlo,#0x0f
-
-.Loop:
-	add	$Thh,$Htbl,$nlo,lsl#4
-	and	$nlo,$Zll,#0xf		@ rem
-	subs	$cnt,$cnt,#1
-	add	$nlo,$nlo,$nlo
-	ldmia	$Thh,{$Tll-$Thh}	@ load Htbl[nlo]
-	eor	$Zll,$Tll,$Zll,lsr#4
-	eor	$Zll,$Zll,$Zlh,lsl#28
-	eor	$Zlh,$Tlh,$Zlh,lsr#4
-	eor	$Zlh,$Zlh,$Zhl,lsl#28
-	ldrh	$Tll,[$rem_4bit,$nlo]	@ rem_4bit[rem]
-	eor	$Zhl,$Thl,$Zhl,lsr#4
-#ifdef	__thumb2__
-	it	pl
-#endif
-	ldrplb	$nlo,[$Xi,$cnt]
-	eor	$Zhl,$Zhl,$Zhh,lsl#28
-	eor	$Zhh,$Thh,$Zhh,lsr#4
-
-	add	$Thh,$Htbl,$nhi
-	and	$nhi,$Zll,#0xf		@ rem
-	eor	$Zhh,$Zhh,$Tll,lsl#16	@ ^= rem_4bit[rem]
-	add	$nhi,$nhi,$nhi
-	ldmia	$Thh,{$Tll-$Thh}	@ load Htbl[nhi]
-	eor	$Zll,$Tll,$Zll,lsr#4
-	eor	$Zll,$Zll,$Zlh,lsl#28
-	eor	$Zlh,$Tlh,$Zlh,lsr#4
-	ldrh	$Tll,[$rem_4bit,$nhi]	@ rem_4bit[rem]
-	eor	$Zlh,$Zlh,$Zhl,lsl#28
-	eor	$Zhl,$Thl,$Zhl,lsr#4
-	eor	$Zhl,$Zhl,$Zhh,lsl#28
-	eor	$Zhh,$Thh,$Zhh,lsr#4
-#ifdef	__thumb2__
-	itt	pl
-#endif
-	andpl	$nhi,$nlo,#0xf0
-	andpl	$nlo,$nlo,#0x0f
-	eor	$Zhh,$Zhh,$Tll,lsl#16	@ ^= rem_4bit[rem]
-	bpl	.Loop
-___
-	&Zsmash();
-$code.=<<___;
-#if __ARM_ARCH__>=5
-	ldmia	sp!,{r4-r11,pc}
-#else
-	ldmia	sp!,{r4-r11,lr}
-	tst	lr,#1
-	moveq	pc,lr			@ be binary compatible with V4, yet
-	bx	lr			@ interoperable with Thumb ISA:-)
-#endif
-.size	gcm_gmult_4bit,.-gcm_gmult_4bit
 ___
 {
 my ($Xl,$Xm,$Xh,$IN)=map("q$_",(0..3));
diff --git a/crypto/fipsmodule/modes/asm/ghash-x86_64.pl b/crypto/fipsmodule/modes/asm/ghash-x86_64.pl
index 5c4122f..a1c9220 100644
--- a/crypto/fipsmodule/modes/asm/ghash-x86_64.pl
+++ b/crypto/fipsmodule/modes/asm/ghash-x86_64.pl
@@ -90,6 +90,9 @@
 #
 # [1] http://rt.openssl.org/Ticket/Display.html?id=2900&user=guest&pass=guest
 
+# This file was patched in BoringSSL to remove the variable-time 4-bit
+# implementation.
+
 $flavour = shift;
 $output  = shift;
 if ($flavour =~ /\./) { $output = $flavour; undef $flavour; }
@@ -125,10 +128,6 @@
 $Xi="%rdi";
 $Htbl="%rsi";
 
-# per-function register layout
-$cnt="%rcx";
-$rem="%rdx";
-
 sub LB() { my $r=shift; $r =~ s/%[er]([a-d])x/%\1l/	or
 			$r =~ s/%[er]([sd]i)/%\1l/	or
 			$r =~ s/%[er](bp)/%\1l/		or
@@ -141,302 +140,15 @@
     $code .= "\t$opcode\t".join(',',$arg,reverse @_)."\n";
 }
 
-{ my $N;
-  sub loop() {
-  my $inp = shift;
-
-	$N++;
-$code.=<<___;
-	xor	$nlo,$nlo
-	xor	$nhi,$nhi
-	mov	`&LB("$Zlo")`,`&LB("$nlo")`
-	mov	`&LB("$Zlo")`,`&LB("$nhi")`
-	shl	\$4,`&LB("$nlo")`
-	mov	\$14,$cnt
-	mov	8($Htbl,$nlo),$Zlo
-	mov	($Htbl,$nlo),$Zhi
-	and	\$0xf0,`&LB("$nhi")`
-	mov	$Zlo,$rem
-	jmp	.Loop$N
-
-.align	16
-.Loop$N:
-	shr	\$4,$Zlo
-	and	\$0xf,$rem
-	mov	$Zhi,$tmp
-	mov	($inp,$cnt),`&LB("$nlo")`
-	shr	\$4,$Zhi
-	xor	8($Htbl,$nhi),$Zlo
-	shl	\$60,$tmp
-	xor	($Htbl,$nhi),$Zhi
-	mov	`&LB("$nlo")`,`&LB("$nhi")`
-	xor	($rem_4bit,$rem,8),$Zhi
-	mov	$Zlo,$rem
-	shl	\$4,`&LB("$nlo")`
-	xor	$tmp,$Zlo
-	dec	$cnt
-	js	.Lbreak$N
-
-	shr	\$4,$Zlo
-	and	\$0xf,$rem
-	mov	$Zhi,$tmp
-	shr	\$4,$Zhi
-	xor	8($Htbl,$nlo),$Zlo
-	shl	\$60,$tmp
-	xor	($Htbl,$nlo),$Zhi
-	and	\$0xf0,`&LB("$nhi")`
-	xor	($rem_4bit,$rem,8),$Zhi
-	mov	$Zlo,$rem
-	xor	$tmp,$Zlo
-	jmp	.Loop$N
-
-.align	16
-.Lbreak$N:
-	shr	\$4,$Zlo
-	and	\$0xf,$rem
-	mov	$Zhi,$tmp
-	shr	\$4,$Zhi
-	xor	8($Htbl,$nlo),$Zlo
-	shl	\$60,$tmp
-	xor	($Htbl,$nlo),$Zhi
-	and	\$0xf0,`&LB("$nhi")`
-	xor	($rem_4bit,$rem,8),$Zhi
-	mov	$Zlo,$rem
-	xor	$tmp,$Zlo
-
-	shr	\$4,$Zlo
-	and	\$0xf,$rem
-	mov	$Zhi,$tmp
-	shr	\$4,$Zhi
-	xor	8($Htbl,$nhi),$Zlo
-	shl	\$60,$tmp
-	xor	($Htbl,$nhi),$Zhi
-	xor	$tmp,$Zlo
-	xor	($rem_4bit,$rem,8),$Zhi
-
-	bswap	$Zlo
-	bswap	$Zhi
-___
-}}
-
 $code=<<___;
 .text
 .extern	OPENSSL_ia32cap_P
-
-.globl	gcm_gmult_4bit
-.type	gcm_gmult_4bit,\@function,2
-.align	16
-gcm_gmult_4bit:
-.cfi_startproc
-	push	%rbx
-.cfi_push	%rbx
-	push	%rbp		# %rbp and others are pushed exclusively in
-.cfi_push	%rbp
-	push	%r12		# order to reuse Win64 exception handler...
-.cfi_push	%r12
-	push	%r13
-.cfi_push	%r13
-	push	%r14
-.cfi_push	%r14
-	push	%r15
-.cfi_push	%r15
-	sub	\$280,%rsp
-.cfi_adjust_cfa_offset	280
-.Lgmult_prologue:
-
-	movzb	15($Xi),$Zlo
-	lea	.Lrem_4bit(%rip),$rem_4bit
-___
-	&loop	($Xi);
-$code.=<<___;
-	mov	$Zlo,8($Xi)
-	mov	$Zhi,($Xi)
-
-	lea	280+48(%rsp),%rsi
-.cfi_def_cfa	%rsi,8
-	mov	-8(%rsi),%rbx
-.cfi_restore	%rbx
-	lea	(%rsi),%rsp
-.cfi_def_cfa_register	%rsp
-.Lgmult_epilogue:
-	ret
-.cfi_endproc
-.size	gcm_gmult_4bit,.-gcm_gmult_4bit
 ___
 
 # per-function register layout
 $inp="%rdx";
 $len="%rcx";
-$rem_8bit=$rem_4bit;
 
-$code.=<<___;
-.globl	gcm_ghash_4bit
-.type	gcm_ghash_4bit,\@function,4
-.align	16
-gcm_ghash_4bit:
-.cfi_startproc
-	push	%rbx
-.cfi_push	%rbx
-	push	%rbp
-.cfi_push	%rbp
-	push	%r12
-.cfi_push	%r12
-	push	%r13
-.cfi_push	%r13
-	push	%r14
-.cfi_push	%r14
-	push	%r15
-.cfi_push	%r15
-	sub	\$280,%rsp
-.cfi_adjust_cfa_offset	280
-.Lghash_prologue:
-	mov	$inp,%r14		# reassign couple of args
-	mov	$len,%r15
-___
-{ my $inp="%r14";
-  my $dat="%edx";
-  my $len="%r15";
-  my @nhi=("%ebx","%ecx");
-  my @rem=("%r12","%r13");
-  my $Hshr4="%rbp";
-
-	&sub	($Htbl,-128);		# size optimization
-	&lea	($Hshr4,"16+128(%rsp)");
-	{ my @lo =($nlo,$nhi);
-          my @hi =($Zlo,$Zhi);
-
-	  &xor	($dat,$dat);
-	  for ($i=0,$j=-2;$i<18;$i++,$j++) {
-	    &mov	("$j(%rsp)",&LB($dat))		if ($i>1);
-	    &or		($lo[0],$tmp)			if ($i>1);
-	    &mov	(&LB($dat),&LB($lo[1]))		if ($i>0 && $i<17);
-	    &shr	($lo[1],4)			if ($i>0 && $i<17);
-	    &mov	($tmp,$hi[1])			if ($i>0 && $i<17);
-	    &shr	($hi[1],4)			if ($i>0 && $i<17);
-	    &mov	("8*$j($Hshr4)",$hi[0])		if ($i>1);
-	    &mov	($hi[0],"16*$i+0-128($Htbl)")	if ($i<16);
-	    &shl	(&LB($dat),4)			if ($i>0 && $i<17);
-	    &mov	("8*$j-128($Hshr4)",$lo[0])	if ($i>1);
-	    &mov	($lo[0],"16*$i+8-128($Htbl)")	if ($i<16);
-	    &shl	($tmp,60)			if ($i>0 && $i<17);
-
-	    push	(@lo,shift(@lo));
-	    push	(@hi,shift(@hi));
-	  }
-	}
-	&add	($Htbl,-128);
-	&mov	($Zlo,"8($Xi)");
-	&mov	($Zhi,"0($Xi)");
-	&add	($len,$inp);		# pointer to the end of data
-	&lea	($rem_8bit,".Lrem_8bit(%rip)");
-	&jmp	(".Louter_loop");
-
-$code.=".align	16\n.Louter_loop:\n";
-	&xor	($Zhi,"($inp)");
-	&mov	("%rdx","8($inp)");
-	&lea	($inp,"16($inp)");
-	&xor	("%rdx",$Zlo);
-	&mov	("($Xi)",$Zhi);
-	&mov	("8($Xi)","%rdx");
-	&shr	("%rdx",32);
-
-	&xor	($nlo,$nlo);
-	&rol	($dat,8);
-	&mov	(&LB($nlo),&LB($dat));
-	&movz	($nhi[0],&LB($dat));
-	&shl	(&LB($nlo),4);
-	&shr	($nhi[0],4);
-
-	for ($j=11,$i=0;$i<15;$i++) {
-	    &rol	($dat,8);
-	    &xor	($Zlo,"8($Htbl,$nlo)")			if ($i>0);
-	    &xor	($Zhi,"($Htbl,$nlo)")			if ($i>0);
-	    &mov	($Zlo,"8($Htbl,$nlo)")			if ($i==0);
-	    &mov	($Zhi,"($Htbl,$nlo)")			if ($i==0);
-
-	    &mov	(&LB($nlo),&LB($dat));
-	    &xor	($Zlo,$tmp)				if ($i>0);
-	    &movzw	($rem[1],"($rem_8bit,$rem[1],2)")	if ($i>0);
-
-	    &movz	($nhi[1],&LB($dat));
-	    &shl	(&LB($nlo),4);
-	    &movzb	($rem[0],"(%rsp,$nhi[0])");
-
-	    &shr	($nhi[1],4)				if ($i<14);
-	    &and	($nhi[1],0xf0)				if ($i==14);
-	    &shl	($rem[1],48)				if ($i>0);
-	    &xor	($rem[0],$Zlo);
-
-	    &mov	($tmp,$Zhi);
-	    &xor	($Zhi,$rem[1])				if ($i>0);
-	    &shr	($Zlo,8);
-
-	    &movz	($rem[0],&LB($rem[0]));
-	    &mov	($dat,"$j($Xi)")			if (--$j%4==0);
-	    &shr	($Zhi,8);
-
-	    &xor	($Zlo,"-128($Hshr4,$nhi[0],8)");
-	    &shl	($tmp,56);
-	    &xor	($Zhi,"($Hshr4,$nhi[0],8)");
-
-	    unshift	(@nhi,pop(@nhi));		# "rotate" registers
-	    unshift	(@rem,pop(@rem));
-	}
-	&movzw	($rem[1],"($rem_8bit,$rem[1],2)");
-	&xor	($Zlo,"8($Htbl,$nlo)");
-	&xor	($Zhi,"($Htbl,$nlo)");
-
-	&shl	($rem[1],48);
-	&xor	($Zlo,$tmp);
-
-	&xor	($Zhi,$rem[1]);
-	&movz	($rem[0],&LB($Zlo));
-	&shr	($Zlo,4);
-
-	&mov	($tmp,$Zhi);
-	&shl	(&LB($rem[0]),4);
-	&shr	($Zhi,4);
-
-	&xor	($Zlo,"8($Htbl,$nhi[0])");
-	&movzw	($rem[0],"($rem_8bit,$rem[0],2)");
-	&shl	($tmp,60);
-
-	&xor	($Zhi,"($Htbl,$nhi[0])");
-	&xor	($Zlo,$tmp);
-	&shl	($rem[0],48);
-
-	&bswap	($Zlo);
-	&xor	($Zhi,$rem[0]);
-
-	&bswap	($Zhi);
-	&cmp	($inp,$len);
-	&jb	(".Louter_loop");
-}
-$code.=<<___;
-	mov	$Zlo,8($Xi)
-	mov	$Zhi,($Xi)
-
-	lea	280+48(%rsp),%rsi
-.cfi_def_cfa	%rsi,8
-	mov	-48(%rsi),%r15
-.cfi_restore	%r15
-	mov	-40(%rsi),%r14
-.cfi_restore	%r14
-	mov	-32(%rsi),%r13
-.cfi_restore	%r13
-	mov	-24(%rsi),%r12
-.cfi_restore	%r12
-	mov	-16(%rsi),%rbp
-.cfi_restore	%rbp
-	mov	-8(%rsi),%rbx
-.cfi_restore	%rbx
-	lea	0(%rsi),%rsp
-.cfi_def_cfa_register	%rsp
-.Lghash_epilogue:
-	ret
-.cfi_endproc
-.size	gcm_ghash_4bit,.-gcm_ghash_4bit
-___
 
 ######################################################################
 # PCLMULQDQ version.
@@ -1599,158 +1311,15 @@
 .L7_mask_poly:
 	.long	7,0,`0xE1<<1`,0
 .align	64
-.type	.Lrem_4bit,\@object
-.Lrem_4bit:
-	.long	0,`0x0000<<16`,0,`0x1C20<<16`,0,`0x3840<<16`,0,`0x2460<<16`
-	.long	0,`0x7080<<16`,0,`0x6CA0<<16`,0,`0x48C0<<16`,0,`0x54E0<<16`
-	.long	0,`0xE100<<16`,0,`0xFD20<<16`,0,`0xD940<<16`,0,`0xC560<<16`
-	.long	0,`0x9180<<16`,0,`0x8DA0<<16`,0,`0xA9C0<<16`,0,`0xB5E0<<16`
-.type	.Lrem_8bit,\@object
-.Lrem_8bit:
-	.value	0x0000,0x01C2,0x0384,0x0246,0x0708,0x06CA,0x048C,0x054E
-	.value	0x0E10,0x0FD2,0x0D94,0x0C56,0x0918,0x08DA,0x0A9C,0x0B5E
-	.value	0x1C20,0x1DE2,0x1FA4,0x1E66,0x1B28,0x1AEA,0x18AC,0x196E
-	.value	0x1230,0x13F2,0x11B4,0x1076,0x1538,0x14FA,0x16BC,0x177E
-	.value	0x3840,0x3982,0x3BC4,0x3A06,0x3F48,0x3E8A,0x3CCC,0x3D0E
-	.value	0x3650,0x3792,0x35D4,0x3416,0x3158,0x309A,0x32DC,0x331E
-	.value	0x2460,0x25A2,0x27E4,0x2626,0x2368,0x22AA,0x20EC,0x212E
-	.value	0x2A70,0x2BB2,0x29F4,0x2836,0x2D78,0x2CBA,0x2EFC,0x2F3E
-	.value	0x7080,0x7142,0x7304,0x72C6,0x7788,0x764A,0x740C,0x75CE
-	.value	0x7E90,0x7F52,0x7D14,0x7CD6,0x7998,0x785A,0x7A1C,0x7BDE
-	.value	0x6CA0,0x6D62,0x6F24,0x6EE6,0x6BA8,0x6A6A,0x682C,0x69EE
-	.value	0x62B0,0x6372,0x6134,0x60F6,0x65B8,0x647A,0x663C,0x67FE
-	.value	0x48C0,0x4902,0x4B44,0x4A86,0x4FC8,0x4E0A,0x4C4C,0x4D8E
-	.value	0x46D0,0x4712,0x4554,0x4496,0x41D8,0x401A,0x425C,0x439E
-	.value	0x54E0,0x5522,0x5764,0x56A6,0x53E8,0x522A,0x506C,0x51AE
-	.value	0x5AF0,0x5B32,0x5974,0x58B6,0x5DF8,0x5C3A,0x5E7C,0x5FBE
-	.value	0xE100,0xE0C2,0xE284,0xE346,0xE608,0xE7CA,0xE58C,0xE44E
-	.value	0xEF10,0xEED2,0xEC94,0xED56,0xE818,0xE9DA,0xEB9C,0xEA5E
-	.value	0xFD20,0xFCE2,0xFEA4,0xFF66,0xFA28,0xFBEA,0xF9AC,0xF86E
-	.value	0xF330,0xF2F2,0xF0B4,0xF176,0xF438,0xF5FA,0xF7BC,0xF67E
-	.value	0xD940,0xD882,0xDAC4,0xDB06,0xDE48,0xDF8A,0xDDCC,0xDC0E
-	.value	0xD750,0xD692,0xD4D4,0xD516,0xD058,0xD19A,0xD3DC,0xD21E
-	.value	0xC560,0xC4A2,0xC6E4,0xC726,0xC268,0xC3AA,0xC1EC,0xC02E
-	.value	0xCB70,0xCAB2,0xC8F4,0xC936,0xCC78,0xCDBA,0xCFFC,0xCE3E
-	.value	0x9180,0x9042,0x9204,0x93C6,0x9688,0x974A,0x950C,0x94CE
-	.value	0x9F90,0x9E52,0x9C14,0x9DD6,0x9898,0x995A,0x9B1C,0x9ADE
-	.value	0x8DA0,0x8C62,0x8E24,0x8FE6,0x8AA8,0x8B6A,0x892C,0x88EE
-	.value	0x83B0,0x8272,0x8034,0x81F6,0x84B8,0x857A,0x873C,0x86FE
-	.value	0xA9C0,0xA802,0xAA44,0xAB86,0xAEC8,0xAF0A,0xAD4C,0xAC8E
-	.value	0xA7D0,0xA612,0xA454,0xA596,0xA0D8,0xA11A,0xA35C,0xA29E
-	.value	0xB5E0,0xB422,0xB664,0xB7A6,0xB2E8,0xB32A,0xB16C,0xB0AE
-	.value	0xBBF0,0xBA32,0xB874,0xB9B6,0xBCF8,0xBD3A,0xBF7C,0xBEBE
 
 .asciz	"GHASH for x86_64, CRYPTOGAMS by <appro\@openssl.org>"
 .align	64
 ___
 
-# EXCEPTION_DISPOSITION handler (EXCEPTION_RECORD *rec,ULONG64 frame,
-#		CONTEXT *context,DISPATCHER_CONTEXT *disp)
 if ($win64) {
-$rec="%rcx";
-$frame="%rdx";
-$context="%r8";
-$disp="%r9";
-
 $code.=<<___;
-.extern	__imp_RtlVirtualUnwind
-.type	se_handler,\@abi-omnipotent
-.align	16
-se_handler:
-	push	%rsi
-	push	%rdi
-	push	%rbx
-	push	%rbp
-	push	%r12
-	push	%r13
-	push	%r14
-	push	%r15
-	pushfq
-	sub	\$64,%rsp
-
-	mov	120($context),%rax	# pull context->Rax
-	mov	248($context),%rbx	# pull context->Rip
-
-	mov	8($disp),%rsi		# disp->ImageBase
-	mov	56($disp),%r11		# disp->HandlerData
-
-	mov	0(%r11),%r10d		# HandlerData[0]
-	lea	(%rsi,%r10),%r10	# prologue label
-	cmp	%r10,%rbx		# context->Rip<prologue label
-	jb	.Lin_prologue
-
-	mov	152($context),%rax	# pull context->Rsp
-
-	mov	4(%r11),%r10d		# HandlerData[1]
-	lea	(%rsi,%r10),%r10	# epilogue label
-	cmp	%r10,%rbx		# context->Rip>=epilogue label
-	jae	.Lin_prologue
-
-	lea	48+280(%rax),%rax	# adjust "rsp"
-
-	mov	-8(%rax),%rbx
-	mov	-16(%rax),%rbp
-	mov	-24(%rax),%r12
-	mov	-32(%rax),%r13
-	mov	-40(%rax),%r14
-	mov	-48(%rax),%r15
-	mov	%rbx,144($context)	# restore context->Rbx
-	mov	%rbp,160($context)	# restore context->Rbp
-	mov	%r12,216($context)	# restore context->R12
-	mov	%r13,224($context)	# restore context->R13
-	mov	%r14,232($context)	# restore context->R14
-	mov	%r15,240($context)	# restore context->R15
-
-.Lin_prologue:
-	mov	8(%rax),%rdi
-	mov	16(%rax),%rsi
-	mov	%rax,152($context)	# restore context->Rsp
-	mov	%rsi,168($context)	# restore context->Rsi
-	mov	%rdi,176($context)	# restore context->Rdi
-
-	mov	40($disp),%rdi		# disp->ContextRecord
-	mov	$context,%rsi		# context
-	mov	\$`1232/8`,%ecx		# sizeof(CONTEXT)
-	.long	0xa548f3fc		# cld; rep movsq
-
-	mov	$disp,%rsi
-	xor	%rcx,%rcx		# arg1, UNW_FLAG_NHANDLER
-	mov	8(%rsi),%rdx		# arg2, disp->ImageBase
-	mov	0(%rsi),%r8		# arg3, disp->ControlPc
-	mov	16(%rsi),%r9		# arg4, disp->FunctionEntry
-	mov	40(%rsi),%r10		# disp->ContextRecord
-	lea	56(%rsi),%r11		# &disp->HandlerData
-	lea	24(%rsi),%r12		# &disp->EstablisherFrame
-	mov	%r10,32(%rsp)		# arg5
-	mov	%r11,40(%rsp)		# arg6
-	mov	%r12,48(%rsp)		# arg7
-	mov	%rcx,56(%rsp)		# arg8, (NULL)
-	call	*__imp_RtlVirtualUnwind(%rip)
-
-	mov	\$1,%eax		# ExceptionContinueSearch
-	add	\$64,%rsp
-	popfq
-	pop	%r15
-	pop	%r14
-	pop	%r13
-	pop	%r12
-	pop	%rbp
-	pop	%rbx
-	pop	%rdi
-	pop	%rsi
-	ret
-.size	se_handler,.-se_handler
-
 .section	.pdata
 .align	4
-	.rva	.LSEH_begin_gcm_gmult_4bit
-	.rva	.LSEH_end_gcm_gmult_4bit
-	.rva	.LSEH_info_gcm_gmult_4bit
-
-	.rva	.LSEH_begin_gcm_ghash_4bit
-	.rva	.LSEH_end_gcm_ghash_4bit
-	.rva	.LSEH_info_gcm_ghash_4bit
-
 	.rva	.LSEH_begin_gcm_init_clmul
 	.rva	.LSEH_end_gcm_init_clmul
 	.rva	.LSEH_info_gcm_init_clmul
@@ -1771,14 +1340,6 @@
 $code.=<<___;
 .section	.xdata
 .align	8
-.LSEH_info_gcm_gmult_4bit:
-	.byte	9,0,0,0
-	.rva	se_handler
-	.rva	.Lgmult_prologue,.Lgmult_epilogue	# HandlerData
-.LSEH_info_gcm_ghash_4bit:
-	.byte	9,0,0,0
-	.rva	se_handler
-	.rva	.Lghash_prologue,.Lghash_epilogue	# HandlerData
 .LSEH_info_gcm_init_clmul:
 	.byte	0x01,0x08,0x03,0x00
 	.byte	0x08,0x68,0x00,0x00	#movaps	0x00(rsp),xmm6
diff --git a/crypto/fipsmodule/modes/gcm.c b/crypto/fipsmodule/modes/gcm.c
index 3860ebe..51e7b73 100644
--- a/crypto/fipsmodule/modes/gcm.c
+++ b/crypto/fipsmodule/modes/gcm.c
@@ -58,7 +58,6 @@
 #include "../../internal.h"
 
 
-#define PACK(s) ((size_t)(s) << (sizeof(size_t) * 8 - 16))
 #define REDUCE1BIT(V)                                                 \
   do {                                                                \
     if (sizeof(size_t) == 8) {                                        \
@@ -104,138 +103,11 @@
   Htable[13].hi = V.hi ^ Htable[5].hi, Htable[13].lo = V.lo ^ Htable[5].lo;
   Htable[14].hi = V.hi ^ Htable[6].hi, Htable[14].lo = V.lo ^ Htable[6].lo;
   Htable[15].hi = V.hi ^ Htable[7].hi, Htable[15].lo = V.lo ^ Htable[7].lo;
-
-#if defined(GHASH_ASM) && defined(OPENSSL_ARM)
-  for (int j = 0; j < 16; ++j) {
-    V = Htable[j];
-    Htable[j].hi = V.lo;
-    Htable[j].lo = V.hi;
-  }
-#endif
 }
 
-#if !defined(GHASH_ASM) || defined(OPENSSL_AARCH64) || defined(OPENSSL_PPC64LE)
-static const size_t rem_4bit[16] = {
-    PACK(0x0000), PACK(0x1C20), PACK(0x3840), PACK(0x2460),
-    PACK(0x7080), PACK(0x6CA0), PACK(0x48C0), PACK(0x54E0),
-    PACK(0xE100), PACK(0xFD20), PACK(0xD940), PACK(0xC560),
-    PACK(0x9180), PACK(0x8DA0), PACK(0xA9C0), PACK(0xB5E0)};
-
-void gcm_gmult_4bit(uint64_t Xi[2], const u128 Htable[16]) {
-  u128 Z;
-  int cnt = 15;
-  size_t rem, nlo, nhi;
-
-  nlo = ((const uint8_t *)Xi)[15];
-  nhi = nlo >> 4;
-  nlo &= 0xf;
-
-  Z.hi = Htable[nlo].hi;
-  Z.lo = Htable[nlo].lo;
-
-  while (1) {
-    rem = (size_t)Z.lo & 0xf;
-    Z.lo = (Z.hi << 60) | (Z.lo >> 4);
-    Z.hi = (Z.hi >> 4);
-    if (sizeof(size_t) == 8) {
-      Z.hi ^= rem_4bit[rem];
-    } else {
-      Z.hi ^= (uint64_t)rem_4bit[rem] << 32;
-    }
-
-    Z.hi ^= Htable[nhi].hi;
-    Z.lo ^= Htable[nhi].lo;
-
-    if (--cnt < 0) {
-      break;
-    }
-
-    nlo = ((const uint8_t *)Xi)[cnt];
-    nhi = nlo >> 4;
-    nlo &= 0xf;
-
-    rem = (size_t)Z.lo & 0xf;
-    Z.lo = (Z.hi << 60) | (Z.lo >> 4);
-    Z.hi = (Z.hi >> 4);
-    if (sizeof(size_t) == 8) {
-      Z.hi ^= rem_4bit[rem];
-    } else {
-      Z.hi ^= (uint64_t)rem_4bit[rem] << 32;
-    }
-
-    Z.hi ^= Htable[nlo].hi;
-    Z.lo ^= Htable[nlo].lo;
-  }
-
-  Xi[0] = CRYPTO_bswap8(Z.hi);
-  Xi[1] = CRYPTO_bswap8(Z.lo);
-}
-
-// Streamed gcm_mult_4bit, see CRYPTO_gcm128_[en|de]crypt for
-// details... Compiler-generated code doesn't seem to give any
-// performance improvement, at least not on x86[_64]. It's here
-// mostly as reference and a placeholder for possible future
-// non-trivial optimization[s]...
-void gcm_ghash_4bit(uint64_t Xi[2], const u128 Htable[16], const uint8_t *inp,
-                    size_t len) {
-  u128 Z;
-  int cnt;
-  size_t rem, nlo, nhi;
-
-  do {
-    cnt = 15;
-    nlo = ((const uint8_t *)Xi)[15];
-    nlo ^= inp[15];
-    nhi = nlo >> 4;
-    nlo &= 0xf;
-
-    Z.hi = Htable[nlo].hi;
-    Z.lo = Htable[nlo].lo;
-
-    while (1) {
-      rem = (size_t)Z.lo & 0xf;
-      Z.lo = (Z.hi << 60) | (Z.lo >> 4);
-      Z.hi = (Z.hi >> 4);
-      if (sizeof(size_t) == 8) {
-        Z.hi ^= rem_4bit[rem];
-      } else {
-        Z.hi ^= (uint64_t)rem_4bit[rem] << 32;
-      }
-
-      Z.hi ^= Htable[nhi].hi;
-      Z.lo ^= Htable[nhi].lo;
-
-      if (--cnt < 0) {
-        break;
-      }
-
-      nlo = ((const uint8_t *)Xi)[cnt];
-      nlo ^= inp[cnt];
-      nhi = nlo >> 4;
-      nlo &= 0xf;
-
-      rem = (size_t)Z.lo & 0xf;
-      Z.lo = (Z.hi << 60) | (Z.lo >> 4);
-      Z.hi = (Z.hi >> 4);
-      if (sizeof(size_t) == 8) {
-        Z.hi ^= rem_4bit[rem];
-      } else {
-        Z.hi ^= (uint64_t)rem_4bit[rem] << 32;
-      }
-
-      Z.hi ^= Htable[nlo].hi;
-      Z.lo ^= Htable[nlo].lo;
-    }
-
-    Xi[0] = CRYPTO_bswap8(Z.hi);
-    Xi[1] = CRYPTO_bswap8(Z.lo);
-  } while (inp += 16, len -= 16);
-}
-#endif   // !GHASH_ASM || AARCH64 || PPC64LE
-
-#define GCM_MUL(ctx, Xi) gcm_gmult_4bit((ctx)->Xi.u, (ctx)->gcm_key.Htable)
+#define GCM_MUL(ctx, Xi) gcm_gmult_nohw((ctx)->Xi.u, (ctx)->gcm_key.Htable)
 #define GHASH(ctx, in, len) \
-  gcm_ghash_4bit((ctx)->Xi.u, (ctx)->gcm_key.Htable, in, len)
+  gcm_ghash_nohw((ctx)->Xi.u, (ctx)->gcm_key.Htable, in, len)
 // GHASH_CHUNK is "stride parameter" missioned to mitigate cache
 // trashing effect. In other words idea is to hash data while it's
 // still in L1 cache after encryption pass...
@@ -268,13 +140,13 @@
 }
 #endif  // GHASH_ASM_X86_64 || GHASH_ASM_X86
 
-#ifdef GCM_FUNCREF_4BIT
+#ifdef GCM_FUNCREF
 #undef GCM_MUL
 #define GCM_MUL(ctx, Xi) (*gcm_gmult_p)((ctx)->Xi.u, (ctx)->gcm_key.Htable)
 #undef GHASH
 #define GHASH(ctx, in, len) \
   (*gcm_ghash_p)((ctx)->Xi.u, (ctx)->gcm_key.Htable, in, len)
-#endif  // GCM_FUNCREF_4BIT
+#endif  // GCM_FUNCREF
 
 void CRYPTO_ghash_init(gmult_func *out_mult, ghash_func *out_hash,
                        u128 *out_key, u128 out_table[16], int *out_is_avx,
@@ -350,13 +222,18 @@
   }
 #endif
 
-  gcm_init_4bit(out_table, H.u);
 #if defined(GHASH_ASM_X86)
+  // TODO(davidben): This implementation is not constant-time, but it is
+  // dramatically faster than |gcm_gmult_nohw|. See if we can get a
+  // constant-time SSE2 implementation to close this gap, or decide we don't
+  // care.
+  gcm_init_4bit(out_table, H.u);
   *out_mult = gcm_gmult_4bit_mmx;
   *out_hash = gcm_ghash_4bit_mmx;
 #else
-  *out_mult = gcm_gmult_4bit;
-  *out_hash = gcm_ghash_4bit;
+  gcm_init_nohw(out_table, H.u);
+  *out_mult = gcm_gmult_nohw;
+  *out_hash = gcm_ghash_nohw;
 #endif
 }
 
@@ -378,7 +255,7 @@
 
 void CRYPTO_gcm128_setiv(GCM128_CONTEXT *ctx, const AES_KEY *key,
                          const uint8_t *iv, size_t len) {
-#ifdef GCM_FUNCREF_4BIT
+#ifdef GCM_FUNCREF
   void (*gcm_gmult_p)(uint64_t Xi[2], const u128 Htable[16]) =
       ctx->gcm_key.gmult;
 #endif
@@ -427,14 +304,12 @@
 }
 
 int CRYPTO_gcm128_aad(GCM128_CONTEXT *ctx, const uint8_t *aad, size_t len) {
-#ifdef GCM_FUNCREF_4BIT
+#ifdef GCM_FUNCREF
   void (*gcm_gmult_p)(uint64_t Xi[2], const u128 Htable[16]) =
       ctx->gcm_key.gmult;
-#ifdef GHASH
   void (*gcm_ghash_p)(uint64_t Xi[2], const u128 Htable[16], const uint8_t *inp,
                       size_t len) = ctx->gcm_key.ghash;
 #endif
-#endif
 
   if (ctx->len.u[1]) {
     return 0;
@@ -484,7 +359,7 @@
 int CRYPTO_gcm128_encrypt(GCM128_CONTEXT *ctx, const AES_KEY *key,
                           const uint8_t *in, uint8_t *out, size_t len) {
   block128_f block = ctx->gcm_key.block;
-#ifdef GCM_FUNCREF_4BIT
+#ifdef GCM_FUNCREF
   void (*gcm_gmult_p)(uint64_t Xi[2], const u128 Htable[16]) =
       ctx->gcm_key.gmult;
   void (*gcm_ghash_p)(uint64_t Xi[2], const u128 Htable[16], const uint8_t *inp,
@@ -572,7 +447,7 @@
                           const unsigned char *in, unsigned char *out,
                           size_t len) {
   block128_f block = ctx->gcm_key.block;
-#ifdef GCM_FUNCREF_4BIT
+#ifdef GCM_FUNCREF
   void (*gcm_gmult_p)(uint64_t Xi[2], const u128 Htable[16]) =
       ctx->gcm_key.gmult;
   void (*gcm_ghash_p)(uint64_t Xi[2], const u128 Htable[16], const uint8_t *inp,
@@ -663,7 +538,7 @@
 int CRYPTO_gcm128_encrypt_ctr32(GCM128_CONTEXT *ctx, const AES_KEY *key,
                                 const uint8_t *in, uint8_t *out, size_t len,
                                 ctr128_f stream) {
-#ifdef GCM_FUNCREF_4BIT
+#ifdef GCM_FUNCREF
   void (*gcm_gmult_p)(uint64_t Xi[2], const u128 Htable[16]) =
       ctx->gcm_key.gmult;
   void (*gcm_ghash_p)(uint64_t Xi[2], const u128 Htable[16], const uint8_t *inp,
@@ -749,7 +624,7 @@
 int CRYPTO_gcm128_decrypt_ctr32(GCM128_CONTEXT *ctx, const AES_KEY *key,
                                 const uint8_t *in, uint8_t *out, size_t len,
                                 ctr128_f stream) {
-#ifdef GCM_FUNCREF_4BIT
+#ifdef GCM_FUNCREF
   void (*gcm_gmult_p)(uint64_t Xi[2], const u128 Htable[16]) =
       ctx->gcm_key.gmult;
   void (*gcm_ghash_p)(uint64_t Xi[2], const u128 Htable[16], const uint8_t *inp,
@@ -837,7 +712,7 @@
 }
 
 int CRYPTO_gcm128_finish(GCM128_CONTEXT *ctx, const uint8_t *tag, size_t len) {
-#ifdef GCM_FUNCREF_4BIT
+#ifdef GCM_FUNCREF
   void (*gcm_gmult_p)(uint64_t Xi[2], const u128 Htable[16]) =
       ctx->gcm_key.gmult;
 #endif
@@ -868,7 +743,7 @@
 
 #if defined(OPENSSL_X86) || defined(OPENSSL_X86_64)
 int crypto_gcm_clmul_enabled(void) {
-#ifdef GHASH_ASM
+#if defined(GHASH_ASM_X86) || defined(GHASH_ASM_X86_64)
   const uint32_t *ia32cap = OPENSSL_ia32cap_get();
   return (ia32cap[0] & (1 << 24)) &&  // check FXSR bit
          (ia32cap[1] & (1 << 1));     // check PCLMULQDQ bit
diff --git a/crypto/fipsmodule/modes/gcm_nohw.c b/crypto/fipsmodule/modes/gcm_nohw.c
new file mode 100644
index 0000000..2a5051a
--- /dev/null
+++ b/crypto/fipsmodule/modes/gcm_nohw.c
@@ -0,0 +1,233 @@
+/* Copyright (c) 2019, Google Inc.
+ *
+ * Permission to use, copy, modify, and/or distribute this software for any
+ * purpose with or without fee is hereby granted, provided that the above
+ * copyright notice and this permission notice appear in all copies.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS" AND THE AUTHOR DISCLAIMS ALL WARRANTIES
+ * WITH REGARD TO THIS SOFTWARE INCLUDING ALL IMPLIED WARRANTIES OF
+ * MERCHANTABILITY AND FITNESS. IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY
+ * SPECIAL, DIRECT, INDIRECT, OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES
+ * WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN ACTION
+ * OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF OR IN
+ * CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE. */
+
+#include <openssl/base.h>
+
+#include "../../internal.h"
+#include "internal.h"
+
+
+// This file contains a constant-time implementation of GHASH based on the notes
+// in https://bearssl.org/constanttime.html#ghash-for-gcm and the reduction
+// algorithm described in
+// https://crypto.stanford.edu/RealWorldCrypto/slides/gueron.pdf.
+//
+// Unlike the BearSSL notes, we use uint128_t in the 64-bit implementation. Our
+// primary compilers (clang, clang-cl, and gcc) all support it. MSVC will run
+// the 32-bit implementation, but we can use its intrinsics if necessary.
+
+#if defined(BORINGSSL_HAS_UINT128)
+
+static void gcm_mul64_nohw(uint64_t *out_lo, uint64_t *out_hi, uint64_t a,
+                           uint64_t b) {
+  // One term every four bits means the largest term is 64/4 = 16, which barely
+  // overflows into the next term. Using one term every five bits would cost 25
+  // multiplications instead of 16. It is faster to mask off the bottom four
+  // bits of |a|, giving a largest term of 60/4 = 15, and apply the bottom bits
+  // separately.
+  uint64_t a0 = a & UINT64_C(0x1111111111111110);
+  uint64_t a1 = a & UINT64_C(0x2222222222222220);
+  uint64_t a2 = a & UINT64_C(0x4444444444444440);
+  uint64_t a3 = a & UINT64_C(0x8888888888888880);
+
+  uint64_t b0 = b & UINT64_C(0x1111111111111111);
+  uint64_t b1 = b & UINT64_C(0x2222222222222222);
+  uint64_t b2 = b & UINT64_C(0x4444444444444444);
+  uint64_t b3 = b & UINT64_C(0x8888888888888888);
+
+  uint128_t c0 = (a0 * (uint128_t)b0) ^ (a1 * (uint128_t)b3) ^
+                 (a2 * (uint128_t)b2) ^ (a3 * (uint128_t)b1);
+  uint128_t c1 = (a0 * (uint128_t)b1) ^ (a1 * (uint128_t)b0) ^
+                 (a2 * (uint128_t)b3) ^ (a3 * (uint128_t)b2);
+  uint128_t c2 = (a0 * (uint128_t)b2) ^ (a1 * (uint128_t)b1) ^
+                 (a2 * (uint128_t)b0) ^ (a3 * (uint128_t)b3);
+  uint128_t c3 = (a0 * (uint128_t)b3) ^ (a1 * (uint128_t)b2) ^
+                 (a2 * (uint128_t)b1) ^ (a3 * (uint128_t)b0);
+
+  // Multiply the bottom four bits of |a| with |b|.
+  uint64_t a0_mask = UINT64_C(0) - (a & 1);
+  uint64_t a1_mask = UINT64_C(0) - ((a >> 1) & 1);
+  uint64_t a2_mask = UINT64_C(0) - ((a >> 2) & 1);
+  uint64_t a3_mask = UINT64_C(0) - ((a >> 3) & 1);
+  uint128_t extra = (a0_mask & b) ^ ((uint128_t)(a1_mask & b) << 1) ^
+                    ((uint128_t)(a2_mask & b) << 2) ^
+                    ((uint128_t)(a3_mask & b) << 3);
+
+  *out_lo = (((uint64_t)c0) & UINT64_C(0x1111111111111111)) ^
+            (((uint64_t)c1) & UINT64_C(0x2222222222222222)) ^
+            (((uint64_t)c2) & UINT64_C(0x4444444444444444)) ^
+            (((uint64_t)c3) & UINT64_C(0x8888888888888888)) ^ ((uint64_t)extra);
+  *out_hi = (((uint64_t)(c0 >> 64)) & UINT64_C(0x1111111111111111)) ^
+            (((uint64_t)(c1 >> 64)) & UINT64_C(0x2222222222222222)) ^
+            (((uint64_t)(c2 >> 64)) & UINT64_C(0x4444444444444444)) ^
+            (((uint64_t)(c3 >> 64)) & UINT64_C(0x8888888888888888)) ^
+            ((uint64_t)(extra >> 64));
+}
+
+#else  // !BORINGSSL_HAS_UINT128
+
+static uint64_t gcm_mul32_nohw(uint32_t a, uint32_t b) {
+  // One term every four bits means the largest term is 32/4 = 8, which does not
+  // overflow into the next term.
+  uint32_t a0 = a & 0x11111111;
+  uint32_t a1 = a & 0x22222222;
+  uint32_t a2 = a & 0x44444444;
+  uint32_t a3 = a & 0x88888888;
+
+  uint32_t b0 = b & 0x11111111;
+  uint32_t b1 = b & 0x22222222;
+  uint32_t b2 = b & 0x44444444;
+  uint32_t b3 = b & 0x88888888;
+
+  uint64_t c0 = (a0 * (uint64_t)b0) ^ (a1 * (uint64_t)b3) ^
+                (a2 * (uint64_t)b2) ^ (a3 * (uint64_t)b1);
+  uint64_t c1 = (a0 * (uint64_t)b1) ^ (a1 * (uint64_t)b0) ^
+                (a2 * (uint64_t)b3) ^ (a3 * (uint64_t)b2);
+  uint64_t c2 = (a0 * (uint64_t)b2) ^ (a1 * (uint64_t)b1) ^
+                (a2 * (uint64_t)b0) ^ (a3 * (uint64_t)b3);
+  uint64_t c3 = (a0 * (uint64_t)b3) ^ (a1 * (uint64_t)b2) ^
+                (a2 * (uint64_t)b1) ^ (a3 * (uint64_t)b0);
+
+  return (c0 & UINT64_C(0x1111111111111111)) |
+         (c1 & UINT64_C(0x2222222222222222)) |
+         (c2 & UINT64_C(0x4444444444444444)) |
+         (c3 & UINT64_C(0x8888888888888888));
+}
+
+static void gcm_mul64_nohw(uint64_t *out_lo, uint64_t *out_hi, uint64_t a,
+                           uint64_t b) {
+  uint32_t a0 = a & 0xffffffff;
+  uint32_t a1 = a >> 32;
+  uint32_t b0 = b & 0xffffffff;
+  uint32_t b1 = b >> 32;
+  // Karatsuba multiplication.
+  uint64_t lo = gcm_mul32_nohw(a0, b0);
+  uint64_t hi = gcm_mul32_nohw(a1, b1);
+  uint64_t mid = gcm_mul32_nohw(a0 ^ a1, b0 ^ b1) ^ lo ^ hi;
+  *out_lo = lo ^ (mid << 32);
+  *out_hi = hi ^ (mid >> 32);
+}
+
+#endif  // BORINGSSL_HAS_UINT128
+
+void gcm_init_nohw(u128 Htable[16], const uint64_t Xi[2]) {
+  // We implement GHASH in terms of POLYVAL, as described in RFC8452. This
+  // avoids a shift by 1 in the multiplication, needed to account for bit
+  // reversal losing a bit after multiplication, that is,
+  // rev128(X) * rev128(Y) = rev255(X*Y).
+  //
+  // Per Appendix A, we run mulX_POLYVAL. Note this is the same transformation
+  // applied by |gcm_init_clmul|, etc. Note |Xi| has already been byteswapped.
+  //
+  // See also slide 16 of
+  // https://crypto.stanford.edu/RealWorldCrypto/slides/gueron.pdf
+  Htable[0].lo = Xi[1];
+  Htable[0].hi = Xi[0];
+
+  uint64_t carry = Htable[0].hi >> 63;
+  carry = 0u - carry;
+
+  Htable[0].hi <<= 1;
+  Htable[0].hi |= Htable[0].lo >> 63;
+  Htable[0].lo <<= 1;
+
+  // The irreducible polynomial is 1 + x^121 + x^126 + x^127 + x^128, so we
+  // conditionally add 0xc200...0001.
+  Htable[0].lo ^= carry & 1;
+  Htable[0].hi ^= carry & UINT64_C(0xc200000000000000);
+
+  // This implementation does not use the rest of |Htable|.
+}
+
+static void gcm_polyval_nohw(uint64_t Xi[2], const u128 *H) {
+  // Karatsuba multiplication. The product of |Xi| and |H| is stored in |r0|
+  // through |r3|. Note there is no byte or bit reversal because we are
+  // evaluating POLYVAL.
+  uint64_t r0, r1;
+  gcm_mul64_nohw(&r0, &r1, Xi[0], H->lo);
+  uint64_t r2, r3;
+  gcm_mul64_nohw(&r2, &r3, Xi[1], H->hi);
+  uint64_t mid0, mid1;
+  gcm_mul64_nohw(&mid0, &mid1, Xi[0] ^ Xi[1], H->hi ^ H->lo);
+  mid0 ^= r0 ^ r2;
+  mid1 ^= r1 ^ r3;
+  r2 ^= mid1;
+  r1 ^= mid0;
+
+  // Now we multiply our 256-bit result by x^-128 and reduce. |r2| and
+  // |r3| shifts into position and we must multiply |r0| and |r1| by x^-128. We
+  // have:
+  //
+  //       1 = x^121 + x^126 + x^127 + x^128
+  //  x^-128 = x^-7 + x^-2 + x^-1 + 1
+  //
+  // This is the GHASH reduction step, but with bits flowing in reverse.
+
+  // The x^-7, x^-2, and x^-1 terms shift bits past x^0, which would require
+  // another reduction steps. Instead, we gather the excess bits, incorporate
+  // them into |r0| and |r1| and reduce once. See slides 17-19
+  // of https://crypto.stanford.edu/RealWorldCrypto/slides/gueron.pdf.
+  r1 ^= (r0 << 63) ^ (r0 << 62) ^ (r0 << 57);
+
+  // 1
+  r2 ^= r0;
+  r3 ^= r1;
+
+  // x^-1
+  r2 ^= r0 >> 1;
+  r2 ^= r1 << 63;
+  r3 ^= r1 >> 1;
+
+  // x^-2
+  r2 ^= r0 >> 2;
+  r2 ^= r1 << 62;
+  r3 ^= r1 >> 2;
+
+  // x^-7
+  r2 ^= r0 >> 7;
+  r2 ^= r1 << 57;
+  r3 ^= r1 >> 7;
+
+  Xi[0] = r2;
+  Xi[1] = r3;
+}
+
+void gcm_gmult_nohw(uint64_t Xi[2], const u128 Htable[16]) {
+  uint64_t swapped[2];
+  swapped[0] = CRYPTO_bswap8(Xi[1]);
+  swapped[1] = CRYPTO_bswap8(Xi[0]);
+  gcm_polyval_nohw(swapped, &Htable[0]);
+  Xi[0] = CRYPTO_bswap8(swapped[1]);
+  Xi[1] = CRYPTO_bswap8(swapped[0]);
+}
+
+void gcm_ghash_nohw(uint64_t Xi[2], const u128 Htable[16], const uint8_t *inp,
+                    size_t len) {
+  uint64_t swapped[2];
+  swapped[0] = CRYPTO_bswap8(Xi[1]);
+  swapped[1] = CRYPTO_bswap8(Xi[0]);
+
+  while (len >= 16) {
+    uint64_t block[2];
+    OPENSSL_memcpy(block, inp, 16);
+    swapped[0] ^= CRYPTO_bswap8(block[1]);
+    swapped[1] ^= CRYPTO_bswap8(block[0]);
+    gcm_polyval_nohw(swapped, &Htable[0]);
+    inp += 16;
+    len -= 16;
+  }
+
+  Xi[0] = CRYPTO_bswap8(swapped[1]);
+  Xi[1] = CRYPTO_bswap8(swapped[0]);
+}
diff --git a/crypto/fipsmodule/modes/gcm_test.cc b/crypto/fipsmodule/modes/gcm_test.cc
index b2e805c..a25c17a 100644
--- a/crypto/fipsmodule/modes/gcm_test.cc
+++ b/crypto/fipsmodule/modes/gcm_test.cc
@@ -119,7 +119,7 @@
             CRYPTO_bswap8(UINT64_C(0x0102030405060708)));
 }
 
-#if defined(SUPPORTS_ABI_TEST) && defined(GHASH_ASM)
+#if defined(SUPPORTS_ABI_TEST) && !defined(OPENSSL_NO_ASM)
 TEST(GCMTest, ABI) {
   static const uint64_t kH[2] = {
       UINT64_C(0x66e94bd4ef8a2c3b),
@@ -135,17 +135,12 @@
   };
 
   alignas(16) u128 Htable[16];
-  CHECK_ABI(gcm_init_4bit, Htable, kH);
 #if defined(GHASH_ASM_X86)
+  CHECK_ABI(gcm_init_4bit, Htable, kH);
   CHECK_ABI(gcm_gmult_4bit_mmx, X, Htable);
   for (size_t blocks : kBlockCounts) {
     CHECK_ABI(gcm_ghash_4bit_mmx, X, Htable, buf, 16 * blocks);
   }
-#else
-  CHECK_ABI(gcm_gmult_4bit, X, Htable);
-  for (size_t blocks : kBlockCounts) {
-    CHECK_ABI(gcm_ghash_4bit, X, Htable, buf, 16 * blocks);
-  }
 #endif  // GHASH_ASM_X86
 
 #if defined(GHASH_ASM_X86) || defined(GHASH_ASM_X86_64)
@@ -222,4 +217,4 @@
   }
 #endif  // GHASH_ASM_ARM
 }
-#endif  // SUPPORTS_ABI_TEST && GHASH_ASM
+#endif  // SUPPORTS_ABI_TEST && !OPENSSL_NO_ASM
diff --git a/crypto/fipsmodule/modes/internal.h b/crypto/fipsmodule/modes/internal.h
index 0971a90..a3135ab 100644
--- a/crypto/fipsmodule/modes/internal.h
+++ b/crypto/fipsmodule/modes/internal.h
@@ -261,22 +261,17 @@
 
 // GCM assembly.
 
-#if !defined(OPENSSL_NO_ASM) &&                         \
-    (defined(OPENSSL_X86) || defined(OPENSSL_X86_64) || \
-     defined(OPENSSL_ARM) || defined(OPENSSL_AARCH64) || \
-     defined(OPENSSL_PPC64LE))
-#define GHASH_ASM
-#endif
-
 void gcm_init_4bit(u128 Htable[16], const uint64_t H[2]);
-void gcm_gmult_4bit(uint64_t Xi[2], const u128 Htable[16]);
-void gcm_ghash_4bit(uint64_t Xi[2], const u128 Htable[16], const uint8_t *inp,
+
+void gcm_init_nohw(u128 Htable[16], const uint64_t H[2]);
+void gcm_gmult_nohw(uint64_t Xi[2], const u128 Htable[16]);
+void gcm_ghash_nohw(uint64_t Xi[2], const u128 Htable[16], const uint8_t *inp,
                     size_t len);
 
-#if defined(GHASH_ASM)
+#if !defined(OPENSSL_NO_ASM)
 
 #if defined(OPENSSL_X86) || defined(OPENSSL_X86_64)
-#define GCM_FUNCREF_4BIT
+#define GCM_FUNCREF
 void gcm_init_clmul(u128 Htable[16], const uint64_t Xi[2]);
 void gcm_gmult_clmul(uint64_t Xi[2], const u128 Htable[16]);
 void gcm_ghash_clmul(uint64_t Xi[2], const u128 Htable[16], const uint8_t *inp,
@@ -316,7 +311,7 @@
 
 #elif defined(OPENSSL_ARM) || defined(OPENSSL_AARCH64)
 #define GHASH_ASM_ARM
-#define GCM_FUNCREF_4BIT
+#define GCM_FUNCREF
 
 OPENSSL_INLINE int gcm_pmull_capable(void) {
   return CRYPTO_is_ARMv8_PMULL_capable();
@@ -336,13 +331,13 @@
 
 #elif defined(OPENSSL_PPC64LE)
 #define GHASH_ASM_PPC64LE
-#define GCM_FUNCREF_4BIT
+#define GCM_FUNCREF
 void gcm_init_p8(u128 Htable[16], const uint64_t Xi[2]);
 void gcm_gmult_p8(uint64_t Xi[2], const u128 Htable[16]);
 void gcm_ghash_p8(uint64_t Xi[2], const u128 Htable[16], const uint8_t *inp,
                   size_t len);
 #endif
-#endif  // GHASH_ASM
+#endif  // OPENSSL_NO_ASM
 
 
 // CBC.