Build 32-bit assembly with SSE2 enabled.

This affects bignum and sha. Also now that we're passing the SSE2 flag, revert
the change to ghash-x86.pl which unconditionally sets $sse2, just to minimize
upstream divergence. Chromium assumes SSE2 support, so relying on it is okay.
See https://crbug.com/349320.

Note: this change needs to be mirrored in Chromium to take.

bssl speed numbers:

SSE2:
Did 552 RSA 2048 signing operations in 3007814us (183.5 ops/sec)
Did 19003 RSA 2048 verify operations in 3070779us (6188.3 ops/sec)
Did 72 RSA 4096 signing operations in 3055885us (23.6 ops/sec)
Did 4650 RSA 4096 verify operations in 3024926us (1537.2 ops/sec)

Without SSE2:
Did 350 RSA 2048 signing operations in 3042021us (115.1 ops/sec)
Did 11760 RSA 2048 verify operations in 3003197us (3915.8 ops/sec)
Did 46 RSA 4096 signing operations in 3042692us (15.1 ops/sec)
Did 3400 RSA 4096 verify operations in 3083035us (1102.8 ops/sec)

SSE2:
Did 16407000 SHA-1 (16 bytes) operations in 3000141us (5468743.0 ops/sec): 87.5 MB/s
Did 4367000 SHA-1 (256 bytes) operations in 3000436us (1455455.1 ops/sec): 372.6 MB/s
Did 185000 SHA-1 (8192 bytes) operations in 3002666us (61611.9 ops/sec): 504.7 MB/s
Did 9444000 SHA-256 (16 bytes) operations in 3000052us (3147945.4 ops/sec): 50.4 MB/s
Did 2283000 SHA-256 (256 bytes) operations in 3000457us (760884.1 ops/sec): 194.8 MB/s
Did 89000 SHA-256 (8192 bytes) operations in 3016024us (29509.0 ops/sec): 241.7 MB/s
Did 5550000 SHA-512 (16 bytes) operations in 3000350us (1849784.2 ops/sec): 29.6 MB/s
Did 1820000 SHA-512 (256 bytes) operations in 3001039us (606456.6 ops/sec): 155.3 MB/s
Did 93000 SHA-512 (8192 bytes) operations in 3007874us (30918.8 ops/sec): 253.3 MB/s

Without SSE2:
Did 10573000 SHA-1 (16 bytes) operations in 3000261us (3524026.7 ops/sec): 56.4 MB/s
Did 2937000 SHA-1 (256 bytes) operations in 3000621us (978797.4 ops/sec): 250.6 MB/s
Did 123000 SHA-1 (8192 bytes) operations in 3033202us (40551.2 ops/sec): 332.2 MB/s
Did 5846000 SHA-256 (16 bytes) operations in 3000294us (1948475.7 ops/sec): 31.2 MB/s
Did 1377000 SHA-256 (256 bytes) operations in 3000335us (458948.8 ops/sec): 117.5 MB/s
Did 54000 SHA-256 (8192 bytes) operations in 3027962us (17833.8 ops/sec): 146.1 MB/s
Did 2075000 SHA-512 (16 bytes) operations in 3000967us (691443.8 ops/sec): 11.1 MB/s
Did 638000 SHA-512 (256 bytes) operations in 3000576us (212625.8 ops/sec): 54.4 MB/s
Did 30000 SHA-512 (8192 bytes) operations in 3042797us (9859.3 ops/sec): 80.8 MB/s

BUG=430237

Change-Id: I47d1c1ffcd71afe4f4a192272f8cb92af9505ee1
Reviewed-on: https://boringssl-review.googlesource.com/4130
Reviewed-by: Adam Langley <agl@google.com>
diff --git a/crypto/CMakeLists.txt b/crypto/CMakeLists.txt
index bbd4f20..c454664 100644
--- a/crypto/CMakeLists.txt
+++ b/crypto/CMakeLists.txt
@@ -2,7 +2,7 @@
 
 if(APPLE)
   if (${ARCH} STREQUAL "x86")
-    set(PERLASM_FLAGS "-fPIC")
+    set(PERLASM_FLAGS "-fPIC -DOPENSSL_IA32_SSE2")
   endif()
   set(PERLASM_STYLE macosx)
   set(ASM_EXT S)
@@ -13,7 +13,7 @@
     # in order to decide whether to generate 32- or 64-bit asm.
     set(PERLASM_STYLE linux64)
   elseif (${ARCH} STREQUAL "x86")
-    set(PERLASM_FLAGS "-fPIC")
+    set(PERLASM_FLAGS "-fPIC -DOPENSSL_IA32_SSE2")
     set(PERLASM_STYLE elf)
   else()
     set(PERLASM_STYLE elf)
@@ -27,6 +27,7 @@
   else()
     message("Using win32n")
     set(PERLASM_STYLE win32n)
+    set(PERLASM_FLAGS "-DOPENSSL_IA32_SSE2")
   endif()
 
   # On Windows, we use the NASM output, specifically built with Yasm.
diff --git a/crypto/modes/asm/ghash-x86.pl b/crypto/modes/asm/ghash-x86.pl
index eb6d55e..23a5527 100644
--- a/crypto/modes/asm/ghash-x86.pl
+++ b/crypto/modes/asm/ghash-x86.pl
@@ -131,8 +131,8 @@
 
 &asm_init($ARGV[0],"ghash-x86.pl",$x86only = $ARGV[$#ARGV] eq "386");
 
-$sse2=1;
-#for (@ARGV) { $sse2=1 if (/-DOPENSSL_IA32_SSE2/); }
+$sse2=0;
+for (@ARGV) { $sse2=1 if (/-DOPENSSL_IA32_SSE2/); }
 
 ($Zhh,$Zhl,$Zlh,$Zll) = ("ebp","edx","ecx","ebx");
 $inp  = "edi";