Document and test stance on non-canonical base64 From RFC 4648: 3.5. Canonical Encoding The padding step in base 64 and base 32 encoding can, if improperly implemented, lead to non-significant alterations of the encoded data. For example, if the input is only one octet for a base 64 encoding, then all six bits of the first symbol are used, but only the first two bits of the next symbol are used. These pad bits MUST be set to zero by conforming encoders, which is described in the descriptions on padding below. If this property do not hold, there is no canonical representation of base-encoded data, and multiple base- encoded strings can be decoded to the same binary data. If this property (and others discussed in this document) holds, a canonical encoding is guaranteed. In some environments, the alteration is critical and therefore decoders MAY chose to reject an encoding if the pad bits have not been set to zero. The specification referring to this may mandate a specific behaviour. OpenSSL's decoder has always accepted non-canonical encodings and it still appears to be the prevalent practice in 2024. In particular Go's encoding/base64 package requires you to opt into strict mode (which encoding/pem does not use). Also, Bouncy Castle and NSS accept such encodings. So add a comment to the code that this is a deliberate, if perhaps begrudging, choice and encode this in regress with a few test cases that are more obviously of a degenerate nature than the current non-canonical forms. Also, group the test vectors straight from RFC 4648 section 10 together. Change-Id: Ibcc22b7feed86fe1cb0fd51a1d61ec0c60dc8672 Reviewed-on: https://boringssl-review.googlesource.com/c/boringssl/+/68247 Auto-Submit: Theo Buehler <theorbuehler@gmail.com> Commit-Queue: David Benjamin <davidben@google.com> Reviewed-by: Bob Beck <bbe@google.com> Reviewed-by: David Benjamin <davidben@google.com>

commit: 1a9edc3e3b1024af4f6dc1ed6bb391510cb494ba [log] [tgz]
author: Theo Buehler <theorbuehler@gmail.com> Mon May 06 06:47:08 2024 +0200
committer: Boringssl LUCI CQ <boringssl-scoped@luci-project-accounts.iam.gserviceaccount.com> Wed May 15 20:19:46 2024 +0000
tree: 0e27e2b9c665eac06629a3d69af5675d32ba27c1
parent: b8912d713cb82a748bbe63f28f28b17632c70964 [diff]
diff --git a/crypto/base64/base64.c b/crypto/base64/base64.c
index 666f832..26ad974 100644
--- a/crypto/base64/base64.c
+++ b/crypto/base64/base64.c

@@ -307,6 +307,10 @@
                                    (in[2] == '=') << 1 |
                                    (in[3] == '=');
 
+  // In presence of padding, the lowest bits of v are unused. Canonical encoding
+  // (RFC 4648, section 3.5) requires that these bits all be set to zero. Common
+  // PEM parsers accept noncanonical base64, adding to the malleability of the
+  // format. This decoder follows OpenSSL's and Go's PEM parsers and accepts it.
   switch (padding_pattern) {
     case 0:
       // The common case of no padding.

diff --git a/crypto/base64/base64_test.cc b/crypto/base64/base64_test.cc
index 6484dc6..f246605 100644
--- a/crypto/base64/base64_test.cc
+++ b/crypto/base64/base64_test.cc

@@ -45,8 +45,8 @@
   const char *encoded;
 };
 
-// Test vectors from RFC 4648.
 static const Base64TestVector kTestVectors[] = {
+    // Test vectors from RFC 4648, section 10.
     {canonical, "", ""},
     {canonical, "f", "Zg==\n"},
     {canonical, "fo", "Zm8=\n"},
@@ -54,12 +54,31 @@
     {canonical, "foob", "Zm9vYg==\n"},
     {canonical, "fooba", "Zm9vYmE=\n"},
     {canonical, "foobar", "Zm9vYmFy\n"},
-    {valid, "foobar", "Zm9vYmFy\n\n"},
-    {valid, "foobar", " Zm9vYmFy\n\n"},
-    {valid, "foobar", " Z m 9 v Y m F y\n\n"},
+
     {invalid, "", "Zm9vYmFy=\n"},
     {invalid, "", "Zm9vYmFy==\n"},
     {invalid, "", "Zm9vYmFy===\n"},
+
+    // valid non-canonical encodings due to arbitrary whitespace
+    {valid, "foobar", "Zm9vYmFy\n\n"},
+    {valid, "foobar", " Zm9vYmFy\n\n"},
+    {valid, "foobar", " Z m 9 v Y m F y\n\n"},
+    {valid, "foobar", "Zm9vYmFy\r\n"},
+
+    // The following "valid" encodings are arguably invalid, but they are
+    // commonly accepted by parsers, in particular by OpenSSL.
+    {valid, "v", "dv==\n"},
+    {canonical, "w", "dw==\n"},
+    {valid, "w", "dx==\n"},
+    {valid, "w", "d+==\n"},
+    {valid, "w", "d/==\n"},
+    {invalid, "", "d===\n"},
+    {canonical, "w`", "d2A=\n"},
+    {valid, "w`", "d2B=\n"},
+    {valid, "w`", "d2C=\n"},
+    {valid, "w`", "d2D=\n"},
+    {canonical, "wa", "d2E=\n"},
+
     {invalid, "", "Z"},
     {invalid, "", "Z\n"},
     {invalid, "", "ab!c"},
commit	1a9edc3e3b1024af4f6dc1ed6bb391510cb494ba	[log] [tgz]
author	Theo Buehler <theorbuehler@gmail.com>	Mon May 06 06:47:08 2024 +0200
committer	Boringssl LUCI CQ <boringssl-scoped@luci-project-accounts.iam.gserviceaccount.com>	Wed May 15 20:19:46 2024 +0000
tree	0e27e2b9c665eac06629a3d69af5675d32ba27c1
parent	b8912d713cb82a748bbe63f28f28b17632c70964 [diff]