gh-101178: Add Ascii85, base85, and Z85 support to binascii #102753

kangtastic · 2023-03-16T11:38:01Z

Synopsis

Add Ascii85 and base85 encoder and decoder functions implemented in C to binascii and use them to greatly improve the performance and reduce the memory usage of the existing Ascii85, base85, and Z85 codec functions in base64.

No API or documentation changes are necessary with respect to any functions in base64, and all existing unit tests for those functions continue to pass without modification.

Resolves: gh-101178

Discussion

The base85-related functions in base64 are now wrappers for the new functions in binascii, as envisioned in the docs:

The binascii module contains a number of methods to convert between binary and various ASCII-encoded binary representations. Normally, you will not use these functions directly but use wrapper modules like uu or base64 instead. The binascii module contains low-level functions written in C for greater speed that are used by the higher-level modules.

Parting out Ascii85 from base85 and Z85 was warranted in my testing despite the code duplication due to the various performance-murdering special cases in Ascii85.

Comments and questions are welcome.

Benchmarks

Updated April 20, 2025.

# bench_b85.py

# Note: EXTREMELY SLOW on unmodified mainline CPython
#       when tracing malloc on the base-85 functions.

import base64
import sys
import timeit
import tracemalloc

funcs = [(base64.b64encode, base64.b64decode),  # sanity check/comparison
         (base64.a85encode, base64.a85decode),
         (base64.b85encode, base64.b85decode),
         (base64.z85encode, base64.z85decode)]

def mb(n):
    return f"{n / 1024 / 1024:.3f} MB"

def stats(func, data, t, m):
    name, n, bps = func.__qualname__, len(data), len(data) / t
    print(f"{name} : {n} b in {t:.3f} s ({mb(bps)}/s) using {mb(m)}")

if __name__ == "__main__":
    data = b"a" * int(sys.argv[1]) * 1024 * 1024
    for fenc, fdec in funcs:
        tracemalloc.start()
        enc = fenc(data)
        menc = tracemalloc.get_traced_memory()[1] - len(enc)
        tracemalloc.stop()
        tenc = timeit.timeit("fenc(data)", number=1, globals=globals())
        stats(fenc, data, tenc, menc)

        tracemalloc.start()
        dec = fenc(enc)
        mdec = tracemalloc.get_traced_memory()[1] - len(dec)
        tracemalloc.stop()
        tdec = timeit.timeit("fdec(enc)", number=1, globals=globals())
        stats(fdec, enc, tdec, mdec)

# Python 3.14.0a7+ commit 78cfee6f09
# ./configure --enable-optimizations --with-lto

# With this PR
$ time ./python bench_b85.py 64
b64encode : 67108864 b in 0.084 s (763.340 MB/s) using 42.667 MB
b64decode : 89478488 b in 0.230 s (371.074 MB/s) using 56.889 MB
a85encode : 67108864 b in 0.190 s (336.115 MB/s) using 0.000 MB
a85decode : 83886080 b in 0.216 s (370.605 MB/s) using 0.000 MB
b85encode : 67108864 b in 0.072 s (887.955 MB/s) using 0.000 MB
b85decode : 83886080 b in 0.175 s (457.224 MB/s) using 0.000 MB
z85encode : 67108864 b in 0.072 s (891.721 MB/s) using 0.000 MB
z85decode : 83886080 b in 0.174 s (460.582 MB/s) using 0.000 MB

real    0m2.231s
user    0m2.064s
sys     0m0.156s

# Unmodified
$ time ./python bench_b85.py 64
b64encode : 67108864 b in 0.082 s (781.718 MB/s) using 42.667 MB
b64decode : 89478488 b in 0.237 s (360.686 MB/s) using 56.889 MB
a85encode : 67108864 b in 7.492 s (8.543 MB/s) using 2664.406 MB
a85decode : 83886080 b in 14.264 s (5.609 MB/s) using 3332.254 MB
b85encode : 67108864 b in 7.181 s (8.912 MB/s) using 2664.404 MB
b85decode : 83886080 b in 8.486 s (9.427 MB/s) using 3332.254 MB
z85encode : 67108864 b in 7.343 s (8.715 MB/s) using 2664.102 MB
z85decode : 83886080 b in 8.778 s (9.113 MB/s) using 3332.254 MB

real    9m2.346s
user    8m47.248s
sys     0m12.460s

The old pure-Python implementation is two orders of magnitude slower and uses over O(40n) temporary memory.

ghost · 2023-03-16T11:38:28Z

All commit authors signed the Contributor License Agreement.

kangtastic · 2024-03-19T08:58:57Z

It's a year later, and Z85 support has been added to base64 in the meantime. So while bringing this PR up to date with main, I added Z85 support to it as well.

For reference, this is the benchmark run that led me to do so.

# After merging main but before adding Z85 support to this PR
(cpython-b85) $ python bench_b85.py 64
b64encode : 67108864 b in 0.121 s (527.435 MB/s) using 42.667 MB
b64decode : 89478488 b in 0.309 s (276.188 MB/s) using 56.889 MB
a85encode : 67108864 b in 0.297 s (215.150 MB/s) using 0.000 MB
a85decode : 83886080 b in 0.205 s (390.751 MB/s) using 0.000 MB
b85encode : 67108864 b in 0.106 s (604.359 MB/s) using 0.000 MB
b85decode : 83886080 b in 0.204 s (393.040 MB/s) using 0.000 MB
z85encode : 67108864 b in 0.204 s (313.610 MB/s) using 80.000 MB
z85decode : 83886080 b in 0.300 s (266.670 MB/s) using 100.000 MB

The existing Z85 implementation translates from the standard base85 alphabet to Z85 after the fact and within Python, so it was already benefiting from this PR but with substantial performance and memory usage overhead. That overhead is now gone.

python-cla-bot · 2025-04-18T09:43:15Z

All commit authors signed the Contributor License Agreement.

Add Ascii85, base85, and Z85 encoders and decoders to `binascii`, replacing the existing pure Python implementations in `base64`. No API or documentation changes are necessary with respect to `base64.a85encode()`, `b85encode()`, etc., and all existing unit tests for those functions continue to pass without modification. Note that attempting to decode Ascii85 or base85 data of length 1 mod 5 (after accounting for Ascii85 quirks) now produces an error, as no encoder would emit such data. This should be the only significant externally visible difference compared to the old implementation. Resolves: pythongh-101178

kangtastic · 2025-04-21T05:18:40Z

PR has been rebased onto main at 78cfee6 with squashing.

sergey-miryanov · 2025-04-21T08:58:43Z

Note that attempting to decode Ascii85, base85, or Z85 data of length 1 mod 5 now produces an error, as no encoder would emit such data. This should be the only significant externally visible difference compared to the old implementations.

I believe you have to document this change.

kangtastic · 2025-04-21T10:27:58Z

Note that attempting to decode Ascii85, base85, or Z85 data of length 1 mod 5 now produces an error, as no encoder would emit such data. This should be the only significant externally visible difference compared to the old implementations.

I believe you have to document this change.

Fair point, I could do that.

In case anyone argues for keeping the old behavior (silently ignoring length 1 mod 5), I won't do it just yet.

Lib/base64.py

If we were strictly following PEP-0399, _base64 would be a C module for accelerated functions in base64. Due to historical reasons, those should actually go in binascii instead. We still want to preserve the existing Python code in base64. Parting out facilities for accessing the C functions into a module named _base64 shouldn't risk a naming conflict and will simplify testing.

This is done differently to PEP-0399 to minimize the number of changed lines.

As we're now keeping the existing Python base 85 functions, the C implementations should behave exactly the same, down to exception type and wording. It is also no longer an error to try to decode data of length 1 mod 5.

kangtastic · 2025-04-27T03:51:24Z

The PR has been updated to preserve the existing base 85 Python functions in base64 and modify the new base 85 C functions in binascii to closely match their behavior. Notably, trying to decode data of length 1 mod 5 is no longer an error.

Lib/_base64.py

Lib/base64.py

Lib/test/test_base64.py

Importing update_wrapper() from functools to copy attributes is expensive. Do it ourselves for only the most relevant ones.

This requires some code duplication, but oh well.

Using a decorator complicates function signature introspection.

Do we really need to test the legacy API twice?

Lib/base64.py

kangtastic · 2025-04-29T03:26:09Z

PR was accidentally closed due to misclicking on mobile. There should be a confirmation dialog or something 😅

serhiy-storchaka · 2025-12-26T09:40:19Z

Thank you for your PR, @kangtastic. This looks very interesting. The speed was increased by 100 times, this is impressive. Initially we did not bother to implement these codecs in C, because they are non-standard, and the expected use was just for few kilobytes. But it seems people use them for larger data.

I resolved conflicts and fixed tests and support of features added since the last update. I am planning to finish the review and land that code in Python.

I managed to speed up Ascii85 encoding by 3 times, it can be almost so fast as Base85. If you don't mind, I'll apply this optimization here.

As for general organization of the code, I think that introduction of the _base64 module is redundant. I understand why you did this, for tests, but the binascii module is a non-optional dependency for base64. If we need to make Base85 support in binascii optional, we should simply test if binascii.a2b_ascii85 etc exist in base64.py. I am not sure that it is worth the hassle. I asked others for opinion: https://discuss.python.org/t/accelerator-for-ascii85-base85/105415 .

Also, instead of the boolean z85 parameter, it is better to add separate functions. It could be even better to generalize the functions to work with an arbitrary alphabet, but it is not easy for decoders. If we are going this way, we have to generate the decoding table in Python, or implement caching in C.

kangtastic · 2025-12-26T13:58:27Z

Thank you for your PR, @kangtastic. This looks very interesting. The speed was increased by 100 times, this is impressive. Initially we did not bother to implement these codecs in C, because they are non-standard, and the expected use was just for few kilobytes. But it seems people use them for larger data.

Thanks for reviewing, @serhiy-storchaka.

I resolved conflicts and fixed tests and support of features added since the last update. I am planning to finish the review and land that code in Python.

I managed to speed up Ascii85 encoding by 3 times, it can be almost so fast as Base85. If you don't mind, I'll apply this optimization here.

No objections from me. Interested to see how you did it.

As for general organization of the code, I think that introduction of the _base64 module is redundant. I understand why you did this, for tests, but the binascii module is a non-optional dependency for base64. If we need to make Base85 support in binascii optional, we should simply test if binascii.a2b_ascii85 etc exist in base64.py. I am not sure that it is worth the hassle. I asked others for opinion: https://discuss.python.org/t/accelerator-for-ascii85-base85/105415 .

Yes, I originally removed the pure-Python base-85-related function implementations in base64 for the same reasons you mentioned in your discussion thread. I added them back in and created _base64 after @AA-Turner pointed out in the previous review that PEP-0399 requires keeping existing pure-Python implementations alongside new C accelerator modules. If the discussion thread leads permission to remove the Python codepath, I'd be happy with that.

Also, instead of the boolean z85 parameter, it is better to add separate functions. It could be even better to generalize the functions to work with an arbitrary alphabet, but it is not easy for decoders. If we are going this way, we have to generate the decoding table in Python, or implement caching in C.

Regarding generalization for arbitrary alphabets, I didn't think it was necessary given that only Ascii85, Base85, and Z85 are in common use, and Ascii85 is different enough that there are really only two alphabets to be generalized.

How about separate functions for Base85 and Z85 only on the Python API side? I'm inclined to keep them combined on the C side to avoid code duplication (with wrapper functions for the z85 parameter).

picnixz

A few comments. I'd appreciate, as it's new code, that PEP-7 is followed.

picnixz · 2025-12-26T15:08:56Z

Doc/library/binascii.rst

+   of no more than the specified width separated by the ASCII newline
+   character.
+
+   If *pad* is true, the input is padded to a multiple of 4 before encoding.


Suggested change

If *pad* is true, the input is padded to a multiple of 4 before encoding.

If *pad* is true, the input is padded to a multiple of 4 before encoding.

.. versionadded:: next

picnixz · 2025-12-26T15:09:03Z

Doc/library/binascii.rst

+   Valid base85 data contains characters from the base85 alphabet in groups
+   of five (except for the final group, which may have from two to five
+   characters). Each group encodes 32 bits of binary data in the range from
+   ``0`` to ``2 ** 32 - 1``, inclusive.


Suggested change

``0`` to ``2 ** 32 - 1``, inclusive.

``0`` to ``2 ** 32 - 1``, inclusive.

.. versionadded:: next

picnixz · 2025-12-26T15:09:13Z

Doc/library/binascii.rst

+   *ignore* is an optional bytes-like object that specifies characters to
+   ignore in the input.
+
+   Invalid Ascii85 data will raise :exc:`binascii.Error`.


Suggested change

Invalid Ascii85 data will raise :exc:`binascii.Error`.

Invalid Ascii85 data will raise :exc:`binascii.Error`.

.. versionadded:: next

picnixz · 2025-12-26T15:09:21Z

Doc/library/binascii.rst

+   If *newline* is true, a newline char is appended to the result.
+
+   If *z85* is true, the Z85 alphabet is used for conversion.
+   See `Z85 specification <https://rfc.zeromq.org/spec/32/>`_ for more information.


Suggested change

See `Z85 specification <https://rfc.zeromq.org/spec/32/>`_ for more information.

See `Z85 specification <https://rfc.zeromq.org/spec/32/>`_ for more information.

.. versionadded:: next

picnixz · 2025-12-26T15:11:31Z

Lib/_base64.py

+
+
+def _bytes_from_encode_data(b):
+    return b if isinstance(b, bytes_types) else memoryview(b).tobytes()


memoryview(b).tobytes() may raise a generic TypeError here, whereas it is in _bytes_from_decode_data. Should we do it as well?

picnixz · 2025-12-26T15:12:05Z

Lib/_base64.py

+
+# Functions in binascii raise binascii.Error instead of ValueError.
+
+def a85encode(b, *, foldspaces=False, wrapcol=0, pad=False, adobe=False):


Suggested change

def a85encode(b, *, foldspaces=False, wrapcol=0, pad=False, adobe=False):

def a85encode(b, /, *, foldspaces=False, wrapcol=0, pad=False, adobe=False):

picnixz · 2025-12-26T15:12:10Z

Lib/_base64.py

+        raise ValueError(e) from None
+
+
+def a85decode(b, *, foldspaces=False, adobe=False, ignorechars=b' \t\n\r\v'):


Suggested change

def a85decode(b, *, foldspaces=False, adobe=False, ignorechars=b' \t\n\r\v'):

def a85decode(b, /, *, foldspaces=False, adobe=False, ignorechars=b' \t\n\r\v'):

picnixz · 2025-12-26T15:20:11Z

Modules/binascii.c

+                state = get_binascii_state(module);
+                if (state != NULL) {
+                    PyErr_SetString(state->Error, "Ascii85 overflow");
+                }


This can be NULL only if:

the state is really missing (mod->md_state == NULL)

'module' is not a module

I don't know when it could be possible for the state to be missing (maybe at interpreter's finalizaation?).

picnixz · 2025-12-26T15:20:45Z

Modules/binascii.c

+
+    return PyBytesWriter_FinishWithPointer(writer, bin_data);
+
+error_end:


I'd suggest naming this error instead of error_end.

picnixz · 2025-12-26T15:26:52Z

How about separate functions for Base85 and Z85 only on the Python API side? I'm inclined to keep them combined on the C side to avoid code duplication (with wrapper functions for the z85 parameter).

It may be better to separate them for PGO though I have no idea whether there will be an impact or not (this needs to be measured). If you're worried about parts of the code being duplicated, it can be refactored into macros (or into smaller functions). However, we should avoid the user-interface exposing different flavors with boolean switches (we usually try to avoid this, as illustrated by examples of filter/itertoolsfilterfalse and recently fnmatch.filter/fnmatch.filterfalse where the API was explicitly designed to avoid switches).

bedevere-bot added the awaiting review label Mar 16, 2023

kangtastic changed the title ~~Add Ascii85 and base85 support to binascii~~ gh-101178: Add Ascii85 and base85 support to binascii Mar 16, 2023

bedevere-bot mentioned this pull request Mar 16, 2023

base64.b85encode uses significant amount of RAM #101178

Open

arhadthedev added the stdlib Standard Library Python modules in the Lib/ directory label Mar 23, 2023

kangtastic force-pushed the gh-101178-rework-base85 branch from 71f1955 to 7b4aba1 Compare March 19, 2024 09:27

kangtastic force-pushed the gh-101178-rework-base85 branch from 7b4aba1 to 05ae5ad Compare April 21, 2025 05:16

kangtastic changed the title ~~gh-101178: Add Ascii85 and base85 support to binascii~~ gh-101178: Add Ascii85. base85, and Z85 support to binascii Apr 21, 2025

kangtastic changed the title ~~gh-101178: Add Ascii85. base85, and Z85 support to binascii~~ gh-101178: Add Ascii85, base85, and Z85 support to binascii Apr 21, 2025

AA-Turner reviewed Apr 24, 2025

View reviewed changes

Lib/base64.py Show resolved Hide resolved

kangtastic added 5 commits April 26, 2025 06:37

Restore base64.py

aa06c5d

Test both Python and C codepaths in base64

6c0e4a3

This is done differently to PEP-0399 to minimize the number of changed lines.

Match behavior between Python and C base 85 functions

ce4773c

As we're now keeping the existing Python base 85 functions, the C implementations should behave exactly the same, down to exception type and wording. It is also no longer an error to try to decode data of length 1 mod 5.

Add Z85 tests to binascii

4072e3b

Update generated files

bc9217f

AA-Turner reviewed Apr 27, 2025

View reviewed changes

Lib/_base64.py Outdated Show resolved Hide resolved

AA-Turner reviewed Apr 27, 2025

View reviewed changes

Lib/base64.py Outdated Show resolved Hide resolved

AA-Turner reviewed Apr 27, 2025

View reviewed changes

Lib/test/test_base64.py Outdated Show resolved Hide resolved

kangtastic added 4 commits April 27, 2025 19:55

Avoid importing functools

2c40ba0

Importing update_wrapper() from functools to copy attributes is expensive. Do it ourselves for only the most relevant ones.

Avoid circular import in _base64

fd9eaf7

This requires some code duplication, but oh well.

Do not use a decorator for changing exception type

4746d18

Using a decorator complicates function signature introspection.

Test Python and C codepaths in base64 using mixins

d075593

Do we really need to test the legacy API twice?

kangtastic closed this Apr 29, 2025

kangtastic reopened this Apr 29, 2025

AA-Turner reviewed Apr 29, 2025

View reviewed changes

Lib/base64.py Outdated Show resolved Hide resolved

Remove leading underscore from functions in private module

6d65fec

serhiy-storchaka self-requested a review December 24, 2025 18:40

serhiy-storchaka added 5 commits December 24, 2025 21:05

Merge branch 'main' into pythongh-101178-rework-base85

a241356

Use more modern C API.

0df9a40

Fix tests.

60fbd7c

Merge branch 'main' into pythongh-101178-rework-base85

a070887

Fix new tests.

167e83e

picnixz reviewed Dec 26, 2025

View reviewed changes

Optimize binascii.b2a_ascii85().

01df442



		def _bytes_from_encode_data(b):
		return b if isinstance(b, bytes_types) else memoryview(b).tobytes()


		# Functions in binascii raise binascii.Error instead of ValueError.

		def a85encode(b, *, foldspaces=False, wrapcol=0, pad=False, adobe=False):

	def a85encode(b, *, foldspaces=False, wrapcol=0, pad=False, adobe=False):
	def a85encode(b, /, *, foldspaces=False, wrapcol=0, pad=False, adobe=False):

		raise ValueError(e) from None


		def a85decode(b, *, foldspaces=False, adobe=False, ignorechars=b' \t\n\r\v'):

	def a85decode(b, *, foldspaces=False, adobe=False, ignorechars=b' \t\n\r\v'):
	def a85decode(b, /, *, foldspaces=False, adobe=False, ignorechars=b' \t\n\r\v'):


		return PyBytesWriter_FinishWithPointer(writer, bin_data);

		error_end:

Uh oh!

gh-101178: Add Ascii85, base85, and Z85 support to binascii #102753

Are you sure you want to change the base?

gh-101178: Add Ascii85, base85, and Z85 support to binascii #102753

Conversation

kangtastic commented Mar 16, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Synopsis

Discussion

Benchmarks

Uh oh!

ghost commented Mar 16, 2023 • edited by ghost Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kangtastic commented Mar 19, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

python-cla-bot bot commented Apr 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kangtastic commented Apr 21, 2025

Uh oh!

sergey-miryanov commented Apr 21, 2025

Uh oh!

kangtastic commented Apr 21, 2025

Uh oh!

Uh oh!

kangtastic commented Apr 27, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kangtastic commented Apr 29, 2025

Uh oh!

serhiy-storchaka commented Dec 26, 2025

Uh oh!

kangtastic commented Dec 26, 2025

Uh oh!

picnixz left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

picnixz commented Dec 26, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

kangtastic commented Mar 16, 2023 •

edited

Loading

ghost commented Mar 16, 2023 •

edited by ghost

Loading

kangtastic commented Mar 19, 2024 •

edited

Loading

python-cla-bot bot commented Apr 18, 2025 •

edited

Loading