Using Python textwrap.shorten for string but with bytes width

2024/10/18 15:18:27

I'd like to shorten a string using textwrap.shorten or a function like it. The string can potentially have non-ASCII characters. What's special here is that the maximal width is for the bytes encoding of the string. This problem is motivated by the fact that several database column definitions and some message buses have a bytes based max length.

For example:

>>> import textwrap
>>> s = '☺ Ilsa, le méchant ☺ ☺ gardien ☺'# Available function that I tried:
>>> textwrap.shorten(s, width=27)
'☺ Ilsa, le méchant ☺ [...]'
>>> len(_.encode())
31  # I want ⩽27# Desired function:
>>> shorten_to_bytes_width(s, width=27)
'☺ Ilsa, le méchant [...]'
>>> len(_.encode())
27  # I want and get ⩽27

It's okay for the implementation to use a width greater than or equal to the length of the whitespace-stripped placeholder [...], i.e. 5.

The text should not be shortened any more than necessary. Some buggy implementations can use optimizations which on occasion result in excessive shortening.

Using textwrap.wrap with bytes count is a similar question but it's different enough from this one since it is about textwrap.wrap, not textwrap.shorten. Only the latter function uses a placeholder ([...]) which makes this question sufficiently unique.

Caution: Do not rely on any of the answers here for shortening a JSON encoded string in a fixed number of bytes. For it, substitute text.encode() with json.dumps(text).

Answer

In theory it's enough to encode your string, then check if it fits in the "width" constraint. If it does, then the string can be simply returned. Otherwise you can take the first "width" bytes from the encoded string (minus the bytes needed for the placeholder). To make sure it works like textwrap.shorten one also needs to find the last whitespace in the remaining bytes and return everything before the whitespace + the placeholder. If there is no whitespace only the placeholder needs to be returned.

Given that you mentioned that you really want it byte-amount constrained the function throws an exception if the placeholder is too large. Because having a placeholder that wouldn't fit into the byte-constrained container/datastructure simply doesn't make sense and avoids a lot of edge cases that could result in inconsistent "maximum byte size" and "placeholder byte size".

The code could look like this:

def shorten_rsplit(string: str, maximum_bytes: int, normalize_spaces: bool = False, placeholder: str = "[...]") -> str:# Make sure the placeholder satisfies the byte length requirementencoded_placeholder = placeholder.encode().strip()if maximum_bytes < len(encoded_placeholder):raise ValueError('placeholder too large for max width')# Get the UTF-8 bytes that represent the string and (optionally) normalize the spaces.    if normalize_spaces:string = " ".join(string.split())encoded_string = string.encode()# If the input string is empty simply return an empty string.if not encoded_string:return ''# In case we don't need to shorten anything simply returnif len(encoded_string) <= maximum_bytes:return string# We need to shorten the string, so we need to add the placeholdersubstring = encoded_string[:maximum_bytes - len(encoded_placeholder)]splitted = substring.rsplit(b' ', 1)  # Split at last space-characterif len(splitted) == 2:return b" ".join([splitted[0], encoded_placeholder]).decode()else:return '[...]'

And a simple test case:

t = '☺ Ilsa, le méchant ☺ ☺ gardien ☺'for i in range(5, 50):shortened = shorten_rsplit(t, i)byte_length = len(shortened.encode())print(byte_length <= i, i, byte_length, shortened)

Which returns

True 5 5 [...]
True 6 5 [...]
True 7 5 [...]
True 8 5 [...]
True 9 9 ☺ [...]
True 10 9 ☺ [...]
True 11 9 ☺ [...]
True 12 9 ☺ [...]
True 13 9 ☺ [...]
True 14 9 ☺ [...]
True 15 15 ☺ Ilsa, [...]
True 16 15 ☺ Ilsa, [...]
True 17 15 ☺ Ilsa, [...]
True 18 18 ☺ Ilsa, le [...]
True 19 18 ☺ Ilsa, le [...]
True 20 18 ☺ Ilsa, le [...]
True 21 18 ☺ Ilsa, le [...]
True 22 18 ☺ Ilsa, le [...]
True 23 18 ☺ Ilsa, le [...]
True 24 18 ☺ Ilsa, le [...]
True 25 18 ☺ Ilsa, le [...]
True 26 18 ☺ Ilsa, le [...]
True 27 27 ☺ Ilsa, le méchant [...]
True 28 27 ☺ Ilsa, le méchant [...]
True 29 27 ☺ Ilsa, le méchant [...]
True 30 27 ☺ Ilsa, le méchant [...]
True 31 31 ☺ Ilsa, le méchant ☺ [...]
True 32 31 ☺ Ilsa, le méchant ☺ [...]
True 33 31 ☺ Ilsa, le méchant ☺ [...]
True 34 31 ☺ Ilsa, le méchant ☺ [...]
True 35 35 ☺ Ilsa, le méchant ☺ ☺ [...]
True 36 35 ☺ Ilsa, le méchant ☺ ☺ [...]
True 37 35 ☺ Ilsa, le méchant ☺ ☺ [...]
True 38 35 ☺ Ilsa, le méchant ☺ ☺ [...]
True 39 35 ☺ Ilsa, le méchant ☺ ☺ [...]
True 40 35 ☺ Ilsa, le méchant ☺ ☺ [...]
True 41 41 ☺ Ilsa, le méchant ☺ ☺ gardien ☺
True 42 41 ☺ Ilsa, le méchant ☺ ☺ gardien ☺
True 43 41 ☺ Ilsa, le méchant ☺ ☺ gardien ☺
True 44 41 ☺ Ilsa, le méchant ☺ ☺ gardien ☺
True 45 41 ☺ Ilsa, le méchant ☺ ☺ gardien ☺
True 46 41 ☺ Ilsa, le méchant ☺ ☺ gardien ☺
True 47 41 ☺ Ilsa, le méchant ☺ ☺ gardien ☺
True 48 41 ☺ Ilsa, le méchant ☺ ☺ gardien ☺
True 49 41 ☺ Ilsa, le méchant ☺ ☺ gardien ☺

The function also has an argument for normalizing the spaces. That could be helpful in case you have different kind of whitespaces (newlines, etc.) or multiple sequential spaces. Although it will be a bit slower.

Performance

I did a quick test using simple_benchmark (a library I wrote) to ensure it's actually faster.

For the benchmark I create a string containing random unicode characters where the (on average) one out of 8 characters is a whitespace. I also use half the length of the string as byte-width to split. Both have no special reason, it could bias the benchmarks though, that's why I wanted to mention it. enter image description here

The functions used in the benchmark:

def shorten_rsplit(string: str, maximum_bytes: int, normalize_spaces: bool = False, placeholder: str = "[...]") -> str:encoded_placeholder = placeholder.encode().strip()if maximum_bytes < len(encoded_placeholder):raise ValueError('placeholder too large for max width')if normalize_spaces:string = " ".join(string.split())encoded_string = string.encode()if not encoded_string:return ''if len(encoded_string) <= maximum_bytes:return stringsubstring = encoded_string[:maximum_bytes - len(encoded_placeholder)]splitted = substring.rsplit(b' ', 1)  # Split at last space-characterif len(splitted) == 2:return b" ".join([splitted[0], encoded_placeholder]).decode()else:return '[...]'import textwrap_MIN_WIDTH = 5
def shorten_to_bytes_width(text: str, width: int) -> str:width = max(_MIN_WIDTH, width)text = textwrap.shorten(text, width)while len(text.encode()) > width:text = textwrap.shorten(text, len(text) - 1)assert len(text.encode()) <= widthreturn textdef naive(text: str, width: int) -> str:width = max(_MIN_WIDTH, width)text = textwrap.shorten(text, width)if len(text.encode()) <= width:return textcurrent_width = _MIN_WIDTHindex = 0slice_index = 0endings = ' 'while True:new_width = current_width + len(text[index].encode())if new_width > width:breakif text[index] in endings:slice_index = indexindex += 1current_width = new_widthif slice_index:slice_index += 1  # to include found spacetext = text[:slice_index] + '[...]'assert len(text.encode()) <= widthreturn textMAX_BYTES_PER_CHAR = 4
def bytes_to_char_length(input, bytes, start=0, max_length=None):if bytes <= 0 or (max_length is not None and max_length <= 0):return 0if max_length is None:max_length = min(bytes, len(input) - start)bytes_too_much = len(input[start:start + max_length].encode()) - bytesif bytes_too_much <= 0:return max_lengthmin_length = max(max_length - bytes_too_much, bytes // MAX_BYTES_PER_CHAR)max_length -= (bytes_too_much + MAX_BYTES_PER_CHAR - 1) // MAX_BYTES_PER_CHARnew_start = start + min_lengthbytes_left = bytes - len(input[start:new_start].encode())return min_length + bytes_to_char_length(input, bytes_left, new_start, max_length - min_length)def shorten_to_bytes(input, bytes, placeholder=' [...]', start=0):if len(input[start:start + bytes + 1].encode()) <= bytes:return inputbytes -= len(placeholder.encode())max_chars = bytes_to_char_length(input, bytes, start)if max_chars <= 0:return placeholder.strip() if bytes >= 0 else ''w = input.rfind(' ', start, start + max_chars + 1)if w > 0:return input[start:w] + placeholderelse:return input[start:start + max_chars] + placeholder# Benchmarkfrom simple_benchmark import benchmark, MultiArgumentimport randomdef get_random_unicode(length):  # https://stackoverflow.com/a/21666621/5393381get_char = chrinclude_ranges = [(0x0021, 0x0021), (0x0023, 0x0026), (0x0028, 0x007E), (0x00A1, 0x00AC), (0x00AE, 0x00FF), (0x0100, 0x017F), (0x0180, 0x024F), (0x2C60, 0x2C7F), (0x16A0, 0x16F0), (0x0370, 0x0377), (0x037A, 0x037E), (0x0384, 0x038A), (0x038C, 0x038C)]alphabet = [get_char(code_point) for current_range in include_rangesfor code_point in range(current_range[0], current_range[1] + 1)]# Add more whitespacesfor _ in range(len(alphabet) // 8):alphabet.append(' ')return ''.join(random.choice(alphabet) for i in range(length))r = benchmark([shorten_rsplit, shorten_to_bytes, shorten_to_bytes_width, naive, bytes_to_char_length],{2**exponent: MultiArgument([get_random_unicode(2**exponent), 2**exponent // 2]) for exponent in range(4, 15)},"string length"
)

I also did a second benchmark excluding the shorten_to_bytes_width function so I could benchmark even longer strings:

r = benchmark([shorten_rsplit, shorten_to_bytes, naive],{2**exponent: MultiArgument([get_random_unicode(2**exponent), 2**exponent // 2]) for exponent in range(4, 20)},"string length"
)

enter image description here

https://en.xdnf.cn/q/73049.html

Related Q&A

How to create a transparent mask in opencv-python

I have sign (signs with arbitrary shape) images with white background and I want to get an image of the sign with transparent background. I have managed to create a mask and apply it to the image and t…

Variables with dynamic shape TensorFlow

I need to create a matrix in TensorFlow to store some values. The trick is the matrix has to support dynamic shape.I am trying to do the same I would do in numpy: myVar = tf.Variable(tf.zeros((x,y), va…

python protobuf cant deserialize message

Getting started with protobuf in python I face a strange issue:a simple message proto definition is:syntax = "proto3"; package test;message Message {string message = 1;string sender = 2; }gen…

Seaborn: title and subtitle placement

H all,Id like to create a scatterplot with a title, subtitle, colours corresponding to a specific variable and size corresponding to another variable. I want to display the colour legend but not the si…

Calculate a rolling regression in Pandas and store the slope

I have some time series data and I want to calculate a groupwise rolling regression of the last n days in Pandas and store the slope of that regression in a new column.I searched the older questions an…

Python read microphone

I am trying to make python grab data from my microphone, as I want to make a random generator which will use noise from it. So basically I dont want to record the sounds, but rather read it in as a da…

How to tell pytest-xdist to run tests from one folder sequencially and the rest in parallel?

Imagine that I have test/unit/... which are safe to run in parallel and test/functional/... which cannot be run in parallel yet.Is there an easy way to convince pytest to run the functional ones sequen…

PyPDF4 - Exported PDF file size too big

I have a PDF file of around 7000 pages and 479 MB. I have create a python script using PyPDF4 to extract only specific pages if the pages contain specific words. The script works but the new PDF file,…

Jupyter install fails on Mac

Im trying to install Jupyter on my Mac (OS X El Capitan) and Im getting an error in response to:sudo pip install -U jupyterAt first the download/install starts fine, but then I run into this:Installing…

Python Error Codes are upshifted

Consider a python script error.pyimport sys sys.exit(3)Invokingpython error.py; echo $?yields the expected "3". However, consider runner.pyimport os result = os.system("python error.py&…