Python string splitlines() removes certain Unicode control characters

2024/10/13 19:25:35

I noticed that Python's standard string method splitlines() actually removes some crucial Unicode control characters as well. Example

>>> s1 = u'asdf \n fdsa \x1d asdf'
>>> s1.splitlines()
[u'asdf ', u' fdsa ', u' asdf']

Notice how the "\x1d" character quietly disappears.

It doesn't happen if the string s1 is still a Python bytestring though (without the "u" prefix):

>>> s2 = 'asdf \n fdsa \x1d asdf'
>>> s2.splitlines()
['asdf ', ' fdsa \x1d asdf']

I can't find any information about this in the reference https://docs.python.org/2.7/library/stdtypes.html#str.splitlines.

Why does this happen? What other characters than "\x1d" (or unichr(29)) are affected?

I'm using Python 2.7.3 on Ubuntu 12.04 LTS.

Answer

This is indeed under-documented; I had to dig through the source code somewhat to find it.

The unicodetype_db.h file defines linebreaks as:

case 0x000A:
case 0x000B:
case 0x000C:
case 0x000D:
case 0x001C:
case 0x001D:
case 0x001E:
case 0x0085:
case 0x2028:
case 0x2029:

These are generated from the Unicode database; any codepoint listed in the Unicode standard with the Line_Break property set to BK, CR, LF or NL or with bidirectional category set to B (paragraph break) is considered a line break.

From the Unicode Data file, version 6 of the standard lists U+001D as a paragraph break:

001D;<control>;Cc;0;B;;;;;N;INFORMATION SEPARATOR THREE;;;;

(5th column is the bidirectional category).

You could use a regular expression if you want to limit what characters to split on:

import relinebreaks = re.compile(ur'[\n-\r\x85\u2028\u2929]')
linebreaks.split(yourtext)

would split your text on the same set of linebreaks except for the U+001C, U+001D or U+001E codepoints, so the three data structuring control characters.

https://en.xdnf.cn/q/69500.html

Related Q&A

Get only HTML head Element with a Script or Tool

I am trying to get large amount of status information, which are encoded in websites, mainly inside the "< head >< /head >" element. I know I can use wget or curl or python to get…

Is it possible to restore corrupted “interned” bytes-objects

It is well known, that small bytes-objects are automatically "interned" by CPython (similar to the intern-function for strings). Correction: As explained by @abarnert it is more like the inte…

Wildcard namespaces in lxml

How to query using xpath ignoring the xml namespace? I am using python lxml library. I tried the solution from this question but doesnt seem to work.In [151]: e.find("./*[local-name()=Buckets]&qu…

WordNet - What does n and the number represent?

My question is related to WordNet Interface.>>> wn.synsets(cat)[Synset(cat.n.01), Synset(guy.n.01), Synset(cat.n.03),Synset(kat.n.01), Synset(cat-o-nine-tails.n.01), Synset(caterpillar.n.02), …

How to change the values of a column based on two conditions in Python

I have a dataset where I have the time in a game and the time of an event. EVENT GAME0:34 0:43NaN 0:232:34 3:43NaN 4:50I want to replace the NaN in the EVENT column where GAME…

logging module for python reports incorrect timezone under cygwin

I am running python script that uses logging module under cygwin on Windows 7. The date command reports correct time:$ date Tue, Aug 14, 2012 2:47:49 PMHowever, the python script is five hours off:201…

Set ordering of Apps and models in Django admin dashboard

By default, the Django admin dashboard looks like this for me:I want to change the ordering of models in Profile section, so by using codes from here and here I was able to change the ordering of model…

python database / sql programming - where to start

What is the best way to use an embedded database, say sqlite in Python:Should be small footprint. Im only needing few thousands records per table. And just a handful of tables per database. If its one …

How to install Python 3.5 on Raspbian Jessie

I need to install Python 3.5+ on Rasbian (Debian for the Raspberry Pi). Currently only version 3.4 is supported. For the sources I want to compile I have to install:sudo apt-get install -y python3 pyth…

Django - last insert id

I cant get the last insert id like I usually do and Im not sure why.In my view:comment = Comments( ...) comment.save() comment.id #returns NoneIn my Model:class Comments(models.Model):id = models.Integ…