CPython stores unicode strings as either utf-16 or utf-32 internally depending on compile options. In utf-16 builds of Python string slicing, iteration, and len
seem to work on code units, not code points, so that multibyte characters behave strangely.
E.g., on CPython 2.6 with sys.maxunicode
= 65535:
>>> char = u'\U0001D49E'
>>> len(char)
2
>>> char[0:1]
u'\uu835'
>>> char[1:2]
u'\udc9e'
According to the Python documentation, sys.maxunicode
is "An integer giving the largest supported code point for a Unicode character."
Does this mean that unicode
operations aren't guranteed to work on code points beyond sys.maxunicode
? If I want to work with characters outside the BMP I either have to use a utf-32 build or write my own portable unicode
operations?
I came across this problem in How to iterate over Unicode characters in Python 3?