Question 1

CPython stores unicode strings as either utf-16 or utf-32 internally depending on compile options. In utf-16 builds of Python string slicing, iteration, and len seem to work on code units, not code points, so that multibyte characters behave strangely.

E.g., on CPython 2.6 with sys.maxunicode = 65535:

>>> char = u'\U0001D49E'
>>> len(char)
2
>>> char[0:1]
u'\uu835'
>>> char[1:2]
u'\udc9e'

According to the Python documentation, sys.maxunicode is "An integer giving the largest supported code point for a Unicode character."

Does this mean that unicode operations aren't guranteed to work on code points beyond sys.maxunicode? If I want to work with characters outside the BMP I either have to use a utf-32 build or write my own portable unicode operations?

I came across this problem in How to iterate over Unicode characters in Python 3?

Question 2

Characters beyond sys.maxunicode=65535 are stored internally using UTF-16 surrogates. Yes you have to deal with this yourself or use a wide build. Even with a wide build you also may have to deal with single characters represented by a combination of code points. For example:

>>> print('a\u0301')
á
>>> print('\xe1')
á

The first uses a combining accent character and the second doesn't. Both print the same. You can use unicodedata.normalize to convert the forms.

What does sys.maxunicode mean?

Related Q&A

How to detect dialogs close event?

How to Make a Portable Jupyter Slideshow

How to animate a bar char being updated in Python

Add text to end of line without loading file

How does one use `dis.dis` to analyze performance?

How do I require HTTPS for this Django view?

How many times a number appears in a numpy array

python: How to remove values from 2 lists based on whats in 1 list

merge two dataframe columns into 1 in pandas

Upsample and Interpolate a NumPy Array