A python regex that matches the regional indicator character class

2024/10/5 21:23:28

I am using python 2.7.10 on a Mac. Flags in emoji are indicated by a pair of Regional Indicator Symbols. I would like to write a python regex to insert spaces between a string of emoji flags.

  • For example, this string is two Brazilian flags:

    • u"\U0001F1E7\U0001F1F7\U0001F1E7\U0001F1F7"

    • which will render like this: 🇧🇷🇧🇷

I'd like to insert spaces between any pair of regional indicator symbols. Something like this:

re.sub(re.compile(u"([\U0001F1E6-\U0001F1FF][\U0001F1E6-\U0001F1FF])"),r"\1 ", u"\U0001F1E7\U0001F1F7\U0001F1E7\U0001F1F7")

...which would result in:

u"\U0001F1E7\U0001F1F7 \U0001F1E7\U0001F1F7 "

...but that code gives me an error:

sre_constants.error: bad character range

A hint (I think) at what's going wrong is the following, which shows that \U0001F1E7 is turning into two "characters" in the regex:

re.search(re.compile(u"([\U0001F1E7])"),u"\U0001F1E7\U0001F1F7\U0001F1E7\U0001F1F7").group(0)

This results in:

u'\ud83c'

Sadly my understanding of unicode is too weak for me to make further progress.

Answer

I believe you're using Python 2.7 in Windows or Mac, which has the narrow 16-bit Unicode build - Linux/Glibc usually have 32-bit full unicode, also Python 3.5 has wide Unicode on all platforms.

What you see is the one code being split into a surrogate pair. Unfortunately it also means that you cannot use a single character class easily for this task. However it is still possible. The UTF-16 representation of U+1F1E6 (🇦) is \uD83C\uDDE6, and that of U+1F1FF (🇿) is \uD83C\uDDFF.

I do not even have an access to such Python build at all, but you could try

\uD83C[\uDDE6-\uDDFF]

as a replacement for single [\U0001F1E6-\U0001F1FF], thus your whole regex would be

(\uD83C[\uDDE6-\uDDFF]\uD83C[\uDDE6-\uDDFF])

The reason why the character class doesn't work is that it tries to make a range from the second half of the first surrogate pair to the first half of the second surrogate pair - this fails, because the start of the range is lexicographically greater than the end.

However, this regular expression still wouldn't work on Linux, you need to use the original there as Linux builds use wide unicode by default.


Alternatively, upgrade your Windows Python to 3.5 or better.

https://en.xdnf.cn/q/70444.html

Related Q&A

Importing modules from a sibling directory for use with py.test

I am having problems importing anything into my testing files that I intend to run with py.test.I have a project structure as follows:/ProjectName | |-- /Title | |-- file1.py | |-- file2.py | …

Uploading and processing a csv file in django using ModelForm

I am trying to upload and fetch the data from csv file uploaded by user. I am using the following code. This is my html form (upload_csv1.html):<form action="{% url myapp:upload_csv %}" me…

Plotting Multiple Lines in iPython/pandas Produces Multiple Plots

I am trying to get my head around matplotlibs state machine model, but I am running into an error when trying to plot multiple lines on a single plot. From what I understand, the following code should…

libclang: add compiler system include path (Python in Windows)

Following this question and Andrews suggestions, I am trying to have liblang add the compiler system include paths (in Windows) in order for my Python codeimport clang.cindexdef parse_decl(node):refere…

Pako not able to deflate gzip files generated in python

Im generating gzip files from python using the following code: (using python 3)file = gzip.open(output.json.gzip, wb)dataToWrite = json.dumps(data).encode(utf-8)file.write(dataToWrite)file.close()Howev…

Best way to have a python script copy itself?

I am using python for scientific applications. I run simulations with various parameters, my script outputs the data to an appropriate directory for that parameter set. Later I use that data. However s…

How to convert dictionary to matrix in python?

I have a dictionary like this:{device1 : (news1, news2, ...), device2 : (news 2, news 4, ...)...}How to convert them into a 2-D 0-1 matrix in python? Looks like this:news1 news2 news3 news4 device1 …

Ubuntu 16.04 - Why I cannot install libtiff4-dev?

Following this tutorial, I am trying to install the OpenCV 3 with Python on Ubuntu 16.04.At the step of entering $ sudo apt-get install libjpeg8-dev libtiff4-dev libjasper-dev libpng12-devI got this me…

Histogram update in a for loop with matplotlib.pylab

I am trying to update in for loop a histogram data. but I dont know how to make it. I tried with set_data but it is not working. here is the code:plt.ion() ax=plt.subplot(111) [n,X, V]=ax.hist(range(MA…

How to remove a section from an ini file using Python ConfigParser?

I am attempting to remove a [section] from an ini file using Pythons ConfigParser library.>>> import os >>> import ConfigParser >>> os.system("cat a.ini") [a] b = c…