Unicode, regular expressions and PyPy

2024/9/30 1:26:07

I wrote a program to add (limited) unicode support to Python regexes, and while it's working fine on CPython 2.5.2 it's not working on PyPy (1.5.0-alpha0 1.8.0, implementing Python 2.7.1 2.7.2), both running on Windows XP (Edit: as seen in the comments, @dbaupp could run it fine on Linux). I have no idea why, but I suspect it has something to do with my uses of u" and ur". The full source is here, and the relevant bits are:

# -*- coding:utf-8 -*-
import re# Regexps to match characters in the BMP according to their Unicode category.
# Extracted from Unicode specification, version 5.0.0, source:
# http://unicode.org/versions/Unicode5.0.0/
unicode_categories = {ur'Pi':ur'[\u00ab\u2018\u201b\u201c\u201f\u2039\u2e02\u2e04\u2e09\u2e0c\u2e1c]',ur'Sk':ur'[\u005e\u0060\u00a8\u00af\u00b4\u00b8\u02c2-\u02c5\u02d2-\u02df\u02...',ur'Sm':ur'[\u002b\u003c-\u003e\u007c\u007e\u00ac\u00b1\u00d7\u00f7\u03f6\u204...',...ur'Pf':ur'[\u00bb\u2019\u201d\u203a\u2e03\u2e05\u2e0a\u2e0d\u2e1d]',ur'Me':ur'[\u0488\u0489\u06de\u20dd-\u20e0\u20e2-\u20e4]',ur'Mc':ur'[\u0903\u093e-\u0940\u0949-\u094c\u0982\u0983\u09be-\u09c0\u09c7\u0...',
}def hack_regexp(regexp_string):for (k,v) in unicode_categories.items():regexp_string = regexp_string.replace((ur'\p{%s}' % k),v)return regexp_stringdef regex(regexp_string,flags=0):"""Shortcut for re.compile that also translates and add the UNICODE flagExample usage:>>> from unicode_hack import regex>>> result = regex(ur'^\p{Ll}\p{L}*').match(u'áÇñ123')>>> print result.group(0)áÇñ>>> """return re.compile(hack_regexp(regexp_string), flags | re.UNICODE)

(on PyPy there is no match in the "Example usage", so result is None)

Reiterating, the program works fine (on CPython): the Unicode data seems correct, the replace works as intended, the usage example runs ok (both via doctest and directly typing it in the command line). The source file encoding is also correct, and the coding directive in the header seems to be recognized by Python.

Any ideas of what PyPy does "different" that is breaking my code? Many things came to my head (unrecognized coding header, different encodings in the command line, different interpretations of r and u) but as far as my tests go, both CPython and PyPy seems to behave identically, so I'm clueless about what to try next.

Answer

Why aren’t you simply using Matthew Barnett’s super-recommended regexp module instead?

It works on both Python 3 and legacy Python 2, is a drop-in replacement for re, handles all the Unicode stuff you could want, and a whole lot more.

https://en.xdnf.cn/q/71138.html

Related Q&A

Python str object has no attribute read

Python 3.3.2 import json & urllib.requestJson[{"link":"www.google.com","orderid":"100000222"}, {"link":"www.google.com","orderid&quo…

Efficient upsert of pandas dataframe to MS SQL Server using pyodbc

Im trying to upsert a pandas dataframe to a MS SQL Server using pyodbc. Ive used a similar approach before to do straight inserts, but the solution Ive tried this time is incredibly slow. Is there a mo…

Comparison on the basis of min function

How exactly does the min function work for lists in python ?For example,num = [1,2,3,4,[1,2,3]]num2 = [1,2,3,4,5]min(num,num2) gives num2 as the result. Is the comparison value based or length based ?

Python Pandas rolling aggregate a column of lists

I have a simple dataframe df with a column of lists lists. I would like to generate an additional column based on lists.The df looks like:import pandas as pd lists={1:[[1]],2:[[1,2,3]],3:[[2,9,7,9]],4:…

Easy way of overriding default methods in custom Python classes?

I have a class called Cell:class Cell:def __init__(self, value, color, size):self._value = valueself._color = colorself._size = size# and other methods...Cell._value will store a string, integer, etc. …

Return first non NaN value in python list

What would be the best way to return the first non nan value from this list?testList = [nan, nan, 5.5, 5.0, 5.0, 5.5, 6.0, 6.5]edit:nan is a float

How to subplot pie chart in plotly?

How can I subplot pie1 in fig, so it be located at the first position. this is how I am doing it but it doesnt work out import pandas as pdimport numpy as npimport seaborn as snsimport plotly.offline a…

Example of use \G in negative variable-length lookbehinds to limit how far back the lookbehind goes

In the pypi page of the awesome regex module (https://pypi.python.org/pypi/regex) it is stated that \G can be used "in negative variable-length lookbehinds to limit how far back the lookbehind goe…

Regex with lookbehind not working using re.match

The following python code:import reline="http://google.com" procLine = re.match(r(?<=http).*, line) if procLine.group() == "":print(line + ": did not match regex") els…

testing python multiprocessing pool code with nose

I am trying to write tests with nose that get set up with something calculated using multiprocessing.I have this directory structure:code/tests/tests.pytests.py looks like this:import multiprocessing a…