I wrote a program to add (limited) unicode support to Python regexes, and while it's working fine on CPython 2.5.2 it's not working on PyPy (1.5.0-alpha0 1.8.0, implementing Python 2.7.1 2.7.2), both running on Windows XP (Edit: as seen in the comments, @dbaupp could run it fine on Linux). I have no idea why, but I suspect it has something to do with my uses of u"
and ur"
. The full source is here, and the relevant bits are:
# -*- coding:utf-8 -*-
import re# Regexps to match characters in the BMP according to their Unicode category.
# Extracted from Unicode specification, version 5.0.0, source:
# http://unicode.org/versions/Unicode5.0.0/
unicode_categories = {ur'Pi':ur'[\u00ab\u2018\u201b\u201c\u201f\u2039\u2e02\u2e04\u2e09\u2e0c\u2e1c]',ur'Sk':ur'[\u005e\u0060\u00a8\u00af\u00b4\u00b8\u02c2-\u02c5\u02d2-\u02df\u02...',ur'Sm':ur'[\u002b\u003c-\u003e\u007c\u007e\u00ac\u00b1\u00d7\u00f7\u03f6\u204...',...ur'Pf':ur'[\u00bb\u2019\u201d\u203a\u2e03\u2e05\u2e0a\u2e0d\u2e1d]',ur'Me':ur'[\u0488\u0489\u06de\u20dd-\u20e0\u20e2-\u20e4]',ur'Mc':ur'[\u0903\u093e-\u0940\u0949-\u094c\u0982\u0983\u09be-\u09c0\u09c7\u0...',
}def hack_regexp(regexp_string):for (k,v) in unicode_categories.items():regexp_string = regexp_string.replace((ur'\p{%s}' % k),v)return regexp_stringdef regex(regexp_string,flags=0):"""Shortcut for re.compile that also translates and add the UNICODE flagExample usage:>>> from unicode_hack import regex>>> result = regex(ur'^\p{Ll}\p{L}*').match(u'áÇñ123')>>> print result.group(0)áÇñ>>> """return re.compile(hack_regexp(regexp_string), flags | re.UNICODE)
(on PyPy there is no match in the "Example usage", so result
is None
)
Reiterating, the program works fine (on CPython): the Unicode data seems correct, the replace works as intended, the usage example runs ok (both via doctest
and directly typing it in the command line). The source file encoding is also correct, and the coding
directive in the header seems to be recognized by Python.
Any ideas of what PyPy does "different" that is breaking my code? Many things came to my head (unrecognized coding
header, different encodings in the command line, different interpretations of r
and u
) but as far as my tests go, both CPython and PyPy seems to behave identically, so I'm clueless about what to try next.