text="\xe2\x80\x94"
print re.sub(r'(\\(?<=\\)x[a-z0-9]{2})+',"replacement_text",text)
output is —
how can I handle the hex decimal characters in this situation?
text="\xe2\x80\x94"
print re.sub(r'(\\(?<=\\)x[a-z0-9]{2})+',"replacement_text",text)
output is —
how can I handle the hex decimal characters in this situation?
Your input doesn't have backslashes. It has 3 bytes, the UTF-8 encoding for the U+2014 EM DASH character:
>>> text = "\xe2\x80\x94"
>>> len(text)
3
>>> text[0]
'\xe2'
>>> text.decode('utf8')
u'\u2014'
>>> print text.decode('utf8')
—
You either need to match those UTF-8 bytes directly, or decode from UTF-8 to unicode
and match the codepoint. The latter is preferable; always try to deal with text as Unicode to simplify how many characters you have to transform at a time.
Also note that Python's repr()
output (which is used impliciltly when echoing in the interactive interpreter or when printing lists, dicts or other containers) uses \xhh
escape sequences to represent any non-printable character. For UTF-8 strings, that includes anything outside the ASCII range. You could just replace anything outside that range with:
re.sub(r'[\x80-\xff]+', "replacement_text", text)
Take into account that this'll match multiple UTF-8-encoded characters in a row, and replace these together as a group!