Regex for accent insensitive replacement in python

2024/10/10 0:20:18

In Python 3, I'd like to be able to use re.sub() in an "accent-insensitive" way, as we can do with the re.I flag for case-insensitive substitution.

Could be something like a re.IGNOREACCENTS flag:

original_text = "¿It's 80°C, I'm drinking a café in a cafe with Chloë。"
accent_regex = r'a café'
re.sub(accent_regex, 'X', original_text, flags=re.IGNOREACCENTS)

This would lead to "¿It's 80°C, I'm drinking X in X with Chloë。" (note that there's still an accent on "Chloë") instead of "¿It's 80°C, I'm drinking X in a cafe with Chloë。" in real python.

I think that such a flag doesn't exist. So what would be the best option to do this? Using re.finditer and unidecode on both original_text and accent_regex and then replace by splitting the string? Or modifying all characters in the accent_regex by their accented variants, for instance: r'[cç][aàâ]f[éèêë]'?

Answer

unidecode is often mentioned for removing accents in Python, but it also does more than that : it converts '°' to 'deg', which might not be the desired output.

unicodedata seems to have enough functionality to remove accents.

With any pattern

This method should work with any pattern and any text.

You can temporarily remove the accents from both the text and regex pattern. The match information from re.finditer() (start and end indices) can be used to modify the original, accented text.

Note that the matches must be reversed in order to not modify the following indices.

import re
import unicodedataoriginal_text = "I'm drinking a 80° café in a cafe with Chloë, François Déporte and Francois Deporte."accented_pattern = r'a café|François Déporte'def remove_accents(s):return ''.join((c for c in unicodedata.normalize('NFD', s) if unicodedata.category(c) != 'Mn'))print(remove_accents('äöüßéèiìììíàáç'))
# aoußeeiiiiiaacpattern = re.compile(remove_accents(accented_pattern))modified_text = original_text
matches = list(re.finditer(pattern, remove_accents(original_text)))for match in matches[::-1]:modified_text = modified_text[:match.start()] + 'X' + modified_text[match.end():]print(modified_text)
# I'm drinking a 80° café in X with Chloë, X and X.

If pattern is a word or a set of words

You could :

  • remove the accents out of your pattern words and save them in a set for fast lookup
  • look for every word in your text with \w+
  • remove the accents from the word:
    • If it matches, replace by X
    • If it doesn't match, leave the word untouched

import re
from unidecode import unidecodeoriginal_text = "I'm drinking a café in a cafe with Chloë."def remove_accents(string):return unidecode(string)accented_words = ['café', 'français']words_to_remove = set(remove_accents(word) for word in accented_words)def remove_words(matchobj):word = matchobj.group(0)if remove_accents(word) in words_to_remove:return 'X'else:return wordprint(re.sub('\w+', remove_words, original_text))
# I'm drinking a X in a X with Chloë.
https://en.xdnf.cn/q/69961.html

Related Q&A

Python + Flask REST API, how to convert data keys between camelcase and snakecase?

I am learning Python, and coding simple REST API using Flask micro-framework.I am using SQLAlchemy for Object-relational-mapping and Marshmallow for Object-serialization/deserialization.I am using snak…

pytest reports too much on assert failures

Is there a way for pytest to only output a single line assert errors?This problem arises when you have modules with asserts, If those asserts fails, it dumps the entire function that failed the assert…

pulp.solvers.PulpSolverError: PuLP: cannot execute glpsol.exe

I am a newbie with python and optimization. I am getting some error, please help me resolve it. I tried running the below mentioned code in PyCharm where I am running Anaconda 3from pulp import * x = L…

Django urldecode in template file

is there any way do the urldecode in Django template file? Just opposite to urlencode or escapeI want to convert app%20llc to app llc

Structure accessible by attribute name or index options

I am very new to Python, and trying to figure out how to create an object that has values that are accessible either by attribute name, or by index. For example, the way os.stat() returns a stat_resul…

Loading data from Yahoo! Finance with pandas

I am working my way through Wes McKinneys book Python For Data Analysis and on page 139 under Correlation and Covariance, I am getting an error when I try to run his code to obtain data from Yahoo! Fin…

Run Multiple Instances of ChromeDriver

Using selenium and python I have several tests that need to run in parallel. To avoid using the same browser I added the parameter of using a specific profile directory and user data (see below). The p…

numpy 2d array max/argmax

I have a numpy matrix:>>> A = np.matrix(1 2 3; 5 1 6; 9 4 2) >>> A matrix([[1, 2, 3],[5, 1, 6],[9, 4, 2]])Id like to get the index of the maximum value in each row along with the valu…

How do I add a python script to the startup registry?

Im trying to make my python script run upon startup but I get the error message windowserror access denied, but I should be able to make programs start upon boot because teamviewer ( a third-party prog…

Python: How can I filter a n-nested dict of dicts by leaf value?

Ive got a dict that looks something like this:d = {Food: {Fruit : {Apples : {Golden Del. : [Yellow],Granny Smith : [Green],Fuji : [Red], },Cherries : [Red],Bananas : [Yellow],Grapes …