How to replace accents in a column of a pandas dataframe

2024/9/16 23:25:43

I have a dataframe dataSwiss which contains the information Swiss municipalities. I want to replace the letter with accents with normal letter.

This is what I am doing:

dataSwiss['Municipality'] = dataSwiss['Municipality'].str.encode('utf-8')
dataSwiss['Municipality'] = dataSwiss['Municipality'].str.replace(u"é", "e")

but I get the following error:

----> 2 dataSwiss['Municipality'] = dataSwiss['Municipality'].str.replace(u"é", "e")
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 1: ordinal not in range(128)

data looks like:

dataSwiss.Municipality
0               Zürich
1               Zürich
2               Zürich
3               Zürich
4               Zürich
5               Zürich
6               Zürich
7               Zürich

I found the solution

s = dataSwiss['Municipality']
res = s.str.decode('utf-8')
res = res.str.replace(u"é", "e")
Answer

This is one way. You can convert to byte literal first before decoding to utf-8.

s = pd.Series(['hello', 'héllo', 'Zürich', 'Zurich'])res = s.str.normalize('NFKD')\.str.encode('ascii', errors='ignore')\.str.decode('utf-8')print(res)0     hello
1     hello
2    Zurich
3    Zurich
dtype: object

pd.Series.str.normalize uses unicodedata module. As per the docs:

The normal form KD (NFKD) will apply the compatibility decomposition,i.e. replace all compatibility characters with their equivalents.

https://en.xdnf.cn/q/72871.html

Related Q&A

Comparison of multi-threading models in Julia =1.3 and Python 3.x

I would like to understand, from the user point of view, the differences in multithreading programming models between Julia >= 1.3 and Python 3.Is there one that is more efficient than the other (in…

How to do multihop ssh with fabric

I have a nat and it has various server So from my local server I want to go to nat and then from nat i have to ssh to other machinesLocalNAT(abcuser@publicIP with key 1)server1(xyzuser@localIP with key…

Python - Converting CSV to Objects - Code Design

I have a small script were using to read in a CSV file containing employees, and perform some basic manipulations on that data.We read in the data (import_gd_dump), and create an Employees object, cont…

Python multithreading - memory not released when ran using While statement

I built a scraper (worker) launched XX times through multithreading (via Jupyter Notebook, python 2.7, anaconda). Script is of the following format, as described on python.org:def worker():while True:i…

Delete files that are older than 7 days

I have seen some posts to delete all the files (not folders) in a specific folder, but I simply dont understand them.I need to use a UNC path and delete all the files that are older than 7 days.Mypath …

Doctests: How to suppress/ignore output?

The doctest of the following (nonsense) Python module fails:""" >>> L = [] >>> if True: ... append_to(L) # XXX >>> L [1] """def append_to(L):…

Matplotlib not showing xlabel in top two subplots

I have a function that Ive written to show a few graphs here:def plot_price_series(df, ts1, ts2):# price series line graphfig = plt.figure()ax1 = fig.add_subplot(221)ax1.plot(df.index, df[ts1], label=t…

SQLAlchemy NOT exists on subselect?

Im trying to replicate this raw sql into proper sqlalchemy implementation but after a lot of tries I cant find a proper way to do it:SELECT * FROM images i WHERE NOT EXISTS (SELECT image_idFROM events …

What is the correct way to obtain explanations for predictions using Shap?

Im new to using shap, so Im still trying to get my head around it. Basically, I have a simple sklearn.ensemble.RandomForestClassifier fit using model.fit(X_train,y_train), and so on. After training, Id…

value error when using numpy.savetxt

I want to save each numpy array (A,B, and C) as column in a text file, delimited by space:import numpy as npA = np.array([5,7,8912,44])B = np.array([5.7,7.45,8912.43,44.99])C = np.array([15.7,17.45,189…