I have a CSV
file which I can read in and at all works fine except for the specific German (and possibly other) characters. I've used chardet
to determine that the encoding is Mac Roman
import chardetdef detect_encoding(file_path):with open(file_path, 'rb') as f:rawdata = f.read()result = chardet.detect(rawdata)return result['encoding']
Now if I print any lines containing special German characters, I get totally different things, like:
- ‰ instead of ä
- fl instead of ß
- ÷ instead of Ö
- ¸ instead of ü etc.
import csvdef read_csv_file_line_by_line(file_path, target_name):encoding = detect_encoding(file_path)print(encoding)with open(file_path, mode='r', encoding=encoding) as file:csv_reader = csv.reader(file, delimiter=';')counter = -1for row in csv_reader:print(row)
As it is a huge file and I'm not sure that I can find all the special cases, I would like to use something to convert everything instead of me picking my cases and replacing everything 1 by 1. What I've tried so far:
row = [field.encode('mac_roman', 'replace').decode('mac_roman') for field in row]
-> it doesn't do anything, the "weird" characters stay still weirdrow = [field.encode('mac_roman', 'replace').decode('utf-8') for field in row]
->row = [field.encode('mac_roman').decode('utf-8') for field in row] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe4 in position 12: invalid continuation byte
I was unsure, whether it is really Mac Roman
, so tried iso-8859-1
and iso-8859-15
, but all in vain:
row = [field.encode('iso-8859-15').decode('utf-8') for field in row]
decoded_string = original_string.encode('iso-8859-1').decode('utf-8') ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ UnicodeEncodeError: 'latin-1' codec can't encode character '\u2030' in position 5: ordinal not in range(256)
row = [field.encode('iso-8859-15').decode('utf-8') for field in row]
UnicodeEncodeError: 'charmap' codec can't encode character '\u2030' in position 5: character maps to <undefined> encoding with 'iso-8859-15' codec failed