Question 1

I have a CSV file which I can read in and at all works fine except for the specific German (and possibly other) characters. I've used chardet to determine that the encoding is Mac Roman

import chardetdef detect_encoding(file_path):with open(file_path, 'rb') as f:rawdata = f.read()result = chardet.detect(rawdata)return result['encoding']

Now if I print any lines containing special German characters, I get totally different things, like:

‰ instead of ä
ﬂ instead of ß
÷ instead of Ö
¸ instead of ü etc.

import csvdef read_csv_file_line_by_line(file_path, target_name):encoding = detect_encoding(file_path)print(encoding)with open(file_path, mode='r', encoding=encoding) as file:csv_reader = csv.reader(file, delimiter=';')counter = -1for row in csv_reader:print(row)

As it is a huge file and I'm not sure that I can find all the special cases, I would like to use something to convert everything instead of me picking my cases and replacing everything 1 by 1. What I've tried so far:

row = [field.encode('mac_roman', 'replace').decode('mac_roman') for field in row] -> it doesn't do anything, the "weird" characters stay still weird
row = [field.encode('mac_roman', 'replace').decode('utf-8') for field in row] -> row = [field.encode('mac_roman').decode('utf-8') for field in row] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe4 in position 12: invalid continuation byte

I was unsure, whether it is really Mac Roman, so tried iso-8859-1 and iso-8859-15, but all in vain:

row = [field.encode('iso-8859-15').decode('utf-8') for field in row]

decoded_string = original_string.encode('iso-8859-1').decode('utf-8') ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ UnicodeEncodeError: 'latin-1' codec can't encode character '\u2030' in position 5: ordinal not in range(256)

row = [field.encode('iso-8859-15').decode('utf-8') for field in row]

UnicodeEncodeError: 'charmap' codec can't encode character '\u2030' in position 5: character maps to <undefined> encoding with 'iso-8859-15' codec failed

Question 2

It is easy to convert it with pure Python.

with open("input.csv", "r", encoding="mac_roman") as inp, \open("output.csv", "w", encoding="utf-8") as outp:for line in inp:outp.write(line)

If you prefer, there are tools like iconv and recode which do this on the command line.

If you need help figuring out the actual encoding, I have a page where you can look up individual character codes at https://tripleee.github.io/8bit but based on the discussion under your own answer, I guess the real encoding was Latin-1 all along. Maybe you used something which was designed to display results in Mac Roman to view the file?

If you need help with producing a hex dump of the file, this is easy to do in Python; see e.g. Pythonic way to hex dump files but here's my quick and dirty attempt.

import sysdef chunk(filehandle):"Read filehandle, yield each 16-byte chunk max and its offset"offset = 0for data in iter(lambda: filehandle.read(16), b""):yield data, offsetoffset += 16with open(sys.argv[1], "rb") as handle:for data, offset in chunk(handle):hex = " ".join("%02x" % c for c in data)s = "".join(chr(c) if 32 <= c <= 127 else "." for c in data)print("%06x %-47s %s" % (offset, hex, s))

Converting German characters (like , etc) from Mac Roman to UTF (or similar)?

Related Q&A

Caesar cipher without knowing the Key

how to convert u\uf04a to unicode in python [duplicate]

How can I display a nxn matrix depending on users input?

How to launch 100 workers in multiprocessing?

Indexes of a list Python

str object is not callable - CAUTION: DO NO USE SPECIAL FUNCTIONS AS VARIABLES

Using `wb.save` results in UnboundLocalError: local variable rel referenced before assignment

Passing a Decimal(str(value)) to a dictionary for raw value

Delete regex matching part of file

How do I download files from the web using the requests module?