Converting German characters (like , etc) from Mac Roman to UTF (or similar)?

2024/10/6 10:03:46

I have a CSV file which I can read in and at all works fine except for the specific German (and possibly other) characters. I've used chardet to determine that the encoding is Mac Roman

import chardetdef detect_encoding(file_path):with open(file_path, 'rb') as f:rawdata = f.read()result = chardet.detect(rawdata)return result['encoding']

Now if I print any lines containing special German characters, I get totally different things, like:

  • instead of ä
  • instead of ß
  • ÷ instead of Ö
  • ¸ instead of ü etc.
import csvdef read_csv_file_line_by_line(file_path, target_name):encoding = detect_encoding(file_path)print(encoding)with open(file_path, mode='r', encoding=encoding) as file:csv_reader = csv.reader(file, delimiter=';')counter = -1for row in csv_reader:print(row)

As it is a huge file and I'm not sure that I can find all the special cases, I would like to use something to convert everything instead of me picking my cases and replacing everything 1 by 1. What I've tried so far:

  • row = [field.encode('mac_roman', 'replace').decode('mac_roman') for field in row] -> it doesn't do anything, the "weird" characters stay still weird
  • row = [field.encode('mac_roman', 'replace').decode('utf-8') for field in row] -> row = [field.encode('mac_roman').decode('utf-8') for field in row] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe4 in position 12: invalid continuation byte

I was unsure, whether it is really Mac Roman, so tried iso-8859-1 and iso-8859-15, but all in vain:

  • row = [field.encode('iso-8859-15').decode('utf-8') for field in row]

decoded_string = original_string.encode('iso-8859-1').decode('utf-8') ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ UnicodeEncodeError: 'latin-1' codec can't encode character '\u2030' in position 5: ordinal not in range(256)

  • row = [field.encode('iso-8859-15').decode('utf-8') for field in row]

UnicodeEncodeError: 'charmap' codec can't encode character '\u2030' in position 5: character maps to <undefined> encoding with 'iso-8859-15' codec failed

Answer

It is easy to convert it with pure Python.

with open("input.csv", "r", encoding="mac_roman") as inp, \open("output.csv", "w", encoding="utf-8") as outp:for line in inp:outp.write(line)

If you prefer, there are tools like iconv and recode which do this on the command line.

If you need help figuring out the actual encoding, I have a page where you can look up individual character codes at https://tripleee.github.io/8bit but based on the discussion under your own answer, I guess the real encoding was Latin-1 all along. Maybe you used something which was designed to display results in Mac Roman to view the file?

If you need help with producing a hex dump of the file, this is easy to do in Python; see e.g. Pythonic way to hex dump files but here's my quick and dirty attempt.

import sysdef chunk(filehandle):"Read filehandle, yield each 16-byte chunk max and its offset"offset = 0for data in iter(lambda: filehandle.read(16), b""):yield data, offsetoffset += 16with open(sys.argv[1], "rb") as handle:for data, offset in chunk(handle):hex = " ".join("%02x" % c for c in data)s = "".join(chr(c) if 32 <= c <= 127 else "." for c in data)print("%06x %-47s %s" % (offset, hex, s))
https://en.xdnf.cn/q/119897.html

Related Q&A

Caesar cipher without knowing the Key

Hey guys if you look at my code below you will be able to see that i was able to create a program that can open a file decode the content of the file and save it into another file but i need to input t…

how to convert u\uf04a to unicode in python [duplicate]

This question already has answers here:Python unicode codepoint to unicode character(4 answers)Closed 2 years ago.I am trying to decode u\uf04a in python thus I can print it without error warnings. In …

How can I display a nxn matrix depending on users input?

For a school task I need to display a nxn matrix depending on users input: heres an example: 0 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 0(users input: 5) And here is my code until now:n = int(inpu…

How to launch 100 workers in multiprocessing?

I am trying to use python to call my function, my_function() 100 times. Since my_function takes a while to run, I want to parallelize this process. I tried reading the docs for https://docs.python.org/…

Indexes of a list Python

I am trying to find how to print the indexes of words in a list in Python. If the sentence is "Hello world world hello name") I want it to print the list "1, 2, 2, 1, 3")I removed a…

str object is not callable - CAUTION: DO NO USE SPECIAL FUNCTIONS AS VARIABLES

EDIT: If you define a predefined type such as: str = 5 then theoriginal functionality of that predefined will change to a new one. Lesson Learnt: Do not give variables names that are predefined or bel…

Using `wb.save` results in UnboundLocalError: local variable rel referenced before assignment

I am trying to learn how to place an image in an Excel worksheet but I am having a problem with wb.save. My program ends with the following error:"C:\Users\Don\PycharmProjects\Test 2\venv\Scripts\…

Passing a Decimal(str(value)) to a dictionary for raw value

Im needing to pass values to a dictionary as class decimal.Decimal, and the following keeps happening:from decimal import *transaction_amount = 100.03 transaction_amount = Decimal(str(transaction_amoun…

Delete regex matching part of file

I have a file ,and i need to delete the regex matching part and write remaining lines to a file.Regex matching Code to delete file:import re with open("in1.txt") as f:lines = f.read()m = re.f…

How do I download files from the web using the requests module?

Im trying to download a webpage data to samplefile.txt on my hard drive using the following code:import requests res = requests.get(http://www.gutenberg.org/cache/epub/1112/pg1112.txt) res.raise_for_s…