How can I check a Python unicode string to see that it *actually* is proper Unicode?

2024/10/3 17:10:39

So I have this page:

http://hub.iis.sinica.edu.tw/cytoHubba/

Apparently it's all kinds of messed up, as it gets decoded properly but when I try to save it in postgres I get:

DatabaseError: invalid byte sequence for encoding "UTF8": 0xedbdbf

The database clams up after that and refuses to do anything without a rollback, which will be a bit hard to issue (long story). Is there a way for me to check if this will happen before it hits the database? source.encode("utf-8") works without a hitch, so I'm not sure what's going on...

Answer

There is a bug in python 2.x that is only fixed python 3.x. In fact, this bug is even in OS X's iconv (but not the glibc one).

Here's what's happening:

Python 2.x does not recognize UTF8 surrogate pairs [1] as being invalid (which is what your character sequence is)

This should be all that's needed:

foo.decode('utf8').encode('utf8')

But thanks to that bug they're not fixing, it doesn't catch surrogate pairs.

Try this in python 2.x and then in 3.x:

b'\xed\xbd\xbf'.decode('utf8')

It will throw an error (correctly) in the latter. They aren't fixing it in the 2.x branch either. See [2] and [3] for more info

[1] https://www.rfc-editor.org/rfc/rfc3629#section-4

[2] http://bugs.python.org/issue9133

[3] http://bugs.python.org/issue8271#msg102209

https://en.xdnf.cn/q/70701.html

Related Q&A

Test assertions for tuples with floats

I have a function that returns a tuple that, among others, contains a float value. Usually I use assertAlmostEquals to compare those, but this does not work with tuples. Also, the tuple contains other …

Django: Assigning ForeignKey - Unable to get repr for class

I ask this question here because, in my searches, this error has been generally related to queries rather than ForeignKey assignment.The error I am getting occurs in a method of a model. Here is the co…

Counting day-of-week-hour pairs between two dates

Consider the following list of day-of-week-hour pairs in 24H format:{Mon: [9,23],Thu: [12, 13, 14],Tue: [11, 12, 14],Wed: [11, 12, 13, 14]Fri: [13],Sat: [],Sun: [], }and two time points, e.g.:Start:dat…

Download A Single File Using Multiple Threads

Im trying to create a Download Manager for Linux that lets me download one single file using multiple threads. This is what Im trying to do : Divide the file to be downloaded into different parts by sp…

Merge string tensors in TensorFlow

I work with a lot of dtype="str" data. Ive been trying to build a simple graph as in https://www.tensorflow.org/versions/master/api_docs/python/train.html#SummaryWriter. For a simple operat…

How to reduce memory usage of threaded python code?

I wrote about 50 classes that I use to connect and work with websites using mechanize and threading. They all work concurrently, but they dont depend on each other. So that means 1 class - 1 website - …

Connection is closed when a SQLAlchemy event triggers a Celery task

When one of my unit tests deletes a SQLAlchemy object, the object triggers an after_delete event which triggers a Celery task to delete a file from the drive.The task is CELERY_ALWAYS_EAGER = True when…

Python escape sequence \N{name} not working as per definition

I am trying to print unicode characters given their name as follows:# -*- coding: utf-8 -*- print "\N{SOLIDUS}" print "\N{BLACK SPADE SUIT}"However the output I get is not very enco…

Binary integer programming with PULP using vector syntax for variables?

New to the python library PULP and Im finding the documentation somewhat unhelpful, as it does not include examples using lists of variables. Ive tried to create an absolutely minimalist example below …

Nonblocking Scrapy pipeline to database

I have a web scraper in Scrapy that gets data items. I want to asynchronously insert them into a database as well. For example, I have a transaction that inserts some items into my db using SQLAlchemy …