I utf-8 encode characters like \u2013 before inserting them into SQLite.
When I pull them out with a SELECT, they are back in their unencoded form, so I need to re-encode them if I want to do anything with them. In this case, I want to write the rows to a CSV.Before writing the rows to CSV, I want to first add hyperlink to any row whose value starts with 'http'. Some values will be ints, dates etc, so I do the folliowing conditional expression - list comprehension combo:
row = ['=HYPERLINK("%s")' % cell if 'http' in str(cell) else cell for cell in row].
The str()
operation then results in the well-known:
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2013' inposition 15: ordinal not in range(128) error.
What I need then is to perform the .encode('utf-8')
encoding again, but only on those elements in the lists that are strings to begin with. The following won't work (since not all elements are strings):
['=HYPERLINK("%s")' % cell if 'http' in str(cell).encode('utf8') else cell.encode('utf8') for cell in row]
TLDR: How do I expand /modify the list comprehension to only encode an element if it's a string?
In general, work in terms of unicode as long as possible, and encoded unicode to
bytes (i.e. str
s) only when necessary, such as writing output to a network
socket or file.
Do not mix str
s with unicode
-- although this is permitted in Python2,
it causes Python2 to implicitly convert str
to unicode
or vice versa as necessary using the ascii
codec. If the implicit encoding or decoding fails, then you get a UnicodeEncodingError or UnicodedDecodingError, respectively, such as the one you are seeing.
Since cell
is unicode, use u'=HYPERLINK("{}")'.format(cell)
or u'=HYPERLINK("%s")' % cell
instead of '=HYPERLINK("%s")' % cell
. (Note that you may want to url-encode cell
in case cell
contains a double quote).
row = [u'=HYPERLINK("{}")'.format(cell) if isinstance(cell, unicode) and cell.startswith(u'http') else cell for cell in row]
Later, when/if you need to convert row
to str
s, you could use
row = [cell.encode('utf-8') if isinstance(cell, unicode) else str(cell) for cell in row]
Alternatively, convert everything in row
to str
s first:
row = [cell.encode('utf-8') if isinstance(cell, unicode) else str(cell) for cell in row]
and then you could use
row = ['=HYPERLINK("{}")'.format(cell) if cell.startswith('http') else cell for cell in row]
Similarly, since row
contains cell
s which are unicode, perform the test
if u'http' in cell
using the unicode
u'http'
instead of the str
'http'
, or better yet,
if isinstance(cell, unicode) and cell.startswith(u'http')
Although no error arises if you keep 'http'
here (since the ascii
codec can decode bytes in the 0-127 range), it is a good practice to use u'http'
anyway since conforms to the rule never mix str
and unicode
, and supports mental clarity.