in python:
>>> "\xc4\xe3".decode("gbk").encode("utf-8")
'\xe4\xbd\xa0'
>>> "\xc4\xe3".decode("gbk")
u'\u4f60'
we can get two conclusions:
1.\xc4\xe3 in gbk encode = \xe4\xbd\xa0 in utf-8
2.\xc4\xe3 in gbk encode = \x4f\x60 in unicode(or say in ucs-2)
in R:
> iconv("\xc4\xe3",from="gbk",to="utf-8",toRaw=TRUE)
[[1]]
[1] e4 bd a0
> iconv("\xc4\xe3",from="gbk",to="unicode",toRaw=TRUE)
[[1]]
[1] ff fe 60 4f
now ,the conclusion1 is correct ,it is as same in python as in R
conclusion2 is a puzzle,
what on earth is the \xc4\xe3 in gbk encode = ?? in unicode.
in python it is u'\u4f60',in R it is ff fe 60 4f
are the equal? which one is correct?are they all correct?
In python, the \uxxxx
notation refers to Unicode codepoints, not to any encoding of those codepoints.
UCS-2, UTF-16, UTF-8 are all encodings capable of capturing those codepoints in bytes suitable for storage in files, for transferring across a network, etc.
The R representation of the \u4f60
codepoint includes the UTF-16 Byte Order Mark, or BOM. It indicates what byte order is chosen, where 0xFFFE means little endian. Python includes it too, when you encode to UTF-16:
>>> u'\uf460'.encode('utf16')
'\xff\xfe`\xf4'
The big-endian equivalent is 0xFEFF. You can explicitly encode to utf-16be
or utf-16le
in python to avoid the BOM being included, because you've made an explicit choice:
>>> u'\uf460'.encode('utf-16be')
'\xf4`'
>>> u'\uf460'.encode('utf-16le')
'`\xf4'
You really should read the Joel Spolsky Unicode article, as well as the Python Unicode HOWTO to more fully appreciate the difference between Unicode and encodings.