I'm looking into new languages, kind of craving for one where I no longer need to worry about charset problems amongst inordinate amounts of other niggles I have with PHP for a new project.
I tend to find Java too verbose and messy, and my not wanting to touch Windows with a 6-foot pole tends to rule out .Net. That leaves essentially everything else -- except PHP, C and C++ (the latter two of which I know get messy with unicode stuff irrespective of the ICU library).
I've short listed a few languages to date, namely Ruby (loved the mixins), Python, Lisp and Javascript (node.js). However, I'm coming with highly inconsistent information on unicode support and I'm dreading (lack of time...) to learn each and every one of them to the point where I can safely break it to rule it out.
In so far as I understood, Python 3 seems to have it. As does Ruby 1.9. Lisp not necessarily. Javascript presumably.
There's arguably more than unicode support to a language, but in my experience it tends to become a major drawback when dealing with locale.
I also realize the question is somewhat subjective. (Please don't close it on that grounds: I'm actually linking to several SO threads which I found unsatisfying.) But... as a user of any of these languages, how well do they support unicode in practice?
Python's unicode support did not really change in 3.x. The unicode support in Python has been pretty much the same since Python 2.x, which introduced the separate unicode
type and the encoding handling. What Python 3.x changes is that unicode becomes the only string type (and is renamed to str
), whereas 2.x has bytestrings (str
, "..."
) and unicode strings (unicode
, u"..."
) that often but not always don't quite mix. (Allowing them to mix was an attempt to make transitioning from bytestrings to unicode easier, but it turned out a mistake.) All in all, Python's unicode support is quite good, mistakes in Python 2.x notwithstanding. There's unicode literals with numeric and named escapes, source-encoding declarations for non-ASCII characters in unicode literals, automatic encoding/decoding through the codecs
module, unicode support in many libraries (like the regular expression and DB-API modules) and a builtin unicode database.
That said, you still need to know about encodings in order to handle text correctly. Your program will receive bytes in some encoding (be it from files, from environment variables or through other input) and they will need to be interpreted in that encoding. If you don't know the encoding (and can't determine it from the data, like in HTML or XML) you can really only process the data as bytes. If you do know the encoding, Python does allow you to deal with it mostly transparently.