Question 1

E.g., for the character "a", I want to get a string (list of chars) like "aàáâãäåāăą" (not sure if that example list is complete...) (basically all unicode chars with names "Latin Small Letter A with *").

Is there a generic way to get this?

I'm asking for Python, but if the answer is more generic, this is also fine, although I would appreciate a Python code snippet in any case. Python >=3.5 is fine. But I guess you need to have access to the Unicode database, e.g. the Python module unicodedata, which I would prefer over other external data sources.

I could imagine some solution like this:

def get_variations(char):import unicodedataname = unicodedata.name(char)chars = charfor variation in ["WITH CEDILLA", "WITH MACRON", ...]:try: chars += unicodedata.lookup("%s %s" % (name, variation))except KeyError:passreturn chars

Question 2

To start, get a collection of the Unicode combining diacritical characters; they're contiguous, so this is pretty easy, e.g.:

# Unicode combining diacritical marks run from 768 to 879, inclusive
combining_chars = ''.join(map(chr, range(768, 880)))

Now define a function that attempts to compose each one with a base ASCII character; when the composed normal form is length 1 (meaning the ASCII + combining became a single Unicode ordinal), save it:

import unicodedatadef get_unicode_variations(letter):if len(letter) != 1:raise ValueError("letter must be a single character to check for variations")variations = []# We could just loop over map(chr, range(768, 880)) without caching# in combining_chars, but that increases runtime ~20%for combiner in combining_chars:normalized = unicodedata.normalize('NFKC', letter + combiner)if len(normalized) == 1:variations.append(normalized)return ''.join(variations)

This has the advantage of not trying to manually perform string lookups in the unicodedata DB, and not needing to hardcode all possible descriptions of the combining characters. Anything that composes to a single character gets included; runtime for the check on my machine comes in under 50 µs, so if you're not doing this too often, the cost is reasonable (you could decorate with functools.lru_cache if you intend to call it repeatedly with the same arguments and want to avoid recomputing it every time).

If you want to get everything built out of one of these characters, a more exhaustive search can find it, but it'll take longer (functools.lru_cache would be nigh mandatory unless it's only ever called once per argument):

import functools
import sys
import unicodedata@functools.lru_cache(maxsize=None)
def get_unicode_variations_exhaustive(letter): if len(letter) != 1:raise ValueError("letter must be a single character to check for variations")variations = [] for testlet in map(chr, range(sys.maxunicode)): if letter in unicodedata.normalize('NFKD', testlet) and testlet != letter: variations.append(testlet) return ''.join(variations)

This looks for any character that decomposes into a form that includes the target letter; it does mean that searching the first time takes roughly a third of a second, and the result includes stuff that isn't really just a modified version of the character (e.g. 'L''s result will include ℡, which isn't really a "modified 'L'), but it's as exhaustive as you can get.

get all unicode variations of a latin character

Related Q&A

How do I install Django on Ubuntu 11.10?

Sympy second order ode

Bulk update using Peewee library

Can you specify variance in a Python type annotation?

Django loaddata error

Parsing JSON string/object in Python

Removing NaNs in numpy arrays

Run Python + OpenCV + dlib in Azure Functions

Best way to add python scripting into QT application?

Unexpected behavior of python builtin str function