get all unicode variations of a latin character

2024/10/18 15:36:19

E.g., for the character "a", I want to get a string (list of chars) like "aàáâãäåāăą" (not sure if that example list is complete...) (basically all unicode chars with names "Latin Small Letter A with *").

Is there a generic way to get this?

I'm asking for Python, but if the answer is more generic, this is also fine, although I would appreciate a Python code snippet in any case. Python >=3.5 is fine. But I guess you need to have access to the Unicode database, e.g. the Python module unicodedata, which I would prefer over other external data sources.

I could imagine some solution like this:

def get_variations(char):import unicodedataname = unicodedata.name(char)chars = charfor variation in ["WITH CEDILLA", "WITH MACRON", ...]:try: chars += unicodedata.lookup("%s %s" % (name, variation))except KeyError:passreturn chars
Answer

To start, get a collection of the Unicode combining diacritical characters; they're contiguous, so this is pretty easy, e.g.:

# Unicode combining diacritical marks run from 768 to 879, inclusive
combining_chars = ''.join(map(chr, range(768, 880)))

Now define a function that attempts to compose each one with a base ASCII character; when the composed normal form is length 1 (meaning the ASCII + combining became a single Unicode ordinal), save it:

import unicodedatadef get_unicode_variations(letter):if len(letter) != 1:raise ValueError("letter must be a single character to check for variations")variations = []# We could just loop over map(chr, range(768, 880)) without caching# in combining_chars, but that increases runtime ~20%for combiner in combining_chars:normalized = unicodedata.normalize('NFKC', letter + combiner)if len(normalized) == 1:variations.append(normalized)return ''.join(variations)

This has the advantage of not trying to manually perform string lookups in the unicodedata DB, and not needing to hardcode all possible descriptions of the combining characters. Anything that composes to a single character gets included; runtime for the check on my machine comes in under 50 µs, so if you're not doing this too often, the cost is reasonable (you could decorate with functools.lru_cache if you intend to call it repeatedly with the same arguments and want to avoid recomputing it every time).

If you want to get everything built out of one of these characters, a more exhaustive search can find it, but it'll take longer (functools.lru_cache would be nigh mandatory unless it's only ever called once per argument):

import functools
import sys
import unicodedata@functools.lru_cache(maxsize=None)
def get_unicode_variations_exhaustive(letter): if len(letter) != 1:raise ValueError("letter must be a single character to check for variations")variations = [] for testlet in map(chr, range(sys.maxunicode)): if letter in unicodedata.normalize('NFKD', testlet) and testlet != letter: variations.append(testlet) return ''.join(variations) 

This looks for any character that decomposes into a form that includes the target letter; it does mean that searching the first time takes roughly a third of a second, and the result includes stuff that isn't really just a modified version of the character (e.g. 'L''s result will include , which isn't really a "modified 'L'), but it's as exhaustive as you can get.

https://en.xdnf.cn/q/72694.html

Related Q&A

How do I install Django on Ubuntu 11.10?

Im using The Definitive guide to installing Django on ubuntu and ironically need something more definitive because I cant make it work.(I have followed the steps before this on the link above) Here is …

Sympy second order ode

I have a homogeneous solution to a simple second-order ODE, which when I try to solve for initial values using Sympy, returns the same solution. It should substitute for y(0) and y(0) and yield a solut…

Bulk update using Peewee library

Im trying to update many records inside a table using Peewee library. Inside a for loop, i fetch a single record and then I update it but this sounds awful in terms of performance so I need to do the u…

Can you specify variance in a Python type annotation?

Can you spot the error in the code below? Mypy cant.from typing import Dict, Anydef add_items(d: Dict[str, Any]) -> None:d[foo] = 5d: Dict[str, str] = {} add_items(d)for key, value in d.items():pr…

Django loaddata error

I created a "fixtures" folder in the app directory and put data1.json in there.This is what is in the file:[{"firm_url": "http://www.graychase.com/kadam", "firm_name&…

Parsing JSON string/object in Python

Ive recently started working with JSON in python. Now Im passing a JSON string to Python(Django) through a post request. Now I want to parse/iterate of that data. But I cant find a elegant way to parse…

Removing NaNs in numpy arrays

I have two numpy arrays that contains NaNs:A = np.array([np.nan, 2, np.nan, 3, 4]) B = np.array([ 1 , 2, 3 , 4, np.nan])are there any smart way using numpy to remove the NaNs in b…

Run Python + OpenCV + dlib in Azure Functions

I have created an image processing script in Python (with dlib and OpenCV) - I was wondering how I can bring this functionality to Azure Functions, so that the script can be called via an API. As Pytho…

Best way to add python scripting into QT application?

I have a QT 4.6 application (C++ language) and i need to add python scripting to it on windows platform. Unfortunately, i never embed python before, and it seems to be a lot of different ways to do so.…

Unexpected behavior of python builtin str function

I am running into an issue with subtyping the str class because of the str.__call__ behavior I apparently do not understand. This is best illustrated by the simplified code below.class S(str):def __ini…