Get the most relevant word (spell check) from enchant suggest() in Python

2024/9/18 18:26:18

I want to get the most relevant word from enchant suggest(). Is there any better way to do that. I feel my function is not efficient when it comes to checking large set of words in the range of 100k or more.

Problem with enchant suggest():

>>> import enchant
>>> d.suggest("prfomnc")
['prominence', 'performance', 'preform', 'Provence', 'preferment', 'proforma']

My function to get the appropriate word from a set of suggested words:

import enchant, difflibword="prfomnc"
dict,max = {},0
a = set(d.suggest(word))
for b in a:tmp = difflib.SequenceMatcher(None, word, b).ratio();dict[tmp] = bif tmp > max:max = tmpprint dict[max]Result: performance

Updated:

if I get multiple keys, meaning same difflib ratio() values, I use multi-key dictionary. As explained here: http://code.activestate.com/recipes/440502-a-dictionary-with-multiple-values-for-each-key/

Answer

No magic bullet, I'm afraid... a few suggestions however.

I'm guessing that most of the time in the logic is spent in the difflib's SequenceMatcher().ratio() call. This wouldn't be surprising since this method uses a variation on the Rattcliff-Obershelp algorithm which is relatively expensive, CPU-wise (but the metric it produces is rather "on the mark" to locate close matches, and that is probably why you like it).

To be sure, you should profile this logic and confirm that indeed SequenceMatcher() is the hot spot. Maybe Enchant.suggest() is also a bit slow, but there would be little we could do, code-wise, to improve this (configuration-wise, there may be a few options, for eg. doing away with personal dictionary to save the double look-upup and merge etc.).

Assuming that SequenceMatcher() is indeed the culprit, and assuming that you wish to stick with the Ratcliff-Obershelp similarity metric as the way to select the best match, you could do [some of] the following:

  • only compute the SequenceMatcher ratio value for the top (?) 5 items from Enchant.
    After all, Enchant.suggest() returns its suggestions in an ordered fashion with its best guesses first; therefore while based on different heuristics, there's value in the Enchant order as well, the chances of finding a hi-ranking match probably diminish as we move down the list. Also, even though, we may end up ignoring a few such hi-ranking matches, by testing only the top few Enchant suggestions, we somehow combine the "wisdom" found in Enchant's heuristics with these from the Ratcliff-Obershelp metric.
  • stop computing the SequenceMatcher ratio after a certain threshold has been reached
    The idea is similar to the preceding: avoid calling SequenceMatcher once the odds of finding better are getting smaller (and once we have a decent if not best choice in hand)
  • filter out some of the words from Enchant with your own logic.
    The idea is to have a relatively quick/inexpensive test which may tell us that a given word is unlikely to score well on the SequenceMatcher ratio. For example exclude words which do not have at least, say, length of user string minus two characters in common.
    BTW, you can maybe use some of the SequenceMatcher object's [quicker] functions to get some data for the filtering heuristics.
  • use SequenceMatcher *quick_ratio*() function instead
    at least in some cases.
  • only keep the best match, in a string, rather than using a dictionary
    Apparently only the top choice matters, so except for test purposes you may not need the [relatively small] overhead of the dictionary.
  • you may consider writing your own Ratcliff-Obershelp (or similar) method, introducing therein various early exits when the prospect of meeting the current max ratio is small. BEWARE, it would likely be difficult to produce a method as efficient as the C-language one of difflib, your interest in doing this is with the early exits...

HTH, good luck ;-)

https://en.xdnf.cn/q/72754.html

Related Q&A

How do I get python-markdown to additionally urlify links when formatting plain text?

Markdown is a great tool for formatting plain text into pretty html, but it doesnt turn plain-text links into URLs automatically. Like this one:http://www.google.com/How do I get markdown to add tags …

Best way to read aws credentials file

In my python code I need to extract AWS credentials AWS_SECRET_ACCESS_KEY and AWS_ACCESS_KEY_ID which are stored in the plain text file as described here: https://docs.aws.amazon.com/sdkref/latest/guid…

Profiling on live Django server?

Ive never done code coverage in Python, but Im looking for something like GCCs gcov, which tells me how many times each line executes, or Apples Shark which gives a hierarchial breakdown of how long ea…

Inset axes anchored to specific points in data coordinates?

Id like to be able to overlay multiple inset axes on top of a set of parent axes, something like this:Ideally, Id like the anchor point of each set of inset axes to be fixed in data coordinates, but fo…

No module named folium.plugins, Python 3.6

I am trying to import folium into a Jupyter notebook Im working on and I cannot seem to solve the import issues with the Folium library. Has anyone else solved this problem?After encountering an error…

How you enable CBC to return best solution when timelimit? (Pyomo)

I am trying to use CBC(v2.10.3) on Pyomo to solve for a integer linear problem.When executing the solver, I am currently setting a timelimit of 600s.opt = SolverFactory ("cbc")opt.options[sec…

SSL cert issue with Python Requests

Im making a request to a site which requires SSL cert to access. When I tried to access the URL, I get SSL Certificate errorimport requests proxies = {"https":"https://user:pwd@host:port…

MatplotLib get all annotation by axes

im doing a project with Python and Tkinter. I can plot an array of data and i also implemented a function to add annotation on plot when i click with the mouse, but now i need a list of all annotation…

Using Pandas to applymap with access to index/column?

Whats the most effective way to solve the following pandas problem? Heres a simplified example with some data in a data frame: import pandas as pd import numpy as np df = pd.DataFrame(np.random.randin…

Multiple URL segment in Flask and other Python frameowrks

Im building an application in both Bottle and Flask to see which I am more comfortable with as Django is too much batteries included.I have read through the routing documentation of both, which is very…