How to do fuzzy string search without a heavy database?

2024/10/12 19:15:44

I have a mapping of catalog numbers to product names:

35  cozy comforter
35  warm blanket
67  pillow

and need a search that would find misspelled, mixed names like "warm cmfrter".

We have code using edit-distance (difflib), but it probably won't scale to the 18000 names.

I achieved something similar with Lucene, but as PyLucene only wraps Java that would complicate deployment to end-users.

SQLite doesn't usually have full-text or scoring compiled in.

The Xapian bindings are like C++ and have some learning curve.

Whoosh is not yet well-documented but includes an abusable spell-checker.

What else is there?

Answer

Apparently the only way to make fuzzy comparisons fast is to do less of them ;)

Instead of writing another n-gram search or improving the one in Whoosh we now keep a word index, retrieve all entries that have at least one (correctly spelled) word in common with the query, and use difflib to rank those. Works well enough in this case.

https://en.xdnf.cn/q/69617.html

Related Q&A

Logging while nbconvert execute

I have a Jupyter notebook that needs to run from the command line. For this I have the following command:jupyter nbconvert --execute my_jupyter_notebook.ipynb --to pythonThis command creates a python s…

How to provide input for a TensorFlow DNNRegressor in Java?

I managed to write a TensorFlow python program with a DNNRegressor. I have trained the model and is able to get a prediction from the model in Python by manually created input (constant tensors). I hav…

Adding breakpoint command lists in GDB controlled from Python script

Im using Python to control GDB via batch commands. Heres how Im calling GDB:$ gdb --batch --command=cmd.gdb myprogramThe cmd.gdb listing just contains the line calling the Python scriptsource cmd.pyAnd…

Getting the maximum accuracy for a binary probabilistic classifier in scikit-learn

Is there any built-in function to get the maximum accuracy for a binary probabilistic classifier in scikit-learn?E.g. to get the maximum F1-score I do:# AUCPR precision, recall, thresholds = sklearn.m…

Pydantic does not validate when assigning a number to a string

When assigning an incorrect attribute to a Pydantic model field, no validation error occurs. from pydantic import BaseModelclass pyUser(BaseModel):username: strclass Config:validate_all = Truevalidate_…

PyUsb USB Barcode Scanner

Im trying to output a string from a barcode or qrcode using a Honeywell USB 3310g scanner in Ubuntu. I have libusb and a library called metro-usb (http://gitorious.org/other/metro-usb) which are enabli…

Count unique dates in pandas dataframe

I have a dataframe of surface weather observations (fzraHrObs) organized by a station identifier code and date. fzraHrObs has several columns of weather data. The station code and date (datetime object…

Miniforge / VScode - Python is not installed and virtualenv is not found

I have been stuck on this issue for several days, so any help is greatly appreciated. I recently had to move away from Anaconda (due to their change in the commercial policy) and decided to try Minifo…

How to merge pandas table by regex

I am wondering if there a fast way to merge two pandas tables by the regular expression in python .For example: table A col1 col2 1 apple_3dollars_5 2 apple_2dollar_4 1 o…

Scipy Optimize is only returning x0, only completing one iteration

I am using scipy optimize to get the minimum value on the following function: def randomForest_b(a,b,c,d,e):return abs(rf_diff.predict([[a,b,c,d,e]]))I eventually want to be able to get the optimal val…