Generate misspelled words (typos)

2024/10/8 6:19:52

I have implemented a fuzzy matching algorithm and I would like to evaluate its recall using some sample queries with test data.

Let's say I have a document containing the text:

{"text": "The quick brown fox jumps over the lazy dog"}

I want to see if I can retrieve it by testing queries such as "sox" or "hazy drog" instead of "fox" and "lazy dog".

In other words, I want to add noise to strings to generate misspelled words (typos).

What would be a way of automatically generating words with typos for evaluating fuzzy search?

Answer

I would just create a program to randomly alter letters in your words. I guess you can elaborate for specific requirements of your case, but the general idea would go like this.

Say you have a phrase

phrase = "The quick brown fox jumps over the lazy dog"

Then define a probability for a word to change (say 10%)

p = 0.1

Then loop over the words of your phrase and sample from a uniform distribution for each one of them. If the random variable is lower than your threshold, then randomly change one letter from the word

import string
import randomnew_phrase = []
words = phrase.split(' ')
for word in words:outcome = random.random()if outcome <= p:ix = random.choice(range(len(word)))new_word = ''.join([word[w] if w != ix else random.choice(string.ascii_letters) for w in range(len(word))])new_phrase.append(new_word)else:new_phrase.append(word)new_phrase = ' '.join([w for w in new_phrase]) 

In my case I got the following interesting phrase result

print(new_phrase)
'The quick brown fWx jumps ovey the lazy dog'
https://en.xdnf.cn/q/70146.html

Related Q&A

Get the inverse function of a polyfit in numpy

I have fit a second order polynomial to a number of x/y points in the following way:poly = np.polyfit(x, y, 2)How can I invert this function in python, to get the two x-values corresponding to a speci…

Installing an old version of scikit-learn

Problem StatmentIm trying to run some old python code that requires scikit-learn 18.0 but the current version I have installed is 0.22 and so Im getting a warning/invalid data when I run the code.What …

remove characters from pandas column

Im trying to simply remove the ( and ) from the beginning and end of the pandas column series. This is my best guess so far but it just returns empty strings with () intact. postings[location].replace(…

numerically stable inverse of a 2x2 matrix

In a numerical solver I am working on in C, I need to invert a 2x2 matrix and it then gets multiplied on the right side by another matrix:C = B . inv(A)I have been using the following definition of an …

Type annotating class variable: in init or body?

Lets consider the two following syntax variations:class Foo:x: intdef __init__(self, an_int: int):self.x = an_intAndclass Foo:def __init__(self, an_int: int):self.x = an_intApparently the following cod…

decoding shift-jis: illegal multibyte sequence

Im trying to decode a shift-jis encoded string, like this:string.decode(shift-jis).encode(utf-8)to be able to view it in my program.When I come across 2 shift-jis characters, in hex "0x87 0x54&quo…

Add columns in pandas dataframe dynamically

I have following code to load dataframe import pandas as pdufo = pd.read_csv(csv_path) print(ufo.loc[[0,1,2] , :])which gives following output, see the structure of the csvCity Colors Reported Shape Re…

How do you add input from user into list in Python [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.Want to improve this question? Add details and clarify the problem by editing this post.Closed 9 years ago.Improve…

How to suppress matplotlib inline for a single cell in Jupyter Notebooks/Lab?

I was looking at matplotlib python inline on/off and this kind of solves the problem but when I do plt.ion() all of the Figures pop up (100s of figures). I want to keep them suppressed in a single cel…

Using django-filer, can I chose the folder that images go into, from Unsorted Uploads

Im using django-filer for the first time, and it looks great, and work pretty well.But all my images are being uploaded to the Unsorted Uploads folder, and I cant figure out a way to put them in a spec…