Question 1

I have implemented a fuzzy matching algorithm and I would like to evaluate its recall using some sample queries with test data.

Let's say I have a document containing the text:

{"text": "The quick brown fox jumps over the lazy dog"}

I want to see if I can retrieve it by testing queries such as "sox" or "hazy drog" instead of "fox" and "lazy dog".

In other words, I want to add noise to strings to generate misspelled words (typos).

What would be a way of automatically generating words with typos for evaluating fuzzy search?

Question 2

I would just create a program to randomly alter letters in your words. I guess you can elaborate for specific requirements of your case, but the general idea would go like this.

Say you have a phrase

phrase = "The quick brown fox jumps over the lazy dog"

Then define a probability for a word to change (say 10%)

p = 0.1

Then loop over the words of your phrase and sample from a uniform distribution for each one of them. If the random variable is lower than your threshold, then randomly change one letter from the word

import string
import randomnew_phrase = []
words = phrase.split(' ')
for word in words:outcome = random.random()if outcome <= p:ix = random.choice(range(len(word)))new_word = ''.join([word[w] if w != ix else random.choice(string.ascii_letters) for w in range(len(word))])new_phrase.append(new_word)else:new_phrase.append(word)new_phrase = ' '.join([w for w in new_phrase])

In my case I got the following interesting phrase result

print(new_phrase)
'The quick brown fWx jumps ovey the lazy dog'

Generate misspelled words (typos)

Related Q&A

Get the inverse function of a polyfit in numpy

Installing an old version of scikit-learn

remove characters from pandas column

numerically stable inverse of a 2x2 matrix

Type annotating class variable: in init or body?

decoding shift-jis: illegal multibyte sequence

Add columns in pandas dataframe dynamically

How do you add input from user into list in Python [closed]

How to suppress matplotlib inline for a single cell in Jupyter Notebooks/Lab?

Using django-filer, can I chose the folder that images go into, from Unsorted Uploads