I have implemented a fuzzy matching algorithm and I would like to evaluate its recall using some sample queries with test data.
Let's say I have a document containing the text:
{"text": "The quick brown fox jumps over the lazy dog"}
I want to see if I can retrieve it by testing queries such as "sox" or "hazy drog" instead of "fox" and "lazy dog".
In other words, I want to add noise to strings to generate misspelled words (typos).
What would be a way of automatically generating words with typos for evaluating fuzzy search?
I would just create a program to randomly alter letters in your words. I guess you can elaborate for specific requirements of your case, but the general idea would go like this.
Say you have a phrase
phrase = "The quick brown fox jumps over the lazy dog"
Then define a probability for a word to change (say 10%)
p = 0.1
Then loop over the words of your phrase and sample from a uniform distribution for each one of them. If the random variable is lower than your threshold, then randomly change one letter from the word
import string
import randomnew_phrase = []
words = phrase.split(' ')
for word in words:outcome = random.random()if outcome <= p:ix = random.choice(range(len(word)))new_word = ''.join([word[w] if w != ix else random.choice(string.ascii_letters) for w in range(len(word))])new_phrase.append(new_word)else:new_phrase.append(word)new_phrase = ' '.join([w for w in new_phrase])
In my case I got the following interesting phrase result
print(new_phrase)
'The quick brown fWx jumps ovey the lazy dog'