I want to remove data that is suspicious (i.e. has too many repeated or successive digits) like 1232131321313 , 1122334455667, 1234567890123, etc. I am very tolerant of false negatives, but want to avoid false positives (bad bar codes) as much as possible.
Answer
If you're worried about repeated and successive digits, you can take np.diff of the digits and then compare against a triangular distribution using a Kolmogorov Smirnov test. The difference between successive digits for a random number should follow a triangular distribution between -10 and 10, with a maximum at 0
import scipy.stats as stat
t = stat.triang(.5, loc = -10, scale = 20)
Turning the bar codes into an array:
a = np.array(list(map(list, map(str, a))), dtype = int) # however you get `a` out of your dataframe
then build a mask with
np.array[stat.kstest(i, t.cdf).pvalue > .5 for i in np.diff(a, axis = 1)]
testing:
np.array([stat.kstest(j, t.cdf).pvalue > .5 for j in np.diff(np.random.randint(0, 10, (1000, 13)), axis = 1)]).sum()Out: 720
You'll have about a 30% false negative rate, but a p-value threshold of .5 should pretty much guarantee that the values you keep don't have too many successive or repeat digits. If you want to really be sure you've eliminate anything suspicious, you may want to also KS test the actual digits against stat.uniform(scale = 10) (to eliminate 1213141516171 and similar).
I want to get href link of website form onclick function
Here is html code in which onclick function call a website <div class="fl"><span class="taLnk" onclick="ta.tr…
This question already has answers here:Why is Tkinter Entrys get function returning nothing?(6 answers)Closed 7 years ago.I am building a simple program for university. We have to convert our code to …
Id like to use a secondary y-axis for some boxplots in pandas, but it doesnt seem available. import numpy as np
import pandas as pddata = np.random.random((10, 5))
data[:,-1] += 10 # offset one column…
I am trying to accept a capture group only if the pattern matches and there is not a specific word before the end of the group. Ive tried a # of approaches and none seem to work, clearly Im not getting…
I was wondering if I could get some help. I want to find an algorithm that is THETA(n) or linear time for determining whether 2 numbers in a 2 sorted arrays add up to a certain number.For instance, let…
I tried to install numpy, but whenever I start my program, I get these messages.Error importing numpy: you should not try to import numpy fromits source directory; please exit the numpy source tree, an…
I use the dataset from UCI repo: http://archive.ics.uci.edu/ml/datasets/Energy+efficiency
Then doing next:from pandas import *
from sklearn.neighbors import KNeighborsRegressor
from sklearn.linear_mode…
I am using 2 similar ES methods to load and delete documents:result = es.search(index=users_favourite_documents,doc_type=favourite_document,body={"query": {"match": {user: user}}})A…
There are a couple of related questions regarding this, but in my case, all those solutions is not working out. Thats why I thought of asking again. I am getting this error while I am firing below quer…
I am trying to move some of my rows and make the them columns, but keep a large portion of the dataframe the same.Resulting Dataframe:ID Thing Level1 Level2 Time OAttribute IsTrue Score Value
1 …