Question 1

Here's some bar code data from a pandas database

737318  Sikat Botol Pigeon          4902508045506   75170
737379  Natur Manual Breast Pump    8850851860016   75170
738753  Sunlight                    1232131321313   75261
739287  Bodymist bodyshop           1122334455667   75296
739677  Bodymist ale                1234567890123   75367

I want to remove data that is suspicious (i.e. has too many repeated or successive digits) like 1232131321313 , 1122334455667, 1234567890123, etc. I am very tolerant of false negatives, but want to avoid false positives (bad bar codes) as much as possible.

Question 2

If you're worried about repeated and successive digits, you can take np.diff of the digits and then compare against a triangular distribution using a Kolmogorov Smirnov test. The difference between successive digits for a random number should follow a triangular distribution between -10 and 10, with a maximum at 0

import scipy.stats as stat
t = stat.triang(.5, loc = -10, scale = 20)

Turning the bar codes into an array:

a = np.array(list(map(list, map(str, a))), dtype = int)  # however you get `a` out of your dataframe

then build a mask with

np.array[stat.kstest(i, t.cdf).pvalue > .5 for i in np.diff(a, axis = 1)]

testing:

np.array([stat.kstest(j, t.cdf).pvalue > .5 for j in np.diff(np.random.randint(0, 10, (1000, 13)), axis = 1)]).sum()Out: 720

You'll have about a 30% false negative rate, but a p-value threshold of .5 should pretty much guarantee that the values you keep don't have too many successive or repeat digits. If you want to really be sure you've eliminate anything suspicious, you may want to also KS test the actual digits against stat.uniform(scale = 10) (to eliminate 1213141516171 and similar).

How to eliminate suspicious barcode (like 123456) data [closed]

Related Q&A

how to get href link from onclick function in python

Python tkinters entry.get() does not work, how can I fix it? [duplicate]

Pandas secondary y axis for boxplots

Fixing Negative Assertion for end of string

Two Sorted Arrays, sum of 2 elements equal a certain number

I cant seem to install numpy

Using slices in Python

Elasticsearch delete_by_query wrong usage

SQLAlchemy: Lost connection to MySQL server during query

row to columns while keeping part of dataframe, display on same row