How to eliminate suspicious barcode (like 123456) data [closed]

2024/10/12 22:25:42

Here's some bar code data from a pandas database

737318  Sikat Botol Pigeon          4902508045506   75170
737379  Natur Manual Breast Pump    8850851860016   75170
738753  Sunlight                    1232131321313   75261
739287  Bodymist bodyshop           1122334455667   75296
739677  Bodymist ale                1234567890123   75367

I want to remove data that is suspicious (i.e. has too many repeated or successive digits) like 1232131321313 , 1122334455667, 1234567890123, etc. I am very tolerant of false negatives, but want to avoid false positives (bad bar codes) as much as possible.

Answer

If you're worried about repeated and successive digits, you can take np.diff of the digits and then compare against a triangular distribution using a Kolmogorov Smirnov test. The difference between successive digits for a random number should follow a triangular distribution between -10 and 10, with a maximum at 0

import scipy.stats as stat
t = stat.triang(.5, loc = -10, scale = 20)

Turning the bar codes into an array:

a = np.array(list(map(list, map(str, a))), dtype = int)  # however you get `a` out of your dataframe

then build a mask with

np.array[stat.kstest(i, t.cdf).pvalue > .5 for i in np.diff(a, axis = 1)]

testing:

np.array([stat.kstest(j, t.cdf).pvalue > .5 for j in np.diff(np.random.randint(0, 10, (1000, 13)), axis = 1)]).sum()Out: 720

You'll have about a 30% false negative rate, but a p-value threshold of .5 should pretty much guarantee that the values you keep don't have too many successive or repeat digits. If you want to really be sure you've eliminate anything suspicious, you may want to also KS test the actual digits against stat.uniform(scale = 10) (to eliminate 1213141516171 and similar).

https://en.xdnf.cn/q/118153.html

Related Q&A

how to get href link from onclick function in python

I want to get href link of website form onclick function Here is html code in which onclick function call a website <div class="fl"><span class="taLnk" onclick="ta.tr…

Python tkinters entry.get() does not work, how can I fix it? [duplicate]

This question already has answers here:Why is Tkinter Entrys get function returning nothing?(6 answers)Closed 7 years ago.I am building a simple program for university. We have to convert our code to …

Pandas secondary y axis for boxplots

Id like to use a secondary y-axis for some boxplots in pandas, but it doesnt seem available. import numpy as np import pandas as pddata = np.random.random((10, 5)) data[:,-1] += 10 # offset one column…

Fixing Negative Assertion for end of string

I am trying to accept a capture group only if the pattern matches and there is not a specific word before the end of the group. Ive tried a # of approaches and none seem to work, clearly Im not getting…

Two Sorted Arrays, sum of 2 elements equal a certain number

I was wondering if I could get some help. I want to find an algorithm that is THETA(n) or linear time for determining whether 2 numbers in a 2 sorted arrays add up to a certain number.For instance, let…

I cant seem to install numpy

I tried to install numpy, but whenever I start my program, I get these messages.Error importing numpy: you should not try to import numpy fromits source directory; please exit the numpy source tree, an…

Using slices in Python

I use the dataset from UCI repo: http://archive.ics.uci.edu/ml/datasets/Energy+efficiency Then doing next:from pandas import * from sklearn.neighbors import KNeighborsRegressor from sklearn.linear_mode…

Elasticsearch delete_by_query wrong usage

I am using 2 similar ES methods to load and delete documents:result = es.search(index=users_favourite_documents,doc_type=favourite_document,body={"query": {"match": {user: user}}})A…

SQLAlchemy: Lost connection to MySQL server during query

There are a couple of related questions regarding this, but in my case, all those solutions is not working out. Thats why I thought of asking again. I am getting this error while I am firing below quer…

row to columns while keeping part of dataframe, display on same row

I am trying to move some of my rows and make the them columns, but keep a large portion of the dataframe the same.Resulting Dataframe:ID Thing Level1 Level2 Time OAttribute IsTrue Score Value 1 …