Using PIL to detect a scan of a blank page

2024/9/16 22:59:03

So I often run huge double-sided scan jobs on an unintelligent Canon multifunction, which leaves me with a huge folder of JPEGs. Am I insane to consider using PIL to analyze a folder of images to detect scans of blank pages and flag them for deletion?

Leaving the folder-crawling and flagging parts out, I imagine this would look something like:

  • Check if the image is greyscale, as this is presumed uncertain.
  • If so, detect the dominant range of shades (background colour).
  • If not, detect the dominant range of shades, restricting to light greys.
  • Determine what percentage of the entire image is composed of said shades.
  • Try to find a threshold that adequately detects pages with type or writing or imagery.
  • Perhaps test fragments of the image at a time to increase accuracy of threshold.

I know this is sort of an edge case, but can anyone with PIL experience lend some pointers?

Answer

Here is an alternative solution, using mahotas and milk.

  1. Start by creating two directories: positives/ and negatives/ where you will manually pick out a few examples.
  2. I will assume that the rest of the data is in an unlabeled/ directory
  3. Compute features for all of the images in positives and negatives
  4. learn a classifier
  5. use that classifier on the unlabeled images

In the code below I used jug to give you the possibility of running it on multiple processors, but the code also works if you remove every line which mentions TaskGenerator

from glob import glob
import mahotas
import mahotas.features
import milk
from jug import TaskGenerator@TaskGenerator
def features_for(imname):img = mahotas.imread(imname)return mahotas.features.haralick(img).mean(0)@TaskGenerator
def learn_model(features, labels):learner = milk.defaultclassifier()return learner.train(features, labels)@TaskGenerator
def classify(model, features):return model.apply(features)positives = glob('positives/*.jpg')
negatives = glob('negatives/*.jpg')
unlabeled = glob('unlabeled/*.jpg')features = map(features_for, negatives + positives)
labels = [0] * len(negatives) + [1] * len(positives)model = learn_model(features, labels)labeled = [classify(model, features_for(u)) for u in unlabeled]

This uses texture features, which is probably good enough, but you can play with other features in mahotas.features if you'd like (or try mahotas.surf, but that gets more complicated). In general, I have found it hard to do classification with the sort of hard thresholds you are looking for unless the scanning is very controlled.

https://en.xdnf.cn/q/72420.html

Related Q&A

Pandas: Filling data for missing dates

Lets say Ive got the following table:ProdID Date Val1 Val2 Val3 Prod1 4/1/2019 1 3 4 Prod1 4/3/2019 2 3 54 Prod1 4/4/2019 3 4 54 Prod2 4/1/2019 1 3 3…

Linear Regression: How to find the distance between the points and the prediction line?

Im looking to find the distance between the points and the prediction line. Ideally I would like the results to be displayed in a new column which contains the distance, called Distance.My Imports:impo…

How to draw a Tetrahedron mesh by matplotlib?

I want to plot a tetrahedron mesh by matplotlib, and the following are a simple tetrahedron mesh: xyz = np.array([[-1,-1,-1],[ 1,-1,-1], [ 1, 1,-1],[-1, 1,-1],[-1,-1, 1],[ 1,-1, 1], [ 1, 1, 1],[-1, 1, …

How to set seaborn jointplot axis to log scale

How to set axis to logarithmic scale in a seaborn jointplot? I cant find any log arguments in seaborn.jointplot Notebook import seaborn as sns import pandas as pddf = pd.read_csv("https://storage…

Convert decision tree directly to png [duplicate]

This question already has answers here:graph.write_pdf("iris.pdf") AttributeError: list object has no attribute write_pdf(10 answers)Closed 7 years ago.I am trying to generate a decision tree…

Python: can I modify a Tuple?

I have a 2 D tuple (Actually I thought, it was a list.. but the error says its a tuple) But anyways.. The tuple is of form: (floatnumber_val, prod_id) now I have a dictionary which contains key-> p…

Saving scatterplot animations

Ive been trying to save an animated scatterplot with matplotlib, and I would prefer that it didnt require totally different code for viewing as an animated figure and for saving a copy. The figure show…

Pandas: Bin dates into 30 minute intervals and calculate averages

I have a Pandas dataframe with two columns which are speed and time.speed date 54.72 1:33:56 49.37 1:33:59 37.03 1:34:03 24.02 7:39:58 28.02 7:40:01 24.04 7:40:04 24.02 7:40:07 25.35 …

Regular expression for UK Mobile Number - Python

I need a regular expression that only validates UK mobile numbers. A UK mobile number can be between 10-14 digits and either starts with 07, or omits the 0 and starts with 447. Importantly, if the user…

Iterate through all the rows in a table using python lxml xpath

This is the source code of the html page I want to extract data from.Webpage: http://gbgfotboll.se/information/?scr=table&ftid=51168 The table is at the bottom of the page <html><tab…