Question 1

I'm messing around with file lookups in python on a large hard disk. I've been looking at os.walk and glob. I usually use os.walk as I find it much neater and seems to be quicker (for usual size directories).

Has anyone got any experience with them both and could say which is more efficient? As I say, glob seems to be slower, but you can use wildcards etc, were as with walk, you have to filter results. Here is an example of looking up core dumps.

core = re.compile(r"core\.\d*")
for root, dirs, files in os.walk("/path/to/dir/")for file in files:if core.search(file):path = os.path.join(root,file)print "Deleting: " + pathos.remove(path)

Or

for file in iglob("/path/to/dir/core.*")print "Deleting: " + fileos.remove(file)

Question 2

I made a research on a small cache of web pages in 1000 dirs. The task was to count a total number of files in dirs. The output is:

os.listdir: 0.7268s, 1326786 files found
os.walk: 3.6592s, 1326787 files found
glob.glob: 2.0133s, 1326786 files found

As you see, os.listdir is quickest of three. And glog.glob is still quicker than os.walk for this task.

The source:

import os, time, globn, t = 0, time.time()
for i in range(1000):n += len(os.listdir("./%d" % i))
t = time.time() - t
print "os.listdir: %.4fs, %d files found" % (t, n)n, t = 0, time.time()
for root, dirs, files in os.walk("./"):for file in files:n += 1
t = time.time() - t
print "os.walk: %.4fs, %d files found" % (t, n)n, t = 0, time.time()
for i in range(1000):n += len(glob.glob("./%d/*" % i))
t = time.time() - t
print "glob.glob: %.4fs, %d files found" % (t, n)

Quicker to os.walk or glob?

Related Q&A

Getting PyCharm to recognize python on the windows linux subsystem (bash on windows)

Whats the difference between nan, NaN and NAN

Python requests: URL base in Session

size of NumPy array

Feature Importance Chart in neural network using Keras in Python

numpy.max or max ? Which one is faster?

Nested Json to pandas DataFrame with specific format

Iterating over dictionary items(), values(), keys() in Python 3

Is there a method that tells my program to quit?

Hiding Axis Labels