extract strings from a binary file in python

2024/9/20 0:19:36

I have a project where I am given a file and i need to extract the strings from the file. Basically think of the "strings" command in linux but i'm doing this in python. The next condition is that the file is given to me as a stream (e.g. string) so the obvious answer of using one of the subprocess functions to run strings isn't an option either.

I wrote this code:

def isStringChar(ch):if ord(ch) >= ord('a') and ord(ch) <= ord('z'): return Trueif ord(ch) >= ord('A') and ord(ch) <= ord('Z'): return Trueif ord(ch) >= ord('0') and ord(ch) <= ord('9'): return Trueif ch in ['/', '-', ':', '.', ',', '_', '$', '%', '\'', '(', ')', '[', ']', '<', '>', ' ']: return True# default out
return Falsedef process(stream):
dwStreamLen = len(stream)
if dwStreamLen < 4: return NonedwIndex = 0;
strString = ''
for ch in stream:if isStringChar(ch) == False:if len(strString) > 4:#print strStringstrString = ''else:strString += ch

This technically works but is WAY slow. For instance, I was able to use the strings command on a 500Meg executable and it produced 300k worth of strings in less than 1 second. I ran the same file through the above code and it took 16 minutes.

Is there a library out there that will let me do this without the burden of python's latency?

Thanks!

Answer

Of similar speed to David Wolever's, using re, Python's regular expression library. The short story of optimisation is that the less code you write, the faster it is. A library function that loops is often implemented in C and will be faster than you can hope to be. Same goes for the char in set() being faster than checking yourself. Python is the opposite of C in that respect.

import sys
import rechars = r"A-Za-z0-9/\-:.,_$%'()[\]<> "
shortest_run = 4regexp = '[%s]{%d,}' % (chars, shortest_run)
pattern = re.compile(regexp)def process(stream):data = stream.read()return pattern.findall(data)if __name__ == "__main__":for found_str in process(sys.stdin):print found_str

Working in 4k chunks would be clever, but is a bit trickier on edge-cases with re. (where two characters are on the end of the 4k block and the next 2 are at the start of the next block)

https://en.xdnf.cn/q/72217.html

Related Q&A

Installing numpy on Mac to work on AWS Lambda

Is there a way to install numpy on a Mac so that it will work when uploaded to AWS Lambda? I have tried a variety of different ways, including using different pip versions, using easy_install, and fol…

python- how to get the output of the function used in Timer

I want to run a function for 10s then do other stuff. This is my code using Timerfrom threading import Timer import timedef timeout():b=truereturn ba=false t = Timer(10,timeout) t.start()while(a==f…

Create automated tests for interactive shell based on Pythons cmd module

I am building an interactive shell using Python 3 and the cmd module. I have already written simple unit tests using py.test to test the individual functions, such as the do_* functions. Id like to c…

Matplotlib with multiprocessing freeze computer

I have an issue with matplotlib and multiprocessing. I launch a first process, where I display an image and select an area, and close the figure. Then I launch another process, where I call a graph fun…

Pull Tag Value using BeautifulSoup

Can someone direct me as how to pull the value of a tag using BeautifulSoup? I read the documentation but had a hard time navigating through it. For example, if I had:<span title="Funstuff&qu…

What is the practical difference between xml, json, rss and atom when interfacing with Twitter?

Im new to web services and as an introduction Im playing around with the Twitter API using the Twisted framework in python. Ive read up on the different formats they offer, but its still not clear to m…

how to grab from JSON in selenium python

My page returns JSON http response which contains id: 14Is there a way in selenium python to grab this? I searched the web and could not find any solutions. Now I am wondering maybe its just not poss…

Numpy: Array of `arange`s

Is there a way to take...>>> x = np.array([0, 8, 10, 15, 50]).reshape((-1, 1)); ncols = 5...and turn it into...array([[ 0, 1, 2, 3, 4],[ 8, 9, 10, 11, 12],[10, 11, 12, 13, 14],[15, 16, 17…

Understanding model.summary Keras

Im trying to understand model.summary() in Keras. I have the following Convolutional Neural Network. The values of the first Convolution are: conv2d_4 (Conv2D) (None, 148, 148, 16) 448 …

Determine adjacent regions in numpy array

I am looking for the following. I have a numpy array which is labeled as regions. The numpy array represents a segmented image. A region is a number of adjacent cells with the same value. Each region h…