I have a project where I am given a file and i need to extract the strings from the file. Basically think of the "strings" command in linux but i'm doing this in python. The next condition is that the file is given to me as a stream (e.g. string) so the obvious answer of using one of the subprocess functions to run strings isn't an option either.
I wrote this code:
def isStringChar(ch):if ord(ch) >= ord('a') and ord(ch) <= ord('z'): return Trueif ord(ch) >= ord('A') and ord(ch) <= ord('Z'): return Trueif ord(ch) >= ord('0') and ord(ch) <= ord('9'): return Trueif ch in ['/', '-', ':', '.', ',', '_', '$', '%', '\'', '(', ')', '[', ']', '<', '>', ' ']: return True# default out
return Falsedef process(stream):
dwStreamLen = len(stream)
if dwStreamLen < 4: return NonedwIndex = 0;
strString = ''
for ch in stream:if isStringChar(ch) == False:if len(strString) > 4:#print strStringstrString = ''else:strString += ch
This technically works but is WAY slow. For instance, I was able to use the strings command on a 500Meg executable and it produced 300k worth of strings in less than 1 second. I ran the same file through the above code and it took 16 minutes.
Is there a library out there that will let me do this without the burden of python's latency?
Thanks!