Question 1

I am writing a Python script to index a large set of Windows installers into a DB.

I would like top know how to read the metadata information (Company, Product Name, Version, etc) from EXE, MSI and ZIP files using Python running on Linux.

Software

I am using Python 2.6.5 on Ubuntu 10.04 64-bit with Django 1.2.1.

Found so far:

Windows command line utilities that can extract EXE metadata (like filever from SysUtils), or other individual CL utils that only work in Windows. I've tried running these through Wine but they have problems and it hasn't been worth the work to go and find the libs and frameworks that those CL utils depend on and try installing them in Wine/Crossover.

Win32 modules for Python that can do some things but won't run in Linux (right?)

Secondary question:

Obviously changing the file's metadata would change the MD5 hashsum of the file. Is there a general method of hashing a file independent of the metadata besides locating it and reading it in (ex: like skipping the first 1024 byes?)

Question 2

Take a look at this library: http://bitbucket.org/haypo/hachoir/wiki/Home and this example program that uses the library: http://pypi.python.org/pypi/hachoir-metadata/1.3.3. The second link is an example program which uses the Hachoir binary file manipulation library (first link) to parse the metadata.

The library can handle these formats:

Archives: bzip2, gzip, zip, tar
Audio: MPEG audio ("MP3"), WAV, Sun/NeXT audio, Ogg/Vorbis (OGG), MIDI, AIFF, AIFC, Real audio (RA)
Image: BMP, CUR, EMF, ICO, GIF, JPEG, PCX, PNG, TGA, TIFF, WMF, XCF
Misc: Torrent
Program: EXE
Video: ASF format (WMV video), AVI, Matroska (MKV), Quicktime (MOV), Ogg/Theora, Real media (RM)

Additionally, Hachoir can do some file manipulation operations which I would assume includes some primitive metadata manipulation.

Read EXE, MSI, and ZIP file metadata in Python in Linux

Software

Found so far:

Secondary question:

Related Q&A

IllegalArgumentException thrown when count and collect function in spark

Check if string does not contain strings from the list

How do I conditionally include a file in a Sphinx toctree? [duplicate]

Use BeautifulSoup to extract sibling nodes between two nodes

Put HTML into ValidationError in Django

python os.listdir doesnt show all files

how to save modified ELF by pyelftools

Access train and evaluation error in xgboost

Gtk* backend requires pygtk to be installed

ValueError: A value in x_new is below the interpolation range