Reading .doc file in Python using antiword in Windows (also .docx)

2024/10/6 6:46:21

I tried reading a .doc file like -

with open('file.doc', errors='ignore') as f:text = f.read()

It did read that file but with huge junk, I can't remove that junk as I don't know from where it starts and where it ends.

I also tried installing textract module which says it can read from any file format but there were many dependency issues while downloading it in Windows.

So I alternately did this with antiword command line utility, my answer is below.

Answer

You can use antiword command line utility to do this, I know most of you would have tried it but still I wanted to share.

  • Download antiword from here
  • Extract the antiword folder to C:\ and add the path C:\antiword to your PATH environment variable.

Here is a sample of how to use it, handling docx and doc files:

import os, docx2txt
def get_doc_text(filepath, file):if file.endswith('.docx'):text = docx2txt.process(file)return textelif file.endswith('.doc'):# converting .doc to .docxdoc_file = filepath + filedocx_file = filepath + file + 'x'if not os.path.exists(docx_file):os.system('antiword ' + doc_file + ' > ' + docx_file)with open(docx_file) as f:text = f.read()os.remove(docx_file) #docx_file was just to read, so deletingelse:# already a file with same name as doc exists having docx extension, # which means it is a different file, so we cant read itprint('Info : file with same name of doc exists having docx extension, so we cant read it')text = ''return text

Now call this function:

filepath = "D:\\input\\"
files = os.listdir(filepath)
for file in files:text = get_doc_text(filepath, file)print(text)

This could be good alternate way to read .doc file in Python on Windows.

Hope it helps, Thanks.

https://en.xdnf.cn/q/70393.html

Related Q&A

Error installing package with pip

Im trying to install a charting tool (matplotlib-v1.4.2) for python 3.4 in Windows 7, so far all my trails doesnt seem to do the job.Attempts:Ive downloaded pip from GitHub python -m pip install matplo…

Assign new values to certain tensor elements in Keras

I need to change the value of some elements of a tensor. I know what elements -- they are in a boolean tensor already.I dont see how to do this in keras code. But if I were using TensorFlow code I woul…

Making grid triangular mesh quickly with Numpy

Consider a regular matrix that represents nodes numbered as shown in the figure:I want to make a list with all the triangles represented in the figure. Which would result in the following 2 dimensional…

df [X].unique() and TypeError: unhashable type: numpy.ndarray

all,I have a column in a dataframe that looks like this:allHoldingsFund[BrokerMixed] Out[419]: 78 ML 81 CITI 92 ML 173 CITI 235 ML 262 ML 264 ML 25617 …

Python pandas idxmax for multiple indexes in a dataframe

I have a series that looks like this:delivery 2007-04-26 706 23 2007-04-27 705 10706 1089708 83710 13712 51802 4806 181…

No of Pairs of consecutive prime numbers having difference of 6 like (23,29) from 1 to 2 billion

How to find number of pairs of consecutive prime numbers having difference of 6 like (23,29) from 1 to 2 billion (using any programming language and without using any external libraries) with consideri…

Building a docker image for a flask app fails in pip

from alpine:latest RUN apk add --no-cache python3-dev \&& pip3 install --upgrade pipWORKDIR /backend COPY . /backendRUN pip --no-cache-dir install -r requirements.txt EXPOSE 5000 ENTRYPOINT [py…

Why is numba so fast?

I want to write a function which will take an index lefts of shape (N_ROWS,) I want to write a function which will create a matrix out = (N_ROWS, N_COLS) matrix such that out[i, j] = 1 if and only if j…

How to create a field with a list of foreign keys in SQLAlchemy?

I am trying to store a list of models within the field of another model. Here is a trivial example below, where I have an existing model, Actor, and I want to create a new model, Movie, with the field …

Implementing a recursive algorithm in pyspark to find pairings within a dataframe

I have a spark dataframe (prof_student_df) that lists student/professor pair for a timestamp. There are 4 professors and 4 students for each timestamp and each professor-student pair has a “score” (s…