Python: Extract text from Word files in a url

2024/9/22 7:33:08

Given the url containing a certain file, in this case a word document, read the contents of the document. I have seen several examples of how to extract text from local documents but not from a url. Would it be the same from an http address than from an ftp?

from urllib.request import urlopenurl = 'ftp://path/to/file.docx'txt = urlopen(url).read()

the value of text is:

b'PK\x03\x04\x14\x00\x06\x00\x08\x00\x00\x00!\x00\xdd\xfc\x957f\x01\x00\x00 \x05\x00\x00\x13\x00\x08\x02[Content_Types].xml \xa2\x04\x02(\xa0\x00\x02\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00 ...'

I try to decode

txt.decode("utf-8", "ignore")

but this returns PK ... followed by other strange characters

The option to save the document and then process it is not feasible.

What am I doing wrong?

Answer

By using requests and docx2txt it's very simple:

import requests
import docx2txt
from io import BytesIOurl = "http://url.to.file/sample.docx"
docx = BytesIO(requests.get(url).content)# extract text
text = docx2txt.process(docx)
https://en.xdnf.cn/q/119613.html

Related Q&A

Python3:Plot f(x,y), preferably using matplotlib

Is there a way, preferably using matplotlib, to plot a 2-variable function f(x,y) in python; Thank you, in advance.

Why does my cronjob not send the email from my script? [closed]

Closed. This question needs debugging details. It is not currently accepting answers.Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to repro…

How to delete unsaved tkinker label?

I made this program where I am putting labels on a grid without saving them in a variable. I do this because then I can for loop through a list of classes and get the data from each class in and add th…

Adjust every other row of a data frame

I would like to change every second row of my data frame.I have a df like this:Node | Feature | Indicator | Value | Class | Direction -------------------------------------------------------- 1 | …

Why is the list index out of range?

Im new at programing and Im trying to check a piece of code that keeps giving me this error: t[i] = t[i - 1] + dt IndexError: list index out of rangeThe code is the following: dt = 0.001t = [0] for i i…

Stopping a while loop mid-way - Python

What is the best way to stop a while loop in Python mid-way through the statement? Im aware of break but I thought using this would be bad practice.For example, in this code below, I only want the pro…

Click on element in dropdown with Selenium and Python

With Selenium and Chrome webdriver on MacOS need to click dropdown element. But always have an error that cant find. Have this html code on a page where it located:<select id="periodoExtrato&qu…

Send cv2 video stream for face recognition

Im struggling with a problem to send a cv2 videostream (webcam) to a server (which shall be used later for face recognition). I keep getting the following error for the server: Traceback (most recent c…

Generate all possible lists from the sublist in python [duplicate]

This question already has answers here:How to get the Cartesian product of multiple lists(20 answers)Closed 7 years ago.Suppose I have list [[a, b, c], [d, e], [1, 2]]I want to generate list where on t…

Time/frequency color map in python

Is there in native Python 3.X library or in scipy/numpy/matplolib libraries a function or their short set which could help me to draw a plot similar to this one(?):What would be an efficient way to ac…