Given the url containing a certain file, in this case a word document, read the contents of the document. I have seen several examples of how to extract text from local documents but not from a url. Would it be the same from an http address than from an ftp?
from urllib.request import urlopenurl = 'ftp://path/to/file.docx'txt = urlopen(url).read()
the value of text is:
b'PK\x03\x04\x14\x00\x06\x00\x08\x00\x00\x00!\x00\xdd\xfc\x957f\x01\x00\x00 \x05\x00\x00\x13\x00\x08\x02[Content_Types].xml \xa2\x04\x02(\xa0\x00\x02\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00 ...'
I try to decode
txt.decode("utf-8", "ignore")
but this returns PK ...
followed by other strange characters
The option to save the document and then process it is not feasible.
What am I doing wrong?