Equivalent of wget in Python to download website and resources

2024/4/14 9:18:27

Same thing asked 2.5 years ago in Downloading a web page and all of its resource files in Python but doesn't lead to an answer and the 'please see related topic' isn't really asking the same thing.

I want to download everything on a page to make it possible to view it just from the files.

The command

wget --page-requisites --domains=DOMAIN --no-parent --html-extension --convert-links --restrict-file-names=windows

does exactly that I need. However we want to be able to tie it in with other stuff that must be portable, so requires it to be in Python.

I've been looking at Beautiful Soup, scrapy, various spiders posted around the place, but these all seem to deal with getting data/links in clever but specific ways. Using these to do what I want seems like it will require a lot of work to deal with finding all of the resources, when I'm sure there must be an easy way.

thanks very much


You should be using an appropriate tool for the job at hand.

If you want to spider a site and save the pages to disk, Python probably isn't the best choice for that. Open source projects get features when someone needs that feature, and because wget does its job so well, nobody has bothered to try to write a python library to replace it.

Considering wget runs on pretty much any platform that has a Python interpreter, is there a reason you can't use wget?


Related Q&A

Python plt: close or clear figure does not work

I generate a lots of figures with a script which I do not display but store to harddrive. After a while I get the message/usr/lib/pymodules/python2.7/matplotlib/pyplot.py:412: RuntimeWarning: More than…

PCA of RGB Image

Im trying to figure out how to use PCA to decorrelate an RGB image in python. Im using the code found in the OReilly Computer vision book:from PIL import Image from numpy import *def pca(X):# Principa…

delete the first element in subview of a matrix

I have a dataset like this:[[0,1],[0,2],[0,3],[0,4],[1,5],[1,6],[1,7],[2,8],[2,9]]I need to delete the first elements of each subview of the data as defined by the first column. So first I get all elem…

How to scroll QListWidget to selected item

The code below creates a single dialog window with QListWidget and QPushButton. Clicking the button fires up a scroll() function which finds and selects an "ITEM-0011". I wonder if there is a…

Declaring Subclass without passing self

I have an abstract base class Bicycle:from abc import ABC, abstractmethodclass Bicycle(ABC):def __init__(self, cadence = 10, gear = 10, speed = 10):self._cadence = cadenceself._gear = gear self…

Flask-OIDC with keycloak - oidc_callback default callback not working

Im trying to use Flask-oidc in a simple flask application in order to add authentication via keycloak. However, once I log-in with valid credentials it goes back to /oidc_callback which doesnt exist. T…

Matplotlib: reorder subplots

Say that I have a figure fig which contains two subplots as in the example from the documentation:I can obtain the two axes (the left one being ax1 and the right one ax2) by just doing:ax1, ax2 = fig.a…

Filter values in a list using an array with boolean expressions

I have a list of tuples like this:listOfTuples = [(0, 1), (0, 2), (3, 1)]and an array that could look like this:myArray = np.array([-2, 9, 5])Furthermore, I have an array with Boolean expressions which…

Show two correlation coefficients on pairgrid plot with hue (categorical variable) - seaborn python

I found a function to compute a correlation coefficient and then add it to a pair plot (shown below). My issue is that when I run a pairplot with hue (a categorical variable) the correlation coefficien…

How to properly setup vscode with pyside? Missing suggestions

Im very new to pyside, qt and python. I managed to setup a project with a basic window and a push button which closes the app. My problem is, that somehow vscode wont show all properties available, eve…