python: find html tags and replace their attributes [duplicate]

2024/11/13 7:53:27

I need to do the following:

  1. take html document
  2. find every occurrence of 'img' tag
  3. take their 'src' attribute
  4. pass founded url to processing
  5. change the 'src' attribute to the new one
  6. do all this stuff with Python 2.7

P.S. I,ve heard about lmxl and BeautifulSoup. How do you recommend to solve this problem with? Maybe it would be better to use regexes then? or another something else?

Answer
from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup(html_string)
for link in soup.findAll('a')link['src'] = 'New src'
html_string = str(soup)

I don't particularly like BeautifulSoup but it does the job for you. Try to not over-do your solution if you don't have to, this being one of the simpler things you can do to solve a general issue.

That said, building for the future is equally important but all your 6 requirements can be put down into one, "I want to change 'src' or all links to X"

https://en.xdnf.cn/q/72202.html

Related Q&A

Django/Apache/mod_wsgi not using virtualenvs Python binary

I have a virtualenv at /opt/webapps/ff/ with its own Python installation. I have WSGIPythonHome set to /opt/webapps/ff in my Apache config file (and this is definitely getting used in some capacity, b…

How to open the users preferred mail application on Linux?

I wrote a simple native GUI script with python-gtk. Now I want to give the user a button to send an email with an attachment.The script runs on Linux desktops. Is there a way to open the users preferr…

finding a set of ranges that a number fall in

I have a 200k lines list of number ranges like start_position,stop position. The list includes all kinds of overlaps in addition to nonoverlapping ones.the list looks like this[3,5] [10,30] [15,25] [5…

Python Tornado Websocket Connections still open after being closed

I have a Tornado Websocket Server and I want to time out after 30 minutes of inactivity. I use self.close() to close the connection after 30 minutes of inactivity. But it seems that some connections st…

Vertical Print String - Python3.2

Im writing a script that will take as user inputed string, and print it vertically, like so:input = "John walked to the store"output = J w t t so a o h th l e on k re edIve written …

How to remove small particle background noise from an image?

Im trying to remove gradient background noise from the images I have. Ive tried many ways with cv2 without success.Converting the image to grayscale at first to make it lose some gradients that may hel…

Running commands from within python that need root access

I have been playing around with subprocess lately. As I do more and more; I find myself needing root access. I was wondering if there is an easy way to enter the root password for a command that needs …

How not to plot missing periods

Im trying to plot a time series data, where for certain periods there is no data. Data is loaded into dataframe and Im plotting it using df.plot(). The problem is that the missing periods get connected…

Disabling std. and file I/O in Python sandbox implementation

Im trying to set up a Python sandbox and want to forbid access to standard and file I/O. I am running the sandbox inside of a running Python server.Ive already looked at modules like RestrictedPython a…

Extract edge and communities from list of nodes

I have dataset which has more than 50k nodes and I am trying to extract possible edges and communities from them. I did try using some graph tools like gephi, cytoscape, socnet, nodexl and so on to v…