Get only HTML head Element with a Script or Tool

2024/10/13 19:19:32

I am trying to get large amount of status information, which are encoded in websites, mainly inside the "< head >< /head >" element.

I know I can use wget or curl or python to get the whole page. But I don't want to put too much unnecessary stress to the servers (the pages themself are rather large/complicated).

Is there any method of getting only the head-element?

I assume there is something proxy servers do besides checking for html headers.

Jusk for clarification: I don't search for html-headers, only for html-<head>.

Answer

It is not possible to load only the data between the <head> tags because the server would have to parse the requested page before sending it.

A possible solution would read a few bytes until a </head> tag is found.

The following reads n bytes from the source and checks if the string </head> is included. If so, the bytes are converted to string and trimmed such that the result contains the tags <head> and </head> as well as the data between them. Otherwise it continues to read n bytes until </head> is found.

import urllib.requestdef get_head_tag_data(url, n=512):"""Read n bytes form source until '</head> is included. Trim result to'<head> ... </head>' and return it as string."""# open resourcewith urllib.request.urlopen(url) as site:# read n bytes until `buff` includes "</head>"data = b''i = 1while True:buff = site.read(n)data += buffif b'</head>' in buff:breakelif buff == b'':raise AttributeError('Not head-tag found.')i += 1print('{} bytes read'.format(n*i))# cast to stringdata = str(data)# detect tag positionstart_tag = data.find('<head>')end_tag = data.find('</head>') + 7return data[start_tag:end_tag]tag_data = get_head_tag_data('https://stackoverflow.com', n=256)

Note that this functions does not check for possible erros, for example if there is no </head> tag.

https://en.xdnf.cn/q/69499.html

Related Q&A

Is it possible to restore corrupted “interned” bytes-objects

It is well known, that small bytes-objects are automatically "interned" by CPython (similar to the intern-function for strings). Correction: As explained by @abarnert it is more like the inte…

Wildcard namespaces in lxml

How to query using xpath ignoring the xml namespace? I am using python lxml library. I tried the solution from this question but doesnt seem to work.In [151]: e.find("./*[local-name()=Buckets]&qu…

WordNet - What does n and the number represent?

My question is related to WordNet Interface.>>> wn.synsets(cat)[Synset(cat.n.01), Synset(guy.n.01), Synset(cat.n.03),Synset(kat.n.01), Synset(cat-o-nine-tails.n.01), Synset(caterpillar.n.02), …

How to change the values of a column based on two conditions in Python

I have a dataset where I have the time in a game and the time of an event. EVENT GAME0:34 0:43NaN 0:232:34 3:43NaN 4:50I want to replace the NaN in the EVENT column where GAME…

logging module for python reports incorrect timezone under cygwin

I am running python script that uses logging module under cygwin on Windows 7. The date command reports correct time:$ date Tue, Aug 14, 2012 2:47:49 PMHowever, the python script is five hours off:201…

Set ordering of Apps and models in Django admin dashboard

By default, the Django admin dashboard looks like this for me:I want to change the ordering of models in Profile section, so by using codes from here and here I was able to change the ordering of model…

python database / sql programming - where to start

What is the best way to use an embedded database, say sqlite in Python:Should be small footprint. Im only needing few thousands records per table. And just a handful of tables per database. If its one …

How to install Python 3.5 on Raspbian Jessie

I need to install Python 3.5+ on Rasbian (Debian for the Raspberry Pi). Currently only version 3.4 is supported. For the sources I want to compile I have to install:sudo apt-get install -y python3 pyth…

Django - last insert id

I cant get the last insert id like I usually do and Im not sure why.In my view:comment = Comments( ...) comment.save() comment.id #returns NoneIn my Model:class Comments(models.Model):id = models.Integ…

How to check if default value for python function argument is set using inspect?

Im trying to identify the parameters of a function for which default values are not set. Im using inspect.signature(func).parameters.value() function which gives a list of function parameters. Since Im…