I am trying to get large amount of status information, which are encoded in websites, mainly inside the "< head >< /head >" element.
I know I can use wget
or curl
or python to get the whole page. But I don't want to put too much unnecessary stress to the servers (the pages themself are rather large/complicated).
Is there any method of getting only the head-element?
I assume there is something proxy servers do besides checking for html headers.
Jusk for clarification: I don't search for html-headers, only for html-<head>
.
It is not possible to load only the data between the <head>
tags because the server would have to parse the requested page before sending it.
A possible solution would read a few bytes until a </head>
tag is found.
The following reads n
bytes from the source and checks if the string </head>
is included. If so, the bytes are converted to string
and trimmed such that the result contains the tags <head>
and </head>
as well as the data between them. Otherwise it continues to read n
bytes until </head>
is found.
import urllib.requestdef get_head_tag_data(url, n=512):"""Read n bytes form source until '</head> is included. Trim result to'<head> ... </head>' and return it as string."""# open resourcewith urllib.request.urlopen(url) as site:# read n bytes until `buff` includes "</head>"data = b''i = 1while True:buff = site.read(n)data += buffif b'</head>' in buff:breakelif buff == b'':raise AttributeError('Not head-tag found.')i += 1print('{} bytes read'.format(n*i))# cast to stringdata = str(data)# detect tag positionstart_tag = data.find('<head>')end_tag = data.find('</head>') + 7return data[start_tag:end_tag]tag_data = get_head_tag_data('https://stackoverflow.com', n=256)
Note that this functions does not check for possible erros, for example if there is no </head>
tag.