My entry (The variable is of string type):
<a href="https://wikipedia.org/" rel="nofollow ugc">wiki</a>
My expected output:
{
'href': 'https://wikipedia.org/',
'rel': 'nofollow ugc',
'text': 'wiki',
}
How can I do this with Python? Without using beautifulsoup Library
Please tell with the help of lxml library
Solution with lxml (but without bs!):
from lxml import etreexml = '<a href="https://wikipedia.org/" rel="nofollow ugc">wiki</a>'
root = etree.fromstring(xml)
print(root.attrib)>>> {'href': 'https://wikipedia.org/', 'rel': 'nofollow ugc'}
But there's no text
attribute.
You can extract it by using text
property:
print(root.text)
>>> 'wiki'
To conclusion:
from lxml import etreexml = '<a href="https://wikipedia.org/" rel="nofollow ugc">wiki</a>'
root = etree.fromstring(xml)
dict_ = {}
dict_.update(root.attrib)
dict_.update({'text': root.text})
print(dict_)
>>> {'href': 'https://wikipedia.org/', 'rel': 'nofollow ugc', 'text': 'wiki'}
EDIT
-------regex parsing [X]HTML is deprecated!-------
Solution with regex:
import re
pattern_text = r"[>](\w+)[<]"
pattern_href = r'href="(\w\S+)"'
pattern_rel = r'rel="([A-z ]+)"'xml = '<a href="https://wikipedia.org/" rel="nofollow ugc">wiki</a>'
dict_ = {'href': re.search(pattern_href, xml).group(1),'rel': re.search(pattern_rel, xml).group(1),'text': re.search(pattern_text, xml).group(1)
}
print(dict_)>>> {'href': 'https://wikipedia.org/', 'rel': 'nofollow ugc', 'text': 'wiki'}
It will work if input is string.