These are the rules
- The HTML tags will start with any of the following
<p>
,<ol>
or<ul>
- The content of the HTML when any of step 1 tags is found will contain only the following tags:
<em>
,<strong>
or<span style="text-decoration:underline">
- Map step two tags into the following:
<strong>
will be this item{"bold":True}
in a JSON,<em>
will{"italics":True}
and<span style="text-decoration:underline">
will be{"decoration":"underline"}
- Any text found would be
{"text": "this is the text"}
in the JSON
Let’s say l have the HTML below: By using this:
soup = Soup("THIS IS THE WHOLE HTML", "html.parser")
allTags = [tag for tag in soup.find_all(recursive=False)]
Which produces this Array:
[<p>The name is not mine it is for the people<span style="text-decoration: underline;"><em><strong>stephen</strong></em></span><em><strong> how can</strong>name </em><strong>good</strong> <em>his name <span style="text-decoration: underline;">moneuet</span>please </em><span style="text-decoration: underline;"><strong>forever</strong></span><em>tomorrow<strong>USA</strong></em></p>,<p>2</p>,<p><strong>moment</strong><em>Africa</em> <em>China</em> <span style="text-decoration: underline;">home</span> <em>thomas</em> <strong>nothing</strong></p>,<ol><li>first item</li><li><em><span style="text-decoration: underline;"><strong>second item</strong></span></em></li></ol>
]
By Applying the rules above, this will be the result:
First Array element would be processed into this JSON:
{"text": ["The name is not mine it is for the people",{"text": "stephen", "decoration": "underline", "bold": True, "italics": True}, {"text": "how can", "bold": True, "italics": True},{"text": "name", "italics": True},{"text": "good", "bold": True},{"text": "his name", "italics": True},{"text": "moneuet", "decoration": "underline"},{"text": "please ", "italics": True},{"text": "forever", "decoration": "underline", "bold":True},{"text": "tomorrow", "italics": True},{"text": "USA", "bold": True, "italics": True}]
}
Second Array element would be processed into this JSON:
{"text": ["2"] }
Third Array element would be processed into this JSON:
{"text": [{"text": "moment", "bold": True},{"text": "Africa", "italics": True},{"text": "China", "italics": True},{"text": "home", "decoration": "underline"},{"text": "thomas", "italics": True},{"text": "nothing", "bold": True}]
}
The fourth Array element would be processed into this JSON:
{"ol": ["first item", {"text": "second item", "decoration": "underline", "italics": True, "bold": True}]
}
This is my attempt so, l am able to drill down. But how to process arrayOfTextAndStyles array is the issue
soup = Soup("THIS IS THE WHOLE HTML", "html.parser")
allTags = [tag for tag in soup.find_all(recursive=False)]
for foundTag in allTags:foundTagStyles = [tag for tag in foundTag.find_all(recursive=True)]if len(foundTagStyles ) > 0:if str(foundTag.name) == "p":arrayOfTextAndStyles = [{"tag": tag.name, "text": foundTag.find_all(text=True, recursive=False) }] + [{"tag":tag.name, "text": foundTag.find_all(text=True, recursive=False) } for tag in foundTag.find_all()]elif str(foundTag.name) == "ol":elif str(foundTag .name) == "ul":