Question 1

These are the rules

The HTML tags will start with any of the following , <ol> or <ul>
The content of the HTML when any of step 1 tags is found will contain only the following tags: ,  or 
Map step two tags into the following:  will be this item {"bold":True} in a JSON,  will {"italics":True} and  will be {"decoration":"underline"}
Any text found would be {"text": "this is the text"} in the JSON

Let’s say l have the HTML below: By using this:

soup = Soup("THIS IS THE WHOLE HTML", "html.parser")
allTags = [tag for tag in soup.find_all(recursive=False)]

Which produces this Array:

[<p>The name is not mine it is for the people<span style="text-decoration: underline;"><em><strong>stephen</strong></em></span><em><strong> how can</strong>name </em><strong>good</strong> <em>his name <span style="text-decoration: underline;">moneuet</span>please </em><span style="text-decoration: underline;"><strong>forever</strong></span><em>tomorrow<strong>USA</strong></em></p>,<p>2</p>,<p><strong>moment</strong><em>Africa</em> <em>China</em> <span style="text-decoration: underline;">home</span> <em>thomas</em> <strong>nothing</strong></p>,<ol><li>first item</li><li><em><span style="text-decoration: underline;"><strong>second item</strong></span></em></li></ol>
]

By Applying the rules above, this will be the result:

First Array element would be processed into this JSON:

{"text": ["The name is not mine it is for the people",{"text": "stephen", "decoration": "underline", "bold": True, "italics": True}, {"text": "how can", "bold": True, "italics": True},{"text": "name", "italics": True},{"text": "good", "bold": True},{"text": "his name", "italics": True},{"text": "moneuet", "decoration": "underline"},{"text": "please ", "italics": True},{"text": "forever", "decoration": "underline", "bold":True},{"text": "tomorrow", "italics": True},{"text": "USA", "bold": True, "italics": True}]
}

Second Array element would be processed into this JSON:

{"text": ["2"] }

Third Array element would be processed into this JSON:

{"text": [{"text": "moment", "bold": True},{"text": "Africa", "italics": True},{"text": "China", "italics": True},{"text": "home", "decoration": "underline"},{"text": "thomas", "italics": True},{"text": "nothing", "bold": True}]
}

The fourth Array element would be processed into this JSON:

{"ol": ["first item", {"text": "second item", "decoration": "underline", "italics": True, "bold": True}]
}

This is my attempt so, l am able to drill down. But how to process arrayOfTextAndStyles array is the issue

soup = Soup("THIS IS THE WHOLE HTML", "html.parser")
allTags = [tag for tag in soup.find_all(recursive=False)]
for foundTag in allTags:foundTagStyles = [tag for tag in foundTag.find_all(recursive=True)]if len(foundTagStyles ) > 0:if str(foundTag.name) == "p":arrayOfTextAndStyles = [{"tag": tag.name, "text": foundTag.find_all(text=True, recursive=False) }] +  [{"tag":tag.name, "text": foundTag.find_all(text=True, recursive=False) } for tag in foundTag.find_all()]elif  str(foundTag.name) == "ol":elif  str(foundTag .name) == "ul":

Question 2

I'd use a function to parse each element, not use one huge loop. Select on p and ol tags, and raise an exception in your parsing to flag anything that doesn't match your specific rules:

from bs4 import NavigableStringdef parse(elem):if elem.name == 'ol':result = []for li in elem.find_all('li'):if len(li) > 1:result.append([parse_text(sub) for sub in li])else:result.append(parse_text(next(iter(li))))return {'ol': result}return {'text': [parse_text(sub) for sub in elem]}def parse_text(elem):if isinstance(elem, NavigableString):return {'text': elem}result = {}if elem.name == 'em':result['italics'] = Trueelif elem.name == 'strong':result['bold'] = Trueelif elem.name == 'span':try:# rudimentary parse into a dictionarystyles = dict(s.replace(' ', '').split(':') for s in elem.get('style', '').split(';')if s.strip())except ValueError:raise ValueError('Invalid structure')if 'underline' not in styles.get('text-decoration', ''):raise ValueError('Invalid structure')result['decoration'] = 'underline'else:raise ValueError('Invalid structure')if len(elem) > 1:result['text'] = [parse_text(sub) for sub in elem]else:result.update(parse_text(next(iter(elem))))return result

You then parse your document:

for candidate in soup.select('ol,p'):try:result = parse(candidate)except ValueError:# invalid structure, ignorecontinueprint(result)

Using pprint, this results in:

{'text': [{'text': 'The name is not mine it is for the people'},{'bold': True,'decoration': 'underline','italics': True,'text': 'stephen'},{'italics': True,'text': [{'bold': True, 'text': ' how can'}, {'text': 'name '}]},{'bold': True, 'text': 'good'},{'text': ' '},{'italics': True,'text': [{'text': 'his name '},{'decoration': 'underline', 'text': 'moneuet'},{'text': 'please '}]},{'bold': True, 'decoration': 'underline', 'text': 'forever'},{'italics': True,'text': [{'text': 'tomorrow'}, {'bold': True, 'text': 'USA'}]}]}
{'text': [{'text': '2'}]}
{'text': [{'bold': True, 'text': 'moment'},{'italics': True, 'text': 'Africa'},{'text': ' '},{'italics': True, 'text': 'China'},{'text': ' '},{'decoration': 'underline', 'text': 'home'},{'text': ' '},{'italics': True, 'text': 'thomas'},{'text': ' '},{'bold': True, 'text': 'nothing'}]}
{'ol': [{'text': 'first item'},{'bold': True,'decoration': 'underline','italics': True,'text': 'second item'}]}

Note that the text nodes are now nested; this lets you consistently re-create the same structure, with correct whitespace and nested text decorations.

The structure is also reasonably consistent; a 'text' key will either point at a single string, or a list of dictionaries. Such a list will never mix types. You could improve on this still; have 'text' only point to a string, and use a different key to signify nested data, such as contains or nested or similar, then use just one or the other. All that'd require is changing the 'text' keys in len(elem) > 1 case and in the parse() function.

Python Convert HTML into JSON using Soup

Related Q&A

I/O Error while saving Excel file - Python

How to stop the python turtle from drawing

Replace values in a string

Select n data points from plot

Python azure uploaded file content type changed to application/octet-stream

the dumps method of itsdangerous throws a TypeError

SP 500 List python script crashes

sleekxmpp threaded authentication

DoxyPy - Member (variable) of namespace is not documented

to_csv append mode is not appending to next new line