Python Convert HTML into JSON using Soup

2024/10/13 2:15:28

These are the rules

  1. The HTML tags will start with any of the following <p>, <ol> or <ul>
  2. The content of the HTML when any of step 1 tags is found will contain only the following tags: <em>, <strong> or <span style="text-decoration:underline">
  3. Map step two tags into the following: <strong> will be this item {"bold":True} in a JSON, <em> will {"italics":True} and <span style="text-decoration:underline"> will be {"decoration":"underline"}
  4. Any text found would be {"text": "this is the text"} in the JSON

Let’s say l have the HTML below: By using this:

soup = Soup("THIS IS THE WHOLE HTML", "html.parser")
allTags = [tag for tag in soup.find_all(recursive=False)]

Which produces this Array:

[<p>The name is not mine it is for the people<span style="text-decoration: underline;"><em><strong>stephen</strong></em></span><em><strong> how can</strong>name </em><strong>good</strong> <em>his name <span style="text-decoration: underline;">moneuet</span>please </em><span style="text-decoration: underline;"><strong>forever</strong></span><em>tomorrow<strong>USA</strong></em></p>,<p>2</p>,<p><strong>moment</strong><em>Africa</em> <em>China</em> <span style="text-decoration: underline;">home</span> <em>thomas</em> <strong>nothing</strong></p>,<ol><li>first item</li><li><em><span style="text-decoration: underline;"><strong>second item</strong></span></em></li></ol>
]

By Applying the rules above, this will be the result:

First Array element would be processed into this JSON:

{"text": ["The name is not mine it is for the people",{"text": "stephen", "decoration": "underline", "bold": True, "italics": True}, {"text": "how can", "bold": True, "italics": True},{"text": "name", "italics": True},{"text": "good", "bold": True},{"text": "his name", "italics": True},{"text": "moneuet", "decoration": "underline"},{"text": "please ", "italics": True},{"text": "forever", "decoration": "underline", "bold":True},{"text": "tomorrow", "italics": True},{"text": "USA", "bold": True, "italics": True}]
}

Second Array element would be processed into this JSON:

{"text": ["2"] }

Third Array element would be processed into this JSON:

{"text": [{"text": "moment", "bold": True},{"text": "Africa", "italics": True},{"text": "China", "italics": True},{"text": "home", "decoration": "underline"},{"text": "thomas", "italics": True},{"text": "nothing", "bold": True}]
}

The fourth Array element would be processed into this JSON:

{"ol": ["first item", {"text": "second item", "decoration": "underline", "italics": True, "bold": True}]
}

This is my attempt so, l am able to drill down. But how to process arrayOfTextAndStyles array is the issue

soup = Soup("THIS IS THE WHOLE HTML", "html.parser")
allTags = [tag for tag in soup.find_all(recursive=False)]
for foundTag in allTags:foundTagStyles = [tag for tag in foundTag.find_all(recursive=True)]if len(foundTagStyles ) > 0:if str(foundTag.name) == "p":arrayOfTextAndStyles = [{"tag": tag.name, "text": foundTag.find_all(text=True, recursive=False) }] +  [{"tag":tag.name, "text": foundTag.find_all(text=True, recursive=False) } for tag in foundTag.find_all()]elif  str(foundTag.name) == "ol":elif  str(foundTag .name) == "ul":
Answer

I'd use a function to parse each element, not use one huge loop. Select on p and ol tags, and raise an exception in your parsing to flag anything that doesn't match your specific rules:

from bs4 import NavigableStringdef parse(elem):if elem.name == 'ol':result = []for li in elem.find_all('li'):if len(li) > 1:result.append([parse_text(sub) for sub in li])else:result.append(parse_text(next(iter(li))))return {'ol': result}return {'text': [parse_text(sub) for sub in elem]}def parse_text(elem):if isinstance(elem, NavigableString):return {'text': elem}result = {}if elem.name == 'em':result['italics'] = Trueelif elem.name == 'strong':result['bold'] = Trueelif elem.name == 'span':try:# rudimentary parse into a dictionarystyles = dict(s.replace(' ', '').split(':') for s in elem.get('style', '').split(';')if s.strip())except ValueError:raise ValueError('Invalid structure')if 'underline' not in styles.get('text-decoration', ''):raise ValueError('Invalid structure')result['decoration'] = 'underline'else:raise ValueError('Invalid structure')if len(elem) > 1:result['text'] = [parse_text(sub) for sub in elem]else:result.update(parse_text(next(iter(elem))))return result

You then parse your document:

for candidate in soup.select('ol,p'):try:result = parse(candidate)except ValueError:# invalid structure, ignorecontinueprint(result)

Using pprint, this results in:

{'text': [{'text': 'The name is not mine it is for the people'},{'bold': True,'decoration': 'underline','italics': True,'text': 'stephen'},{'italics': True,'text': [{'bold': True, 'text': ' how can'}, {'text': 'name '}]},{'bold': True, 'text': 'good'},{'text': ' '},{'italics': True,'text': [{'text': 'his name '},{'decoration': 'underline', 'text': 'moneuet'},{'text': 'please '}]},{'bold': True, 'decoration': 'underline', 'text': 'forever'},{'italics': True,'text': [{'text': 'tomorrow'}, {'bold': True, 'text': 'USA'}]}]}
{'text': [{'text': '2'}]}
{'text': [{'bold': True, 'text': 'moment'},{'italics': True, 'text': 'Africa'},{'text': ' '},{'italics': True, 'text': 'China'},{'text': ' '},{'decoration': 'underline', 'text': 'home'},{'text': ' '},{'italics': True, 'text': 'thomas'},{'text': ' '},{'bold': True, 'text': 'nothing'}]}
{'ol': [{'text': 'first item'},{'bold': True,'decoration': 'underline','italics': True,'text': 'second item'}]}

Note that the text nodes are now nested; this lets you consistently re-create the same structure, with correct whitespace and nested text decorations.

The structure is also reasonably consistent; a 'text' key will either point at a single string, or a list of dictionaries. Such a list will never mix types. You could improve on this still; have 'text' only point to a string, and use a different key to signify nested data, such as contains or nested or similar, then use just one or the other. All that'd require is changing the 'text' keys in len(elem) > 1 case and in the parse() function.

https://en.xdnf.cn/q/118134.html

Related Q&A

I/O Error while saving Excel file - Python

Im using python to open an existing excel file and do some formatting and save and close the file. My code is working good when the file size is small but when excel size is big (apprx. 40MB) Im gettin…

How to stop the python turtle from drawing

Can anyone tell me why this code always has a line on the screen and also how to stop it?Slight problem with this is that every time this happens, I always get a line on my canvas no matter what I try…

Replace values in a string

So the challenge was to replace a specific word in a sentence with asterisks with equivalent length to that word - 3 letters 3 asterisks etc.Section One does not work, but Section Two does - can anyon…

Select n data points from plot

I want to select points by clicking om them in a plot and store the point in an array. I want to stop selecting points after n selections, by for example pressing a key. How can I do this? This is wha…

Python azure uploaded file content type changed to application/octet-stream

I am using python Azure sdk. When the file uploaded its content type changed to application/octet-stream. I want to set its default content type like image/png for PNG image.I am using following method…

the dumps method of itsdangerous throws a TypeError

I am following the guide of 『Flask Web Development』. I want to use itsdangerous to generate a token, but some problems occured. Here is my code:def generate_confirmation_token(self, expiration=3600):…

SP 500 List python script crashes

So I have been following a youtube tutorial on Python finance and since Yahoo has now closed its doors to the financial market, it has caused a few dwelling problems. I run this codeimport bs4 as bs im…

sleekxmpp threaded authentication

so... I have a simple chat client like so:class ChatClient(sleekxmpp.ClientXMPP):def __init__(self, jid, password, server):sleekxmpp.ClientXMPP.__init__(self, jid, password, ssl=True)self.add_event_han…

DoxyPy - Member (variable) of namespace is not documented

I get the error message warning: Member constant1 (variable) of namespace <file_name> is not documented. for my doxygen (doxypy) documentation. I have documented the file and all functions and cl…

to_csv append mode is not appending to next new line

I have a csv called test.csv that looks like:accuracy threshold trainingLabels abc 0.506 15000 eew 18.12 15000And then a dataframe called summaryDF that looks like:accu…