To simplify my problem, I have a base in json, and I recover all of my lines of json to put informations in a base. It seems easy for moments, but problem is that my json is not correctly written
So i did a code to recover all my json lines, but it doesn't work on all lines, like "biographie".
I show you
{"name": "Nazamiu0304 Rau0304majiu0304", "personal_name": "Nazamiu0304 Rau0304majiu0304", "last_modified": {"type": "/type/datetime", "value": "2008-08-20T18:00:41.270799"}, "key": "/authors/OL1001461A", "type": {"key": "/type/author"}, "revision": 2}
{"name": "Nazamiu0304 Rau0304majiu0304", "personal_name": "Nazamiu0304 Rau0304majiu0304", "last_modified": {"type": "/type/datetime", "value": "2008-08-20T18:00:41.270799"}, "key": "/authors/OL1001461A", "type": {"key": "/type/author"}, "revision": 2}
you see, you have name,personal name ...
sometimes you have other informations,
{"bio": {"type": "/type/text", "value": "> "Eversley, William Pinder, B.C.L. Queen's Coll., Oxon, M.A., a member of the South-eastern circuit, reporter for Law Times in Queen's Bench division, a student of the Inner Temple 14 April, 1874 (then aged 23), called to the bar 25 April, 1877 (eldest son of William Eversley, Esq., of London); born u2060, 1851. rn> rn> 7, King's Bench Walk, Temple, E.C." rn> ...[in Foster's _Men at the Bar_][1]rnrnrn rnrn[1]: https://en.wikisource.org/wiki/Men-at-the-Bar/Eversley,_William_Pinder "Men at the Bar""}, "name": "William Pinder Eversley", "created": {"type": "/type/datetime", "value": "2008-04-01T03:28:50.625462"}, "death_date": "1918", "photos": [6897255, 6897254], "last_modified": {"type": "/type/datetime", "value": "2018-07-31T15:39:07.982159"}, "latest_revision": 6, "key": "/authors/OL1003081A", "birth_date": "1851", "personal_name": "William Pinder Eversley", "type": {"key": "/type/author"}, "revision": 6}{"name": "Valerie Meyer", "personal_name": "Valerie Meyer", "last_modified": {"type": "/type/datetime", "value": "2008-08-20T18:22:33.63997"}, "key": "/authors/OL1004062A", "type": {"key": "/type/author"}, "revision": 2}
You can see i have a lot of problem with the element "bio": because he is not written correctely at all, the quota are not interpreted correctely and "<" too so I got this code to change the structure of bio to exploit it.
Here my code to change structure of bio
import re
import json
import pprintbio_regex = re.compile(r"""
("bio":\s*{) # bio field start
(.*?) # content
(},) # bio field end
(?=\s*(?:"\w+"|})) # followed by another one or the json end
""",flags=re.VERBOSE | re.DOTALL)value_regex = re.compile(r"""
("value":\s*") # value field start
(.*?) # content
("\s*\Z) # value field end + end of string
""",flags=re.VERBOSE | re.DOTALL)def normalize_value(mo):start, content, end = mo.group(1, 2, 3)content = content.replace('"', '\\"')return start + content + enddef normalize_bio(mo):start, content, end = mo.group(1, 2, 3)content = value_regex.sub(normalize_value, content)return start + content + endmessy_json = """
{ "bio":{ "type":"/type/text","value":"> "Eversley, William Pinder, B.C.L. Queen's Coll., Oxon, M.A., a member of the South-eastern circuit, reporter for Law Times in Queen's Bench division, a student of the Inner Temple 14 April, 1874 (then aged 23), called to the bar 25 April, 1877 (eldest son of William Eversley, Esq., of London); born u2060, 1851. rn> rn> 7, King's Bench Walk, Temple, E.C." rn> ...[in Foster's Men at the Bar][1]rnrnrn rnrn[1]: https://en.wikisource.org/wiki/Men-at-the-Bar/Eversley,_William_Pinder "Men at the Bar""},"name":"William Pinder Eversley","created":{ "type":"/type/datetime","value":"2008-04-01T03:28:50.625462"},"death_date":"1918","photos":[ 6897255,6897254],"last_modified":{ "type":"/type/datetime","value":"2018-07-31T15:39:07.982159"},"latest_revision":6,"key":"/authors/OL1003081A","birth_date":"1851","personal_name":"William Pinder Eversley","type":{ "key":"/type/author"},"revision":6
}"""result = bio_regex.sub(normalize_bio, messy_json)
obj = json.loads(result)
Here is the result:
{'bio': {'type': '/type/text','value': '> "Eversley, William Pinder, B.C.L. Queen\'s Coll., Oxon, M.A., a member of the '"South-eastern circuit, reporter for Law Times in Queen's Bench division, a student of "'the Inner Temple 14 April, 1874 (then aged 23), called to the bar 25 April, 1877 (eldest '"son of William Eversley, Esq., of London); born u2060, 1851. rn> rn> 7, King's Bench "'Walk, Temple, E.C." rn> ...[in Foster\'s Men at the Bar][1]rnrnrn rnrn[1]: ''https://en.wikisource.org/wiki/Men-at-the-Bar/Eversley,_William_Pinder "Men at the Bar"'},'birth_date': '1851','created': {'type': '/type/datetime', 'value': '2008-04-01T03:28:50.625462'},'death_date': '1918','key': '/authors/OL1003081A','last_modified': {'type': '/type/datetime', 'value': '2018-07-31T15:39:07.982159'},'latest_revision': 6,'name': 'William Pinder Eversley','personal_name': 'William Pinder Eversley','photos': [6897255, 6897254],'revision': 6,'type': {'key': '/type/author'}}
The problem here is that this script is good if I put my entire line in my code, but i would like to recover my 1000000 lines of bio with the good structure, I can't do that 1 per 1, I tried a lot of thing with a loop to recover 1 per 1 but it puts me always an error, I need know how recuperate it witch a loop . I need upgrade my code to take all lines of database from line bio and not only 1 per 1
Thanks in advance and thanks to listen me!