How to deal with large json files (flattening it to tsv) [closed]
2024/11/15 6:21:21
I am working with a large JSON file specifically the persona dataset (download here)
Each entry in Persona-Chat is a dict with two keys personality and
utterances, and the dataset is a list of entries.
personality: list of strings containing the personality of the agent
utterances: list of dictionaries, each of which has two keys which are lists of strings.
candidates: [next_utterance_candidate_1, ..., next_utterance_candidate_19]
The last candidate is the ground truth response observed in the conversational data
history: [dialog_turn_0, ... dialog_turn N], where N is an odd number since the other user starts every conversation.
What I am trying to achieve is to flatten it and convert it to tsv in the following format:
col_index, string (where string is the personality, candidates and history
But
Whenever I try to load it and convert it to dataframes
import pandas as pd
df = pd.read_json(r'path')
display(df)
I get the following error:
ValueError: arrays must all be same length
Any help is appreciated whether articles or other libs/frameworks and approaches, even breadcrumbs!
Edit:
I am feeding it to another api which require tsv, I am thinking of a way to concatenate and preserve the structure to re-structure it again.
Answer
To fully flatten that file, you'd need something like
import jsondef read_personachat_file(name="personachat_self_original.json"):with open(name, "r") as f:data = json.load(f)for entry_type, chats in data.items():for chat_id, chat in enumerate(chats):personality = "|".join(chat["personality"])for utt_id, utt in enumerate(chat["utterances"]):for key in ("candidates", "history"):for phrase_id, phrase in enumerate(utt[key]):yield (entry_type, chat_id, personality, utt_id, key, phrase_id, phrase)for entry in read_personachat_file():print(entry)
The output will be something like
('train', 313, 'i like to wear red .|i wear a red purse .|i like to wear red shoes also .|i use red lipstick .|i drive a red car .', 5, 'candidates', 7, 'my sister will be my mom , she wants me to get married')
('train', 313, 'i like to wear red .|i wear a red purse .|i like to wear red shoes also .|i use red lipstick .|i drive a red car .', 5, 'candidates', 8, 'hi , how are ya ?')
('train', 313, 'i like to wear red .|i wear a red purse .|i like to wear red shoes also .|i use red lipstick .|i drive a red car .', 5, 'candidates', 9, 'sounds good . i am just sitting here with my dog . i love animals .')
('train', 313, 'i like to wear red .|i wear a red purse .|i like to wear red shoes also .|i use red lipstick .|i drive a red car .', 5, 'candidates', 10, "sure i'll go with you but i am baking a pizza right now , my favorite . come eat .")
('train', 313, 'i like to wear red .|i wear a red purse .|i like to wear red shoes also .|i use red lipstick .|i drive a red car .', 5, 'candidates', 11, 'where do you work then soccer person ?')
('train', 313, 'i like to wear red .|i wear a red purse .|i like to wear red shoes also .|i use red lipstick .|i drive a red car .', 5, 'candidates', 12, 'it is so pretty in the fall and winter , my favorite time to go')
('train', 313, 'i like to wear red .|i wear a red purse .|i like to wear red shoes also .|i use red lipstick .|i drive a red car .', 5, 'candidates', 13, 'i to travel and meet new people')
How can I assert the values of my dict when they are in a list
My_dict = {chr7: [127479365, 127480532], chr8: [127474697, 127475864], chr9: [127480532, 127481699]}The code to assert this
assert all(isi…
I have a rich text like Sample text for testing:<a href="http://www.baidu.com" title="leoshi">leoshi</a>leoshi for details balala...
Welcome to RegExr v2.1 by gskinner.c…
If I haveandneed to getdef create_avatar(username):avatar, frame, avatar_id = get_avatar(username)if avatar is not None and frame is not None:try:image = Image.new("RGBA", size)image.putalpha…
Im trying to follow some example code provided in the documentation for np.linalg.svd in order to compare term and document similarities following an SVD on a TDM matrix. Heres what Ive got:results_t =…
Any help with the below code would be appreciated. I have checked the results of h and g using print to verify that they are incrementing the url properly, but the program seems to be only repeating th…
I need to process a series of space separated strings i.e. text sentences. ‘Co-occurrence’ is when two tags (or words) appear on the same sentence. I need to list all the co-occurring words when they…
This question already has answers here:How to serve static files in Flask(24 answers)Link to Flask static files with url_for(2 answers)Closed 6 years ago.I have a flask application where I need to rend…
I have some raspberry pi running some python code. Once and a while my devices will fail to check in. The rest of the python code continues to run perfectly but the code here quits. I am not sure wh…