How to deal with large json files (flattening it to tsv) [closed]

2024/10/5 14:52:58

I am working with a large JSON file specifically the persona dataset (download here)

Each entry in Persona-Chat is a dict with two keys personality and utterances, and the dataset is a list of entries.

personality: list of strings containing the personality of the agent
utterances: list of dictionaries, each of which has two keys which are lists of strings.
candidates: [next_utterance_candidate_1, ..., next_utterance_candidate_19]
The last candidate is the ground truth response observed in the conversational data
history: [dialog_turn_0, ... dialog_turn N], where N is an odd number since the other user starts every conversation.

https://towardsdatascience.com/how-to-train-your-chatbot-with-simple-transformers-da25160859f4

What I am trying to achieve is to flatten it and convert it to tsv in the following format:

 col_index, string (where  string is the personality, candidates and history

But Whenever I try to load it and convert it to dataframes

import pandas as pd
df = pd.read_json(r'path')
display(df)

I get the following error:

ValueError: arrays must all be same length

Any help is appreciated whether articles or other libs/frameworks and approaches, even breadcrumbs!

Edit: I am feeding it to another api which require tsv, I am thinking of a way to concatenate and preserve the structure to re-structure it again.

Answer

To fully flatten that file, you'd need something like

import jsondef read_personachat_file(name="personachat_self_original.json"):with open(name, "r") as f:data = json.load(f)for entry_type, chats in data.items():for chat_id, chat in enumerate(chats):personality = "|".join(chat["personality"])for utt_id, utt in enumerate(chat["utterances"]):for key in ("candidates", "history"):for phrase_id, phrase in enumerate(utt[key]):yield (entry_type, chat_id, personality, utt_id, key, phrase_id, phrase)for entry in read_personachat_file():print(entry)

The output will be something like

('train', 313, 'i like to wear red .|i wear a red purse .|i like to wear red shoes also .|i use red lipstick .|i drive a red car .', 5, 'candidates', 7, 'my sister will be my mom , she wants me to get married')
('train', 313, 'i like to wear red .|i wear a red purse .|i like to wear red shoes also .|i use red lipstick .|i drive a red car .', 5, 'candidates', 8, 'hi , how are ya ?')
('train', 313, 'i like to wear red .|i wear a red purse .|i like to wear red shoes also .|i use red lipstick .|i drive a red car .', 5, 'candidates', 9, 'sounds good . i am just sitting here with my dog . i love animals .')
('train', 313, 'i like to wear red .|i wear a red purse .|i like to wear red shoes also .|i use red lipstick .|i drive a red car .', 5, 'candidates', 10, "sure i'll go with you but i am baking a pizza right now , my favorite . come eat .")
('train', 313, 'i like to wear red .|i wear a red purse .|i like to wear red shoes also .|i use red lipstick .|i drive a red car .', 5, 'candidates', 11, 'where do you work then soccer person ?')
('train', 313, 'i like to wear red .|i wear a red purse .|i like to wear red shoes also .|i use red lipstick .|i drive a red car .', 5, 'candidates', 12, 'it is so pretty in the fall and winter , my favorite time to go')
('train', 313, 'i like to wear red .|i wear a red purse .|i like to wear red shoes also .|i use red lipstick .|i drive a red car .', 5, 'candidates', 13, 'i to travel and meet new people')

(whether or not that's useful for you).

https://en.xdnf.cn/q/119685.html

Related Q&A

How can I find max number among numbers in this code?

class student(object):def student(self):self.name=input("enter name:")self.stno=int(input("enter stno:"))self.score=int(input("enter score:"))def dis(self):print("nam…

Assert data type of the values of a dict when they are in a list

How can I assert the values of my dict when they are in a list My_dict = {chr7: [127479365, 127480532], chr8: [127474697, 127475864], chr9: [127480532, 127481699]}The code to assert this assert all(isi…

Loading tiff images in fiftyone using ipynp

I am trying to load tiff images using fiftyone and python in ipynb notebook, but it just doesnt work. Anyone knows how to do it?

Regular expression to match the word but not the word inside other strings

I have a rich text like Sample text for testing:<a href="http://www.baidu.com" title="leoshi">leoshi</a>leoshi for details balala... Welcome to RegExr v2.1 by gskinner.c…

Make one image out of avatar and frame in Python with Pillow

If I haveandneed to getdef create_avatar(username):avatar, frame, avatar_id = get_avatar(username)if avatar is not None and frame is not None:try:image = Image.new("RGBA", size)image.putalpha…

Could not broadcast input array from shape (1285) into shape (1285, 5334)

Im trying to follow some example code provided in the documentation for np.linalg.svd in order to compare term and document similarities following an SVD on a TDM matrix. Heres what Ive got:results_t =…

Python URL Stepping Returns Only First Page Results

Any help with the below code would be appreciated. I have checked the results of h and g using print to verify that they are incrementing the url properly, but the program seems to be only repeating th…

Text processing to find co-occurences of strings

I need to process a series of space separated strings i.e. text sentences. ‘Co-occurrence’ is when two tags (or words) appear on the same sentence. I need to list all the co-occurring words when they…

Flask doesnt render any image [duplicate]

This question already has answers here:How to serve static files in Flask(24 answers)Link to Flask static files with url_for(2 answers)Closed 6 years ago.I have a flask application where I need to rend…

Bug in python thread

I have some raspberry pi running some python code. Once and a while my devices will fail to check in. The rest of the python code continues to run perfectly but the code here quits. I am not sure wh…