Question 1

I am working with a large JSON file specifically the persona dataset (download here)

Each entry in Persona-Chat is a dict with two keys personality and utterances, and the dataset is a list of entries.
personality: list of strings containing the personality of the agent
utterances: list of dictionaries, each of which has two keys which are lists of strings.
candidates: [next_utterance_candidate_1, ..., next_utterance_candidate_19]
The last candidate is the ground truth response observed in the conversational data
history: [dialog_turn_0, ... dialog_turn N], where N is an odd number since the other user starts every conversation.
https://towardsdatascience.com/how-to-train-your-chatbot-with-simple-transformers-da25160859f4

What I am trying to achieve is to flatten it and convert it to tsv in the following format:

 col_index, string (where  string is the personality, candidates and history

But Whenever I try to load it and convert it to dataframes

import pandas as pd
df = pd.read_json(r'path')
display(df)

I get the following error:

ValueError: arrays must all be same length

Any help is appreciated whether articles or other libs/frameworks and approaches, even breadcrumbs!

Edit: I am feeding it to another api which require tsv, I am thinking of a way to concatenate and preserve the structure to re-structure it again.

Question 2

To fully flatten that file, you'd need something like

import jsondef read_personachat_file(name="personachat_self_original.json"):with open(name, "r") as f:data = json.load(f)for entry_type, chats in data.items():for chat_id, chat in enumerate(chats):personality = "|".join(chat["personality"])for utt_id, utt in enumerate(chat["utterances"]):for key in ("candidates", "history"):for phrase_id, phrase in enumerate(utt[key]):yield (entry_type, chat_id, personality, utt_id, key, phrase_id, phrase)for entry in read_personachat_file():print(entry)

The output will be something like

('train', 313, 'i like to wear red .|i wear a red purse .|i like to wear red shoes also .|i use red lipstick .|i drive a red car .', 5, 'candidates', 7, 'my sister will be my mom , she wants me to get married')
('train', 313, 'i like to wear red .|i wear a red purse .|i like to wear red shoes also .|i use red lipstick .|i drive a red car .', 5, 'candidates', 8, 'hi , how are ya ?')
('train', 313, 'i like to wear red .|i wear a red purse .|i like to wear red shoes also .|i use red lipstick .|i drive a red car .', 5, 'candidates', 9, 'sounds good . i am just sitting here with my dog . i love animals .')
('train', 313, 'i like to wear red .|i wear a red purse .|i like to wear red shoes also .|i use red lipstick .|i drive a red car .', 5, 'candidates', 10, "sure i'll go with you but i am baking a pizza right now , my favorite . come eat .")
('train', 313, 'i like to wear red .|i wear a red purse .|i like to wear red shoes also .|i use red lipstick .|i drive a red car .', 5, 'candidates', 11, 'where do you work then soccer person ?')
('train', 313, 'i like to wear red .|i wear a red purse .|i like to wear red shoes also .|i use red lipstick .|i drive a red car .', 5, 'candidates', 12, 'it is so pretty in the fall and winter , my favorite time to go')
('train', 313, 'i like to wear red .|i wear a red purse .|i like to wear red shoes also .|i use red lipstick .|i drive a red car .', 5, 'candidates', 13, 'i to travel and meet new people')

(whether or not that's useful for you).

How to deal with large json files (flattening it to tsv) [closed]

Related Q&A

How can I find max number among numbers in this code?

Assert data type of the values of a dict when they are in a list

Loading tiff images in fiftyone using ipynp

Regular expression to match the word but not the word inside other strings

Make one image out of avatar and frame in Python with Pillow

Could not broadcast input array from shape (1285) into shape (1285, 5334)

Python URL Stepping Returns Only First Page Results

Text processing to find co-occurences of strings

Flask doesnt render any image [duplicate]

Bug in python thread