pandas to_json() redundant backslashes

2024/9/20 23:21:54

I have a '.csv' file containing data about movies and I'm trying to reformat it as a JSON file to use it in MongoDB. So I loaded that csv file to a pandas DataFrame and then used to_json method to write it back. here is how one row in DataFrame looks like:

In [43]: result.iloc[0]
Out[43]: 
title                                                      Avatar
release_date                                                 2009
cast            [{"cast_id": 242, "character": "Jake Sully", "...
crew            [{"credit_id": "52fe48009251416c750aca23", "de...
Name: 0, dtype: object

but when pandas writes it back, it becomes like this:

{   "title":"Avatar","release_date":"2009","cast":"[{\"cast_id\": 242, \"character\": \"Jake Sully\", \"credit_id\": \"5602a8a7c3a3685532001c9a\", \"gender\": 2,...]","crew":"[{\"credit_id\": \"52fe48009251416c750aca23\", \"department\": \"Editing\", \"gender\": 0, \"id\": 1721,...]"
}

As you can see, 'cast' ans 'crew' are lists and they have tons of redundant backslashes. These backslashes appear in MongoDB collections and make it impossible to extract data from these two fields.

How can I solve this problem other than replacing \" with "?

P.S.1: this is how I save the DataFrame as JSON:

result.to_json('result.json', orient='records', lines=True)

UPDATE 1: Apparently pandas is doing just fine and the problem is caused by the original csv files. here is how they look like:

movie_id,title,cast,crew
19995,Avatar,"[{""cast_id"": 242, ""character"": ""Jake Sully"", ""credit_id"": ""5602a8a7c3a3685532001c9a"", ""gender"": 2, ""id"": 65731, ""name"": ""Sam Worthington"", ""order"": 0}, {""cast_id"": 3, ""character"": ""Neytiri"", ""credit_id"": ""52fe48009251416c750ac9cb"", ""gender"": 1, ""id"": 8691, ""name"": ""Zoe Saldana"", ""order"": 1}, {""cast_id"": 25, ""character"": ""Dr. Grace Augustine"", ""credit_id"": ""52fe48009251416c750aca39"", ""gender"": 1, ""id"": 10205, ""name"": ""Sigourney Weaver"", ""order"": 2}, {""cast_id"": 4, ""character"": ""Col. Quaritch"", ""credit_id"": ""52fe48009251416c750ac9cf"", ""gender"": 2, ""id"": 32747, ""name"": ""Stephen Lang"", ""order"": 3},...]"

I tried to replace "" with " (and I really wanted to avoid this hack):

sed -i 's/\"\"/\"/g'

And of course it caused problems in some lines of data when reading it as csv again:

ParserError: Error tokenizing data. C error: Expected 1501 fields in line 4, saw 1513

So we can conclude it's not safe to do such blind replacement. Any idea?

P.S.2: I'm using kaggle's 5000 movie dataset: https://www.kaggle.com/carolzhangdc/imdb-5000-movie-dataset

Answer

I had the same issue : the solution is in 3 steps

1- Data-frame form csv or in my case from xlsx:

 excel_df= pd.read_excel(dataset ,sheet_name=my_sheet_name)

2- convert to json (if you have date in your data)

json_str = excel_df.to_json(orient='records' ,date_format='iso')

3-The most important thing : json.loads **** this is it !

parsed = json.loads(json_str)

4- (facultative) you can write or send the json file : for example : write locally

with open(out, 'w') as json_file:json_file.write(json.dumps({"data": parsed}, indent=4 ))

more info : https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_json.html

https://en.xdnf.cn/q/72292.html

Related Q&A

How can I get the old zip() in Python3?

I migrated from Python 2.7 to Python 3.3 and zip() does not work as expected anymore. Indeed, I read in the doc that it now returns an iterator instead of a list.So, how I am supposed to deal with this…

How can I use tensorflow metric function within keras models?

using python 3.5.2 tensorflow rc 1.1Im trying to use a tensorflow metric function in keras. the required function interface seems to be the same, but calling:import pandas import numpy import tensorflo…

Pandas return the next Sunday for every row

In Pandas for Python, I have a data set that has a column of datetimes in it. I need to create a new column that has the date of the following Sunday for each row. Ive tried various methods trying to u…

Where is `_softmax_cross_entropy_with_logits` defined in tensorflow?

I am trying to see how softmax_cross_entropy_with_logits_v2() is implemented. It calls _softmax_cross_entropy_with_logits(). But I dont see where the latter is defined. Does anybody know how to locate …

Python: Counting frequency of pairs of elements in a list of lists

Actually, I have a dataset about a "meeting". For example, A,B,C have a meeting, then the list would be [A,B,C]. Like this, each list would contain a list of members who participated in the …

How to create a pandas dataframe where columns are filled with random strings?

I want to create a Pandas dataframe with 2 columns and x number rows that contain random strings. I have found code to generate a pandas dataframe with random ints and a random stringer generator. I st…

Unable to make my script process locally created server response in the right way

Ive used a script to run selenium locally so that I can make use of the response (derived from selenium) within my spider.This is the web service where selenium runs locally:from flask import Flask, re…

using variable in a url in python

Sorry for this very basic question. I am new to Python and trying to write a script which can print the URL links. The IP addresses are stored in a file named list.txt. How should I use the variable in…

Create dynamic updated graph with Python

I need to write a script in Python that will take dynamically changed data, the source of data is not matter here, and display graph on the screen. I know how to use matplotlib, but the problem with m…

Converting a nested dictionary to a list

I know there are many dict to list questions on here but I cant quite find the information I need for my situation so Im asking a new quetion.Some background: Im using a hierarchical package for my mod…