pyspark: Save schemaRDD as json file

2024/10/4 21:21:45

I am looking for a way to export data from Apache Spark to various other tools in JSON format. I presume there must be a really straightforward way to do it.

Example: I have the following JSON file 'jfile.json':

{"key":value_a1, "key2":value_b1},
{"key":value_a2, "key2":value_b2},
{...}

where each line of the file is a JSON object. These kind of files can be easily read into PySpark with

jsonRDD = jsonFile('jfile.json')

and then look like (by calling jsonRDD.collect()):

[Row(key=value_a1, key2=value_b1),Row(key=value_a2, key2=value_b2)]

Now I want to save these kind of files back to a pure JSON file.

I found this entry on the Spark User list:

http://apache-spark-user-list.1001560.n3.nabble.com/Updating-exising-JSON-files-td12211.html

that claimed using

RDD.saveAsTextFile(jsonRDD) 

After doing this, the text file looks like

Row(key=value_a1, key2=value_b1)
Row(key=value_a2, key2=value_b2)

, i.e., the jsonRDD has just been plainly written to the file. I would have expected a kind of an "automagic" conversion back to JSON format after reading the Spark User List entry. My goal is to have a file that looks like 'jfile.json' mentioned in the beginning.

Am I missing a really obvious easy way to do this?

I read http://spark.apache.org/docs/latest/programming-guide.html, searched google, the user list and stack overflow for answers, but almost all answers deal with reading and parsing JSON into Spark. I even bought the book 'Learning Spark', but the examples there (p. 71) just lead to the same output file as above.

Can anybody help me out here? I feel like I am missing just a small link in here

Cheers and thanks in advance!

Answer

You can use the method toJson() , it allows you to convert a SchemaRDD into a MappedRDD of JSON documents.

https://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=tojson#pyspark.sql.SchemaRDD.toJSON

https://en.xdnf.cn/q/70567.html

Related Q&A

How to start a background thread when django server is up?

TL;DR In my django project, where do I put my "start-a-thread" code to make a thread run as soon as the django server is up?First of all, Happy New Year Everyone! This is a question from a n…

Can I use SQLAlchemy relationships in ORM event callbacks? Always get None

I have a User model that resembles the following:class User(db.Model):id = db.Column(db.BigInteger, primary_key=True)account_id = db.Column(db.BigInteger, db.ForeignKey(account.id))account = db.relatio…

Elegant way to transform a list of dict into a dict of dicts

I have a list of dictionaries like in this example:listofdict = [{name: Foo, two: Baz, one: Bar}, {name: FooFoo, two: BazBaz, one: BarBar}]I know that name exists in each dictionary (as well as the oth…

sharpen image to detect edges/lines in a stamped X object on paper

Im using python & opencv. My goal is to detect "X" shaped pieces in an image taken with a raspberry pi camera. The project is that we have pre-printed tic-tac-toe boards, and must image t…

How to change the color of lines within a subplot?

My goal is to create a time series plot for each column in my data with their corresponding rolling mean. Id like the color of the lines across subplots to be different. For example, for gym and rollin…

Cythons calculations are incorrect

I implemented the Madhava–Leibniz series to calculate pi in Python, and then in Cython to improve the speed. The Python version:from __future__ import division pi = 0 l = 1 x = True while True:if x:pi…

Python: NLTK and TextBlob in french

Im using NLTK and TextBlob to find nouns and noun phrases in a text:from textblob import TextBlob import nltkblob = TextBlob(text) print(blob.noun_phrases) tokenized = nltk.word_tokenize(text) nouns =…

How can I run a script as part of a Travis CI build?

As part of a Python package I have a script myscript.py at the root of my project and setup(scripts=[myscript.py], ...) in my setup.py.Is there an entry I can provide to my .travis.yml that will run my…

Writing nested schema to BigQuery from Dataflow (Python)

I have a Dataflow job to write to BigQuery. It works well for non-nested schema, however fails for the nested schema.Here is my Dataflow pipeline:pipeline_options = PipelineOptions()p = beam.Pipeline(o…

Python decorators on class members fail when decorator mechanism is a class

When creating decorators for use on class methods, Im having trouble when the decorator mechanism is a class rather than a function/closure. When the class form is used, my decorator doesnt get treated…