I am looking for a way to export data from Apache Spark to various other tools in JSON format. I presume there must be a really straightforward way to do it.
Example: I have the following JSON file 'jfile.json':
{"key":value_a1, "key2":value_b1},
{"key":value_a2, "key2":value_b2},
{...}
where each line of the file is a JSON object. These kind of files can be easily read into PySpark with
jsonRDD = jsonFile('jfile.json')
and then look like (by calling jsonRDD.collect()):
[Row(key=value_a1, key2=value_b1),Row(key=value_a2, key2=value_b2)]
Now I want to save these kind of files back to a pure JSON file.
I found this entry on the Spark User list:
http://apache-spark-user-list.1001560.n3.nabble.com/Updating-exising-JSON-files-td12211.html
that claimed using
RDD.saveAsTextFile(jsonRDD)
After doing this, the text file looks like
Row(key=value_a1, key2=value_b1)
Row(key=value_a2, key2=value_b2)
, i.e., the jsonRDD has just been plainly written to the file. I would have expected a kind of an "automagic" conversion back to JSON format after reading the Spark User List entry. My goal is to have a file that looks like 'jfile.json' mentioned in the beginning.
Am I missing a really obvious easy way to do this?
I read http://spark.apache.org/docs/latest/programming-guide.html, searched google, the user list and stack overflow for answers, but almost all answers deal with reading and parsing JSON into Spark. I even bought the book 'Learning Spark', but the examples there (p. 71) just lead to the same output file as above.
Can anybody help me out here? I feel like I am missing just a small link in here
Cheers and thanks in advance!