Question 1

I have a lot of json files in my S3 bucket and I want to be able to read them and query those files. The problem is they are pretty printed. One json file has just one massive dictionary but it's not in one line. As per this thread, a dictionary in the json file should be in one line which is a limitation of Apache Spark. I don't have it structured that way.

My JSON schema looks like this -

{"dataset": [{"key1": [{"range": "range1", "value": 0.0}, {"range": "range2", "value": 0.23}]}, {..}, {..}],"last_refreshed_time": "2016/09/08 15:05:31"
}

Here are my questions -

Can I avoid converting these files to match the schema required by Apache Spark (one dictionary per line in a file) and still be able to read it?
If not, what's the best way to do it in Python? I have a bunch of these files for each day in the bucket. The bucket is partitioned by day.
Is there any other tool better suited to query these files other than Apache Spark? I'm on AWS stack so can try out any other suggested tool with Zeppelin notebook.

Question 2

You could use sc.wholeTextFiles() Here is a related post.

Alternatively, you could reformat your json using a simple function and load the generated file.

def reformat_json(input_path, output_path):with open(input_path, 'r') as handle:jarr = json.load(handle)f = open(output_path, 'w')for entry in jarr:f.write(json.dumps(entry)+"\n")f.close()

Reading pretty print json files in Apache Spark

Related Q&A

Visualize TFLite graph and get intermediate values of a particular node?

Why do I get a pymongo.cursor.Cursor when trying to query my mongodb db via pymongo?

using dropbox as a server for my django app

Proper overloading of json encoding and decoding with Flask

How to check a specific type of tuple or list?

Cannot import name BlockBlobService

Legend outside the plot in Python - matplotlib

Filter items that only occurs once in a very large list

Get Data JSON in Flask

Commands working on windows command line but not in Git Bash terminal