Reading pretty print json files in Apache Spark

2024/11/14 22:19:45

I have a lot of json files in my S3 bucket and I want to be able to read them and query those files. The problem is they are pretty printed. One json file has just one massive dictionary but it's not in one line. As per this thread, a dictionary in the json file should be in one line which is a limitation of Apache Spark. I don't have it structured that way.

My JSON schema looks like this -

{"dataset": [{"key1": [{"range": "range1", "value": 0.0}, {"range": "range2", "value": 0.23}]}, {..}, {..}],"last_refreshed_time": "2016/09/08 15:05:31"
}

Here are my questions -

  1. Can I avoid converting these files to match the schema required by Apache Spark (one dictionary per line in a file) and still be able to read it?

  2. If not, what's the best way to do it in Python? I have a bunch of these files for each day in the bucket. The bucket is partitioned by day.

  3. Is there any other tool better suited to query these files other than Apache Spark? I'm on AWS stack so can try out any other suggested tool with Zeppelin notebook.

Answer

You could use sc.wholeTextFiles() Here is a related post.

Alternatively, you could reformat your json using a simple function and load the generated file.

def reformat_json(input_path, output_path):with open(input_path, 'r') as handle:jarr = json.load(handle)f = open(output_path, 'w')for entry in jarr:f.write(json.dumps(entry)+"\n")f.close()
https://en.xdnf.cn/q/72145.html

Related Q&A

Visualize TFLite graph and get intermediate values of a particular node?

I was wondering if there is a way to know the list of inputs and outputs for a particular node in tflite? I know that I can get input/outputs details, but this does not allow me to reconstruct the com…

Why do I get a pymongo.cursor.Cursor when trying to query my mongodb db via pymongo?

I have consumed a bunch of tweets in a mongodb database. I would like to query these tweets using pymongo. For example, I would like to query for screen_name. However, when I try to do this, python doe…

using dropbox as a server for my django app

I dont know if at all i make any sense, but this popped up in my mind. Can we use the 2gb free hosting of dropbox to put our django app over there and do some hacks to run our app?

Proper overloading of json encoding and decoding with Flask

I am trying to add some overloading to the Flask JSON encoder/decoder to add datetime encoding/decoding but only succeeded through a hack.from flask import Flask, flash, url_for, redirect, render_templ…

How to check a specific type of tuple or list?

Suppose, var = (x, 3)How to check if a variable is a tuple with only two elements, first being a type str and the other a type int in python? Can we do this using only one check? I want to avoid this…

Cannot import name BlockBlobService

I got the following error:from azure.storage.blob import BlockBlobService ImportError: cannot import name BlockBlobServicewhen trying to run my python project using command prompt. (The code seems to…

Legend outside the plot in Python - matplotlib

Im trying to place a rather extensive legend outside my plot in matplotlib. The legend has quite a few entries, and each entry can be quite long (but I dont know exactly how long).Obviously, thats quit…

Filter items that only occurs once in a very large list

I have a large list(over 1,000,000 items), which contains english words:tokens = ["today", "good", "computer", "people", "good", ... ]Id like to get al…

Get Data JSON in Flask

Even following many example here & there, i cant get my API work in POST Method. Here the code about it :from flask import Flask, jsonify, request@app.route(/api/v1/lists, methods=[POST]) def add_e…

Commands working on windows command line but not in Git Bash terminal

I am trying to run certain commands in Git Bash but they continue to hang and not display anything. When I run them in the Windows command prompt they work.For example, in my windows command prompt the…