Question 1

Let's say that we have the following JSON file. For the sake of the example it's emulated by a string. The string is the input and a Tree object should be the output. I'll be using the graphical notation of a tree to present the output.

I've found the following classes to handle tree concept in Python:

class TreeNode(object):def __init__(self, data):self.data = dataself.children = []def add_child(self, obj):self.children.append(obj)def __str__(self, level=0):ret = "\t"*level+repr(self.data)+"\n"for child in self.children:ret += child.__str__(level+1)return retdef __repr__(self):return '<tree node representation>'class Tree:def __init__(self):self.root = TreeNode('ROOT')def __str__(self):return self.root.__str__()

The input file can be of different complexity:

Simple case

Input:

json_file = '{"item1": "end1", "item2": "end2"}'

Output:

"ROOT"item1end1item2end2

Embedded case

Input:

json_file = {"item1": "end1", "item2": {"item3": "end3"}}

Output:

"ROOT"item1end1item2item3end3

Array case

Input:

json_file = { "name": "John", "items": [ { "item_name": "lettuce", "price": 2.65, "units": "no" }, { "item_name": "ketchup", "price": 1.51, "units": "litres" } ] }

Output:

"ROOT"nameJohnitems1item_namelettuceprice2.65unitsno2   item_nameketchupprice1.51unitslitres

Please note that each item in an array is described with an integer (starting at 1).

So far I've managed to come up with the following function that solves the problem for the simple case. In terms of the embedded case I know that I must use recursion but so far I get UnboundLocalError: local variable 'tree' referenced before assignment.

def create_tree_from_JSON(json, parent=None):if not parent:tree = Tree()node_0 = TreeNode("ROOT")tree.root = node_0parent = node_0else:parent = parentfor key in json:if isinstance(json[key], dict):head = TreeNode(key)create_tree_from_JSON(json[key], head)else:node = TreeNode(key)node.add_child(TreeNode(json[key]))parent.add_child(node)return tree

Problem's background

You may wonder why would I need to change a JSON object into a tree. As you may know PostgreSQL provides a way to handle JSON fields in the database. Given a JSON object I can get the value of any field by using -> and ->> notation. Here and here more about the subject. I will be creating new tables based on the fields' names and values. Unfortunately the JSON objects vary to such an extent that I cannot write the .sql code manually - I must find a way to do it automatically.

Let's assume that I want to create a table based on the embedded case. I need to get the following .sql code:

select content_json ->> 'item1' as end1,content_json -> 'item_2' ->> 'item_3' as end3
from table_with_json

Substitute content_json for "ROOT" and you can see that each line in SQL code is simply a depth-first traversal from "ROOT" to a leaf (move from the last node to leaf is always annotated with ->>).

EDIT: In order to make the question more clear I'm adding the target .sql query for the array case. I would like there to be as many queries as there are elements in the array:

selectcontent_json ->> 'name' as name,content_json -> 'items' -> 1 -> 'item_name' as item_name,content_json -> 'items' -> 1 -> 'price' as price,content_json -> 'items' -> 1 -> 'units' as units
from table_with_jsonselectcontent_json ->> 'name' as name,content_json -> 'items' -> 2 ->> 'item_name' as item_name,content_json -> 'items' -> 2 ->> 'price' as price,content_json -> 'items' -> 2 ->> 'units' as units
from table_with_json

Solution so far (07.05.2019)

I'm testing the current solution for the moment:

from collections import OrderedDictdef treeify(data) -> dict:if isinstance(data, dict):  # already have keys, just recursereturn OrderedDict((key, treeify(children)) for key, children in data.items())elif isinstance(data, list):  # make keys from indicesreturn OrderedDict((idx, treeify(children)) for idx, children in enumerate(data, start=1))else:  # leave node, no recursionreturn datadef format_query(tree, stack=('content_json',)) -> str:if isinstance(tree, dict):  # build stack of keysfor key, child in tree.items():yield from format_query(child, stack + (key,))else:  # print complete stack, discarding leaf data in tree*keys, field = stackpath = ' -> '.join(str(key) if isinstance(key, int) else "'%s'" % keyfor key in keys)yield path + " ->> '%s' as %s" % (field, field)def create_select_query(lines_list):query = "select\n"for line_number in range(len(lines_list)):if "_class" in lines_list[line_number]:# ignore '_class' fieldscontinuequery += "\t" + lines_list[line_number]if line_number == len(lines_list)-1:query += "\n"else:query += ",\n"query += "from table_with_json"return query

I'm currently working on a JSON like this:

stack_nested_example = {"_class":"value_to_be_ignored","first_key":{"second_key":{"user_id":"123456","company_id":"9876","question":{"subject":"some_subject","case_type":"urgent","from_date":{"year":2011,"month":11,"day":11},"to_date":{"year":2012,"month":12,"day":12}},"third_key":[{"role":"driver","weather":"great"},{"role":"father","weather":"rainy"}]}}}

In the output I get the only constant element is the order of lines treated with array logic. Order of other lines differs. The output I would like to get is the one that takes into account order of the keys:

select'content_json' -> 'first_key' -> 'second_key' ->> 'user_id' as user_id,'content_json' -> 'first_key' -> 'second_key' ->> 'company_id' as company_id,'content_json' -> 'first_key' -> 'second_key' -> 'question' ->> 'subject' as subject,'content_json' -> 'first_key' -> 'second_key' -> 'question' ->> 'case_type' as case_type,'content_json' -> 'first_key' -> 'second_key' -> 'question' -> 'from_date' ->> 'year' as year,'content_json' -> 'first_key' -> 'second_key' -> 'question' -> 'from_date' ->> 'month' as month,'content_json' -> 'first_key' -> 'second_key' -> 'question' -> 'from_date' ->> 'day' as day,'content_json' -> 'first_key' -> 'second_key' -> 'question' -> 'to_date' ->> 'year' as year,'content_json' -> 'first_key' -> 'second_key' -> 'question' -> 'to_date' ->> 'month' as month,'content_json' -> 'first_key' -> 'second_key' -> 'question' -> 'to_date' ->> 'day' as day,'content_json' -> 'first_key' -> 'second_key' -> 'third_key' -> 1 ->> 'role' as role,'content_json' -> 'first_key' -> 'second_key' -> 'third_key' -> 1 ->> 'weather' as weather,'content_json' -> 'first_key' -> 'second_key' -> 'third_key' -> 2 ->> 'role' as role,'content_json' -> 'first_key' -> 'second_key' -> 'third_key' -> 2 ->> 'weather' as weather
from table_with_json

Question 2

In your create_tree_from_JSON you never pass on the tree during recursion. Yet you try to return it.

def create_tree_from_JSON(json, parent=None):if not parent:tree = Tree()  # tree is only created for root node...else:parent = parent  # tree is not created here...return tree  # tree is always returned

Either pass on the tree during recursion, or separate the root step from the others:

def create_tree_from_JSON(json):  # root casetree = Tree()node_0 = TreeNode("ROOT")tree.root = node_0parent = node_0_walk_tree(json, parent)def _walk_tree(json, parent):  # recursive casefor key in json:if isinstance(json[key], dict):head = TreeNode(key)_walk_tree(json[key], head)else:node = TreeNode(key)node.add_child(TreeNode(json[key]))parent.add_child(node)

Note that what you are doing can be solved much easier using plain dicts. Your class is effectively just wrapping a custom interface around dict to begin with.

def treeify(data) -> dict:if isinstance(data, dict):  # already have keys, just recursereturn {key: treeify(children) for key, children in data.items()}elif isinstance(data, list):  # make keys from indicesreturn {idx: treeify(children) for idx, children in enumerate(data, start=1)}else:  # leave node, no recursionreturn data

You can feed any decoded json data to this.

>>> treeify(json_file = { "name": "John", "items": [ { "item_name": "lettuce", "price": 2.65, "units": "no" }, { "item_name": "ketchup", "price": 1.51, "units": "litres" } ] })
{'name': 'John', 'items': {1: {'item_name': 'lettuce', 'price': 2.65, 'units': 'no'}, 2: {'item_name': 'ketchup', 'price': 1.51, 'units': 'litres'}}}

To get the desired pretty-printed output, you can walk this structure with a stack of current keys. A generator is appropriate to create each query line on the fly:

def format_query(tree, stack=('content_json',)) -> str:if isinstance(tree, dict):  # build stack of keysfor key, child in tree.items():yield from format_query(child, stack + (key,))else:  # print complete stack, discarding leaf data in tree*keys, field = stackpath = ' -> '.join(str(key) if isinstance(key, int) else "'%s'" % keyfor key in keys)yield path + " ->> '%s' as %s" % (field, field)

Given your second example, this allows you to get a list of query lines:

>>> list(format_query(treeify({ "name": "John", "items": [ { "item_name": "lettuce", "price": 2.65, "units": "no" }, { "item_name": "ketchup", "price": 1.51, "units": "litres" } ] })))
["'content_json' ->> 'name' as name","'content_json' -> 'items' -> 1 ->> 'item_name' as item_name","'content_json' -> 'items' -> 1 ->> 'price' as price","'content_json' -> 'items' -> 1 ->> 'units' as units","'content_json' -> 'items' -> 2 ->> 'item_name' as item_name","'content_json' -> 'items' -> 2 ->> 'price' as price","'content_json' -> 'items' -> 2 ->> 'units' as units"]

Python create tree from a JSON file

Simple case

Embedded case

Array case

Problem's background

Solution so far (07.05.2019)

Related Q&A

disable `functools.lru_cache` from inside function

How to clear tf.flags?

Stochastic Optimization in Python

Pandas convert yearly to monthly

Firebase database data to R

Django 1.8 Migrations - NoneType object has no attribute _meta

Manage dependencies of git submodules with poetry

Create Boxplot Grouped By Column

How can I configure gunicorn to use a consistent error log format?

Implementing seq2seq with beam search