Python create tree from a JSON file

2024/10/4 1:15:27

Let's say that we have the following JSON file. For the sake of the example it's emulated by a string. The string is the input and a Tree object should be the output. I'll be using the graphical notation of a tree to present the output.

I've found the following classes to handle tree concept in Python:

class TreeNode(object):def __init__(self, data):self.data = dataself.children = []def add_child(self, obj):self.children.append(obj)def __str__(self, level=0):ret = "\t"*level+repr(self.data)+"\n"for child in self.children:ret += child.__str__(level+1)return retdef __repr__(self):return '<tree node representation>'class Tree:def __init__(self):self.root = TreeNode('ROOT')def __str__(self):return self.root.__str__()

The input file can be of different complexity:

Simple case

Input:

json_file = '{"item1": "end1", "item2": "end2"}'

Output:

"ROOT"item1end1item2end2

Embedded case

Input:

json_file = {"item1": "end1", "item2": {"item3": "end3"}}

Output:

"ROOT"item1end1item2item3end3

Array case

Input:

json_file = { "name": "John", "items": [ { "item_name": "lettuce", "price": 2.65, "units": "no" }, { "item_name": "ketchup", "price": 1.51, "units": "litres" } ] }

Output:

"ROOT"nameJohnitems1item_namelettuceprice2.65unitsno2   item_nameketchupprice1.51unitslitres

Please note that each item in an array is described with an integer (starting at 1).

So far I've managed to come up with the following function that solves the problem for the simple case. In terms of the embedded case I know that I must use recursion but so far I get UnboundLocalError: local variable 'tree' referenced before assignment.

def create_tree_from_JSON(json, parent=None):if not parent:tree = Tree()node_0 = TreeNode("ROOT")tree.root = node_0parent = node_0else:parent = parentfor key in json:if isinstance(json[key], dict):head = TreeNode(key)create_tree_from_JSON(json[key], head)else:node = TreeNode(key)node.add_child(TreeNode(json[key]))parent.add_child(node)return tree

Problem's background

You may wonder why would I need to change a JSON object into a tree. As you may know PostgreSQL provides a way to handle JSON fields in the database. Given a JSON object I can get the value of any field by using -> and ->> notation. Here and here more about the subject. I will be creating new tables based on the fields' names and values. Unfortunately the JSON objects vary to such an extent that I cannot write the .sql code manually - I must find a way to do it automatically.

Let's assume that I want to create a table based on the embedded case. I need to get the following .sql code:

select content_json ->> 'item1' as end1,content_json -> 'item_2' ->> 'item_3' as end3
from table_with_json

Substitute content_json for "ROOT" and you can see that each line in SQL code is simply a depth-first traversal from "ROOT" to a leaf (move from the last node to leaf is always annotated with ->>).

EDIT: In order to make the question more clear I'm adding the target .sql query for the array case. I would like there to be as many queries as there are elements in the array:

selectcontent_json ->> 'name' as name,content_json -> 'items' -> 1 -> 'item_name' as item_name,content_json -> 'items' -> 1 -> 'price' as price,content_json -> 'items' -> 1 -> 'units' as units
from table_with_jsonselectcontent_json ->> 'name' as name,content_json -> 'items' -> 2 ->> 'item_name' as item_name,content_json -> 'items' -> 2 ->> 'price' as price,content_json -> 'items' -> 2 ->> 'units' as units
from table_with_json

Solution so far (07.05.2019)

I'm testing the current solution for the moment:

from collections import OrderedDictdef treeify(data) -> dict:if isinstance(data, dict):  # already have keys, just recursereturn OrderedDict((key, treeify(children)) for key, children in data.items())elif isinstance(data, list):  # make keys from indicesreturn OrderedDict((idx, treeify(children)) for idx, children in enumerate(data, start=1))else:  # leave node, no recursionreturn datadef format_query(tree, stack=('content_json',)) -> str:if isinstance(tree, dict):  # build stack of keysfor key, child in tree.items():yield from format_query(child, stack + (key,))else:  # print complete stack, discarding leaf data in tree*keys, field = stackpath = ' -> '.join(str(key) if isinstance(key, int) else "'%s'" % keyfor key in keys)yield path + " ->> '%s' as %s" % (field, field)def create_select_query(lines_list):query = "select\n"for line_number in range(len(lines_list)):if "_class" in lines_list[line_number]:# ignore '_class' fieldscontinuequery += "\t" + lines_list[line_number]if line_number == len(lines_list)-1:query += "\n"else:query += ",\n"query += "from table_with_json"return query

I'm currently working on a JSON like this:

stack_nested_example = {"_class":"value_to_be_ignored","first_key":{"second_key":{"user_id":"123456","company_id":"9876","question":{"subject":"some_subject","case_type":"urgent","from_date":{"year":2011,"month":11,"day":11},"to_date":{"year":2012,"month":12,"day":12}},"third_key":[{"role":"driver","weather":"great"},{"role":"father","weather":"rainy"}]}}}

In the output I get the only constant element is the order of lines treated with array logic. Order of other lines differs. The output I would like to get is the one that takes into account order of the keys:

select'content_json' -> 'first_key' -> 'second_key' ->> 'user_id' as user_id,'content_json' -> 'first_key' -> 'second_key' ->> 'company_id' as company_id,'content_json' -> 'first_key' -> 'second_key' -> 'question' ->> 'subject' as subject,'content_json' -> 'first_key' -> 'second_key' -> 'question' ->> 'case_type' as case_type,'content_json' -> 'first_key' -> 'second_key' -> 'question' -> 'from_date' ->> 'year' as year,'content_json' -> 'first_key' -> 'second_key' -> 'question' -> 'from_date' ->> 'month' as month,'content_json' -> 'first_key' -> 'second_key' -> 'question' -> 'from_date' ->> 'day' as day,'content_json' -> 'first_key' -> 'second_key' -> 'question' -> 'to_date' ->> 'year' as year,'content_json' -> 'first_key' -> 'second_key' -> 'question' -> 'to_date' ->> 'month' as month,'content_json' -> 'first_key' -> 'second_key' -> 'question' -> 'to_date' ->> 'day' as day,'content_json' -> 'first_key' -> 'second_key' -> 'third_key' -> 1 ->> 'role' as role,'content_json' -> 'first_key' -> 'second_key' -> 'third_key' -> 1 ->> 'weather' as weather,'content_json' -> 'first_key' -> 'second_key' -> 'third_key' -> 2 ->> 'role' as role,'content_json' -> 'first_key' -> 'second_key' -> 'third_key' -> 2 ->> 'weather' as weather
from table_with_json
Answer

In your create_tree_from_JSON you never pass on the tree during recursion. Yet you try to return it.

def create_tree_from_JSON(json, parent=None):if not parent:tree = Tree()  # tree is only created for root node...else:parent = parent  # tree is not created here...return tree  # tree is always returned

Either pass on the tree during recursion, or separate the root step from the others:

def create_tree_from_JSON(json):  # root casetree = Tree()node_0 = TreeNode("ROOT")tree.root = node_0parent = node_0_walk_tree(json, parent)def _walk_tree(json, parent):  # recursive casefor key in json:if isinstance(json[key], dict):head = TreeNode(key)_walk_tree(json[key], head)else:node = TreeNode(key)node.add_child(TreeNode(json[key]))parent.add_child(node)

Note that what you are doing can be solved much easier using plain dicts. Your class is effectively just wrapping a custom interface around dict to begin with.

def treeify(data) -> dict:if isinstance(data, dict):  # already have keys, just recursereturn {key: treeify(children) for key, children in data.items()}elif isinstance(data, list):  # make keys from indicesreturn {idx: treeify(children) for idx, children in enumerate(data, start=1)}else:  # leave node, no recursionreturn data

You can feed any decoded json data to this.

>>> treeify(json_file = { "name": "John", "items": [ { "item_name": "lettuce", "price": 2.65, "units": "no" }, { "item_name": "ketchup", "price": 1.51, "units": "litres" } ] })
{'name': 'John', 'items': {1: {'item_name': 'lettuce', 'price': 2.65, 'units': 'no'}, 2: {'item_name': 'ketchup', 'price': 1.51, 'units': 'litres'}}}

To get the desired pretty-printed output, you can walk this structure with a stack of current keys. A generator is appropriate to create each query line on the fly:

def format_query(tree, stack=('content_json',)) -> str:if isinstance(tree, dict):  # build stack of keysfor key, child in tree.items():yield from format_query(child, stack + (key,))else:  # print complete stack, discarding leaf data in tree*keys, field = stackpath = ' -> '.join(str(key) if isinstance(key, int) else "'%s'" % keyfor key in keys)yield path + " ->> '%s' as %s" % (field, field)

Given your second example, this allows you to get a list of query lines:

>>> list(format_query(treeify({ "name": "John", "items": [ { "item_name": "lettuce", "price": 2.65, "units": "no" }, { "item_name": "ketchup", "price": 1.51, "units": "litres" } ] })))
["'content_json' ->> 'name' as name","'content_json' -> 'items' -> 1 ->> 'item_name' as item_name","'content_json' -> 'items' -> 1 ->> 'price' as price","'content_json' -> 'items' -> 1 ->> 'units' as units","'content_json' -> 'items' -> 2 ->> 'item_name' as item_name","'content_json' -> 'items' -> 2 ->> 'price' as price","'content_json' -> 'items' -> 2 ->> 'units' as units"]
https://en.xdnf.cn/q/70664.html

Related Q&A

disable `functools.lru_cache` from inside function

I want to have a function that can use functools.lru_cache, but not by default. I am looking for a way to use a function parameter that can be used to disable the lru_cache. Currently, I have a two ver…

How to clear tf.flags?

If I run this code twice:tf.flags.DEFINE_integer("batch_size", "2", "batch size for training")I will get this error:DuplicateFlagError: The flag batch_size is defined twic…

Stochastic Optimization in Python

I am trying to combine cvxopt (an optimization solver) and PyMC (a sampler) to solve convex stochastic optimization problems. For reference, installing both packages with pip is straightforward: pip in…

Pandas convert yearly to monthly

Im working on pulling financial data, in which some is formatted in yearly and other is monthly. My model will need all of it monthly, therefore I need that same yearly value repeated for each month. …

Firebase database data to R

I have a database in Google Firebase that has streaming sensor data. I have a Shiny app that needs to read this data and map the sensors and their values.I am trying to pull the data from Firebase into…

Django 1.8 Migrations - NoneType object has no attribute _meta

Attempting to migrate a project from Django 1.7 to 1.8. After wrestling with code errors, Im able to get migrations to run. However, when I try to migrate, Im given the error "NoneType object has …

Manage dependencies of git submodules with poetry

We have a repository app-lib that is used as sub-module in 4 other repos and in each I have to add all dependencies for the sub-module. So if I add/remove a dependency in app-lib I have to adjust all o…

Create Boxplot Grouped By Column

I have a Pandas DataFrame, df, that has a price column and a year column. I want to create a boxplot after grouping the rows based on their year. Heres an example: import pandas as pd temp = pd.DataF…

How can I configure gunicorn to use a consistent error log format?

I am using Gunicorn in front of a Python Flask app. I am able to configure the access log format using the --access-log-format command line parameter when I run gunicorn. But I cant figure out how to c…

Implementing seq2seq with beam search

Im now implementing seq2seq model based on the example code that tensorflow provides. And I want to get a top-5 decoder outputs to do a reinforcement learning.However, they implemented translation mode…