Suppose you have a large range of consecutive integers in memory, each of which belongs to exactly one category. Two operations must be O(log n): moving a range from one category to another, and finding the category counts for a given range.
I'm pretty sure the second operation is solved trivially given a correct implementation for the first operation.
Each integer begins in a category, so I started with a set of balanced BSTs. Moving a subtree from one BST to another (eg, moving a range to a different category) has a runtime equivalent to merging two BSTs, which is O(n1 * n2)[1].
This is too slow (in python, and C is not an option), and I couldn't figure out a way to take advantage of my data's inherent structure to create an efficient BST merge operation.
I'm now looking at AVL, red-black, and interval trees, binary heaps, and treaps. Comparing their properties is overwhelming. Which structure should I use?
Edit (problem clarification): I am flexible on how I store these values and create my data structures. I am inflexible about how I receive my input, which comes from another application, and looks like the following: CATEGORY(cat3, I, J)
. My current solution creates a tree with a node for each integer in the range. This is too slow for the size of my dataset, so I'm happy to re-architect if given a better method.
Any given request can move any possible range of integers into any category. In other words, ranges are overlapping in the sense of CATEGORY(cat1, 1, 10)
followed by CATEGORY(cat3, 5, 15)
, but non-overlapping in the sense that every integer will be in exactly one category at any given time.
As far as I understood the question you have a range [A, B] and queries of the form -
- Assign a particular range to a category
E.g.
R1 R2 C1
R3 R4 C2
- Query a range for the total number of items in particular categories.
E.g. find count of categories in R1 R4
A simple implementation using dictionaries as given above would not work as I describe by this example -
suppose we have a range [1000, 5000]
and we make assignment as follows -
1 2 C1
2 3 C2
3 4 C3
......
4999 5000 C4999
Now we make the following assignment
1 5000 C5555
This will make updation/changes/deletion to previously assigned ranges of the ranges O(N) where N is maximum size of range (B - A)
D['category'] = set(of_all_the_ranges_you_have_in_category)
In this case deletion of previous ranges from corresponding sets in categories C1,C2...C4999 will be needed for last assignment ( 1 5000 C5555 )
OR
1 : { "stop" : 5, "category" : "C1"},6 : { "stop" : 19, "category" : "C23"},
Here updation of category for each starting value (1,2,3,4...,4999) will be required for last assignment ( 1 5000 C5555 )
A better option that will make updation of ranges in O(lg n) would be segment trees
(http://en.wikipedia.org/wiki/Segment_tree )
For the above example the segment tree will look something like this
1000:5000|---------------------| |1000:3000 3001:5000| |---------------- --------------------| | | |1000:2000 2001:3000 3001:4000 4001:5000
.................................................................
............................................................... and so on
The leaf nodes will have ranges (1:2, 2:3,...)
You can assign a value "category" to each node and given an interval traverse the tree dividing the interval appropriately ( e.g for 2500 to 4500 divide in 2500:3000 and 3001:4500 and proceed in both directions until nodes with matching ranges are found).
Now an interesting thing is update the children of the nodes when you need them. For example do not proceed updating the children immediately when doing assignments like 1 5000 C5555. This thing is called lazy propagation and you can learn more about it here (http://www.spoj.pl/forum/viewtopic.php?f=27&t=8296 ).
Now for query part. If the number of categories are very small, you can maintain a frequency table at each node and update the ranges when required and lazy propagate when needed otherwise, you will have to traverse the whole tree from leaf to node, counting and complexity will become O(n).
I think a better solution to querying may exist. It's not coming to my mind.
UPDATE
Lets take a small example.
Range [1,8]
Categories allowed {C1, C2}
1:81:4 5:81:2 3:4 5:6 7:81:1 2:2 3:3 4:4 5:5 6:6 7:7 8:8
Each node will have 3 fields [category, category_counts[], children_update_required = false]
1 5 C1
Query will get divided and nodes 1:4's category will be set to C1 and children_update_required will be set to true, its children will not be updated now (remember update only when required or lazy propagation). Also node 5:5's category will be set to C1
3 4 C2
Query will propagate along the tree towards 3:4 (and in the process to reach 3:4, 1:2 and 3:4's categories will be updated to C1, 1:4's children_update_required will be set to false, 1:2's and 3:4's children_update_required will be set to true) and now will update the category of 3:4 to C2 according to current requirement. Next it will set children_update required of 3:4 to be true for future updation of its children (which is already set in this case..no harm done).