I need to merge the list and have a function that can be implemented, but when the number of merges is very slow and unbearable, I wonder if there is a more efficient way
Consolidation conditions:Sub-lists contain identical numbers to each other Thank you
Simple Association:
[7,8,9] = [7,8]+[8,9] #The same number 8
Cascade contains:
[1,2,3] = [1,2,3]+[3,4] #The same number 3
[3,4,5,6] = [3,4],[4,5,6] #The same number 4
[1,2,3,4,5,6] = [1,2,3]+[3,4,5,6] #The same number 3
Function:
a = [ [1,2,3],[4,5,6],[3,4],[7,8],[8,9],[6,12,13] ]
b = len(a)
for i in range(b):for j in range(b):x = list(set(a[i]+a[j]))y = len(a[j])+len(a[i])if i == j or a[i] == 0 or a[j] == 0:breakelif len(x) < y:a[i] = xa[j] = [0]print a
print [i for i in a if i!= [0]]
result:
[[8, 9, 7], [1, 2, 3, 4, 5, 6, 10, 11]]
Above is an example where each sub-list in the actual calculation has a length of only 2,
a = [[1,3],[5,6],[3,4],[7,8],[8,9],[12,13]]
I want to miss out more data, here is a simulation data.
a = np.random.rand(150,150)>0.99
a[np.tril_indices(a.shape[1], -1)] = 0
a[np.diag_indices(a.shape[1])] = 0
a = [list(x) for x in np.c_[np.where(a)]]consolidate(a)
I think your algorithm is close to optimal, except that the inner loop can be shortened because the intersection operation is symmetric, i.e. if you check that (A, B)
intersect, there is no need to check for (B, A)
.
This way you would go from O(n²)
to O(n * (n / 2))
.
However, I would rewrite the piece of code more cleanly and I would also avoid modifying the input.
Note also, that since set
s do not guarantee ordering, it is a good idea to do some sorting before getting to list.
Here is my proposed code (EDITED to reduce the number of castings and sortings):
def consolidate(items):items = [set(item.copy()) for item in items]for i, x in enumerate(items):for j, y in enumerate(items[i + 1:]):if x & y:items[i + j + 1] = x | yitems[i] = Nonereturn [sorted(x) for x in items if x]
Encapsulating your code in a function, I would get:
def consolidate_orig(a):a = [x.copy() for x in a]b = len(a)for i in range(b):for j in range(b):x = list(set(a[i]+a[j]))y = len(a[j])+len(a[i])if i == j or a[i] == 0 or a[j] == 0:breakelif len(x) < y:a[i] = xa[j] = [0]return [i for i in a if i!= [0]]
This would allow us to do some clean micro-benchmarking (for completeness I have included also @zipa's merge()
):
EDIT:
@zipa's code is not properly encapsulated, here is an equivalent version with proper encapsulation:
def merge(iterable, base=None):if base is None:base = iterablemerged = set([tuple(set(i).union(*[j for j in base if set(i).intersection(j)])) for i in iterable])if merged == iterable:return mergedelse:return merge(merged, base)
and updated timings:
in_list = [[1,2,3], [4,5,6], [3,4], [7,8], [8,9], [6,12,13]]
%timeit consolidate_orig(in_list)
# 17.9 µs ± 368 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit consolidate(in_list)
# 6.15 µs ± 30 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit merge(in_list)
# 53.6 µs ± 718 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)in_list = [[1, 3], [5, 6], [3, 4], [7, 8], [8, 9], [12, 13]]
%timeit consolidate_orig(in_list)
# 16.1 µs ± 159 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit consolidate(in_list)
# 5.87 µs ± 71.7 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit merge(in_list)
# 27 µs ± 701 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Showing that, at least for this input, the proposed solution is consistently faster.
Since it is not too straightforward to generate large meaningful inputs, I'll leave to you to check that this is more efficient then your approach for the larger inputs you have in mind.
EDIT
With larger, but probably meaningless inputs, the timings are still favorable for the proposed version:
in_list = [[1,2,3], [4,5,6], [3,4], [7,8], [8,9], [6,12,13]] * 300
%timeit consolidate_orig(in_list)
# 1.04 s ± 14.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit consolidate(in_list)
# 724 ms ± 7.51 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit merge(in_list)
# 1.04 s ± 7.94 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)in_list = [[1, 3], [5, 6], [3, 4], [7, 8], [8, 9], [12, 13]] * 300
%timeit consolidate_orig(in_list)
# 1.03 s ± 18 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit consolidate(in_list)
# 354 ms ± 3.43 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit merge(in_list)
# 967 ms ± 16.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)