Question 1

I have following pandas dataframe df

column1 column2 list_numbers          sublist_column
x        y      [10,-6,1,-4]             
a        b      [1,3,7,-2]               
p        q      [6,2,-3,-3.2]

the sublist_column will contain the numbers from the column "list_numbers" that adds up to 0 (0.5 is a tolerance) I have written following code.

def return_list(original_lst,target_sum,tolerance):memo=dict()sublist=[]for i, x in enumerate(original_lst):if memo_func(original_lst, i + 1, target_sum - x, memo,tolerance) > 0:sublist.append(x)target_sum -= x          return sublist  def memo_func(original_lst, i, target_sum, memo,tolerance):if i >= len(original_lst):if target_sum <=tolerance and target_sum>=-tolerance:return 1else:return 0if (i, target_sum) not in memo:  c = memo_func(original_lst, i + 1, target_sum, memo,tolerance)c += memo_func(original_lst, i + 1, target_sum - original_lst[i], memo,tolerance)memo[(i, target_sum)] = c  return memo[(i, target_sum)]

Then I am using the "return_list" function on the "sublist_column" to populate the result.

target_sum = 0
tolerance=0.5df['sublist_column']=df['list_numbers'].apply(lambda x: return_list(x,0,tolerance))

the following will be the resultant dataframe

column1 column2 list_numbers          sublist_column
x        y      [10,-6,1,-4]             [10,-6,-4]
a        b      [1,3,7,-2]               []
p        q      [6,2,-3,-3.2]            [6,-3,-3.2]  #sum is -0.2(within the tolerance)

This is giving me correct result but it's very slow(takes 2 hrs to run if i use spyder IDE), as my dataframe size has roughly 50,000 rows, and the length of some of the lists in the "list_numbers" column is more than 15. The running time is particularly getting affected when the number of elements in the lists in the "list_numbers" column is greater than 15. e.g following list is taking almost 15 minutes to process

[-1572.35,-76.16,-261.1,-7732.0,-1634.0,-52082.42,-3974.15,
-801.65,-30192.79,-671.98,-73.06,-47.72,57.96,-511.18,-391.87,-4145.0,-1008.61,
-17.53,-17.53,-1471.08,-119.26,-2269.7,-2709,-182939.59,-19.48,-516,-6875.75,-138770.16,-71.11,-295.84,-348.09,-3460.71,-704.01,-678,-632.15,-21478.76]

How can i significantly improve my running time?

Question 2

Step 1: using Numba

Based on the comments, it appear that memo_func is the main bottleneck. You can use Numba to speed up its execution. Numba compile the Python code to a native one thanks to a just-in-time (JIT) compiler. The JIT is able to perform tail-call optimizations and native function calls are significantly faster than the one of CPython. Here is an example:

import numba as nb@nb.njit('(float64[::1], int64, float64, float64)')
def memo_func(original_arr, i, target_sum, tolerance):if i >= len(original_arr):if -tolerance <= target_sum <= tolerance:return 1return 0c = memo_func(original_arr, i + 1, target_sum, tolerance)c += memo_func(original_arr, i + 1, target_sum - original_arr[i], tolerance)return c@nb.njit('(float64[::1], float64, float64)')
def return_list(original_arr, target_sum, tolerance):sublist = []for i, x in enumerate(original_arr):if memo_func(original_arr, np.int64(i + 1), target_sum - x,tolerance) > 0:sublist.append(x)target_sum -= xreturn sublist

Using memoization does not seems to speed up the result and this is a bit cumbersome to implement in Numba. In fact, there are much better ways to improve the algorithm.

Note that you need to convert the lists in Numpy array before calling the functions:

lst = [-850.85,-856.05,-734.09,5549.63,77.59,-39.73,23.63,13.93,-6455.54,-417.07,176.72,-570.41,3621.89,-233.47,-471.54,-30.33,-941.49,-1014.6,1614.5]
result = return_list(np.array(lst, np.float64), 0, tolerance)

Step 2: tail call optimization

Calling many function to compute the right part of the input list is not efficient. The JIT is able to reduce the number of all but it is not able to completely remove them. You can unroll all the call when the depth of the tail calls is big. For example, when there is 6 items to compute, you can use this following code:

if n-i == 6:c = 0s0 = target_sumv0, v1, v2, v3, v4, v5 = original_arr[i:]for s1 in (s0, s0 - v0):for s2 in (s1, s1 - v1):for s3 in (s2, s2 - v2):for s4 in (s3, s3 - v3):for s5 in (s4, s4 - v4):for s6 in (s5, s5 - v5):c += np.int64(-tolerance <= s6 <= tolerance)return c

This is pretty ugly but far more efficient since the JIT is able to unroll all the loop and produce a very fast code. Still, this is not enough for large lists.

Step 3: better algorithm

For large input lists, the problem is the exponential complexity of the algorithm. The thing is this problem looks really like a relaxed variant of subset-sum which is known to be NP-complete. Such class of algorithm is known to be very hard to solve. The best exact practical algorithms known so far to solve NP-complete problem are exponential. Put it shortly, this means that for any sufficiently large input, there is no known algorithm capable of finding an exact solution in a reasonable time (eg. less than the lifetime of a human).

That being said, there are heuristics and strategies to improve the complexity of the current algorithm. One efficient approach is to use a meet-in-the-middle algorithm. When applied to your use-case, the idea is to generate a large set of target sums, then sort them, and then use a binary search to find the number of matching values. This is possible here since -tolerance <= target_sum <= tolerance where target_sum = partial_sum1 + partial_sum2 is equivalent to -tolerance + partial_sum2 <= partial_sum1 <= tolerance + partial_sum2.

The resulting code is unfortunately quite big and not trivial, but this is certainly the cost to pay for trying to solve efficiently a complex problem like this one. Here it is:

# Generate all the target sums based on in_arr and put the result in out_sum
@nb.njit('(float64[::1], float64[::1], float64)', cache=True)
def gen_all_comb(in_arr, out_sum, target_sum):assert in_arr.size >= 6if in_arr.size == 6:assert out_sum.size == 64v0, v1, v2, v3, v4, v5 = in_arrs0 = target_sumcur = 0for s1 in (s0, s0 - v0):for s2 in (s1, s1 - v1):for s3 in (s2, s2 - v2):for s4 in (s3, s3 - v3):for s5 in (s4, s4 - v4):for s6 in (s5, s5 - v5):out_sum[cur] = s6cur += 1else:assert out_sum.size % 2 == 0mid = out_sum.size // 2gen_all_comb(in_arr[1:], out_sum[:mid], target_sum)gen_all_comb(in_arr[1:], out_sum[mid:], target_sum - in_arr[0])# Find the number of item in sorted_arr where:
# lower_bound <= item <= upper_bound
@nb.njit('(float64[::1], float64, float64)', cache=True)
def count_between(sorted_arr, lower_bound, upper_bound):assert lower_bound <= upper_boundlo_pos = np.searchsorted(sorted_arr, lower_bound, side='left')hi_pos = np.searchsorted(sorted_arr, upper_bound, side='right')return hi_pos - lo_pos# Count all the target sums in:
# -tolerance <= all_target_sums(in_arr,sorted_target_sums)-s0 <= tolerance
@nb.njit('(float64[::1], float64[::1], float64, float64)', cache=True)
def multi_search(in_arr, sorted_target_sums, tolerance, s0):assert in_arr.size >= 6if in_arr.size == 6:v0, v1, v2, v3, v4, v5 = in_arrc = 0for s1 in (s0, s0 + v0):for s2 in (s1, s1 + v1):for s3 in (s2, s2 + v2):for s4 in (s3, s3 + v3):for s5 in (s4, s4 + v4):for s6 in (s5, s5 + v5):lo = -tolerance + s6hi = tolerance + s6c += count_between(sorted_target_sums, lo, hi)return celse:c = multi_search(in_arr[1:], sorted_target_sums, tolerance, s0)c += multi_search(in_arr[1:], sorted_target_sums, tolerance, s0 + in_arr[0])return c@nb.njit('(float64[::1], int64, float64, float64)', cache=True)
def memo_func(original_arr, i, target_sum, tolerance):n = original_arr.sizeremaining = n - itail_size = min(max(remaining//2, 7), 16)# Tail call: for very small list (trivial case)if remaining <= 0:return np.int64(-tolerance <= target_sum <= tolerance)# Tail call: for big lists (better algorithm)elif remaining >= tail_size*2:partial_sums = np.empty(2**tail_size, dtype=np.float64)gen_all_comb(original_arr[-tail_size:], partial_sums, target_sum)partial_sums.sort()return multi_search(original_arr[-remaining:-tail_size], partial_sums, tolerance, 0.0)# Tail call: for medium-sized list (unrolling)elif remaining == 6:c = 0s0 = target_sumv0, v1, v2, v3, v4, v5 = original_arr[i:]for s1 in (s0, s0 - v0):for s2 in (s1, s1 - v1):for s3 in (s2, s2 - v2):for s4 in (s3, s3 - v3):for s5 in (s4, s4 - v4):for s6 in (s5, s5 - v5):c += np.int64(-tolerance <= s6 <= tolerance)return c# Recursionc = memo_func(original_arr, i + 1, target_sum, tolerance)c += memo_func(original_arr, i + 1, target_sum - original_arr[i], tolerance)return c@nb.njit('(float64[::1], float64, float64)', cache=True)
def return_list(original_arr, target_sum, tolerance):sublist = []for i, x in enumerate(original_arr):if memo_func(original_arr, np.int64(i + 1), target_sum - x,tolerance) > 0:sublist.append(x)target_sum -= xreturn sublist

Note that the code takes few seconds to compile since it is quite big. The cache should help not to recompile it every time.

Step 4: even better algorithm

The previous code count the number of matching values (the value stored in c). This is not needed since we just want to know if 1 value exists (ie. memo_func(...) > 0). As a result, we can return a boolean to define if a value has been found and optimize the algorithm so to directly return True when some early solutions are found. Big parts of the exploration tree can be skipped with this method (which is particularly efficient when there are many possible solutions like on random arrays).

Another optimization is then to perform only one binary search (instead of two) and check before if the searched values can be found in the min-max range of the sorted array (so to skip this trivial case before applying the expensive binary search). This is possible because of the previous optimization.

A final optimization is to early discard a part the exploration tree when the values generated by multi_search are so small/big that we can be sure there is no need to perform a binary search. This can be done by computing a pessimistic over-approximation of the searched values. This is especially useful in pathological cases that have almost no solutions.

Here is the final implementation:

@nb.njit('(float64[::1], float64[::1], float64)', cache=True)
def gen_all_comb(in_arr, out_sum, target_sum):assert in_arr.size >= 6if in_arr.size == 6:assert out_sum.size == 64v0, v1, v2, v3, v4, v5 = in_arrs0 = target_sumcur = 0for s1 in (s0, s0 - v0):for s2 in (s1, s1 - v1):for s3 in (s2, s2 - v2):for s4 in (s3, s3 - v3):for s5 in (s4, s4 - v4):for s6 in (s5, s5 - v5):out_sum[cur] = s6cur += 1else:assert out_sum.size % 2 == 0mid = out_sum.size // 2gen_all_comb(in_arr[1:], out_sum[:mid], target_sum)gen_all_comb(in_arr[1:], out_sum[mid:], target_sum - in_arr[0])# Find the number of item in sorted_arr where:
# lower_bound <= item <= upper_bound
@nb.njit('(float64[::1], float64, float64)', cache=True)
def has_items_between(sorted_arr, lower_bound, upper_bound):if upper_bound < sorted_arr[0] or sorted_arr[sorted_arr.size-1] < lower_bound:return Falselo_pos = np.searchsorted(sorted_arr, lower_bound, side='left')return lo_pos < sorted_arr.size and sorted_arr[lo_pos] <= upper_bound# Count all the target sums in:
# -tolerance <= all_target_sums(in_arr,sorted_target_sums)-s0 <= tolerance
@nb.njit('(float64[::1], float64[::1], float64, float64)', cache=True)
def multi_search(in_arr, sorted_target_sums, tolerance, s0):assert in_arr.size >= 6if in_arr.size == 6:v0, v1, v2, v3, v4, v5 = in_arrx3, x4, x5 = min(v3, 0), min(v4, 0), min(v5, 0)y3, y4, y5 = max(v3, 0), max(v4, 0), max(v5, 0)mini = sorted_target_sums[0]maxi = sorted_target_sums[sorted_target_sums.size-1]for s1 in (s0, s0 + v0):for s2 in (s1, s1 + v1):for s3 in (s2, s2 + v2):# Prune the exploration tree early if a # larger range cannot be found.lo = s3 + (x3 + x4 + x5 - tolerance)hi = s3 + (y3 + y4 + y5 + tolerance)if hi < mini or maxi < lo:continuefor s4 in (s3, s3 + v3):for s5 in (s4, s4 + v4):for s6 in (s5, s5 + v5):lo = -tolerance + s6hi = tolerance + s6if has_items_between(sorted_target_sums, lo, hi):return Truereturn Falsereturn (multi_search(in_arr[1:], sorted_target_sums, tolerance, s0)or multi_search(in_arr[1:], sorted_target_sums, tolerance, s0 + in_arr[0]))@nb.njit('(float64[::1], int64, float64, float64)', cache=True)
def memo_func(original_arr, i, target_sum, tolerance):n = original_arr.sizeremaining = n - itail_size = min(max(remaining//2, 7), 13)# Tail call: for very small list (trivial case)if remaining <= 0:return -tolerance <= target_sum <= tolerance# Tail call: for big lists (better algorithm)elif remaining >= tail_size*2:partial_sums = np.empty(2**tail_size, dtype=np.float64)gen_all_comb(original_arr[-tail_size:], partial_sums, target_sum)partial_sums.sort()return multi_search(original_arr[-remaining:-tail_size], partial_sums, tolerance, 0.0)# Tail call: for medium-sized list (unrolling)elif remaining == 6:s0 = target_sumv0, v1, v2, v3, v4, v5 = original_arr[i:]for s1 in (s0, s0 - v0):for s2 in (s1, s1 - v1):for s3 in (s2, s2 - v2):for s4 in (s3, s3 - v3):for s5 in (s4, s4 - v4):for s6 in (s5, s5 - v5):if -tolerance <= s6 <= tolerance:return Truereturn False# Recursionreturn (memo_func(original_arr, i + 1, target_sum, tolerance)or memo_func(original_arr, i + 1, target_sum - original_arr[i], tolerance))@nb.njit('(float64[::1], float64, float64)', cache=True)
def return_list(original_arr, target_sum, tolerance):sublist = []for i, x in enumerate(original_arr):if memo_func(original_arr, np.int64(i + 1), target_sum - x,tolerance):sublist.append(x)target_sum -= xreturn sublist

This final implementation is meant to efficiently compute pathological cases (the one where there is only few non-trivial solutions or even no solutions like on the big provided input lists). However, it can can be tuned so to compute faster the cases where there are many solutions (like on large random uniformly-distributed arrays) at the expense of a significantly slower execution on the pathological cases. This tread-off can be set by changing the variable tail_size (smaller values are better for cases with more solutions).

Benchmark

Here is the tested inputs:

target_sum = 0
tolerance = 0.5small_lst = [-850.85,-856.05,-734.09,5549.63,77.59,-39.73,23.63,13.93,-6455.54,-417.07,176.72,-570.41,3621.89,-233.47,-471.54,-30.33,-941.49,-1014.6,1614.5]
big_lst = [-1572.35,-76.16,-261.1,-7732.0,-1634.0,-52082.42,-3974.15,-801.65,-30192.79,-671.98,-73.06,-47.72,57.96,-511.18,-391.87,-4145.0,-1008.61,-17.53,-17.53,-1471.08,-119.26,-2269.7,-2709,-182939.59,-19.48,-516,-6875.75,-138770.16,-71.11,-295.84,-348.09,-3460.71,-704.01,-678,-632.15,-21478.76]
random_lst = [-86145.13, -34783.33, 50912.99, -87823.73, 37537.52, -22796.4, 53530.74, 65477.91, -50725.36, -52609.35, 92769.95, 83630.42, 30436.95, -24347.08, -58197.95, 77504.44, 83958.08, -85095.73, -61347.26, -14250.65, 2012.91, 83969.32, -69356.41, 29659.23, 94736.29, 2237.82, -17784.34, 23079.36, 8059.84, 26751.26, 98427.46, -88735.07, -28936.62, 21868.77, 5713.05, -74346.18]

The uniformly-distributed random list has a very large number of solutions while the provided big list has none. The tuned final implementation set tail_size to min(max(remaining//2, 7), 13) so to compute the random list much faster at the expense of a significantly slower execution on the big list.

Here is the timing with the small list on my machine:

Naive python algorithm:               173.45 ms
Naive algorithm using Numba:            7.21 ms
Tail call optimization + Numba:         0.33 ms
KellyBundy's implementation:            0.19 ms
Efficient algorithm + optim + Numba:    0.10 ms
Final implementation (tuned):           0.05 ms
Final implementation (default):         0.05 ms

Here is the timing with the large random list on my machine (easy case):

Efficient algorithm + optim + Numba:    209.61 ms
Final implementation (default):           4.11 ms
KellyBundy's implementation:              1.15 ms
Final implementation (tuned):             0.85 msOther algorithms are not shown here because they are too slow (see below)

Here is the timing with the big list on my machine (challenging case):

Naive python algorithm:               >20000 s    [estimation & out of memory]
Naive algorithm using Numba:            ~900 s    [estimation]
Tail call optimization + Numba:           42.61 s
KellyBundy's implementation:               0.671 s
Final implementation (tuned):              0.078 s
Efficient algorithm + optim + Numba:       0.051 s
Final implementation (default):            0.013 s

Thus, the final implementation is up to ~3500 times faster on the small input and more than 1_500_000 times faster on the large input! It also use far less RAM so it can actually be executed on a cheap PC.

It is worth noting that the execution time can be reduced even further be using multiple thread so to reach a speed up >5_000_000 though it may be slower on small inputs and it will make the code a bit complex.

Efficiently count all the combinations of numbers having a sum close to 0

Step 1: using Numba

Step 2: tail call optimization

Step 3: better algorithm

Step 4: even better algorithm

Benchmark

Related Q&A

What is the equivalent to iloc for dask dataframe?

How to deal with limitations of inspect.getsource - or how to get ONLY the source of a function?

Checking whether a function is decorated

How to keep the script run after plt.show() [duplicate]

python - Simulating else in dictionary switch statements

Allow dynamic choice in Django ChoiceField

Row-wise unions in pandas groupby

Add seaborn.palplot axes to existing figure for visualisation of different color palettes

Running Django with Run can not find LESS CSS

OpenCV + python -- grab frames from a video file