Question 1

I'm trying to efficiently compute a running sum, with exponential decay, of each column of a Pandas DataFrame. The DataFrame contains a daily score for each country in the world. The DataFrame looks like this:

                AF        UK        US
2014-07-01  0.998042  0.595720  0.524698
2014-07-02  0.380649  0.838436  0.355149
2014-07-03  0.306240  0.274755  0.964524
2014-07-04  0.396721  0.836027  0.225848
2014-07-05  0.151291  0.677794  0.603548
2014-07-06  0.558846  0.050535  0.551785
2014-07-07  0.463514  0.552748  0.265537
2014-07-08  0.240282  0.278825  0.116432
2014-07-09  0.309446  0.096573  0.246021
2014-07-10  0.800977  0.583496  0.713893

I'm not sure how to calculate the rolling sum (with decay) without iterating through the dataframe, since I need to know yesterday's score to calculate today's score. But to calculate yesterday's score, I need to know the day before yesterday's score, etc. This is the code that I've been using, but I'd like a more efficient way to go about it.

for j, val in df.iteritems():for i, row in enumerate(val):df[j].iloc[i] = row + val[i-1]*np.exp(-0.05)

Question 2

You can use the fact that when exponentials multiply their exponents add:

eg:

N(2) = N(2) + N(1) * exp(-0.05)
N(3) = N(3) + (N(2) + N(1) * exp(-0.05))*exp(-0.05)
N(3) = N(3) + N(2)*exp(-0.05) + N(1)*exp(-0.1)
N(4) = ...and so on

This can then be vectorized using numpy:

dataset = pd.DataFrame(np.random.rand(1000,3), columns=["A", "B","C"])weightspace = np.exp(np.linspace(len(dataset), 0, num=len(dataset))*-0.05)
def rollingsum(array):weights = weightspace[0-len(array):]# Convolve the array and the weights to obtain the resulta = np.dot(array, weights).sum()return aa = pd.expanding_apply(dataset, rollingsum)

pd.expanding_apply applies the rollingsum function backwards to each row, calling it len(dataset) times. np.linspace generates a dataset of size len(dataset) and calculates how many times each row is multiplied by exp(-0.05) for the current row.

Because it is vectorized, it should be fast:

%timeit a = pd.expanding_apply(dataset, rollingsum)
10 loops, best of 3: 25.5 ms per loop

This compares with (note I'm using python 3 and had to make a change to the behaviour on the first row...):

def multipleApply(df):for j, val in df.iteritems():for i, row in enumerate(val):if i == 0:continuedf[j].iloc[i] = row + val[i-1]*np.exp(-0.05)

This comes out as:

In[68]: %timeit multipleApply(dataset)
1 loops, best of 3: 414 ms per loop

Exponential Decay on Python Pandas DataFrame

Related Q&A

TensorFlow - why doesnt this sofmax regression learn anything?

Extended example to understand CUDA, Numba, Cupy, etc

Python 2 newline tokens in tokenize module

Prevent encoding errors in Python

How do I get the operating system name in a friendly manner using Python 2.5?

Extend dataclass repr programmatically

find least common denominator for list of fractions in python

How to configure uwsgi to encode logging as json except app output

Testing aiohttp client with unittest.mock.patch

GridsearchCV: cant pickle function error when trying to pass lambda in parameter