Question 1

I want to calculate the Spearman and/or Pearson Correlation between two columns of a DataFrame, using a rolling window.

I have tried df['corr'] = df['col1'].rolling(P).corr(df['col2'])
(P is the window size)

but i don't seem to be able to define the method. (Adding method='spearman' as argument produces error:

File "main.py", line 29, in __init__
_df['corr'] = g['col1'].rolling(P).corr(g['col2'], method = corr_function)
File "~\Python36\lib\site-packages\pandas\core\window.py", line 1287, in corr
**kwargs)
File "~\Python36\lib\site-packages\pandas\core\window.py", line 1054, in corr
_get_corr, pairwise=bool(pairwise))
File "~\Python36\lib\site-packages\pandas\core\window.py", line 1866, in _flex_binary_moment
return f(X, Y)
File "~\Python36\lib\site-packages\pandas\core\window.py", line 1051, in _get_corr
return a.cov(b, **kwargs) / (a.std(**kwargs) * b.std(**kwargs))
File "~\Python36\lib\site-packages\pandas\core\window.py", line 1280, in cov
ddof=ddof, **kwargs)
File "~\Python36\lib\site-packages\pandas\core\window.py", line 1020, in cov
_get_cov, pairwise=bool(pairwise))
File "~\Python36\lib\site-packages\pandas\core\window.py", line 1866, in _flex_binary_moment
return f(X, Y)
File "~\Python36\lib\site-packages\pandas\core\window.py", line 1015, in _get_cov
center=self.center).count(**kwargs)
TypeError: count() got an unexpected keyword argument 'method'

To be fair, i wasn't expecting this to work, since reading the documentation, there is no mention that rolling.corr supports methods...

Any suggestions on how to do this, taking into account that the dataframe is quite big (>10M rows)?

Question 2

rolling.corr does Pearson, so you can use it for that. For Spearman, use something like this:

import pandas as pd
from numpy.lib.stride_tricks import as_strided
from numpy.lib import pad
import numpy as np
def rolling_spearman(seqa, seqb, window):stridea = seqa.strides[0]ssa = as_strided(seqa, shape=[len(seqa) - window + 1, window], strides=[stridea, stridea])strideb = seqa.strides[0]ssb = as_strided(seqb, shape=[len(seqb) - window + 1, window], strides =[strideb, strideb])ar = pd.DataFrame(ssa)br = pd.DataFrame(ssb)ar = ar.rank(1)br = br.rank(1)corrs = ar.corrwith(br, 1)return pad(corrs, (window - 1, 0), 'constant', constant_values=np.nan)

E.g.:

In [144]: df = pd.DataFrame(np.random.randint(0,1000,size=(10,2)), columns = list('ab'))
In [145]: df['corr'] = rolling_spearman(df.a, df.b, 4)
In [146]: df
Out[146]: a    b  corr
0  429  922   NaN
1  618  382   NaN
2  476  517   NaN
3  267  536  -0.8
4  582  844  -0.4
5  254  895  -0.4
6  583  974   0.4
7  687  298  -0.4
8  697  447  -0.6
9  383   35   0.4

Explanation: numpy.lib.stride_tricks.as_strided is a hacky method that in this case gives us a view of the sequences that looks like a 2d array with the rolling window sections of the sequence we're looking at.

From then on, it's simple. Spearman correlation is equivalent to transforming the sequences to ranks, and taking the Pearson correlation coefficient. Pandas has, helpfully, got fast implementations to do this row-wise on DataFrames. Then at the end we pad the start of the resulting Series with NaN values (so you can add it as a column to your dataframe or whatever).

(Personal note: I spent so long trying to figure out how to do this efficiently with numpy and scipy before I realised everything you need is in pandas already...!).

To show the speed advantage of this method over just looping over the sliding windows, I made a little file called srsmall.py containing:

import pandas as pd
from numpy.lib.stride_tricks import as_strided
import scipy.stats
from numpy.lib import pad
import numpy as npdef rolling_spearman_slow(seqa, seqb, window):stridea = seqa.strides[0]ssa = as_strided(seqa, shape=[len(seqa) - window + 1, window], strides=[stridea, stridea])strideb = seqa.strides[0]ssb = as_strided(seqb, shape=[len(seqb) - window + 1, window], strides =[strideb, strideb])corrs = [scipy.stats.spearmanr(a, b)[0] for (a,b) in zip(ssa, ssb)]return pad(corrs, (window - 1, 0), 'constant', constant_values=np.nan)def rolling_spearman_quick(seqa, seqb, window):stridea = seqa.strides[0]ssa = as_strided(seqa, shape=[len(seqa) - window + 1, window], strides=[stridea, stridea])strideb = seqa.strides[0]ssb = as_strided(seqb, shape=[len(seqb) - window + 1, window], strides =[strideb, strideb])ar = pd.DataFrame(ssa)br = pd.DataFrame(ssb)ar = ar.rank(1)br = br.rank(1)corrs = ar.corrwith(br, 1)return pad(corrs, (window - 1, 0), 'constant', constant_values=np.nan)

And compare the performance:

In [1]: import pandas as pd
In [2]: import numpy as np
In [3]: from srsmall import rolling_spearman_slow as slow
In [4]: from srsmall import rolling_spearman_quick as quick
In [5]: for i in range(6):...:     df = pd.DataFrame(np.random.randint(0,1000,size=(10*(10**i),2)), columns=list('ab'))...:     print len(df), " rows"...:     print "quick: ",...:     %timeit quick(df.a, df.b, 4)...:     print "slow: ",...:     %timeit slow(df.a, df.b, 4)...:     
10  rows
quick: 100 loops, best of 3: 3.52 ms per loop
slow: 100 loops, best of 3: 3.2 ms per loop
100  rows
quick: 100 loops, best of 3: 3.53 ms per loop
slow: 10 loops, best of 3: 42 ms per loop
1000  rows
quick: 100 loops, best of 3: 3.82 ms per loop
slow: 1 loop, best of 3: 430 ms per loop
10000  rows
quick: 100 loops, best of 3: 7.47 ms per loop
slow: 1 loop, best of 3: 4.33 s per loop
100000  rows
quick: 10 loops, best of 3: 50.2 ms per loop
slow: 1 loop, best of 3: 43.4 s per loop
1000000  rows
quick: 1 loop, best of 3: 404 ms per loop
slow:

On a million rows (on my machine), the quick (pandas) version runs in less than half a second. Not shown above but 10 million took 8.43 seconds. The slow one is still running, but assuming the linear growth continues it should take around 7 minutes for 1M and over an hour for 10M.

Pandas Rolling window Spearman correlation

Related Q&A

Python string splitlines() removes certain Unicode control characters

Get only HTML head Element with a Script or Tool

Is it possible to restore corrupted “interned” bytes-objects

Wildcard namespaces in lxml

WordNet - What does n and the number represent?

How to change the values of a column based on two conditions in Python

logging module for python reports incorrect timezone under cygwin

Set ordering of Apps and models in Django admin dashboard

python database / sql programming - where to start

How to install Python 3.5 on Raspbian Jessie