Pandas Rolling window Spearman correlation

2024/10/13 19:21:15

I want to calculate the Spearman and/or Pearson Correlation between two columns of a DataFrame, using a rolling window.

I have tried df['corr'] = df['col1'].rolling(P).corr(df['col2'])
(P is the window size)

but i don't seem to be able to define the method. (Adding method='spearman' as argument produces error:

File "main.py", line 29, in __init__
_df['corr'] = g['col1'].rolling(P).corr(g['col2'], method = corr_function)
File "~\Python36\lib\site-packages\pandas\core\window.py", line 1287, in corr
**kwargs)
File "~\Python36\lib\site-packages\pandas\core\window.py", line 1054, in corr
_get_corr, pairwise=bool(pairwise))
File "~\Python36\lib\site-packages\pandas\core\window.py", line 1866, in _flex_binary_moment
return f(X, Y)
File "~\Python36\lib\site-packages\pandas\core\window.py", line 1051, in _get_corr
return a.cov(b, **kwargs) / (a.std(**kwargs) * b.std(**kwargs))
File "~\Python36\lib\site-packages\pandas\core\window.py", line 1280, in cov
ddof=ddof, **kwargs)
File "~\Python36\lib\site-packages\pandas\core\window.py", line 1020, in cov
_get_cov, pairwise=bool(pairwise))
File "~\Python36\lib\site-packages\pandas\core\window.py", line 1866, in _flex_binary_moment
return f(X, Y)
File "~\Python36\lib\site-packages\pandas\core\window.py", line 1015, in _get_cov
center=self.center).count(**kwargs)
TypeError: count() got an unexpected keyword argument 'method'

To be fair, i wasn't expecting this to work, since reading the documentation, there is no mention that rolling.corr supports methods...

Any suggestions on how to do this, taking into account that the dataframe is quite big (>10M rows)?

Answer

rolling.corr does Pearson, so you can use it for that. For Spearman, use something like this:

import pandas as pd
from numpy.lib.stride_tricks import as_strided
from numpy.lib import pad
import numpy as np
def rolling_spearman(seqa, seqb, window):stridea = seqa.strides[0]ssa = as_strided(seqa, shape=[len(seqa) - window + 1, window], strides=[stridea, stridea])strideb = seqa.strides[0]ssb = as_strided(seqb, shape=[len(seqb) - window + 1, window], strides =[strideb, strideb])ar = pd.DataFrame(ssa)br = pd.DataFrame(ssb)ar = ar.rank(1)br = br.rank(1)corrs = ar.corrwith(br, 1)return pad(corrs, (window - 1, 0), 'constant', constant_values=np.nan)

E.g.:

In [144]: df = pd.DataFrame(np.random.randint(0,1000,size=(10,2)), columns = list('ab'))
In [145]: df['corr'] = rolling_spearman(df.a, df.b, 4)
In [146]: df
Out[146]: a    b  corr
0  429  922   NaN
1  618  382   NaN
2  476  517   NaN
3  267  536  -0.8
4  582  844  -0.4
5  254  895  -0.4
6  583  974   0.4
7  687  298  -0.4
8  697  447  -0.6
9  383   35   0.4

Explanation: numpy.lib.stride_tricks.as_strided is a hacky method that in this case gives us a view of the sequences that looks like a 2d array with the rolling window sections of the sequence we're looking at.

From then on, it's simple. Spearman correlation is equivalent to transforming the sequences to ranks, and taking the Pearson correlation coefficient. Pandas has, helpfully, got fast implementations to do this row-wise on DataFrames. Then at the end we pad the start of the resulting Series with NaN values (so you can add it as a column to your dataframe or whatever).

(Personal note: I spent so long trying to figure out how to do this efficiently with numpy and scipy before I realised everything you need is in pandas already...!).

To show the speed advantage of this method over just looping over the sliding windows, I made a little file called srsmall.py containing:

import pandas as pd
from numpy.lib.stride_tricks import as_strided
import scipy.stats
from numpy.lib import pad
import numpy as npdef rolling_spearman_slow(seqa, seqb, window):stridea = seqa.strides[0]ssa = as_strided(seqa, shape=[len(seqa) - window + 1, window], strides=[stridea, stridea])strideb = seqa.strides[0]ssb = as_strided(seqb, shape=[len(seqb) - window + 1, window], strides =[strideb, strideb])corrs = [scipy.stats.spearmanr(a, b)[0] for (a,b) in zip(ssa, ssb)]return pad(corrs, (window - 1, 0), 'constant', constant_values=np.nan)def rolling_spearman_quick(seqa, seqb, window):stridea = seqa.strides[0]ssa = as_strided(seqa, shape=[len(seqa) - window + 1, window], strides=[stridea, stridea])strideb = seqa.strides[0]ssb = as_strided(seqb, shape=[len(seqb) - window + 1, window], strides =[strideb, strideb])ar = pd.DataFrame(ssa)br = pd.DataFrame(ssb)ar = ar.rank(1)br = br.rank(1)corrs = ar.corrwith(br, 1)return pad(corrs, (window - 1, 0), 'constant', constant_values=np.nan)

And compare the performance:

In [1]: import pandas as pd
In [2]: import numpy as np
In [3]: from srsmall import rolling_spearman_slow as slow
In [4]: from srsmall import rolling_spearman_quick as quick
In [5]: for i in range(6):...:     df = pd.DataFrame(np.random.randint(0,1000,size=(10*(10**i),2)), columns=list('ab'))...:     print len(df), " rows"...:     print "quick: ",...:     %timeit quick(df.a, df.b, 4)...:     print "slow: ",...:     %timeit slow(df.a, df.b, 4)...:     
10  rows
quick: 100 loops, best of 3: 3.52 ms per loop
slow: 100 loops, best of 3: 3.2 ms per loop
100  rows
quick: 100 loops, best of 3: 3.53 ms per loop
slow: 10 loops, best of 3: 42 ms per loop
1000  rows
quick: 100 loops, best of 3: 3.82 ms per loop
slow: 1 loop, best of 3: 430 ms per loop
10000  rows
quick: 100 loops, best of 3: 7.47 ms per loop
slow: 1 loop, best of 3: 4.33 s per loop
100000  rows
quick: 10 loops, best of 3: 50.2 ms per loop
slow: 1 loop, best of 3: 43.4 s per loop
1000000  rows
quick: 1 loop, best of 3: 404 ms per loop
slow:

On a million rows (on my machine), the quick (pandas) version runs in less than half a second. Not shown above but 10 million took 8.43 seconds. The slow one is still running, but assuming the linear growth continues it should take around 7 minutes for 1M and over an hour for 10M.

https://en.xdnf.cn/q/69501.html

Related Q&A

Python string splitlines() removes certain Unicode control characters

I noticed that Pythons standard string method splitlines() actually removes some crucial Unicode control characters as well. Example>>> s1 = uasdf \n fdsa \x1d asdf >>> s1.splitlines(…

Get only HTML head Element with a Script or Tool

I am trying to get large amount of status information, which are encoded in websites, mainly inside the "< head >< /head >" element. I know I can use wget or curl or python to get…

Is it possible to restore corrupted “interned” bytes-objects

It is well known, that small bytes-objects are automatically "interned" by CPython (similar to the intern-function for strings). Correction: As explained by @abarnert it is more like the inte…

Wildcard namespaces in lxml

How to query using xpath ignoring the xml namespace? I am using python lxml library. I tried the solution from this question but doesnt seem to work.In [151]: e.find("./*[local-name()=Buckets]&qu…

WordNet - What does n and the number represent?

My question is related to WordNet Interface.>>> wn.synsets(cat)[Synset(cat.n.01), Synset(guy.n.01), Synset(cat.n.03),Synset(kat.n.01), Synset(cat-o-nine-tails.n.01), Synset(caterpillar.n.02), …

How to change the values of a column based on two conditions in Python

I have a dataset where I have the time in a game and the time of an event. EVENT GAME0:34 0:43NaN 0:232:34 3:43NaN 4:50I want to replace the NaN in the EVENT column where GAME…

logging module for python reports incorrect timezone under cygwin

I am running python script that uses logging module under cygwin on Windows 7. The date command reports correct time:$ date Tue, Aug 14, 2012 2:47:49 PMHowever, the python script is five hours off:201…

Set ordering of Apps and models in Django admin dashboard

By default, the Django admin dashboard looks like this for me:I want to change the ordering of models in Profile section, so by using codes from here and here I was able to change the ordering of model…

python database / sql programming - where to start

What is the best way to use an embedded database, say sqlite in Python:Should be small footprint. Im only needing few thousands records per table. And just a handful of tables per database. If its one …

How to install Python 3.5 on Raspbian Jessie

I need to install Python 3.5+ on Rasbian (Debian for the Raspberry Pi). Currently only version 3.4 is supported. For the sources I want to compile I have to install:sudo apt-get install -y python3 pyth…