Pandas rolling std yields inconsistent results and differs from values.std

2024/10/12 0:25:07

Using pandas v1.0.1 and numpy 1.18.1, I want to calculate the rolling mean and std with different window sizes on a time series. In the data I am working with, the values can be constant for some subsequent points such that - depending on the window size - the rolling mean might be equal to all the values in the window and the corresponding std is expected to be 0.

However, I see a different behavior using the same df depending on the window size.

MWE:

for window in [3,5]:values = [1234.0, 4567.0, 6800.0, 6810.0, 6821.0, 6820.0, 6820.0, 6820.0, 6820.0, 6820.0, 6820.0]df = pd.DataFrame(values, columns=['values'])df.loc[:, 'mean'] = df.rolling(window, min_periods=1).mean()df.loc[:, 'std'] = df.rolling(window, min_periods=1).std(ddof=0)print(df.info())print(f'window: {window}')print(df)print('non-rolling result:', df['values'].iloc[len(df.index)-window:].values.std())print('')

Output:

window: 3values         mean          std
0   1234.0  1234.000000     0.000000
1   4567.0  2900.500000  1666.500000
2   6800.0  4200.333333  2287.053757
3   6810.0  6059.000000  1055.011216
4   6821.0  6810.333333     8.576454
5   6820.0  6817.000000     4.966555
6   6820.0  6820.333333     0.471405
7   6820.0  6820.000000     0.000000
8   6820.0  6820.000000     0.000000
9   6820.0  6820.000000     0.000000
10  6820.0  6820.000000     0.000000
non-rolling result: 0.0window: 5values         mean          std
0   1234.0  1234.000000     0.000000
1   4567.0  2900.500000  1666.500000
2   6800.0  4200.333333  2287.053757
3   6810.0  4852.750000  2280.329732
4   6821.0  5246.400000  2186.267193
5   6820.0  6363.600000   898.332366
6   6820.0  6814.200000     8.158431
7   6820.0  6818.200000     4.118252
8   6820.0  6820.200000     0.400000
9   6820.0  6820.000000     0.000021
10  6820.0  6820.000000     0.000021
non-rolling result: 0.0

As expected, the std is 0 for idx 7,8,9,10 using a window size of 3. For a window size of 5, I would expect idx 9 and 10 to yield 0. However, the result is different from 0.

If I calculate the std 'manually' for the last window of each window size (using idxs 8,9,10 and 6,7,8,9,10, respectively), I get the expected result of 0 for both cases.

Does anybody have an idea what could be the issue here? Any numerical caveats?

Answer

It seems that implementation of std() in pd.rolling prefers high performance over numerical accuracy. However You can apply np version of standard deviation:

df.loc[:, 'std'] = df.rolling(window, min_periods=1).apply(np.std)

Result:

    values          std
0   1234.0     0.000000
1   4567.0  1666.500000
2   6800.0  2287.053757
3   6810.0  2280.329732
4   6821.0  2186.267193
5   6820.0   898.332366
6   6820.0     8.158431
7   6820.0     4.118252
8   6820.0     0.400000
9   6820.0     0.000000
10  6820.0     0.000000

Now precision is better.

https://en.xdnf.cn/q/69709.html

Related Q&A

How to change attributes of a networkx / matplotlib graph drawing?

NetworkX includes functions for drawing a graph using matplotlib. This is an example using the great IPython Notebook (started with ipython3 notebook --pylab inline):Nice, for a start. But how can I in…

Deploying MLflow Model without Conda environment

Currently working on deploying my MLflow Model in a Docker container. The Docker container is set up with all the necessary dependencies for the model so it seems redundant for MLflow to also then crea…

Insert Data to SQL Server Table using pymssql

I am trying to write the data frame into the SQL Server Table. My code:conn = pymssql.connect(host="Dev02", database="DEVDb") cur = conn.cursor() query = "INSERT INTO dbo.SCORE…

module object has no attribute discover_devices

Im trying to get Pybluez to work for me. Here is what happens when I try to discover bluetooth devises. import bluetooth nearby_devices = bluetooth.discover_devices()Traceback (most recent call last):F…

scipy sparse matrix: remove the rows whose all elements are zero

I have a sparse matrix which is transformed from sklearn tfidfVectorier. I believe that some rows are all-zero rows. I want to remove them. However, as far as I know, the existing built-in functions, e…

Time complexity for adding elements to list vs set in python

Why does adding elements to a set take longer than adding elements to a list in python? I created a loop and iterated over 1000000 elements added it to a list and a set. List is consistently taking ar…

ERROR: Could not install packages due to an EnvironmentError: [Errno 28] No space left on device

I was trying to install turicreate using pip install -U turicreate But got the error Could not install packages due to an EnvironmentError: [Errno 28] Nospace left on device.I followed all the steps on…

How to find cluster centroid with Scikit-learn [closed]

Closed. This question needs debugging details. It is not currently accepting answers.Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to repro…

How do I use the FPS argument in cv2.VideoWriter?

Ok, so I am making a video. I want to know exactly how to use the FPS argument. It is a float, so I assumed it was what interval do I want between each frame. Can you give an example? I just want to k…

Best practice for using common subexpression elimination with lambdify in SymPy

Im currently attempting to use SymPy to generate and numerically evaluate a function and its gradient. For simplicity, Ill use the following function as an example (keeping in mind that the real functi…