Pandas: How to select a column in rolling window

2024/9/20 19:29:53

I have a dataframe (with columns 'a', 'b', 'c') on which I am doing a rolling-window.

I want to be able to filter the rolling window using one of the columns (say 'a') in the apply function like below

df.rolling(len(s),min_periods=0).apply(lambda x: x[[x['a']>10][0] if len(x[[x['a']>10]]) >=0 else np.nan)

The intention of above line is to select the first row in the rolling window whose 'a' column has value greater than 10. If there is no such row, then return nan.

But I am unable to do so and get the following error

IndexError: only integers, slices (`:`), ellipsis (`...`), numpy.newaxis (`None`) and integer or boolean arrays are valid indices

This means that I am not allowed to access the individual columns at all by this syntax. Is there any other way of doing this kind of thing?

Answer

Your error stems from assuming what comes to the function inside apply is a dataframe, it is actually a ndarray not a dataframe.

Pandas dataframe apply works on each column/series of the dataframe, so any function passed to apply is applied along each column/series like an internal lambda. In case of windowed dataframe, apply takes each column/series inside the each window and passes to the function as ndarray and the function has to return only array of length 1 per one series per one window. Knowing this saves a lot of pain.

so in your case you cannot use any apply unless you have a complex function that remembers first value of the series a for each window.

For OP's case if a column of the window say a is meeting a condition, say > 10

  1. For case where a in the first row of a window meets condition it is same as searching in dataframe df[df['a']>10].

  2. For other conditions like a in second row of a window is > 10, checking the entire dataframe works except for the first window of the dataframe.

Following example demonstrates another way to solution.

import numpy as np
import pandas as pd
np.random.seed(123)
df = pd.DataFrame(np.random.randint(0,20,size=(20, 4)), columns=list('abcd'))

df looks like

    a   b   b   d
0   13  2   2   6
1   17  19  10  1
2   0   17  15  9
3   0   14  0   15
4   19  14  4   0
5   16  4   17  3
6   2   7   2   15
7   16  7   9   3
8   6   1   2   1
9   12  8   3   10
10  5   0   11  2
11  10  13  18  4
12  15  11  12  6
13  13  19  16  6
14  14  7   11  7
15  1   11  5   18
16  17  12  18  17
17  1   19  12  9
18  16  17  3   3
19  11  7   9   2

now to select a window if second row inside rolling window of a meets a condition a > 10 like in OP's question.

roll_window=5
search_index=1df_roll = df['a'].rolling(roll_window)
df_y = df_roll.apply(lambda x:x[1] if x[1] > 10 else np.nan).dropna()

above line returns all values of a corresponding to condition a in second row of a window greater then 10. Note the values are right based on example dataframe above but the indexes are defined by how rolling window was centered.

4     17.0
7     19.0
8     16.0
10    16.0
12    12.0
15    15.0
16    13.0
17    14.0
19    17.0

to get the right index location and entire row inside the first dataframe

df.loc[df_y.index+searchindex-rollwindow+1]

returns

    a   b   b   d
1   17  19  10  1
4   19  14  4   0
5   16  4   17  3
7   16  7   9   3
9   12  8   3   10
12  15  11  12  6
13  13  19  16  6
14  14  7   11  7
16  17  12  18  17

one could also use np.array(df) and make a rolling slice corresponding to rolling window and filter the array using slices correspondingly.

https://en.xdnf.cn/q/72467.html

Related Q&A

What is the fastest way in Cython to create a new array from an existing array and a variable

Suppose I have an arrayfrom array import array myarr = array(l, [1, 2, 3])and a variable: myvar = 4 what is the fastest way to create a new array:newarray = array(l, [1, 2, 3, 4])You can assume all ele…

Subclassing and built-in methods in Python

For convenience, I wanted to subclass socket to create an ICMP socket:class ICMPSocket(socket.socket):def __init__(self):socket.socket.__init__(self, socket.AF_INET,socket.SOCK_RAW,socket.getprotobynam…

How to load Rs .rdata files into Python?

I am trying to convert one part of R code in to Python. In this process I am facing some problems.I have a R code as shown below. Here I am saving my R output in .rdata format.nms <- names(mtcars) s…

how to set cookie in python mechanize

After sending request to the serverbr.open(http://xxxx)br.select_form(nr=0) br.form[MESSAGE] = 1 2 3 4 5br.submit()I get the response title, which has set-cookieSet-Cookie: PON=xxx.xxx.xxx.111; expir…

How can I deal with a massive delete from Django Admin?

Im working with Django 2.2.10.I have a model called Site, and a model called Record. Each record is associated with a single site (Foreign Key).After my app runs for a few days/weeks/months, each site …

What is colocate_with used for in tensorflow?

Here is the link of the official docs. https://www.tensorflow.org/versions/r1.3/api_docs/python/tf/colocate_with

Getting tkinter to work with python 3.x on macos with asdf [duplicate]

This question already has answers here:Why does tkinter (or turtle) seem to be missing or broken? Shouldnt it be part of the standard library?(4 answers)Closed 8 months ago.So Im stumped. How do I ge…

Flask server sent events socket exception

I am thinking of using SSE to push new data to the client and using Flot(javascript charting library) display "live" updates. My server runs on python Flask framework and I have figured out h…

Pandas adding scalar value to numeric column?

Given a dataframe like thisImageId | Width | Height | lb0 | x0 | y0 | lb1 | x1 | y1 | lb2 | x2 | y2 0 abc | 200 | 500 | ijk | 115| 8 | zyx | 15 | 16 | www | 23 | 42 1 def | 300 | 800 …

Sklearn Pipeline all the input array dimensions for the concatenation axis must match exactly

import pandas as pd from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer from sklearn.pipeline import Pipeline from sklearn.svm import LinearSVC from sklearn.preprocessing impo…