Pandas dataframe : Operation per batch of rows

2024/9/21 1:36:19

I have a pandas DataFrame df for which I want to compute some statistics per batch of rows.

For example, let's say that I have a batch_size = 200000.

For each batch of batch_size rows I would like to have the number of unique values for a column ID of my DataFrame.

How can I do something like that ?

Here is an example of what I want :

print(df)>>
+-------+
|     ID|
+-------+
|      1|
|      1|
|      2|
|      2|
|      2|
|      3|
|      3|
|      3|
|      3|
+-------+batch_size = 3my_new_function(df,batch_size)>>
For batch 1 (0 to 2) :
2 unique values 
1 appears 2 times
2 appears 1 timeFor batch 2 (3 to 5) : 
2 unique values 
2 appears 2 times
3 appears 1 timeFor batch 3 (6 to 8) 
1 unique values 
3 appears 3 times

Note : The output can of course be a simple DataFrame

Answer

See this post for the splitting process, then you could do this to get number of unique 'ID'

df = pd.DataFrame({'ID' : [1, 1, 2, 2, 2, 3, 3, 3, 3]})
batch_size = 3
result = []
for batch_number, batch_df in df.groupby(np.arange(len(df)) // batch_size):result.append(batch_df['ID'].nunique())
pd.DataFrame(result)

edit: go with user3426270's answer, I didn't notice it when I answered

https://en.xdnf.cn/q/72281.html

Related Q&A

Combining grid/pack Tkinter

I know there have been many questions on grid and pack in the past but I just dont understand how to combine the two as Im having difficulties expanding my table in both directions (row/column).Buttons…

Mocking assert_called_with in Python

Im having some trouble understanding why the following code does not pass:test.pyimport mock import unittestfrom foo import Fooclass TestFoo(unittest.TestCase):@mock.patch(foo.Bar)def test_foo_add(self…

Python select() behavior is strange

Im having some trouble understanding the behavior of select.select. Please consider the following Python program:def str_to_hex(s):def dig(n):if n > 9:return chr(65-10+n)else:return chr(48+n)r = wh…

Moving items up and down in a QListWidget?

In a QListWidget I have a set of entries. Now I want to allow the user to sort (reorder) these entries through two buttons (Up/Down).Heres part of my code:def __init__(self):QtGui.QMainWindow.__init__(…

How to resize a tiff image with multiple channels?

I have a tiff image of size 21 X 513 X 513 where (513, 513) is the height and width of the image containing 21 channels. How can I resize this image to 21 X 500 X 375?I am trying to use PILLOW to do …

Losing elements in python code while creating a dictionary from a list?

I have some headache with this python code.print "length:", len(pub) # length: 420pub_dict = dict((p.key, p) for p in pub)print "dict:", len(pub_dict) # length: 163If I understand t…

How to set Font size or Label size to fit all device

How to set kivy font size or label size so that it will fit all phone-device screen? (fit by means does not overlap with the boundary, rectangle, or even the screen)I know the Buttons and some other W…

Using tweepy to access Twitters Streaming API

Im currently having trouble getting example code for using tweepy to access Twitters Streaming API to run correctly (err...or at least how I expect it to run). Im using a recent clone of tweepy from Gi…

Jupyter notebook, how to run multiple cells simultaneously?

I defined a python function which run a bash script. Lets say the function is: calc(x,y,z). If I run this function in python with some variables,>>> calc(1,2,3)It generates a C code which simu…

How to make a slice of DataFrame and fillna in specific slice using Python Pandas?

The problem: let us take Titanic dataset from Kaggle. I have dataframe with columns "Pclass", "Sex" and "Age". I need to fill NaN in column "Age" with a median f…