Iteration order with pandas groupby on a pre-sorted DataFrame

2024/10/14 11:16:35

The Situation

I'm classifying the rows in a DataFrame using a certain classifier based on the values in a particular column. My goal is to append the results to one new column or another depending on certain conditions. The code, as it stands looks something like this:

df = pd.DataFrame({'A': [list with classifier ids],  # Only 3 ids, One word strings'B': [List of text to be classified],  # Millions of unique rows, lines of text around 5-25 words long'C': [List of the old classes]}  # Hundreds of possible classes, four digit integers stored as stringsdf.sort_values('A', inplace=True)new_col1, new_col2 = [], []
for name, group in df.groupby('A', sort=False):classifier = classy_dict[name]vectors = vectorize(group.B.values)preds = classifier.predict(vectors)scores = classifier.decision_function(vectors)for tup in zip(preds, scores, group.C.values):if tup[2] == tup[0]:new_col1.append(np.nan)new_col2.append(tup[2])else:new_col1.append(str(classifier.classes_[tup[1].argsort()[-5:]]))new_col2.append(np.nan)df['D'] = new_col1
df['E'] = new_col2

The Issue

I am concerned that groupby will not iterate in a top-down, order-of-appearance manner as I expect. Iteration order when sort=False is not covered in the docs

My Expectations

All I'm looking for here is some affirmation that groupby('col', sort=False) does iterate in the top-down order-of-appearance way that I expect. If there is a better way to make all of this work, suggestions are appreciated.

Here is the code I used to test my theory on sort=False iteration order:

from numpy.random import randint
import pandas as pd
from string import ascii_lowercase as lowersdf = pd.DataFrame({'A': [lowers[randint(3)] for _ in range(100)],'B': randint(10, size=100)})print(df.A.unique())  # unique values in order of appearance per the docsfor name, group in df.groupby('A', sort=False):print(name)

Edit: The above code makes it appear as though it acts in the manner that I expect, but I would like some more undeniable proof, if it is available.

Answer

Yes, when you pass sort=False the order of first appearance is preserved. The groupby source code is a little opaque, but there is one function groupby.ngroup which fully answers this question, as it directly tells you the order in which iteration occurs.

def ngroup(self, ascending=True):"""Number each group from 0 to the number of groups - 1.This is the enumerative complement of cumcount.  Note that thenumbers given to the groups match the order in which the groupswould be seen when iterating over the groupby object, not theorder they are first observed.""

Data from @coldspeed

df['sort=False'] = df.groupby('col', sort=False).ngroup()
df['sort=True'] = df.groupby('col', sort=True).ngroup()

Output:

    col  sort=False  sort=True
0   16           0          7
1    1           1          0
2   10           2          5
3   20           3          8
4    3           4          2
5   13           5          6
6    2           6          1
7    5           7          3
8    7           8          4

When sort=False you iterate based on the first appearance, when sort=True it sorts the groups, and then iterates.

https://en.xdnf.cn/q/69417.html

Related Q&A

How do I pass an exception between threads in python

I need to pass exceptions across a thread boundary.Im using python embedded in a non thread safe app which has one thread safe call, post_event(callable), which calls callable from its main thread.I am…

How to check if a docker instance is running?

I am using Python to start docker instances.How can I identify if they are running? I can pretty easily use docker ps from terminal like:docker ps | grep myimagenameand if this returns anything, the i…

Retrieving facets and point from VTK file in python

I have a vtk file containing a 3d model,I would like to extract the point coordinates and the facets.Here is a minimal working example:import vtk import numpy from vtk.util.numpy_support import vtk_to_…

Tensorflow: feed dict error: You must feed a value for placeholder tensor

I have one bug I cannot find out the reason. Here is the code:with tf.Graph().as_default():global_step = tf.Variable(0, trainable=False)images = tf.placeholder(tf.float32, shape = [FLAGS.batch_size,33,…

Pass many pieces of data from Python to C program

I have a Python script and a C program and I need to pass large quantities of data from Python script that call many times the C program. Right now I let the user choose between passing them with an AS…

Parse JavaScript to instrument code

I need to split a JavaScript file into single instructions. For examplea = 2; foo() function bar() {b = 5;print("spam"); }has to be separated into three instructions. (assignment, function ca…

Converting all files (.jpg to .png) from a directory in Python

Im trying to convert all files from a directory from .jpg to .png. The name should remain the same, just the format would change.Ive been doing some researches and came to this:from PIL import Image im…

AssertionError: Gaps in blk ref_locs when unstack() dataframe

I am trying to unstack() data in a Pandas dataframe, but I keep getting this error, and Im not sure why. Here is my code so far with a sample of my data. My attempt to fix it was to remove all rows whe…

Python does not consider distutils.cfg

I have tried everything given and the tutorials all point in the same direction about using mingw as a compiler in python instead of visual c++.I do have visual c++ and mingw both. Problem started comi…

Is it possible to dynamically generate commands in Python Click

Im trying to generate click commands from a configuration file. Essentially, this pattern:import click@click.group() def main():passcommands = [foo, bar, baz] for c in commands:def _f():print("I a…