Output:

Question 1

The Situation

I'm classifying the rows in a DataFrame using a certain classifier based on the values in a particular column. My goal is to append the results to one new column or another depending on certain conditions. The code, as it stands looks something like this:

df = pd.DataFrame({'A': [list with classifier ids],  # Only 3 ids, One word strings'B': [List of text to be classified],  # Millions of unique rows, lines of text around 5-25 words long'C': [List of the old classes]}  # Hundreds of possible classes, four digit integers stored as stringsdf.sort_values('A', inplace=True)new_col1, new_col2 = [], []
for name, group in df.groupby('A', sort=False):classifier = classy_dict[name]vectors = vectorize(group.B.values)preds = classifier.predict(vectors)scores = classifier.decision_function(vectors)for tup in zip(preds, scores, group.C.values):if tup[2] == tup[0]:new_col1.append(np.nan)new_col2.append(tup[2])else:new_col1.append(str(classifier.classes_[tup[1].argsort()[-5:]]))new_col2.append(np.nan)df['D'] = new_col1
df['E'] = new_col2

The Issue

I am concerned that groupby will not iterate in a top-down, order-of-appearance manner as I expect. Iteration order when sort=False is not covered in the docs

My Expectations

All I'm looking for here is some affirmation that groupby('col', sort=False) does iterate in the top-down order-of-appearance way that I expect. If there is a better way to make all of this work, suggestions are appreciated.

Here is the code I used to test my theory on sort=False iteration order:

from numpy.random import randint
import pandas as pd
from string import ascii_lowercase as lowersdf = pd.DataFrame({'A': [lowers[randint(3)] for _ in range(100)],'B': randint(10, size=100)})print(df.A.unique())  # unique values in order of appearance per the docsfor name, group in df.groupby('A', sort=False):print(name)

Edit: The above code makes it appear as though it acts in the manner that I expect, but I would like some more undeniable proof, if it is available.

Question 2

Yes, when you pass sort=False the order of first appearance is preserved. The groupby source code is a little opaque, but there is one function groupby.ngroup which fully answers this question, as it directly tells you the order in which iteration occurs.

def ngroup(self, ascending=True):"""Number each group from 0 to the number of groups - 1.This is the enumerative complement of cumcount.  Note that thenumbers given to the groups match the order in which the groupswould be seen when iterating over the groupby object, not theorder they are first observed.""

Data from @coldspeed

df['sort=False'] = df.groupby('col', sort=False).ngroup()
df['sort=True'] = df.groupby('col', sort=True).ngroup()

Output:

    col  sort=False  sort=True
0   16           0          7
1    1           1          0
2   10           2          5
3   20           3          8
4    3           4          2
5   13           5          6
6    2           6          1
7    5           7          3
8    7           8          4

When sort=False you iterate based on the first appearance, when sort=True it sorts the groups, and then iterates.

Iteration order with pandas groupby on a pre-sorted DataFrame

The Situation

The Issue

My Expectations

Output:

Related Q&A

How do I pass an exception between threads in python

How to check if a docker instance is running?

Retrieving facets and point from VTK file in python

Tensorflow: feed dict error: You must feed a value for placeholder tensor

Pass many pieces of data from Python to C program

Parse JavaScript to instrument code

Converting all files (.jpg to .png) from a directory in Python

AssertionError: Gaps in blk ref_locs when unstack() dataframe

Python does not consider distutils.cfg

Is it possible to dynamically generate commands in Python Click