The Situation
I'm classifying the rows in a DataFrame using a certain classifier based on the values in a particular column. My goal is to append the results to one new column or another depending on certain conditions. The code, as it stands looks something like this:
df = pd.DataFrame({'A': [list with classifier ids], # Only 3 ids, One word strings'B': [List of text to be classified], # Millions of unique rows, lines of text around 5-25 words long'C': [List of the old classes]} # Hundreds of possible classes, four digit integers stored as stringsdf.sort_values('A', inplace=True)new_col1, new_col2 = [], []
for name, group in df.groupby('A', sort=False):classifier = classy_dict[name]vectors = vectorize(group.B.values)preds = classifier.predict(vectors)scores = classifier.decision_function(vectors)for tup in zip(preds, scores, group.C.values):if tup[2] == tup[0]:new_col1.append(np.nan)new_col2.append(tup[2])else:new_col1.append(str(classifier.classes_[tup[1].argsort()[-5:]]))new_col2.append(np.nan)df['D'] = new_col1
df['E'] = new_col2
The Issue
I am concerned that groupby
will not iterate in a top-down, order-of-appearance manner as I expect. Iteration order when sort=False
is not covered in the docs
My Expectations
All I'm looking for here is some affirmation that groupby('col', sort=False)
does iterate in the top-down order-of-appearance way that I expect. If there is a better way to make all of this work, suggestions are appreciated.
Here is the code I used to test my theory on sort=False
iteration order:
from numpy.random import randint
import pandas as pd
from string import ascii_lowercase as lowersdf = pd.DataFrame({'A': [lowers[randint(3)] for _ in range(100)],'B': randint(10, size=100)})print(df.A.unique()) # unique values in order of appearance per the docsfor name, group in df.groupby('A', sort=False):print(name)
Edit: The above code makes it appear as though it acts in the manner that I expect, but I would like some more undeniable proof, if it is available.