I'm experienced with Pandas but stumbled upon a problem that I can't seem to figure out.
I have a large dataset ((40,000, 16)) and I am trying to group it by a specific column ("group_name" for this matter) and then for each group apply the following rules so it'd merge into one row per group:
- x1, x2, x3 are the "important" columns, if one row has less nulls than the others, take it. (see example with row D)
- If there are conflicts in any column, it's arbitrary and we can pick whatever.
- Combine the nulls on the important fields (x1, x2, x3), see example with row A.
Here is an example with 6 rows that should turn into 4 groups (aka 4 rows).
So far I have
groups = df.groupby['group_name']
I tried many other solutions such as summing each group, applying a transformation, aggregating by each 'important' column, merging on each 'important' column and more.
Each solution brought it's own problems so I'm offering this question here without limiting people to a certain way.
Also, I spent nearly two days combining different solutions from other questions but none has seem to work. Perhaps I've missed something.
- Please note that since this is a large dataset, I'd very much like to avoid using
for
loop on each group since efficiency is something to consider here.
I hope I explained everything properly, please let me know if something is unclear.
Code to re-create the dataframe (thanks to @Henry Ecker from the first answer):
df = pd.DataFrame({'group_name': ['A', 'A', 'B', 'C', 'D', 'D'],'z1': ['value1', 'different_value', 'value1','value1', 'value99', 'value999'],'z2': ['value2'] * 4 + ['value100', 'value1000'],'z3': ['value3'] * 4 + ['value101', 'value101'],'zN': ['valueN'] * 5 + ['valueN200'],'x1': ['a', None, None, 'abc', 'xx', None],'x2': [None, 'b', None, 'def', 'yy', None],'x3': [None, None, None, None, 'zz', 'ff']
})
Try with groupby aggregate
'first' to get the first (valid) value from every column for each group_name
:
new_df = df.groupby('group_name', as_index=False).agg('first')
new_df
:
group_name z1 z2 z3 zN x1 x2 x3
0 A value1 value2 value3 valueN a b None
1 B value1 value2 value3 valueN None None None
2 C value1 value2 value3 valueN abc def None
3 D value99 value100 value101 valueN xx yy zz
*Note if those are string 'null'
mask
them out first + fillna
to put them back:
new_df = (df.mask(df.eq('null')).groupby('group_name', as_index=False).agg('first').fillna('null')
)
new_df
:
group_name z1 z2 z3 zN x1 x2 x3
0 A value1 value2 value3 valueN a b null
1 B value1 value2 value3 valueN null null null
2 C value1 value2 value3 valueN abc def null
3 D value99 value100 value101 valueN xx yy zz
DataFrame used:
group_name z1 z2 z3 zN x1 x2 x3
0 A value1 value2 value3 valueN a None None
1 A different_value value2 value3 valueN None b None
2 B value1 value2 value3 valueN None None None
3 C value1 value2 value3 valueN abc def None
4 D value99 value100 value101 valueN xx yy zz
5 D value999 value1000 value101 valueN200 None None ff
df = pd.DataFrame({'group_name': ['A', 'A', 'B', 'C', 'D', 'D'],'z1': ['value1', 'different_value', 'value1','value1', 'value99', 'value999'],'z2': ['value2'] * 4 + ['value100', 'value1000'],'z3': ['value3'] * 4 + ['value101', 'value101'],'zN': ['valueN'] * 5 + ['valueN200'],'x1': ['a', None, None, 'abc', 'xx', None],'x2': [None, 'b', None, 'def', 'yy', None],'x3': [None, None, None, None, 'zz', 'ff']
})
DataFrame with string 'null'
used:
group_name z1 z2 z3 zN x1 x2 x3
0 A value1 value2 value3 valueN a null null
1 A different_value value2 value3 valueN null b null
2 B value1 value2 value3 valueN null null null
3 C value1 value2 value3 valueN abc def null
4 D value99 value100 value101 valueN xx yy zz
5 D value999 value1000 value101 valueN200 null null ff
df = pd.DataFrame({'group_name': ['A', 'A', 'B', 'C', 'D', 'D'],'z1': ['value1', 'different_value', 'value1','value1', 'value99', 'value999'],'z2': ['value2'] * 4 + ['value100', 'value1000'],'z3': ['value3'] * 4 + ['value101', 'value101'],'zN': ['valueN'] * 5 + ['valueN200'],'x1': ['a', 'null', 'null', 'abc', 'xx', 'null'],'x2': ['null', 'b', 'null', 'def', 'yy', 'null'],'x3': ['null', 'null', 'null', 'null', 'zz', 'ff']
})