How to group near-duplicate values in a pandas dataframe?

2024/10/14 9:27:39

If there are duplicate values in a DataFrame pandas already provides functions to replace or drop duplicates. In many experimental datasets on the other hand one might have 'near' duplicates.

How can one replace these near duplicate values with, e.g. their mean?

The example data looks as follows:

df = pd.DataFrame({'x': [1, 2,2.01, 3, 4,4.1,3.95, 5,], 'y': [1, 2,2.2, 3, 4.1,4.4,4.01, 5.5]})

I tried to hack together something to bin together near duplicates but this is using for loops and seems like a hack against pandas:

def cluster_near_values(df, colname_to_cluster, bin_size=0.1):used_x = [] # list of values already groupedgroup_index = 0for search_value in df[colname_to_cluster]:if search_value in used_x:# value is already in a group, skip to nextcontinueg_ix = df[abs(df[colname_to_cluster]-search_value) < bin_size].indexused_x.extend(df.loc[g_ix, colname_to_cluster])df.loc[g_ix, 'cluster_group'] = group_indexgroup_index += 1return df.groupby('cluster_group').mean()

Which does the grouping and averaging:

print(cluster_near_values(df, 'x', 0.1))x     y
0.0            1.000000  1.00
1.0            2.005000  2.10
2.0            3.000000  3.00
3.0            4.016667  4.17
4.0            5.000000  5.50

Is there a better way to achieve this?


Here's an example, where you want to group items to one digit of precision. You can modify this as needed. You can also modify this for binning values with threshold over 1.

df.groupby(np.ceil(df['x'] * 10) // 10).mean()    x     y
1.0  1.000000  1.00
2.0  2.005000  2.10
3.0  3.000000  3.00
4.0  4.016667  4.17
5.0  5.000000  5.50

