Question 1

If there are duplicate values in a DataFrame pandas already provides functions to replace or drop duplicates. In many experimental datasets on the other hand one might have 'near' duplicates.

How can one replace these near duplicate values with, e.g. their mean?

The example data looks as follows:

df = pd.DataFrame({'x': [1, 2,2.01, 3, 4,4.1,3.95, 5,], 'y': [1, 2,2.2, 3, 4.1,4.4,4.01, 5.5]})

I tried to hack together something to bin together near duplicates but this is using for loops and seems like a hack against pandas:

def cluster_near_values(df, colname_to_cluster, bin_size=0.1):used_x = [] # list of values already groupedgroup_index = 0for search_value in df[colname_to_cluster]:if search_value in used_x:# value is already in a group, skip to nextcontinueg_ix = df[abs(df[colname_to_cluster]-search_value) < bin_size].indexused_x.extend(df.loc[g_ix, colname_to_cluster])df.loc[g_ix, 'cluster_group'] = group_indexgroup_index += 1return df.groupby('cluster_group').mean()

Which does the grouping and averaging:

print(cluster_near_values(df, 'x', 0.1))x     y
cluster_group                
0.0            1.000000  1.00
1.0            2.005000  2.10
2.0            3.000000  3.00
3.0            4.016667  4.17
4.0            5.000000  5.50

Is there a better way to achieve this?

Question 2

Here's an example, where you want to group items to one digit of precision. You can modify this as needed. You can also modify this for binning values with threshold over 1.

df.groupby(np.ceil(df['x'] * 10) // 10).mean()    x     y
x                  
1.0  1.000000  1.00
2.0  2.005000  2.10
3.0  3.000000  3.00
4.0  4.016667  4.17
5.0  5.000000  5.50

How to group near-duplicate values in a pandas dataframe?

Related Q&A

python looping and creating new dataframe for each value of a column

How to put more whitespace around my plots?

using ols from statsmodels.formula.api - how to remove constant term?

Is numerical encoding necessary for the target variable in classification?

django - regex for optional url parameters

How do I remove transparency from a histogram created using Seaborn in python?

Set confidence levels in seaborn kdeplot

OpenCV (cv2 in Python) VideoCapture not releasing camera after deletion

Paho MQTT Python Client: No exceptions thrown, just stops

SSH Key-Forwarding using python paramiko