How to group near-duplicate values in a pandas dataframe?

2024/10/14 9:27:39

If there are duplicate values in a DataFrame pandas already provides functions to replace or drop duplicates. In many experimental datasets on the other hand one might have 'near' duplicates.

How can one replace these near duplicate values with, e.g. their mean?

The example data looks as follows:

df = pd.DataFrame({'x': [1, 2,2.01, 3, 4,4.1,3.95, 5,], 'y': [1, 2,2.2, 3, 4.1,4.4,4.01, 5.5]})

I tried to hack together something to bin together near duplicates but this is using for loops and seems like a hack against pandas:

def cluster_near_values(df, colname_to_cluster, bin_size=0.1):used_x = [] # list of values already groupedgroup_index = 0for search_value in df[colname_to_cluster]:if search_value in used_x:# value is already in a group, skip to nextcontinueg_ix = df[abs(df[colname_to_cluster]-search_value) < bin_size].indexused_x.extend(df.loc[g_ix, colname_to_cluster])df.loc[g_ix, 'cluster_group'] = group_indexgroup_index += 1return df.groupby('cluster_group').mean()

Which does the grouping and averaging:

print(cluster_near_values(df, 'x', 0.1))x     y
cluster_group                
0.0            1.000000  1.00
1.0            2.005000  2.10
2.0            3.000000  3.00
3.0            4.016667  4.17
4.0            5.000000  5.50

Is there a better way to achieve this?

Answer

Here's an example, where you want to group items to one digit of precision. You can modify this as needed. You can also modify this for binning values with threshold over 1.

df.groupby(np.ceil(df['x'] * 10) // 10).mean()    x     y
x                  
1.0  1.000000  1.00
2.0  2.005000  2.10
3.0  3.000000  3.00
4.0  4.016667  4.17
5.0  5.000000  5.50
https://en.xdnf.cn/q/69430.html

Related Q&A

python looping and creating new dataframe for each value of a column

I want to create a new dataframe for each unique value of station.I tried below which gives me only last station data updated in the dataframe = tai_new.itai[station].unique() has 500 values.for i in t…

How to put more whitespace around my plots?

I have a figure that contains two subplots in two rows and one column like so:fig, (ax1, ax2) = subplots(nrows=2,ncols=1, )The two subplots are pie charts, therefore I want their axes to be square. Aft…

using ols from statsmodels.formula.api - how to remove constant term?

Im following this first example in statsmodels tutorial:http://statsmodels.sourceforge.net/devel/How do I specify not to use constant term for linear fit in ols?# Fit regression model (using the natur…

Is numerical encoding necessary for the target variable in classification?

I am using sklearn for text classification, all my features are numerical but my target variable labels are in text. I can understand the rationale behind encoding features to numerics but dont think t…

django - regex for optional url parameters

I have a view in django that can accept a number of different filter parameters, but they are all optional. If I have 6 optional filters, do I really have to write urls for every combination of the 6 …

How do I remove transparency from a histogram created using Seaborn in python?

Im creating histograms using seaborn in python and want to customize the colors. The default settings create transparent histograms, and I would like mine to be solid. How do I remove the transparency?…

Set confidence levels in seaborn kdeplot

Im completely new to seaborn, so apologies if this is a simple question, but I cannot find anywhere in the documentation a description of how the levels plotted by n_levels are controlled in kdeplot. T…

OpenCV (cv2 in Python) VideoCapture not releasing camera after deletion

I am relatively new to Python, just having learnt it over the past month or so and have hacked this together based off examples and others code I found online.I have gotten a Tkinter GUI to display the…

Paho MQTT Python Client: No exceptions thrown, just stops

I try to setup a mqtt client in python3. This is not the first time im doing this, however i came across a rather odd behaviour. When trying to call a function, which contains a bug, from one of the c…

SSH Key-Forwarding using python paramiko

We currently run a script on our desktop that uses paramiko to ssh to a remote linux host. Once we are on the remote linux host we execute another command to log into another remote machine. What we wa…