Question 1

I have a question similar to the question asked here: simple way of fusing a few close points. I want to replace points that are located close to each other with the average of their coordinates. The closeness in cells is specified by the user (I am talking about euclidean distance).

In my case I have a lot of points (about 1-million). This method is working, but is time consuming as it uses a double for loop.

Is there a faster way to detect and fuse close points in a numpy 2d array?

To be complete I added an example:

points=array([[  382.49056159,   640.1731949 ],[  496.44669161,   655.8583119 ],[ 1255.64762859,   672.99699399],[ 1070.16520917,   688.33538171],[  318.89390168,   718.05989421],[  259.7106383 ,   822.2       ],[  141.52574427,    28.68594436],[ 1061.13573287,    28.7094536 ],[  820.57417943,    84.27702407],[  806.71416007,   108.50307828]])

A scatterplot of the points is visible below. The red circle indicates the points located close to each other (in this case a distance of 27.91 between the last two points in the array). So if the user would specify a minimum distance of 30 these points should be fused.

enter image description here

In the output of the fuse function the last to points are fused. This will look like:

#output
array([[  382.49056159,   640.1731949 ],[  496.44669161,   655.8583119 ],[ 1255.64762859,   672.99699399],[ 1070.16520917,   688.33538171],[  318.89390168,   718.05989421],[  259.7106383 ,   822.2       ],[  141.52574427,    28.68594436],[ 1061.13573287,    28.7094536 ],[  813.64416975,    96.390051175]])

Question 2

If you have a large number of points then it may be faster to build a k-D tree using scipy.spatial.KDTree, then query it for pairs of points that are closer than some threshold:

import numpy as np
from scipy.spatial import KDTreetree = KDTree(points)
rows_to_fuse = tree.query_pairs(r=30)    print(repr(rows_to_fuse))
# {(8, 9)}print(repr(points[list(rows_to_fuse)]))
# array([[ 820.57417943,   84.27702407],
#        [ 806.71416007,  108.50307828]])

The major advantage of this approach is that you don't need to compute the distance between every pair of points in your dataset.

Fast fuse of close points in a numpy-2d (vectorized)

Related Q&A

I use to_gbq on pandas for updating Google BigQuery and get GenericGBQException

Something wrong with Keras code Q-learning OpenAI gym FrozenLake

How to generate month names as list in Python? [duplicate]

Getting ERROR: Double requirement given: setuptools error in zappa

PySpark - Create DataFrame from Numpy Matrix

RunTimeError during one hot encoding

Is there a Mercurial or Git version control plugin for PyScripter? [closed]

How to make a color map with many unique colors in seaborn

Swap column values based on a condition in pandas

How to improve performance on a lambda function on a massive dataframe