Question 1

I have a large set of features that looks like this:

id1 28273 20866 29961 27190 31790 19714 8643 14482 5384 ....  upto 1000
id2 12343 45634 29961 27130 33790 14714 7633 15483 4484 ....  
id3 ..... ..... ..... ..... ..... ..... .... ..... .... .... .   .   .
...
id200000 .... .... ... ..  .  .  .  .

I want to compute for each id euclidean distance and sort them to find the 5-nearest points. Because my dataset is very large. what is the best way to do it.

Question 2

scikit-learn has nearest neighbor search. Example:

Load your data into a NumPy array.

>>> import numpy as np
>>> X = np.array([[28273, 20866, 29961, 27190, 31790, 19714, 8643, 14482, 5384, ...],[12343, 45634, 29961, 27130, 33790, 14714, 7633, 15483, 4484, ...], ...])

(Just two points shown.)

Fit a NearestNeighbors object.

>>> from sklearn.neighbors import NearestNeighbors
>>> knn = NearestNeighbors(n_neighbors=5)
>>> knn.fit(X)
NearestNeighbors(algorithm='auto', leaf_size=30, n_neighbors=5, p=2,radius=1.0, warn_on_equidistant=True)

p=2 means Euclidean (L2) distance. p=1 would mean Manhattan (L1) distance.

Perform queries. To get the neighbors of X[0], your first data point:
```
>>> knn.kneighbors(X[0], return_distance=False)
array([[0, 1]])
```
So, the nearest neighbors of X[0] are X[0] itself and X[1] (of course).

Make sure you set n_neighbors=6 because every point in your set is going to be its own nearest neighbor.

Disclaimer: I'm involved in scikit-learn development, so this is not unbiased advice.

searching for k nearest points

Related Q&A

Why does del (x) with parentheses around the variable name work?

How to concisely represent if/else to specify CSS classes in Django templates

LabelEncoder: How to keep a dictionary that shows original and converted variable

How to find hidden files inside image files (Jpg/Gif/Png) [closed]

How to open a simple image using streams in Pillow-Python

SyntaxError: Non-UTF-8 code starting with \x82 [duplicate]

How to identify the CPU core ID of a process on Python multiprocessing?

Finding highest values in each row in a data frame for python

Using pytest_addoptions in a non-root conftest.py

How to count distinct values in a combination of columns while grouping by in pandas?