Find closest line to each point on big dataset, possibly using shapely and rtree

2024/9/20 16:56:14

I have a simplified map of a city that has streets in it as linestrings and addresses as points. I need to find closest path from each point to any street line. I have a working script that does this, but it runs in polynomial time as it has nested for loop. For 150 000 lines (shapely LineString) and 10 000 points (shapely Point), it takes 10 hours to finish on 8 GB Ram computer.

The function looks like this (sorry for not making it entirely reproducible):

import pandas as pd
import shapely
from shapely import Point, LineStringdef connect_nodes_to_closest_edges(edges_df , nodes_df,edges_geom,nodes_geom):"""Finds closest line to points and returns 2 dataframes:edges_dfnodes_df"""for i in range(len(nodes_df)):point = nodes_df.loc[i,nodes_geom]shortest_distance = 100000for j in range(len(edges_df)):line = edges_df.loc[j,edges_geom]if line.distance(point) < shortest_distance:shortest_distance = line.distance(point)closest_street_index = jclosest_line = line...

Then I save the results in a table as a new column that adds the shortest path from point to line as a new column.

Is there a way to make it faster with some addition to the function?

If I could for example filter out lines for every point that are 50m away or so, it would help to speed every iteration?

Is there a way to make this faster using rtree package? I was able to find an answer that makes the script for finding intersection of polygons faster, but I can't seem to make it working for closest point to line.

Faster way of polygon intersection with shapely

https://pypi.python.org/pypi/Rtree/

Sorry if this was already answered, but I didn't find an answer here nor on gis.stackexchange

thank you for advice!

Answer

Here you have a solution using rtree library. The idea is to build boxes that contain the segments in the diagonal, and use that boxes to build the rtree. This will be the most time-expensive operation. Later, you query the rtree with a box centered in the point. You get several hits that you need to check for the minimum, but the number of hits will be (hopefuly) orders of magnitud lower than checking against all the segments.

In the solutions dict you will get, for each point, the line id, the nearest segment, the nearest point (a point of the segment), and the distance to the point.

There are some comments in the code to help you. Take into account you can serialize the rtree for later use. In fact, I would recomend to build the rtree, save it, and then use it. Because the exceptions for the adjustments of the constants MIN_SIZE and INFTY will probably raise, and you would not want to lose all the computation you did building the rtree.

A too small MIN_SIZE will mean you could have errors in the solutions because if the box around the point does not intersect a segment, it could intersect a box of a segment that is not the nearest segment (it is easy to think a case).

A too big MIN_SIZE would mean to have too much false positives, that in the extreme case would make the code to try with all the segments, and you will be in the same position as before, or worst, because you are now building an rtree you don't really use.

If the data is real data from a city, I imagine you know that any address will be intersecting a segment with a distance smaller than a few blocks. That will make the search practically logaritmic.

One more comment. I'm assuming that there are not segments that are too large. Since we are using the segments as the diagonals of the boxes in the rtree, if you have some large segments in a line, this would mean a huge box will be assigned to that segment, and all addresses boxes would intersect it. To avoid this, you can always increase artificially the resolution of the LineStrins by adding more intermediate points.

import math
from rtree import index
from shapely.geometry import Polygon, LineStringINFTY = 1000000
MIN_SIZE = .8
# MIN_SIZE should be a vaule such that if you build a box centered in each 
# point with edges of size 2*MIN_SIZE, you know a priori that at least one 
# segment is intersected with the box. Otherwise, you could get an inexact 
# solution, there is an exception checking this, though.def distance(a, b):return math.sqrt( (a[0]-b[0])**2 + (a[1]-b[1])**2 ) def get_distance(apoint, segment):a = apointb, c = segment# t = <a-b, c-b>/|c-b|**2# because p(a) = t*(c-b)+b is the ortogonal projection of vector a # over the rectline that includes the points b and c. t = (a[0]-b[0])*(c[0]-b[0]) + (a[1]-b[1])*(c[1]-b[1])t = t / ( (c[0]-b[0])**2 + (c[1]-b[1])**2 )# Only if t 0 <= t <= 1 the projection is in the interior of # segment b-c, and it is the point that minimize the distance # (by pitagoras theorem).if 0 < t < 1:pcoords = (t*(c[0]-b[0])+b[0], t*(c[1]-b[1])+b[1])dmin = distance(a, pcoords)return pcoords, dminelif t <= 0:return b, distance(a, b)elif 1 <= t:return c, distance(a, c)def get_rtree(lines):def generate_items():sindx = 0for lid, l in lines:for i in xrange(len(l)-1):a, b = l[i]c, d = l[i+1]segment = ((a,b), (c,d))box = (min(a, c), min(b,d), max(a, c), max(b,d)) #box = left, bottom, right, topyield (sindx, box, (lid, segment))sindx += 1return index.Index(generate_items())def get_solution(idx, points):result = {}for p in points:pbox = (p[0]-MIN_SIZE, p[1]-MIN_SIZE, p[0]+MIN_SIZE, p[1]+MIN_SIZE)hits = idx.intersection(pbox, objects='raw')    d = INFTYs = Nonefor h in hits: nearest_p, new_d = get_distance(p, h[1])if d >= new_d:d = new_ds = (h[0], h[1], nearest_p, new_d)result[p] = sprint s#some checking you could remove after you adjust the constantsif s == None:raise Exception("It seems INFTY is not big enough.")pboxpol = ( (pbox[0], pbox[1]), (pbox[2], pbox[1]), (pbox[2], pbox[3]), (pbox[0], pbox[3]) )if not Polygon(pboxpol).intersects(LineString(s[1])):  msg = "It seems MIN_SIZE is not big enough. "msg += "You could get inexact solutions if remove this exception."raise Exception(msg)return result

I tested the functions with this example.

xcoords = [i*10.0/float(1000) for i in xrange(1000)]
l1 = [(x, math.sin(x)) for x in xcoords]
l2 = [(x, math.cos(x)) for x in xcoords]
points = [(i*10.0/float(50), 0.8) for i in xrange(50)]lines = [('l1', l1), ('l2', l2)]idx = get_rtree(lines)solutions = get_solution(idx, points)

And got: Result of test

https://en.xdnf.cn/q/72146.html

Related Q&A

Reading pretty print json files in Apache Spark

I have a lot of json files in my S3 bucket and I want to be able to read them and query those files. The problem is they are pretty printed. One json file has just one massive dictionary but its not in…

Visualize TFLite graph and get intermediate values of a particular node?

I was wondering if there is a way to know the list of inputs and outputs for a particular node in tflite? I know that I can get input/outputs details, but this does not allow me to reconstruct the com…

Why do I get a pymongo.cursor.Cursor when trying to query my mongodb db via pymongo?

I have consumed a bunch of tweets in a mongodb database. I would like to query these tweets using pymongo. For example, I would like to query for screen_name. However, when I try to do this, python doe…

using dropbox as a server for my django app

I dont know if at all i make any sense, but this popped up in my mind. Can we use the 2gb free hosting of dropbox to put our django app over there and do some hacks to run our app?

Proper overloading of json encoding and decoding with Flask

I am trying to add some overloading to the Flask JSON encoder/decoder to add datetime encoding/decoding but only succeeded through a hack.from flask import Flask, flash, url_for, redirect, render_templ…

How to check a specific type of tuple or list?

Suppose, var = (x, 3)How to check if a variable is a tuple with only two elements, first being a type str and the other a type int in python? Can we do this using only one check? I want to avoid this…

Cannot import name BlockBlobService

I got the following error:from azure.storage.blob import BlockBlobService ImportError: cannot import name BlockBlobServicewhen trying to run my python project using command prompt. (The code seems to…

Legend outside the plot in Python - matplotlib

Im trying to place a rather extensive legend outside my plot in matplotlib. The legend has quite a few entries, and each entry can be quite long (but I dont know exactly how long).Obviously, thats quit…

Filter items that only occurs once in a very large list

I have a large list(over 1,000,000 items), which contains english words:tokens = ["today", "good", "computer", "people", "good", ... ]Id like to get al…

Get Data JSON in Flask

Even following many example here & there, i cant get my API work in POST Method. Here the code about it :from flask import Flask, jsonify, request@app.route(/api/v1/lists, methods=[POST]) def add_e…