Extract edge and communities from list of nodes

2024/9/20 7:53:07

I have dataset which has more than 50k nodes and I am trying to extract possible edges and communities from them. I did try using some graph tools like gephi, cytoscape, socnet, nodexl and so on to visualize and identify the edges and communities but the node list too large for those tools. Hence I am trying to write script to exact the edge and communities. The other columns are connection start datetime and end datetime with GPS locations.




I am trying to implement undirected weighted / unweighted graph.


Use Pandas to get the data into a pairwise node listing, where each row represents an edge, based on your edge criteria. Then migrate into a networkx object for graph analysis.

The criteria for two nodes sharing an edge include:

  1. Same location Assuming this means same gps1 AND gps2.
  2. "Near same start and end time" This is a little ambiguous. For the purposes of this answer I've reduced this criterion to "start time in the same 5-second interval". It shouldn't be too hard to extend the groupby approach I've taken here if you want to apply additional temporal conditions on edges.

Since we want to manipulate data based on timestamps, convert start and end to datetime dtype:

df.start = pd.to_datetime(df.start, unit="s")
df.end = pd.to_datetime(df.end, unit="s")df.start.describe()
count                      35
unique                     11
top       2004-01-05 00:00:13
freq                        8
first     2004-01-05 00:00:01
last      2004-01-05 00:00:26
Name: start, dtype: objectdf.head()ID               start                 end    gps1    gps2
0   0022d9064bc 2004-01-05 00:00:01 2004-01-05 00:00:03  819251  440006
1  00022d9064bc 2004-01-05 00:00:03 2004-01-05 00:00:10  819213  439954
2  00904b4557d3 2004-01-05 00:00:03 2004-01-05 00:18:40  817526  439458
3  00022de73863 2004-01-05 00:00:04 2004-01-05 01:16:50  817558  439525
4  00904b14b494 2004-01-05 00:00:04 2004-01-05 00:30:25  817558  439525

The sample observations happen within a few seconds of each other, so we'll set the grouping frequency to be only a few seconds:

near = "5s" 

Now groupby location and start time to find connected nodes:

edges = (df.groupby(["gps1","gps2",pd.Grouper(key="start", freq=near, closed="right", label="right")], as_index=False).agg({"ID":','.join,"start":"min","end":"max"}).reset_index().rename(columns={"index":"edge","start":"start_min", "end":"end_max"}))edges.ID = edges.ID.str.split(",")


   edge    gps1    gps2                                                 ID  \
0     0  817526  439458                                     [00904b4557d3]   
1     1  817558  439525  [00022de73863, 00904b14b494, 00904b14b494, 009...   
2     2  817558  439525         [00022de73863, 00904b14b494, 00904b312d9e]   
3     3  817721  439564  [00022d176cf3, 000c30d8d2e8, 00904b243bc4, 009...   
4     4  817735  439757                       [003065d2d8b6, 00904b0c7856]   start_min             end_max  
0 2004-01-05 00:00:03 2004-01-05 00:18:40  
1 2004-01-05 00:00:04 2004-01-05 01:16:50  
2 2004-01-05 00:00:25 2004-01-05 00:01:19  
3 2004-01-05 00:00:13 2004-01-05 00:02:42  
4 2004-01-05 00:00:17 2004-01-05 01:52:40 

Each row now represents a unique edge category. ID is a list of nodes in that all share that edge. It's a bit tricky to get this list into new structure of node-pairs; I've resorted to some old-fashioned nested for-loops. There's likely some Pandas-fu that can improve efficiency here:

Note: In the case of a singleton node, I've assigned a None value to its pair. If you don't want to track singletons, just ignore the if not len(combos): ... logic.

pairs = []
idx = 0
for e in edges.edge.values:nodes = edges.loc[edges.edge==e, "ID"].values[0]attrs = edges.loc[edges.edge==e, ["gps1","gps2","start_min","end_max"]]combos = list(combinations(nodes, 2))if not len(combos):pair = [e, nodes[0], None]pair.extend(attrs.values[0])pairs.append(pair)idx += 1else:for combo in combos:pair = [e, combo[0], combo[1]]pair.extend(attrs.values[0])pairs.append(pair)idx += 1
cols = ["edge","nodeA","nodeB","gps1","gps2","start_min","end_max"]
pairs_df = pd.DataFrame(pairs, columns=cols)    


   edge         nodeA         nodeB    gps1    gps2           start_min  \
0     0  00904b4557d3          None  817526  439458 2004-01-05 00:00:03   
1     1  00022de73863  00904b14b494  817558  439525 2004-01-05 00:00:04   
2     1  00022de73863  00904b14b494  817558  439525 2004-01-05 00:00:04   
3     1  00022de73863  00904b14b494  817558  439525 2004-01-05 00:00:04   
4     1  00904b14b494  00904b14b494  817558  439525 2004-01-05 00:00:04   end_max  
0 2004-01-05 00:18:40  
1 2004-01-05 01:16:50  
2 2004-01-05 01:16:50  
3 2004-01-05 01:16:50  
4 2004-01-05 01:16:50      

Now the data can be fit to a networkx object:

import networkx as nxg = nx.from_pandas_dataframe(pairs_df, "nodeA", "nodeB", edge_attr=True)# access edge attributes by node pairing:
test_A = "00022de73863"
test_B = "00904b14b494"
# output:
Timestamp('2004-01-05 00:00:25')

For community detection, there are several options. Consider the networkx community algorithms, as well as the community module, which builds off of native networkx functionality.

I read your question as mainly concerned with manipulating your data into a format suitable for network analysis. As this answer is lengthy enough already, I'll leave it to you to pursue community detection strategies - several methods can be used out-of-the-box with the modules I've linked to here.


