Extract edge and communities from list of nodes

2024/9/20 7:53:07

I have dataset which has more than 50k nodes and I am trying to extract possible edges and communities from them. I did try using some graph tools like gephi, cytoscape, socnet, nodexl and so on to visualize and identify the edges and communities but the node list too large for those tools. Hence I am trying to write script to exact the edge and communities. The other columns are connection start datetime and end datetime with GPS locations.

Input:

Id,starttime,endtime,gps1,gps2

0022d9064bc,1073260801,1073260803,819251,440006
00022d9064bc,1073260803,1073260810,819213,439954
00904b4557d3,1073260803,1073261920,817526,439458
00022de73863,1073260804,1073265410,817558,439525
00904b14b494,1073260804,1073262625,817558,439525
00904b14b494,1073260804,1073265163,817558,439525
00904b14b494,1073260804,1073263786,817558,439525
00022d1406df,1073260807,1073260809,820428,438735
00022d1406df,1073260807,1073260878,820428,438735
00022d623dfe,1073260810,1073276346,819251,440006
00022d7317d7,1073260810,1073276155,819251,440006
00022d9064bc,1073260810,1073272525,819251,440006
00022d9064bc,1073260810,1073260999,819251,440006
00022d9064bc,1073260810,1073260857,819251,440006
0030650c9eda,1073260811,1073260813,820356,439224
00022d0e0cec,1073260813,1073262843,820187,439271
00022d176cf3,1073260813,1073260962,817721,439564
000c30d8d2e8,1073260813,1073260902,817721,439564
00904b243bc4,1073260813,1073260962,817721,439564
00904b2fc34d,1073260813,1073260962,817721,439564
00904b52b839,1073260813,1073260962,817721,439564
00904b9a5a51,1073260813,1073260962,817721,439564
00904ba8b682,1073260813,1073260962,817721,439564
00022d3be9cd,1073260815,1073261114,819269,439403
00022d80381f,1073260815,1073261114,819269,439403
00022dc1b09c,1073260815,1073261114,819269,439403
00022d36a6df,1073260817,1073260836,820761,438607
00022d36a6df,1073260817,1073260845,820761,438607
003065d2d8b6,1073260817,1073267560,817735,439757
00904b0c7856,1073260817,1073265149,817735,439757
00022de73863,1073260825,1073260879,817558,439525
00904b14b494,1073260825,1073260879,817558,439525
00904b312d9e,1073260825,1073260879,817558,439525
00022d15b1c7,1073260826,1073260966,820353,439280
00022dcbe817,1073260826,1073260966,820353,439280

I am trying to implement undirected weighted / unweighted graph.

Answer

Use Pandas to get the data into a pairwise node listing, where each row represents an edge, based on your edge criteria. Then migrate into a networkx object for graph analysis.

The criteria for two nodes sharing an edge include:

  1. Same location Assuming this means same gps1 AND gps2.
  2. "Near same start and end time" This is a little ambiguous. For the purposes of this answer I've reduced this criterion to "start time in the same 5-second interval". It shouldn't be too hard to extend the groupby approach I've taken here if you want to apply additional temporal conditions on edges.

Since we want to manipulate data based on timestamps, convert start and end to datetime dtype:

df.start = pd.to_datetime(df.start, unit="s")
df.end = pd.to_datetime(df.end, unit="s")df.start.describe()
count                      35
unique                     11
top       2004-01-05 00:00:13
freq                        8
first     2004-01-05 00:00:01
last      2004-01-05 00:00:26
Name: start, dtype: objectdf.head()ID               start                 end    gps1    gps2
0   0022d9064bc 2004-01-05 00:00:01 2004-01-05 00:00:03  819251  440006
1  00022d9064bc 2004-01-05 00:00:03 2004-01-05 00:00:10  819213  439954
2  00904b4557d3 2004-01-05 00:00:03 2004-01-05 00:18:40  817526  439458
3  00022de73863 2004-01-05 00:00:04 2004-01-05 01:16:50  817558  439525
4  00904b14b494 2004-01-05 00:00:04 2004-01-05 00:30:25  817558  439525

The sample observations happen within a few seconds of each other, so we'll set the grouping frequency to be only a few seconds:

near = "5s" 

Now groupby location and start time to find connected nodes:

edges = (df.groupby(["gps1","gps2",pd.Grouper(key="start", freq=near, closed="right", label="right")], as_index=False).agg({"ID":','.join,"start":"min","end":"max"}).reset_index().rename(columns={"index":"edge","start":"start_min", "end":"end_max"}))edges.ID = edges.ID.str.split(",")

edges.head():

   edge    gps1    gps2                                                 ID  \
0     0  817526  439458                                     [00904b4557d3]   
1     1  817558  439525  [00022de73863, 00904b14b494, 00904b14b494, 009...   
2     2  817558  439525         [00022de73863, 00904b14b494, 00904b312d9e]   
3     3  817721  439564  [00022d176cf3, 000c30d8d2e8, 00904b243bc4, 009...   
4     4  817735  439757                       [003065d2d8b6, 00904b0c7856]   start_min             end_max  
0 2004-01-05 00:00:03 2004-01-05 00:18:40  
1 2004-01-05 00:00:04 2004-01-05 01:16:50  
2 2004-01-05 00:00:25 2004-01-05 00:01:19  
3 2004-01-05 00:00:13 2004-01-05 00:02:42  
4 2004-01-05 00:00:17 2004-01-05 01:52:40 

Each row now represents a unique edge category. ID is a list of nodes in that all share that edge. It's a bit tricky to get this list into new structure of node-pairs; I've resorted to some old-fashioned nested for-loops. There's likely some Pandas-fu that can improve efficiency here:

Note: In the case of a singleton node, I've assigned a None value to its pair. If you don't want to track singletons, just ignore the if not len(combos): ... logic.

pairs = []
idx = 0
for e in edges.edge.values:nodes = edges.loc[edges.edge==e, "ID"].values[0]attrs = edges.loc[edges.edge==e, ["gps1","gps2","start_min","end_max"]]combos = list(combinations(nodes, 2))if not len(combos):pair = [e, nodes[0], None]pair.extend(attrs.values[0])pairs.append(pair)idx += 1else:for combo in combos:pair = [e, combo[0], combo[1]]pair.extend(attrs.values[0])pairs.append(pair)idx += 1
cols = ["edge","nodeA","nodeB","gps1","gps2","start_min","end_max"]
pairs_df = pd.DataFrame(pairs, columns=cols)    

pairs_df.head():

   edge         nodeA         nodeB    gps1    gps2           start_min  \
0     0  00904b4557d3          None  817526  439458 2004-01-05 00:00:03   
1     1  00022de73863  00904b14b494  817558  439525 2004-01-05 00:00:04   
2     1  00022de73863  00904b14b494  817558  439525 2004-01-05 00:00:04   
3     1  00022de73863  00904b14b494  817558  439525 2004-01-05 00:00:04   
4     1  00904b14b494  00904b14b494  817558  439525 2004-01-05 00:00:04   end_max  
0 2004-01-05 00:18:40  
1 2004-01-05 01:16:50  
2 2004-01-05 01:16:50  
3 2004-01-05 01:16:50  
4 2004-01-05 01:16:50      

Now the data can be fit to a networkx object:

import networkx as nxg = nx.from_pandas_dataframe(pairs_df, "nodeA", "nodeB", edge_attr=True)# access edge attributes by node pairing:
test_A = "00022de73863"
test_B = "00904b14b494"
g[test_A][test_B]["start_min"]
# output:
Timestamp('2004-01-05 00:00:25')

For community detection, there are several options. Consider the networkx community algorithms, as well as the community module, which builds off of native networkx functionality.

I read your question as mainly concerned with manipulating your data into a format suitable for network analysis. As this answer is lengthy enough already, I'll leave it to you to pursue community detection strategies - several methods can be used out-of-the-box with the modules I've linked to here.

https://en.xdnf.cn/q/72192.html

Related Q&A

Why is this usage of python F-string interpolation wrapping with quotes?

Code in question:a = test# 1) print(f{a}) # test# 2) print(f{ {a} }) # {test}# 3) print(f{{ {a} }}) # {test}My question is, why does case two print those quotes?I didnt find anything explicitly in the…

Adding a matplotlib colorbar from a PatchCollection

Im converting a Shapely MultiPolygon to a PatchCollection, and first colouring each Polygon like so:# ldn_mp is a MultiPolygon cm = plt.get_cmap(RdBu) num_colours = len(ldn_mp)fig = plt.figure() ax = f…

Mac 10.6 Universal Binary scipy: cephes/specfun _aswfa_ symbol not found

I cant get scipy to function in 32 bit mode when compiled as a i386/x86_64 universal binary, and executed on my 64 bit 10.6.2 MacPro1,1.My python setupWith the help of this answer, I built a 32/64 bit …

python: numpy list to array and vstack

from scipy.io.wavfile import read filepath = glob.glob(*.wav) rates = [] datas = [] for fp in filepath:rate, data = read(fp)rates.append(rate)datas.append(data)I get a list datas which is :[array([0, 0…

Django Unittests Client Login: fails in test suite, but not in Shell

Im running a basic test of my home view. While logging the client in from the shell works, the same line of code fails to log the client in when using the test suite.What is the correct way to log the …

Icon overlay issue with Python

I found some examples and topics on this forum about the way to implement an icon overlay handler with Python 2.7 & the win32com package but it does not work for me and I dont understand why. I cre…

Comparing NumPy object references

I want to understand the NumPy behavior.When I try to get the reference of an inner array of a NumPy array, and then compare it to the object itself, I get as returned value False.Here is the example:I…

Does using django querysets in templates hit the database?

Do template value tags force django to hit the database when called against a non-context value? For example:{{ request.user.username }} Is the call to show the currently logged in users username. H…

how to randomly sample in 2D matrix in numpy

I have a 2d array/matrix like this, how would I randomly pick the value from this 2D matrix, for example getting value like [-62, 29.23]. I looked at the numpy.choice but it is built for 1d array.The f…

How to update figure in same window dynamically without opening and redrawing in new tab?

I am creating a 3D scatter plot based off a pandas dataframe, and then I want to re-draw it with slightly updated data whenever the user presses a button in my program. I almost have this functionality…