Question 1

Let's say I have an array like:

arr = np.array([[1,20,5],[1,20,8],[3,10,4],[2,30,6],[3,10,5]])

and I would like to form a dictionary of the sum of the third column for each row that matches each value in the first column, i.e. return {1: 13, 2: 6, 3: 9}. To make matters more challenging, there's 1 billion rows in my array and 100k unique elements in the first column.

Approach 1: Naively, I can invoke np.unique() then iterate through each item in the unique array with a combination of np.where() and np.sum() in a one-liner dictionary enclosing a list comprehension. This would be reasonably fast if I have a small number of unique elements, but at 100k unique elements, I will incur a lot of wasted page fetches making 100k I/O passes of the entire array.

Approach 2: I could make a single I/O pass of the last column (because having to hash column 1 at each row will probably be cheaper than the excessive page fetches) too, but I lose the advantage of numpy's C inner loop vectorization here.

Is there a fast way to implement Approach 2 without resorting to a pure Python loop?

Question 2

numpy approach:

u = np.unique(arr[:, 0])
s = ((arr[:, [0]] == u) * arr[:, [2]]).sum(0)dict(np.stack([u, s]).T){1: 13, 2: 6, 3: 9}

pandas approach:

import pandas as pd
import numpy as nppd.DataFrame(arr, columns=list('ABC')).groupby('A').C.sum().to_dict(){1: 13, 2: 6, 3: 9}

enter image description here

Fastest way to extract dictionary of sums in numpy in 1 I/O pass

Related Q&A

How to group by and dummies in pandas

Iterate over a dict except for x item items

Best way to do a case insensitive replace but match the case of the word to be replaced?

Given a list of numbers, find all matrices such that each column and row sum up to 264

How can I access tablet pen data via Python?

Read Celery configuration from Python properties file

numpys tostring/fromstring --- what do I need to specify to restore the array

How to limit width of column headers in Pandas

Django + Auth0 JWT authentication refusing to decode

How to measure the angle between 2 lines in a same image using python opencv?