Fastest way to extract dictionary of sums in numpy in 1 I/O pass

2024/10/10 3:27:16

Let's say I have an array like:

arr = np.array([[1,20,5],[1,20,8],[3,10,4],[2,30,6],[3,10,5]])

and I would like to form a dictionary of the sum of the third column for each row that matches each value in the first column, i.e. return {1: 13, 2: 6, 3: 9}. To make matters more challenging, there's 1 billion rows in my array and 100k unique elements in the first column.

Approach 1: Naively, I can invoke np.unique() then iterate through each item in the unique array with a combination of np.where() and np.sum() in a one-liner dictionary enclosing a list comprehension. This would be reasonably fast if I have a small number of unique elements, but at 100k unique elements, I will incur a lot of wasted page fetches making 100k I/O passes of the entire array.

Approach 2: I could make a single I/O pass of the last column (because having to hash column 1 at each row will probably be cheaper than the excessive page fetches) too, but I lose the advantage of numpy's C inner loop vectorization here.

Is there a fast way to implement Approach 2 without resorting to a pure Python loop?

Answer

numpy approach:

u = np.unique(arr[:, 0])
s = ((arr[:, [0]] == u) * arr[:, [2]]).sum(0)dict(np.stack([u, s]).T){1: 13, 2: 6, 3: 9}

pandas approach:

import pandas as pd
import numpy as nppd.DataFrame(arr, columns=list('ABC')).groupby('A').C.sum().to_dict(){1: 13, 2: 6, 3: 9}

enter image description here

https://en.xdnf.cn/q/69941.html

Related Q&A

How to group by and dummies in pandas

I have a pandas dataframe: key valA 1A 2B 1B 3C 1C 4I want to get do some dummies like this:A 1100b 1010c 1001

Iterate over a dict except for x item items

I have a dict in this format:d_data = {key_1:value_1,key_2:value_2,key_3:value_3,key_x:value_x,key_n:value_n}and I have to iterate over its items:for key,value in columns.items():do somethingexcept for…

Best way to do a case insensitive replace but match the case of the word to be replaced?

So far Ive come up with the method below but my question is is there a shorter method out there that has the same result?My Code :input_str = "myStrIngFullOfStUfFiWannAReplaCE_StUfFs" …

Given a list of numbers, find all matrices such that each column and row sum up to 264

Lets say I have a list of 16 numbers. With these 16 numbers I can create different 4x4 matrices. Id like to find all 4x4 matrices where each element in the list is used once, and where the sum of each …

How can I access tablet pen data via Python?

I need to access a windows tablet pen data (such as the surface) via Python. I mainly need the position, pressure, and tilt values.I know how to access the Wacom pen data but the windows pen is differe…

Read Celery configuration from Python properties file

I have an application that needs to initialize Celery and other things (e.g. database). I would like to have a .ini file that would contain the applications configuration. This should be passed to th…

numpys tostring/fromstring --- what do I need to specify to restore the array

Given a raw binary representation of a numpy array, what is the complete set of metadata needed to unambiguously restore the array? For example, >>> np.fromstring( np.array([42]).tostring())…

How to limit width of column headers in Pandas

How can I limit the column width within Pandas when displaying dataframes, etc? I know about display.max_colwidth but it doesnt affect column names. Also, I do not want to break the names up, but rath…

Django + Auth0 JWT authentication refusing to decode

I am trying to implement Auth0 JWT-based authentication in my Django REST API using the django-rest-framework. I know that there is a JWT library available for the REST framework, and I have tried usin…

How to measure the angle between 2 lines in a same image using python opencv?

I have detected a lane boundary line which is not straight using hough transform and then extracted that line separately. Then blended with another image that has a straight line. Now I need to calcula…