pandas: Group by splitting string value in all rows (a column) and aggregation function

2024/9/26 2:17:25

If i have dataset like this:

id   person_name                       salary
0    [alexander, william, smith]       45000
1    [smith, robert, gates]            65000
2    [bob, alexander]                  56000
3    [robert, william]                 80000
4    [alexander, gates]                70000

If we sum that salary column then we will get 316000

I really want to know how much person who named 'alexander, smith, etc' (in distinct) makes in salary if we sum all of the salaries from its splitting name in this dataset (that contains same string value).

output:

group               sum_salary
alexander           171000 #sum from id 0 + 2 + 4 (which contain 'alexander')
william             125000 #sum from id 0 + 3
smith               110000 #sum from id 0 + 1
robert              145000 #sum from id 1 + 3
gates               135000 #sum from id 1 + 4
bob                  56000 #sum from id 2

as we see the sum of sum_salary columns is not the same as the initial dataset. all because the function requires double counting.

I thought it seems familiar like string count, but what makes me confuse is the way we use aggregation function. I've tried creating a new list of distinct value in person_name columns, then stuck comes.

Any help is appreciated, Thank you very much

Answer

Solutions working with lists in column person_name:

#if necessary
#df['person_name'] = df['person_name'].str.strip('[]').str.split(', ')print (type(df.loc[0, 'person_name']))
<class 'list'>

First idea is use defaultdict for store sumed values in loop:

from collections import defaultdictd = defaultdict(int)
for p, s in zip(df['person_name'], df['salary']):for x in p:d[x] += int(s)print (d)
defaultdict(<class 'int'>, {'alexander': 171000, 'william': 125000, 'smith': 110000, 'robert': 145000, 'gates': 135000, 'bob': 56000})

And then:

df1 = pd.DataFrame({'group':list(d.keys()),'sum_salary':list(d.values())})
print (df1)group  sum_salary
0  alexander      171000
1    william      125000
2      smith      110000
3     robert      145000
4      gates      135000
5        bob       56000

Another solution with repeating values by length of lists and aggregate sum:

from itertools import chaindf1 = pd.DataFrame({'group' : list(chain.from_iterable(df['person_name'].tolist())), 'sum_salary' : df['salary'].values.repeat(df['person_name'].str.len())
})df2 = df1.groupby('group', as_index=False, sort=False)['sum_salary'].sum()
print (df2)group  sum_salary
0  alexander      171000
1    william      125000
2      smith      110000
3     robert      145000
4      gates      135000
5        bob       56000
https://en.xdnf.cn/q/71508.html

Related Q&A

Seaborn Title Position

The position of my graph title is terrible on this jointplot. Ive tried moving the loc = left, right, and center but it doesnt move from the position its in. Ive also tried something like ax.title.set_…

Expand/collapse ttk Treeview branch

I would like to know the command for collapsing and expanding a branch in ttk.Treeview.Here is a minimalistic example code:#! coding=utf-8 import tkinter as tk from tkinter import ttkroot = tk.Tk() tre…

Uploading images to s3 with meta = image/jpeg - python/boto3

How do I go about setting ContentType on images that I upload to AWS S3 using boto3 to content-type:image/jpeg?Currently, I upload images to S3 using buto3/python 2.7 using the following command:s3.up…

How to use win environment variable pathlib to save files?

Im trying to use win environment variable like %userprofile%\desktop with pathlib to safe files in different users PC.But Im not able to make it work, it keep saving in on the running script dir.import…

Difference between starting firestore emulator through `firebase` and `gcloud`?

What is the difference between starting the firestore emulator through: firebase emulators:start --only firestoreand: gcloud beta emulators firestore startBoth options allow my python app to achieve co…

PyInstaller icon option doesnt work on Mac

I ran the following command on my mac and created an .app file.pyinstaller --icon icon.icns --noconsole -n testApp main.pyHowever, the generated .app file does not show the icon.icon.icns is specified …

Error Installing scikit-learn

When trying to install scikit-learn, I get the following error:Exception:Traceback (most recent call last):File "/usr/local/Cellar/python/2.7.9/Frameworks/Python.framework/Versions/2.7/lib/python2…

Issues downloading Graphlab dependencies get_dependencies()

I am having trouble when I try to download the dependencies needed to run graphlab. I do import graphlab I get the following:ACTION REQUIRED: Dependencies libstdc++-6.dll and libgcc_s_seh-1.dll not fou…

Django Tastypie slow POST response

Im trying to implement a Tastypie Resource that allows GET & POST operations following a per user-permission policy, the model is pretty simple (similar to the Note model in Tastypie documentation)…

Extract a region of a PDF page by coordinates

I am looking for a tool to extract a given rectangular region (by coordinates) of a 1-page PDF file and produce a 1-page PDF file with the specified region:# in.pdf is a 1-page pdf file extract file.pd…