Pandas report top-n in group and pivot

2024/9/26 2:12:05

I am trying to summarise a dataframe by grouping along a single dimension d1 and reporting summary statistics for each element of d1. In particular I am interested in the top n (index and values) for a number of metrics. what I would like to produce is a row for each element of d1.

Say I have two dimensions d1, d2 and 4 metrics m1,m2,m3, m4

1) what is the suggested way of grouping by d1, and finding the top n d2 and metric value, for each of metrics m1 - m4.

in Wes's book Python for Data Analysis he suggests (page 35)

def get_top1000(group):return group.sort_index(by='births', ascending=False)[:1000]
grouped = names.groupby(['year', 'sex'])
top1000 = grouped.apply(get_top1000)

Is that still the recommended way ( i am only interested in say top 5 d2 out of 1000s, and for multiple metrics) 2) Now next problem is that I want to to pivot the top 5 ( ie so I have one row for each element of d1)

so resulting data frame should look like this for dimensions d1,d2 and metric m1: index d1 and columns for top 5 values of d2 and corresponding values of m1

d1 d2-1 d2-2 d2-3 d2-4 d2-5 m1-1 m1-2 m1-3 m1-4 m1-5

....

so to pivot I have to create the ranking along d2 (ie 1 to 5 - this is my columns field). This would be easy if I always had 5 entries, but occasionally there are fewer than 5 elements of d2 for a given value of d1.

so could someone suggest how to add ranking to the grouping, so that I have the correct column index to perform the pivoting

Answer

I don't have any toy data to use or expected results to compare to, but I think you want the following:

N = 1000
names = my_fake_data_loader()
grouped = names.groupby(['year', 'sex'])
grouped.apply(lambda g: g.sort_index(by='births', ascending=False).head(N))

And that will give to the first 1000 elements of each group.

https://en.xdnf.cn/q/71510.html

Related Q&A

virtualenv --no-site-packages is not working for me

virtualenv --no-site-packages v1cd v1\Scriptsactivate.batpython -c "import django" # - no problem hereWhy does it see the Django package??? It should give me an import error, right?

pandas: Group by splitting string value in all rows (a column) and aggregation function

If i have dataset like this:id person_name salary 0 [alexander, william, smith] 45000 1 [smith, robert, gates] 65000 2 [bob, alexander] …

Seaborn Title Position

The position of my graph title is terrible on this jointplot. Ive tried moving the loc = left, right, and center but it doesnt move from the position its in. Ive also tried something like ax.title.set_…

Expand/collapse ttk Treeview branch

I would like to know the command for collapsing and expanding a branch in ttk.Treeview.Here is a minimalistic example code:#! coding=utf-8 import tkinter as tk from tkinter import ttkroot = tk.Tk() tre…

Uploading images to s3 with meta = image/jpeg - python/boto3

How do I go about setting ContentType on images that I upload to AWS S3 using boto3 to content-type:image/jpeg?Currently, I upload images to S3 using buto3/python 2.7 using the following command:s3.up…

How to use win environment variable pathlib to save files?

Im trying to use win environment variable like %userprofile%\desktop with pathlib to safe files in different users PC.But Im not able to make it work, it keep saving in on the running script dir.import…

Difference between starting firestore emulator through `firebase` and `gcloud`?

What is the difference between starting the firestore emulator through: firebase emulators:start --only firestoreand: gcloud beta emulators firestore startBoth options allow my python app to achieve co…

PyInstaller icon option doesnt work on Mac

I ran the following command on my mac and created an .app file.pyinstaller --icon icon.icns --noconsole -n testApp main.pyHowever, the generated .app file does not show the icon.icon.icns is specified …

Error Installing scikit-learn

When trying to install scikit-learn, I get the following error:Exception:Traceback (most recent call last):File "/usr/local/Cellar/python/2.7.9/Frameworks/Python.framework/Versions/2.7/lib/python2…

Issues downloading Graphlab dependencies get_dependencies()

I am having trouble when I try to download the dependencies needed to run graphlab. I do import graphlab I get the following:ACTION REQUIRED: Dependencies libstdc++-6.dll and libgcc_s_seh-1.dll not fou…