Group by column in pandas dataframe and average arrays

2024/10/5 11:14:34

I have a movie dataframe with movie names, their respective genre, and vector representation (numpy arrays).

ID  Year    Title   Genre   Word Vector
1   2003.0  Dinosaur Planet Documentary [-0.55423898, -0.72544044, 0.33189204, -0.1720...
2   2004.0  Isle of Man TT 2004 Review  Sports & Fitness    [-0.373265237, -1.07549703, -0.469254494, -0.4...
3   1997.0  Character   Foreign [-1.57682264, -0.91265768, 2.43038678, -0.2114...
4   1994.0  Paula Abdul's Get Up & Dance    Sports & Fitness    [0.3096168, -0.57186663, 0.39008939, 0.2868615...
5   2004.0  The Rise and Fall of ECW    Sports & Fitness    [0.17175879, -2.38005066, -0.45771399, 1.32608...

I'd like to group by genre and get each genre's average vector representation (the component wise average of each movie vector in the genre).


I first tried:

movie_df.groupby(['Genre']).mean()

But the built in mean function isn't able to take the mean of numpy arrays.

I tried creating my own function to do so and then apply it to each group, but I'm not sure this is using apply correctly:

def vector_average(group):series_to_array = np.array(group.tolist())return np.mean(series_to_array, axis = 0)movie_df.groupby(['Genre']).apply(vector_average)

Any pointers would be appreciated!

Answer

If I understand correctly, to get the component-wise averages you can simply apply np.mean to the 'Word Vector' SeriesGroupBy explicitly.

df.groupby('Genre')['Word Vector'].apply(np.mean)

Demo

>>> df = pd.DataFrame({'Title': list('ABCDEFGHIJ'), 'Genre': list('ABCBBDCDED'), 'Word Vector': [np.random.randint(0, 10, 10) for _ in range(len('ABCDEFGHIJ'))]})>>> dfGenre Title                     Word Vector
0     A     A  [3, 6, 8, 0, 4, 8, 1, 4, 0, 1]
1     B     B  [5, 4, 4, 4, 8, 7, 4, 3, 7, 2]
2     C     C  [1, 7, 6, 7, 3, 3, 8, 1, 8, 1]
3     B     D  [0, 4, 6, 7, 1, 5, 5, 0, 6, 7]
4     B     E  [8, 2, 1, 4, 1, 2, 0, 4, 9, 1]
5     D     F  [7, 9, 7, 8, 8, 7, 2, 9, 1, 3]
6     C     G  [0, 7, 1, 9, 6, 2, 1, 0, 3, 7]
7     D     H  [4, 7, 9, 4, 1, 5, 0, 3, 0, 6]
8     E     I  [5, 1, 5, 1, 8, 1, 1, 4, 5, 6]
9     D     J  [7, 9, 0, 1, 8, 3, 8, 8, 1, 0]>>> df.groupby('Genre')['Word Vector'].apply(np.mean)Genre
A    [3.0, 6.0, 8.0, 0.0, 4.0, 8.0, 1.0, 4.0, 0.0, ...
B    [4.33333333333, 3.33333333333, 3.66666666667, ...
C    [0.5, 7.0, 3.5, 8.0, 4.5, 2.5, 4.5, 0.5, 5.5, ...
D    [6.0, 8.33333333333, 5.33333333333, 4.33333333...
E    [5.0, 1.0, 5.0, 1.0, 8.0, 1.0, 1.0, 4.0, 5.0, ...
Name: Word Vector, dtype: object
https://en.xdnf.cn/q/70493.html

Related Q&A

Python dynamic properties and mypy

Im trying to mask some functions as properties (through a wrapper which is not important here) and add them to the object dynamically, however, I need code completion and mypy to work.I figured out how…

Flask-login: remember me not working if login_managers session_protection is set to strong

i am using flask-login to integrate session management in my flask app. But the remember me functionality doesnt work if i set the session_protection to strong, however, it works absolutely fine if its…

Does any magic happen when I call `super(some_cls)`?

While investigating this question, I came across this strange behavior of single-argument super:Calling super(some_class).__init__() works inside of a method of some_class (or a subclass thereof), but …

How to get unpickling to work with iPython?

Im trying to load pickled objects in iPython.The error Im getting is:AttributeError: FakeModule object has no attribute WorldAnybody know how to get it to work, or at least a workaround for loading obj…

Basic questions about nested blockmodel in graph-tool

Very briefly, two-three basic questions about the minimize_nested_blockmodel_dl function in graph-tool library. Is there a way to figure out which vertex falls onto which block? In other words, to ext…

How to get multiple parameters with same name from a URL in Pylons?

So unfortunately I find myself in the situation where I need to modify an existing Pylons application to handle URLs that provide multiple parameters with the same name. Something like the following...…

Kivy: Access configuration values from any widget

Im using kivy to create a small App for computer aided learning.At the moment I have some problems with accessing config values. I get the value withself.language = self.config.get(basicsettings, langu…

Multiprocessing with threading?

when I trying to make my script multi-threading, Ive found out multiprocessing,I wonder if there is a way to make multiprocessing work with threading?cpu 1 -> 3 threads(worker A,B,C) cpu 2 -> 3 …

Pandas Groupby Unique Multiple Columns

I have a dataframe.import pandas as pd df = pd.DataFrame( {number: [0,0,0,1,1,2,2,2,2], id1: [100,100,100,300,400,700,700,800,700], id2: [100,100,200,500,600,700,800,900,1000]})id1 id2 nu…

OpenCV Error: Assertion failed when using COLOR_BGR2GRAY function

Im having a weird issue with opencv. I have no issues when working in a jupyter notebook but do when trying to run this Sublime.The error is: OpenCV Error: Assertion failed (depth == CV_8U || depth == …