Row-wise unions in pandas groupby

2024/10/11 6:29:17

I have a large data frame that looks like so (and is copy-pasteable with df=pd.read_clipboard(sep='\s\s+'):

    user_nm    month    unique_ips  shifted_ips     halves  quarters    mo_pairs100118231   2   set([142.136])  set([])         h1  q1  p1100118231   3   set([142.136])  set([142.136])  h1  q1  p2100118231   6   set([108.0])    set([142.136])  h1  q2  p3100118231   7   set([108.0])    set([108.0])    h2  q3  p4100118231   8   set([142.136])  set([108.0])    h2  q3  p4100118231   9   set([142.136])  set([142.136])  h2  q3  p5100118231   10  set([142.136])  set([142.136])  h2  q4  p5100118231   11  set([142.136])  set([142.136])  h2  q4  p6100406016   3   set([50.192])   set([])         h1  q1  p2100406016   7   set([50.192])   set([50.192])   h2  q3  p4

for each user, I want to group by halves (or quarters, or mo_pairs) and get the union of unique_ips and shifted_ips.

I can groupby the fields like so:

In [265]: a=df.groupby(['user_nm','halves'])In [266]: a.head()
Out[266]: user_nm month   unique_ips  shifted_ips halves  quarters    mo_pairs
user_nm halves                              
100118231   h1  0   100118231   2   set([142.136])  set([]) h1  q1  p11   100118231   3   set([142.136])  set([142.136])  h1  q1  p22   100118231   6   set([108.0])    set([142.136])  h1  q2  p3h2  3   100118231   7   set([108.0])    set([108.0])    h2  q3  p44   100118231   8   set([142.136])  set([108.0])    h2  q3  p45   100118231   9   set([142.136])  set([142.136])  h2  q3  p56   100118231   10  set([142.136])  set([142.136])  h2  q4  p57   100118231   11  set([142.136])  set([142.136])  h2  q4  p6
100406016   h1  8   100406016   3   set([50.192])   set([]) h1  q1  p2h2  9   100406016   7   set([50.192])   set([50.192])   h2  q3  p4

However, when I attempt to union these rows, I get an error:

In [267]: a.apply(lambda x: x[2] & x[3], axis=1)
TypeError: <lambda>() got an unexpected keyword argument 'axis'

Ideally, I would like something like this:

                  unique_ips    shifted_ips
user_nm   halves        
100118231   h1  set([142.136, 108.0])   set([142.136])
100118231   h2  set([142.136,108.0])    set([142.136,108.0])
100406016   h1  set([50.192])           set([])
100406016   h2  set([50.192])           set([50.192])

I've also tried set_index, but that does not group the dataframe appropriately

b=df.set_index(['user_nm','halves'])

This seems like a relatively simple task, what am I missing?

Answer

The short answer to this is you need to use aggregate method while reducing groupyby object Pandas GroupBy Aggregate.

Now the following snippet should solve your problem

Properly handling set while reading : the elements were coming out as str and not set

df.unique_ips = df.unique_ips.apply(eval)
df.shifted_ips = df.shifted_ips.apply(eval)

Doing the groupby

grouped = df.groupby(['user_nm', 'halves'])
my_lambda = lambda x: reduce(set.union, x)
output = grouped.aggregate({'unique_ips': my_lambda,'shifted_ips': my_lambda})

The result being:

                             unique_ips            shifted_ips
user_nm   halves                                              
100118231 h1      set([142.136, 108.0])         set([142.136])h2      set([142.136, 108.0])  set([142.136, 108.0])
100406016 h1              set([50.192])                set([])h2              set([50.192])          set([50.192])
https://en.xdnf.cn/q/69804.html

Related Q&A

Add seaborn.palplot axes to existing figure for visualisation of different color palettes

Adding seaborn figures to subplots is usually done by passing ax when creating the figure. For instance:sns.kdeplot(x, y, cmap=cmap, shade=True, cut=5, ax=ax)This method, however, doesnt apply to seabo…

Running Django with Run can not find LESS CSS

I have a Django project that uses buildout. When running or debugging the application it runs fine by using my buildout script. I also use django-compressor to compress and compile my LESS files. I ins…

OpenCV + python -- grab frames from a video file

I cant seem to capture frames from a file using OpenCV -- Ive compiled from source on Ubuntu with all the necessary prereqs according to: http://opencv.willowgarage.com/wiki/InstallGuide%20%3A%20Debia…

python 3.1 - Creating normal distribution

I have scipy and numpy, Python v3.1I need to create a 1D array of length 3million, using random numbers between (and including) 100-60,000. It has to fit a normal distribution. Using a = numpy.random.…

Faster way to iterate all keys and values in redis db

I have a db with about 350,000 keys. Currently my code just loops through all keys and gets its value from the db.However this takes almost 2 minutes to do, which seems really slow, redis-benchmark gav…

How to store a floating point number as text without losing precision?

Like the question says. Converting to / from the (truncated) string representations can affect their precision. But storing them in other formats like pickle makes them unreadable (yes, I want this too…

Integer in python/pandas becomes BLOB (binary) in sqlite

Storing an integer in sqlite results in BLOBs (binary values) instead of INTEGER in sqlite. The problem is the INT in the "Baujahr" column. The table is created. CREATE TABLE "Objekt&quo…

Calling Scrapy Spider from Django

I have a project with a django and scrapy folder in the same workspace:my_project/django_project/django_project/settings.pyapp1/app2/manage.py...scrapy_project/scrapy_project/settings.pyscrapy.cfg...Iv…

Python Threading: Multiple While True loops

Do you guys have any recommendations on what python modules to use for the following application: I would like to create a daemon which runs 2 threads, both with while True: loops. Any examples would b…

Visual Studio Code - input function in Python

I am trying out Visual Studio Code, to learn Python.I am writing a starter piece of code to just take an input from the user, say:S = input("Whats your name? ")When I try to run this (Mac: C…