How to count distinct values in a combination of columns while grouping by in pandas?

2024/10/13 6:19:33

I have a pandas data frame. I want to group it by using one combination of columns and count distinct values of another combination of columns.

For example I have the following data frame:

   a   b    c     d      e
0  1  10  100  1000  10000
1  1  10  100  1000  20000
2  1  20  100  1000  20000
3  1  20  100  2000  20000

I can group it by columns a and b and count distinct values in the column d:

df.groupby(['a','b'])['d'].nunique().reset_index()

As a result I get:

   a   b  d
0  1  10  1
1  1  20  2

However, I would like to count distinct values in a combination of columns. For example if I use c and d, then in the first group I have only one unique combination ((100, 1000)) while in the second group I have two distinct combinations: (100, 1000) and (100, 2000).

The following naive "generalization" does not work:

df.groupby(['a','b'])[['c','d']].nunique().reset_index()

because nunique() is not applicable to data frames.

Answer

You can create combination of values converting to string to new column e and then use SeriesGroupBy.nunique:

df['e'] = df.c.astype(str) + df.d.astype(str)
df = df.groupby(['a','b'])['e'].nunique().reset_index()
print (df)a   b  e
0  1  10  1
1  1  20  2

You can also use Series without creating new column:

df =(df.c.astype(str)+df.d.astype(str)).groupby([df.a, df.b]).nunique().reset_index(name='f')
print (df)a   b  f
0  1  10  1
1  1  20  2

Another posible solution is create tuples:

df=(df[['c','d']].apply(tuple, axis=1)).groupby([df.a, df.b]).nunique().reset_index(name='f')
print (df)a   b  f
0  1  10  1
1  1  20  2

Another numpy solution by this answer:

def f(x):a = x.valuesc = len(np.unique(np.ascontiguousarray(a).view(np.dtype((np.void, a.dtype.itemsize * a.shape[1]))), return_counts=True)[1])return cprint (df.groupby(['a','b'])[['c','d']].apply(f))

Timings:

#[1000000 rows x 5 columns]
np.random.seed(123)
N = 1000000
df = pd.DataFrame(np.random.randint(30, size=(N,5)))
df.columns = list('abcde')
print (df)  In [354]: %timeit (df.groupby(['a','b'])[['c','d']].apply(lambda g: len(g) - g.duplicated().sum()))
1 loop, best of 3: 663 ms per loopIn [355]: %timeit (df.groupby(['a','b'])[['c','d']].apply(f))
1 loop, best of 3: 387 ms per loopIn [356]: %timeit (df.groupby(['a', 'b', 'c', 'd']).size().groupby(level=['a', 'b']).size())
1 loop, best of 3: 441 ms per loopIn [357]: %timeit ((df.c.astype(str)+df.d.astype(str)).groupby([df.a, df.b]).nunique())
1 loop, best of 3: 4.95 s per loopIn [358]: %timeit ((df[['c','d']].apply(tuple, axis=1)).groupby([df.a, df.b]).nunique())
1 loop, best of 3: 17.6 s per loop
https://en.xdnf.cn/q/69563.html

Related Q&A

Set Environmental Variables in Python with Popen

I want to set an environmental variable in linux terminal through a python script. I seem to be able to set environmental variables when using os.environ[BLASTDB] = /path/to/directory .However I was in…

Python - pandas - Append Series into Blank DataFrame

Say I have two pandas Series in python:import pandas as pd h = pd.Series([g,4,2,1,1]) g = pd.Series([1,6,5,4,"abc"])I can create a DataFrame with just h and then append g to it:df = pd.DataFr…

How to retrieve values from a function run in parallel processes?

The Multiprocessing module is quite confusing for python beginners specially for those who have just migrated from MATLAB and are made lazy with its parallel computing toolbox. I have the following fun…

SignalR Alternative for Python

What would be an alternative for SignalR in Python world?To be precise, I am using tornado with python 2.7.6 on Windows 8; and I found sockjs-tornado (Python noob; sorry for any inconveniences). But s…

Python Variable Scope and Classes

In Python, if I define a variable:my_var = (1,2,3)and try to access it in __init__ function of a class:class MyClass:def __init__(self):print my_varI can access it and print my_var without stating (glo…

How to check if valid excel file in python xlrd library

Is there any way with xlrd library to check if the file you use is a valid excel file? I know theres other libraries to check headers of files and I could use file extension check. But for the sake of…

ValueError: Invalid file path or buffer object type: class tkinter.StringVar

Here is a simplified version of some code that I have. In the first frame, the user selects a csv file using tk.filedialog and it is meant to be plotted on the same frame on the canvas. There is also a…

Is there a way to reopen a socket?

I create many "short-term" sockets in some code that look like that :nb=1000 for i in range(nb):sck = socket.socket(socket.AF_INET, socket.SOCK_STREAM)sck.connect((adr, prt)sck.send(question …

How django handles simultaneous requests with concurrency over global variables?

I have a django instance hosted via apache/mod_wsgi. I use pre_save and post_save signals to store the values before and after save for later comparisons. For that I use global variables to store the p…

Why cant I string.print()?

My understanding of the print() in both Python and Ruby (and other languages) is that it is a method on a string (or other types). Because it is so commonly used the syntax:print "hi"works.S…