Grouping and comparing groups using pandas

2024/10/6 22:24:16

I have data that looks like:

Identifier  Category1 Category2 Category3 Category4 Category5
1000           foo      bat       678         a.x       ld
1000           foo      bat       78          l.o       op
1000           coo      cat       678         p.o       kt
1001           coo      sat       89          a.x       hd
1001           foo      bat       78          l.o       op
1002           foo      bat       678         a.x       ld
1002           foo      bat       78          l.o       op
1002           coo      cat       678         p.o       kt

What i am trying to do is compare 1000 to 1001 and to 1002 and so on. The output I want the code to give is : 1000 is the same as 1002. So, the approach I wanted to use was:

  1. First group all the identifier items into separate dataframes (maybe?). For example, df1 would be all rows pertaining to identifier 1000 and df2 would be all rows pertaining to identifier 1002. (**Please note that I want the code to do this itself as there are millions of rows, as opposed to me writing code to manually compare identifiers **). I have tried using the groupby feature of pandas, it does the part of grouping well, but then I do not know how to compare the groups.
  2. Compare each of the groups/sub-data frames.

One method I was thinking of was reading each row of a particular identifier into an array/vector and comparing arrays/vectors using a comparison metric (Manhattan distance, cosine similarity etc).

Any help is appreciated, I am very new to Python. Thanks in advance!

Answer

You could do something like the following:

import pandas as pdinput_file = pd.read_csv("input.csv")
columns = ['Category1','Category2','Category3','Category4','Category5']duplicate_entries = {}for group in input_file.groupby('Identifier'):# transforming to tuples so that it can be used as keys on a dictlines = [tuple(y) for y in group[1].loc[:,columns].values.tolist()]    key = tuple(lines) if key not in duplicate_entries:duplicate_entries[key] = []duplicate_entries[key].append(group[0])

Then the duplicate_entries values will have the list of duplicate Identifiers

duplicate_entries.values()
> [[1000, 1002], [1001]]

EDIT:

To get only the entries that have duplicates, you could have something like:

all_dup = [dup for dup in duplicate_entries if len(dup) > 1]

Explaining the indices (sorry I didn't explained it before): Iterating through the df.groupby outcome gives a tuple where the first entry is the key of the group (in this case it would be a 'Identifier') and the second one is a Series of the grouped dataframes. So to get the lines that contain the duplicate entries we'd use [1] and the 'Identifier' for that group is found at [0]. Because on the duplicate_entries array we'd like the identifier of that entry, using group[0] would get us that.

https://en.xdnf.cn/q/70313.html

Related Q&A

Transform a 3-column dataframe into a matrix

I have a dataframe df, for example:A = [["John", "Sunday", 6], ["John", "Monday", 3], ["John", "Tuesday", 2], ["Mary", "Sunday…

python multiline regex

Im having an issue compiling the correct regular expression for a multiline match. Can someone point out what Im doing wrong. Im looping through a basic dhcpd.conf file with hundreds of entries such as…

OpenCV Python Bindings for GrabCut Algorithm

Ive been trying to use the OpenCV implementation of the grab cut method via the Python bindings. I have tried using the version in both cv and cv2 but I am having trouble finding out the correct param…

showing an image with Graphics View widget

Im new to qt designer and python. I want to created a simple project that I should display an image. I used "Graphics View" widget and I named it "graphicsView". I wrote these funct…

TemplateSyntaxError: settings_tags is not a valid tag library

i got this error when i try to run this test case: WHICH IS written in tests.py of my django application:def test_accounts_register( self ):self.url = http://royalflag.com.pk/accounts/register/self.c =…

Setting NLTK with Stanford NLP (both StanfordNERTagger and StanfordPOSTagger) for Spanish

The NLTK documentation is rather poor in this integration. The steps I followed were:Download http://nlp.stanford.edu/software/stanford-postagger-full-2015-04-20.zip to /home/me/stanford Download http:…

python variable scope in nested functions

I am reading this article about decorator.At Step 8 , there is a function defined as:def outer():x = 1def inner():print x # 1return innerand if we run it by:>>> foo = outer() >>> foo.…

How can I throttle Python threads?

I have a thread doing a lot of CPU-intensive processing, which seems to be blocking out other threads. How do I limit it?This is for web2py specifically, but a general solution would be fine.

get lastweek dates using python?

I am trying to get the date of the last week with python. if date is : 10 OCT 2014 meansIt should be print10 OCT 2014, 09 OCT 2014, 08 OCT 2014, 07 OCT 2014, 06 OCT 2014, 05 OCT 2014, 04 OCT 2014I trie…

Why is vectorized numpy code slower than for loops?

I have two numpy arrays, X and Y, with shapes (n,d) and (m,d), respectively. Assume that we want to compute the Euclidean distances between each row of X and each row of Y and store the result in array…