Question 1

I have data that looks like:

Identifier  Category1 Category2 Category3 Category4 Category5
1000           foo      bat       678         a.x       ld
1000           foo      bat       78          l.o       op
1000           coo      cat       678         p.o       kt
1001           coo      sat       89          a.x       hd
1001           foo      bat       78          l.o       op
1002           foo      bat       678         a.x       ld
1002           foo      bat       78          l.o       op
1002           coo      cat       678         p.o       kt

What i am trying to do is compare 1000 to 1001 and to 1002 and so on. The output I want the code to give is : 1000 is the same as 1002. So, the approach I wanted to use was:

First group all the identifier items into separate dataframes (maybe?). For example, df1 would be all rows pertaining to identifier 1000 and df2 would be all rows pertaining to identifier 1002. (**Please note that I want the code to do this itself as there are millions of rows, as opposed to me writing code to manually compare identifiers **). I have tried using the groupby feature of pandas, it does the part of grouping well, but then I do not know how to compare the groups.
Compare each of the groups/sub-data frames.

One method I was thinking of was reading each row of a particular identifier into an array/vector and comparing arrays/vectors using a comparison metric (Manhattan distance, cosine similarity etc).

Any help is appreciated, I am very new to Python. Thanks in advance!

Question 2

You could do something like the following:

import pandas as pdinput_file = pd.read_csv("input.csv")
columns = ['Category1','Category2','Category3','Category4','Category5']duplicate_entries = {}for group in input_file.groupby('Identifier'):# transforming to tuples so that it can be used as keys on a dictlines = [tuple(y) for y in group[1].loc[:,columns].values.tolist()]    key = tuple(lines) if key not in duplicate_entries:duplicate_entries[key] = []duplicate_entries[key].append(group[0])

Then the duplicate_entries values will have the list of duplicate Identifiers

duplicate_entries.values()
> [[1000, 1002], [1001]]

EDIT:

To get only the entries that have duplicates, you could have something like:

all_dup = [dup for dup in duplicate_entries if len(dup) > 1]

Explaining the indices (sorry I didn't explained it before): Iterating through the df.groupby outcome gives a tuple where the first entry is the key of the group (in this case it would be a 'Identifier') and the second one is a Series of the grouped dataframes. So to get the lines that contain the duplicate entries we'd use [1] and the 'Identifier' for that group is found at [0]. Because on the duplicate_entries array we'd like the identifier of that entry, using group[0] would get us that.

Grouping and comparing groups using pandas

Related Q&A

Transform a 3-column dataframe into a matrix

python multiline regex

OpenCV Python Bindings for GrabCut Algorithm

showing an image with Graphics View widget

TemplateSyntaxError: settings_tags is not a valid tag library

Setting NLTK with Stanford NLP (both StanfordNERTagger and StanfordPOSTagger) for Spanish

python variable scope in nested functions

How can I throttle Python threads?

get lastweek dates using python?

Why is vectorized numpy code slower than for loops?