How to find duplicates in pandas dataframe

2024/10/7 8:26:47

Editing.

Suppose I have the following series in pandas:

>>>p
0     0.0
1     0.0
2     0.0
3     0.3
4     0.3
5     0.3
6     0.3
7     0.3
8     1.0
9     1.0
10    1.0
11    0.2
12    0.2
13    0.3
14    0.3
15    0.3

I need to identify each sequence of consecutive duplicates - its first and last index. Using the above example, I need to identify the first sequence of 0.3 (from index 3 to 7) independently from the last sequence of 0.3 (from index 13 to 15).

Using Series.duplicated is insufficient because:

*using keep='first' marks all first instances of duplicates False, but will leave index 13 as True because it is not the first appearance of 0.3.

*Same goes for keep='last'

*keep=False just marks all of the entries as True.

Thank you!

Answer

I believe need trick with compare shifted values for not equal by ne with cumsum and last drop_duplicates:

s = df['a'].ne(df['a'].shift()).cumsum()
a = s.drop_duplicates().index
b = s.drop_duplicates(keep='last').indexdf = pd.DataFrame({'first':a, 'last':b})
print (df)first  last
0      0     2
1      3     7
2      8    10
3     11    12
4     13    15

If want also duplicated value to new column a bit change solution with duplicated:

s = df['a'].ne(df['a'].shift()).cumsum()
a = df.loc[~s.duplicated(), 'a']
b = s.drop_duplicates(keep='last')df = pd.DataFrame({'first':a.index, 'last':b.index, 'val':a})
print (df)first  last  val
0       0     2  0.0
3       3     7  0.3
8       8    10  1.0
11     11    12  0.2
13     13    15  0.3

If need new column:

df['count'] = df['a'].ne(df['a'].shift()).cumsum()
print (df)a  count
0   0.0      1
1   0.0      1
2   0.0      1
3   0.3      2
4   0.3      2
5   0.3      2
6   0.3      2
7   0.3      2
8   1.0      3
9   1.0      3
10  1.0      3
11  0.2      4
12  0.2      4
13  0.3      5
14  0.3      5
15  0.3      5
https://en.xdnf.cn/q/118840.html

Related Q&A

i have error eol while scanning string literal

i dont know what is the problem im junior on python programer what happened on my code i study but i dnt understand this #fungsi coveragedef coverage ():print("[1] Kota Besar)print("[2] Kota…

How to extract specific data from JSON?

I cant seem to extract specific data from JSON which I retrieved from a link. I wrote this code and seems to work fine up to x [print(x) that is] as you can see from the screenshot-1. But, its giving e…

python csv: getting subset

here is a snapshot of my csv:alex 123f 1 harry fwef 2 alex sef 3 alex gsdf 4 alex wf35 6 harry sdfsdf 3i would like to get the subset of this data where the occurrence of a…

Variable within a Variable in Python (3)

My head is probably in the wrong place with this, but I want to put a variable within a variable.My goal for this script is to compare current versions of clients software with current software version…

selenium scraping data using children of elements

Hi im trying to scrape some data from a live stocks website. I want to display the companies name and stock price, %change ect. The details of 25 companies are shown per page, and these details follow …

Python - ETFs Daily Data Web Scraping

Im trying to web scrape some daily info of differents ETFs. I found that https://www.marketwatch.com/ have a accurate info. The most relevant info is the open Price, outstanding shares, NAV, total asse…

How to create DataFrame with columns based on scraped data?

import requests, re from bs4 import BeautifulSoupdata = []soup = BeautifulSoup(requests.get(https://www.booking.com/searchresults.html?label=gen173nr-1FCAEoggI46AdIM1gEaGyIAQGYATG4ARfIAQzYAQHoAQH4AQKI…

How do i change the colour of a button border tkinter

How do i change the colour of a border in tkinterI have looked at other solutions which recommended using highlightcolor and highlightbackground, however these did not work. excercises_button = Button(…

module object has no attribute Gridspec despite calling help(gridspec) revealing the Gridspec class

If I run the python console and doimport matplotlib matplotlib.__version__ import matplotlib.gridspec as gsI see that the matplotlib version is 1.2.1.If I do help(gs) I see the Gridspec class.However t…

Python division doesnt work as expected for large numbers [duplicate]

This question already has answers here:What class to use for money representation?(6 answers)Closed 9 months ago.I have three variables a, b and c. I want to make sure that after doing this: c -= a*bc…