Compare 2 consecutive rows and assign increasing value if different (using Pandas)

2024/10/13 4:19:42

I have a dataframe df_in like so:

import pandas as pd
dic_in = {'A':['aa','aa','bb','cc','cc','cc','cc','dd','dd','dd','ee'],'B':['200','200','200','400','400','500','700','700','900','900','200'],'C':['da','cs','fr','fs','se','at','yu','j5','31','ds','sz']}
df_in = pd.DataFrame(dic_in)

I would like to investigate the 2 columns A and B in the following way. I 2 consecutive rows[['A','B']] are equal then they are assigned a new value (according to a specific rule which i am about to describe). I will give an example to be more clear: If the first row[['A','B']] is equal to the following one, then I set 1; if the second one is equal to the third one then I will set 1. Every time two consecutive rows are different, then I increase the value to set by 1.

The result should look like this:

     A    B   C  value
0   aa  200  da      1
1   aa  200  cs      1
2   bb  200  fr      2
3   cc  400  fs      3
4   cc  400  se      3
5   cc  500  at      4
6   cc  700  yu      5
7   dd  700  j5      6
8   dd  900  31      7
9   dd  900  ds      7
10  ee  200  sz      8

Can you suggest me a smart one to achieve this goal?

Answer

Use shift and any to compare consecutive rows, using True to indicate where the value should change. Then take the cumulative sum with cumsum to get the increasing value:

df_in['value'] = (df_in[['A', 'B']] != df_in[['A', 'B']].shift()).any(axis=1)
df_in['value'] = df_in['value'].cumsum()

Alternatively, condensing it to one line:

df_in['value'] = (df_in[['A', 'B']] != df_in[['A', 'B']].shift()).any(axis=1).cumsum()

The resulting output:

     A    B   C  value
0   aa  200  da      1
1   aa  200  cs      1
2   bb  200  fr      2
3   cc  400  fs      3
4   cc  400  se      3
5   cc  500  at      4
6   cc  700  yu      5
7   dd  700  j5      6
8   dd  900  31      7
9   dd  900  ds      7
10  ee  200  sz      8
https://en.xdnf.cn/q/69574.html

Related Q&A

searching for k nearest points

I have a large set of features that looks like this:id1 28273 20866 29961 27190 31790 19714 8643 14482 5384 .... upto 1000 id2 12343 45634 29961 27130 33790 14714 7633 15483 4484 .... id3 ..... ....…

Why does del (x) with parentheses around the variable name work?

Why does this piece of code work the way it does?x = 3 print(dir()) #output indicates that x is defined in the global scope del (x) print(dir()) #output indicates that x is not defined in the glob…

How to concisely represent if/else to specify CSS classes in Django templates

In a Django template, Id like to add CSS classes to a DIV based on certain "conditions", for example:<div class="pkg-buildinfo {% if v.release.version == pkg.b.release.version %}activ…

LabelEncoder: How to keep a dictionary that shows original and converted variable

When using LabelEncoder to encode categorical variables into numerics, how does one keep a dictionary in which the transformation is tracked?i.e. a dictionary in which I can see which values became wh…

How to find hidden files inside image files (Jpg/Gif/Png) [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, argum…

How to open a simple image using streams in Pillow-Python

from PIL import Imageimage = Image.open("image.jpg")file_path = io.BytesIO();image.save(file_path,JPEG);image2 = Image.open(file_path.getvalue());I get this error TypeError: embedded NUL char…

SyntaxError: Non-UTF-8 code starting with \x82 [duplicate]

This question already has answers here:"SyntaxError: Non-ASCII character ..." or "SyntaxError: Non-UTF-8 code starting with ..." trying to use non-ASCII text in a Python script(7 an…

How to identify the CPU core ID of a process on Python multiprocessing?

I am testing Pythons multiprocessing module on a cluster with SLURM. I want to make absolutely sure that each of my tasks are actually running on separate cpu cores as I intend. Due to the many possibi…

Finding highest values in each row in a data frame for python

Id like to find the highest values in each row and return the column header for the value in python. For example, Id like to find the top two in each row:df = A B C D 5 9 8 2 4 …

Using pytest_addoptions in a non-root conftest.py

I have a project that has the following structure: Project/ | +-- src/ | | | +-- proj/ | | | +-- __init__.py | +-- code.py | +-- tests/ | | | +-- __init_…