How to replace the column of dataframe based on priority order?

2024/10/6 11:28:45

I have a dataframe as follows df["Annotations"]

missense_variant&splice_region_variant
stop_gained&splice_region_variant
splice_acceptor_variant&coding_sequence_variant&intron_variant
splice_donor_variant&splice_acceptor_variant&coding_sequence_variant&5_prime_UTR_variant&intron_variant
missense_variant&NMD_transcript_variant
frameshift_variant&splice_region_variant
splice_acceptor_variant&intron_variant
splice_acceptor_variant&coding_sequence_variant
stop_lost&3_prime_UTR_variant
missense_variant
splice_region_variant

I want to replace or add a new column with priority of orders. Priority is given as

Type                 Rank
frameshift_variant      1
stop_gained             2
splice_region_variant   3
splice_acceptor_variant 4
splice_donor_variant    5
missense_variant        6
coding_sequence_variant 7

I want to get replace df['Annotations'] or add new column df['Anno_prio'] as:

splice_region_variant
stop_gained
splice_acceptor_variant
splice_acceptor_variant
missense_variant
frameshift_variant
splice_acceptor_variant
splice_acceptor_variant
stop_lost
missense_variant
splice_region_variant

The way I tried was for each term:

df['Annotation']=df['Annotation'].str.replace('missense_variant&splice_region_variant','splice_region_variant')

Are there any other approach to do it using pandas?

Answer

process:

  1. Split by "&" and use pandas.Series.explode transform each element of a list-like to a row.
  2. use map Series to convert the Type to Rank
  3. then sort Rank and drop_duplicates with origin index
  4. fillna with the first Type in Annotations
anno_map = df_rank.set_index('Type')['Rank']
obj_anno_split = df['Annotations'].str.split('&')
df_anno_map = obj_anno_split.explode().reset_index()
# create a new column rank use map
df_anno_map['rank'] = df_anno_map['Annotations'].map(anno_map)# keep the first rank for every index, by sort and drop_duplicates
df_anno_map = (df_anno_map.dropna().sort_values('rank').drop_duplicates('index', keep='first').set_index('index').sort_index())# assing Anno_prio with index broadcast
df['Anno_prio'] = df_anno_map['Annotations']# fillna with the the split's first item
df['Anno_prio'] = df['Anno_prio'].combine_first(obj_anno_split.str[0])# print(df_anno_map)
# print(df)

result:

print(df_anno_map)Annotations  rank
index                               
0        splice_region_variant   3.0
1                  stop_gained   2.0
2      splice_acceptor_variant   4.0
3      splice_acceptor_variant   4.0
4             missense_variant   6.0
5           frameshift_variant   1.0
6      splice_acceptor_variant   4.0
7      splice_acceptor_variant   4.0
9             missense_variant   6.0
10       splice_region_variant   3.0print(df)Annotations                Anno_prio
0              missense_variant&splice_region_variant    splice_region_variant
1                   stop_gained&splice_region_variant              stop_gained
2   splice_acceptor_variant&coding_sequence_varian...  splice_acceptor_variant
3   splice_donor_variant&splice_acceptor_variant&c...  splice_acceptor_variant
4             missense_variant&NMD_transcript_variant         missense_variant
5            frameshift_variant&splice_region_variant       frameshift_variant
6              splice_acceptor_variant&intron_variant  splice_acceptor_variant
7     splice_acceptor_variant&coding_sequence_variant  splice_acceptor_variant
8                       stop_lost&3_prime_UTR_variant                stop_lost
9                                    missense_variant         missense_variant
10                              splice_region_variant    splice_region_variant
https://en.xdnf.cn/q/119507.html

Related Q&A

Scraping from web page and reformatting to a calender file

Im trying to scrape this site: http://stats.swehockey.se/ScheduleAndResults/Schedule/3940And Ive gotten as far (thanks to alecxe) as retrieving the date and teams.from scrapy.item import Item, Field fr…

Python Text to Data Frame with Specific Pattern

I am trying to convert a bunch of text files into a data frame using Pandas. Thanks to Stack Overflows amazing community, I almost got the desired output (OP: Python Text File to Data Frame with Specif…

Python Multiprocess OpenCV Webcam Get Request [closed]

Closed. This question is not reproducible or was caused by typos. It is not currently accepting answers.This question was caused by a typo or a problem that can no longer be reproduced. While similar q…

error when trying to run my tensorflow code

This is a follow up question from my latest post: Put input in a tensorflow neural network I precoded a neural network using tensorflow with the MNIST dataset, and with the help of @FinnE was able to c…

ValueError: invalid literal for int() with base 10: Height (mm)

import csv from decimal import *def mean(data_set):return Decimal(sum(data_set)) / len(data_set)def variance(data_set):mean_res = mean(data_set)differences = []squared_res = []for elem in data_set:diff…

Having trouble in getting page source with beautifulsoup

I am trying to get the HTML source of a web page using beautifulsoup. import bs4 as bs import requests import urllib.request sourceUrl=https://www.pakwheels.com/forums/t/planing-a-trip-from-karachi-to-…

Cannot convert from pandas._libs.tslibs.timestamps.Timestamp to datetime.datetime

Im trying to convert from pandas._libs.tslibs.timestamps.Timestamp to datetime.datetime but the change is not saved: type(df_unix.loc[45,LastLoginDate])OUTPUT: pandas._libs.tslibs.timestamps.Timestampt…

string index out of range Python, Django

im using Django to develop a web application. When i try and run it on my web form i am receiving string index out of range error. However, when i hardcode a dictionary in to a python test file it work…

PyQt UI event control segregation

I am a beginner in python but my OOPS concept from Java and Android are strong enough to motivate me in making some tool in python. I am using PyQt for developing the application. In my application th…

How to re-run process Linux after crash? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.This question does not appear to be about a specific programming problem, a software algorithm, or s…