I have a dataframe as follows df["Annotations"]
missense_variant&splice_region_variant
stop_gained&splice_region_variant
splice_acceptor_variant&coding_sequence_variant&intron_variant
splice_donor_variant&splice_acceptor_variant&coding_sequence_variant&5_prime_UTR_variant&intron_variant
missense_variant&NMD_transcript_variant
frameshift_variant&splice_region_variant
splice_acceptor_variant&intron_variant
splice_acceptor_variant&coding_sequence_variant
stop_lost&3_prime_UTR_variant
missense_variant
splice_region_variant
I want to replace or add a new column with priority of orders. Priority is given as
Type Rank
frameshift_variant 1
stop_gained 2
splice_region_variant 3
splice_acceptor_variant 4
splice_donor_variant 5
missense_variant 6
coding_sequence_variant 7
I want to get replace df['Annotations'] or add new column df['Anno_prio'] as:
splice_region_variant
stop_gained
splice_acceptor_variant
splice_acceptor_variant
missense_variant
frameshift_variant
splice_acceptor_variant
splice_acceptor_variant
stop_lost
missense_variant
splice_region_variant
The way I tried was for each term:
df['Annotation']=df['Annotation'].str.replace('missense_variant&splice_region_variant','splice_region_variant')
Are there any other approach to do it using pandas?
process:
- Split by "&" and use
pandas.Series.explode
transform each element of a list-like to a row.
- use map Series to convert the
Type
to Rank
- then sort Rank and drop_duplicates with origin index
- fillna with the first Type in
Annotations
anno_map = df_rank.set_index('Type')['Rank']
obj_anno_split = df['Annotations'].str.split('&')
df_anno_map = obj_anno_split.explode().reset_index()
# create a new column rank use map
df_anno_map['rank'] = df_anno_map['Annotations'].map(anno_map)# keep the first rank for every index, by sort and drop_duplicates
df_anno_map = (df_anno_map.dropna().sort_values('rank').drop_duplicates('index', keep='first').set_index('index').sort_index())# assing Anno_prio with index broadcast
df['Anno_prio'] = df_anno_map['Annotations']# fillna with the the split's first item
df['Anno_prio'] = df['Anno_prio'].combine_first(obj_anno_split.str[0])# print(df_anno_map)
# print(df)
result:
print(df_anno_map)Annotations rank
index
0 splice_region_variant 3.0
1 stop_gained 2.0
2 splice_acceptor_variant 4.0
3 splice_acceptor_variant 4.0
4 missense_variant 6.0
5 frameshift_variant 1.0
6 splice_acceptor_variant 4.0
7 splice_acceptor_variant 4.0
9 missense_variant 6.0
10 splice_region_variant 3.0print(df)Annotations Anno_prio
0 missense_variant&splice_region_variant splice_region_variant
1 stop_gained&splice_region_variant stop_gained
2 splice_acceptor_variant&coding_sequence_varian... splice_acceptor_variant
3 splice_donor_variant&splice_acceptor_variant&c... splice_acceptor_variant
4 missense_variant&NMD_transcript_variant missense_variant
5 frameshift_variant&splice_region_variant frameshift_variant
6 splice_acceptor_variant&intron_variant splice_acceptor_variant
7 splice_acceptor_variant&coding_sequence_variant splice_acceptor_variant
8 stop_lost&3_prime_UTR_variant stop_lost
9 missense_variant missense_variant
10 splice_region_variant splice_region_variant