I have the following data set:
column1HL111
PG3939HL11
HL339PG
RC--HL--PG
I am attempting to write a function that does the following:
- Loop through each row of column1
- Pull only the alphabet and put into an array
- If the array has "HL" in it, remove it from the array UNLESS HL is the only word in the array.
- Take the first word in the array and output results.
So for the above example, my array (step2) would look like this:
[HL]
[PG,HL]
[HL,PG]
[RC,HL,PG]
and my desired final output (step4) would look like this:
desired_columnHL
PG
PG
RC
I have the code for step 2, and it seems to work fine
df['array_column'] = (df.column1.str.extractall('([A-Z]+)').unstack().values.tolist())
But I don't know how to get from here to my final output (step4).
You may achieve what you need by replacing all non-letters first, then extracting pairs of letters and then applying some custom logic to extract the necessary value from the array:
>>> df['array_column'].str.replace('[^A-Z]+', '').str.findall('([A-Z]{2})').apply(lambda d: [''] if len(d) == 0 else d).apply(lambda x: 'HL' if len(x) == 1 and x[0] == 'HL' else [m for m in x if m != 'HL'][0])
0 HL
1 PG
2 PG
3 RC
Name: array_column, dtype: object
>>>
Details
.replace('[^A-Z]+', '')
- remove all chars other the uppercase letters
.str.findall('([A-Z]{2})')
- extract pairs of letters
.apply(lambda d: [''] if len(d) == 0 else d)
will add an empty item if there is no regex match in the previous step
.apply(lambda x: 'HL' if len(x) == 1 and x[0] == 'HL' else [m for m in x if m != 'HL'][0])
- custom logic: if the list length is 1 and it is equal to HL
, keep it, else remove all HL
and get the first element