Separate keywords and @ mentions from dataset

2024/10/12 16:26:22

I have a huge set of data which has several columns and about 10k rows in more than 100 csv files, for now I am concerned about only one column with message format and from them I want to extract two parameters. I searched extensively around and I found two solutions that seem close but are not enough close to solve the question here. ONE & TWO

Input : Col name "Text" and every message is a separate row in a csv.

"Let's Bounce!😉  #[message_1]Loving the energy & Microphonic Mayhem while…" #[message_2]RT @IVijayboi: #[message_3]   @Bdutt@sardesairajdeep@rahulkanwal@abhisarsharma@ppbajpayi@Abpnewd@Ndtv@Aajtak#Jihadimedia@Ibn7 happy #PresstitutesDay"RT @RakeshKhatri23: MY LIFE #[message_4]WITHOUT YOU ISLIKE FLOWERS WITHOUT FRAGRANCE 💞💞~True Love~"Me & my baby ðŸ¶â¤ï¸ðŸ‘­ @ Home Sweet Home  #[message_5]

The input is a CSV file with several other columns in the data but I am interested only in this column. I want to separate the @name and #keywordfrom the input into a new column like:

expected output

text, mentions, keywords 
[message], NAN, NAN
[message], NAN, NAN
[message], @IVijayboi, #Jihadimedia  @Bdutt      #PresstitutesDay@sardesairajdeep @rahulkanwal @abhisarsharma @ppbajpayi @Abpnewd @Ndtv @Aajtak  @Ibn7

As we see in the input first and second message has no @ and # so the column values NAN but for the third message it has 10 @ and 2 # keywords.

In simple words how do I separate the @ mentioned names and # keywords from the message to a separate column.

Answer

I suspect you want to use a regular expression. I don't know the exact format that your @ mentions and # keywords are allowed to take, but I would guess that something of the form @([a-zA-Z0-9]+)[^a-zA-Z0-9] would work.

#!/usr/bin/env python3
import retest_string = """Text
"Let's Bounce!😉
Loving the energy & Microphonic Mayhem while…"
RT @IVijayboi: etc etc"""mention_match = re.compile('@([a-zA-Z0-9]+)[^a-zA-Z0-9]')
for match in mention_match.finditer(test_string):print(match.group(1))hashtag_match = re.compile('#([a-zA-Z0-9]+)[^a-zA-Z0-9]')
for match in hashtag_match.finditer(test_string):print(match.group(1))

Hopefully that gives you enough to get started with.

https://en.xdnf.cn/q/118182.html

Related Q&A

Kivy class in .py and .kv interaction 2

Follow up from Kivy class in .py and .kv interaction , but more complex. Here is the full code of what Im writing: The data/screens/learnkanji_want.kv has how I want the code to be, but I dont fully un…

How to centre an image in pygame? [duplicate]

This question already has an answer here:How to center an image in the middle of the window in pygame?(1 answer)Closed 1 year ago.I am using python 2.7, I would like to know if there is a way to centr…

widget in sub-window update with real-time data in tkinter python

Ive tried using the after/time.sleep to update the treeview, but it is not working with the mainloop. My questions are: How can I update the treeview widget with real-time data? And is there a way th…

Changing for loop to while loop

Wondering how would the following for loop be changed to while loop. While having the same output.for i in range(0,20, 4):print(i)

How to dynamically resize label in kivy without size attribute

So, I get that you can usually just use self(=)(:)texture_size (py,kv) but all of my widgets are either based on screen(root only) or size_hint. I am doing this on purpose for a cross-platform GUI. I o…

How to parallelize this nested loop in Python that calls Abaqus

I have the nested loops below. How can i parallelize the outside loop so i can distribute the outside loop into 4 simultaneous runs and wait for all 4 runs to complete before moving on with the rest of…

Comparing one column value to all columns in linux enviroment

So I have two files , one VCF that looks like88 Chr1 25 C - 3 2 1 1 88 Chr1 88 A T 7 2 1 1 88 Chr1 92 A C 16 4 1 1and another with genes that looks likeGENEI…

Can be saved into a variable one condition?

It would be possible to store the condition itself in the variable, rather than the immediate return it, when to declare it?Example:a = 3 b = 5x = (a == b) print(x)a = 5 print(x)The return isFalse Fal…

Pycharm, can not find Python version 2.7.11

I installed Python 2.7.11 on this Mac, and from terminal Python 2.7.11 can be started. However, From the interpreter of Pycharm (2016.1 version) , there is no Python 2.7.11.Any suggestions ? ThanksPS…

Python Convert Binary String to IP Address in Dotted Notation

So Im trying to read in a file with some binary strings, i.e: 10000010 00000000 0000**** ********. The script will convert the *s to both 0 and 1, so there will be two binary strings that look like thi…