Remove Special Chars from a TSV file using Regex

2024/10/5 15:07:49

I have a File called "X.tsv" i want to remove special characters (including double spaces) (excluding . Single spaces Tabs / -) using regex before i export them to sub files in python

I want to implement it in the following code.

import pandas as pd 
import csv
from itertools import chain, combinations 
df = pd.read_table('xa.tsv')
def all_subsets(ss): return chain(*map(lambda x: combinations(ss,x), range(0, len(ss) + 1)))cols = [x for x in df.columns if not x == 'acm_classification'    if not x== 'publicationId'    if not x== 'publisher'    if not x== 'publication_link'    if not x== 'source'] # Exclude Extra Cols
subsets = all_subsets(cols)
for subset in subsets: if len(subset) > 0: #df1 = df[list(subset) + ['acm_classification']]df1.to_csv('_'.join(subset) + '.csv', index=False) 
Answer

You could use read_csv() to help with loading the TSV file. You could then specify the columns you want to keep and for it to use \t as the delimiter:

import pandas as pd
import redef normalise(text):text = re.sub('[{}]'.format(re.escape('",$!@#$%^&*()')), ' ', text.strip())  # Remove special characterstext = re.sub(r'\s+', ' ', text)        # Convert multiple whitespace into a single spacereturn textfieldnames = ['title', 'abstract', 'keywords', 'general_terms', 'acm_classification']
df = pd.read_csv('xa.tsv', delimiter='\t', usecols=fieldnames, dtype='object', na_filter=False)
df = df.applymap(normalise)
print(df)

You can then use df.applymap() to apply a function to each cell to format it as you need. In this example it first removes any leading or trailing spaces, converts multiple whitespace characters into a single space and also removes your list of special characters.

The resulting dataframe could then be further processed using your all_subsets() function before saving.

https://en.xdnf.cn/q/119693.html

Related Q&A

How to inherit a python base class?

dir/||___ __init__.py||___ Base_class.py||___ Subclass.py__init__.py is empty(as mentioned here)/* Base_class.pyclass Employee:numOfEmployees = 0 # Pure class member, no need to overrideraiseAmount = 1…

Validation for int(input()) python

def is_digit(x):if type(x) == int:return Trueelse:return Falsedef main():shape_opt = input(Enter input >> )while not is_digit(shape_opt):shape_opt = input(Enter input >> )else:print(it work…

How to get data out of a def function in python [duplicate]

This question already has answers here:How do I get ("return") a result (output) from a function? How can I use the result later?(4 answers)Closed 1 year ago.Trying to simplify lots of repe…

Calculating RSI in Python

I am trying to calculate RSI on a dataframedf = pd.DataFrame({"Close": [100,101,102,103,104,105,106,105,103,102,103,104,103,105,106,107,108,106,105,107,109]})df["Change"] = df["…

Pandas groupwise percentage

How can I calculate a group-wise percentage in pandas?similar to Pandas: .groupby().size() and percentages or Pandas Very Simple Percent of total size from Group by I want to calculate the percentage…

How to deal with large json files (flattening it to tsv) [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.Want to improve this question? Add details and clarify the problem by editing this post.Closed 3 years ago.Improve…

How can I find max number among numbers in this code?

class student(object):def student(self):self.name=input("enter name:")self.stno=int(input("enter stno:"))self.score=int(input("enter score:"))def dis(self):print("nam…

Assert data type of the values of a dict when they are in a list

How can I assert the values of my dict when they are in a list My_dict = {chr7: [127479365, 127480532], chr8: [127474697, 127475864], chr9: [127480532, 127481699]}The code to assert this assert all(isi…

Loading tiff images in fiftyone using ipynp

I am trying to load tiff images using fiftyone and python in ipynb notebook, but it just doesnt work. Anyone knows how to do it?

Regular expression to match the word but not the word inside other strings

I have a rich text like Sample text for testing:<a href="http://www.baidu.com" title="leoshi">leoshi</a>leoshi for details balala... Welcome to RegExr v2.1 by gskinner.c…