Question 1

I have a File called "X.tsv" i want to remove special characters (including double spaces) (excluding . Single spaces Tabs / -) using regex before i export them to sub files in python

I want to implement it in the following code.

import pandas as pd 
import csv
from itertools import chain, combinations 
df = pd.read_table('xa.tsv')
def all_subsets(ss): return chain(*map(lambda x: combinations(ss,x), range(0, len(ss) + 1)))cols = [x for x in df.columns if not x == 'acm_classification'    if not x== 'publicationId'    if not x== 'publisher'    if not x== 'publication_link'    if not x== 'source'] # Exclude Extra Cols
subsets = all_subsets(cols)
for subset in subsets: if len(subset) > 0: #df1 = df[list(subset) + ['acm_classification']]df1.to_csv('_'.join(subset) + '.csv', index=False)

Question 2

You could use read_csv() to help with loading the TSV file. You could then specify the columns you want to keep and for it to use \t as the delimiter:

import pandas as pd
import redef normalise(text):text = re.sub('[{}]'.format(re.escape('",$!@#$%^&*()')), ' ', text.strip())  # Remove special characterstext = re.sub(r'\s+', ' ', text)        # Convert multiple whitespace into a single spacereturn textfieldnames = ['title', 'abstract', 'keywords', 'general_terms', 'acm_classification']
df = pd.read_csv('xa.tsv', delimiter='\t', usecols=fieldnames, dtype='object', na_filter=False)
df = df.applymap(normalise)
print(df)

You can then use df.applymap() to apply a function to each cell to format it as you need. In this example it first removes any leading or trailing spaces, converts multiple whitespace characters into a single space and also removes your list of special characters.

The resulting dataframe could then be further processed using your all_subsets() function before saving.

Remove Special Chars from a TSV file using Regex

Related Q&A

How to inherit a python base class?

Validation for int(input()) python

How to get data out of a def function in python [duplicate]

Calculating RSI in Python

Pandas groupwise percentage

How to deal with large json files (flattening it to tsv) [closed]

How can I find max number among numbers in this code?

Assert data type of the values of a dict when they are in a list

Loading tiff images in fiftyone using ipynp

Regular expression to match the word but not the word inside other strings