Question 1

I need to form a matrix from a list of textfiles containing frequency distribution of expressions. Therefore, I created a list of all that text files (lof) from a directory and used it to build a matrix (thanks to gboffy). Each filename in that list is structured in a way: CompanyName-SerialNumber_IssueDate_IFRS.txt (Example: GoldmanSachs-123456_31.12.2014_IFRS.txt). Each file's content is structured in a exact same way too:

CompanyABC-123456_31.12.2012_IFRS.txt

Company ABC-123456_31.12.2012
financial statement:4
corporate-taxes:8
assets:2
available-for-sale property:0
auditors:213

Company123-789102_31.12.2012_IFRS.txt

Company123-789102_31.12.2012
financial statement:15
corporate-taxes:3
assets:8
available-for-sale property:2
auditors:23

My desired output from this should be a single matrix file written to txt with one line for each company file consisting of (CompanyName,Serial Number,IssueDate,Frequency1,Frequency2,...,FrequencyN):

'CompanyABC','123456','31.12.2012','4','8','2','0','213' \n
'Company123','789102','31.12.2012','15','3','8','2','23' \n

Here is my code so far:

       def list_textfiles(directory, min_file_size):# Creates a list of all files stored in DIRECTORY ending on '.txt' with minimum file sizetextfiles = []for root, dirs, files in os.walk(directory):for name in files:filename = os.path.join(root, name)if os.stat(filename).st_size > min_file_size:textfiles.append(filename)return textfilesdirectory = 'C:/CompanyFiles'minimum_size = 30000lof = list_textfiles(directory, minimum_size)res = []for f in lof:res += [[entry.split(':')[1] for entry in cdata ]for cdata in [data.splitlines() for data in open(f).read().split('\n\n')]]with open('C:/CompanyFiles/Matrix.txt', 'wt') as outfile:outfile.write(str(res))

How can I modify my code to achieve the output as stated above?

Question 2

This should do the trick:

import osoutFile = 'C:/CompanyFiles/Matrix.txt'
folder = 'C:/CompanyFiles'with open(outFile, 'w') as wfp:for f in os.listdir(inFolder):tmp = [line.rstrip() for line in open(os.path.join(folder, f), 'r')]arr = tmp[0].split('-')arr = [arr[0]] + arr[1].split('_')arr += [t.split(':')[1].strip() for t in tmp[1:]]wfp.write(','.join(["'" + e + "'" for e in arr]) + '\n')

Note: I haven't tested it thoroughly

Comma separated Matrix from txt files - continued

Related Q&A

Questions about training LLMs on large text datasets for text generation from scratch

Pandas - Update/Merge 2 Dataframes based on multiple matching column values

How do I fix scrapy Unsupported URL scheme error?

comparing two timeseries dataframes based on some conditions in pandas

Game of Chance in Python 3.x?

Count occurence of a word by ID in python

ModuleNotFoundError: No module named plyer in Python

Floating point to 16 bit Twos Complement Binary, Python

Flag the first non zero column value with 1 and rest 0 having multiple columns

How to split up data from a column in a csv file into two separate output csv files?