Comma separated Matrix from txt files - continued

2024/9/20 10:45:40

I need to form a matrix from a list of textfiles containing frequency distribution of expressions. Therefore, I created a list of all that text files (lof) from a directory and used it to build a matrix (thanks to gboffy). Each filename in that list is structured in a way: CompanyName-SerialNumber_IssueDate_IFRS.txt (Example: GoldmanSachs-123456_31.12.2014_IFRS.txt). Each file's content is structured in a exact same way too:

CompanyABC-123456_31.12.2012_IFRS.txt

Company ABC-123456_31.12.2012
financial statement:4
corporate-taxes:8
assets:2
available-for-sale property:0
auditors:213

Company123-789102_31.12.2012_IFRS.txt

Company123-789102_31.12.2012
financial statement:15
corporate-taxes:3
assets:8
available-for-sale property:2
auditors:23

My desired output from this should be a single matrix file written to txt with one line for each company file consisting of (CompanyName,Serial Number,IssueDate,Frequency1,Frequency2,...,FrequencyN):

'CompanyABC','123456','31.12.2012','4','8','2','0','213' \n
'Company123','789102','31.12.2012','15','3','8','2','23' \n

Here is my code so far:

       def list_textfiles(directory, min_file_size):# Creates a list of all files stored in DIRECTORY ending on '.txt' with minimum file sizetextfiles = []for root, dirs, files in os.walk(directory):for name in files:filename = os.path.join(root, name)if os.stat(filename).st_size > min_file_size:textfiles.append(filename)return textfilesdirectory = 'C:/CompanyFiles'minimum_size = 30000lof = list_textfiles(directory, minimum_size)res = []for f in lof:res += [[entry.split(':')[1] for entry in cdata ]for cdata in [data.splitlines() for data in open(f).read().split('\n\n')]]with open('C:/CompanyFiles/Matrix.txt', 'wt') as outfile:outfile.write(str(res))

How can I modify my code to achieve the output as stated above?

Answer

This should do the trick:

import osoutFile = 'C:/CompanyFiles/Matrix.txt'
folder = 'C:/CompanyFiles'with open(outFile, 'w') as wfp:for f in os.listdir(inFolder):tmp = [line.rstrip() for line in open(os.path.join(folder, f), 'r')]arr = tmp[0].split('-')arr = [arr[0]] + arr[1].split('_')arr += [t.split(':')[1].strip() for t in tmp[1:]]wfp.write(','.join(["'" + e + "'" for e in arr]) + '\n')

Note: I haven't tested it thoroughly

https://en.xdnf.cn/q/119720.html

Related Q&A

Questions about training LLMs on large text datasets for text generation from scratch

I made a fully custom made GPT in Jax (with Keras 3), using Tensorflow for the data pipeline. Ive trained the model on the Shakespeare dataset and got good results (so no problem with the model). Now I…

Pandas - Update/Merge 2 Dataframes based on multiple matching column values

I have 2 dataframes left_df and right-df, which both have 20 columns with identical names and dtypes. right_df also has 2 additional columns with unique values on every row. I want to update rows in ri…

How do I fix scrapy Unsupported URL scheme error?

I collect url from command python and then insert it into start_urls from flask import Flask, jsonify, request import scrapy import subprocessclass ClassSpider(scrapy.Spider):name = mySpider#sta…

comparing two timeseries dataframes based on some conditions in pandas

I have two timeseries dataframes df1 and df2: df1 = pd.DataFrame({date_1:[10/11/2017 0:00,10/11/2017 03:00,10/11/2017 06:00,10/11/2017 09:00],value_1:[5000,1500,np.nan,2000]})df1[date_1] = pd.to_dateti…

Game of Chance in Python 3.x?

I have this problem in my python code which is a coinflip game, the problem is that when It asks, "Heads or Tails?" and I just say 1 or Heads(same for 2 and Tails) without quotation marks an…

Count occurence of a word by ID in python

Following is the content of a file,My question is how to count the number of occurences for the word "optimus" for different IDs ID67 DATEUID Thank you for choosing Optimus prime. Please w…

ModuleNotFoundError: No module named plyer in Python

I am trying to write a program notify.py (location: desktop) that uses plyer library to get a notification on windows 10. I used pip install plyer and am using vs code to run the program but I get an e…

Floating point to 16 bit Twos Complement Binary, Python

so I think questions like this have been asked before but Im having quite a bit of trouble getting this implemented. Im dealing with CSV files that contain floating points between -1 and 1. All of thes…

Flag the first non zero column value with 1 and rest 0 having multiple columns

Please assist with the belowimport pandas as pd df = pd.DataFrame({Grp: [1,1,1,1,2,2,2,2,3,3,3,4,4,4], Org1: [x,x,y,y,z,y,z,z,x,y,y,z,x,x], Org2: [a,a,b,b,c,b,c,c,a,b,b,c,a,a], Value: [0,0,3,1,0,1,0,5,…

How to split up data from a column in a csv file into two separate output csv files?

I have a .csv file, e.g.:ID NAME CATEGORIES 1, x, AB 2, xx, AA 3, xxx, BAHow would I get this to form two output .csv files based on the category e.g.:File 1:ID NAME CATEGORY 1, x, A 2, xx, A 3, …