Python - Transpose columns to rows within data operation and before writing to file

2024/10/10 16:22:16

I have developed a public and open source App for Splunk (Nmon performance monitor for Unix and Linux Systems, see https://apps.splunk.com/app/1753/)

A master piece of the App is an old perl (recycled, modified and updated) script automatically launched by the App to convert the Nmon data (which is some kind of custom csv), reading it from stdin and writing out to formerly formatted csv files by section (a section is a performance monitor)

I want now to fully rewrite this script in Python, which is almost done for a first beta version... BUT i am facing difficulties to transpose data, and i'm afraid not being able to solve it myself.

This is why i am kindly asking for help today.

Here is the difficulty in details:

Nmon generates performance monitors for various sections (cpu, memory, disks...), for many of them there is no big difficulty but extracting the good timestamp and so on. But for all sections that have "device" notion (such as DISKBUSY in the provided example, which represents the percentage of time disks were busy) have to be transformed and transposed to be later exploitable

Currently, i am able to generate the data as follows:

Example:

time,sda,sda1,sda2,sda3,sda5,sda6,sda7,sdb,sdb1,sdc,sdc1,sdc2,sdc3
26-JUL-2014 11:10:44,4.4,0.0,0.0,0.0,0.4,1.9,2.5,0.0,0.0,10.2,10.2,0.0,0.0
26-JUL-2014 11:10:54,4.8,0.0,0.0,0.0,0.3,2.0,2.6,0.0,0.0,5.4,5.4,0.0,0.0
26-JUL-2014 11:11:04,4.8,0.0,0.0,0.0,0.4,2.3,2.1,0.0,0.0,17.8,17.8,0.0,0.0
26-JUL-2014 11:11:14,2.1,0.0,0.0,0.0,0.2,0.5,1.5,0.0,0.0,28.2,28.2,0.0,0.0

The goal is to transpose the data such as we will have in the header "time,device,value", example:

time,device,value
26-JUL-2014 11:10:44,sda,4.4
26-JUL-2014 11:10:44,sda1,0.0
26-JUL-2014 11:10:44,sda2,0.0

And so on.

One month ago, I've opened a question for almost the same need (for another app and not exactly the same data, but the same need to transpose columns to rows)

Python - CSV time oriented Transposing large number of columns to rows

I had a very great answer which perfectly did the trick, thus i am unable to recycle the piece of code into this new context. One of difference is that i want to include the data transposition inside within the code, such that the script only works in memory and avoid dealing with multiple temporary files.

Here is the piece of code:

Note: needs to use Python 2x

###################
# Dynamic Sections : data requires to be transposed to be exploitable within Splunk
###################dynamic_section = ["DISKBUSY"]for section in dynamic_section:# Set output filecurrsection_output = DATA_DIR + HOSTNAME + '_' + day + '_' + month + '_' + year + '_' + hour + minute + second + '_' + section + '.csv'# Open output for writingwith open(currsection_output, "w") as currsection:for line in data:# Extract sections, and write to outputmyregex = r'^' + section + '[0-9]*' + '|ZZZZ.+'find_section = re.match( myregex, line)if find_section:# csv header# Replace some symbolsline=re.sub("%",'_PCT',line)line=re.sub(" ",'_',line)# Extract header excluding data that always has Txxxx for timestamp referencemyregex = '(' + section + ')\,([^T].+)'fullheader_match = re.search( myregex, line)            if fullheader_match:fullheader = fullheader_match.group(2)header_match = re.match( r'([a-zA-Z\-\/\_0-9]+,)([a-zA-Z\-\/\_0-9\,]*)', fullheader)    if header_match:header = header_match.group(2)# Write headercurrsection.write('time' + ',' + header + '\n'),# Extract timestamp# Nmon V9 and prior do not have date in ZZZZ# If unavailable, we'll use the global date (AAA,date)ZZZZ_DATE = '-1'ZZZZ_TIME = '-1'                # For Nmon V10 and more             timestamp_match = re.match( r'^ZZZZ\,(.+)\,(.+)\,(.+)\n', line)if timestamp_match:ZZZZ_TIME = timestamp_match.group(2)ZZZZ_DATE = timestamp_match.group(3)            ZZZZ_timestamp = ZZZZ_DATE + ' ' + ZZZZ_TIME# For Nmon V9 and less                  if ZZZZ_DATE == '-1':ZZZZ_DATE = DATEtimestamp_match = re.match( r'^ZZZZ\,(.+)\,(.+)\n', line)if timestamp_match:ZZZZ_TIME = timestamp_match.group(2)                    ZZZZ_timestamp = ZZZZ_DATE + ' ' + ZZZZ_TIME# Extract Datamyregex = r'^' + section + '\,(T\d+)\,(.+)\n'perfdata_match = re.match( myregex, line)if perfdata_match:perfdata = perfdata_match.group(2)# Write perf datacurrsection.write(ZZZZ_timestamp + ',' + perfdata + '\n'),# End for# Open output for reading and show number of line we extractedwith open(currsection_output, "r") as currsection:num_lines = sum(1 for line in currsection)print (section + " section: Wrote", num_lines, "lines")# End for

The line:

                currsection.write('time' + ',' + header + '\n'),

will contain the header

And the line:

            currsection.write(ZZZZ_timestamp + ',' + perfdata + '\n'),

contains the data line by line

Note: the final data (header and body data) should in target also contains other information, to simplify things i removed it in the code above

For static sections which does not require the data transposition, the same lines will be:

                    currsection.write('type' + ',' + 'serialnum' + ',' + 'hostname' + ',' + 'time' + ',' + header + '\n'),

And:

                currsection.write(section + ',' + SN + ',' + HOSTNAME + ',' + ZZZZ_timestamp + ',' + perfdata + '\n'),

The great goal would be to be able to transpose the data just after the required definition and before writing it.

Also, performance and minimum system resources called (such as working with temporary files instead of memory) is a requirement to prevent from generating too high cpu load on systems periodically the script.

Could anyone help me to achieve this ? I've looked for it again and again, i'm pretty sure there is multiple ways to achieve this (zip, map, dictionary, list, split...) but i failed to achieve it...

Please be indulgent, this is my first real Python script :-)

Thank you very much for any help !

More details:

  • testing nmon file

A small testing nmon file can be retrieved here: http://pastebin.com/xHLRbBU0

  • Current complete script

The current complete script can be retrieved here: http://pastebin.com/QEnXj6Yh

To test the script, it is required to:

  • export the SPLUNK_HOME variable to anything relevant for you, ex:

    mkdir /tmp/nmon2csv

--> place the script and nmon file here, allow execution on script

export SPLUNK_HOME=/tmp/nmon2csv
mkdir -p etc/apps/nmon

And finally:

cat test.nmon | ./nmon2csv.py

Data will be generated in /tmp/nmon2csv/etc/apps/nmon/var/*

Update: Working code using csv module:

###################
# Dynamic Sections : data requires to be transposed to be exploitable within Splunk
###################dynamic_section = ["DISKBUSY","DISKBSIZE","DISKREAD","DISKWRITE","DISKXFER","DISKRIO","DISKWRIO","IOADAPT","NETERROR","NET","NETPACKET","JFSFILE","JFSINODE"]for section in dynamic_section:# Set output file (will opened after transpose)currsection_output = DATA_DIR + HOSTNAME + '_' + day + '_' + month + '_' + year + '_' + hour + minute + second + '_' + section + '.csv'# Open Tempwith TemporaryFile() as tempf:for line in data:# Extract sections, and write to outputmyregex = r'^' + section + '[0-9]*' + '|ZZZZ.+'find_section = re.match( myregex, line)if find_section:# csv header# Replace some symbolsline=re.sub("%",'_PCT',line)line=re.sub(" ",'_',line)# Extract header excluding data that always has Txxxx for timestamp referencemyregex = '(' + section + ')\,([^T].+)'fullheader_match = re.search( myregex, line)            if fullheader_match:fullheader = fullheader_match.group(2)header_match = re.match( r'([a-zA-Z\-\/\_0-9]+,)([a-zA-Z\-\/\_0-9\,]*)', fullheader)    if header_match:header = header_match.group(2)# Write headertempf.write('time' + ',' + header + '\n'),  # Extract timestamp# Nmon V9 and prior do not have date in ZZZZ# If unavailable, we'll use the global date (AAA,date)ZZZZ_DATE = '-1'ZZZZ_TIME = '-1'                # For Nmon V10 and more             timestamp_match = re.match( r'^ZZZZ\,(.+)\,(.+)\,(.+)\n', line)if timestamp_match:ZZZZ_TIME = timestamp_match.group(2)ZZZZ_DATE = timestamp_match.group(3)            ZZZZ_timestamp = ZZZZ_DATE + ' ' + ZZZZ_TIME# For Nmon V9 and less                  if ZZZZ_DATE == '-1':ZZZZ_DATE = DATEtimestamp_match = re.match( r'^ZZZZ\,(.+)\,(.+)\n', line)if timestamp_match:ZZZZ_TIME = timestamp_match.group(2)                    ZZZZ_timestamp = ZZZZ_DATE + ' ' + ZZZZ_TIME# Extract Datamyregex = r'^' + section + '\,(T\d+)\,(.+)\n'perfdata_match = re.match( myregex, line)if perfdata_match:perfdata = perfdata_match.group(2)# Write perf datatempf.write(ZZZZ_timestamp + ',' + perfdata + '\n'),# Open final for writingwith open(currsection_output, "w") as currsection:# Rewind temptempf.seek(0)writer = csv.writer(currsection)writer.writerow(['type', 'serialnum', 'hostname', 'time', 'device', 'value'])           for d in csv.DictReader(tempf):time = d.pop('time')for device, value in sorted(d.items()):row = [section, SN, HOSTNAME, time, device, value]writer.writerow(row)            # End for# Open output for reading and show number of line we extractedwith open(currsection_output, "r") as currsection:num_lines = sum(1 for line in currsection)print (section + " section: Wrote", num_lines, "lines")# End for
Answer

Thegoal is to transpose the data such as we will have in the header"time,device,value"

This rough transposition logic looks like this:

text = '''time,sda,sda1,sda2,sda3,sda5,sda6,sda7,sdb,sdb1,sdc,sdc1,sdc2,sdc3
26-JUL-2014 11:10:44,4.4,0.0,0.0,0.0,0.4,1.9,2.5,0.0,0.0,10.2,10.2,0.0,0.0
26-JUL-2014 11:10:54,4.8,0.0,0.0,0.0,0.3,2.0,2.6,0.0,0.0,5.4,5.4,0.0,0.0
26-JUL-2014 11:11:04,4.8,0.0,0.0,0.0,0.4,2.3,2.1,0.0,0.0,17.8,17.8,0.0,0.0
26-JUL-2014 11:11:14,2.1,0.0,0.0,0.0,0.2,0.5,1.5,0.0,0.0,28.2,28.2,0.0,0.0
'''import csvfor d in csv.DictReader(text.splitlines()):time = d.pop('time')for device, value in sorted(d.items()):print time, device, value

Putting it all together into a complete script looks something like this:

import csvwith open('transposed.csv', 'wb') as destfile:writer = csv.writer(destfile)writer.writerow(['time', 'device', 'value'])with open('data.csv', 'rb') as sourefile:for d in csv.DictReader(sourcefile):time = d.pop('time')for device, value in sorted(d.items()):row = [time, device, value]writer.writerow(row)
https://en.xdnf.cn/q/118430.html

Related Q&A

Unexpected output while sorting the list of IP address [duplicate]

This question already has answers here:Python .sort() not working as expected(8 answers)Closed last year.I am trying to sort the list of ipaddress from the following list. IPlist= [209.85.238.4, 216.23…

Google Cloud Run returning Server Unavailable Occasionally

I am running a Flask app at https://recycler-mvdcj7favq-uc.a.run.app/ on Google Cloud Run and occasionally I get 503 server unavailable while refreshing a few times seems to load the page. Also, someti…

Connecting to Internet?

Im having issues with connecting to the Internet using python.I am on a corporate network that uses a PAC file to set proxies. Now this would be fine if I could find and parse the PAC to get what I nee…

Angular App Not Working When Moving to Python Flask

Not sure what information to give so will do as much as I can. Currently have an Angular app sitting on IIS and using Classic ASP. All works fine. There is a dropdown which fetches some JSON that then …

How can I subtract tuples in a list?

Lets say I have a list with tuples in it.Something like this:listnum = [(18,12),(12,20)]Is there a way I can subtract what is in the tuples and make listnum into:listnum = [6,8]As you can see It takes …

Chart barh matplotlib - overlap bars

Im new user of matplotlib and I have a problem with chart barh: overlap bars. When plot the graph, the bars draws overlapped and I havent found the reason. In my opinion the problem is on re-size the …

Python django image upload is not working

Python Django image upload is not working when I am adding data and click on submit button not give a any response and also not upload any data in the database.models.py fisrst of all i add model file.…

Android bluetooth send message working first time only

I need to send string message from Raspberry PI to Android device. I am getting message first time only. After that it does not work at all. I am using PYTHON code in Raspberry PI. After first time, it…

Scraping data from href

I was trying to get the postcodes for DFS, for that i tried getting the href for each shop and then click on it, the next page has shop location from which i can get the postal code, but i am able to g…

Numpy - how to sort an array of value/key pairs in descending order

I was looking at the problem Fastest way to rank items with multiple values and weightings and came up with the following solution, but with two remaining issues:import numpy as np# set up values keys …