Reading direct access binary file format in Python

2024/9/24 11:31:38

Background:

A binary file is read on a Linux machine using the following Fortran code:

        parameter(nx=720, ny=360, nday=365)
c dimension tmax(nx,ny,nday),nmax(nx,ny,nday)dimension tmin(nx,ny,nday),nmin(nx,ny,nday)
c open(10,&file='FILE',&access='direct',recl=nx*ny*4)
cdo k=1,ndayread(10,rec=(k-1)*4+1)((tmax(i,j,k),i=1,nx),j=1,ny) read(10,rec=(k-1)*4+2)((nmax(i,j,k),i=1,nx),j=1,ny) read(10,rec=(k-1)*4+3)((tmin(i,j,k),i=1,nx),j=1,ny) read(10,rec=(k-1)*4+4)((nmin(i,j,k),i=1,nx),j=1,ny) end do

File Details:

options  little_endian
title global daily analysis (grid box mean, the grid shown is the center of the grid box)
undef -999.0
xdef 720 linear    0.25 0.50
ydef 360  linear -89.75 0.50
zdef 1 linear 1 1
tdef 365 linear 01jan2015 1dy
vars 4
tmax     1  00 daily maximum temperature (C)
nmax     1  00 number of reports for maximum temperature (C)
tmin     1  00 daily minimum temperature (C)
nmin     1  00 number of reports for minimum temperature (C)
ENDVARS

Attempts at a solution:

I am trying to parse this into an array in python using the following code (purposely leaving out two attributes):

with gzip.open("/FILE.gz", "rb") as infile:data = numpy.frombuffer(infile.read(), dtype=numpy.dtype('<f4'), count = -1)while x <= len(data) / 4:tmax.append(data[(x-1)*4])tmin.append(data[(x-1)*4 + 2])x += 1data_full = zip(tmax, tmin)

When testing some records, the data does not seem to line up with some sample records from the file when using Fortran. I have also tried dtype=numpy.float32 as well with no success. It seems as though I am reading the file in correctly in terms of number of observations though. I was also using struct before I learned the file was created with Fortran. That was not working

There are similar questions out here, some of which have answers that I have tried adapting with no luck.

UPDATE

I am a little bit closer after trying out this code:

#Define numpy variables and empty arrays
nx = 720 #number of lon
ny = 360 #number of lat
nday = 0 #iterate up to 364 (or 365 for leap year)   
tmax = numpy.empty([0], dtype='<f', order='F')
tmin = numpy.empty([0], dtype='<f', order='F')#Parse the data into numpy arrays, shifting records as the date increments
while nday < 365:tmax = numpy.append(tmax, data[(nx*ny)*nday:(nx*ny)*(nday + 1)].reshape((nx,ny), order='F'))tmin = numpy.append(tmin, data[(nx*ny)*(nday + 2):(nx*ny)*(nday + 3)].reshape((nx,ny), order='F'))nday += 1  

I get the correct data for the first day, but for the second day I get all zeros, the third day the max is lower than the min, and so on.

Answer

While the exact format of Fortran binary files is compiler dependent, in all cases I'm aware of direct access files (files opened with access='direct' as in this question) do not have any record markers between records. Each record is of a fixed size, as given by the recl= specifier in the OPEN statement. That is, the record N starts at offset (N - 1) * RECL bytes in the file.

One portability gotcha is that the unit of the recl= is in terms of file storage units. For most compilers, the file storage unit specifies the size in 8-bit octets (as recommended in recent versions of the Fortran standard), but for the Intel Fortran compiler, recl= is in units of 32 bits; there is a commandline option -assume byterecl which can be used to make Intel Fortran match most other compilers.

So in the example given here and assuming a 8-bit file storage unit, your recl would be 1036800 bytes.

Further, looking at the code, it seems to assume the arrays are of a 4-byte type (e.g. integer or single precision real). So if it's single precision real, and the file has been created in little endian, then the numpy dtype <f4 that you have used seems to be the correct choice.

Now, getting back to the Intel Fortran compiler gotcha, if the file has been created by ifort without -assume byterecl then the data you want will be in the first quarter of each record, with the rest being padding (all zeros or maybe even random data?). Then you'll have to do some extra gymnastics to extract the correct data in python and not the padding. It should be easy to check this by checking the size of the file, is it nx * ny * 4 * nday *4 or is it nx * ny * 4 * nday * 4 * 4 bytes?

https://en.xdnf.cn/q/71706.html

Related Q&A

Fabric/Python: AttributeError: NoneType object has no attribute partition

Have the following function in fabric for adding user accounts.~/scripts #fab -lPython source codeAvailable commands:OS_TYPEadduser_createcmd Create command line for adding useradduser_getinfo Prom…

Python object @property

Im trying to create a point class which defines a property called "coordinate". However, its not behaving like Id expect and I cant figure out why. class Point:def __init__(self, coord=None…

Tkinter - Inserting text into canvas windows

I have a Tkinter canvas populated with text and canvas windows, or widgets, created using the create_text and create_window methods. The widgets I place on the canvas are text widgets, and I want to in…

How to run nosetests without showing of my matplotlibs graph?

I try to run my test without any messages displaying from my main program. I only want verbose messages from nosetests to display.For example: nosetests -v --nologcaptureAll of my printout messages fro…

Variable name to string in Python

I have written some code in Python and I would like to save some of the variables into files having the name of the variable. The following code:variableName1 = 3.14 variableName2 = 1.09 save_variable_…

In Python, how to print FULL ISO 8601 timestamp, including current timezone

I need to print the FULL local date/time in ISO 8601 format, including the local timezone info, eg:2007-04-05T12:30:00.0000-02:00I can use datetime.isoformat() to print it, if I have the right tzinfo o…

TextVariable not working

I am trying to get the Text out of an Entry widget in Tkinter. It works with Entry1.get(), but it does not work using textvariableWhat am I doing wrong ? from Tkinter import * master = Tk() v = String…

Is there a Python module to get next runtime from a crontab-style time definition?

Im writing a dashboard application and I need a way to figure out how long an item is "valid", i.e. when should it have been superseded by a new value (its possible to have an error such that…

Django Generic Relations error: cannot resolve keyword content_object into field

Im using Djangos Generic Relations to define Vote model for Question and Answer models. Here is my vote model:models.pyclass Vote(models.Model):user_voted = models.ForeignKey(MyUser)is_upvote = models.…

How to benchmark unit tests in Python without adding any code

I have a Python project with a bunch of tests that have already been implemented, and Id like to begin benchmarking them so I can compare performance of the code, servers, etc over time. Locating the …