How to divide a binary file to 6-byte blocks in C++ or Python with fast speed? [closed]

2024/10/8 4:26:15

I’m reading a file in C++ and Python as a binary file. I need to divide the binary into blocks, each 6 bytes. For example, if my file is 600 bytes, the result should be 100 blocks, each 6 bytes.

I have tried struct (in C++ and Python) and array (Python). None of them divide the binary into blocks of 6 bytes. They can only divide the binary into blocks each power of two (1, 2, 4, 8, 16, etc.).

The array algorithm was very fast, reading 1 GB of binary data in less than a second as blocks of 4 bytes. In contrast, I used some other methods, but all of them are extremely slow, taking tens of minutes to do it for a few megabytes.

How can I read the binary as blocks of 6 bytes as fast as possible? Any help in either C++ or Python will be great. Thank you.

EDIT - The Code:

    struct Block
{char data[6];
};class BinaryData
{
private:char data[6];public:BinaryData() {};~BinaryData() {};void readBinaryFile(string strFile){Block block;ifstream binaryFile;int size = 0;binaryFile.open(strFile, ios::out | ios::binary);binaryFile.seekg(0, ios::end);size = (int)binaryFile.tellg();binaryFile.seekg(0, ios::beg);cout << size << endl;while ( (int)binaryFile.tellg() < size ){cout << binaryFile.tellg() << " , " << size << " , " << 
size - (int)binaryFile.tellg() << endl;binaryFile.read((char*)block.data,sizeof(block.data));cout << block.data << endl;//cin  >> block.data;if (size - (int)binaryFile.tellg() > size){break;}}binaryFile.close();}};

Notes :

  • in the file the numbers are in big endian ( remark )
  • the goal is to as fast as possible read them then sort them in ascending order ( remark )
Answer

Let's start simple, then optimize.

Simple Loop

uint8_t  array1[6];
while (my_file.read((char *) &array1[0], 6))
{Process_Block(&array1[0]);
}

The above code reads in a file, 6 bytes at a time and sends the block to a function.
Meets the requirements, not very optimal.

Reading Larger Blocks

Files are streaming devices. They have an overhead to start streaming, but are very efficient to keep streaming. In other words, we want to read as much data per transaction to reduce the overhead.

static const unsigned int CAPACITY = 6 * 1024;
uint8_t block1[CAPACITY];
while (my_file.read((char *) &block1[0], CAPACITY))
{const size_t bytes_read = my_file.gcount();const size_t blocks_read = bytes_read / 6;uint8_t const * block_pointer = &block1[0];while (blocks_read > 0){Process_Block(block_pointer);block_pointer += 6;--blocks_read;}
}

The above code reads up to 1024 blocks in one transaction. After reading, each block is sent to a function for processing.

This version is more efficient than the Simple Loop, as it reads more data per transaction. Adjust the CAPACITY to find the optimal size on your platform.

Loop Unrolling

The previous code reduces the first bottleneck of input transfer speed (although there is still room for optimization). Another technique is to reduce the overhead of the processing loop by performing more data processing inside the loop. This is called loop unrolling.

const size_t bytes_read = my_file.gcount();
const size_t blocks_read = bytes_read / 6;
uint8_t const * block_pointer = &block1[0];
while ((blocks_read / 4) != 0)
{Process_Block(block_pointer);block_pointer += 6;Process_Block(block_pointer);block_pointer += 6;Process_Block(block_pointer);block_pointer += 6;Process_Block(block_pointer);block_pointer += 6;blocks_read -= 4;
}
while (blocks_read > 0)
{Process_Block(block_pointer);block_pointer += 6;--blocks_read;
}

You can adjust the quantity of operations in the loop, to see how it affects your program's speed.

Multi-Threading & Multiple Buffers

Another two techniques for speeding up the reading of the data, are to use multiple threads and multiple buffers.

One thread, an input thread, reads the file into a buffer. After reading into the first buffer, the thread sets a semaphore indicating there is data to process. The input thread reads into the next buffer. This repeats until the data is all read. (For a challenge, figure out how to reuse the buffers and notify the other thread of which buffers are available).

The second thread is the processing thread. This processing thread is started first and waits for the first buffer to be completely read. After the buffer has the data, the processing thread starts processing the data. After the first buffer has been processed, the processing thread starts on the next buffer. This repeats until all the buffers have been processed.

The goal here is to use as many buffers as necessary to keep the processing thread running and not waiting.

Edit 1: Other techniques

Memory Mapped Files

Some operating systems support memory mapped files. The OS reads a portion of the file into memory. When a location outside the memory is accessed, the OS loads another portion into memory. Whether this technique improves performance needs to be measured (profiled).

Parallel Processing & Threading

Adding multiple threads may show negligible performance gain. Computers have a data bus (data highway) connecting many hardware devices, including memory, file I/O and the processor. Devices will be paused to let other devices use the data highway. With multiple cores or processors, one processor may have to wait while the other processor is using the data highway. This waiting may cause negligible performance gain when using multiple threads or parallel processing. Also, the operating system has overhead when constructing and maintaining threads.

https://en.xdnf.cn/q/118741.html

Related Q&A

Selenium, python dynamic table

Im creating a robot with selenium that get all info from agencies in Brasil, ive alredy done the permutation click between all States and counties, all i have to do nows click in all agencies and get i…

How to split string into column

I got a csv file with some data, and I want to split this data.My column one contains a title, my column 2 contains some dates, and my column 3 contains some text linked to the dates.I want to tran…

remove whitespaces with new line characters

I have a string that looks like that:"\n My name is John\n and I like to go.\n blahblahblah.\n \n\n ".Note - In this string example, there are 5 white-spaces after…

Python tkinter GUI freezing/crashing

from Tkinter import * import tkFileDialog import tkMessageBox import os import ttkimport serial import timeit import time################################################################################…

If/else in python list comprehension

I would like to return random word from file, based on passed argument. But if the argument doesnt match anythning I dont want to return anything. My method looks like:def word_from_score(self,score):p…

Conditional module importing in Python

Im just trying out Maya 2017 and seen that theyve gone over to PySide2, which is great but all of my tools have import PySide or from PySide import ... in them.The obvious solution would be to find/rep…

AttributeError: list object has no attribute lower : clustering

Im trying to do a clustering. Im doing with pandas and sklearn. import pandas import pprint import pandas as pd from sklearn.cluster import KMeans from sklearn.metrics import adjusted_rand_score from s…

I’m dealing with an error when I run server in Django

PS C:\Users\besho\OneDrive\Desktop\DjangoCrushcourse> python manage.py runserver C:\Users\besho\AppData\Local\Programs\Python\Python312\python.exe: cant open file C:\Users\besho\OneDrive\Desktop\Dja…

python threading with global variables

i encountered a problem when write python threading code, that i wrote some workers threading classes, they all import a global file like sharevar.py, i need a variable like regdevid to keep tracking t…

How to write nth value of list into csv file

i have the following list : sec_min=[37, 01, 37, 02, 37, 03, 37, 04, 37, 05,....]i want to store this list into CVS file in following format: 37,01 37,02 37,03 37,04 ... and so onthis is what i coded: …