How to write Huffman coding to a file using Python?

2024/7/2 15:12:22

I created a Python script to compress text by using the Huffman algorithm. Say I have the following string:

string = 'The quick brown fox jumps over the lazy dog'

Running my algorithm returns the following 'bits':

result = '01111100111010101111010011111010000000011000111000010111110111110010100110010011010100101111100011110001000110101100111101000010101101110110111000111010101110010111111110011000101101000110111000'

By comparing the amount of bits of the result with the input string, the algorithm seems to work:

>>> print len(result), len(string) * 8
194 344

But now comes the question: how do I write this to a file, while still being able to decode it. You can only write to a file per byte, not per bit. By writing the 'codes' as bytes, there is no compression at all!

I am new at computer science, and the online resources just don't cut it for me. All help is much appreciated!

Edit: note that I had my codes something like this (in case of another input string 'xxxxxxxyzz'):

{'y': '00', 'x': '1', 'z': '10'}

The way I create the resulting string is by concatenating these codes in order of the input string:

result = '1111111001010'

How to get back to the original string from this result? Or am I getting this completely wrong? Thank you!

Answer

First you need to convert your input string to bytes:

def _to_Bytes(data):b = bytearray()for i in range(0, len(data), 8):b.append(int(data[i:i+8], 2))return bytes(b)

Then, open a file to write in binary mode:

result = '01111100111010101111010011111010000000011000111000010111110111110010100110010011010100101111100011110001000110101100111101000010101101110110111000111010101110010111111110011000101101000110111000'
with open('test.bin', 'wb') as f:f.write(_to_Bytes(result))

Now, writing the original string to a file, a comparison of bytes can take place:

import os
with open('test_compare.txt', 'a') as f:f.write('The quick brown fox jumps over the lazy dog')_o = os.path.getsize('test_compare.txt')
_c = os.path.getsize('test.bin')
print(f'Original file: {_o} bytes')
print(f'Compressed file: {_c} bytes')
print('Compressed file to about {}% of original'.format(round((((_o-_c)/_o)*100), 0)))

Output:

Original file: 43 bytes
Compressed file: 25 bytes
Compressed file to about 42.0% of original

To get back to the original, you can write a function that determines the possible ordering of characters:

d = {'y': '00', 'x': '1', 'z': '10'}
result = '1111111001010'
from typing import Generator
def reverse_encoding(content:str, _lookup) -> Generator[str, None, None]:while content:_options = [i for i in _lookup if content.startswith(i) and (any(content[len(i):].startswith(b) for b in _lookup) or not content[len(i):])]if not _options:raise Exception("Decoding error")yield _lookup[_options[0]]content = content[len(_options[0]):]print(''.join(reverse_encoding(result, {b:a for a, b in d.items()})))

Output:

'xxxxxxxyzz'
https://en.xdnf.cn/q/73398.html

Related Q&A

How to know if a Python multiprocessing.Lock is released or not?

>>> l = Lock() >>> l.acquire() True >>> l.release() >>> l.release() Traceback (most recent call last):File "<stdin>", line 1, in <module> Value…

How do you ensure a Celery chord callback gets called with failed subtasks?

I am using a Chord in Celery to have a callback that gets called when a Group of parallel tasks finish executing. Specifically, I have a group of functions that wrap calls to an external API. I want to…

Unpacking nested C structs in Python

I am trying to unpack a C struct that is handed to my Python program in binary form and includes another nested struct. The relevant part of the C header looks like this:typedef struct {uint8_t seq;uin…

Remove black borders on images with watermarks in Python

I have a bunch of image I would like to uniformise by removing black borders. Usually I use the Trim function of Imagemagick with the fuzz parameters but in the case the image have some watermark the r…

scipy cdist with sparse matrices

I need to calculate the distances between two sets of vectors, source_matrix and target_matrix.I have the following line, when both source_matrix and target_matrix are of type scipy.sparse.csr.csr_matr…

NumPy arrays with SQLite

The most common SQLite interface Ive seen in Python is sqlite3, but is there anything that works well with NumPy arrays or recarrays? By that I mean one that recognizes data types and does not requir…

Binary Phase Shift Keying in Python

Im currently working on some code to transmit messages/files/and other data over lasers using audio transformation. My current code uses the hexlify function from the binascii module in python to conve…

Django. Listing files from a static folder

One seemingly basic thing that Im having trouble with is rendering a simple list of static files (say the contents of a single repository directory on my server) as a list of links. Whether this is sec…

inconsistent migration history when changing a django apps name

Im trying to rename one of the apps in my django website. There is another app which depends on it and its mysql tables. I went over all the files in both apps and changed the instances of the old name…

Tensorflow vs Numpy math functions

Is there any real difference between the math functions performed by numpy and tensorflow. For example, exponential function, or the max function? The only difference I noticed is that tensorflow take…