Python, can someone guess the type of a file only by its base64 encoding?

2024/9/8 8:50:52

Let's say I have the following:

image_data = """iVBORw0KGgoAAAANSUhEUgAAAAUAAAAFCAYAAACNbyblAAAAHElEQVQI12P4//8/w38GIAXDIBKE0DHxgljNBAAO9TXL0Y4OHwAAAABJRU5ErkJggg=="""

This is just a dot image (from https://en.wikipedia.org/wiki/Data_URI_scheme). But I do not know if it is image or text etc. Is it possible to understand what it is only having this encoded string? I try it in Python, but it is also general question. So any insight in both is highly welcome.

Answer

You can't, at least not without decoding, because the bytes that help identify the filetype are spread across the base64 characters, which don't directly align with whole bytes. Each character encodes 6 bits, which means that for every 4 characters, there are 3 bytes encoded.

Identifying a filetype requires access to those bytes in different block sizes. A JPEG image for example, can be identified from the bytes FF D8 or FF D9, but that's two bytes; the third byte that follows must also be encoded as part of the 4-character block.

What you can do is decode just enough of the base64 string to do your filetype fingerprinting. So you can decode the first 4 characters to get the 3 bytes, and then use the first two to see if the object is a JPEG image. A large number of file formats can be identified from just the first or last series of bytes (a PNG image can be identified by the first 8 bytes, a GIF by the first 6, etc.). Decoding just those bytes from the base64 string is trivial.

Your sample is a PNG image; you can test for image types using the imghdr module:

>>> import imghdr
>>> image_data = """iVBORw0KGgoAAAANSUhEUgAAAAUAAAAFCAYAAACNbyblAAAAHElEQVQI12P4//8/w38GIAXDIBKE0DHxgljNBAAO9TXL0Y4OHwAAAABJRU5ErkJggg=="""
>>> sample = image_data[:44].decode('base64')  # 33 bytes / 3 times 4 is 44 base64 chars
>>> for tf in imghdr.tests:
...     res = tf(sample, None)
...     if res:
...         break
...
>>> print res
png

I only used the first 33 bytes from the base64 data, to echo what the imghdr.what() function will read from the file you pass it (it reads 32 bytes, but that number doesn't divide by 3).

There is an equivalent soundhdr module, and there is also the python-magic project that lets you pass in a number of bytes to determine a file type.

https://en.xdnf.cn/q/73259.html

Related Q&A

Extract only body text from arXiv articles formatted as .tex

My dataset is composed of arXiv astrophysics articles as .tex files, and I need to extract only text from the article body, not from any other part of the article (e.g. tables, figures, abstract, title…

why is python reusing a class instance inside in function

Im running a for loop inside a function which is creating instances of a class to test them. instead of making new classes it appears to be reusing the same two over and over.Is there something Im miss…

How to set locale in Altair?

Im successfully creating and rendering a chart in Altair with a currency prefix ($), but I need this to be set to GBP (£). I know that theres a Vega-lite formatLocale which can be set, but I cant …

Show/hide a plots legend

Im relatively new to python and am developing a pyqt GUI. I want to provide a checkbox option to show/hide a plots legend. Is there a way to hide a legend? Ive tried using pyplots _nolegend_ and it ap…

Difference between iterating over a file-like and calling readline

I always thought iterating over a file-like in Python would be equivalent to calling its readline method in a loop, but today I found a situation where that is not true. Specifically, I have a Popend p…

Creating `input_fn` from iterator

Most tutorials focus on the case where the entire training dataset fits into memory. However, I have an iterator which acts as an infinite stream of (features, labels)-tuples (creating them cheaply on …

A Python one liner? if x in y, do x

numbers = [1,2,3,4,5,6,7,8,9] number = 1Can I write the following on one line?if number in numbers:print numberUsing the style of ruby:puts number if numbers.include?(number)I have tried:print number…

Adjusting the ticks to fit within the figure

I have the following matplotlib code which all it does is plots 0-20 on the x-axis vs 0-100 on the y-axisimport matplotlib.pyplot as plt fig = plt.figure() ax = fig.add_subplot(111) ax.plot(range(20)) …

Python ctypes: pass argument by reference error

I have a C++ function that I want you call in Python 2.7.12, looking like this:extern "C" {double* myfunction(double* &y, double* &z, int &n_y, int &n_z, int a, int b){vector&…

Python: Print next x lines from text file when hitting string

The situation is as follows:I have a .txt file with results of several nslookups.I want to loop tru the file and everytime it hits the string "Non-authoritative answer:" the scripts has to pr…