Is it possible to restore corrupted “interned” bytes-objects

2024/10/13 19:21:05

It is well known, that small bytes-objects are automatically "interned" by CPython (similar to the intern-function for strings). Correction: As explained by @abarnert it is more like the integer-pool than the interned strings.

Is it possible to restore the interned bytes-objects after they have been corrupted by let's say an "experimental" third party library or is the only way to restart the kernel?

The proof of concept can be done with Cython-functionality (Cython>=0.28):

%%cython
def do_bad_things():cdef bytes b=b'a'cdef const unsigned char[:] safe=b  cdef char *unsafe=<char *> &safe[0]   #who needs const and type-safety anyway?unsafe[0]=98                          #replace through `b`

or as suggested by @jfs through ctypes:

import ctypes
import sys
def do_bad_things():b = b'a'; (ctypes.c_ubyte * sys.getsizeof(b)).from_address(id(b))[-2] = 98

Obviously, by misusing C-functionality, do_bad_things changes immutable (or so the CPython thinks) object b'a' to b'b' and because this bytes-object is interned, we can see bad things happen afterwards:

>>> do_bad_things() #b'a' means now b'b'
>>> b'a'==b'b'  #wait for a surprise  
True
>>> print(b'a') #another one
b'b'

It is possible to restore/clear the byte-object-pool, so that b'a' means b'a' once again?


A little side note: It seems as if not every bytes-creation process is using this pool. For example:

>>> do_bad_things()
>>> print(b'a')
b'b'
>>> print((97).to_bytes(1, byteorder='little')) #ord('a')=97
b'a'
Answer

Python 3 doesn't intern bytes objects the way it does str. Instead, it keeps a static array of them the way it does with int.

This is very different under the covers. On the down side, it means there's no table (with an API) to be manipulated. On the up side, it means that if you can find the static array, you can fix it, the same way you would for ints, because the array index and the character value of the string are supposed to be identical.

If you look in bytesobject.c, the array is declared at the top:

static PyBytesObject *characters[UCHAR_MAX + 1];

… and then, for example, within PyBytes_FromStringAndSize:

if (size == 1 && str != NULL &&(op = characters[*str & UCHAR_MAX]) != NULL)
{
#ifdef COUNT_ALLOCSone_strings++;
#endifPy_INCREF(op);return (PyObject *)op;
}

Notice that the array is static, so it's not accessible from outside this file, and that it's still refcounting the objects, so callers (even internal stuff in the interpreter, much less your C API extension) can't tell that there's anything special going on.

So, there's no "correct" way to clean this up.

But if you want to get hacky…

If you have a reference to any of the single-char bytes, and you know which character it was supposed to be, you can get to the start of the array and then clean up the whole thing.

Unless you've screwed up even more than you think, you can just construct a one-char bytes and subtract the character it was supposed to be. PyBytes_FromStringAndSize("a", 1) is going to return the object that's supposed to be 'a', even if it happens to actually hold 'b'. How do we know that? Because that's exactly the problem that you're trying to fix.

Actually, there are probably ways you could break things even worse… which all seem very unlikely, but to be safe, let's use a character you're less likely to have broken than a, like \x80:

PyBytesObject *byte80 = (PyBytesObject *)PyBytes_FromStringAndSize("\x80", 1);
PyBytesObject *characters = byte80 - 0x80;

The only other caveat is that if you try to do this from Python with ctypes instead of from C code, it would require some extra care,1 but since you're not using ctypes, let's not worry about that.

So, now we have a pointer to characters, we can walk it. We can't just delete the objects to "unintern" them, because that will hose anyone who has a reference to any of them, and probably lead to a segfault. But we don't have to. Any object that's in the table, we know what it's supposed to be—characters[i] is supposed to be a one-char bytes whose one character is i. So just set it back to that, with a loop something like this:

for (size_t char i=0; i!=UCHAR_MAX; i++) {if (characters[i]) {// do the same hacky stuff you did to break the string in the first place}
}

That's all there is to it.


Well, except for compilation.2

Fortunately, at the interactive interpreter, each complete top-level statement is its own compilation unit, so… you should be OK with any new line you type after running the fix.

But a module you've imported, that had to be compiled, while you had the broken strings? You've probably screwed up its constants. And I can't think of a good way to clean this up except to forcibly recompile and reimport every module.


1. The compiler might turn your b'\x80' argument into the wrong thing before it even gets to the C call. And you'd be surprised at all the places you think you're passing around a c_char_p and it's actually getting magically converted to and from bytes. Probably better to use a POINTER(c_uint8).

2. If you compiled some code with b'a' in it, the consts array should have a reference to b'a', which will get fixed. But, since bytes are known immutable to the compiler, if it knows that b'a' == b'b', it may actually store the pointer to the b'b' singleton instead, for the same reason that 123456 is 123456 is true, in which case fixing b'a' may not actually solve the problem.

https://en.xdnf.cn/q/69498.html

Related Q&A

Wildcard namespaces in lxml

How to query using xpath ignoring the xml namespace? I am using python lxml library. I tried the solution from this question but doesnt seem to work.In [151]: e.find("./*[local-name()=Buckets]&qu…

WordNet - What does n and the number represent?

My question is related to WordNet Interface.>>> wn.synsets(cat)[Synset(cat.n.01), Synset(guy.n.01), Synset(cat.n.03),Synset(kat.n.01), Synset(cat-o-nine-tails.n.01), Synset(caterpillar.n.02), …

How to change the values of a column based on two conditions in Python

I have a dataset where I have the time in a game and the time of an event. EVENT GAME0:34 0:43NaN 0:232:34 3:43NaN 4:50I want to replace the NaN in the EVENT column where GAME…

logging module for python reports incorrect timezone under cygwin

I am running python script that uses logging module under cygwin on Windows 7. The date command reports correct time:$ date Tue, Aug 14, 2012 2:47:49 PMHowever, the python script is five hours off:201…

Set ordering of Apps and models in Django admin dashboard

By default, the Django admin dashboard looks like this for me:I want to change the ordering of models in Profile section, so by using codes from here and here I was able to change the ordering of model…

python database / sql programming - where to start

What is the best way to use an embedded database, say sqlite in Python:Should be small footprint. Im only needing few thousands records per table. And just a handful of tables per database. If its one …

How to install Python 3.5 on Raspbian Jessie

I need to install Python 3.5+ on Rasbian (Debian for the Raspberry Pi). Currently only version 3.4 is supported. For the sources I want to compile I have to install:sudo apt-get install -y python3 pyth…

Django - last insert id

I cant get the last insert id like I usually do and Im not sure why.In my view:comment = Comments( ...) comment.save() comment.id #returns NoneIn my Model:class Comments(models.Model):id = models.Integ…

How to check if default value for python function argument is set using inspect?

Im trying to identify the parameters of a function for which default values are not set. Im using inspect.signature(func).parameters.value() function which gives a list of function parameters. Since Im…

OpenCV-Python cv2.CV_CAP_PROP_POS_FRAMES error

Currently, I am using opencv 3.1.0, and I encountered the following error when executing the following code:post_frame = cap.get(cv2.CV_CAP_PROP_POS_FRAMES)I got the following error Message:File "…