Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> But Python code (even buggy Python code!) is not supposed to be able to trigger C-level undefined behavior, that’s part of the point of using a high-level language like Python.

Is this really true when you’re directly modifying the underlying data structure of a python object? That can’t really be considered “pure python” anymore, and it seems like there would LOTS of ways to use it to cause shenanigans.



Rust's answer here would be yes. The code which made this possible is faulty.

For example, you can trivially reintroduce C-style undefined behaviour on buffer overflow by writing a Rust type which implements Index (and optionally IndexMut) using unsafe to just not bother with a bounds check. Rust says this type is wrong, it should not do this, that's a bug -- even if your program using it has no buffer mistakes, the type is wrong anyway.

This is a fact of Rust's culture rather than a technical implication of the Rust language design, albeit they're closely related.


This is one thing that I appreciate in the Rust culture, back to the days of Object Pascal/Modula-2, or the managed languages, where we do a working solution first, and only if it doesn't meet the performance requirements, we go through the optimization phase, with possible somehow hacky code disabling bounds check and such ({$R-} on TP since forever).

C++ culture was also a bit like that on the early days when we discussed about safer abstractions to C code on Usenet, somehow that got lost along the way as more C folks migrated into it.


It’s good that Rust has this culture. But the flip side is that issues that wouldn’t even be bugs in other languages have CVEs filed for them in Rust. A layman looking at the Rust language thinks it looks very insecure. Example from a couple weeks ago - https://news.ycombinator.com/item?id=33151963


The problem, and this goes back as I can remember, is that C folks use the remaining 30% of bugs causes, to dismiss any language that is sold as improving safety on systems programming.

It is Rust today, it was Ada, Object Pascal and Modula-2 in the early 80's, and anything in between since then


It seems like rust has passed some tipping point to wider adoption that I haven't seen in any of those other languages. Yes flight safety stuff went to ada, but much of that has now gone to modern c++ subsets with lots of testing. And car safety stuff has been misra c for quite a while I think? Google has released a rust based os, the browser I'm typing in has a rust css layout engine. I don't think rewrite it in rust is going to be some sweeping fire, it is too much work, but I could see driver writers and kernel authors migrating the whole over a long period. Ship of Theseus style.

Edit: or as part of the webapp Armageddon wasm and wasm runtimes all get written in rust and the os running all those webapps and wasm services atrophies to a tiny thing that eventually gets rewritten itself as something(s) non linuxlike.



Agreed, my point was simply that ada never really had the adoption levels that rust already enjoys. Yes ada was everywhere in safety of flight, but that is fairly niche comapred to a language that will be part of the linux kernel next year.


You're right that a C extension could be implemented in a way that breaks the usual Python C extension contract and then have undefined behaviour. For example, if you release the GIL and then iterate over a Python list then another thread could modify that list (causing the underlying buffer to point at another location) causing undefined behaviour.

But this is a case of undefined behaviour even for C extensions that do obey the usual contract. It even includes numpy for example. The two relevant elements are:

* When C extension uses the buffer protocol, is should operate directly on the buffer - the whole point of the buffer protocol is to avoid copying and allow speed almost of native code.

* When C extension is is doing CPU intensive work, release the Python GIL (but don't use any CPython APIs e.g. to instantiate a new object, iterate a list, etc.)

A major selling point of numpy, for example, is that you can use it to do CPU intensive work without the GIL locked, allowing true parallelism. I don't think it's a deal breaker that you can use this to get proper C-level undefined behaviour but it's definitely interesting, and I don't consider it a bug in numpy or Python.


I’ve never used buffers, but my understanding is that you can modify the contents directly from python without even needing a C extension. Surely I could trigger UB by just scribbling into the data structure for one of python’s native types and then accessing it normally.


I don't think that's true - you can use the buffer protcol to give one object a view into another but those objects' classes need to be written in C in order to allow that exchange to happen.

But if even if you were able to modify a buffer-protocol buffer directly using native Python, that would keep the GIL locked, which would prevent simultaneous access from multiple threads.


No, its not. But the general idea is, I think: If the composition of correct code can lead to incorrect behavior and the language that handles this composition does not allow to correct the error (and explicitly supports the composition), then that language is somehow lacking.


Python also makes no distinction between safe and unsafe code. Now, "ctypes" is maybe enough of a marker, but it just illustrates how easy it is to create danging pointers in Python just like in C, if desired.

    import ctypes
    ptr = ctypes.c_char_p(0x1234)  # integer to pointer cast
    ptr.value  # deref dangling pointer

And similar operations are available through numpy and the buffer API, just a bit harder to jump into those, but once you do, the same kind of mistakes can happen.


In BASIC, a wrong POKE could also kill the machine and a reboot was required.

The big difference is that most of these mistakes are opt-in, while in C every line of code is a possible source for problems.


Right but his point was that they aren't explicitly opt-in. There's no `unsafe` keyword you have to use in Python before you are allowed to do unsafe things. It's just based on documentation essentially.

Therefore there's not really any issue with having a Python API that you have to be a bit careful with to use safely - as long as it is documented and relatively obvious, it's no different to ctypes.


I would assert that import ctypes is the opt-in somehow.


Yeah, that's a fair argument. You'd probably expect any public API exposed by functions in the module importing ctypes to provide a "safe" interface. The difference would be if any of that public API was unsafe, in Rust you could force the caller to use the unsafe keyword too (since unsafe functions can only be called in unsafe blocks), whereas in Python you'd rely on documentation.


As an experienced python developer that dabbles in Rust (and the connection of both) I'd say that you should have a hard time to program Python in a way that makes it run into undefined behavior. However I have more confidence in the Rust code being unexploitable than the Python code. Rust's strict ownership rules, the default immutable variables, the strict type system and the good test framework all combine to code where much less can happen beyond the intended paths.

Python hooks into C at some points. I have not verified every single one of those points.


> you should have a hard time to program Python in a way that makes it run into undefined behavior

Iterating a data structure while modifying it can get some very weird behaviour in Python.

I think it qualifies as undefined behaviour. Python recommends not to do this, without any safeguards in place.


Iterating a pure-Python list in one thread while modifying it in another will not give you true C-level undefined behaviour. It might throw an exception or return the same element twice, etc., but it won't crash or do any nasal daemon stuff (time travelling crashes). In part, that's due to the Python GIL.

What this articles says is that the buffer protocol allows bugs that are qualitatively different.


Wouldn't that produce unexpected but not necessarily undefined behavior? To parent's point, I'm not sure there's going to be a language level _exploit_ there, but it will certainly not work the way you'd think.


Undefined behavior means it's not specified by the language how the code should be compiled. Just because something is weird and hard to predict doesn't mean it's UB.


So undefined behaviour only applies to compiled languages? that seems strange.

I thought the behaviour was simply whatever cpython does but it seems it is documented (so not undefined, indeed as you say). Copy pasted below with source.

Note

There is a subtlety when the sequence is being modified by the loop (this can only occur for mutable sequences, e.g. lists). An internal counter is used to keep track of which item is used next, and this is incremented on each iteration. When this counter has reached the length of the sequence the loop terminates. This means that if the suite deletes the current (or a previous) item from the sequence, the next item will be skipped (since it gets the index of the current item which has already been treated). Likewise, if the suite inserts an item in the sequence before the current item, the current item will be treated again the next time through the loop. This can lead to nasty bugs that can be avoided by making a temporary copy using a slice of the whole sequence, e.g.,

https://docs.python.org/3.7/reference/compound_stmts.html#th...


I had the same reaction: clearly all bets are off when you expose memory directly to another runtime.


>Is this really true when you’re directly modifying the underlying data structure of a python object? That can’t really be considered “pure python” anymore, and it seems like there would LOTS of ways to use it to cause shenanigans.

I've done this against my own code. I got to a point where I needed shared memory due to multiprocessing but couldn't get anything to work quite right.

https://docs.python.org/3/library/multiprocessing.html#modul...

Whatever idiocy that got me there need not be mentioned but I infact was able to code a buffer overflow, and i only needed to bypass dep. aslr wasn't important even though it was win 10 with aslr on.

In hindsight, not a python problem. Definitely a me problem.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: