> But Python code (even buggy Python code!) is not supposed to be able to trigger C-level undefined behavior, that’s part of the point of using a high-level language like Python.
Is this really true when you’re directly modifying the underlying data structure of a python object? That can’t really be considered “pure python” anymore, and it seems like there would LOTS of ways to use it to cause shenanigans.
Rust's answer here would be yes. The code which made this possible is faulty.
For example, you can trivially reintroduce C-style undefined behaviour on buffer overflow by writing a Rust type which implements Index (and optionally IndexMut) using unsafe to just not bother with a bounds check. Rust says this type is wrong, it should not do this, that's a bug -- even if your program using it has no buffer mistakes, the type is wrong anyway.
This is a fact of Rust's culture rather than a technical implication of the Rust language design, albeit they're closely related.
This is one thing that I appreciate in the Rust culture, back to the days of Object Pascal/Modula-2, or the managed languages, where we do a working solution first, and only if it doesn't meet the performance requirements, we go through the optimization phase, with possible somehow hacky code disabling bounds check and such ({$R-} on TP since forever).
C++ culture was also a bit like that on the early days when we discussed about safer abstractions to C code on Usenet, somehow that got lost along the way as more C folks migrated into it.
It’s good that Rust has this culture. But the flip side is that issues that wouldn’t even be bugs in other languages have CVEs filed for them in Rust. A layman looking at the Rust language thinks it looks very insecure. Example from a couple weeks ago - https://news.ycombinator.com/item?id=33151963
The problem, and this goes back as I can remember, is that C folks use the remaining 30% of bugs causes, to dismiss any language that is sold as improving safety on systems programming.
It is Rust today, it was Ada, Object Pascal and Modula-2 in the early 80's, and anything in between since then
It seems like rust has passed some tipping point to wider adoption that I haven't seen in any of those other languages. Yes flight safety stuff went to ada, but much of that has now gone to modern c++ subsets with lots of testing. And car safety stuff has been misra c for quite a while I think? Google has released a rust based os, the browser I'm typing in has a rust css layout engine. I don't think rewrite it in rust is going to be some sweeping fire, it is too much work, but I could see driver writers and kernel authors migrating the whole over a long period. Ship of Theseus style.
Edit: or as part of the webapp Armageddon wasm and wasm runtimes all get written in rust and the os running all those webapps and wasm services atrophies to a tiny thing that eventually gets rewritten itself as something(s) non linuxlike.
Agreed, my point was simply that ada never really had the adoption levels that rust already enjoys. Yes ada was everywhere in safety of flight, but that is fairly niche comapred to a language that will be part of the linux kernel next year.
You're right that a C extension could be implemented in a way that breaks the usual Python C extension contract and then have undefined behaviour. For example, if you release the GIL and then iterate over a Python list then another thread could modify that list (causing the underlying buffer to point at another location) causing undefined behaviour.
But this is a case of undefined behaviour even for C extensions that do obey the usual contract. It even includes numpy for example. The two relevant elements are:
* When C extension uses the buffer protocol, is should operate directly on the buffer - the whole point of the buffer protocol is to avoid copying and allow speed almost of native code.
* When C extension is is doing CPU intensive work, release the Python GIL (but don't use any CPython APIs e.g. to instantiate a new object, iterate a list, etc.)
A major selling point of numpy, for example, is that you can use it to do CPU intensive work without the GIL locked, allowing true parallelism. I don't think it's a deal breaker that you can use this to get proper C-level undefined behaviour but it's definitely interesting, and I don't consider it a bug in numpy or Python.
I’ve never used buffers, but my understanding is that you can modify the contents directly from python without even needing a C extension. Surely I could trigger UB by just scribbling into the data structure for one of python’s native types and then accessing it normally.
I don't think that's true - you can use the buffer protcol to give one object a view into another but those objects' classes need to be written in C in order to allow that exchange to happen.
But if even if you were able to modify a buffer-protocol buffer directly using native Python, that would keep the GIL locked, which would prevent simultaneous access from multiple threads.
No, its not. But the general idea is, I think: If the composition of correct code can lead to incorrect behavior and the language that handles this composition does not allow to correct the error (and explicitly supports the composition), then that language is somehow lacking.
Python also makes no distinction between safe and unsafe code. Now, "ctypes" is maybe enough of a marker, but it just illustrates how easy it is to create danging pointers in Python just like in C, if desired.
And similar operations are available through numpy and the buffer API, just a bit harder to jump into those, but once you do, the same kind of mistakes can happen.
Right but his point was that they aren't explicitly opt-in. There's no `unsafe` keyword you have to use in Python before you are allowed to do unsafe things. It's just based on documentation essentially.
Therefore there's not really any issue with having a Python API that you have to be a bit careful with to use safely - as long as it is documented and relatively obvious, it's no different to ctypes.
Yeah, that's a fair argument. You'd probably expect any public API exposed by functions in the module importing ctypes to provide a "safe" interface. The difference would be if any of that public API was unsafe, in Rust you could force the caller to use the unsafe keyword too (since unsafe functions can only be called in unsafe blocks), whereas in Python you'd rely on documentation.
As an experienced python developer that dabbles in Rust (and the connection of both) I'd say that you should have a hard time to program Python in a way that makes it run into undefined behavior. However I have more confidence in the Rust code being unexploitable than the Python code. Rust's strict ownership rules, the default immutable variables, the strict type system and the good test framework all combine to code where much less can happen beyond the intended paths.
Python hooks into C at some points. I have not verified every single one of those points.
Iterating a pure-Python list in one thread while modifying it in another will not give you true C-level undefined behaviour. It might throw an exception or return the same element twice, etc., but it won't crash or do any nasal daemon stuff (time travelling crashes). In part, that's due to the Python GIL.
What this articles says is that the buffer protocol allows bugs that are qualitatively different.
Wouldn't that produce unexpected but not necessarily undefined behavior? To parent's point, I'm not sure there's going to be a language level _exploit_ there, but it will certainly not work the way you'd think.
Undefined behavior means it's not specified by the language how the code should be compiled. Just because something is weird and hard to predict doesn't mean it's UB.
So undefined behaviour only applies to compiled languages? that seems strange.
I thought the behaviour was simply whatever cpython does but it seems it is documented (so not undefined, indeed as you say). Copy pasted below with source.
Note
There is a subtlety when the sequence is being modified by the loop (this can only occur for mutable sequences, e.g. lists). An internal counter is used to keep track of which item is used next, and this is incremented on each iteration. When this counter has reached the length of the sequence the loop terminates. This means that if the suite deletes the current (or a previous) item from the sequence, the next item will be skipped (since it gets the index of the current item which has already been treated). Likewise, if the suite inserts an item in the sequence before the current item, the current item will be treated again the next time through the loop. This can lead to nasty bugs that can be avoided by making a temporary copy using a slice of the whole sequence, e.g.,
>Is this really true when you’re directly modifying the underlying data structure of a python object? That can’t really be considered “pure python” anymore, and it seems like there would LOTS of ways to use it to cause shenanigans.
I've done this against my own code. I got to a point where I needed shared memory due to multiprocessing but couldn't get anything to work quite right.
Whatever idiocy that got me there need not be mentioned but I infact was able to code a buffer overflow, and i only needed to bypass dep. aslr wasn't important even though it was win 10 with aslr on.
In hindsight, not a python problem. Definitely a me problem.
Naively, going into this I assumed “channels”[0] would be the solution.
All I took away was the age-old adage of: try to avoid passing objects back and forth from native code and also that I write inefficient code that copies buffers more often than I thought.
If you’re interested in python performance I saw a really good talk from strangeloop this year titled "Python Performance Matters": https://youtu.be/vVUnCXKuNOg ; Emery Berger has been talking about python performance for quite some time.
I mentioned this last time I spotted a link to a Strange Loop talk, but worth reiterating so fewer people miss the opportunity: 2023 will be the last conference. Make it if you can, it’s an amazing experience.
> If we have a Python buffer, and we want to represent the data in Rust, how should we do so? The natural answer would be a &[u8], but as you may have picked up, in the face of the possibility of concurrent writes, this would be unsound. Similarly, an &mut [u8] is unacceptable because the Python buffer protocol provides no assurances that only one mutable buffer is handed out at a time.
Idk, to me the natural answer for anything crossing "ffi" border would be *mut u8. Afaik rustc makes fewer assumptions about raw pointers and it might be closer to sound interface?
- The C extension releases the GIL while accessing the buffer (explicitly allowed by the spec)
In this case, the C extension is reading and writing to the same buffer at the same time without the use of atomics, which is the definition of a data race, and therefore triggers UB.
IMO, the best fix would be for the buffer implementation to expose a "concurrency mode", and require that users of the buffer API use the concurrency mode specified by the buffer when accessing it. In one mode, no synchronization is necessary because the buffer only ever hands out one writable pointer or multiple readable pointers but never both. In another mode, you might be required to have the GIL, in another mode there might be a buffer-specific RWLock you have to hold while accessing it.
"Yeah a raw pointer makes sense but is equally unhelpful for the caller, since there's still no safe way to access the data."
The problem is exactly that there is no safe way to access the data, according to Rust's definition of safety. The entire point of Rust is to exclude the C-esque "well, I have a pointer, I guess I have the right to do whatever I want with it, I hope, maybe?" access model, but that's what Python is operating under. No amount of Rust hackery can add Rust's promises onto the Python code. I'd argue that the "correct" answer is to hand back an opaque pointer value into the Rust, and obligate all Rust access to that opaque pointer to be through "unsafe". To put things back into Rust's safe space would require an unsafe copy, but at least the copy would be back in "true" Rust.
The problem here isn't that that is incorrect, the problem is that it's uselessly unergonomic, both for the programmer, and because I may not want to pay the price of making a copy. At this point we're just down to figuring out what promise of Rust we want to break. (The only reason we're not discussing which promises of Python we want to break is that on this dimension, it doesn't make any. There are other conceivable situations where we might have compromise in both directions, e.g., Haskell's immutability.) We have the freedom of acknowledging there isn't a perfect solution, and then pursuing the "good enough"... and the "good enough" is probably to just hand Rust the buffer wrapped in whatever Rust type you feel like/is most appropriate from Rust's point of view, and then just saying "all this type can do, and therefore all it does, is constrain my Rust code; I can do nothing about the Python at all on this level". Since there is no solution, spinning about trying to find one is not terribly productive.
Because CPython code is extremely slow there is an active cottage industry of extensions for making things very fast.
This of course creates the somewhat strange outcome that writing really fast Python code is all about writing your code in anything but Python but the syntax and ecosystem are obviously worth it to people.
> strange outcome that writing really fast Python code is all about writing your code in anything but Python
I really have to wonder why people are so insistent on keeping up the “Python charade”, like, surely at some point we recognise that all the effort trying to back-port performance into something that really isn’t having it, isn’t really worth it, and we’re better off just porting the bits of Python that work out to other languages that are fast/ergonomic/correct/etc and everyone involved can get some time back.
This is a slow process which takes many years, or even decades.
And what language is really the alternative? There is no clear answer. Many languages might somewhat be ok but I'm not sure. Python has grown so much because:
- It's very simple.
- Its syntax almost reflects algorithmic pseudo code.
- It's a managed language with garbage collection, you will usually not see segfaults or other memory corruption, and also not the simple kind of memleaks.
- Debugging is easy.
Those are features which are really nice for newcomers from other scientific fields, but also for professionals when Python is just the gluing language which orchestrates some heavy workers, e.g. TensorFlow or PyTorch. Or when it's just used as a better shell scripting language.
I'm myself using Python mostly for machine learning (TensorFlow, PyTorch), and as a shell script language alternative, and for other small utilities. And I really don't know what alternative language there is, which would be a better fit for this, even ignoring the state of the tooling and library support.
Some people might say Julia. But Julia is more complicated and more difficult to understand, esp for newcomers. And Julia has slow startup times. Slow startup times already disqualifies Julia for these applications, or at least it's a very huge disadvantage.
It appears simple on the surface, it is as complex as Common Lisp or C++, when one starts exploring the language surface, runtime capabilities, and standard library.
As an experienced Python programmer who has dabbled in C++ (and Rust), I find this claim very dubious, though I can't say either way about Common Lisp. The difficult parts of Python-the-language (e.g. metaclasses) are both fairly easily to explain and also so rarely used even by library authors that explanation isn't usually necessary.
I suppose the standard library can be considered complex, but is hardly worse than C++.
I've seen plenty of non-programmers become proficient enough in Python to be productive, but maybe one guy -- a statistician maintaining R libraries -- learn enough C++ to be dangerous.
How much would you score on a Python pub quiz, including all versions of Python between 1.6 and 3.10?
Questions would include changes across minor versions, meta programming, old and new classes, ABC, slots, magical methods, async, standard library, famous libraries like stackless, ...
Outside of tiny niches the world is on Python 3 now, so I don't think older versions are particularly relevant when evaluating how complex Python is. For that reason old vs new style classes don't cause issues. Nor do the existence of niche implementations like Stackless, unless we want to also claim such implementations as sources of complexity for C++ or (especially) Common Lisp, at which point it's a wash.
Having trained many newly-graduated data scientists and engineers to a level of being reasonably productive in Python, my assessment is that it is simply not that difficult to get a handle on. But if I'm expected to take non-career-programmers to comparable C++ fluency, that would be a considerably longer process.
Things they'd have to worry about in C++ that they don't have to worry about in Python: memory leaks, memory debugging, complex templates, libraries written in the lowest-level way possible, move semantics, ...
The thing is, what works really well for python is not just easily portable features.
Having a REPL and things like notebooks is great. Having pip install. Having a syntax that is very beginner friendly, especially for reading. Just making it easy to glue code together without worrying about types.
Half of these things are hard to impossible to port, and many others are actively undesirable to port.
If you consider python as a sort of bash-scripting language, that is explicitly meant to call 'real' code and very easily combine it into something new that is more usable, that might make more sense. Sadly the extra ergonomics of python vs bash mean that making code that can easily be called by python is a lot more work.
pip (and it's predecessor, easy_install) were improvements over CPAN, PEAR or the complete absence of solutions at the time in languages like Java or C, but I think it's fair to state that a package manager is table stakes for a language now and newer iterations on the concept like cargo have surpassed pip.
Because Python is the new BASIC, the big difference is that at least BASIC not only has compilers for decades, it was originally designed (Dartmouth BASIC) as JIT based language.
The interpreters got famous on the 8-bit machines because there wasn't space for either JIT or AOT compilers.
The latest talk I saw regarding Python performance actually addresses this charade, with a profiler that helps locate which parts of Python code should be rewritten in C, some people really love their "Python".
Well, there are a bunch of reasons of maintaining the "charade".
Firstly, faster languages don't have the ergonomics.
Secondly, there's a tonne of Python code out there already that isn't going away any time soon, and any effort to make the runtime faster benefits all that code.
Thirdly, there's little reason Python can't be as fast as the likes of JavaScript beyond the issues imposed by the C extension API. The reality is that not a lot of effort has historically been put into making the CPython runtime particularly fast. It's only recently that anyone's put any concerted effort into it.
That's a whole other issue. PyPy and CPython have conflicting aims, which is partly why it never replaced CPython as the defacto default Python runtime. Also, PyPy is much slower for stuff that's only going to run for a short time. Still, many of the lessons learned in PyPy can be used to improve CPython performance.
They only have conflicting aims due to politics, otherwise a way to work towards a common goal would have happened.
CPython devs will only accept proper JIT when the language really starts getting a betting from the likes of Julia, Chapel, Go, JVM and .NET bindings to the same underylying C/C++/Fortran libs.
Rust has a good shot at taking over Python's role in a couple years: it is fast, cargo is even simpler to use than pip, multithreading is mostly idiot-proof, and there's a growing ecosystem of data science and ML libraries that bring the benefits of Rust's more expressive type system to the table. But it will take years to grow that ecosystem to a size that makes it really competitive with Python.
Rust will never overtake Python's role as long as data scientists are the rank-and-file using machine learning libraries. Rust is simply too complex and too difficult to learn for non-CS types. I'm not casting aspersions at data scientists, who largely could adapt to Rust, but in my experience they wouldn't (in general!) unless they were forced to by the job market. And employers aren't going to see the need to make that change (again, in general), because a lengthy re-education process would be expensive.
My gut feeling is that Rust won't compete with Python in the ML space, but contrary to the common assertion that Rust is too hard for non-CS types, it has actually seen notable adoption from Python-users. While the strictness of the compiler can be seen as a deterrent to some, it is also a huge benefit for people who are new to systems programming and worried about shooting themselves in the foot.
Julia is probably the most likely competitor for Python in the data science space. It's designed for such use cases and it's very good at it.
Still, Python does have so much momentum that it's almost certainly going to continue its rise. But other languages can still succeed, the ecosystem doesn't need to end up as winner-take-all.
If you can write your whole program in Python and just pay a bit more attention to the one speed critical bit, you can often get perfectly decent performance (let's say takes 20% longer than writing the whole thing natively, to just make up a number, or even 100% longer may be acceptable) while taking much much less effort to write an maintain the whole thing (I mean like an order of magnitude less effort).
Is this really true when you’re directly modifying the underlying data structure of a python object? That can’t really be considered “pure python” anymore, and it seems like there would LOTS of ways to use it to cause shenanigans.