Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Pickle’s Nine Flaws (nedbatchelder.com)
64 points by gilad on July 16, 2020 | hide | past | favorite | 38 comments


There is a way to read pickles without running them, but it is Python-only and still requires one to know how pickles work. The module `pickletools` can be used to disassemble pickles to bytecode, just like `dis` for normal Python objects. Honestly, though, I wouldn't say that this invalidates the point about unreadability, but just hammers in exactly how unreadable they really are.


To me it's interesting that pickle can be thought of as recording some of the implicit assumptions GvR made about the expected use of Python semantics.

Formally serialization/deserialization is very crunchy and precise. (And I remember how stoked I was to find out that Python included an implementation!) In practice, things get messy and we break the implicit assumptions.

Is it a flaw of the pickle module? Or are our designs too clever?

Patient: "It hurts when I do this."

Doctor: "Don't do that."

;-)


* Insecure: If you are unpickling insecure code, you have other problems. Deserializers should not be used as a protection against hacking.

* Old pickles look like old code: Again, convert your object into json and serialize that to your database. Oh no, you are missing an attribute. Pickle should not be used so you don't have to employ a release engineer.

* Implicit: No software works everywhere with defaults. So use copyreg.

* Over-serializes: USE copyreg.

* __init__ isn’t called: USE COPYREG.

* Python only: what's this for, then? http://www.picklingtools.com/

* Unreadable: Great feature.

* Appears to pickle code: Another great feature.

* Slow: check again, it has been 8 years. I can't find any faster method.


> * Python only: what's this for, then? http://www.picklingtools.com/

From the picklingtools FAQ:

What versions of Python does PicklingTools support? Historically, versions 2.1.x to 2.6.x have been tested extensively. More recently, 2.7 has been tested and should work, but it has not been tested as much as the other versions.

   3.x has not been tested: We are waiting for our main paying customer to adopt the 3.x series.
[edit] formatting


> * Insecure: If you are unpickling insecure code, you have other problems. Deserializers should not be used as a protection against hacking.

There are lots of good use cases for deserializing untrusted data -- it's what you do in almost any client-server situation. So the fact you can't do this with pickle really is an important limitation.


> There are lots of good use cases for deserializing untrusted data

Yes, and untrusted user data should not have arrived at the system in pickle format. And in case of RoR, it should not have arrived in YAML format or loaded with YAML.load[1] So my point stands.

[1] https://gist.github.com/niklasb/df9dba3097df536820888aeb4de3...


In what client server situation does it make sense to use pickles over JSON/YAML?


None, except when you're taking a huge shortcut. Which is why you want to be super cautious about using Pickle, or Java serialization, or any serialization solution that deserializes arbitrary objects. Once your deserialization isn't explicit about what objects you accept, you have to be super careful about the provenance of that data.


Like hyperpape said, virtually none. And this point is the biggest reason why.

The article is, after all, about reasons not to use pickle. :-)


"* Unreadable: Great feature."

???


Subjective, isn't it?


I’m curious why you find it to be a great feature. Being unreadable doesn’t mean meaningless, obviously, so as far as I can tell, unreadable just means it’s a little more hassle to figure out what it means.


I just like it when floats and integers are serialized into bytes and it's harder to see what is where. IPC becomes a little harder to decipher. Obfuscation is not exactly a use case for pickle, but still nice extra feature.


I'm skeptical of the point about over-serialization. In my opinion, throwing an exception on an unserializable attribute is a good default. If an object is using a file, it will more often than not be unusable when deserialized without the file.

This is one of the few things Java gets right about its built in serialization: if you have an object that can't be serialized, anything using that object has to declare it as transient, meaning it won't be serialized or deserialized. Hopefully you'll think about whether the result makes sense before using the keyword.

If you don't mark an unserializable field transient, you'll get an exception at runtime. It's not enforced by the compiler, which would be ideal, but linters will warn you.


Hawking my own (incomplete) contribution to Pickle security/analysis https://github.com/moreati/pickle-fuzz#rehabilitating-python...


This seems to seriously misunderstand the point of pickle. It's not for data interchange. It's for e.g. caching objects or debugging. That's it.

The fact it keeps "old code" is a feature. The object is exactly as it was at the time it was saved.


I've read quite a few blog posts of people discouraging the use of pickle for data interchange, which led me to believe that quite a few orgs actually use it for precisely that.

I wouldn't use it, simply because I like my formats easy to inspect (if you want privacy, just put it on a secure channel like TLS or use some other standard wrapping), but I guess other people have less scruples.


I mean, the most popular async message queue for python broadly supports pickle as its serialization format. I don't think this problem is exclusive to this blog post.


I should have been more precise. I meant interchange between different programs/platforms/etc. Not internal messages.


I think these flaws are fairly minor, at least you seem to be nudged towards use cases where you're not overly reliant on pickle for complex work.

If readability is an issue there's a JSON version that's quite useful.

Other than that, most of the other concerns are addressable. If security matter perhaps use an encryption lib around the pickle, rather than ask for it to be built into it? As for speed, you're already using python and chances are you're not constantly pickling and unpickling?


Encryption won't do you any good - encrypted messages can be forged. But let's say you meant a signature or other form of authentication.

That still does you no good if you're, say, a server getting data from a client. Very few servers want to allow clients to execute arbitrary code inside them.

That still leaves some situations where it can be used - but it's a major limitation on the scope of those situations.


I am surprised when people use pickle NOT as a last resort.

For numeric data, H5 is nice. For configs, JSON is pretty much a standard. For Python code... well, nothing beats Python code.


Pickle's greatest flaw is the complete lack of forward and backward compatibility. The compatibility is not guaranteed between when upgrading any of the dependencies. Dependencies should stay the same over releases, halting forward progress in the development process.


Can you elaborate? What dependencies? Pickle is in the standard library.


Library dependencies, such as managed by Conda or Pip (from PyPI). Any change in the expected interface from a dependency to your depickled objects will create an error.

I author package "FooBar". FooBar creates objects of type "Baz" which will later be serialized by you, the FooBar user. Within the FooBar package: "def show(baz)" taking a Baz and displaying it. When I decide at one point to add an extra attribute to my "Baz" objects, I cannot (or should be extra careful) to use that attribute within that method if I expect the "Baz" objects to be pickled. This situation can be unpredictable and cause hard to track errors.


Perhaps they're referring to your application's dependencies, in a situation where you're pickling instances of those dependencies' types. Then this is an example of "old pickles look like old code".


I have this problem a lot with pandas Dataframes. I know I could engineer around it, but most of the time I just want to distribute some number crunching to a bunch of docker images quickly and a database is overkill for a one off analysis. Works fine until an image updates pandas. Not taking issue with pandas or pickle, but it's an issue of time trade-off. JSON is okay, but object conversion/Nan/None/inf can be a bear.


Data serialization is hard and the artifacts are much longer lived than executable code and even our API interfaces. YAML, JSON, XML are all flawed. There are many competing binary serialization frameworks. Beware. Dar be Dragons in Durable Data.


I find it interesting that everyone so far has suggested JSON as a pickle alternative. Depending on why you are serializing and deserializing the data, a lot of times the true replacement for pickle is a full-fledged database.


Right now I'm working on a large Monte Carlo project where I need to export the result of every simulation. Inserting into a database takes longer than just pickling the result.


It must only be used for small programs and it serves the purpose well.


I use it for transferring stuff from a data collection program to a Jupiter notebook. It works quite well for that. Converting big numpy arrays to text and back again would be cumbersome.


If all you're transferring is numpy arrays, they natively support an efficient, cross-platform, forward-compatible binary format that might work better for you.


Good point, I'll look into that. Usually it's a mixture of things like numpy arrays and meta-data. I find it's easy to save a lot of stuff along with the data, that I might regret not having later on.


Similar to PHP's serialize() and unserialize()


Also very, very similar to Java's serialization mechanism.


Basically every language with a vm with boxed types has to have this. Erlang has term_to_binary, which has some crazy superpowers, like I can serialize a lambda, put it on a pigeon, and have it run on an airgapped machine (assuming any module referenced in the lambda has an equivalently named version in the airgapped VM). Of course you can see how this could also be a security problem if you're not careful.

On the other hand, this is part of how erlang distributed systems (which is crazy easy) can communicate with extremely simple semantics, and the security model for erlang distribution is very explicitly "trusted only; locking down the cluster as a single security domain is YOUR responsibilty".


Indeed. The book Effective Java dedicates several chapters to how serialization works, what are the pitfalls, and how to avoid them (with care and effort, not automatic).




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: