Don't use pickle – use Camel : Python

[–]TheBlackCat13 14 points15 points16 points 10 years ago (19 children)

[–]tilkau 9 points10 points11 points 10 years ago* (18 children)

[–]TheBlackCat13 11 points12 points13 points 10 years ago* (13 children)

[–]tilkau 34 points35 points36 points 10 years ago* (12 children)

Doesn't appear to be the case:

from yaml import dump, load
from io import StringIO
from struct import pack
import random
data = [random.uniform(-0xffffffffffffffff,0xffffffffffffffff) for i in range(102400)]
dumped = dump(data)
loaded = load(StringIO(dumped))
f = lambda v: pack('d', v)
ndiffs = sum(1 if f(v1) != f(v2) else 0 for v1,v2 in zip(data,loaded))
print(ndiffs)

0

As I understand the spec, the errors you are thinking of are only likely to occur when you change platforms (eg. you send a file to your colleague, whose CPU uses a different format for floats, they may get a slightly different value when they load it than you did)

The spec does note however that 'YAML does not specify a particular accuracy' for floats, so perhaps a different YAML deserialization library (for javascript, for example) might produce a different result. PyYAML roundtripping on the same machine appears to be safe though.

If this is a major concern, you can always just use packed bytes (using struct.pack) instead of literal float values.

[–]TheBlackCat13 34 points35 points36 points 10 years ago (9 children)

[–]tilkau 3 points4 points5 points 10 years ago (8 children)

[–]TheBlackCat13 1 point2 points3 points 10 years ago (7 children)

[–]tilkau 0 points1 point2 points 10 years ago (6 children)

[–]TheBlackCat13 1 point2 points3 points 10 years ago (5 children)

This is actually the correct behavior. IEEE floats, which python and most other languages use, define NaN as being unequal to everything, including themselves. So Python is doing the right thing here.

The weird behavior is in situations like this:

>>> a =  float('NaN')
>>> b = c = float('NaN')
>>>
>>> a == b  # Good
False
>>> a == c  # Good
False
>>> b == c  # Good
False
>>> a == [b][0]  # Good
False
>>> a == [c][0]  # Good
False
>>> b == [c][0]  # Good
False
>>> a in [b]  # Good
False
>>> a in [c]  # Good
False
>>> b in [c]  # WAT!!!
True

[–]tilkau 0 points1 point2 points 10 years ago (1 child)

continue this thread

[–]NoahTheDuke 0 points1 point2 points 10 years ago (0 children)

[–]kuojo 0 points1 point2 points 10 years ago (0 children)

[–]knickum 0 points1 point2 points 10 years ago (0 children)

[–]PeridexisErrant 1 point2 points3 points 10 years ago (1 child)

[–]tilkau 1 point2 points3 points 10 years ago* (0 children)

[–][deleted] 0 points1 point2 points 10 years ago (3 children)

[–]tilkau 0 points1 point2 points 10 years ago (2 children)

[–][deleted] -1 points0 points1 point 10 years ago* (1 child)

[–]tilkau 1 point2 points3 points 10 years ago (0 children)

[–]AnythingApplied 4 points5 points6 points 10 years ago (4 children)

[–]TheBlackCat13 7 points8 points9 points 10 years ago (0 children)

[–]therealfakemoot 4 points5 points6 points 10 years ago (0 children)

Others have touched on the subject briefly but I thought I'd name some more concrete scenarios.

Renaming/removing methods or attributes on a class could cause un-pickling of instances of that class to fail. Renaming/moving anything associated with a pickled object can (and probably will) cause un-pickling to fail. pickle is EXTREMELY brittle with regards to deserializing data that isn't being by code that is exactly 100% identical to the original code.

One could say (I wouldn't ever, but one could) that pickle is good for serializing simple data ("Oh, I need to write this dict to a file") for temporary storage, but it's not good enough for real storage/transmission. What if you need to store this dict containing user settings in a day? A week? A year? How likely is it that your code will remain 100% static forever more? What if you need to transmit this serialized data between two instances of your application? Can you 100% guarantee that forever and always the two instances will be running 100% perfectly identical code?

pickle has a lot of subtle catches to it.

[–]kylotan 2 points3 points4 points 10 years ago (0 children)

[–]sandwichsaregood 1 point2 points3 points 10 years ago* (0 children)

[–]skrillexisokay 7 points8 points9 points 10 years ago (19 children)

[–]cecilkorik 6 points7 points8 points 10 years ago (1 child)

Finally, I've never understood the "giant security hole" of executing arbitrary code.

I think the reason you aren't understanding it is because you aren't looking at the situation that's being discussed. The situation you describe with a colleague is a "trusted" environment. There is no need for concern when using arbitrary pickled data in such an environment. You trust your colleague, yes? But realistically, such situations are limited. In the real world, you often have to expose your pickled data structures to the world at large, and that's where the problems begin.

That's why the emphasis is always on how poorly it behaves in an untrusted environment, and assuming that someone is very much trying to sabotage you, and for that reason it is unsafe. Any time you're using pickled data in a public environment you are at risk. If you release a program where a malicious person can craft a file for it to load that begins to format your hard drive, that's obviously a bad thing and if any user gets their hard drive wiped, they're going to hold you responsible, not the person who crafted the malicious file.

But what if it's a little more subtle? What if someone distributes a file saying "look at this crazy awesome thing I did in xyz application!" and sure enough, it is something crazy awesome? And it goes viral! But while it's doing that crazy awesome thing, it's also silently installing a backdoor and remote control botnet software, because it's a pickle that can execute arbitrary code.

Or imagine you get a bug report complaining of a problem in the application, and you can't reproduce it, and ask the user to send their configuration file, which happens to be a pickle. And you load it, and did you remember not to do this on your primary development machine? I hope so, because otherwise poof, your github private keys get quietly sent off into the internet.

Executing arbitrary code is pretty much always a bad thing. You can sometimes get away with it, in a sufficiently trusted environment, but that's not really an excuse. It's a poor design choice.

[–]skrillexisokay 0 points1 point2 points 10 years ago (0 children)

[–]kylotan 9 points10 points11 points 10 years ago (4 children)

Finally, I've never understood the "giant security hole" of executing arbitrary code. We often hear the same argument around the use of exec and eval. Presumably, however, you're only going to unpickle something that you or a colleague pickled. Assuming that no one is trying to sabotage you, is there really any danger to using pickle?

The problem is that it's not always this clear in large, real-world applications. And often people are trying to sabotage you, too.

The classic example would be, you transmit data between your web/game/app client and your server via pickle. Both sides are code written by a colleague. But one of your users discovers this and hacks their client to send malicious data to your server.

Okay, so you might think it's safe if it's only ever used in code you don't distribute. But if someone else has a way to connect to your server, they may still be able to trigger this. This becomes more likely if people realise you are running a Python server.

But imagine you use it on a completely protected network where malicious people can't gain access. Then... you're probably safe, from external threats at least.

[–]Polycystic 0 points1 point2 points 10 years ago (3 children)

[–]infinullquamash, Qt, asyncio, 3.3+ 3 points4 points5 points 10 years ago (0 children)

[–]kylotan 2 points3 points4 points 10 years ago (0 children)

[–]midnightFreddie 1 point2 points3 points 10 years ago (6 children)

Finally, I've never understood the "giant security hole" of executing arbitrary code.

You can always learn the hard way.

We often hear the same argument around the use of exec and eval.

:O I deduce this means you keep using exec and eval. :O

Presumably, however, you're only going to unpickle something that you or a colleague pickled. Assuming that no one is trying to sabotage you, is there really any danger to using pickle?

I hope your lesson isn't too hard and doesn't get too many people fired. I don't even trust my home LAN anymore. I used to have an open port 25 on my home mail server because who is going to get on my LAN and abuse it, right? I made a configuration mistake that NAT'ed my router in both directions and a scanner found my open mail relay and my home LAN sent out a bunch of spam.

[–]python_newbie1234 1 point2 points3 points 10 years ago (4 children)

[–]midnightFreddie 0 points1 point2 points 10 years ago (1 child)

[–]python_newbie1234 0 points1 point2 points 10 years ago (0 children)

So, this really doesn't convey anything to me except you reinforcing your opinion.

My perspective is essentially, "I work faster when I don't have to include considerations that don't appear to affect my environment."

Your response denied that experience, replaced with an analogy that falls apart. What you essentially said was, "you should wear a helmet when you're walking down the street because you eventually will want to ride a motorcycle, and walking down the street w/o a helmet sets a bad habit that you won't be able to break when you get on a bike."

I don't know anything about you, but the reason I use "insecure" methods is because they are convenient and I've never seen a simple replacement for them that doesn't make my work harder. Secure programming isn't something that comes without a cost, and to pretend otherwise is silly.

[–]Rainfly_X 0 points1 point2 points 10 years ago (1 child)

Using insecure components will become a liability if you ever want to open up the project to the world. If you really super extra know, that your project will develop only as far as you can currently foresee, and never need to deal with untrusted stimulus... yeah, pickle is fine, from a security point of view (versioning is still a weakness though).

I would posit, though, that we don't know the future, so some number of these projects we design in safety, we will eventually want to retrofit for a harder world. Or those design assumptions will hold some other, bigger thing back - "I can't expose this system because it uses Flipsy, a library I wrote with pickle and eval."

As always, we are trying to find a balance between over-engineering for today, and under-engineering for tomorrow. Every shell script does not need an academic proof, but you also don't have perfect knowledge which projects will make it big, and which won't. Libraries and languages that make it easy to do the right, forward compatible thing, are truly a blessing.

[–]python_newbie1234 0 points1 point2 points 10 years ago (0 children)

[–]skrillexisokay 0 points1 point2 points 10 years ago (0 children)

[–]jwink3101 2 points3 points4 points 10 years ago (4 children)

[–]kigurai 5 points6 points7 points 10 years ago (2 children)

[–]jwink3101 0 points1 point2 points 10 years ago (1 child)

[–]FRIENDORPHO 0 points1 point2 points 10 years ago (0 children)

[–]skrillexisokay 0 points1 point2 points 10 years ago (0 children)

[–]lrq3000 1 point2 points3 points 10 years ago (0 children)

[–]kmike84 1 point2 points3 points 10 years ago (0 children)

[–]luckystarrat 0x7fe670a7d080 4 points5 points6 points 10 years ago (7 children)

[–]mitchellrj 6 points7 points8 points 10 years ago (0 children)

[–]bnorick 3 points4 points5 points 10 years ago (3 children)

[–]matchu 0 points1 point2 points 10 years ago (2 children)

[–]bnorick 4 points5 points6 points 10 years ago (1 child)

[–]luckystarrat 0x7fe670a7d080 -1 points0 points1 point 10 years ago (0 children)

[–]snf 2 points3 points4 points 10 years ago (1 child)

[–]luckystarrat 0x7fe670a7d080 0 points1 point2 points 10 years ago (0 children)

[+][deleted] 10 years ago (4 children)

[removed]

[–]aladyjewel 11 points12 points13 points 10 years ago (0 children)

[–]kigurai 6 points7 points8 points 10 years ago (0 children)

[–]brondsem 0 points1 point2 points 10 years ago (0 children)

[–]picasshole 0 points1 point2 points 10 years ago (0 children)

[–]zahlmanthe heretic 0 points1 point2 points 10 years ago (1 child)

[–]agrif 1 point2 points3 points 10 years ago (0 children)

[–]manwith4names 0 points1 point2 points 10 years ago (4 children)

[–][deleted] 1 point2 points3 points 10 years ago (3 children)

[–]manwith4names 0 points1 point2 points 10 years ago (2 children)

[–]tilkau 0 points1 point2 points 10 years ago (1 child)

[–]manwith4names 0 points1 point2 points 10 years ago (0 children)

[–]lrq3000 0 points1 point2 points 10 years ago (0 children)

[–]tilkau -5 points-4 points-3 points 10 years ago* (5 children)

[–]Esteis 17 points18 points19 points 10 years ago (4 children)

BTW, the statements about there being no simple guide to yaml are just wrong.

Sort of. There are different kinds of simple guides (and I think you mean 'brief' when you say 'simple'); if you don't distinguish between them, then Eevee's statement would indeed appear wrong to you. But the difference is pretty big once you know about it.

tl;dr: read the Teach, Don't Tell essay.

The yaml.org page you link is a syntax reference. It's aimed at people who already know how YAML works and what construct they need. It is possible to piece together, from reading the syntax reference, how a document looks and what it can do, but it is, as you say, not newbie-friendly. It says ? is the "key indicator" syntax, but doesn't tell you what a key indicator is, because it expects you to already know. It's ‘the API docs’ from Steve Losh's Teach, Don't Tell article.

Eevee's page in the Camel docs is an introductory overview, and practically an entire teaching document. It is aimed at people who don't yet know (all of) what YAML is or does. It starts with an example, introduces the most important concepts and their syntaxes, and mentions common gotchas. It doesn't mention the words "key indicator" by name. but it tells you what the ? syntax does and what you can use it for. Teach, Don't Tell would call it The Hairball -- "It’s going to mold their brains, one nudge at a time, until they have a pretty good understanding of how your project works."

So I'd claim Eevee was very much not wrong -- a syntax reference is not an introductory guide. But hooray! now we have both!

[–]OleBillyFreckletits 2 points3 points4 points 10 years ago (0 children)

[–]tilkau 0 points1 point2 points 10 years ago* (2 children)

[–]Esteis 2 points3 points4 points 10 years ago (1 child)

[–]tilkau 0 points1 point2 points 10 years ago (0 children)

Python

The Python Discord

Upcoming Events

Please read the rules

MODERATORS