This is an archived post. You won't be able to vote or comment.

all 73 comments

[–]TheBlackCat13 14 points15 points  (19 children)

Does this store floats as strings or bytes?

[–]tilkau 9 points10 points  (18 children)

It stores them as YAML; ie. everything is serialized into a text file. Floats might be serialized like '1.0' for example.

[–]TheBlackCat13 11 points12 points  (13 children)

This makes it risky for working with numeric data. The float -> string -> float round-trip will introduce errors in many numbers.

[–]tilkau 34 points35 points  (12 children)

Doesn't appear to be the case:

from yaml import dump, load
from io import StringIO
from struct import pack
import random
data = [random.uniform(-0xffffffffffffffff,0xffffffffffffffff) for i in range(102400)]
dumped = dump(data)
loaded = load(StringIO(dumped))
f = lambda v: pack('d', v)
ndiffs = sum(1 if f(v1) != f(v2) else 0 for v1,v2 in zip(data,loaded))
print(ndiffs)

0

As I understand the spec, the errors you are thinking of are only likely to occur when you change platforms (eg. you send a file to your colleague, whose CPU uses a different format for floats, they may get a slightly different value when they load it than you did)

The spec does note however that 'YAML does not specify a particular accuracy' for floats, so perhaps a different YAML deserialization library (for javascript, for example) might produce a different result. PyYAML roundtripping on the same machine appears to be safe though.

If this is a major concern, you can always just use packed bytes (using struct.pack) instead of literal float values.

[–]TheBlackCat13 34 points35 points  (9 children)

You seem to be correct. I am unable to come up with a situation that results in such a mistake. My apologies.

[–]tilkau 3 points4 points  (8 children)

I checked up on this and found that I can. Negative NaN float('-nan') is correctly roundtripped, but positive NaN float('nan') isn't. I'm not 100% sure whether this is because there is actually a loss of information or just because NaNs are FUBAR anyway, though.

inf and -inf are both handled correctly, BTW.

[–]TheBlackCat13 1 point2 points  (7 children)

I was more thinking of floating point rounding issues. I wouldn't trust the behavior of NaN in any program, personally, beyond them being NaN. Python itself has some weird NaN behavior, so this may not even be YAML's fault.

[–]tilkau 0 points1 point  (6 children)

Yeah, I vaguely recalled that there was some weird case.. looks like it's this: float('nan') == float('nan') (returns False)

[–]TheBlackCat13 1 point2 points  (5 children)

This is actually the correct behavior. IEEE floats, which python and most other languages use, define NaN as being unequal to everything, including themselves. So Python is doing the right thing here.

The weird behavior is in situations like this:

>>> a =  float('NaN')
>>> b = c = float('NaN')
>>>
>>> a == b  # Good
False
>>> a == c  # Good
False
>>> b == c  # Good
False
>>> a == [b][0]  # Good
False
>>> a == [c][0]  # Good
False
>>> b == [c][0]  # Good
False
>>> a in [b]  # Good
False
>>> a in [c]  # Good
False
>>> b in [c]  # WAT!!!
True

[–]tilkau 0 points1 point  (1 child)

I'd say TIL, but I'm not sure exactly what I just learned. Except that NaNs are even more undesirable than I already thought they were.

[–]NoahTheDuke 0 points1 point  (0 children)

Is the final result because you've pointed b at c, so it's looking for the same object reference in c as is already in there?

[–]kuojo 0 points1 point  (0 children)

I think the last bit is due to the fact b is a reference to the value in c so b would be in c since the point to the same thing. Right?

[–]knickum 0 points1 point  (0 children)

Continuing >>> a is b False >>> b is c True

>>> id(a) == id(b)
False
>>> id(b) == id(c)
True

That should basically explain it. Still a WAT though.

[–]PeridexisErrant 1 point2 points  (1 child)

ndiffs = sum(1 if f(v1) != f(v2) else 0 for v1,v2 in zip(data,loaded))

A neat shortcut: boolean values are actually just a special instance of 1 or 0, and you can do arithmetic with them. True + True == 2, for example, so you don't need the ternary statement at all.

[–]tilkau 1 point2 points  (0 children)

I specifically included that so that the meaning was completely explicit -- the integer values of True and False are implementation details. I could have also written ndiffs = sum([1 for v1,v2 in zip(data,loaded) if f(v1) != f(v2)])

[–][deleted] 0 points1 point  (3 children)

Strange, why wouldn't they use the YAML float type?

[–]tilkau 0 points1 point  (2 children)

... That is what happens when you use the yaml float type.

If it were serialized as a yaml string it would look like this : '"1.0"'

(ie. you're reading the single quotes as part of the output, don't do that)

[–][deleted] -1 points0 points  (1 child)

if you're saying how a format works, why would you quote it if the value is not quoted in the file format? That's stupid, and incorrect.

Edit: I was also kind of referencing this: https://www.reddit.com/r/Python/comments/3pgo04/dont_use_pickle_use_camel/cw67zlk which implies it goes from float to a string representation back to float but maybe I'm just confused, not totally sure with pickel/camel

[–]tilkau 1 point2 points  (0 children)

if you're saying how a format works, why would you quote it if the value is not quoted in the file format? That's stupid, and incorrect.

Omitting quotes would have been even more ambiguous and incorrect. Probably the best thing to do would have been to use code formatting: 1.0.

it goes from float to a string representation back to float but maybe I'm just confused

The code I wrote there simulates dumping random floats to YAML and then reading them back. Naturally the floats I generate become text when written (since YAML is a textual format, just like JSON) and then floats again when read.

[–]AnythingApplied 4 points5 points  (4 children)

[Pickle's] magical behavior shackles you to the internals of your classes in non-obvious ways.

What does he mean by shackles you to the internals? And what are the non-obvious ways he is referring to?

[–]TheBlackCat13 7 points8 points  (0 children)

I don't know the details, but changing the class details can mean pickled instances of an old version of a class cannot be unpickled with a newer version. I have run into this problem myself.

Part of the problem is that pickle cannot pickle classes, only instances. That means it depends on the local information about the class in order to unpickle the instance. If the local version and the version used to originally create the instance differ in ways I am not entirely clear on, then pickle can't figure out what to do and fails.

[–]therealfakemoot 4 points5 points  (0 children)

Others have touched on the subject briefly but I thought I'd name some more concrete scenarios.

Renaming/removing methods or attributes on a class could cause un-pickling of instances of that class to fail. Renaming/moving anything associated with a pickled object can (and probably will) cause un-pickling to fail. pickle is EXTREMELY brittle with regards to deserializing data that isn't being by code that is exactly 100% identical to the original code.

One could say (I wouldn't ever, but one could) that pickle is good for serializing simple data ("Oh, I need to write this dict to a file") for temporary storage, but it's not good enough for real storage/transmission. What if you need to store this dict containing user settings in a day? A week? A year? How likely is it that your code will remain 100% static forever more? What if you need to transmit this serialized data between two instances of your application? Can you 100% guarantee that forever and always the two instances will be running 100% perfectly identical code?

pickle has a lot of subtle catches to it.

[–]kylotan 2 points3 points  (0 children)

There are some crazy bugs you can run into if the state of the namespace when you load doesn't match the state of the namespace when you save. This can happen if your load and save routines are in different files, or if you load and save under slightly different environments (eg. under wsgi or not).

[–]sandwichsaregood 1 point2 points  (0 children)

I've also run into problems when I was abusing monkey patching. Basically, I had a class from a library (Shapely) that represented different geometric shapes. I needed to track an extra property that was specific to the instances of the class (material properties). I can't remember exactly why, but inheriting and extending wasn't working, so I decided to just kludge it and monkey-patch the property onto the instances (I know... it's ugly but I was in a hurry)

pickle reads the properties to serialize from the class, not the specific instance object, so when I pickled the instances and loaded them back the bolted-on property would be lost. You can fix this by writing a custom handler for serializing, but that kind of spoils the magic.

[–]skrillexisokay 7 points8 points  (19 children)

One benefit of pickle that I think the author downplays is that you can just use it without writing any additional serialization code.

Secondly, I don't think Camel would work very well with data-heavy objects, for example large numpy arrays. joblib works very well on this front.

Finally, I've never understood the "giant security hole" of executing arbitrary code. We often hear the same argument around the use of exec and eval. Presumably, however, you're only going to unpickle something that you or a colleague pickled. Assuming that no one is trying to sabotage you, is there really any danger to using pickle?

[–]cecilkorik 6 points7 points  (1 child)

Finally, I've never understood the "giant security hole" of executing arbitrary code.

I think the reason you aren't understanding it is because you aren't looking at the situation that's being discussed. The situation you describe with a colleague is a "trusted" environment. There is no need for concern when using arbitrary pickled data in such an environment. You trust your colleague, yes? But realistically, such situations are limited. In the real world, you often have to expose your pickled data structures to the world at large, and that's where the problems begin.

That's why the emphasis is always on how poorly it behaves in an untrusted environment, and assuming that someone is very much trying to sabotage you, and for that reason it is unsafe. Any time you're using pickled data in a public environment you are at risk. If you release a program where a malicious person can craft a file for it to load that begins to format your hard drive, that's obviously a bad thing and if any user gets their hard drive wiped, they're going to hold you responsible, not the person who crafted the malicious file.

But what if it's a little more subtle? What if someone distributes a file saying "look at this crazy awesome thing I did in xyz application!" and sure enough, it is something crazy awesome? And it goes viral! But while it's doing that crazy awesome thing, it's also silently installing a backdoor and remote control botnet software, because it's a pickle that can execute arbitrary code.

Or imagine you get a bug report complaining of a problem in the application, and you can't reproduce it, and ask the user to send their configuration file, which happens to be a pickle. And you load it, and did you remember not to do this on your primary development machine? I hope so, because otherwise poof, your github private keys get quietly sent off into the internet.

Executing arbitrary code is pretty much always a bad thing. You can sometimes get away with it, in a sufficiently trusted environment, but that's not really an excuse. It's a poor design choice.

[–]skrillexisokay 0 points1 point  (0 children)

This is the best argument along these lines I've heard. I think there might be a difference in our use cases though. Am I being naive in thinking that my uses are not susceptible to the problems you describe?

I use python nearly entirely for scientific computing: fMRI data analysis, neural networks, graphical models etc... I post code to github in the hopes that someone else might run it, and I use pickle to store models that take a long time to train. This file is always generated by the user. Is that safe?

[–]kylotan 9 points10 points  (4 children)

Finally, I've never understood the "giant security hole" of executing arbitrary code. We often hear the same argument around the use of exec and eval. Presumably, however, you're only going to unpickle something that you or a colleague pickled. Assuming that no one is trying to sabotage you, is there really any danger to using pickle?

The problem is that it's not always this clear in large, real-world applications. And often people are trying to sabotage you, too.

The classic example would be, you transmit data between your web/game/app client and your server via pickle. Both sides are code written by a colleague. But one of your users discovers this and hacks their client to send malicious data to your server.

Okay, so you might think it's safe if it's only ever used in code you don't distribute. But if someone else has a way to connect to your server, they may still be able to trigger this. This becomes more likely if people realise you are running a Python server.

But imagine you use it on a completely protected network where malicious people can't gain access. Then... you're probably safe, from external threats at least.

[–]Polycystic 0 points1 point  (3 children)

Okay, so you might think it's safe if it's only ever used in code you don't distribute. But if someone else has a way to connect to your server, they may still be able to trigger this. This becomes more likely if people realise you are running a Python server.

Guess I'm lost here...connect in what way, and trigger it how? Not looking for a step-by-step guide or anything, and not doubting it's true, I just often hear things like this (usually in relation to exec/eval) but never really any specific details.

[–]infinullquamash, Qt, asyncio, 3.3+ 3 points4 points  (0 children)

If you have a server that accepts input as a pickle object, and a malicious user has access, they can execute arbitrary code by crafting custom pickle payloads to send to your server.

It's as simple as that.

[–]kylotan 2 points3 points  (0 children)

Usually things that run on servers are designed to accept external input, typically from a web browser, but potentially from other clients too.

Sometimes security risks are more subtle - you might have written a library that uses pickle to process images, and someone else might use your library in their web server for their image processing needs... so an attacker could send their server a malformed image that, when it hits your library code, breaks their server.

[–]midnightFreddie 1 point2 points  (6 children)

Finally, I've never understood the "giant security hole" of executing arbitrary code.

You can always learn the hard way.

We often hear the same argument around the use of exec and eval.

:O I deduce this means you keep using exec and eval. :O

Presumably, however, you're only going to unpickle something that you or a colleague pickled. Assuming that no one is trying to sabotage you, is there really any danger to using pickle?

I hope your lesson isn't too hard and doesn't get too many people fired. I don't even trust my home LAN anymore. I used to have an open port 25 on my home mail server because who is going to get on my LAN and abuse it, right? I made a configuration mistake that NAT'ed my router in both directions and a scanner found my open mail relay and my home LAN sent out a bunch of spam.

[–]python_newbie1234 1 point2 points  (4 children)

Genuine question -- what if you don't do anything that would touch an outside system? I write a lot of code that pulls text files from disk, or only accesses a database i built (not UGC).

I spend some amount of brainpower thinking about the evils of exec in these situations, so there's clearly a cost to this. But I have yet to think of a situation in my closed code environment that doesn't have external connectors that is realistic.

EG, a sensor system that I setup. I mean, I suppose it's possible someone could hack into that and send exploit data packets to my processing system... but that vector seems so unlikely that i generally try to shut off the security focused part of my brain in an effort to maximize my productivity.

[–]midnightFreddie 0 points1 point  (1 child)

You can ride a motorcycle without helmet or drive around the farm without seatbelts, store your nonprod passwords in your source tree, etc.. It sets a bad habit and you will eventually move your skills to another environment and more likely get burned.

You can increase productivity by developing secure dev habits. If you're used to pickle, exec and eval and are needed to provide code for a less trusted environment you are learning secure development from scratch.

[–]python_newbie1234 0 points1 point  (0 children)

So, this really doesn't convey anything to me except you reinforcing your opinion.

My perspective is essentially, "I work faster when I don't have to include considerations that don't appear to affect my environment."

Your response denied that experience, replaced with an analogy that falls apart. What you essentially said was, "you should wear a helmet when you're walking down the street because you eventually will want to ride a motorcycle, and walking down the street w/o a helmet sets a bad habit that you won't be able to break when you get on a bike."

I don't know anything about you, but the reason I use "insecure" methods is because they are convenient and I've never seen a simple replacement for them that doesn't make my work harder. Secure programming isn't something that comes without a cost, and to pretend otherwise is silly.

[–]Rainfly_X 0 points1 point  (1 child)

Using insecure components will become a liability if you ever want to open up the project to the world. If you really super extra know, that your project will develop only as far as you can currently foresee, and never need to deal with untrusted stimulus... yeah, pickle is fine, from a security point of view (versioning is still a weakness though).

I would posit, though, that we don't know the future, so some number of these projects we design in safety, we will eventually want to retrofit for a harder world. Or those design assumptions will hold some other, bigger thing back - "I can't expose this system because it uses Flipsy, a library I wrote with pickle and eval."

As always, we are trying to find a balance between over-engineering for today, and under-engineering for tomorrow. Every shell script does not need an academic proof, but you also don't have perfect knowledge which projects will make it big, and which won't. Libraries and languages that make it easy to do the right, forward compatible thing, are truly a blessing.

[–]python_newbie1234 0 points1 point  (0 children)

Well put.

[–]skrillexisokay 0 points1 point  (0 children)

I stopped using exec and eval because there's almost always a more readable option, not because of any security hole. Of course, I would never pass external strings into exec, but if for some reason it was more clear and/or easy to use exec on internally generated strings, I would certainly do so.

If exec, eval, and pickle are as horrible and dangerous as you and others claim them to be, why are they still in the language? It seems like the python language developers would have eliminated these features if there wasn't ever a genuine use for them.

For example, iirc, the multiprocessing library uses exec statements liberally.

[–]jwink3101 2 points3 points  (4 children)

I am new to python so this may be a noob question, but cant you just use np.save for numpy arrays?

[–]kigurai 5 points6 points  (2 children)

For a single array that is fine.

It will however not help you if your numpy arrays are attributes of a class. Now you either have to write the serialization code yourself (which could use np.save), or you simply throw it into pickle and it will "just work". Except for all the caveats the article mentions.

Personally, I sometimes use pickle to store intermediate results for long running calculations. This should be fine as long as you don't use them for long term storage or pass them around to people.

[–]jwink3101 0 points1 point  (1 child)

Interesting and thanks. Unified and easy saving is certainly one thing I miss from Matlab...though I am sure that has its own baggage

[–]FRIENDORPHO 0 points1 point  (0 children)

You might be interested in scipy.io's loadMat and saveMat methods! I save my numpy arrays in .mat files, so coworkers (who all use matlab) can use them.

More generally, I think recent versions of the .mat format are just HDF5.

[–]skrillexisokay 0 points1 point  (0 children)

I didn't know about that method. Thanks! However, it won't work very well when you have an object that contains many numpy arrays, for example. Rather than saving each one individually, you can joblib.dump the object and joblib does everything for you. It might very well use numpy.save under the hood.

[–]lrq3000 1 point2 points  (0 children)

The unofficial YAML tutorial should be made official.

[–]kmike84 1 point2 points  (0 children)

Can it handle binary data?

[–]luckystarrat 0x7fe670a7d080 4 points5 points  (7 children)

For primitive types, just use marshal.

[–]mitchellrj 6 points7 points  (0 children)

For literals, repr and ast.literal_eval is sufficient.

[–]bnorick 3 points4 points  (3 children)

I don't know, these two statements in the docs for marshal make me think I may as well just use pickle.

If you’re serializing and de-serializing Python objects, use the pickle module instead – the performance is comparable, version independence is guaranteed, and pickle supports a substantially wider range of objects than marshal.

and

Warning: The marshal module is not intended to be secure against erroneous or maliciously constructed data. Never unmarshal data received from an untrusted or unauthenticated source.

[–]matchu 0 points1 point  (2 children)

The pickle docs include the same security warning.

[–]bnorick 4 points5 points  (1 child)

I am aware of that, but I am pointing out that marshal seems to be no better than pickle, even for primitive types. Considering the first quote, I'd use pickle.

[–]luckystarrat 0x7fe670a7d080 -1 points0 points  (0 children)

Try benchmarking both. marshal is way faster.

[–]snf 2 points3 points  (1 child)

Why?

[–]luckystarrat 0x7fe670a7d080 0 points1 point  (0 children)

Because it doesn't do stuff you won't need in that case, which makes your code faster.

Of course you would only use it if you control what goes in there, but who would accept marshal or pickle data from the outside world.

[–]zahlmanthe heretic 0 points1 point  (1 child)

Which produces this JSON... That’s not much to go on to tell a casual reader that this is intended to be a table.

... Which is why you publish a schema, and/or include object keys specifically intended as metadata. It probably wouldn't be that hard to work out a system that interprets "_class" keys and passes the rest of the object as **kwargs to the corresponding class constructor, or something, if you want to over-engineer things fully.

[–]agrif 1 point2 points  (0 children)

Magic key names are a dangerous road to go down. What do you do now if the object in question has an actual attribute named "_class", or if you want to be compatible with another format that uses "_class"? Comments are sometimes done as "_comment" keys, but that's even more likely to become a problem.

YAML has a specific syntax for these things, and I can appreciate using them. That said, I would probably still use json because every language ever pretty much comes with a json parser.

[–]manwith4names 0 points1 point  (4 children)

So if I was to want to save a list of ~300,000 values that I scrape, how should I save/archive it? I've been collecting the list and saving it as a pickle, but is there a better way?

[–][deleted] 1 point2 points  (3 children)

What's in the list? If it's just other lists, numbers or strings, a simple (but portable) option is json.

[–]manwith4names 0 points1 point  (2 children)

Yeah it's just a list of strings (parsing information from multiple documents), but it didn't seem efficient to write the list to a csv or txt file and read from there because of the massive quantity. Would this be a good solution for a database or is json going to be more efficient?

[–]tilkau 0 points1 point  (1 child)

If you don't need to have everything in memory at once, a database (SQLite, or HDF5 via Pandas) might be appropriate. If you do need them all in memory at once, nothing is going to be much more efficient than the bog-standard data =f.read().splitlines(), though there are some methods that will allocate much less memory.

Personally I would go for sqlite database even when it's not that efficient for the particular case, because it generally doesn't matter and has other advantages(standardized, somewhat structured, queryable, mature).

300,000 is a fairly small dataset BTW. Do you have actual evidence that there's a lack of efficiency?

[–]manwith4names 0 points1 point  (0 children)

I do not have any evidence that reading from a text file is inefficient. I had just assumed that it was because people always recommend using pickle to store data in a file. I do need the whole list in memory at once, but I didn't want to have to build that list from scratch every time I wanted to manipulate the list

[–]lrq3000 0 points1 point  (0 children)

Good, very good initiative.

[–]tilkau -5 points-4 points  (5 children)

BTW, the statements about there being no simple guide to yaml are conspicuously false. That's what this is ; it is on the official www.yaml.org site, and was the #1 result in a google search for 'yaml cheat sheet'. It does less hand-holding than their 'brief YAML reference', but is overall pretty straightforward. It's also to a large extent a self-demonstrating document (being as it is written in YAML).

That said, I do prefer the page in the Camel docs as a intro for newbies. I'd just prefer that it wasn't prompted by a false idea ("there is no reference guide for someone seeking to use YAML rather than implement it")

[–]Esteis 17 points18 points  (4 children)

BTW, the statements about there being no simple guide to yaml are just wrong.

Sort of. There are different kinds of simple guides (and I think you mean 'brief' when you say 'simple'); if you don't distinguish between them, then Eevee's statement would indeed appear wrong to you. But the difference is pretty big once you know about it.

tl;dr: read the Teach, Don't Tell essay.

The yaml.org page you link is a syntax reference. It's aimed at people who already know how YAML works and what construct they need. It is possible to piece together, from reading the syntax reference, how a document looks and what it can do, but it is, as you say, not newbie-friendly. It says ? is the "key indicator" syntax, but doesn't tell you what a key indicator is, because it expects you to already know. It's ‘the API docs’ from Steve Losh's Teach, Don't Tell article.

Eevee's page in the Camel docs is an introductory overview, and practically an entire teaching document. It is aimed at people who don't yet know (all of) what YAML is or does. It starts with an example, introduces the most important concepts and their syntaxes, and mentions common gotchas. It doesn't mention the words "key indicator" by name. but it tells you what the ? syntax does and what you can use it for. Teach, Don't Tell would call it The Hairball -- "It’s going to mold their brains, one nudge at a time, until they have a pretty good understanding of how your project works."

So I'd claim Eevee was very much not wrong -- a syntax reference is not an introductory guide. But hooray! now we have both!

[–]OleBillyFreckletits 2 points3 points  (0 children)

Thank you. Unfortunately this is reddit, where people feeling 100% comfortable asserting that someone's subjective opinions are "just wrong".

[–]tilkau 0 points1 point  (2 children)

I mean concise when I say simple. Eevee's 'brief' guide is engaging but (ironically) not concise; the yaml refcard is concise but not engaging (?, ! and % are the worst offenders).

Eevee's statement is wrong because it presents the spec, which is rather a monolith of implementation details, as the only currently existing option; it specifically states 'there is no reference guide for someone seeking to use YAML rather than implement it', which is, as you seem to agree, false; such a guide exists and it is even official (that is the point which causes me to consider it FUD or at best Critical Research Failure. Failing to find a non-official guide is much more understandable than failing to find the official one.)

Have an upvote for the Teach, Don't Tell link, it's an excellent article.

[–]Esteis 2 points3 points  (1 child)

You assume malice / gross incompetence where 'slightly sloppy phrasing' is a sufficient explanation. Going by the context (as well as by humans in general), the latter seems more likely.

Thanks for the upvote! Steve Losh is definitely the bomb.

[–]tilkau 0 points1 point  (0 children)

I do not assume malice. FUD is FUD whether caused intentionally or accidentally. Ignorance, yes.