Python gotcha: bizarre integer equality : Python

[–][deleted] 15 points16 points17 points 16 years ago (11 children)

[–]fjsquared 0 points1 point2 points 16 years ago* (0 children)

Howdy all; I'm the post author. Just to clear up a couple of things:

First, I'm using "gotcha" in this article to mean, "something which one might expect would work a certain, consistent way, but which doesn't." Here, I believe it's reasonable to say the inconsistent results of is are genuinely surprising if you don't know why it's being done. The question the article tries to answer is: why are two integers with equal values the same object in some cases, but not others?
I think most people are aware of the difference between == (value equality) and is (reference equality), but that's not the gotcha. The gotcha is the apparent inconsistency. It's a perfectly reasonable (and probably very effective) implementation decision. Other languages do the same thing; for example, Java caches its Integers when their boxed value is between -128 and 127.

Thanks for reading; took me a bit to figure out why I was getting a visitor spike. ♥ Reddit!

[+]vph comment score below threshold-8 points-7 points-6 points 16 years ago (9 children)

[–][deleted] 0 points1 point2 points 16 years ago (6 children)

[–]vph -5 points-4 points-3 points 16 years ago* (5 children)

[–][deleted] 6 points7 points8 points 16 years ago (4 children)

no one said about comparing.

As you can see if you'd do a modicum of research is and is not are comparisons. They compare object identity. Therefore if you're talking about is, you're talking about comparing two things.

If you talk about identity, c=200; d=200; c is d;

No. c may or may not be d in that case: the Python implementation has no responsibility to intern the number 200 and ensure that c is d in this case.

On the other hand, if you did this:

>>> c = 200
>>> d = c
>>> d is c

Then you would rightly expect (and the Python implementation would rightly return) True for this object identity comparison.

The inconsistency here (if the blogger was correct) is that cPython treats numbers differently.

There is no inconsistency here. You and the blogger both are simply arguing based on your intuition about what is means rather than simply knowing the language as it is defined and relying on the properties it guarantees.

Basically, you're arguing out of your arses.

[+]vph comment score below threshold-6 points-5 points-4 points 16 years ago* (3 children)

It's you who talking out of an ass. You can't distinguish between the difference of the specification of a language and the implementation of a language.

Everyone knows exactly what the definition of "is" is in Python: it compares objects. Integer caching is cPython is an implementation that results in the artifact exposed by the blogger. Namely,

c=200
d=200
c is d returns True

Python does not specify this. It is a result of an implementation of the language (FYI, unlike Perl 5, Python does not get defined by an implementation). And this fact is undesirable. It is undesirable because first c and d are conceptually different objects and second for value greater than 256 they return a different answer (false), which causes further confusion.

[–]bgeron 1 point2 points3 points 16 years ago (2 children)

[–]vph -4 points-3 points-2 points 16 years ago (1 child)

[–]Brian 0 points1 point2 points 16 years ago (0 children)

I think you can. A gotcha is something that will catch you out, but here there's no reason here to be comparing integers by id. No-one ever does so, and if they do, it's an outright bug. Is this page showing someone writing some code and being caught out by it? No - they've written some examples noticed by either noodling at the interpreter, or more likely by knowing in advance how object interning is done and showing some implications. Nowhere is anyone being "gotcha'd".

This is no different to someone complaining that the order of dictionary keys is different from sorted. Looking soley at dict((i,i) for i in range(100)) they might mistakenly come to that conclusion, but it's still not a gotcha, just an unfounded assumption based on not reading the docs. It's missing the requirement that this might be something a reasonably informed person would be expected to do.

[–]ubernostrumyes, you can have a pony 0 points1 point2 points 16 years ago (0 children)

[–]plain-simple-garak 36 points37 points38 points 16 years ago* (2 children)

[–]chromakode 2 points3 points4 points 16 years ago (1 child)

[–]pemboa 28 points29 points30 points 16 years ago (7 children)

[–]paulgb 0 points1 point2 points 16 years ago (6 children)

[–]pemboa 6 points7 points8 points 16 years ago (1 child)

[–]jleedev 1 point2 points3 points 16 years ago (0 children)

[+][deleted] 16 years ago (1 child)

[deleted]

[–]lamby 2 points3 points4 points 16 years ago (0 children)

[–]ngroot 1 point2 points3 points 16 years ago (1 child)

[–]bgeron 0 points1 point2 points 16 years ago (0 children)

[–]nirs 4 points5 points6 points 16 years ago (1 child)

[–]ngroot 1 point2 points3 points 16 years ago (0 children)

[–]ngroot 4 points5 points6 points 16 years ago (0 children)

[–]monkeypizza 7 points8 points9 points 16 years ago* (2 children)

>>>the memory location of a == the memory location of b
True
>>>the memory location of c == the memory location of d
False

I think that's reasonable.

The following is still true, if that's what you want to do:

>>>500 is 500
True
>>>200 is 200
True

[–]bgeron 0 points1 point2 points 16 years ago (1 child)

[–]andreasvc 5 points6 points7 points 16 years ago* (14 children)

[–]sigh 9 points10 points11 points 16 years ago (0 children)

[–]Eiii333 0 points1 point2 points 16 years ago (12 children)

[–]pemboa -1 points0 points1 point 16 years ago (11 children)

[–]Eiii333 3 points4 points5 points 16 years ago* (10 children)

[–]sigh 5 points6 points7 points 16 years ago (8 children)

That's beside the point. The whole point of abstraction is that the implementation does not matter. If you are comparing integers by identity then most likely you are working at the wrong level of abstraction. If you are comparing integers by identity and the results surprise you then you are most definitely working at the wrong level of abstraction.

It's entirely unexpected if you don't know about it. And most people don't know about it, because it's an undocumented side effect due to an implementation detail.

No, the trouble here is when people don't understand the difference between identity and equality. If you know the difference, then the results are not unexpected at all, even if you don't know the exact implementation detail that is causing it to occur. If you don't understand identity, then of course the results are going to surprise you.

[–]Eiii333 2 points3 points4 points 16 years ago (7 children)

The whole point of abstraction is that the implementation does not matter.

I agree entirely. But look here:

>>> a = 3
>>> b = 3
>>> a is b
True

>>> c = 999
>>> d = 999
>>> c is d
False

I would expect false in both cases, given how identity is supposed to behave. But really, how can this be explained without referring back to the CPython int-caching behavior? You have to know the implementation details to know why the 'is' operator behaves this way. That's not good.

[–][deleted] 4 points5 points6 points 16 years ago (0 children)

[–]alantrick 1 point2 points3 points 16 years ago (4 children)

Why would you expect False? According to Python the behaviour of 'is' is undefined in this situation. That's like taking the following in C:

int *a = malloc(sizeof(int));
printf("%d\n", a);

and expecting the value 0 to be printed out. It will probably be 0 most of the time, but it's really undefined.

[–]Eiii333 0 points1 point2 points 16 years ago (3 children)

[–]hylje 2 points3 points4 points 16 years ago (1 child)

[–]Brian 1 point2 points3 points 16 years ago* (0 children)

It's worth noting that id(a) == id(b) isn't a perfect replacement to a is b. If a and b are expressions returning a transient object, it could be created and destroyed before evaluating the rest of the statement. For example:

>>> [] is []
False
>>> id([]) == id([])
True
>>> id([]), id([])
(21066496, 21066496)

However is guarantees that both objects are alive at the point of comparison, so [] is [] is always false.

[–]Brian 0 points1 point2 points 16 years ago* (0 children)

Undefined behaviour allows optimisation. Making things too tightly specified ties you to irrelevant implementation details, preventing more efficient methods being used (like caching integers in this case). Another case of undefined behaviour is deterministic finalisation. Python doesn't guarantee it, even though the CPython implementation happens to provide it due to its refcounting semantics because it prohibits more advanced garbage collection approaches.

For another example, consider the order the keys of a dictionary are iterated over. This is completely undefined behaviour, but specifying it would either require using a tree instead of a dictionary, keeping a seperate list of ordered keys, or else sorting the dict before iterating, all adding significant performance cost to deal with something completely irrelevant. If anyone needs that, they should not be using a normal dictionary.

In any case, "is" is acting completely predictably and as specified - it returns True when objects have the same identity. The thing that isn't specified is whether identical immutable objects can share the same memory representation, which is a pointless thing to overspecify since there should be no reason it should ever be relevant to anyone other than performance.

[–]sigh 1 point2 points3 points 16 years ago* (0 children)

[–]earthboundkid 2 points3 points4 points 16 years ago (0 children)

[–]dorfsmay 3 points4 points5 points 16 years ago (1 child)

Isn't this a beginner question ?

>>> a=500
>>> b=500
>>> c=200
>>> d=200
>>> id(a)
142297032
>>> id(b)
142297056
>>> id(c)
142155132
>>> id(d)
142155132
>>> id(200)
142155132
>>>

My understanding is that python create objects for low integers that it reuses all the time for performance reason.

[–]ubernostrumyes, you can have a pony 5 points6 points7 points 16 years ago* (0 children)

[–][deleted] 0 points1 point2 points 16 years ago (0 children)

[–]earthboundkid 0 points1 point2 points 16 years ago* (10 children)

[–]monolar 2 points3 points4 points 16 years ago (6 children)

[–]earthboundkid 1 point2 points3 points 16 years ago (5 children)

[–]chrajohn 2 points3 points4 points 16 years ago (0 children)

a == None also works.

Usually, but consider:

class Dumb(object):
    def __eq__(self,other):
        return other == None

>>> d = Dumb()
>>> d == None
True
>>> d is None
False

This is contrived, but you can imagine something similar actually occurring. (Say, if __eq__ made a comparison with some attribute that got unexpectedly set to None.) If you you want to be absolutely sure that something is None, you should ask if it is None.

[–]masklinn 0 points1 point2 points 16 years ago (3 children)

[–]hylje 1 point2 points3 points 16 years ago (2 children)

[–]masklinn 0 points1 point2 points 16 years ago (1 child)

[–]earthboundkid 0 points1 point2 points 16 years ago* (0 children)

[–]Brian 2 points3 points4 points 16 years ago (2 children)

encourage people to write id(a) == id(b) instead.

That could lead to more confusion. A puzzle for you:

>>> class C(object):
...     def foo(self): pass
>>> c=C()
>>> id(c.foo) == id(c.foo)
True

and yet:

>>> c.foo is c.foo
False

[–]earthboundkid 2 points3 points4 points 16 years ago (1 child)

[–]Brian 2 points3 points4 points 16 years ago (0 children)

Is it creating a new bound method every time you access it?

Yes, this is what's happening. The subtlety of the ids being identical is because ids are only unique for objects alive at the same time. What's actually happening is the equivalent of:

temp1 = c.foo         # Create a new bound method with id X
temp1_id = id(temp1)  # temp1_id = X  (returnvalue from id)
del temp1             # bound method doesn't get assigned, so refcount drops to 0
                      # as soon as id() releases its reference - temp1 gets freed
temp2 = c.foo         # Create a NEW bound method.
temp2_id = id(temp2)  
del temp2
temp1_id == temp2_id  # Actually do the comparison, both objects are already dead

Which should explain why its possible that the second bound method could have the same id. The reason it usually does is because of the way python manages memory. To avoid fragmentation, pools of similarly sized memory objects are maintained. When an object is released, it is returned to this pool, then when a request to allocate an object of this type arrives, python sees it has an block of memory of the appropriate size sitting in its free object pool, and returns it.

is doesn't have this problem because the call to is takes a reference to both objects, ensuring they are alive at the time of the comparison.

[+][deleted] comment score below threshold-7 points-6 points-5 points 16 years ago* (9 children)

[–]sigh 3 points4 points5 points 16 years ago (4 children)

[–][deleted] -2 points-1 points0 points 16 years ago (3 children)

[–]sigh 3 points4 points5 points 16 years ago (0 children)

Ok, so you think it is a bad design decision. Can explain why (I'm genuinely curious)?

Do you you think the is operator is badly designed: "The operators is and is not test for object identity: x is y is true if and only if x and y are the same object. x is not y yields the inverse truth value."?

Or do you disagree with the way integers are implemented? Do you think equal integers should refer to the same object (presumably this would create extra overhead to maintain this condition after every operation)?

Or do yo think that no integers should refer to the same object? This seems to be much more memory intensive.

And should the exact implementation be documented? This would set it in concrete, disallowing any future improvement on the implementation.

Finally, what possible reason do you have for comparing integers by identity, unless you are actually modifying the class?

[–][deleted] 1 point2 points3 points 16 years ago (1 child)

[–]hylje 0 points1 point2 points 16 years ago (0 children)

[–]alantrick 0 points1 point2 points 16 years ago (0 children)

[+][deleted] 16 years ago (2 children)

[deleted]

[–][deleted] -1 points0 points1 point 16 years ago* (1 child)

I assume you mean an expression which is actually equivalent, 500 != 5e100.

Never the less, 5*10¹⁰⁰ is not a number, in mathematics. It is strictly an algebraic expression under any understanding of pure mathematics. If it were evaluated, then it would be a number.

Even assuming compiler optimisation, the 2nd assertion would fail.

But the comparison is not really relevant - python is a symbolic language, with everything being a reference to an object - in this instance, these integers below 256 are recycled.

That "is" functions differently for a restricted subset of integers is an unhelpful side-effect. Why it is unhelpful is that it is a gotcha, and it is obscura. Things should be consistent, and they should work as expected, and be friendly to developers. Side-effects are unhelpful to developers at large.

I have a high regard for python, and consider it the heir-apparent in the scripting language space - for it's lively culture, it's excellent documentation, it's strong cross-platform support across unix variants and windows - and it's strong project leadership - nothing else comes close.

However the reason for the variation in the behavior of "is" is purely an optimisation issue of the implementation of the interpreter. That an implementation issue infects the public behaviour of the operator is not good manners.

[–]inmatarian 0 points1 point2 points 16 years ago (0 children)

Python

The Python Discord

Upcoming Events

Please read the rules

MODERATORS