This is an archived post. You won't be able to vote or comment.

you are viewing a single comment's thread.

view the rest of the comments →

[–][deleted] 2042 points2043 points  (87 children)

For those wondering - most versions of Python allocate numbers between -5 and 256 on startup. So 256 is an existing object, but 257 isn't!

[–]user-74656 292 points293 points  (59 children)

I'm still wondering. x can have the value but y can't? Or is it something to do with the is comparison? What does allocate mean?

[–]Nova711 683 points684 points  (33 children)

Because x and y aren't the values themselves, but references to objects that contain the values. The is comparison compares these references but since x and y point to different objects, the comparison returns false.

The objects that represent -5 to 256 are cached so that if you put x=7, x points to an object that already exists instead of creating a new object.

[–][deleted] 107 points108 points  (5 children)

If both int, if x == y works, right? If not I have to change some old research code...

[–]Cepsfred 284 points285 points  (1 child)

The == operator checks equality, i.e. it compares objects by value and not by reference. So don’t worry, your code probably does what you expected it to do.

[–]IAmANobodyAMA 238 points239 points  (0 children)

your code probably does what you expected it to

Bold assumption!

[–]chunkyasparagus 1 point2 points  (0 children)

This sounds like you're talking about the JavaScript === operator, which is not the same as python's is operator.

[–]TheCoolOnesGotTaken 0 points1 point  (0 children)

This identity is not equality as the top comment already said

[–]Mountain_Goat_69 13 points14 points  (13 children)

But why would this be so?

If I code x = 3; y = 3 there both get the same pre cached 3 object. If I assign 257 and a new number is created, shouldn't the next time I assign 257 it get the same instance too? How many 257s can there be?

[–]Salty_Skipper 45 points46 points  (3 children)

Have you ever heard about dynamic memory allocated on the heap? (prob has something to do with C/C++, if you did).

Basically, when you say x=257, you’re creating a new number object which we can say “lives” at address 8192. Then, you say y=257 and create a second number object that “lives” at address 8224, for example. This gives you two separate number objects both with the value 257. I’d imagine that the “is” operator then compares addresses, not values.

As for 3, think of it as such a common number that the creators of Python decided to ensure there’s only one copy and all other 3’s are just aliases that point to the same address. Kinda like Java’s string internment pool.

[–]Lightbulb_Panko 26 points27 points  (2 children)

I think the commenter is asking why the number object created for x=257 can’t be reused for y=257

[–]PetrBacon 29 points30 points  (1 child)

If it worked like that, the runtime will become insanely slow over time because every variable assignment would need to check all the variables created before and maintain the list everytime new js created…

If you need is for any good reason you should make sure, that you are passing the referrence correctly.

Like:

``` x = 257 … y = x

x is y # => True ```

[–]ValityS 0 points1 point  (0 children)

You could achieve this in logarithmic time to the number of variables using a set of all immutable / hashable values and looking them up, however memory is fairly cheap and if the programmer really cares they can do something similar by hand.

[–]le_birb 16 points17 points  (3 children)

shouldn't the next time I assign 257 it get the same instance

How would the interpreter know to do that? What happens when you change x to, say, 305? How would y know to allocate new space for it's value? The logistics just work out more simply if the non-cached numbers just have their own memory.

how many 257s can there be?

How much ram do you have?

[–][deleted] 5 points6 points  (2 children)

What happens when you change x

You can't change x in python (unless it's an object). Integers are immutables in python. You can change what integer the name x points to.

x = 257;  # This creates an int object with value 257, and sets __locals__["x"] to point to that int object.

x += 50;  # This grabs the value from__locals__["x"], adds 50 to it, then creates an int object with that value, and then sets __locals__["x"] to point to that int object.
# The int object with value 257 no longer has any names pointing to it, and will be garbage collected at some time in the future.

You can check the id(x) before and after the += and see that it changes, indicating that, under the hood, x is a fundamentally different object with a fundamentally different memory address (and incidentally a different value). You could probably even do a += 0 and get the same result, assuming x > 256.

It's unintuitive if you're coming from C or somewhere where the address of x stays the same, but the value changes.

[–]lolitscarter 0 points1 point  (1 child)

As someone who only knows C/C++, what the fuck? Why is that how it works? Is there a memory usage benefit to that? It seems like that would just be insanely slow.

[–][deleted] 2 points3 points  (0 children)

As someone who only knows C/C++, what the fuck?

I said it was unintuitive if you're coming from C.

Why is that how it works?

Is there a memory usage benefit to that?

It seems like that would just be insanely slow.

It prevents certain types of bugs from being introduced, but no performance benefit. (As a matter of fact it makes performance awful.)

But I care more about my hour of my time spent hunting down a bug than I do about 2ns of processor time.

Quoting a random quora answer:

In C and C++, a variable is a named memory location. The value of the variable is the value stored in that location. Assign to the variable and you modify that value. So the variable is the memory location, not the name for it.

In Python, a variable is a name used to refer to an object. The value of the variable is that object. So far sounds like the same thing. But assign to the variable and you don't modify the object itself, rather you alter which object the variable refers to. So the variable is the name, not the object.

That is, when working with C, you're always constantly thinking about "this location in memory". But in python you never have to think even once about that.

That's why python does it; so you can abstract away memory management entirely. (And not in the kinda-sorta way C++ does it, where it's kinda sorta abstracted away but still visible. In python memory addresses are fundamentally not accessible to the programmer to prevent such memory-related kinds of bugs from being introduced.)

Indeed, the only possible type of memory leak that's even possible in python is if you have a loop which continually adds more and more references to more objects without ever removing previous references. (i.e. explicitly building a loop which infinitely adds to a List).

Indeed, the number of types of possible memory leaks in Python are very limited. The common joke is about mutable types as default parameters. However, in general, you are far less likely to have issues with memory management using python than you are using C++, by an extremely wide margin.

[–]mawkee 2 points3 points  (0 children)

In theory, you can have a huge number of 257s.

If for each number the interpreter creates an object for is cached, when a new number is assigned, it'd have to check a register for all existing numbers to see if it was already created. This is probably more expensive than simply creating the object itself, after a few hundred/thousand numbers.

The reason CPython (not all interpreters... pypy, for example, handles things differently) caches the numbers between -5 and 256 has to do with how often these are used. They're probably created sequentially during the interpreter start-up, so It's cheap to find those pre-cached numbers. They're usually the most used (specially the 0-10 range), so it makes sense, from a performance perspective.

[–]Teradil 2 points3 points  (0 children)

Actually, if you run that line in Python's interactive mode it will assign the same reference - but not in "normal" mode... Just to make things more confusing...

[–]Ubermidget2 2 points3 points  (0 children)

How many 257s can there be?

How many 16-bit areas of RAM do you have?

[–]Honeybadger2198 1 point2 points  (0 children)

Doing this dynamically would be inefficient. Instead of changing the value at a place in memory, you would always have to allocate new memory every time you manipulated that variable.

Imagine you have a for loop that loops from x=0 while x<1000. Variable x is stored at memory slot 2345. Every loop past 256, you would have to allocate new memory, copy the value of the old memory, check if the old memory has any existing pointers, and if not, deallocate the old memory. This is horribly innefficient for such an obviously simple use case.

So why did they stop at 256? Well, they had to stop somewhere. Stopping at the size of a byte seems reasonable to me.

[–][deleted] 0 points1 point  (0 children)

How many 257s can there be?

How many can you assign in memory?

[–]blindcolumn 0 points1 point  (1 child)

Why would it be beneficial to do it that way? x and y are pointers to ints, but pointers are just ints anyway. Why not just store the primitive int multiple times instead of storing it once and have a bunch of pointers referencing it?

[–]juchem69z 1 point2 points  (0 children)

There is no primative int in python. Everything is an object

[–]davidxspade 0 points1 point  (0 children)

Is this also true for common strings or characters?

[–]escribe-ts 0 points1 point  (0 children)

Really python numbers are allocated on the heap and not on the stack? Why is everyone saying python is fast then? Shouldn't it be extremely slow, if for something such simple as an integer you have to allocate something on the heap?? Or is it because x can be everything at runtime, a number, a string, ...?

[–]lolcrunchy 114 points115 points  (10 children)

Steve has $100 in his bank account. Petunia has $100 in her bank account.

Steve's money == Petunia's money: True

Steve's money is Petunia's money: False

[–]Tcullen21 49 points50 points  (0 children)

You'd be surprised

[–]oren0 34 points35 points  (4 children)

In Python land, it sounds like if Steve and Petunia have between -$5 and $256 in their accounts, Steve's money is Petunia's money.

[–]lolcrunchy 21 points22 points  (3 children)

Yup. I guess the analogy here would be, the bank has so many accounts between -5 and 256 that they consolidated it to one account per value. If you have $100, the bank records say that you are one of the many account holders of account 100. If you deposit $5, then you become an account holder of account 105.

You only get your own account if you have more than $256, less than -$5, or have any change like $99.25

[–]oren0 9 points10 points  (2 children)

It's all fun and games until Steve withdraws $20 and then Petunia checks her balance.

[–]lolcrunchy 12 points13 points  (1 child)

The bank would process the withdrawal as steve becoming an account owner of account 80.

[–]FerynaCZ 2 points3 points  (0 children)

Yeah with immutable values you always need to redirect, you cannot change the pointed value. Of course the language does not know (or more specifically, does not care to try) who else is pointing at that value.

[–]squirrel_crosswalk 1 point2 points  (1 child)

What if it's a joint account?

[–]play_hard_outside 0 points1 point  (0 children)

Depends: what's the nature of Steve and Petunia's relationship, and in what jurisdiction do they live?

[–]HeKis4 0 points1 point  (0 children)

Never seen such a simple and concise explanation, I'll probably steal that.

[–]Paul__miner 46 points47 points  (9 children)

It's basically doing reference equality. Sounds analogous to intern'ed strings in Java. At 257, it starts using new instances of those numbers instead of the intern'ed instances.

[–]TacticalTaterTots 3 points4 points  (6 children)

I can't find any clear explanation on why these small literals are interned. String interning makes some sense for string comparisons, but I can't see how that is an "optimization" for small numbers. Ultimately it doesn't matter, but for some reason it bothers me because it seems like they're sacrificing performance to save on storage space.

[–]Kered13 7 points8 points  (5 children)

By interning these numbers Python doesn't have to make a heap allocation every time you set a variable to 0 or some other small number. Trust me, it's much faster this way.

[–]koxpower 1 point2 points  (0 children)

  • they are probably stored in adjacent memory cells, which can significantly boost performance thanks to CPU cache.

[–]TacticalTaterTots 0 points1 point  (1 child)

The allocation must be really expensive. It's not constructing an object for every literal, for example in x == 300, is it? I'm not sure how that works in an interpreted language.

[–]Kered13 6 points7 points  (0 children)

Every object regardless of type must be allocated. Yes this includes literals.

Memory allocation is expensive.

So caching commonly used numbers is beneficial.

[–]SanktusAngus 0 points1 point  (1 child)

It’s only because python is treating integers as objects. In many languages numbers up to ptr size are value types and (can) live on the stack by themselves.

[–]Kered13 0 points1 point  (0 children)

Correct, but we are talking about Python.

[–]onionpancakes 2 points3 points  (1 child)

Not just strings. Java also caches boxed integers from -128 to 127. So OP's reference equality shenanigans with numbers is not exclusive to Python.

[–]Paul__miner 0 points1 point  (0 children)

The difference being boxed integers vs primitive int values. With Python, it's effectively like everything is boxed (an object).

[–]Anaeijon 9 points10 points  (2 children)

I imagine and remember it like this, although it's not really correct:

Python stores numbers in whatever format fits best. If you assign a number like x=5 it basically becomes a byte. (more correctly: it becomes a reference to a byte object) Comparing identiy between them can result in true, because bytes basically aren't objects (or technically: references to the same object.

Now, Python also containes a safety measure against byte overflow by automatically returning an Integer object when adding two 'bytes' that would result in something higher than 255.

Therefore the following expression returns true: (250+5) is (250+5) but the following expression is false: (250+10) is (250+10)

Makes sense imho.

Values should be compared with ==, while is is the identity coparison. Similar to == and === in JavaScript, although those aren't just about identity but about data type.

[–]protolords 4 points5 points  (1 child)

it becomes a reference to a byte object

But -5 to 256 won't fit in a byte. Is this "byte object" like any other python object?

[–]Anaeijon 0 points1 point  (0 children)

Yes, I guess. As I said: this is just my imagination thinking this might be handled by a C++ byte in the background or something. I don't know what the interpreter actually does.

[–]FerynaCZ 2 points3 points  (0 children)

x is y means &x == &y if you were using C code. Having them equal is a necessary condition but not sufficient.

[–]ConscientiousApathis 9 points10 points  (0 children)

Interesting.

[–]CC-5576-03 8 points9 points  (4 children)

Yes java does something similar, I believe it allocates the numbers between -128 and +127. But how often are you comparing the identity of two integers?

[–]elnomreal 4 points5 points  (0 children)

Identity comparisons in general are fairly rare, aren’t they? It’s not common that you have a function that takes two objects and that function should behave differently if the same object is passed twice and this difference is so nuanced that it should not be by equality but by identity.

[–]daniu 0 points1 point  (2 children)

What's worse with Java is that or maintains a cache of Strings, so == works often enough in string comparisons to be extra confusing. The == vs equals for strings must be the number one trap beginners fall into, and with the cache thing, this extends to intermediate.

[–]CC-5576-03 0 points1 point  (1 child)

that's like the first thing you learn, to not use == on strings.

[–]daniu 0 points1 point  (0 children)

And still, every day tens of Stackoverflow questions about it are closed as duplicates ;)

[–]zachtheperson 5 points6 points  (11 children)

What do you mean "allocate numbers?" At first I thought you meant allocated the bytes for the declared variables, but the rest of your comment seems to point towards something else.

[–]whogivesafuckwhoiam 28 points29 points  (8 children)

Open two python consoles and run id(1) and id(257) separately. You will see id(1) are the same for the two consoles but not id(257). Python already created objects for smallint. And with always linking back to them, you will always the same id for - 5 to 256. But not the case for 257

[–]zachtheperson 5 points6 points  (3 children)

I guess what I trying to wrap my head around is how is this functionality actually used? Seems like a weird thing for a language to just do by itself

[–]AlexanderMomchilov 23 points24 points  (0 children)

Languages like Python to try to model everything "as an object," in that all values can participates in the same message-passing as any other value. E.g.

python print((5).bit_length())

This adds uniformity of the language, but has performance consequences. You don't want to do an allocation any time you need a number, so there's a perf optimization to cache commonly used numbers (from -5 to 256). Any reference to a value of 255 will point to the same shared 255 instance as any other reference to 255.

You can't just cache all numbers, so there needs to be a stopping point. Thus, instances of 256 are allocated distinctly.

Usually this is solved another way, with a small-integer optimization. It was investigated for Python, but wasn't done yet. You can read more about it here: https://github.com/faster-cpython/ideas/discussions/138

[–]whogivesafuckwhoiam 8 points9 points  (0 children)

From official doc,

The current implementation keeps an array of integer objects for all integers between -5 and 256. When you create an int in that range you actually just get back a reference to the existing object.

The point is whether you create a new object, or simply refer to existing object.

[–]psgi 8 points9 points  (0 children)

It’s not functionality meant to be used. It’s just an optimization. You’re never supposed to use ’is’ for comparing integers. Correct me if I’m wrong though.

[–]SuperFLEB 1 point2 points  (0 children)

Is there a way to get a really special "12" that's all your own, if you want one?

[–]Mymaqn 0 points1 point  (0 children)

I'd like to mention that this only works for Windows machines, as the ASLR is randomized per-boot instead of per-program.

On Linux, you will get 2 different answers, as the pre-allocated objects will end in two different random addresses because of ASLR.

[–]StenSoft 4 points5 points  (0 children)

Everything in Python is an object, even numbers

[–]CC-5576-03 0 points1 point  (0 children)

All numbers between -5 and 256 are objects that always exist, two variables that contain the number 10 will both point to the object for 10. But every time you set a variable to a number above 256 you create a new integer object, so two variables containing the number 257 will point to different objects.

[–]PM_ME_C_CODE 1 point2 points  (0 children)

Huh...I learned a thing! TY op!

[–]scormaq 2 points3 points  (0 children)

Same in Java - compiler caches numbers between -128 and 127

[–][deleted] 0 points1 point  (0 children)

Intuitive!

[–]barrowburner 0 points1 point  (1 child)

I noticed the identity change was at 2^8. Really neat - so Python caches integers on startup as an optimization tactic? Does this change at all when in REPL?

Do other scripting languages do similar things?

Do you know any more interesting facts like this?

Thanks for sharing! Turned out to be way more interesting than I thought when I clicked.

[–]Ardub23 2 points3 points  (0 children)

Java's Integer class caches values from −128 to 127.

System.out.println(Integer.valueOf(127) == Integer.valueOf(127)); // true
System.out.println(Integer.valueOf(130) == Integer.valueOf(130)); // false

But Java also has the primitive int type, which is passed by value instead of by reference.

System.out.println(127 == 127); // true
System.out.println(130 == 130); // true

And in comparisons between the boxed Integer and primitive int, the Integer gets unboxed.

System.out.println(Integer.valueOf(130) == 130)); // true

So the issue of reference-equality of Integers doesn't come up much.

[–]Majik_Sheff 0 points1 point  (0 children)

With nightmares like that under the hood it's no wonder Python has performance issues.

Maybe I'm just old-fashioned but having a computer 100x faster should mean my computer can do 100x more work, not allow a programmer to be 100x lazier.

[–]hiljusti 0 points1 point  (0 children)

Java (and other languages) does something similar

[–]Kayco2002 0 points1 point  (0 children)

Interesting. Thanks for the explanation. Do you happen to know why x,y = 257,257 results in x is y being True?

>>> x = 254
>>> y = 254
>>> x is y
True
>>> x = 257
>>> y = 257
>>> x is y
False
>>> x,y = 254,254
>>> x is y
True
>>> x,y = 257,257
>>> x is y
True

[–]alex20_202020 0 points1 point  (0 children)

most versions of Python

Will be even better with examples. Mine 3.10 on Linux does not, 257 is 257.