ORIGINAL QUESTION:
(My question applies to Python 3.2+, but I doubt this has changed since Python 2.7.)
Suppose I use an expression that we usually expect to create an object. Examples: [1,2,3]
; 42
; 'abc'
; range(10)
; True
; open('readme.txt')
; MyClass()
; lambda x : 2 * x
; etc.
Suppose two such expressions are executed at different times and "evaluate to the same value" (i.e., have the same type, and compare as equal). Under what conditions does Python provide what I call a distinct object guarantee that the two expressions actually create two distinct objects (i.e., x is y
evaluates as False
, assuming the two objects are bound to x
and y
, and both are in scope at the same time)?
I understand that for objects of any mutable type, the "distinct object guarantee" holds:
x = [1,2]
y = [1,2]
assert x is not y # guaranteed to pass
I also know for certain immutable types (str
, int
) the guarantee does not hold; and for certain other immutable types (bool
, NoneType
), the opposite guarantee holds:
x = True
y = not not x
assert x is not y # guaranteed to fail
x = 2
y = 3 - 1
assert x is not y # implementation-dependent; likely to fail in CPython
x = 1234567890
y = x + 1 - 1
assert x is not y # implementation-dependent; likely to pass in CPython
But what about all the other immutable types?
In particular, can two tuples created at different times have the same identity?
The reason I'm interested in this is that I represent nodes in my graph as tuples of int
, and the domain model is such that any two nodes are distinct (even if they are represented by tuples with the same values). I need to create sets of nodes. If Python guarantees that tuples created at different times are distinct objects, I could simply subclass tuple
to redefine equality to mean identity:
class DistinctTuple(tuple):
__hash__ = tuple.__hash__
def __eq__(self, other):
return self is other
x = (1,2)
y = (1,2)
s = set(x,y)
assert len(s) == 1 # pass; but not what I want
x = DistinctTuple(x)
y = DistinctTuple(y)
s = set(x,y)
assert len(s) == 2 # pass; as desired
But if tuples created at different times are not guaranteed to be distinct, then the above is a terrible technique, which hides a dormant bug that may appear at random and may be very hard to replicate and find. In that case, subclassing won't help; I will actually need to add to each tuple, as an extra element, a unique id. Alternatively, I can convert my tuples to lists. Either way, I'd use more memory. Obviously, I'd prefer not to use these alternatives unless my original subclassing solution is unsafe.
My guess is that Python does not offer the "distinct object guarantee" for immutable types, either built-in or user-defined. But I haven't found a clear statement about it in the documentation.
UPDATE 1:
@LuperRouch @larsmans Thank you for the discussion and the answer so far. Here's the last issue I'm still unclear with:
Is there any chance that the creation of an object of a user-defined type results in a reuse of an existing object?
If this is possible, I'd like to know how I can verify for any class I work with whether it might exhibit such a behavior.
Here's my understanding. Any time an object of a user-defined class is created, the class' __new__()
method is called first. If this method is overridden, nothing in the language would prevent the programmer from returning a reference to an existing object, thus violating my "distinct object guarantee". Obviously, I can observe it by examining the class definition.
I am not sure what happens if a user-defined class does not override __new__()
(or explicitly relies __new__()
from the base class). If I write
class MyInt(int):
pass
the object creation is handled by int.__new__()
. I would expect that this means I may sometimes see the following assertion fail:
x = MyInt(1)
y = MyInt(1)
assert x is not y # may fail, since int.__new__() might return the same object twice?
But in my experimentation with CPython I could not achieve such behavior. Does this mean the language provides "distinct object guarantee" for user-defined classes that don't override __new__
, or is it just an arbitrary implementation behavior?
UPDATE 2:
While my DistinctTuple
turned out to be a perfectly safe implementation, I now understand that my design idea of using DistinctTuple
to model nodes is very bad.
The identity operator is already available in the language; making ==
behave in the same way as is
is logically superfluous.
Worse, if ==
could have been done something useful, I made it unavailable. For instance, it's quite likely that somewhere in my program I'll want to see if two nodes are represented by the same pair of integers; ==
would have been perfect for that - and in fact, that's what it does by default...
Worse yet, most people actually do expect ==
to compare some "value" rather than identity - even for a user-defined class. They would be caught unawares with my override that only looks at identity.
Finally... the only reason I had to redefine ==
was to allow multiple nodes with the same tuple representation to be part of a set. This is the wrong way to go about it! It's not ==
behavior that needs to change, it's the container type! I simply needed to use multisets instead of sets.
In short, while my question may have some value for other situations, I am absolutely convinced that creating class DistinctTuple
is a terrible idea for my use case (and I strongly suspect it has no valid use case at all).
Is there any chance that the creation of an object of a user-defined type results in a reuse of an existing object?
This will happen if, and only if, the user-defined type is explicitly designed to do that. With __new__()
or some metaclass.
I'd like to know how I can verify for any class I work with whether it might exhibit such a behavior.
Use the source, Luke.
When it comes to int
, small integers are pre-allocated, and these pre-allocated integers are used wherever you create of calculate with integers. You can't get this working when you do MyInt(1) is MyInt(1)
, because what you have there are not integers. However:
>>> MyInt(1) + MyInt(1) is 2
True
This is because of course MyInt(1) + MyInt(1) does not return a MyInt. It returns an int, because that's what the __add__
of an integer returns (and that's where the check for pre-allocated integers occur as well). This if anything just shows that subclassing int in general isn't particularly useful. :-)
Does this mean the language provides "distinct object guarantee" for user-defined classes that don't override new, or is it just an arbitrary implementation behavior?
It doesn't guarantee it, because there is no need to do so. The default behavior is to create a new object. You have to override it if you don't want that to happen. Having a guarantee makes no sense.