The Spark Row
class in pyspark/sql/types.py
contains no __init__
method, but shows the following overloaded type hints for __new__
:
@overload
def __new__(cls, *args: str) -> "Row"
@overload
def __new__(cls, **kwargs: Any) -> "Row"
def __new__(cls, *args: Optional[str], **kwargs: Optional[Any]) -> "Row"
The doc string for Row
shows various instantiations:
>>> Person = Row("name", "age")
>>> row1 = Row("Alice", 11) # This is the one that is hard to understand
>>> row2 = Row(name="Alice", age=11)
>>> row1 == row2
True
The second line above does not fit any of the overloaded prototypes.
It almost fits the prototype with *args
, except for the fact that all
of the arguments for *args
are supposed to be strings. This is
obviously not the case for Row("Alice",11)
, but that invocation
doesn't generate any messages when issued at the REPL prompt.
Obviously, there is something that I am missing about how
type hinting and overloading works. Can someone please explain?
P.S. For context,
I got to this point by trying to see how the constructor
knows that Row("name","age")
specifies field names while
Row("Alice", 11)
specifies field values. The source code for
__new__
shows that it depends on whether the argument list is
*args
or **kwargs
. Both of the Row
method invocations in
this paragraph use *args
, but the
second one simply doesn't fit the prototype for *args
above.
Row
is a subclass of tuple
therefore it behaves accordingly. Looking at the source code, I can deduce that the type hints used there do not perform any runtime checks to enforce types (note [1] below provides background on type checking). Therefore, a Row
object can be created using any object type not just int and strings. Here is one example,
class Foo:
pass
class Bar:
pass
>>> Row(Foo(), Bar())
# <Row(<__main__.Foo object at 0x7f5bb3ef4700>, <__main__.Bar object at 0x7f5bb36458d0>)>
Creating the row objects using only args
will not automatically set the __fields__
parameter. Therefore your assumption that "For context, I got to this point by trying to see how the constructor knows that Row("name","age")
specifies field names while Row("Alice", 11)
specifies field values" is incorrect.
Instead by default both will set values of the tuple. This is the reason why row1 == row2
in your example
>>> row = Row('name', 'age')
>>> row.__fields__
# AttributeError: __fields__
>>> row1 = Row("Alice", 11)
>>> row2 = Row(name="Alice", age=11)
>>> row1 == row2
# True
One important point here is Row
also defines __call__
method which will create a new instance of Row
from fields and values. Essentially when you call an existing Row
object in this manner, the existing values become the new assigned fields
>>> row = Row('name', 'age')('Alice', 11)
>>> row
# Row(name='Alice', age=11)
>>> row.__fields__
>>> <Row('name', 'age')>
But when you use kwargs
form, the fields and values are automatically set,
>>> row = Row(name="Alice", age=11)
>>> row
# Row(name='Alice', age=11)
>>> row.__fields__
# ['name', 'age']
Notes:
[1] Here is primer on type hinting and checking. Third party tools can be used to check the types specified by hints. This is not built into IDEs like Spyder.