pythonapache-sparkpysparkoverloadingpython-typing

Spark Row object instantiated differently from overloaded prototypes?


The Spark Row class in pyspark/sql/types.py contains no __init__ method, but shows the following overloaded type hints for __new__:

@overload
def __new__(cls, *args: str) -> "Row"

@overload
def __new__(cls, **kwargs: Any) -> "Row"

def __new__(cls, *args: Optional[str], **kwargs: Optional[Any]) -> "Row"

The doc string for Row shows various instantiations:

>>> Person = Row("name", "age")
>>> row1 = Row("Alice", 11) # This is the one that is hard to understand
>>> row2 = Row(name="Alice", age=11)
>>> row1 == row2
True

The second line above does not fit any of the overloaded prototypes. It almost fits the prototype with *args, except for the fact that all of the arguments for *args are supposed to be strings. This is obviously not the case for Row("Alice",11), but that invocation doesn't generate any messages when issued at the REPL prompt. Obviously, there is something that I am missing about how type hinting and overloading works. Can someone please explain?

P.S. For context, I got to this point by trying to see how the constructor knows that Row("name","age") specifies field names while Row("Alice", 11) specifies field values. The source code for __new__ shows that it depends on whether the argument list is *args or **kwargs. Both of the Row method invocations in this paragraph use *args, but the second one simply doesn't fit the prototype for *args above.


Solution

  • Row is a subclass of tuple therefore it behaves accordingly. Looking at the source code, I can deduce that the type hints used there do not perform any runtime checks to enforce types (note [1] below provides background on type checking). Therefore, a Row object can be created using any object type not just int and strings. Here is one example,

    class Foo:
        pass
    
    class Bar:
        pass
    
    
    >>> Row(Foo(), Bar())
    # <Row(<__main__.Foo object at 0x7f5bb3ef4700>, <__main__.Bar object at 0x7f5bb36458d0>)>
    

    Creating the row objects using only args will not automatically set the __fields__ parameter. Therefore your assumption that "For context, I got to this point by trying to see how the constructor knows that Row("name","age") specifies field names while Row("Alice", 11) specifies field values" is incorrect.

    Instead by default both will set values of the tuple. This is the reason why row1 == row2 in your example

    >>> row = Row('name', 'age')
    >>> row.__fields__
    # AttributeError: __fields__
    
    >>> row1 = Row("Alice", 11)
    >>> row2 = Row(name="Alice", age=11)
    >>> row1 == row2
    # True
    

    One important point here is Row also defines __call__ method which will create a new instance of Row from fields and values. Essentially when you call an existing Row object in this manner, the existing values become the new assigned fields

    >>> row = Row('name', 'age')('Alice', 11)
    >>> row
    # Row(name='Alice', age=11)
    
    >>> row.__fields__
    >>> <Row('name', 'age')>
    

    But when you use kwargs form, the fields and values are automatically set,

    >>> row = Row(name="Alice", age=11)
    >>> row
    # Row(name='Alice', age=11)
    
    >>> row.__fields__
    # ['name', 'age']
    

    Notes:

    [1] Here is primer on type hinting and checking. Third party tools can be used to check the types specified by hints. This is not built into IDEs like Spyder.