pythonpython-typingpylancezip-operator

Correctly specify the types of unpacked `zip`


I need to restructure some lists of tuples in python. I want to put the n-th value of each tuple in these lists into a separate tuple. The tuples in the lists are all similarly structured (e.g. position 1 is always an int) and I provided the respective type hints. However, I unexpectedly receive an error message when I write the following code:

test_list: list[tuple[int, str]] = [(1, 'testa'), (2, 'testb')]

a: tuple[int]
b: tuple[str]

a, b = zip(*test_list)

As expected, a and b now only consist of int and str, respectively:

print(a)  # Output: (1, 2)
print(b)  # Output: ('testa', 'testb')

However, Pylance still complains about the zip expression:

Expression of type "tuple[int | str]" cannot be assigned to declared type "tuple[int]"
  "tuple[int | str]" is incompatible with "tuple[int]"
    Tuple entry 1 is incorrect type
      Type "int | str" cannot be assigned to type "int"
        "str" is incompatible with "int" (PylancereportGeneralTypeIssues)
Expression of type "tuple[int | str]" cannot be assigned to declared type "tuple[str]"
  "tuple[int | str]" is incompatible with "tuple[str]"
    Tuple entry 1 is incorrect type
      Type "int | str" cannot be assigned to type "str"
        "int" is incompatible with "str" (PylancereportGeneralTypeIssues)

What do I have to change to get rid of the error message? Or is this a bug in Pylance? Does it not recognize the star operator?


Solution

  • I don't think the problem here is with Pylance or your code.

    zip accepts generic iterables

    The problem is in the way that zip is designed/annotated. If we look at typeshed (always a great source for figuring out types of built-in functions), we can see that the the two-argument-overload looks something like this (simplified):

    from __future__ import annotations
    from collections.abc import Iterable, Iterator
    from typing import TypeVar
    
    T = TypeVar("T", covariant=True)
    T1 = TypeVar("T1")
    T2 = TypeVar("T2")
    
    ...
    
    class zip(Iterator[T]):
        def __new__(cls, iter1: Iterable[T1], iter2: Iterable[T2]) -> zip[tuple[T1, T2]]: ...
    
        def __next__(self) -> T: ...
    

    (Source, zip starting in line 1673 as of today's main branch)

    What this means is that zip implements the iterator protocol (also mentioned in the docs). It is in fact a generic iterator over a type T and calling next on an instance of such a zip iterator returns something of type T.

    Moreover, the type T is fully specified upon construction of a zip instance as indicated by the __new__ return type annotation.

    The key point however is that the arguments taken by __new__ are annotated as Iterable. Those are generic over only one type argument. In this example the first iterable is generic over type argument T1 and the second over T2. The type argument to the resulting zip is then a tuple[T1, T2].

    Tuples to iterables and back to tuples

    This is totally fine, when our iterables are in fact of a "consistent" type. Take the following example:

    a = (1, 2)
    b = ("1", "2")
    x, y = zip(a, b)
    
    reveal_type(x)
    reveal_type(y)
    

    Not sure how this is done with Pylance, but the reveal_type statements cause mypy to note the following:

    note: Revealed type is "Tuple[builtins.int, builtins.str]"
    note: Revealed type is "Tuple[builtins.int, builtins.str]"
    

    Makes sense, considering a and b are of type tuple[int, int] and tuple[str, str] respectively, which in turn is generalized to Iterable[int] and Iterable[str] when passed to the zip constructor and there turned into the type argument tuple[int, str] for the values returned by the iterator.

    Tuples can have multiple type arguments

    But what happens, if we change the setup just slightly:

    a = (1, "2")
    b = ("1", 2)
    x, y = zip(a, b)
    reveal_type(x)
    reveal_type(y)
    

    Now we get the following from mypy:

    note: Revealed type is "Tuple[builtins.object, builtins.object]"
    note: Revealed type is "Tuple[builtins.object, builtins.object]"
    

    To be clear, the type of a and b is still correctly inferred to be tuple[int, str] and tuple[str, int]. But when those tuples are passed to zip it looks at their type arguments and it needs to join them because it is defined on iterables of one type argument. And as you can see, int and str have only object as their closest common base.

    So the iterables are seen as containing object, which then gives us the zip[tuple[object, object]].

    Unions vs. joins, same difference

    Pylance seems to use unions instead of joins to find the type of those iterables, which is why you get that notice about the zip iterator yielding tuple[int | str]. Might be arguably the better approach compared to that of mypy, but still unsatisfactory for your purposes. But I hope you see that the fundamental issue here is the same.

    Tuples are special in that they are generic over a variable number of type arguments. The reasoning is supposedly that they are of fixed length and you can therefore very precisely parameterize them. Something like a list can be changed in length and iterators don't even have a concept of length; they can potentially yield elements forever. So it seems reasonable to only give them one type parameter.

    The problems arise, when we need to view a tuple as an Iterable. And I see no way around that.


    Practical solutions

    As for how to proceed for you, this depends on what the actual use case is. The simplest way based on your minimal example would obviously be a type: ignore at the moment of unpacking the zip:

    test_list: list[tuple[int, str]] = [(1, 'testa'), (2, 'testb')]
    
    a: tuple[int]
    b: tuple[str]
    
    a, b = zip(*test_list)  # type: ignore[assignment]
    

    If you are actually dealing with functions and your setup is more involved, maybe there are other ways around this problem. If you elaborate, maybe we can work something else out. Other than that, there is no shame in using duly considered type: ignores in your code, when you hit the limits of the available typing system.