palantir-foundryfoundry-code-repositoriesfoundry-python-transform

Why is my Code Repo warning me not to use union and instead use unionByName?


I see in my repository it's warning me about using union and instead I should use unionByName. Aren't these the same thing? Why would I care which one to use?


Solution

  • In PySpark docs it's noted that for union:

    Also as standard in SQL, this function resolves columns by position (not by name).

    This is dangerous is most cases as if your schemas have the same types but not the same names / purposes, you may silently be merging different and incompatible schemas. i.e. if schema1 is [('col1', T.IntegerType()), ('col2', T.StringType())] and schema2 is [('col3', T.IntegerType()), ('col4', T.StringType())], they can successfully be merged via union even though col1 and col3 have fundamentally different meanings, as may col2 and col4

    This is different from unionByName, in that:

    The difference between this function and union() is that this function resolves columns by name (not by position)

    This is a safer way to conduct a union in most cases, therefore it is preferred.