I was wondering if there are performance difference between calling except (https://spark.apache.org/docs/2.1.0/api/java/org/apache/spark/sql/Dataset.html#except(org.apache.spark.sql.Dataset) and using a left anti-join. So far, the only difference I can see is that with the left anti-join, the 2 datasets can have different columns.
Your title vs. explanation differ, actually.
Feb '24: Your comment is noted and clarifies: Let's say I have 2 datasets with the same schema, Dataset A and Dataset B. My goal is to find all the rows in Dataset A that are not present in Dataset B; should I do that with an EXCEPT or a LEFT ANTI JOIN?
If you have the same structure in Datasets A & B, you would simply use EXCEPT. Using LEFT ANTI JOIN would be convoluted coding (as aluded to in Comments), but is technically possible.
EXCEPT
is a specific implementation that enforces same structure and is a subtract operation, whereas
LEFT ANTI JOIN
allows different structures to be compared and where
clause is needed..
Use cases differ: 1) Left Anti Join can apply to many situations pertaining to missing data - customers with no orders (yet), orphans in a database. 2) Except is for subtracting things, e.g. Machine Learning splitting data into test- and training sets, or your use case "...new DataFrame containing only rows present in the DS A but not present in DS B ..."
Performance should not be a real deal breaker as they are different use cases in general and therefore difficult to compare. Except will involve the same data source whereas LAJ will involve different data sources.
So, you need to use EXCEPT.