pythonsnowflake-cloud-data-platform

Snowpark Consecutive Filters Not Working as Expected


Found what I believe to be a bug. Posting here for posterity. Comes from consecutive filter operations on a dataframe.

enter image description here

Expected output is the empty dataframe (no rows meet the A=0) condition. However, I believe the consecutive filter is being parsed as A=1 or (B=3 and A=0)


Solution

  • However, I believe the consecutive filter is being parsed as A=1 or (B=3 and A=0)

    Correct. It is parsed as A = 1 OR B=3 AND A = 0 (AND has precedence over OR).

    It can be easily confirmed by using snowflake.snowpark.Session.query_history:

    import snowflake.snowpark
    session = snowflake.snowpark.context.get_active_session() 
    
    with session.query_history(True) as query_history:
      df = session.sql("SELECT 1 AS a, 2 AS B")
    
      df.filter('A = 1 OR B=3') \
        .filter('A = 0') \
        .show()
    
    query_history.queries[1]
    

    Output:

    enter image description here

    Chained filter conditions are connected with AND, therefore for correct evaluation the condition with OR it should be wrapped with parenthesis:

    df.filter('(A = 1 OR B=3)') \
      .filter('A = 0') \
      .show()
    

    Sidenote: Usage of verbatim SQL text in the condition can be replaced with col('<name>') syntax:

    import snowflake.snowpark
    from snowflake.snowpark.functions import col
    session = snowflake.snowpark.context.get_active_session() 
    
    with session.query_history(True) as query_history:
        df = session.sql("SELECT 1 AS A, 2 AS B")
    
        df.filter((col('A') == 1) | (col('B') == 3)) \
          .filter(col('A') == 0) \
          .show()
    
    query_history.queries[1]
    

    It generates:

    SELECT  *  
    FROM (SELECT 1 AS A, 2 AS B) 
    WHERE ((("A" = 1) OR ("B" = 3)) AND ("A" = 0))
    LIMIT 10