sqlapache-sparkdatabricksinstr

String comparison in Databricks Spark SQL


select distinct
 promo_name 
  ,case 
    when substring(promo_name,instr(promo_name, "P0"),2)  = "P0"   then 0
    when substring(promo_name,instr(promo_name, "P1"),2)  = "P1"   then 1
    When substring(promo_name,instr(promo_name, "P01"),3) = "P01"  then 1
    when substring(promo_name,instr(promo_name, "P2"),2)  = "P2"   then 2 
    When substring(promo_name,instr(promo_name, "P02"),3) = "P02"  then 2  
    when substring(promo_name,instr(promo_name, "P3"),2)  = "P3"   then 3 
    when substring(promo_name,instr(promo_name, "P03"),3) = "P03"  then 3
    when substring(promo_name,instr(promo_name, "P4"),2)  = "P4"   then 4  
    when substring(promo_name,instr(promo_name, "P04"),3) = "P04"  then 4 
    when substring(promo_name,instr(promo_name, "P5"),2)  = "P5"   then 5
    when substring(promo_name,instr(promo_name, "P05"),3) = "P05"  then 5
    when substring(promo_name,instr(promo_name, "P6"),2)  = "P6"   then 6
    when substring(promo_name,instr(promo_name, "P06"),3) = "P06"  then 6
    when substring(promo_name,instr(promo_name, "P7"),2)  = "P7"   then 7
    when substring(promo_name,instr(promo_name, "P07"),3) = "P07"  then 7 
    when trim(substring(promo_name,instr(promo_name, "P8"),2))  ="P8"  then 8 
    when trim(substring(promo_name,instr(promo_name, "P08"),3)) ="P08" then 8
    when trim(substring(promo_name,instr(promo_name, "P9"),2))  ="P9"  then 9
    when trim(substring(promo_name,instr(promo_name, "P09"),3)) ="P09" then 9
    when trim(substring(promo_name,instr(promo_name, "P10"),3)) ="P10" then 10 
    when trim(substring(promo_name,instr(promo_name, "P11"),3)) ="P11" then 11
    when trim(substring(promo_name,instr(promo_name, "P12"),3)) ="P12" then 12 

else 0 end as promo_id ,Case when trim(substring(promo_name,instr(promo_name, "P10"),3)) = "P10" then 10 when trim(substring(promo_name,instr(promo_name, "P11"),3)) = "P11" then 11 when trim(substring(promo_name,instr(promo_name, "P12"),3)) = "P12" then 12 when trim(substring(promo_name,instr(promo_name, "P13"),3)) = "P13" then 13 when trim(substring(promo_name,instr(promo_name, "P14"),3)) = "P14" then 14 else 0 end as id from hbi_dns_protected.store_zones_stock_v7_1_4 where promo_name is not null

Trying to extract ID from String, it works fine from P10 to P14 when I use in a separate column, when I do in the same column it just pick 1 instead of 11, 1 instead of 12 etc...

Am I making an mistake here? sample data


Solution

  • Why not use regexp_extract to do a RegEx extraction from your string, rather than write code for each case, something like:

    %sql
    SELECT *,
      regexp_extract( promo_name, ' P(\\d+)', 1 ) AS promoNumber
    FROM tmp
    

    My results:

    My results

    NB The regex is case-sensitive. If you have a requirement to capture either lower or upper case Ps, then you could use a character class ie [pP] instead.

    A full explanation of the RegEx pattern used:

    1. the regex starts off with a space character and a capital P. This will match literally a space and a capital P. If you want to make the match case-insensitive, you can use a character class eg [pP] which says match any character (case-sensitive) in the brackets
    2. The next component of the RegEx is (\\d+). This is made up of the RegEx pattern \d for matching digits, the + symbol which means, 'match one or more'. The brackets make it into a group, which is group 1. The \d has an additional slash which is an escape character required for the Spark SQL implementation of regexp_extract.
    3. the last argument of regexp_extract has the value of 1 which means 'return group 1 from the function'

    I use regex101.com for testing and practicing with RegEx expressions.