dataframescalaapache-sparkvalidationrlike

Spark regex 'COIN' in column values -> rlike approach


I would like to check if the column values contains 'COIN' etc. in values. Is there a possibility to change my regex so as not to include "CRYPTOCOIN|KUCOIN|COINBASE"? I'd like to have something like
"regex associated with COIN word|BTCBIT.NET"

Please find my attached code below:

val CRYPTO_CARD_INDICATOR: String = ("BTCBIT.NET|KUCOIN|COINBASE|CRYPTCOIN")
val CryptoCheckDataset = df.withColumn("is_crypto_indicator",when(upper(col("company_name")).rlike(CRYPTO_CARD_INDICATOR), 1).otherwise(0))

Solution

  • I think the following should work:

    COIN|BTCBIT.NET
    

    Full test in PySpark:

    from pyspark.sql.functions import *
    CRYPTO_CARD_INDICATOR = "COIN|BTCBIT.NET"
    df = spark.createDataFrame([('kucoin',), ('coinbase',), ('crypto',)], ['company_name'])
    
    CryptoCheckDataset = df.withColumn("is_crypto_indicator", when(upper(col("company_name")).rlike(CRYPTO_CARD_INDICATOR), 1).otherwise(0))
    CryptoCheckDataset.show()
    # +------------+-------------------+
    # |company_name|is_crypto_indicator|
    # +------------+-------------------+
    # |      kucoin|                  1|
    # |    coinbase|                  1|
    # |      crypto|                  0|
    # +------------+-------------------+