[SOLVED] Spark regex 'COIN' in column values -> rlike approach

Spark regex 'COIN' in column values -> rlike approach

I would like to check if the column values contains 'COIN' etc. in values. Is there a possibility to change my regex so as not to include "CRYPTOCOIN|KUCOIN|COINBASE"? I'd like to have something like
"regex associated with COIN word|BTCBIT.NET"

Please find my attached code below:

val CRYPTO_CARD_INDICATOR: String = ("BTCBIT.NET|KUCOIN|COINBASE|CRYPTCOIN")
val CryptoCheckDataset = df.withColumn("is_crypto_indicator",when(upper(col("company_name")).rlike(CRYPTO_CARD_INDICATOR), 1).otherwise(0))

Solution

I think the following should work:

COIN|BTCBIT.NET

Full test in PySpark:

from pyspark.sql.functions import *
CRYPTO_CARD_INDICATOR = "COIN|BTCBIT.NET"
df = spark.createDataFrame([('kucoin',), ('coinbase',), ('crypto',)], ['company_name'])

CryptoCheckDataset = df.withColumn("is_crypto_indicator", when(upper(col("company_name")).rlike(CRYPTO_CARD_INDICATOR), 1).otherwise(0))
CryptoCheckDataset.show()
# +------------+-------------------+
# |company_name|is_crypto_indicator|
# +------------+-------------------+
# |      kucoin|                  1|
# |    coinbase|                  1|
# |      crypto|                  0|
# +------------+-------------------+