I have a pyspark dataframe like so with rows that have 'zodiac' and rows that have 'nonzodiac'
spark.createDataFrame(
[
(1, '1234ESPNnonzodiac'),
(2, '1234ESPNzodiac'),
(3, '963CNNnonzodiac'),
(4, '963CNNzodiac'),
],
['id', 'col1']
)
I can get all the nonzodiac rows like so:
nonzodiac_rows = df.where(f.col("col1").rlike("nonzodiac"))
But I'm having trouble returning only rows with 'zodiac' since doing something similar returns both zodiac and nonzodiac rows
zodiac_rows = df.where(f.col("col1").rlike("zodiac"))
It is because nonzodiac contains zodiac substring. You need to write more strict regexp For example add one more character:
zodiac_rows = df.where(f.col("col1").rlike("Nzodiac"))
Or restrict non before zodiac
zodiac_rows = df.where(f.col("col1").rlike("(?<!non)zodiac"))