apache-sparkpysparkapache-spark-sql

How to find position of substring column in another column using PySpark?


If I have a PySpark DataFrame with two columns, text and subtext, where subtext is guaranteed to occur somewhere within text. How would I calculate the position of subtext in text column?

Input data:

+---------------------------+---------+
|           text            | subtext | 
+---------------------------+---------+
| Where is my string?       | is      |
| Hm, this one is different | on      |
+---------------------------+---------+

Expected output:

+---------------------------+---------+----------+
|           text            | subtext | position |
+---------------------------+---------+----------+
| Where is my string?       | is      |       6  |
| Hm, this one is different | on      |       9  |
+---------------------------+---------+----------+

Note: I can do this using static text/regex without issue, I have not been able to find any resources on doing this with a row-specific text/regex.


Solution

  • You can use locate. You need to subtract 1 because string index starts from 1, not 0.

    import pyspark.sql.functions as F
    
    df2 = df.withColumn('position', F.expr('locate(subtext, text) - 1'))
    
    df2.show(truncate=False)
    +-------------------------+-------+--------+
    |text                     |subtext|position|
    +-------------------------+-------+--------+
    |Where is my string?      |is     |6       |
    |Hm, this one is different|on     |9       |
    +-------------------------+-------+--------+