If I have a PySpark DataFrame with two columns, text
and subtext
, where subtext
is guaranteed to occur somewhere within text
. How would I calculate the position of subtext
in text
Input data:
| text | subtext |
| Where is my string? | is |
| Hm, this one is different | on |
Expected output:
| text | subtext | position |
| Where is my string? | is | 6 |
| Hm, this one is different | on | 9 |
Note: I can do this using static text/regex without issue, I have not been able to find any resources on doing this with a row-specific text/regex.
You can use locate
. You need to subtract 1 because string index starts from 1, not 0.
import pyspark.sql.functions as F
df2 = df.withColumn('position', F.expr('locate(subtext, text) - 1'))
|text |subtext|position|
|Where is my string? |is |6 |
|Hm, this one is different|on |9 |