regexpyspark

Pyspark Regular Expression add double quotes after comma


I have below string present within dataframe:

30,kUsUO,6,18,97,42,SAM,lmhYK,49,aLaTA,51,34,3,49,75,39,pdwvW,54,7,63,12,25,26,SJ12u,rUFUV,34,xXBv3,XHtz4,r4Fyh,14,20,0jZL2,izrsC,44,K5Kw3,8,tcKu7,5,RPLcy,kg4IR,Kvs3p,lyG09,dJmZB,34,84,7,qED2y,8uNen,5,96,81,88,bGgqK,FAsIV,81,YXZ,PQR,Flat No B1002, Balaji Whitefield society, sus road, pune,Mh,22,591213,LbAo7,21,18,text,,,,,

Requirement here to add double quote after 57th comma if string/digits immidiately present after 57th comma and close double quote before pattern digits,digits (here before ,22,591213)

So basically trying to enclose below substring within double quote
"Flat No B1002, Balaji Whitefield society, sus road, pune,Mh"

For that I have written below regular expression

Pattern=r"^((?:[^,]*,){57})(\"?[a-zA-Z_][^\"]*?\"?)(,\d{2},\d{4}.*)$"

df = df.withColumn("text", regexp_replace(col("text"), pattern, r'$1"$2"$3'))

This regular expression works very well for above string.

But If i get variation in string , example below then count for comma goes wrong

30,kUsUO,6,18,97,42,"SAM,K,KARAN" lmhYK,49,aLaTA,51,34,3,49,75,39,pdwvW,54,7,63,12,25,26,SJ12u,rUFUV,34,xXBv3,XHtz4,r4Fyh,14,20,0jZL2,izrsC,44,K5Kw3,8,tcKu7,5,RPLcy,kg4IR,Kvs3p,lyG09,dJmZB,34,84,7,qED2y,8uNen,5,96,81,88,bGgqK,FAsIV,81,YXZ,PQR,Flat No B1002, Balaji Whitefield society, sus road, pune,Mh,22,591213,LbAo7,21,18,text,,,,,

Here name appears within double quote with comma "SAM,K,KARAN" Due to this my count for comma goes wrong

Is there any way to modify above regular expression in pyspark so that expression will not consider comma if its present within double quote.

This double quote comma case appears any number of times and places.


Solution

  • You might change the regex to:

    ^((?:[^,"]*(?:"[^"]*"[^,"]*)*,){57})("?[a-zA-Z_][^"]*"?)(,\d{2},\d{4}.*)$
    

    The different groups match:

    See a regex demo