I am trying to split the string to different columns using regular expression
Below is my data
decodeData = [('M|C705|Exx05','2'),
('M|Exx05','4'),
('M|C705 P|Exx05','6'),
('M|C705 P|8960 L|Exx05','7'),('M|C705 P|78|8960','9')]
df = sc.parallelize(decodeData).toDF(['Decode',''])
dfNew = df.withColumn('Exx05',regexp_extract(col('Decode'), '(M|P|M)(\\|Exx05)', 1)).withColumn('C705',regexp_extract(col('Decode'), '(M|P|M)(\\|C705)', 1)) .withColumn('8960',regexp_extract(col('Decode'), '(M|P|M)(\\|8960)', 1))
dfNew.show()
Result
+--------------------+---+-----+----+-----+
| Decode| |Exx05|C705| 8960|
+--------------------+---+-----+----+-----+
| M|C705|Exx05 | 2 | | M| |
| M|Exx05 | 4 | M| | |
| M|C705 P|Exx05 | 6 | P| M| |
|M|C705 P|8960 L|Exx05| 7 | M| M| P|
| M|C705 P|78|8960 | 9 | | M| |
+--------------------+---+-----+----+-----+
Here I am trying to extract the Code for string Exx05,C705,8960 and this can fall into M/P/L codes eg: While decoding 'M|C705 P|8960 L|Exx05' I expect the results as L M P in respective columns. However I am missing some logic here,which I am finding difficulty to crack
Expected results
+--------------------+---+-----+----+-----+
| Decode| |Exx05|C705| 8960|
+--------------------+---+-----+----+-----+
| M|C705|Exx05 | | M| M| |
| M|Exx05 | | M| | |
| M|C705 P|Exx05 | | P| M| |
|M|C705 P|8960 L|Exx05| | L| M| P|
| M|C705 P|78|8960 | | | M| P|
+--------------------+---+-----+----+-----+
When I am trying to change the reg expression accordingly , It works for some cases and it wont work for other sample cases, and this is just a subset of the actual data I am working on.
eg: 1. Exx05 can fall in any code M/L/P and even it can fall at any position,begining,middle,end etc
You can add ([^ ])*
in the regex to extend it so that it matches any consecutive patterns that are not separated by a space:
dfNew = df.withColumn(
'Exx05',
regexp_extract(col('Decode'), '(M|P|L)([^ ])*(\\|Exx05)', 1)
).withColumn(
'C705',
regexp_extract(col('Decode'), '(M|P|L)([^ ])*(\\|C705)', 1)
).withColumn(
'8960',
regexp_extract(col('Decode'), '(M|P|L)([^ ])*(\\|8960)', 1)
)
dfNew.show(truncate=False)
+---------------------+---+-----+----+----+
|Decode | |Exx05|C705|8960|
+---------------------+---+-----+----+----+
|M|C705|Exx05 |2 |M |M | |
|M|Exx05 |4 |M | | |
|M|C705 P|Exx05 |6 |P |M | |
|M|C705 P|8960 L|Exx05|7 |L |M |P |
|M|C705$P|78|8960 |9 | |M |P |
+---------------------+---+-----+----+----+