I'm using Pyspark SQL with regexp_extract in this way:
df = spark.createDataFrame([['id_20_30', 10], ['id_40_50', 30]], ['id', 'age'])
df.createOrReplaceTempView("table")
sql_statement="""
select regexp_extract(id, r'(\d+)', 1) as id
from table
"""
df = spark.sql(sql_statement)
But I obtain this error:
Traceback (most recent call last):
File "/tmp/spark-5b33a47c-6490-4f80-9a97-3acb37f683ec/script.py", line 86, in <module>
df = spark.sql(sql_statement)
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/session.py", line 723, in sql
File "/usr/lib/spark/python/lib/py4j-0.10.9.2-src.zip/py4j/java_gateway.py", line 1310, in __call__
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 117, in deco
pyspark.sql.utils.ParseException:
Literals of type 'R' are currently not supported.
The result should be
+---+
| id|
+---+
| 20|
| 40|
+---+
If I remove the r
it doesn't fail but it creates a wrong result.
If I try this code with Pyspark locally it works (Pyspark version 3.3.0), but when I run this code in an EMR job it fails, I'm using emr-6.6.0 as application.
Do you have any suggestion? I know I could try to not use pyspark sql but I would prefer to proceed with it.
You need to escape the '\' with '\' and you need two of them.
sql_statement="select regexp_extract(id, '([0-9]+)', 1) as id from table "
#or
sql_statement="select regexp_extract(id, '(\\\\d+)', 1) as id from table "
df = spark.sql(sql_statement)
df.show()
+---+
| id|
+---+
| 20|
| 40|
+---+