apache-sparkpysparkapache-spark-sql

how to remove every space inside string by pyspark?


df1 = spark.read.csv('/content/drive/MyDrive/BigData2021/Lecture23/datasets/cities.csv', header = True, inferSchema= True)
import pyspark.sql.functions as F

for name in df1.columns:
     df1 = df1.withColumn(name, F.trim(df1[name]))
     df1.show()

Here is my piece of code I try to trim every space in column header and also values but it does't work I need function to use every other df.


Solution

  • You can use use regexp_replace to replace spaces in column values with empty string "".

    You can use replace to remove spaces in column names.

    from pyspark.sql import functions as F
    
    df = spark.createDataFrame([("col1 with spaces  ", "col 2 with spaces", ), ], ("col 1", "col 2"))
    
    """
    +------------------+-----------------+
    |             col 1|            col 2|
    +------------------+-----------------+
    |col1 with spaces  |col 2 with spaces|
    +------------------+-----------------+
    """
    select_expr = [F.regexp_replace(F.col(c), "[\s]", "").alias(c.replace(" ", "")) for c in df.columns]
    
    df.select(*select_expr).show()
    
    """
    +--------------+--------------+
    |          col1|          col2|
    +--------------+--------------+
    |col1withspaces|col2withspaces|
    +--------------+--------------+
    """