regexpysparkudf

string manipulation for column names in pyspark


This artcle gives a great overview on how to change columnnames. How to change dataframe column names in pyspark?

Nontheless I need something more / slightly adjusted that I am not capable of doing. Can anybody help remove spaces from all colnames? Its needed for e.g. join commands and the systematic approach reduces the effort of dealing with 30 columns. I suppose a combination of regex and a UDF would work best.

Example: root |-- CLIENT: string (nullable = true) |-- Branch Number: string (nullable = true)


Solution

  • There is a real simple solution:

    for name in df.schema.names:
      df = df.withColumnRenamed(name, name.replace(' ', ''))