pandaspysparkazure-synapse

How to fix space in column name when transforming pyspark dataframe in Pandas/Polars


I'm reading a sql table in a notebook on Synapse and loading it in a pyspark dataframe:

df = spark.read.synapsesql("dbtablename")

Unfortunately some columns have a space in their name e.g.: Job Title.

I tried different methods to change the name of the columns and remove the space. I can see the new dataframes with the fixed column name but when I try to convert to Polars or Pandas dataframe the operation fails and the error message refers back to the original name of the column with the space in the first dataframe.

I tried with:

Thank you for your kind help.


Solution

  • I was able to solve this problem by modifying the way I'm querying the table on the dedicated sql pool.

    Before:

    df = spark.read.synapsesql("db.tablename")
    

    After:

    df = (spark.read
        .option(Constants.SERVER, "server_name.sql.azuresynapse.net")
        .option(Constants.DATABASE, "db_name")
        .synapsesql("select [Job Title] as JobTitle from dbo.TableName)
        )
    
    pandas_df = df.toPandas()
    

    Thank you for your help. I tried again with the previously shared code and with a new pool and resources but I still had the same issue. This solution allows me to modify the name of the column.

    To note that:

    select [Job Title] from dbo.TableName
    

    doesn't work when transforming to pandas dataframe or simply doing:

    df.show()
    

    So the column alias without a space solves it.

    Thank you.