pythonpyspark

Looking to convert String Column to Integer Column in PySpark. What happens to strings that can't be converted?


I'm trying to convert a column in a dataframe to IntegerType. Here is an example of the dataframe:

+----+-------+
|From|     To|
+----+-------+
|   1|1664968|
|   2|      3|
|   2| 747213|
|   2|1664968|
|   2|1691047|
|   2|4095634|
+----+-------+

I'm using the following code:

exploded_df = exploded_df.withColumn('From', exploded_df['To'].cast(IntegerType()))

However, I wanted to know what happens to strings that are not digits, for example, what happens if I have a string with several spaces? The reason is that I want to filter the dataframe in order to get the values of the column From that don't have numbers in column To.

Is there a simpler way to filter by this condition without converting the columns to IntegerType?

Thank you!


Solution

  • Values which cannot be cast are set to null, and the column will be considered a nullable column of that type. Here's a simple example:

    from pyspark import SQLContext
    from pyspark.sql import SparkSession
    import pyspark.sql.functions as F
    from pyspark.sql.types import IntegerType
    
    spark = SparkSession.builder.getOrCreate()
    
    sql_context = SQLContext(spark.sparkContext)
    
    df = sql_context.createDataFrame([("1",),
                                      ("2",),
                                      ("3",),
                                      ("4",),
                                      ("hello world",)], schema=['id'])
    
    print(df.show())
    
    df = df.withColumn("id", F.col("id").astype(IntegerType()))
    
    print(df.show())
    

    Output:

    +-----------+
    |         id|
    +-----------+
    |          1|
    |          2|
    |          3|
    |          4|
    |hello world|
    +-----------+
    
    +----+
    |  id|
    +----+
    |   1|
    |   2|
    |   3|
    |   4|
    |null|
    +----+
    

    And to verify the schema is correct:

    print(df.printSchema())
    

    Output:

    None
    root
     |-- id: integer (nullable = true)