juliajulia-dataframe

Change datatype of column in a Dataframe in Julia


I have a dataframe df = DataFrame(CSV.File("file.csv", delim=";;")). The dataframe has three columns (column1 = Date, column2 = String31, column3 = String15).

column1    |  column2   |  column3
 date      |  String31  |  String15
2022-06-29 |    Test    |   100.00

Only column1 has the right datatype. I would like to change both column2 (to just String) and column3 (to Real or Float64). I managed to change column two, but when I tried to change column3 I got that I can't change string to real.

How would I go about to change these two columns?


Solution

  • On column2 I would recommend leaving it as String31 unless you run into an issue with that (and if you do maybe raise an issue with the InlineStrings.jl package). String31 is a datatype mainly aimed at data analysis workflows where large number of strings are created in memory (such as in a long DataFrame column), which puts a lot of pressure on Julia's garbage collector. Working with InlineStrings like String31 is therefore likely to speed up the analysis in many cases (this won't matter if your data set is small).

    For column3, if you want to get a number from a string you need to parse it:

    julia> parse(Float64, "100.0")
    100.0
    

    You can apply this to the whole column by broadcasting:

    df.column2 = parse.(Float64, df.column2)
    

    That said, this operation is likely to fail, because if it would work CSV.jl would have parsed the column as numeric already. The fact that the column is String tells you that there's likely something in there which can't be parsed as a number - one popular example is a thousands separator (e.g. in files that came from Excel).

    parse will however tell you where it failed:

    julia> parse.(Float64, ["100.0", "1,000.0"])
    ERROR: ArgumentError: cannot parse "1,000.0" as Float64
    

    In this case you would to parse.(Float64, replace.(df.column2, "," => "")) to remove thousands separators.

    [If parse just works without any changes you might have discovered a bug in CSV.jls type detection algorithm which might be worth filing an issue for.]