I have a dataframe df = DataFrame(CSV.File("file.csv", delim=";;"))
.
The dataframe has three columns (column1 = Date, column2 = String31, column3 = String15).
column1 | column2 | column3
date | String31 | String15
2022-06-29 | Test | 100.00
Only column1 has the right datatype. I would like to change both column2 (to just String) and column3 (to Real or Float64). I managed to change column two, but when I tried to change column3 I got that I can't change string to real.
How would I go about to change these two columns?
On column2
I would recommend leaving it as String31
unless you run into an issue with that (and if you do maybe raise an issue with the InlineStrings.jl
package). String31
is a datatype mainly aimed at data analysis workflows where large number of strings are created in memory (such as in a long DataFrame column), which puts a lot of pressure on Julia's garbage collector. Working with InlineStrings like String31
is therefore likely to speed up the analysis in many cases (this won't matter if your data set is small).
For column3
, if you want to get a number from a string you need to parse
it:
julia> parse(Float64, "100.0")
100.0
You can apply this to the whole column by broadcasting:
df.column2 = parse.(Float64, df.column2)
That said, this operation is likely to fail, because if it would work CSV.jl would have parsed the column as numeric already. The fact that the column is String
tells you that there's likely something in there which can't be parsed as a number - one popular example is a thousands separator (e.g. in files that came from Excel).
parse
will however tell you where it failed:
julia> parse.(Float64, ["100.0", "1,000.0"])
ERROR: ArgumentError: cannot parse "1,000.0" as Float64
In this case you would to parse.(Float64, replace.(df.column2, "," => ""))
to remove thousands separators.
[If parse
just works without any changes you might have discovered a bug in CSV.jl
s type detection algorithm which might be worth filing an issue for.]