I'm using Julia 0.6.3 with Dataframes.jl
I was wondering if there was any way to get categorial features easily in Julia?
For large datasets it can be impossible to enter everything by hand.
My workaround is to rely on strings and usually low cardinality but it's not fool-proof.
My workaround so far :
cat_cols = []
for col in cols
if contains(string(typeof(X_train[col])),"String") == true
push!(cat_cols,col)
end
end
But it seems kind of ugly and I don't catch label encoded values because they are integers.
I could also try to rely on low unique counts but then sparse features would be taken in aswell.
Any idea?
As 张实唯 indicates if you are reading the data from an external source you have to do it manually and there is no workaround.
However, if you are reading a properly prepared DataFrame
by someone else this is simple, as categorical values should be of CategoricalArray
type, so you can check it as follows.
Assume df
is your data frame, then you can do either:
isa.(collect(eachcol(df)), CategoricalArray)
or
map(col -> isa(df[col], CategoricalArray), 1:size(df,2))
or (in this case you will get a DataFrame
as a result)
map(col -> isa(col, CategoricalArray), eachcol(df))
Additionally CategoricalArray
allows you to differentiate between ordinal and nominal categorical value. One of the ways to extract this information could be for instance:
map(col -> isa(df[col], CategoricalArray) ?
(isordered(df[col]) ? :ordered : :categorical) :
:other, 1:size(df,2))
In general in Julia, and in DataFrames.jl in particular, you can expect that important metadata about your object is given by its type as working with user defined types is efficient in Julia. CategoricalArray
is one of such types.