dataframejuliacategorical-data

Is there any way to collect categorical features quickly in Julia DataFrames?


I'm using Julia 0.6.3 with Dataframes.jl

I was wondering if there was any way to get categorial features easily in Julia?

For large datasets it can be impossible to enter everything by hand.

My workaround is to rely on strings and usually low cardinality but it's not fool-proof.

My workaround so far :

cat_cols = []
for col in cols
    if contains(string(typeof(X_train[col])),"String") == true
        push!(cat_cols,col)
    end
end

But it seems kind of ugly and I don't catch label encoded values because they are integers.

I could also try to rely on low unique counts but then sparse features would be taken in aswell.

Any idea?


Solution

  • As 张实唯 indicates if you are reading the data from an external source you have to do it manually and there is no workaround.

    However, if you are reading a properly prepared DataFrame by someone else this is simple, as categorical values should be of CategoricalArray type, so you can check it as follows.

    Assume df is your data frame, then you can do either:

    isa.(collect(eachcol(df)), CategoricalArray)
    

    or

    map(col -> isa(df[col], CategoricalArray), 1:size(df,2))
    

    or (in this case you will get a DataFrame as a result)

    map(col -> isa(col, CategoricalArray), eachcol(df))
    

    Additionally CategoricalArray allows you to differentiate between ordinal and nominal categorical value. One of the ways to extract this information could be for instance:

    map(col -> isa(df[col], CategoricalArray) ?
               (isordered(df[col]) ? :ordered : :categorical) :
               :other, 1:size(df,2))
    

    In general in Julia, and in DataFrames.jl in particular, you can expect that important metadata about your object is given by its type as working with user defined types is efficient in Julia. CategoricalArray is one of such types.