juliajulia-dataframe

How to balance a dataset from a countmap table


I have this dataset:

text               sentiment
randomstring        positive
randomstring        negative
randomstring        netrual
random              mixed

Then if I run a countmap i have:

"mixed"    -> 600
"positive" -> 2000
"negative" -> 3300
"netrual"  -> 780

I want to random sample from this dataset in a way that I have records of all smallest class (mixed = 600) and the same amount of each of other classes (positive=600, negative=600, neutral = 600)

I know how to do this in pandas:

df_teste = [data.loc[data.sentiment==i]\
            .sample(n=int(data['sentiment']
            .value_counts().nsmallest(1)[0]),random_state=SEED) for i in data.sentiment.unique()]

df_teste = pd.concat(df_teste, axis=0, ignore_index=True)

But I am having a hard time to do this in Julia.

Note: I don´t want to hardcode which of the class is the lowest one, so I am looking for a solution that infer that from the countmap or freqtable, if possible.


Solution

  • Why do you want a countmap or freqtable solution if you seem do want to use a data frame in the end?

    This is how you would do this with DataFrames.jl (but without StatsBase.jl and FreqTables.jl as they are not needed for this):

    julia> using Random
    
    julia> using DataFrames
    
    julia> df = DataFrame(text = [randstring() for i in 1:6680],
                                  sentiment = shuffle!([fill("mixed", 600);
                                                        fill("positive", 2000);
                                                        fill("ngative", 3300);
                                                        fill("neutral", 780)]))
    6680×2 DataFrame
      Row │ text      sentiment
          │ String    String
    ──────┼─────────────────────
        1 │ R3W1KL5b  positive
        2 │ uCCpNrat  ngative
        3 │ fwqYTCWG  ngative
      ⋮   │    ⋮          ⋮
     6678 │ UJiNrlcw  ngative
     6679 │ 7aiNOQ1o  neutral
     6680 │ mbIOIQmQ  ngative
               6674 rows omitted
    
    julia> gdf = groupby(df, :sentiment);
    
    julia> min_len = minimum(nrow, gdf)
    600
    
    julia> df_sampled = combine(gdf) do sdf
               return sdf[randperm(nrow(sdf))[1:min_len], :]
           end
    2400×2 DataFrame
      Row │ sentiment  text
          │ String     String
    ──────┼─────────────────────
        1 │ positive   O0QsyrJZ
        2 │ positive   7Vt70PSh
        3 │ positive   ebFd8m4o
      ⋮   │     ⋮         ⋮
     2398 │ neutral    Kq8Wi2Vv
     2399 │ neutral    yygOzKuC
     2400 │ neutral    NemZu7R3
               2394 rows omitted
    
    julia> combine(groupby(df_sampled, :sentiment), nrow)
    4×2 DataFrame
     Row │ sentiment  nrow
         │ String     Int64
    ─────┼──────────────────
       1 │ positive     600
       2 │ ngative      600
       3 │ mixed        600
       4 │ neutral      600
    

    If your data is very large and you need the operation to be very fast there are more efficient ways to do it, but in most situations this should be fast enough and the solution does not require any extra packages.