I have this dataset:
text sentiment
randomstring positive
randomstring negative
randomstring netrual
random mixed
Then if I run a countmap
i have:
"mixed" -> 600
"positive" -> 2000
"negative" -> 3300
"netrual" -> 780
I want to random sample from this dataset in a way that I have records of all smallest class (mixed = 600)
and the same amount of each of other classes (positive=600, negative=600, neutral = 600)
I know how to do this in pandas:
df_teste = [data.loc[data.sentiment==i]\
.sample(n=int(data['sentiment']
.value_counts().nsmallest(1)[0]),random_state=SEED) for i in data.sentiment.unique()]
df_teste = pd.concat(df_teste, axis=0, ignore_index=True)
But I am having a hard time to do this in Julia.
Note: I don´t want to hardcode which of the class is the lowest one, so I am looking for a solution that infer that from the countmap
or freqtable
, if possible.
Why do you want a countmap
or freqtable
solution if you seem do want to use a data frame in the end?
This is how you would do this with DataFrames.jl (but without StatsBase.jl and FreqTables.jl as they are not needed for this):
julia> using Random
julia> using DataFrames
julia> df = DataFrame(text = [randstring() for i in 1:6680],
sentiment = shuffle!([fill("mixed", 600);
fill("positive", 2000);
fill("ngative", 3300);
fill("neutral", 780)]))
6680×2 DataFrame
Row │ text sentiment
│ String String
──────┼─────────────────────
1 │ R3W1KL5b positive
2 │ uCCpNrat ngative
3 │ fwqYTCWG ngative
⋮ │ ⋮ ⋮
6678 │ UJiNrlcw ngative
6679 │ 7aiNOQ1o neutral
6680 │ mbIOIQmQ ngative
6674 rows omitted
julia> gdf = groupby(df, :sentiment);
julia> min_len = minimum(nrow, gdf)
600
julia> df_sampled = combine(gdf) do sdf
return sdf[randperm(nrow(sdf))[1:min_len], :]
end
2400×2 DataFrame
Row │ sentiment text
│ String String
──────┼─────────────────────
1 │ positive O0QsyrJZ
2 │ positive 7Vt70PSh
3 │ positive ebFd8m4o
⋮ │ ⋮ ⋮
2398 │ neutral Kq8Wi2Vv
2399 │ neutral yygOzKuC
2400 │ neutral NemZu7R3
2394 rows omitted
julia> combine(groupby(df_sampled, :sentiment), nrow)
4×2 DataFrame
Row │ sentiment nrow
│ String Int64
─────┼──────────────────
1 │ positive 600
2 │ ngative 600
3 │ mixed 600
4 │ neutral 600
If your data is very large and you need the operation to be very fast there are more efficient ways to do it, but in most situations this should be fast enough and the solution does not require any extra packages.