Have a pandas dataframe with 2 columns: tag
and message
:
tag | message
["string1","sttring2"] | some text
["string","string3"] | another text
["string2"] | another another text
I want to build a dataset for a multi-label classification so I need to extract all the distinct strings
from tag
becauuse they are my labels.
What I need:
I need to transpose this list of about 40 distinct strings
in the tag and then insert the count of each relative to the message column.
So the final dataframe should be like this:
tag | message string string1 string2 string3
["string1","string2"] | some text 0 1 1 0
["string","string3"] | another text 1 0 0 1
["string2"] | another another text 0 0 1 0
Do note that new_df
dataframe must have the 2 originals columns + ~40 new columns because there's about 40 distinct strings in tag column.
How can I do this in Julia
There are many ways to do it, here are two examples:
julia> df = DataFrame(tag=[["string1","sttring2"], ["string","string3"], ["string2"]],
message=["some text", "another text", "another another text"])
3×2 DataFrame
Row │ tag message
│ Array… String
─────┼───────────────────────────────────────────────
1 │ ["string1", "sttring2"] some text
2 │ ["string", "string3"] another text
3 │ ["string2"] another another text
julia> [df DataFrame([col => in.(col, df.tag) for col in foldl(union!, df.tag, init=Set{String}())])]
3×7 DataFrame
Row │ tag message string string2 string3 string1 sttring2
│ Array… String Bool Bool Bool Bool Bool
─────┼────────────────────────────────────────────────────────────────────────────────────────────
1 │ ["string1", "sttring2"] some text false false false true true
2 │ ["string", "string3"] another text true false true false false
3 │ ["string2"] another another text false true false false false
julia> transform(df, [:tag => ByRow(x -> in(col, x)) => col for col in foldl(union!, df.tag, init=Set{String}())])
3×7 DataFrame
Row │ tag message string string2 string3 string1 sttring2
│ Array… String Bool Bool Bool Bool Bool
─────┼────────────────────────────────────────────────────────────────────────────────────────────
1 │ ["string1", "sttring2"] some text false false false true true
2 │ ["string", "string3"] another text true false true false false
3 │ ["string2"] another another text false true false false false