juliaijulia-notebook

Exctracting strings , counting and transposing them as columns in a dataframe


Have a pandas dataframe with 2 columns: tag and message:

       tag              |     message
["string1","sttring2"]  |    some text
["string","string3"]    |  another text
["string2"]             | another another text

I want to build a dataset for a multi-label classification so I need to extract all the distinct strings from tag becauuse they are my labels.

What I need:

I need to transpose this list of about 40 distinct strings in the tag and then insert the count of each relative to the message column. So the final dataframe should be like this:

      tag               |     message           string  string1   string2    string3
["string1","string2"]   |    some text             0      1          1          0        
["string","string3"]    |  another text            1      0          0          1
["string2"]             | another another text     0      0          1          0

Do note that new_df dataframe must have the 2 originals columns + ~40 new columns because there's about 40 distinct strings in tag column.

How can I do this in Julia


Solution

  • There are many ways to do it, here are two examples:

    julia> df = DataFrame(tag=[["string1","sttring2"], ["string","string3"], ["string2"]],
                          message=["some text", "another text", "another another text"])
    3×2 DataFrame
     Row │ tag                      message
         │ Array…                   String
    ─────┼───────────────────────────────────────────────
       1 │ ["string1", "sttring2"]  some text
       2 │ ["string", "string3"]    another text
       3 │ ["string2"]              another another text
    
    julia> [df DataFrame([col => in.(col, df.tag) for col in foldl(union!, df.tag, init=Set{String}())])]
    3×7 DataFrame
     Row │ tag                      message               string  string2  string3  string1  sttring2 
         │ Array…                   String                Bool    Bool     Bool     Bool     Bool     
    ─────┼────────────────────────────────────────────────────────────────────────────────────────────
       1 │ ["string1", "sttring2"]  some text              false    false    false     true      true
       2 │ ["string", "string3"]    another text            true    false     true    false     false
       3 │ ["string2"]              another another text   false     true    false    false     false
    
    julia> transform(df, [:tag => ByRow(x -> in(col, x)) => col for col in foldl(union!, df.tag, init=Set{String}())])
    3×7 DataFrame
     Row │ tag                      message               string  string2  string3  string1  sttring2 
         │ Array…                   String                Bool    Bool     Bool     Bool     Bool
    ─────┼────────────────────────────────────────────────────────────────────────────────────────────
       1 │ ["string1", "sttring2"]  some text              false    false    false     true      true
       2 │ ["string", "string3"]    another text            true    false     true    false     false
       3 │ ["string2"]              another another text   false     true    false    false     false