hadoopclojurecascalog

Turning co-occurrence counts into co-occurrence probabilities with cascalog


I have a table of co-occurrence counts stored on s3 (where each row is [key-a, key-b, count]) and I want to produce the co-occurrence probability matrix from it.

To do that I need to calculate the sum of the counts for each key-a, and then divide each row by the sum for its key-a.

If I were doing this "by hand" I would do a pass over the data to produce a hash table from keys to totals (in leveldb or something like it), and then make a second pass over the data to do the division. That doesn't sound like a very cascalog-y way to do it.

Is there some way I can get the total for a row by doing the equivalent of a self-join?


Solution

  • Sample data:

    (def coocurrences
      [["foo" "bar" 3]
       ["bar" "foo" 3]
       ["foo" "quux" 6]
       ["quux" "foo" 6]
       ["bar" "quux" 2]
       ["quux" "bar" 2]])
    

    Query:

    (require '[cascalog.api :refer :all] '[cascalog.ops :as c])
    
    (let [total (<- [?key-a ?sum]
                  (coocurrences ?key-a _ ?c)
                  (c/sum ?c :> ?sum))]
      (?<- (stdout) [?key-a ?key-b ?prob]
        (div ?c ?sum :> ?prob)
        (coocurrences ?key-a ?key-b ?c)
        (total ?key-a ?sum)))
    

    Output:

    RESULTS
    -----------------------
    bar     foo     0.6
    bar     quux    0.4
    foo     bar     0.3333333333333333
    foo     quux    0.6666666666666666
    quux    foo     0.75
    quux    bar     0.25
    -----------------------