hadoopclojurecascalog

Cascalog process multi-line json?


I have a directory of Json files that I want to process using cascalog. The solution I have right now requires me to remove all newline characters from my json files using a bash script. I am looking a better solution because I sync these files using rsync.

My question is can I read the contents of a file in Cascalog and return the contents of the file as one tuple. At present the function 'lfs-textline' returns a sequence of tuples for each line in the file, hence why I have to remove the newline characters. Preferably I want to return a sequence of tuples for each file.

(defn textline-parsed [dir]
    (let [source (lfs-textline dir)]
        (<- [?line]
            (source ?line))))

Solution

  • Use hfs-wholefile from cascalog.more-taps to do this.

    (:require [cascalog.more-taps :as taps])
    
    (defn- byte-writable-to-str [bw]
      "convert byte writable to stirng"
      [(apply str (map char (. bw (getBytes))))])
    

    And, use

    (??<- [?str] 
        ((taps/hfs-wholefile path) ?filename ?file-content) 
        (byte-writable-to-str ?file-content :> ?str)