scala hadoop design-patterns cascading scalding

Scalding TypedPipe API External Operations pattern

I have a copy of Programming MapReduce with Scalding by Antonios Chalkiopoulos. In the book he discusses the External Operations design pattern for Scalding code. You can see an example on his website here. I have made a choice to use the Type Safe API. Naturally, this introduces new challenges but I prefer it over the Fields API which is what is heavily discussed in the book I have previously mentioned and the site.

I am wondering how people have implemented the external operations pattern with the Type Safe API. My initial implementation is as follows:

I create a class that extends com.twitter.scalding.Job which will serve as my Scalding job class where I will 'manage arguments, define taps, and use external operations to construct data processing pipelines'.

I create an object where I define my functions to be used in the Type Safe pipes. Because the Type Safe pipes take as arguments a function, I can then just pass the functions in the object as arguments to the pipes.

This creates code that looks like this:

class MyJob(args: Args) extends Job(args) {

  import MyOperations._

  val input_path = args(MyJob.inputArgPath)
  val output_path = args(MyJob.outputArgPath)

  val eventInput: TypedPipe[(LongWritable, Text)] = this.mode match {
    case m: HadoopMode => TypedPipe.from(WritableSequenceFile[LongWritable, Text](input_path))
    case _ => TypedPipe.from(WritableSequenceFile[LongWritable, Text](input_path))
  }

  val eventOutput: FixedPathSource with TypedSink[(LongWritable, Text)] with TypedSource[(LongWritable, Text)] = this.mode match {
    case m: HadoopMode => WritableSequenceFile[LongWritable, Text](output_path)
    case _ => TypedTsv[(LongWritable, Text)](output_path)
  }

  val validatedEvents: TypedPipe[(LongWritable, Either[Text, Event])] = eventInput.map(convertTextToEither).fork
  validatedEvents.filter(isEvent).map(removeEitherWrapper).write(eventOutput)
}

object MyOperations {

  def convertTextToEither(v: (LongWritable, Text)): (LongWritable, Either[Text, Event]) = {
    ...
  }

  def isEvent(v: (LongWritable, Either[Text, Event])): Boolean = {
    ...
  }

  def removeEitherWrapper(v: (LongWritable, Either[Text, Event])): (LongWritable, Text) = {
    ...
  }
}

As you can see, the functions that are passed to the Scalding Type Safe operations are kept separate from the job itself. While this is not as 'clean' as the external operations pattern presented, this is a quick way to write this kind of code. Additionally, I can use JUnitRunner for doing job level integration tests and ScalaTest for function level unit tests.

The main point of this post though is to ask how people are doing this sort of thing? The documentation around the internet for Scalding Type Safe API is sparse. Are there more functional Scala friendly ways for doing this? Am I missing a key component here for the design pattern? I sort of feel nervous about this because with the Fields API you can write unit tests on pipes with ScaldingTest. As far as I know, you can't do that with TypedPipes. Please let me know if there is a generally agreed upon pattern for Scalding Type Safe API or how you create reusable, modular, and testable Type Safe API code. Thanks for the help!

Update 2 after Antonios' reply

Thank you for the reply. That was basically the answer I was looking for. I wanted to continue the conversation. The main issue I see in your answer as I commented was that this implementation expects a specific type implementation but what if the types change throughout your job? I have explored this code and it seems to work but it seems hacked on.

def self: TypedPipe[Any]

def testingPipe: TypedPipe[(LongWritable, Text)] = self.map(
    (firstVar: Any) => {
        val tester = firstVar.asInstanceOf[(LongWritable, Text)]
        (tester._1, tester._2)
    }
)

The upside to this is I declare one implementation of self but the downside is this ugly type casting. Additionally, I have not tested this out in depth with a more complex pipeline. So basically, what are your thoughts on how to handle types as they change with only one self implementation for cleanliness/brevity?

Solution

Scala extension methods are implemented using implicit classes. You add to the compiler the capability of converting a TypedPipe into a (wrapper) class that contains your external operations:

import com.twitter.scalding.TypedPipe
import com.twitter.scalding._
import cascading.flow.FlowDef

class MyJob(args: Args) extends Job(args) {

  implicit class MyOperationsWrapper(val self: TypedPipe[Double]) extends MyOperations with Serializable

  val pipe = TypedPipe.from(TypedTsv[Double](args("input")))

  val result = pipe
    .operation1
    .operation2(x => x*2)
    .write(TypedTsv[Double](args("output")))

}

trait MyOperations {

  def self: TypedPipe[Double]

  def operation1(implicit fd: FlowDef): TypedPipe[Double] =
    self.map { x =>
      println(s"Input: $x")
      x / 100
    }

  def operation2(datafn:Double => Double)(implicit fd: FlowDef): TypedPipe[Double] =
    self.map { x=>
      val result = datafn(x)
      println(s"Result: $result")
      result
    }

}

import org.apache.hadoop.util.ToolRunner
import org.apache.hadoop.conf.Configuration

object MyRunner extends App {

  ToolRunner.run(new Configuration(), new Tool, (classOf[MyJob].getName :: "--local" ::
    "--input" :: "doubles.tsv" ::
    "--output":: "result.tsv" :: args.toList).toArray)

}

Regarding how to manage types across the pipes, my recommendation would be to try to work out some basic types that make sense and use case classes. To use your example i would rename the method convertTextToEither into extractEvents :

case class LogInput(l : Long, text: Text)
case class Event(data: String)
def extractEvents( line : LogInput ): TypedPipe[Event] =
  self.filter( isEvent(line) )
      .map ( getEvent(line.text) )

Then you would have

LogInputOperations for LogInput types
EventOperations for Event types