scalascalding

How do I change how keys are serialized in Scalding?


I am grouping by a custom type in my scalding job:

typedPipe
   .map(someMapper)
   .groupBy(_.nonPrimitiveField)
   .sum
   .write(sink)

In my output, the keys show up as the toString output, which is not useful. How can I make scalding use a custom serializer for these keys?

My current workaround is to call toTypedPipe and explicitly call my serialization function in the mappers, but this seems wasteful.

The sink is a TypedTsv[(Key, Value)], where Key is the type of the field that I would like to serialize differently.


Solution

  • Well, Tsv is a text format, so, in the end of the day, everything becomes a string. The simplest way would be to just override .toString on your Key type, or wrap it into another object with .toString overridden. Or, just replace it with a String as a final step (I think, that's what you are already doing anyway). I am not sure what you mean when you say it is "wasteful". It does not add an extra step to the flow if that's your concern, and the conversion to string would have to happen in any case, so that cost is fixed.

    typedPipe.
     .map(someMapper)
     .groupBy(x => beautifulString(x.nonPrimitiveField))
     .sum
     .write(sink)